What cryptRopen is for
cryptRopen pseudonymizes variables inside data frames so
that sensitive values (identifiers, emails, names) can be shared or
stored in a derivative form without exposing the originals. The
pseudonymization is deterministic: the same input value
paired with the same key always produces the same output, so joins
across tables remain possible on the pseudonymized columns. A side table
— the correspondence table — keeps the mapping between original
and pseudonymized values for the party that holds the key.
Typical use cases: preparing an extract for an analytics team that should not see customer identifiers; sharing a dataset with an external partner; feeding a downstream pipeline while keeping the original identifiers under control.
Pseudonymization, not encryption
The crypt in the package name is historical. The
underlying transformation is a salted hash, not
reversible encryption:
- A normalized version of each input value is concatenated with a
user-provided salt (
key) and hashed withdigest::digest(). - The default algorithm is
"md5". Any algorithm supported bydigest::digest()("sha1","sha256","sha512", …) is accepted via thealgorithmparameter — see Algorithm choice below. - Holding the output does not let anyone recover the input. Only the correspondence table (or a rainbow-table-style attack on a weak key) does.
Consequence: cryptRopen is suitable for pseudonymization under regulations such as GDPR, not for confidentiality-bearing encryption.
When not to use cryptRopen
Pseudonymization is one tool among several. Reach for a different approach when:
-
You need to recover the original value later. A
salted hash is one-way. If your workflow must un-pseudonymize at some
point, use symmetric encryption (e.g. the
saferorsodiumpackages) and manage the key separately. - You need formal anonymization, not pseudonymization. Under regulations such as GDPR, pseudonymized data is still personal data — the correspondence table is the de facto re-identification key. For irreversible anonymization, consider differential privacy, k-anonymity, or aggregation.
- You want to hide value distributions. A deterministic hash preserves the frequency profile of the original column. An attacker who knows the rough distribution of the input (e.g. that certain customer IDs occur very often) can still spot the most frequent hashes. Add random padding or switch to format-preserving encryption if that matters for your threat model.
- You need integrity guarantees on the data, not just on its identifiers. Hashes here are computed per cell, not over a whole row or file. They do not detect tampering at the dataset level — pair them with a separate signature scheme if you need that.
Algorithm choice
The package delegates the actual hashing to
digest::digest(), so every algorithm of that function is
available via the algorithm parameter. The trade-offs you
should know:
-
"md5"(default) — fast and short (32 hex chars). Cryptographic collisions exist as theoretical constructions but are negligible at practical dataset scales for pseudonymization. Recommended unless you have a specific reason to step up. -
"sha1"— 40 hex chars, deprecated in formal cryptography but still acceptable for non-adversarial pseudonymization. Rarely a better choice than MD5 or SHA-256. -
"sha256"— 64 hex chars, the modern conservative default if you want to be on the safer side. Slower than MD5 but the difference is typically below 1 % of total runtime on a real pseudonymization job dominated by I/O. -
"sha512"— 128 hex chars, only useful when downstream systems happen to expect a 512-bit hash. The output is twice as long for no practical security gain in this context.
Whatever you pick, lock it down in your run and keep the
same encryption_key across all runs that need to
join: changing either breaks the deterministic property and your tables
stop joining on the pseudonymized columns.
Hashing a single vector
crypt_vector() is the lowest-level entry point. It takes
a vector, a key, and a hash algorithm, and returns a character vector of
the same length. NA and empty strings stay NA.
Input is uppercased, trimmed, and stripped of internal spaces before
hashing, so "Alice", "alice " and
" ALICE " collapse to the same hash.
crypt_vector(
vector = c("Alice", "Bob", "Charlie"),
key = "my-secret-salt",
algo = "md5"
)
#> [1] "450AC498E2781A99FCEE98F8D460A87E" "88C40ACBB5A8351455503E2EB8A44089"
#> [3] "1EEAF5D958D4AF3BA58900DDB131BB18"Determinism and the NA / empty-string rules:
# Same value + same key => same hash, every time.
identical(
crypt_vector("Alice", key = "k", algo = "md5"),
crypt_vector("alice ", key = "k", algo = "md5")
)
#> [1] TRUE
# NAs and empty strings are preserved.
crypt_vector(c("Alice", NA, "", " "), key = "k", algo = "md5")
#> [1] "7A9C2C3ACBB6E4B93066B2B66501EB10" NA
#> [3] NA NAHashing one or several columns of a data frame
crypt_data() operates on a data frame and returns a new
data frame with the requested columns replaced by their hashed
counterparts (suffix _crypt). Original columns can
optionally be dropped (vars_to_remove).
head(mtcars, 3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
encrypted <- crypt_data(
loaded_dataset = head(mtcars, 3),
vars_to_encrypt = "mpg",
vars_to_remove = "cyl",
encryption_key = "demo-key",
algorithm = "md5",
correspondence_table = TRUE,
correspondence_table_label = "demo"
)
encrypted
#> mpg_crypt disp hp drat wt qsec vs am
#> Mazda RX4 C33FDD39F6EAADF6C5E135AD5083E3D6 160 110 3.90 2.620 16.46 0 1
#> Mazda RX4 Wag C33FDD39F6EAADF6C5E135AD5083E3D6 160 110 3.90 2.875 17.02 0 1
#> Datsun 710 7E7E52C52B1C3686F553D1E4AC6A8207 108 93 3.85 2.320 18.61 1 1
#> gear carb
#> Mazda RX4 4 4
#> Mazda RX4 Wag 4 4
#> Datsun 710 4 1The correspondence table is not returned alongside
the data frame; it is stored inside a package-private environment and
retrieved via get_correspondence_tables():
tcs <- get_correspondence_tables()
names(tcs)
#> [1] "tc_crypt_demo"
tcs$tc_crypt_demo
#> mpg mpg_crypt
#> Mazda RX4 21.0 C33FDD39F6EAADF6C5E135AD5083E3D6
#> Mazda RX4 Wag 21.0 C33FDD39F6EAADF6C5E135AD5083E3D6
#> Datsun 710 22.8 7E7E52C52B1C3686F553D1E4AC6A8207Tables accumulate across calls in the same session. Pass a character
vector to names to pull only the ones you need:
get_correspondence_tables(names = "tc_crypt_demo")
#> $tc_crypt_demo
#> mpg mpg_crypt
#> Mazda RX4 21.0 C33FDD39F6EAADF6C5E135AD5083E3D6
#> Mazda RX4 Wag 21.0 C33FDD39F6EAADF6C5E135AD5083E3D6
#> Datsun 710 22.8 7E7E52C52B1C3686F553D1E4AC6A8207Pseudonymising multiple columns at once is a matter of extending
vars_to_encrypt:
crypt_data(
loaded_dataset = head(mtcars, 3),
vars_to_encrypt = c("mpg", "hp"),
encryption_key = "demo-key",
correspondence_table = TRUE,
correspondence_table_label = "demo_multi"
)
#> mpg_crypt hp_crypt
#> Mazda RX4 C33FDD39F6EAADF6C5E135AD5083E3D6 7D10FEDBB84390262D97BCD0956D8762
#> Mazda RX4 Wag C33FDD39F6EAADF6C5E135AD5083E3D6 7D10FEDBB84390262D97BCD0956D8762
#> Datsun 710 7E7E52C52B1C3686F553D1E4AC6A8207 BB8BE843A95B4E5921897169087511EC
#> cyl disp drat wt qsec vs am gear carb
#> Mazda RX4 6 160 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 6 160 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 4 108 3.85 2.320 18.61 1 1 4 1When you need to batch multiple files
crypt_data() is the right tool when you hold one data
frame in memory and want an encrypted version of it. For an
industrialized workflow — several input files specified in an Excel
mask, outputs written to disk, parallel execution, a recap log — see
vignette("crypt_r-workflow").
Related resources
-
?crypt_vector,?crypt_data,?crypt_r— the three public entry points. -
?get_correspondence_tables— retrieval of the correspondence tables stored in the package-private environment. -
NEWS.md— full change history.