Changelog
Source:NEWS.md
cryptRopen 0.2.0
This release closes the 0.1.x cycle and consolidates the package into an API-stable, CRAN-ready state. Submission to CRAN is planned once the package has been validated on production workloads.
API stability
The public surface (crypt_vector(), crypt_data(), crypt_r() and its async companions, get_correspondence_tables(), inspect()) is committed across the 0.x series: existing arguments will not be removed or renamed; new arguments may be added with defaults.
Documentation overhaul (for CRAN clarity)
- Title and Description rewritten to describe the package honestly: pseudonymization via salted hash, not reversible encryption.
-
?cryptRopenis now a real index — entry points, async companions, and pointers to the two vignettes. - The Getting Started vignette gained a “When not to use cryptRopen” section and an “Algorithm choice” section.
- All
@paramdescriptions are standardised; two factually wrong notes were corrected (inspect()does not require the data frame to live inglobalenv();crypt_data()accepts a data frame value, not an expression). - Examples on the async companions migrated from
\dontrun{}to runnable\donttest{}blocks anchored on the shippedpersons.csvfixture. - US English spelling propagated throughout (pseudonymize, normalize, finalize, behavior, …).
Code modernization
-
dplyr::filter_all(any_vars(...))replaced bydplyr::filter(if_any(...))in the mask import step (the superseded family was generating soft deprecation warnings ondplyr>= 1.1). -
inspect()’s optional row-count side effect now flows throughmessage()instead ofprint(). -
crypt_data()argument-validation error messages reformulated for clarity, withcall. = FALSE.
Packaging
-
DESCRIPTIONgainedURL,BugReports,Depends: R (>= 4.1.0),Language: en-US,Config/testthat/edition: 3. -
devtoolsremoved fromSuggests(was unused at runtime). - The async test suite (
tests/testthat/test-crypt_r_async.R) now skips on CRAN viaskip_on_cran(). Full coverage is exercised in CI on every push.
cryptRopen 0.1.1
Bug fixes
-
crypt_r()no longer crashes on mask rows with an emptyvars_to_encryptcell. Such rows are now treated as legitimate “copy / convert only” instructions: the input file is written tooutput_pathas-is (re-encoded to the requested output format),vars_to_removeis applied if non-empty, no_cryptcolumns are emitted, and notc_*.parquetis produced. The recap log recordssuccess = TRUEandtc_name = NAfor such rows. Previously the three engines tried to build a zero-length_cryptdata frame and failed at the assembly step with “Can’t recycle..1(size 0) to match..2” (more visible at CSV streaming scale).
Improvements
- New private helper
.parse_mask_vars()centralizes thestr_split(",") %>% unlist() %>% str_trim()idiom that the three engines duplicated. It also normalizesNA/""/ whitespace-only / list-internal blanks ("a,,b") tocharacter(0)/ clean items, fixing a latent edge case invars_to_removeas well. -
.transform_stream_chunk()and.process_mask_row_in_memory()now short-circuit encryption + correspondence-table construction whenvars_to_encryptis empty, while still applyingvars_to_remove.
Breaking-ish change
-
crypt_data()is now fail-fast on an emptyvars_to_encrypt: it raises an explicit error pointing the user todplyr/riofor plain column dropping or format conversion. Previously it errored too, but later and with a misleading message (“All indicatedvars_to_encryptmust be effectively a variable name”). The asymmetry withcrypt_r()is intentional:crypt_r()is driven by a hand-filled spreadsheet describing a heterogeneous batch, where a “copy / purge only” row is a reasonable use case;crypt_data()is a direct call on a loaded object, where calling it with nothing to encrypt is a misuse and silently succeeding would hide a mistake. - Side effect:
crypt_data()now trimsvars_to_encryptbefore the membership check, so a typo like" mpg "(with surrounding whitespace) is now silently accepted instead of producing the “All indicatedvars_to_encrypt…” error. The historical trim happening later in the function already absorbed those whitespace cases on successful runs, so this aligns the membership check with the rest of the pipeline.
Tests
- Two new baseline cases lock the new
crypt_r()empty-vars behavior:empty_vars_remove_csv(purge-only CSV) andempty_vars_copy_rds(verbatim copy of an RDS). Theirintermediate/directories are empty by contract, asserted byexpect_setequal()intest-baseline.R. - New
test-parse_mask_vars.Rcovers the helper’s edge cases (12 assertions). -
test-crypt_data.Radds threeexpect_error()for the fail-fast oncharacter(0),"",c(NA_character_, " ").
cryptRopen 0.1.0
First public release milestone — closes the refactor-v1 branch. No public API change vs. the historical 0.0.0.9000; under the hood, crypt_r() is now non-blocking with mirai orchestration, a streaming engine handles large parquet / CSV inputs, correspondence tables live in a package-private environment, and the codebase has been restructured for readability.
Readability audit (refactor-v1)
The refactor branch ran a six-step readability audit after the hardening phases. No API change, no semantic change; tests and devtools::check() stay at 0/0/0 throughout.
-
Audit 0 — retire fusen:
R/*.Rbecomes the single source of truth;dev/flat_*.Rmdremoved. -
Audit A —
R/cryptRopen-package.Rstub with consolidated@importFromandutils::globalVariables(). -
Audit B — redundant
requireNamespace()calls removed on hardImports;function (→function(on three sites. -
Audit C —
R/crypt_r.R(1094 lines) split into seven engine-specific files:crypt_r.R,crypt_r_result.R,crypt_r_dispatcher.R,crypt_r_engine_in_memory.R,crypt_r_engine_stream_shared.R,crypt_r_engine_stream_parquet.R,crypt_r_engine_stream_csv.R. -
Audit D.1 — cryptic local names expanded:
sm→mask_row,tf→transformed,ds→arrow_dataset,x0→col, redundantgalias dropped. -
Audit D.2 —
styler::style_pkg(): de-aligned=in multi-line calls, normalized whitespace. -
Audit E — roxygen bug fixes,
@family async_job, missing@return, typos, inline phase references migrated to this file.
cryptRopen 0.0.0.9000
Phase 2 — Hardening
Phase 2.C — retire assign_to_global() and clear R CMD check
-
R/assign_to_global.Rremoved along with its fusen flat file anddev/config_fusen.yamlentry. -
^tests/baseline$added to.Rbuildignore; baseline fixtures stay accessible todevtools::test()viaskip_if_no_baseline()but are excluded from the tarball. -
nanoparquetmoved toImports(was an implicit dependency ofrio::export()for parquet output;R CMD check --as-cranexposed the gap). -
test-baseline.R:source("cases.R")guarded so the file no longer errors when the baseline directory is stripped from the tarball. -
mirai::mirai()body resolves.process_mask_row()viautils::getFromNamespace()instead of the:::operator — clears the::: calls to the package's namespace in its codeNOTE. -
devtools::check()reaches 0 ERROR / 0 WARNING / 0 NOTE for the first time in the refactor.
Phase 2.B — crypt_r() fast-fail + cryptR_results()
-
crypt_r()validatesoutput_pathandintermediate_pathbefore dispatching mirai tasks. Per-row input paths stay unchecked — a per-row failure must not interrupt the rest. -
get_correspondence_tables(names = NULL): optional character vector to select and order the returned tables. Missing keys becomeNULLentries. - New exported
cryptR_results(job): disk-oriented companion tocryptR_status(). One row per filtered mask row withoutput_file_path, liveexists, worker-capturedsize_bytes/sha256,success,error_message. Joins cleanly withcryptR_status()onencrypted_file.
Phase 2.A — enriched job state and summary() method
-
cryptR_status()grows from 3 to 7 columns:start_time,end_time,duration_sec,n_rows_processedare extracted from the per-task payload. No breaking change. - New S3 method
summary.cryptR_job()(exported) returns a compact dashboard object; itsprint()method renders it. -
print.cryptR_job()gains an active-workers line. - Shared helper
.n_workers_active()de-duplicates the three open-codedmirai::status()$daemonscall sites.
Phase 1.D — crypt_r() refactor and mirai orchestration
Phase 1.D.6.c — log xlsx, TC round-trip, auto-watcher
- Engines return a typed
list(success, error_message, tc_name, tc_df, metrics)via new helper.make_row_result(). - New file
R/cryptR_log.Rwith the log-writing and correspondence- table re-injection helpers. - New file
R/cryptR_watcher.Rwith alater::later()self- rescheduling watcher so the recap log is written automatically when the last mirai task resolves. Graceful fallback when later is unavailable. -
cryptR_collect()now runs the shared finalize pipeline (TC re- injection + log writer), idempotent against a watcher that has already fired. - The TC limitation from 1.D.6.b is lifted:
get_correspondence_tables()in the parent process sees the TCs produced by mirai daemons.
Phase 1.D.6.b — mirai orchestration inside crypt_r()
-
crypt_r()becomes non-blocking: onemirai::mirai()per filtered mask row, results aggregated into acryptR_jobreturned immediately. - New parameter
n_workers = NULLwith heuristicmin(detectCores() - 1, n_rows, 8L)(floored at 1). Existing daemons on the default profile are reused; only spawned daemons are torn down bycryptR_collect(). -
.process_mask_row_in_memory()no longer writes toglobalenv(): noassign_to_global(), noeval(parse(text = ...)). The correspondence table lives in.cryptRopen_env+ on disk. -
DESCRIPTION: job removed fromImports,{parallel}added. -
assign_to_global()demoted to@noRd(kept as a no-op to avoid breaking downstream code that may still reference it).
Phase 1.D.6.a — cryptR_job scaffolding
- New S3 class
cryptR_joband companionscryptR_status(),cryptR_wait(),cryptR_collect(),print.cryptR_job()— still wired in isolation;crypt_r()not touched yet. - New files
R/cryptR_job.RandR/cryptR_job_helpers.R. -
DESCRIPTION: mirai added toImports, withr toSuggests(test-only, for daemon teardown).
Phase 1.D.5 — baseline covers large parquet + multi-chunk streaming
- New case
large_parquet_multichunk(50 000 rows,chunk_size = 15000) intests/baseline/cases.R; exercises multi-iteration scanner reads. -
tests/baseline/generate_baseline.R:capture_one_crypt_data()now reads correspondence tables viacryptRopen::get_correspondence_tables()— fixes a latent regression from Phase 1.C where TC names were captured empty.
Phase 1.D.4 — streaming engines
-
1.D.4.a —
crypt_r(engine, chunk_size)parameters added and plumbed through the dispatcher (no routing change yet);arrowpromoted fromSuggeststoImports. -
1.D.4.b — parquet streaming engine via
arrow::open_dataset()- Scanner +
ParquetFileWriter. Correspondence table built incrementally withdistinct().
- Scanner +
-
1.D.4.c — CSV streaming engine via
arrow::open_csv_dataset()for reads;utils::write.table(append = TRUE)for writes (arrow::CsvWriternot universally exported). -
1.D.4.d — shared helpers factored out (
.transform_stream_chunk(),.finalize_stream_tc(),.write_stream_inspect()); dispatcher routes parquet-in/parquet-out and csv-in/csv-out to streaming, mixed or non-streamable endpoints to in-memory.rio::import()kept on the inspect relecture to preserve baselineclass()tuples on CSV date columns.
Phase 1.D.1 → 1.D.3 — extract and freeze the in-memory engine
-
1.D.1 —
.process_mask_row()extracted fromcrypt_r(); thin orchestrator + helper. -
1.D.2 —
.process_mask_row()becomes a dispatcher; historical body moves into.process_mask_row_in_memory()unchanged. -
1.D.3 — three invariant test blocks lock the in-memory engine: inspect xlsx layout, TC parquet uniqueness, output column order.
.process_mask_row_in_memory()marked@section FROZEN.
Phase 1.A → 1.C — API surface refactor
-
Phase 1.A —
crypt_vector()cleaned up: helper.normalize_crypt_input()extracted, mask-before-hash,vapply(USE.NAMES = FALSE). -
Phase 1.B —
inspect()cleaned up: helper.inspect_column()extracted,mutate_if→vapply/lapplyon POSIXct columns,map_df→list_rbind(names_to = "variables"). -
Phase 1.C —
crypt_data()decoupled fromglobalenv(). Correspondence tables now routed through.cryptRopen_env(R/private_env.R) and retrievable via the new exportedget_correspondence_tables().
Phase 0 — baseline regression infrastructure
-
tests/baseline/with 29 cases (12crypt_vector, 6crypt_data, 11crypt_r) generated bygenerate_baseline.Rand compared withtest-baseline.R. All subsequent refactors preserve byte-level equality of disk outputs and semantic equality of in-memory results.