Skip to contents

This page describes the directions being explored for upcoming versions of scrutr. They are exploratory directions, not commitments: scope, naming, and timing may change, and some items may be reshaped or dropped based on feedback.

If one of these axes matches a problem you are dealing with, the most useful thing you can do is open an issue describing it — even briefly. The package will move fastest in the directions that solve real, named pain.

A common thread runs through all six axes: each one operates on a collection of related datasets, not on a single table. This is what scrutr is for, and what distinguishes it from packages such as skimr, pointblank, or DataExplorer, which are excellent but mono-table by design.


1. Schema as a spec

The situation. You receive 50 CSVs from partners every month. There is always one file in which someone renamed a column, switched a date format from ISO to US, or dropped a variable nobody flagged. You usually find out three days later, when something downstream breaks for an unrelated-looking reason.

What this would change. You write the expected shape of the collection once, in a small file you version like code. Every new delivery is then checked against it in a single call.

# collection_spec.yml
columns:
  client_id:    { type: integer, required_in: [orders, invoices, clients] }
  signup_date:  { type: date,    format: "%Y-%m-%d" }
  amount:       { type: numeric, min: 0 }
scrutr_check("collection_spec.yml", input_path = "deliveries/2026-05/")
#> 49/50 files conform.
#> orders_2026-05-12.csv:
#>   - column `signup_date`: 12 values not matching format "%Y-%m-%d"
#>   - column `amount`: missing

You get a single report telling you exactly which files, which columns, and which rows broke the contract — across the whole delivery, not file by file.

Technical note: declarative spec in YAML or CSV, evaluated against any folder or named list of data frames. Conceptually distinct from pointblank and validate, which describe rules in R per dataset; here the spec lives outside the code and evaluates over the entire collection in one pass.


2. Diff between collections

The situation. You refactored a pipeline. The new output looks fine. You have no good way to be sure it produces the same thing as the old one without opening dozens of files side by side. Same problem for: comparing prod to staging before a release, this month’s data to last month’s, before and after a vendor change.

What this would change. A single function compares two collections and tells you exactly what moved.

scrutr_diff(before = "output_v1/", after = "output_v2/")
#> Tables added:    sessions_summary.csv
#> Tables removed:  legacy_events.csv
#> Schema changes:
#>   orders.csv: column `tax_rate` type integer -> numeric
#> Row count changes:
#>   invoices.csv: 12,408 -> 12,402 (-6)
#> Distribution drifts (>5% Wasserstein distance):
#>   orders.csv$amount, customers.csv$age

It tells you in one pass whether the refactor was safe, or where it leaked.

Technical note: structural and statistical comparison between two collections (folders or named lists). Returns a structured object that can be rendered to HTML or inspected programmatically. No CRAN equivalent takes two folders as input today.


3. Inferred relational model

The situation. Someone hands you a folder of 15 tables. No README, no schema diagram, the original author left two years ago. Before you can do anything useful, you have to spend half a day guessing which column joins to which.

What this would change. scrutr does the guessing for you, with confidence scores you can trust or audit.

model <- scrutr_infer_model("legacy_export/")
#> 4 candidate primary keys (uniqueness >= 1.0)
#> 7 candidate foreign keys (inclusion >= 0.98 + name similarity)
#> Granularity hierarchy: clients > orders > order_lines

The returned object is a dm object, so you immediately get diagrams, joins, and validation for free. You stay in the driver’s seat: nothing is assumed, everything is proposed.

Technical note: heuristic inference of PK/FK relationships with confidence scores, leveraging uniqueness and inclusion checks plus naming similarity. Complementary to the dm package, which expects you to declare the model upfront — here the model is reverse-engineered from the data.


4. Operational consistency audit

The situation. This is the boring stuff that silently poisons real pipelines:

  • half your files are UTF-8, the other half Windows-1252 — and your é characters quietly turn into é somewhere in the middle of an analysis;
  • three files use , as decimal separator, two use ., and your means come out wildly wrong;
  • -999, "N/A", ".", and empty strings stand in for missing values, inconsistently, and none of them are recognised as NA;
  • the column client_id is integer in one file, character in another, and numeric (with .0 appended) in a third — joins compile, results lie.

None of these raise errors. They just silently distort everything downstream.

What this would change. One function, run on a folder, surfaces all of them at once.

scrutr_audit("data/")
#> Encoding inconsistencies:    3 files in Windows-1252, others UTF-8
#> Decimal separators:          mixed (`,` and `.`)
#> Likely sentinel values:      `-999` in 4 files, `"N/A"` in 2 files
#> Class drift across files:    column `client_id`: integer, character, numeric
#> Suspicious silent coercions: column `date_signup` parsed as character in 3/15

You no longer wonder what is wrong with the collection — you know.

Technical note: generalisation of the work vars_compclasses() and detect_chars_structure_datasets() already do, bundled into a single audit pass with a structured output you can render, store, or feed into CI.


5. Larger-than-memory backend

The situation. Your collection is 12 GB of Parquet sitting on a network drive. R chokes if you try to load it. You still want to know how many duplicates there are, whether class consistency holds, or how distributions compare across files — without materialising anything.

What this would change. The same scrutr functions, but lazy.

coll <- scrutr_lazy("warehouse_export/")
inspect(coll)
vars_compclasses(coll)
dupl_show(coll)

The heavy lifting is delegated to arrow or duckdb under the hood. You keep the scrutr API; you scale to data that does not fit in RAM.

Technical note: lazy backend that translates scrutr operations into arrow/duckdb queries. The engines exist; the audit semantics on top of them do not. This is the path to making scrutr viable for collections that were until now out of reach.


6. Consistent PII masking across tables

The situation. You need to send a copy of your tables to an external consultant — for analysis, for support, for a regulator. The names, the IDs, the emails must go. But the joins on their side still need to work, or your deliverable is useless. Hashing each column independently breaks every relationship in the collection. Doing it consistently across 20 files by hand is how mistakes happen.

What this would change. The same Excel-mask pattern you already use for conversions and renames, extended to PII.

mask_pii_r(
  input_folderpath  = "data/",
  output_folderpath = "data_masked/",
  mask_filepath     = "pii_mask.xlsx"
)

Behind the scenes: the same client_id hashes to the same value in orders.csv, invoices.csv, and clients.csv. Dates shift by a single offset per entity. Free-text fields can be redacted, replaced, or kept verbatim depending on the mask. You send the masked copy, the joins still work, and no PII leaves the building.

Technical note: deterministic, salt-based transformations that preserve referential integrity across a collection. Complementary to anonymizer and similar packages, which provide per-column primitives but no multi-file consistency layer.


How to weigh in

These six axes are not in priority order, except for axis 1 (schema as a spec), which is the most likely candidate to anchor the next minor version because it becomes the substrate for several of the others.

If any of this resonates — especially if you have a concrete dataset or collection in mind where you would actually use it — please open an issue. Even a one-line “I would use this for X” is the single most useful input we can receive at this stage.