Methodology
What we measure
| Family | Metric | Direction |
| Statistical fidelity | SDMetrics QualityReport: overall, column shapes, column pair trends | higher = better |
| ML utility | TSTR (Train-Synthetic-Test-Real) and TRTR AUC, for logistic regression and gradient boosting | higher = better; closer TSTR-to-TRTR is the meaningful signal |
| Privacy | DCR (5th-percentile distance to closest record, normalised by intra-real median) and NNDR (median nearest-neighbour distance ratio) | DCR higher = better; NNDR closer to 1 = better |
| Constraint conformance | Fraction of synthetic rows that satisfy all schema constraints (range, enum membership) | higher = better |
What we run
- SynthForge at the commit recorded in
synthforge_commit. - CTGAN via SDV's
CTGANSynthesizer, default hyperparameters, 300 epochs, seed 42.
Both synthesizers produce a synthetic dataset of the same size as the real dataset, then every metric is run on the (real, synthetic) pair.
Datasets
- UCI Adult (Census Income, 1994). After dropping rows with missing values: 30,162 rows, 14 features, binary income target.
- UCI Credit Card Default (Taiwan, 2005). 30,000 rows, 23 features, binary default target.
Schema authoring: the important caveat
SynthForge is schema-driven, not data-fitted. For each dataset we hand-author a SynthForge schema from the public UCI data dictionary (documented column types, ranges, categorical sets, and standard demographic priors). We do not fit the SynthForge schema to the real CSV.
This is the honest framing: given only the public data dictionary, what does SynthForge produce? CTGAN, in contrast, sees the real data during training. This is a deliberate asymmetry the benchmark exists to measure. A tool that needs no data access has different operational properties from one that does.
How to read the privacy numbers
A higher DCR means synthetic rows are not close copies of real rows. An NNDR closer to 1 means a synthetic row is not anomalously close to one specific real row. SynthForge's high DCR is structural: it cannot memorise data it never saw.
Reproducibility
- Harness lives at
apps/generator/app/benchmarks/; tests at apps/generator/tests/benchmarks/. - CLI:
python -m app.benchmarks.cli --dataset all --output results.json. - Real CSVs are not committed; download instructions in
apps/generator/benchmarks_data/README.md. - All randomness uses seed 42.
- Re-run any time. The numbers above come from the most recent run committed to
www/src/data/benchmarks/results.json.
What we explicitly do not measure (and why)
- Differential privacy guarantees. Differential privacy (DP) is a mathematical bound on how much any single record in the training data can change the output, expressed as numbers like "(epsilon=2, delta=1e-6)." SynthForge does not yet ship a DP-aware path; SmartNoise integration is on the roadmap. Claiming DP numbers without the math behind them would be misleading, so we don't.
- Image, text, time-series datasets. Out of scope for tabular synthetic data benchmarks.
- Membership-inference attacks against DP baselines. Same gap as DP.
Want to inspect the raw numbers? Download results.json. The harness source is not currently public; reach out at hello@synthforge.io if you want to review or replicate the methodology.