Independent, reproducible benchmarks

We run SynthForge against the open-source CTGAN baseline (a GAN-based tabular synthesizer from the SDV project) on two canonical public datasets (UCI Adult and Credit Card Default), across statistical fidelity, ML utility, privacy, and constraint conformance. The raw JSON powering this page is published alongside it.

Last run: 2026-05-05T18:57:11.295000+00:00 · synthforge e76aa646c836 · raw results.json

adult

Statistical fidelity

Synthesizer	Overall	Column shapes	Column pair trends
synthforge	0.517	0.573	0.461
ctgan	0.881	0.896	0.866

ML utility (TSTR vs TRTR, AUC)

Synthesizer	LogReg TRTR	LogReg TSTR	GBM TRTR	GBM TSTR
synthforge	0.902	0.685	0.916	0.668
ctgan	0.902	0.889	0.916	0.891

Privacy

Synthesizer	DCR p5 (normalised)	NNDR (median)
synthforge	18.788	0.998
ctgan	0.278	0.951

Constraint conformance

Synthesizer	Validity rate
synthforge	1.000
ctgan	0.729

credit

Statistical fidelity

Synthesizer	Overall	Column shapes	Column pair trends
synthforge	0.645	0.682	0.609
ctgan	0.940	0.919	0.960

ML utility (TSTR vs TRTR, AUC)

Synthesizer	LogReg TRTR	LogReg TSTR	GBM TRTR	GBM TSTR
synthforge	0.716	0.513	0.780	0.485
ctgan	0.716	0.705	0.780	0.756

Privacy

Synthesizer	DCR p5 (normalised)	NNDR (median)
synthforge	7.245	0.971
ctgan	0.380	0.915

Constraint conformance

Synthesizer	Validity rate
synthforge	1.000
ctgan	1.000

Run history

Every benchmark run is appended here so you can see how the numbers change over time. We rerun whenever the generator changes meaningfully, or on a schedule. One run so far; future runs append below.

adult

Date	Fidelity (overall)		TSTR (LogReg AUC)		DCR (privacy)		Integrity
	SF	CTGAN	SF	CTGAN	SF	CTGAN	SF	CTGAN
2026-05-05	0.517	0.881	0.685	0.889	18.788	0.278	1.000	0.729

credit

Date	Fidelity (overall)		TSTR (LogReg AUC)		DCR (privacy)		Integrity
	SF	CTGAN	SF	CTGAN	SF	CTGAN	SF	CTGAN
2026-05-05	0.645	0.940	0.513	0.705	7.245	0.380	1.000	1.000

Full per-run reports: results-history.json.

Methodology

What we measure

Family	Metric	Direction
Statistical fidelity	SDMetrics QualityReport: overall, column shapes, column pair trends	higher = better
ML utility	TSTR (Train-Synthetic-Test-Real) and TRTR AUC, for logistic regression and gradient boosting	higher = better; closer TSTR-to-TRTR is the meaningful signal
Privacy	DCR (5th-percentile distance to closest record, normalised by intra-real median) and NNDR (median nearest-neighbour distance ratio)	DCR higher = better; NNDR closer to 1 = better
Constraint conformance	Fraction of synthetic rows that satisfy all schema constraints (range, enum membership)	higher = better

What we run

SynthForge at the commit recorded in synthforge_commit.
CTGAN via SDV's CTGANSynthesizer, default hyperparameters, 300 epochs, seed 42.

Both synthesizers produce a synthetic dataset of the same size as the real dataset, then every metric is run on the (real, synthetic) pair.

Datasets

UCI Adult (Census Income, 1994). After dropping rows with missing values: 30,162 rows, 14 features, binary income target.
UCI Credit Card Default (Taiwan, 2005). 30,000 rows, 23 features, binary default target.

Schema authoring: the important caveat

SynthForge is schema-driven, not data-fitted. For each dataset we hand-author a SynthForge schema from the public UCI data dictionary (documented column types, ranges, categorical sets, and standard demographic priors). We do not fit the SynthForge schema to the real CSV.

This is the honest framing: given only the public data dictionary, what does SynthForge produce? CTGAN, in contrast, sees the real data during training. This is a deliberate asymmetry the benchmark exists to measure. A tool that needs no data access has different operational properties from one that does.

How to read the privacy numbers

A higher DCR means synthetic rows are not close copies of real rows. An NNDR closer to 1 means a synthetic row is not anomalously close to one specific real row. SynthForge's high DCR is structural: it cannot memorise data it never saw.

Reproducibility

Harness lives at apps/generator/app/benchmarks/; tests at apps/generator/tests/benchmarks/.
CLI: python -m app.benchmarks.cli --dataset all --output results.json.
Real CSVs are not committed; download instructions in apps/generator/benchmarks_data/README.md.
All randomness uses seed 42.
Re-run any time. The numbers above come from the most recent run committed to www/src/data/benchmarks/results.json.

What we explicitly do not measure (and why)

Differential privacy guarantees. Differential privacy (DP) is a mathematical bound on how much any single record in the training data can change the output, expressed as numbers like "(epsilon=2, delta=1e-6)." SynthForge does not yet ship a DP-aware path; SmartNoise integration is on the roadmap. Claiming DP numbers without the math behind them would be misleading, so we don't.
Image, text, time-series datasets. Out of scope for tabular synthetic data benchmarks.
Membership-inference attacks against DP baselines. Same gap as DP.

Want to inspect the raw numbers? Download results.json. The harness source is not currently public; reach out at hello@synthforge.io if you want to review or replicate the methodology.