SynthForge SynthForge SynthForge IO

Blog ·

Five minutes from schema to seeded local DB

Paste your DDL, generate a dataset, and run one command to seed a local Postgres, MySQL, or SQLite. No factory_boy script. No fixtures.yaml. No seed.py.

practitioner synthetic-data databases

The friction

You sit down to bring up a local dev database. You need real-ish data in it. Your options: a seed script that hasn’t survived the last three schema migrations, a fixtures.yaml somebody checked in two years ago that references tables that no longer exist, a seed.sql that only runs on one person’s machine, or nothing. Most teams pick nothing and hardcode IDs.

We built SynthForge to skip all of that. Paste your schema, generate a dataset, run one command. Populated local DB in five minutes.

Hand SynthForge your schema

You already have the DDL. Paste it at /schemas/create/sql. Five explicit dialects (Postgres, MySQL, SQLite, SQL Server, MariaDB) plus auto-detect as the default. The parser is deterministic, no LLM in the loop, and CockroachDB comes in on the auto-detect path as a Postgres-compatible syntax.

You can describe it in plain English. /schemas/create/ai is a single text box. Describe your tables and relationships; the AI agent emits a schema you can edit before you ever generate data.

You want a starting point. /schemas/create/template has pre-built schemas for common shapes (e-commerce, healthcare, SaaS). Pick one and customize.

You want to draw it. /schemas/new opens the visual editor directly.

Forge the dataset

Open “Forge Dataset” from your schema. The form renders in this order:

Dataset name (optional). Then row counts. Top-level tables take a plain number. Child tables (any table with a foreign key pointing up to a parent) get a mode toggle: fixed count or per parent row. That toggle is what we wrote about in the cardinality post; you set “8 to 12 orders per customer” and SynthForge derives the child count. For M:N junction tables, the structure autodetects and the form picks a sensible larger-side driver.

Below the row counts, the form shows a live memory estimate. If it climbs too high, the Generate button disables and the banner tells you why.

Output formats come next. Pick CSV. That’s the format that ships the SQL loader scripts. Parquet and JSON are there too, but they won’t seed a SQL database.

Hit Generate. First run after a quiet period adds a couple of seconds while the container wakes. If you’ve been working in the editor for the past few minutes, it’s already warm and the job starts without the pause.

What you download

A zip. For a three-table schema it looks like this:

my-dataset.zip
├── csv/
│   ├── users.csv
│   ├── orders.csv
│   └── order_items.csv
├── import_postgresql.sql
├── import_mysql.sql
├── import_sqlite.sql
├── import_sqlserver.sql
├── import_mariadb.sql
├── import_duckdb.sql
└── import_cockroachdb.sql

One loader script per supported engine, seven total. To seed a local Postgres:

unzip my-dataset.zip && psql mydb -f import_postgresql.sql

Each script runs in four phases: DROP TABLE IF EXISTS in reverse-dependency order (children drop before parents), CREATE TABLE in dependency order with no inline FK constraints, COPY/LOAD/import from csv/{table}.csv, then ALTER TABLE ADD CONSTRAINT for every foreign key at the end. All rows are present before any FK constraint is applied, so import order within a transaction is safe regardless of which engine you’re targeting.

Why CSV plus COPY

COPY FROM (or your engine’s equivalent: LOAD DATA, .import, BULK INSERT) is considerably faster than a stream of per-row INSERT statements on any non-trivial load. How much faster depends on the engine, the row width, and your hardware. People who care already know this.

The CSVs are also portable. All seven loader scripts point at the same files. Generate once, hand the zip to a teammate on a different stack, and they run their own loader without touching the data.

The DDL travels with the artifact too. Each loader script creates its tables before it loads anything. No separate schema file, no versioning question, no drift.

If you want a single INSERT-statement dump you could replay against a running instance, pg_dump --data-only is the right tool. That’s not what this is.

The one gotcha: memory

The memory estimator is not row-based. It multiplies bytes per row by row count, then by the overhead Polars and the format writers add on top. With a single CSV output the multiplier lands at about 5.6x, which means the 8 GB peak budget corresponds to roughly 1.42 GB of raw row data.

How fast you burn through that depends on your field types. integer columns are 8 bytes each. email is 28. paragraph is 500. text is 2000. A schema with five integer columns uses 40 bytes per row: you could push about 38 million rows before the soft-warn fires. A schema with one text column and a few integers is around 2,024 bytes per row, and the warn fires somewhere around 750,000 rows.

Those are memory ceilings, not speed promises. In practice, generation time is usually the binding limit first: a typical table of about a million rows finishes in a few minutes, and richly-faked or very wide tables run slower. Treat the multi-million-row memory headroom as a ceiling, not an everyday target, and split very large volumes across multiple jobs.

When the warning fires, the Generate button disables. The banner shows your estimated peak and the free-tier cap (12 GB). Narrowing a wide column’s max_length usually brings the estimate down faster than reducing row counts.

The free tier

5 GB stored per account, two generation jobs running concurrently. No credit card required.

Try it

Start at app.synthforge.io and open /schemas/create/sql to paste your DDL. The full feature surface is on the data generation page.


Ready to get started?

Multi-table foreign-key integrity, AI schema design, seven SQL dialects, no credit card.