Generate Fraud Detection ML Training Data

Generate production-ready fraud detection datasets with extreme class imbalance (2% fraud), log-normal transaction amounts, and behavioral risk signals, ready for anomaly detection models.

Binary classification10,000 rows5 features2/98 extreme imbalanceLog-normal amountsNoise 0.1

Launch App

Fraud Detection template configuration

Here's the pre-built template configuration. Customize everything after loading.

fraud-detection.json

{
  "templateName": "Fraud Detection",
  "taskType": "classification",
  "numSamples": 10000,
  "features": [
    { "name": "transaction_amount", "type": "numeric", "distribution": "log-normal", "mean": 150, "std": 200 },
    { "name": "distance_from_home", "type": "numeric", "distribution": "exponential", "mean": 25 },
    { "name": "time_since_last", "type": "numeric", "distribution": "exponential", "mean": 48 },
    { "name": "merchant_category", "type": "categorical", "categories": ["retail", "online", "travel", "food", "gas"] },
    { "name": "is_international", "type": "boolean", "trueRatio": 0.08 }
  ],
  "target": { "labels": ["legitimate", "fraud"], "weights": [98, 2] },
  "noise": 0.1
}

Built for Fraud

Every feature is configured with domain-appropriate distributions and realistic parameters.

Extreme Class Imbalance (2/98)

Pre-configured with 2% fraud / 98% legitimate split reflecting real-world fraud rates. Test how your model handles rare event detection with SMOTE, undersampling, or cost-sensitive learning.

Transaction Amount Distributions

Transaction amounts follow log-normal distributions. Most transactions are small, with a long tail of large purchases. Configurable mean and standard deviation match your domain.

Behavioral Distance Features

Distance from home and time since last transaction use exponential distributions. These behavioral signals model how fraudulent transactions differ from normal spending patterns.

Merchant Category Risk

Five merchant categories with configurable weights. Model category-specific fraud risk and test how merchant type contributes to fraud detection accuracy.

Who uses Fraud Detection training data?

Fintech Fraud Teams

Build and benchmark fraud detection models with realistic transaction data. Test SMOTE, threshold tuning, and cost-sensitive approaches with known ground truth.

Risk & Compliance Analysts

Evaluate fraud detection rules and thresholds with controlled synthetic data. No PCI-DSS concerns. The data is entirely synthetic.

ML Researchers

Study extreme class imbalance techniques with configurable fraud rates. Compare oversampling, undersampling, and ensemble methods on controlled datasets.

Extreme Imbalance Handling

Fraud detection requires models that work with extremely rare positive events. SynthForge IO generates datasets purpose-built for this challenge.

2% Fraud Rate Default

The default 2/98 split matches real-world fraud rates. At 10,000 rows, you get ~200 fraud cases, enough to train and evaluate, while maintaining realistic rarity.

Adjustable Imbalance

Set the fraud rate from 0.1% (extreme rarity) to 50% (balanced). Test how your model degrades as the positive class becomes rarer.

10,000 Row Default

Larger default dataset size ensures sufficient positive samples even at extreme imbalance ratios. Scale up to 100K+ rows for production-scale testing.

Evaluation-Ready

Use precision-recall curves, F1 scores, and AUC-PR (not just accuracy) to evaluate. The baseline model evaluation highlights metrics appropriate for imbalanced data.

More ML use cases

Housing

Regression

View Housing ->

View all ML training dataset features ->

Frequently asked questions

Why is the default dataset 10,000 rows?

At a 2% fraud rate, 10,000 rows gives you ~200 fraud cases. Smaller datasets would have too few positive samples for reliable model training and evaluation. You can increase the size for more positive samples.

Can I adjust the fraud rate?

Yes. The class weights are fully configurable. Set fraud from 0.1% (extreme rarity for stress-testing) to 50% (balanced dataset for initial model development). The default 2% matches typical real-world fraud rates.

Is this data PCI-DSS compliant?

PCI-DSS applies to real cardholder data. SynthForge IO generates entirely synthetic data from statistical distributions. No real transaction records are used or derivable. The output is safe for development, testing, and education.

What export formats are available?

Datasets export as a ZIP containing CSV and Parquet files with separate train and test splits, a data quality report, and an auto-generated Jupyter notebook for immediate exploration and modeling.

How should I evaluate fraud detection models on this data?

With extreme imbalance, accuracy is misleading (a model predicting 'legitimate' for everything scores 98%). Use precision-recall curves, F1 score, and AUC-PR instead. The SynthForge IO baseline evaluation reports these metrics.

Start Generating Fraud Detection Training Data

Load the Fraud Detection template, customize features and parameters, and export publication-ready datasets in seconds.