Table of Content

close

Why Synthetic Data Matters

The Problem with Independent Feature Generation

    1. Preprocessing
    2. Clustering using HDBSCAN
    3. For Each Cluster
    4. Combine Cluster Data
    5. Postprocessing
    6. Validation

    With Cursor/Windsurf/Cline
    Without Cursor/Windsurf/Cline

 

Limitations and Considerations

Conclusion

Structured Synthetic Data Generation: Preserving Statistical Relationships Between Features

Preserving Statistical Relationships in Synthetic Data Through Copula-Based Modeling and Cluster Analysis

Artificial Intelligence
Rohit Aggarwal
Harpreet Singh
Rohit Aggarwal
  +1 More
down

In today's data-driven world, organizations face a critical challenge: they need realistic data for testing, development, and training AI models, but using real customer data raises serious privacy concerns and may violate regulations like HIPAA or FERPA. Enter synthetic data generation—a powerful solution that creates artificial data mimicking the statistical properties of real data while completely protecting privacy.
This article explores a sophisticated synthetic data generation pipeline that transforms real data into artificial data that's statistically similar yet entirely synthetic. What sets this approach apart is its ability to not just replicate individual feature distributions, but to preserve the crucial relationships between features—the correlations and dependencies that make data truly useful. Let's dive into how this works.

 

Why Synthetic Data Matters

Before exploring the technical implementation, let's understand why synthetic data is valuable:

  • Privacy compliance: Eliminates the risk of exposing sensitive customer information
  • Development freedom: Enables teams to work with realistic data without security constraints
  • Training AI models: Provides diverse, representative data for machine learning applications
  • Testing edge cases: Allows creation of specific scenarios that might be rare in real data
  • Relationship preservation: Maintains critical correlations and dependencies between variables that simple randomization methods cannot capture

The last point is particularly crucial. Many synthetic data approaches can generate values that match individual feature distributions, but fail to maintain the relationships between features. For example, in a banking dataset, a simple approach might generate realistic account balances and realistic transaction frequencies independently, but miss the vital correlation between these variables. Our method specifically addresses this challenge.

 

The Problem with Independent Feature Generation

To understand why preserving relationships matters, consider a common e-commerce scenario:

In real customer data, affluent customers not only spend more per purchase, but they also tend to buy a wider variety of products. This creates a natural correlation between:

  • Average purchase amount
  • Number of unique products purchased
  • Customer income level

If we were to generate synthetic data by creating each of these features independently—even if each feature's distribution perfectly matches the original data—we would lose these critical relationships. We might end up with unrealistic scenarios like low-income customers purchasing large numbers of luxury items or high-income customers only purchasing a single budget item.

This problem compounds when building predictive models. A model trained on such independently-generated synthetic data would learn incorrect patterns and make faulty predictions when applied to real data. For instance, a recommendation engine might suggest luxury products to customers unlikely to purchase them or miss obvious cross-selling opportunities.

Our approach using copula-based modeling specifically addresses this challenge by mathematically capturing and preserving the dependency structures between features, ensuring the synthetic data maintains these natural correlations.

 

The Synthetic Data Generation Pipeline

Our implementation creates a comprehensive pipeline that transforms real data into synthetic data through several sophisticated steps while preserving statistical properties and relationships. Let's break down each component:

 

1. Preprocessing

The first stage prepares the data through three main steps:

Missing Data Handling

  • Processes target variables first, addressing imbalanced classes for categorical targets or applying transformations to reduce skewness in continuous targets
  • Imputes missing values using median for numerical features and mode/"Unknown" for categorical features

Categorical Data Encoding

  • Applies intelligent encoding based on cardinality (number of unique values):
    • Binary encoding for features with 2 unique values
    • One-hot encoding for features with ≤10 unique values
    • Frequency encoding for high-cardinality features
  • Identifies and transforms highly skewed numerical features using Box-Cox transformation

Standardizing Features

  • Scales numerical features to have zero mean and unit variance
  • Preserves categorical features in their encoded form
  • Stores all transformation parameters for later inverse transformation

     

2. Clustering using HDBSCAN

Data often comes from mixed populations with different underlying patterns. HDBSCAN clustering helps identify these natural groupings:

  • Uses Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
  • Advantages over traditional clustering algorithms:
    • No need to specify the number of clusters in advance
    • Finds clusters of varying densities and shapes
    • Adaptively determines cluster count based on data density
  • Handles small datasets by adjusting clustering parameters (min_cluster_size, min_samples)
  • Assigns noise points to their nearest clusters
  • Creates a 'cluster' column to track membership, falling back to a single cluster if needed
  • Separates data by cluster for subsequent processing

     

3. For Each Cluster

Processing each cluster separately allows the algorithm to better capture the unique characteristics of different data subgroups:

a) Statistical Modeling

This three-step process captures both individual feature distributions and their interrelationships:

Fit Marginal Distribution for Each Feature

  • Tests multiple distribution types (normal, lognormal, exponential, gamma)
  • Selects best fit using AIC (Akaike Information Criterion)
  • Stores distribution parameters for each feature
  • Models each feature's unique pattern independently (e.g., ages might follow a normal distribution, while income follows a log-normal distribution)

Transform to Uniform using CDF

  • Applies Cumulative Distribution Function (CDF) of fitted distributions
  • Transforms each feature to uniform [0,1] distribution
  • Creates standardized representation necessary for copula modeling

Fit Best Copula Model Across Features

  • Tests different copula types (Gaussian, Student-t, Clayton)
  • Selects best-fitting copula based on log-likelihood
  • Captures dependency structure between features

A copula is a mathematical function that connects different marginal distributions to form a joint distribution, preserving relationships between variables. For instance, if higher income correlates with more frequent purchases in your original data, copulas maintain this relationship in synthetic data.

This is where the real magic happens in preserving feature relationships. While each feature's individual distribution is important, the connections between features often contain the most valuable information. For example:

  • In financial data, transaction frequency may be correlated with account balance
  • In healthcare data, age may be correlated with certain medical conditions
  • In e-commerce data, purchase frequency may be correlated with customer lifetime value

Copulas mathematically encode these dependencies, allowing us to generate synthetic data where these critical relationships remain intact. Without this step, you might have realistic-looking individual features but unrealistic combinations of values that would never occur in real data.

b) Data Generation

After modeling comes generation of the synthetic data:

Draw Samples from Fitted Copula

  • Generates correlated uniform [0,1] samples from the fitted copula model
  • Maintains the dependency structure between features

Inverse CDF to Transform Each Feature Back

  • Applies inverse CDF (percent point function) using stored distribution parameters
  • Transforms uniform values back to realistic data following original distributions
  • Restores each feature's original statistical shape while preserving relationships
  • Adds appropriate cluster labels to track membership

     

4. Combine Cluster Data

  • Merges synthetic data from all clusters based on original cluster proportions
  • Maintains the natural groupings and subpopulations present in the original data
  • Preserves the overall data structure and cluster characteristics

     

5. Postprocessing

This stage restores the data to its original format:

Reverse Encoding of Categorical Features

  • Converts encoded categorical features back to their original form:
    • Binary encodings → original binary categories
    • One-hot encodings → original categorical values
    • Frequency encodings → original categorical values

Reverse Standardization

  • Applies inverse transformation to all standardized numerical features
  • Restores original scale and data types
  • Ensures the synthetic data matches the format of the original data
  • Handles decimal formatting and type conversion

 

6. Validation

The final step is thorough quality checking to ensure the synthetic data truly resembles the original:

Validate Each Feature & Target Distribution Independently

  • For numerical features: Applies Kolmogorov-Smirnov tests and compares statistical moments
  • For categorical features: Performs chi-square tests and compares category frequencies
  • Calculates metrics like maximum and average differences between distributions

Validate Correlations

  • Compares correlation matrices (Pearson, Spearman)
  • Calculates Frobenius norm of difference matrices
  • Ensures dependency structures are preserved

This validation step is critical for our goal of relationship preservation. After all the transformations, we need to verify that the synthetic data maintains the same correlation patterns as the original. The process compares both linear (Pearson) and rank-based (Spearman) correlations, allowing us to detect if the relationship structures have been maintained across different types of dependencies.

Validate Cluster Preservation

  • Compares cluster proportions between original and synthetic data
  • Evaluates if cluster characteristics are maintained
  • Compiles all validation results into a comprehensive report with statistical measures

Running the Script

The script can be run in several ways:

With Cursor/Windsurf/Cline

Simply ask the model to run the script.

Without Cursor/Windsurf/Cline

Run the following in terminal:

python -m pip install -r requirements.txt

If you're having trouble, try upgrading pip:

python -m pip install --upgrade pip

For Windows users facing installation issues, follow the guide at: https://github.com/bycloudai/InstallVSBuildToolsWindows?tab=readme-ov-file

Note: Sometimes on Windows machines, py works instead of python:

py -m pip install -r requirements.txt

 

Limitations and Considerations

While this implementation is powerful, it has some limitations:

  • It doesn't make special distinctions between different types of variables during the correlation modeling phase—it treats all variables (including transformed categorical ones) as continuous.
  • This means it might not perfectly preserve some special relationships between categorical and continuous variables, or between categories that were originally part of the same variable.

 

Conclusion

The synthetic data generation pipeline described here offers a powerful solution for organizations needing realistic test data without privacy concerns. What sets it apart from simpler approaches is its sophisticated handling of feature relationships through copula modeling and cluster-aware generation.

By carefully modeling both the individual distributions of features and their relationships, then generating new data that follows these patterns, we can create synthetic data that is:

  • Statistically similar to real data
  • Maintains important relationships between different pieces of information
  • Preserves the overall structure and patterns of the original data
  • Safe to use without worrying about privacy regulations
  • Suitable for testing, development, and analysis purposes

This relationship preservation is crucial for many real-world applications:

  • AI model training: Models trained on synthetic data with preserved relationships will learn the same patterns present in real data
  • Financial analysis: Synthetic financial data must maintain relationships between risk factors and outcomes
  • Healthcare research: The correlations between patient characteristics and medical conditions must be preserved
  • Market research: Customer behavior patterns and preferences need to maintain their interdependencies

The attached code implements this entire pipeline, making it accessible for data scientists and developers who need high-quality synthetic data where relationships between features matter as much as the features themselves.