The Synthetic Data Quality Score is computed by taking a weighted combination of the
individual quality metrics: Field Distribution Stability, Field Correlation Stability
and Deep Structure Stability.
Learn more about Synthetic Data Quality Score (SQS)
The Synthetic Data Quality Score is an estimate of how well the generated synthetic data maintains the same
statistical properties as the original dataset. In this sense,
the Synthetic Data Quality Score can be viewed as a utility score or a confidence score as to whether
scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset
instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower
score may be just as acceptable.
If your Synthetic Data Quality Score isn't as high as you'd like it to be,
read here
for a multitude of ideas for improving your model.
How to interpret your SQS
Excellent
Good
Moderate
Poor
Very Poor
Suitable for machine learning or statistical analysis
Suitable for balancing or augmenting machine learning data sources
Suitable for pre-production testing environments
Suitable for demo environments or mock data
Improve your model using our tips and advice
Significant tuning required to improve model
Data Sharing Use Case
Excellent
Very Good
Good
Normal
Poor
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability
Your Privacy Protection Level (PPL) is determined by the privacy mechanisms you've enabled in the synthetic
configuration. The use of these mechanisms help to ensure that your synthetic data is safe from adversarial
attacks. There are four primary protection mechanisms you can add to the creation of synthetic data.
The Outlier Filter ensures that no synthetic record is an outlier with respect the training space, and is
enabled with the privacy_filters.outliers: [medium, high].
The Similarity Filter ensures that no synthetic record is overly similar to a training record. This filter is
enabled in the configuration by setting privacy_filters.similarity: [medium, high].
You can also set privacy_filters.outliers to auto which will try for
medium, and fall back to turning the filter
off if it prevents the synthetic model from generating the requested number of records.
Overfitting Prevention ensures that model training stops before it has a chance to overfit and is enabled
using the validation_split: True and
early_stopping: True configuration settings.
Differential Privacy is an experimental implementation of DP-SGD that modifies the optimizer to offer provable
guarantees of privacy, enabling safe training on private data. Differential Privacy can cause a hit to utility, often
requiring larger datasets to work well, but it uniquely provides privacy guarantees against both known and unknown
attacks on data. Differential Privacy can be enabled by setting dp: True and can
be modified using the associated configuration settings.
Data Summary Statistics
Good
Field Correlation Stability
Very Poor
Deep Structure Stability
Excellent
Field Distribution Stability
To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the
training data, and then in the synthetic data. The absolute difference between these values is then computed and
averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality
score will be. To aid in the comparison of field correlations, a heatmap is shown for both the training data and
the synthetic data, as well as a heatmap for the computed difference of correlation values. If the intended purpose
of the synthetic data is to perform statistical analysis or machine learning, maintaining the integrity of field
correlations can be critical.
To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a
Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. A synthetic
quality score is created by comparing the distributional distance between the principal components found in each
dataset. The closer the principal components are, the higher the synthetic quality score will be. As PCA is a very
common approach used in machine learning for both dimensionality reduction and visualization, this metric gives
immediate feedback as to the utility of the synthetic data for machine learning purposes.
Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those
in the original data. For each numeric or categorical field we use a common approach for comparing two distributions
referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the
higher the Field Distribution Stability quality score will be. Note, highly unique strings (neither numeric or
categorical) will not have a distributional distance score. To aid in the comparison of original versus synthetic
field distributions, a bar chart or histogram is shown for each numeric or categorical field. Depending on the
intended purpose of the synthetic data, maintaining the integrity of field distributions can be critical.
The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the
number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic
data created. The more examples available when training a model, the easier it is for the model to accurately learn
the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but
increasing that to 5000 or even 50,000 is even better.
The more synthetic rows generated, the easier it is to deduce whether the statistical integrity of the data remains
intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, make sure you’ve generated at
least 5000 synthetic data records.
The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data.
In almost all situations, this value should be 0. The only exception would be if the training data itself contained
a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.
Privacy Protection Summary
Default Privacy Protections
Advanced Protections
Outlier Filter
Disabled
Similarity Filter
Disabled
Overfitting Prevention
Disabled
Differential Privacy
Disabled
The Outlier privacy filter ensures that no synthetic record is an outlier with respect to the training dataset.
Outliers revealed in the synthetic dataset can be exploited by Membership Inference Attacks, Attribute Disclosure
and a wide variety of other adversarial attacks. They are a serious privacy risk. The Outlier Filter is enabled
by the "privacy_filters.outliers" configuration setting. A value of "medium" will filter out any synthetic record
that has a very high likelihood of being an outlier. A value of "high" will filter out any synthetic record that
has a medium to high likelihood of being an outlier.
The Similarity privacy filter ensures that no synthetic record is overly similar to a training record. Overly
similar training records can be a severe privacy risk as adversarial attacks commonly exploit such records to
gain insights into the original data. The Similarity Filter is enabled by the "privacy_filters.similarity"
configuration setting. A value of "medium" will filter out any synthetic record that is an exact duplicate of a
training record. A value of "high" will filter out any synthetic record that is 99% similar or more to a
training record.
The Overfitting Prevention privacy mechanism ensures that the synthetic model will stop training before it has a
chance to overfit. When a model is overfit, it will start to memorize the training data as opposed to learning
generalized patterns in the data. This is a severe privacy risk as overfit models are commonly exploited by
adversaries seeking to gain insights into the original data.
Differential Privacy ensures that no individual training record can unduly influence the output of a synthetic
model. It is very effective at preventing the generation of records that are overly similar to the training
set and at ensuring an even distribution of values, though it can result in a modest degradation of the SQS
utility score. Differential privacy is best suited to larger datasets (typically 50K rows or more) where
probabilistic privacy guarantees are required.
Training field overview
The high level Field Distribution Stability score is computed by taking the average of the individual Field
Distribution Stability scores, shown in the below table. Distributional stability is applicable to numeric and
categorical fields, but not highly unique strings. To better understand a field's Distribution Stability score,
click on the field name to be taken to a graph comparing the training and synthetic distributions.
The below table also shows the count of unique and missing field values, the average length of each field, as well
as its datatype. When a dataset contains a large number of highly unique fields, or a large amount of missing data,
these characteristics can impede the model's ability to accurately learn the statistical structure of the data.
Exceptionally long fields can also have the same impact.
Read here
for advice on how best to handle fields like these.