Gretel synthetics report

Synthetic Data Quality Report

Moderate

Synthetic Data Quality Score

Normal

Privacy Protection Level

The Synthetic Data Quality Score is computed by taking a weighted combination of the individual quality metrics: Field Distribution Stability, Field Correlation Stability and Deep Structure Stability. Learn more about Synthetic Data Quality Score (SQS)

The Synthetic Data Quality Score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the Synthetic Data Quality Score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

If your Synthetic Data Quality Score isn't as high as you'd like it to be, read here for a multitude of ideas for improving your model.

How to interpret your SQS	Excellent	Good	Moderate	Poor	Very Poor
Suitable for machine learning or statistical analysis
Suitable for balancing or augmenting machine learning data sources
Suitable for pre-production testing environments
Suitable for demo environments or mock data
Improve your model using our tips and advice
Significant tuning required to improve model

Data Sharing Use Case	Excellent	Very Good	Good	Normal	Poor
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability

Your Privacy Protection Level (PPL) is determined by the privacy mechanisms you've enabled in the synthetic configuration. The use of these mechanisms help to ensure that your synthetic data is safe from adversarial attacks. There are four primary protection mechanisms you can add to the creation of synthetic data.

The Outlier Filter ensures that no synthetic record is an outlier with respect the training space, and is enabled with the privacy_filters.outliers: [medium, high].
The Similarity Filter ensures that no synthetic record is overly similar to a training record. This filter is enabled in the configuration by setting privacy_filters.similarity: [medium, high].
You can also set privacy_filters.outliers to auto which will try for medium, and fall back to turning the filter off if it prevents the synthetic model from generating the requested number of records.
Overfitting Prevention ensures that model training stops before it has a chance to overfit and is enabled using the validation_split: True and early_stopping: True configuration settings.
Differential Privacy is an experimental implementation of DP-SGD that modifies the optimizer to offer provable guarantees of privacy, enabling safe training on private data. Differential Privacy can cause a hit to utility, often requiring larger datasets to work well, but it uniquely provides privacy guarantees against both known and unknown attacks on data. Differential Privacy can be enabled by setting dp: True and can be modified using the associated configuration settings.

Data Summary Statistics

Good

Field Correlation Stability

Very Poor

Deep Structure Stability

Excellent

Field Distribution Stability

	Training Data	Synthetic Data
Row Count	20	20
Column Count	3	3
Training Lines Duplicated	--	4

What do these values mean?

Privacy Protection Summary

Default Privacy Protections	Advanced Protections

Outlier Filter Disabled

Similarity Filter Disabled

Overfitting Prevention Disabled

Differential Privacy Disabled

Training field overview

Field	Unique	Ave. Length	Type	Distribution Stability
self.Branch_ID\|Branch_Name	10	20.00	Text	N/A
self\|Specialization_Name	20	11.55	Other	N/A
self.Branch_ID\|Address	10	35.90	Text	N/A

Data Summary Statistics

Privacy Protection Summary

Training field overview

Training and Synthetic Data Correlation

Principal Component Analysis