Synthetic Data Generation

Using privacy-sensitive production data for software testing and quality assurance is not only outdated but also prohibited by privacy laws like GDPR, PCI, and HIPAA. Yet, effective software testing requires test data that is “production-like.”

How do you ensure that your test environment’s data sets are both representative and non-traceable to individuals? The solution lies in a combination of data masking and synthetic test data generation.

Generate Synthetic Data: Why and How?

Synthetic test data, often referred to as fake, dummy, mock, or example data, is data created artificially for the purpose of developing and testing applications. Unlike real data or existing information, synthetic test data is generated using algorithms.

There are two primary reasons for generating synthetic test data:

1. Privacy Protection: Synthetic data is used to replace privacy-sensitive information in testing environments, ensuring compliance with data protection regulations.

2. Specific Testing Needs: It is generated to meet specific testing requirements or conditions that may not be present in the production data.

Synthetic test data can be derived from a seed file, generated randomly, or produced based on predefined logic.

Generate Synthetic Data From Real Data

While data masking offers various techniques like shuffle, redact, and blank to protect data, there are cases where these rules alone may not suffice to ensure that the data is untraceable to an individual. In such instances, integrating a dummy data generator into your masking project becomes a valuable option.

By incorporating synthetically generated test data, you can replace privacy-sensitive information such as names, email addresses, and bank account numbers. This approach not only enhances data privacy but also aligns your test data more effectively with your test cases.

Synthetic data generators
within DATPROF Privacy

Basic

  • Random string
  • Random date/time
  • Random number
  • Random decimal number
  • Sequential numbers
  • Color
  • Color code
  • And more…

Names

  • Brand
  • Company
  • Male First name
  • Female First name
  • Last name
  • Location
  • Country Code
  • City
  • Street
  • Country
  • And more…

Business

  • BSN (Dutch Social Security Number)
  • SSN (US Social Security Number)
  • IBAN
  • Currency Code
  • Currency Symbol
  • Military rank
  • Job/profession
  • And more…

Advanced

  • Random value from seed file (Pick values from a custom CSV seed file)
  • Regular expression (Generate values based on a regular expression)
  • And more…

Generate Synthetic Data to Match Sample Data

While synthetic data is often discussed as a masking technique, the importance of generating data from scratch should not be overlooked. When developing an application for a new system without pre-existing data, data masking isn’t applicable. Yet, you still require test data to assess the application’s functionality with production-like data or to create data volumes that don’t exist.

In such scenarios, synthetic data generation tools come to the rescue. These tools allow you to define the type of data you need, including columns and tables, and populate them with realistic, representative data.

Synthetic Data Generation Tools

Many database specialists understand how to create test data manually, but this process can be time-consuming, especially when done regularly. The growing demand for synthetic test data and test data generation tools is driven by the need to streamline this process. Test data creation should serve as a means to an end, not the end goal itself.

Synthetic test data can be efficiently generated with the help of test data generator tools. While there are free test data generators available online, they are suitable for simple tasks like generating a list of first names. However, when dealing with complex tables with multiple columns and interrelated data, relying on open-source mock data generators can quickly become impractical and unreliable.

Generating test data is not inherently complex, as algorithms handle the task. What adds complexity is ensuring that the generated data behaves correctly within a database, making it suitable for effective testing. Similar to data masking, data generation requires careful planning and configuration, especially when defining parameters like Primary Key start values and unique constraints within tables. Hence, licensed synthetic data generation tools are often preferred. These tools offer more capabilities and ensure technical and functional consistency, which is crucial for effective development and testing work.

Generating Test Data for Your Database with DATPROF Privacy

When you decide to use synthetically generated data for testing, the task of generating a significant amount of data to fit your database might seem daunting. However, with DATPROF Privacy, this process becomes straightforward. DATPROF Privacy is not just a data masking tool; it also serves as a powerful test data generation tool, providing realistic and high-quality test data in terms of content and volume.

Connecting DATPROF Privacy to your database, which supports all major relational databases like SQL Server, Oracle, DB2, and more, allows you to seamlessly integrate data generation into your testing process. You can easily add a generation function, similar to any other function in your masking template, to create data for specific columns in your database. Additionally, you have the flexibility to create new columns and generate random data from scratch.

One notable advantage of this approach is that it preserves all existing relationships between tables, ensuring that your complex data structure remains both functionally and technically consistent. Instead of using privacy-sensitive production data, or as an augmentation to your existing data, you can employ synthetic data with confidence. DATPROF Privacy also supports the generation of test data across a chain of interconnected systems, making it a versatile tool for comprehensive testing needs.

Synthetic Data Generation Using Generative AI

When we use artificial intelligence to generate test data, the software first needs to build a model. Generative AI models, or foundation models, learn all the relationships between attributes based on training data, enabling it to create new data based on these relationships; machine learning. However, there are some important considerations to keep in mind when using generative AI for test data generation.

Learn more about Generative AI for Test Data Generation here.

FAQ

What is synthetic test data?

Synthetic test data is generated – fake – data that can and may be used for software testing. It doesn’t contain privacy-sensitive information since it is not real.

What is synthetic test data generation?

Synthetic test data generation is the process of random data creation (from scratch or to replace existing data) with the help of a test data generation tool.

How are synthetic data generated?

Synthetic data can be generated manually or with the help of a synthetic data generation tool. The latter is the best option if you need volumes of data that you don’t have.

What are the pros of synthetic test data?
  • Using less data
  • Perfectly aligned with your test cases
  • No risk of data leakage
  • Limited dependencies
  • Savings on storage costs and f.e. licenses
What are the cons of synthetic test data?

You need to keep in mind all the necessary attributes for your system. You need to know how many attributes your data model (not database) has, the functional requirements of your systems, data quality issues, historical data, and so on.