Data generator from lists #200

pounard · 2024-11-25T14:28:55Z

In order to easier anonymizers creation, we sometime talked about allowing the user to provide sample data as plain old text, json or yaml files.

While thinking about this, and due to the fact I wrote the Faker integration not so long ago, I noticed that we could generate much more data by providing a way for the data sources to provide a set of expressions that mixes its own data with other data sources data. Faker does it in a way, using plain old PHP code, which differs for each language.

I ended up using my imagination to a structure spec for such data source and expressions definition.

Here it is, in YAML (please not that YAML is not a requirement, it's a simple data structure that can be expressed in many languages, YAML is one of the easiest for non IT people I guess, but JSON, PHP arrays would do the trick as well.

Without further ado, here it is:

fr-fr:
    # A datasource is basically an item list in which data will be
    # pulled randomly.
    car_type:
       data: [sedan, hatchback, limousine, SUV, crossover]

    # Moreover, the data can be in an additional file, relative
    # to this file path, which is the recommended way to easier
    # maintenance.
    galaxy_name:
        # File can be either:
        #   - If JSON, it must contain a single array `["foo","bar"]`
        #   - if TXT, then one line per item.
        data: "./resources/galaxy-name.txt"

    # Additionally, a datasource can hold one or more "expressions".
    # If so, data is ignored, but expressions are selected randomly
    # for each generated line, and executed.
    # An expression is a very simple string, containing:
    #  - "{{foo}}" where "foo" is a datasource name: randomly takes
    #    one value out the given datasource name. It can refer self
    #    case in which data will be fetched for the "data" list.
    #  - "[-12;+123]" will extract a random integer between bounds.
    #  - "(foo|bar)" will output randomly either "foo" or "bar"
    #  - Everything else is just raw text.
    # Simple exemple:
    firstname:
        data: [Jules, Nolwenn, Roberto, Jessica]
        expressions:
            - "{{firstname}}"
            - "{{firstname}}-{{firstname}}"

    # Datasources may be consistued of only expressions, without
    # the data list, case in which it must pull data from other
    # datasources:
    lastname:
        data: "./resources/lastname.txt"
        expression:
            - "{{single_lastname}}"
            - "{{single_lastname}}-{{single_lastname}}"
            - "{{single_lastname}} {{single_lastname}}"

    # Multicolumn datasources can be "fixed", in case each item will be
    # restitued as-is, without modification. This is useful when working
    # with data that must remain consistent.
    # You must specify the "type: fixed" property when doing this,
    # otherwise columns will be mixed up when generating sample.
    # Column names are mandatory.
    # In this case, data file structure is:
    #   - If JSON, it must contain a single array `[{"a": "b"},{"a":"d"}]`
    #   - if TXT or CSV, then it's a CSV, use comas as separators.
    # In all cases, multicolumn datasources can use any number of "invariant"
    # columns (fixed value for all items generated by the sample).
    # Fixed datasources cannot hold expressions.
    address_fixed:
        columns: [street, city, country]
        type: fixed
        invariant:
            country: France
        data:
            - {street: 'rue des fleurs', city: 'Donaldville
            - {street: 'rue de la Bastille', city: 'Paris'}

    # For example purpose, we choose to discrimate an address an a set
    # of datasources whose purpose is to help generating more random
    # addresses.
    address_street_type:
        data: [rue, avenue, impasse]
    address_street:
        data: ["des fleurs", "de l'espoir", "des marroniers"]
        expressions:
            - "[1-128] {{address_street_type}} {{address_street}}"
    address_city:
        data: [Nantes, Paris, Lyon]

    # Multicolumn datasources are per default "mixed", which can be
    # expressed using the "type: mixed" property, or by not specifying
    # the entry.
    # In this case, it must use expressions to randomly fetch data
    # from other datasources.
    address:
        columns: [street, city, country]
        invariant:
            country: France
        expressions:
            street: "{{address_street}}"
            city: "{{address_city}}"

This very simple spec could allow great flexibility for creating new packs, ease of maintenance, moreover allow non-IT people, or people unfamiliar with PHP to contribute or create their own data packs.

This example covers a lot of the existing fr-fr pack. For creating a new pack, a user should:

Give a datasource definition file (the example above), which can be:
- YAML: the example above,
- JSON: same structure, but JSON,
- PHP: using the builder pattern, we should provide an API for datasource definition.
Raw primary data for the datasource as text files.

And that's it.

For distribution:

If using composer, then add the "type": "db-tools-bundle-pack" in composer.json file, and use a fixed name for the datasource index file, such as anonymizer-datasource.[yaml|json|php].
Or if not, search for those files using the anonymizer_paths configuration variable (existing).

The text was updated successfully, but these errors were encountered:

pounard added a commit that referenced this issue Nov 27, 2024

feature #200 - data generator without php code prototype

7dbc7ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data generator from lists #200

Data generator from lists #200

pounard commented Nov 25, 2024

Data generator from lists #200

Data generator from lists #200

Comments

pounard commented Nov 25, 2024