You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to easier anonymizers creation, we sometime talked about allowing the user to provide sample data as plain old text, json or yaml files.
While thinking about this, and due to the fact I wrote the Faker integration not so long ago, I noticed that we could generate much more data by providing a way for the data sources to provide a set of expressions that mixes its own data with other data sources data. Faker does it in a way, using plain old PHP code, which differs for each language.
I ended up using my imagination to a structure spec for such data source and expressions definition.
Here it is, in YAML (please not that YAML is not a requirement, it's a simple data structure that can be expressed in many languages, YAML is one of the easiest for non IT people I guess, but JSON, PHP arrays would do the trick as well.
Without further ado, here it is:
fr-fr:
# A datasource is basically an item list in which data will be# pulled randomly.car_type:
data: [sedan, hatchback, limousine, SUV, crossover]# Moreover, the data can be in an additional file, relative# to this file path, which is the recommended way to easier# maintenance.galaxy_name:
# File can be either:# - If JSON, it must contain a single array `["foo","bar"]`# - if TXT, then one line per item.data: "./resources/galaxy-name.txt"# Additionally, a datasource can hold one or more "expressions".# If so, data is ignored, but expressions are selected randomly# for each generated line, and executed.# An expression is a very simple string, containing:# - "{{foo}}" where "foo" is a datasource name: randomly takes# one value out the given datasource name. It can refer self# case in which data will be fetched for the "data" list.# - "[-12;+123]" will extract a random integer between bounds.# - "(foo|bar)" will output randomly either "foo" or "bar"# - Everything else is just raw text.# Simple exemple:firstname:
data: [Jules, Nolwenn, Roberto, Jessica]expressions:
- "{{firstname}}"
- "{{firstname}}-{{firstname}}"# Datasources may be consistued of only expressions, without# the data list, case in which it must pull data from other# datasources:lastname:
data: "./resources/lastname.txt"expression:
- "{{single_lastname}}"
- "{{single_lastname}}-{{single_lastname}}"
- "{{single_lastname}} {{single_lastname}}"# Multicolumn datasources can be "fixed", in case each item will be# restitued as-is, without modification. This is useful when working# with data that must remain consistent.# You must specify the "type: fixed" property when doing this,# otherwise columns will be mixed up when generating sample.# Column names are mandatory.# In this case, data file structure is:# - If JSON, it must contain a single array `[{"a": "b"},{"a":"d"}]`# - if TXT or CSV, then it's a CSV, use comas as separators.# In all cases, multicolumn datasources can use any number of "invariant"# columns (fixed value for all items generated by the sample).# Fixed datasources cannot hold expressions.address_fixed:
columns: [street, city, country]type: fixedinvariant:
country: Francedata:
- {street: 'rue des fleurs', city: 'Donaldville
- {street: 'rue de la Bastille', city: 'Paris'}# For example purpose, we choose to discrimate an address an a set# of datasources whose purpose is to help generating more random# addresses.address_street_type:
data: [rue, avenue, impasse]address_street:
data: ["des fleurs", "de l'espoir", "des marroniers"]expressions:
- "[1-128] {{address_street_type}} {{address_street}}"address_city:
data: [Nantes, Paris, Lyon]# Multicolumn datasources are per default "mixed", which can be# expressed using the "type: mixed" property, or by not specifying# the entry.# In this case, it must use expressions to randomly fetch data# from other datasources.address:
columns: [street, city, country]invariant:
country: Franceexpressions:
street: "{{address_street}}"city: "{{address_city}}"
This very simple spec could allow great flexibility for creating new packs, ease of maintenance, moreover allow non-IT people, or people unfamiliar with PHP to contribute or create their own data packs.
This example covers a lot of the existing fr-fr pack. For creating a new pack, a user should:
Give a datasource definition file (the example above), which can be:
YAML: the example above,
JSON: same structure, but JSON,
PHP: using the builder pattern, we should provide an API for datasource definition.
Raw primary data for the datasource as text files.
And that's it.
For distribution:
If using composer, then add the "type": "db-tools-bundle-pack" in composer.json file, and use a fixed name for the datasource index file, such as anonymizer-datasource.[yaml|json|php].
Or if not, search for those files using the anonymizer_paths configuration variable (existing).
The text was updated successfully, but these errors were encountered:
In order to easier anonymizers creation, we sometime talked about allowing the user to provide sample data as plain old text, json or yaml files.
While thinking about this, and due to the fact I wrote the Faker integration not so long ago, I noticed that we could generate much more data by providing a way for the data sources to provide a set of expressions that mixes its own data with other data sources data. Faker does it in a way, using plain old PHP code, which differs for each language.
I ended up using my imagination to a structure spec for such data source and expressions definition.
Here it is, in YAML (please not that YAML is not a requirement, it's a simple data structure that can be expressed in many languages, YAML is one of the easiest for non IT people I guess, but JSON, PHP arrays would do the trick as well.
Without further ado, here it is:
This very simple spec could allow great flexibility for creating new packs, ease of maintenance, moreover allow non-IT people, or people unfamiliar with PHP to contribute or create their own data packs.
This example covers a lot of the existing
fr-fr
pack. For creating a new pack, a user should:And that's it.
For distribution:
"type": "db-tools-bundle-pack"
in composer.json file, and use a fixed name for the datasource index file, such asanonymizer-datasource.[yaml|json|php]
.anonymizer_paths
configuration variable (existing).The text was updated successfully, but these errors were encountered: