Skip to content

Latest commit

 

History

History
475 lines (349 loc) · 19 KB

README.md

File metadata and controls

475 lines (349 loc) · 19 KB

Flattering

 

Flatteting

Flattering is the tool to flatten, format, and export any JSON-like data to CSV (or any other output), no matter how complex or mixed the data is.

So, items like this:

{
    "name": "Product",
    "offers": [{"price": "154.95", "currency": "$"}],
    "sku": 9204,
    "images": [
        "https://m.site.com/i/9204_1.jpg",
        "https://m.site.com/i/9204_2.jpg",
        "https://m.site.com/i/9204_3.jpg"
    ],
    "description": "Custom description\non multiple lines.",
    "additionalProperty": [
        {"name": "size", "value": "XL"}, {"name": "color", "value": "blue"}
    ],
    "aggregateRating": {"ratingValue": 5.0, "reviewCount": 3}
}

will look like this:

Name Price Currency Sku Images Description              AdditionalProperty RatingValue ReviewCount
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg
https://m.site.com/i/9204_2.jpg
https://m.site.com/i/9204_3.jpg
Custom description
on multiple lines.
size: XL
color:blue
5 3

 

Contents

 

Quickstart

Flattering consists of two elements:

  • StatsCollector, to understand how many columns are required, what headers they'll have, and what data is mixed/invalid (to skip or stringify).
  • Exporter, to format and beatify the data, fit it in columns, and export it (as .csv or flat data).
item_list = [{"some_field": "some_value", "another_field": [1, 2, 3]}]
sc = StatsCollector()
sc.process_items(item_list)
exporter = Exporter(sc.stats["stats"], sc.stats["invalid_properties"])
exporter.export_csv_full(item_list, "example.csv")

You could use both parts on the same side or separately. For example, collect stats during a running job, and then provide them (tiny JSON with numbers) to the backend when a user wants to export the data.

Also, stats and items could be processed one by one (use append=True to append rows, if needed):

item_list = [{"some_field": "some_value", "another_field": [1, 2, 3]}]
sc = StatsCollector()
[sc.process_object(x) for x in item_list]
exporter = Exporter(sc.stats["stats"], sc.stats["invalid_properties"])
exporter.export_csv_headers("example.csv")
for item in item_list:
    exporter.export_csv_row(item, "example.csv", append=True)

When you provide the filename, the file will be opened to write/append automatically. If you want to open the file manually or write to any other form of StringIO, TextIO, etc. - check the 8. Export data section.

CLI

Plus, you can use the tool through CLI:

flattering --path="example.json" --outpath="example.csv"

CLI supports all the same parameters, you can get a complete list using the flattering -h command.

 

What you can do

1. Flatten data

Let's pick an initial item to explain what parameters and formatting options do.

{
    "name": "Product",
    "offers": [{"price": "154.95", "currency": "$"}],
    "sku": 9204,
    "images": [
        "https://m.site.com/i/9204_1.jpg",
        "https://m.site.com/i/9204_2.jpg",
        "https://m.site.com/i/9204_3.jpg"
    ],
    "description": "Custom description\non multiple lines.",
    "additionalProperty": [
        {"name": "size", "value": "XL"}, {"name": "color", "value": "blue"}
    ],
    "aggregateRating": {"ratingValue": 5.0, "reviewCount": 3}
}

If you don't provide any custom options:

item_list = [item]
sc = StatsCollector()
sc.process_items(item_list)
exporter = Exporter(sc.stats["stats"], sc.stats["invalid_properties"])
exporter.export_csv_full(item_list, "example.csv")

the export will look like this:

name offers0->price offers0->currency sku images0 images1 images2 description additionalProperty0->name additionalProperty0->value additionalProperty1->name additionalProperty1->value aggregateRating->ratingValue aggregateRating->reviewCount
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg https://m.site.com/i/9204_2.jpg https://m.site.com/i/9204_3.jpg Custom description
on multiple lines.
size XL color blue 5.0 3

 

2. Rename columns

Let's make it a bit more readable with headers_renaming:

renaming = [
    (r"^offers\[0\]->", ""),
    (r"^aggregateRating->", ""),
    (r"^additionalProperty->(.*)->value", r"\1")
]
exporter = Exporter(
    sc.stats["stats"],
    sc.stats["invalid_properties"],
    headers_renaming=renaming)
Name Price Currency Sku Images[0] Images[1] Images[2] Description AdditionalProperty[0]->name AdditionalProperty[0]->value AdditionalProperty[1]->name AdditionalProperty[1]->value RatingValue ReviewCount
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg https://m.site.com/i/9204_2.jpg https://m.site.com/i/9204_3.jpg Custom description
on multiple lines.
size XL color blue 5.0 3

 

3. Format data

Better, but images take too much place. Let's group them in a single cell, using the name of the field and field_options. Fields could be grouped (all data in a single cell), named (create columns based on an object property), or both.

options = {"images": {"named": False, "grouped": True}}
exporter = Exporter(
    sc.stats["stats"],
    sc.stats["invalid_properties"],
    headers_renaming=renaming,
    field_options=options)
Name Price Currency Sku Images Description AdditionalProperty[0]->name AdditionalProperty[0]->value AdditionalProperty[1]->name AdditionalProperty[1]->value RatingValue ReviewCount
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg
https://m.site.com/i/9204_2.jpg
https://m.site.com/i/9204_3.jpg
Custom description
on multiple lines.
size XL color blue 5.0 3

 

Looks even better, but we still have a lot of additionalProperty columns. Let's make them named, by using name property as the name of the column to make it better:

options = {
    "images": {"named": False, "grouped": True},
    "additionalProperty": {
        "named": True, "name": "name", "grouped": False
    }
}
Name Price Currency Sku Images Description Size Color RatingValue ReviewCount
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg
https://m.site.com/i/9204_2.jpg
https://m.site.com/i/9204_3.jpg
Custom description
on multiple lines.
XL blue 5.0 3

 

Now we have a column with a value for each additionalProperty. But if you don't need separate columns for that, you can go even further and format them as both named and grouped:

"additionalProperty": {
    "named": True, "name": "name", "grouped": True
}
Name Price Currency Sku Images Description AdditionalProperty RatingValue ReviewCount
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg
https://m.site.com/i/9204_2.jpg
https://m.site.com/i/9204_3.jpg
Custom description
on multiple lines.
size: XL
color: blue
5.0 3

 

4. Filter columns

Also, let's assume we don't really need ratingValue and reviewCount in this export, so we want to filter them with headers_filters:

filters = [r".*ratingValue.*", ".*reviewCount.*"]
exporter = Exporter(
    sc.stats["stats"],
    sc.stats["invalid_properties"],
    headers_renaming=renaming,
    headers_filters=filters,
    field_options=options
)

It's important to remember that filters are regular expressions and work with the initial headers, so we're replacing aggregateRating->ratingValue and aggregateRating->reviewCount here.

Name Price Currency Sku Images Description AdditionalProperty
Product 154.95 $ 9204 https://m.site.com/i/9204_1.jpg
https://m.site.com/i/9204_2.jpg
https://m.site.com/i/9204_3.jpg
Custom description
on multiple lines.
size: XL
color: blue

 

5. Order columns

And, to add a final touch, let's reorder the headers with headers_order. For example, I want Name and Sku as the first two columns:

order = ["name", "sku"]
exporter = Exporter(
    sc.stats["stats"],
    sc.stats["invalid_properties"],
    headers_renaming=renaming,
    headers_filters=filters,
    headers_order=order,
    field_options=options
)

All headers present in the headers_order list will be ordered, and other headers will be provided in the natural order they appear in your data. Also, we're sorting initial headers, so using name and sku in lowercase.

Name Sku Price Currency Images Description AdditionalProperty
Product 9204 154.95 $ https://m.site.com/i/9204_1.jpg
https://m.site.com/i/9204_2.jpg
https://m.site.com/i/9204_3.jpg
Custom description
on multiple lines.
size: XL
color: blue

 

6. Process invalid data

If your input has mixed types or invalid data, it could be hard to flatten it properly. So, you can decide - either skip such columns or stringify them.

For example, here the property changed type from dict to list:

item_list = [
    {"a": "a_1", "b": {"c": "c_1"}},
    {"a": "a_2", "b": [1, 2, 3]}
]
sc = StatsCollector()
sc.process_items(item_list)
exporter = Exporter(sc.stats["stats"], sc.stats["invalid_properties"])
exporter.export_csv_full(item_list, "example.csv")

By default, invalid properties would be stringified, so you'll get:

a b
a_1 {'c': 'c_1'}
a_2 some_value

 

But if you want to skip them, you could set stringify_invalid parameter to False. It works at all level of nesting, and will affect only the invalid property, so items like this:

item_list = [
    {"a": "a_1", "b": {"c": "c_1", "b": "b_1"}},
    {"a": "a_1", "b": {"c": "c_2", "b": [1, 2, 3]}},
]
sc = StatsCollector()
sc.process_items(item_list)
exporter = Exporter(
    sc.stats["stats"],
    sc.stats["invalid_properties"],
    stringify_invalid=False
)
exporter.export_csv_full(item_list, "example.csv")

Will export like this:

a b->c
a_1 c_1
a_1 c_2

 

7. Process complex data

Following the nesting, you can export and format data with any amount of nested levels. So, let's create a bit unrealistic item with multiple levels, arrays of arrays, and so on:

{
    "a": {
        "nested_a": [[
            {
                "2x_nested_a": {
                    "3x_nested_a": [
                        {"name": "parameter1", "value": "value1"},
                        {"name": "parameter2", "value": "value2"},
                    ]
                }
            },
        ]],
        "second_nested_a": "some_value",
    }
}

If we try to flatten it as is, it will work. However, headers will be a bit questionable, so let's show it as a code:

[
    "a->nested_a[0][0]->2x_nested_a->3x_nested_a[0]->name",
    "a->nested_a[0][0]->2x_nested_a->3x_nested_a[0]->value",
    "a->nested_a[0][0]->2x_nested_a->3x_nested_a[1]->name",
    "a->nested_a[0][0]->2x_nested_a->3x_nested_a[1]->value",
    "a->second_nested_a",
]
["parameter1", "value1", "parameter2", "value2", "some_value"]

But the best part is that we can format data (grouped, named) on any level, so with a bit of field_options magic:

"a->nested_a[0][0]->2x_nested_a->3x_nested_a": {
    "named": True, "name": "name", "grouped": True
}

It will look like this:

a->nested_a[0][0]->2x_nested_a->3x_nested_a a->second_nested_a
parameter1: value1
parameter2: value2
some_value

 

8. Export data

By default, all the data is exported to .csv, either in one go:

exporter = Exporter(sc.stats["stats"], sc.stats["invalid_properties"])
exporter.export_csv_full(item_list, "example.csv")

or one-by-one:

exporter.export_csv_headers("example.csv")
[exporter.export_csv_row(x, "example.csv", append=True) for x in item_list]

Also, you could use any writable input, like TextIO, StringIO, and so on, so all of the examples below will work:

# StringIO
buffer = io.StringIO()
exporter.export_csv_full(item_list, buffer)

# File objects
with open("example.csv", "w") as f:
    exporter.export_csv_full(item_list, f)

# Path-like objects
filename = tmpdir.join("example")
exporter.export_csv_full(item_list, filename)

We plan to support other formats, but for now you could also get flattened items one by one trough export_item_as_row method and write them wherever you want:

# [{"property_1": "value", "property_2": {"nested_property": [1, 2, 3]}}]
flattened_items = [exporter.export_item_as_row(x) for x in item_list]
# [['value', '1', '2', '3']]

 

Arguments

StatsCollector

  • named_columns_limit int(default=50)

    How many named columns could be created for a single field. For example, you have a set of objects like {"name": "color", "value": "blue"}. If you decide to create a separate column for each name ("color", "size", etc.), the limit defines how much data would be collected to make it work. If the limit is hit (too many columns) - no named columns would be created in export. It's required to control memory usage and data size during stats collection (no need to collect stats for 1000 columns if you don't plan to have 1000 columns anyway).

  • cut_separator str(default="->")

    Separator to organize values from items to required columns. Used instead of default "." separator. If your properties' names include the separator - replace it with a custom one.

 

Exporter

  • stats Dict[str, Header]

    Item stats collected by StatsCollector (stats_collector.stats["stats"]).

  • invalid_properties Dict[str, str]

    Invalid properties data provided by StatsCollector (stats_collector.stats["invalid_properties"])

  • stringify_invalid bool(default=True)

    If True - columns with invalid data would be stringified. If False - columns with invalid data would be skipped

  • field_options Dict[str, FieldOption]

    Field options to format data.

    • Options could be named (named=True, name="property_name"), so the exporter will try to create columns based on the values of the property provided in the "name" attribute.
    • Options could be grouped (grouped=True), so the exporter will try to fit all the data for this field into a single cell.
    • Options could be both named and grouped, so the exporter will try to get data collected for each named property and fit all this data in a single field.
  • array_limits Dict[str, int]

    Limit for the array fields to export only first N elements ({"images": 1}).

  • headers_renaming List[Tuple[str, str]]

    Set of RegExp rules to rename existing item columns ([".*_price", "regularPrice"]). The first value is the pattern to replace, while the second one is the replacement.

  • headers_order List[str]

    List to sort columns headers. All headers that are present both in this list and actual data - would be sorted. All other headers would be appended in a natural order. Headers should be provided in the form before renaming ("offers[0]->price", not "Price").

  • headers_filters List[str]

    List of RegExp statements to filter columns. Headers that match any of these statements would be skipped (["name.*", "_key"]).

  • grouped_separator str

    Separator to divide values when grouping data in a single cell (if grouped=True).

  • cut_separator str(default="->")

    Separator to organize values from items to required columns. Used instead of default "." separator. If your properties' names include the separator - replace it with a custom one.

  • capitalize_headers bool(default=False)

    Capitalize fist letter of CSV headers when exporting.

 

Requirements

  • Python 3.7+
  • Works on Linux, Windows, macOS, BSD