Simplify schema modification of child tables #1647

akelad · 2024-07-29T14:50:51Z

Feature description

You should be able to use apply_hints on child tables just like you can on the parent tables. Right now there's no way to modify the schema of child tables without explicitly doing so in a schema.yml

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

When child tables are generated all of the field types are text . This makes them essentially unusable for analysis/aggregation without further transformations. I also can't partition them.

Slack conversation: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1721944874168369

Proposed solution

No response

Related issues

No response

The text was updated successfully, but these errors were encountered:

sh-rp · 2024-07-29T15:05:36Z

What might work is to add a path parameter of type List[str] to apply_table_hints which would enable support for setting hints on subtables. The DltResource (or rather the DltResourceHints class) would have to be able to hold multiple TTableSchemas which then will be merged into the full schema at the end of the extraction or during normalization.

Example:

@dlt.resource(table_name="items")
def my_resource():
   yield [{
      "id": 1,
      "sub_items": [
         {"id": 11}, 
         {"id": 12}
      ]
      }]
   
# applies primary key hint on main table
my_resource.apply_hints(primary_key="id")

# applies primary key hin on sub_items table
my_resource.apply_hints(path=["sub_items"], primary_key="id")

# ever further nested subtable
my_resource.apply_hints(path=["sub_items", "subsub_items"], primary_key="id")

rudolfix · 2024-09-18T14:58:24Z

Interface proposal (based on path idea above).

extend the ResourceHints to include nested_hints dictionary that represent nested hints. Note that DltResource derives from hints so what I show below

resource = my_resource()

resource.nested_hints["sub_items"] = make_hints(primary_key="id")
resource.nested_hints["sub_items"]["subsub_items"] = make_hints(primary_key="id")

# works like filesystem.walk, where path is a tuple ie ("sub_items", "subsub_items") that may be converted into path
# the operation is recursive
for path, hint in resource.nested_hints.items():
   ...

extend the @dlt.resource to accept nested_hints ie as dictionary {path: hints} or as ResourceHints instance

Some implementation details:

There's just one resource here - the root one. All the others are hints that share data with the root. Mind that only root table(s) created may have resource name attached (again: there's just one resource name here)
There are a few places where we compute_table_schema. Now we need to do this but also we need to walk over nested hints and generate a table chain that we'll later add to schema
Mind that table schemas are dynamically created. And nested tables must know the name of parent table. You have two options here: or you store the parent hint and compute the parent table name recursively or you pass the previously generated parent table schema to compute_table_schema (where I think the latter is way more efficient)
You still may allow users to specify table_name on the nested hint. If you do so, you'll need to modify the normalizer so it maps paths to those names. IMO this is for another ticket and bigger overhaul of the schema
prevent following to be set on nested table:

parent_table_name: TTableHintTemplate[str] = None,
write_disposition: TTableHintTemplate[TWriteDispositionConfig] = None,
table_format: TTableHintTemplate[TTableFormat] = None,
incremental: TIncrementalConfig = None,

joscha · 2024-11-26T11:19:03Z

I believe this would also solve the issue described here: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1732199697666119

Background:

I have a data source that has this data in rows:

{   
    "id": 1,
    "links": [
        {
            "link": "https://linkedin.com/in/some-person-1234",
            "text": "Some Person (Co Founder)"
        }
    ],
    "event": {
        "id": 3715858079,
        "type": "meeting",
        "title": "Meeting title",
        "allDay": false,
        "startTime": "2023-05-09T16:00:00Z",
        "endTime": "2023-05-09T17:30:00Z",
        "attendees": [
            {
                "emailAddress": "[email protected]",
                "person": {
                    "id": 147862259,
                    "firstName": "Will",
                    "lastName": "Smith",
                    "primaryEmailAddress": "[email protected]",
                    "type": "external"
                }
            },
            {
                "emailAddress": "[email protected]",
                "person": {
                    "id": 147862258,
                    "firstName": "Joscha",
                    "lastName": "Feth",
                    "primaryEmailAddress": "[email protected]",
                    "type": "internal"
                }
            },
            {
                "emailAddress": "[email protected]",
                "person": null
            }
        ]
    }
}

What I would like to get from this is:

a main table containing the row ID as the primary key
a events table, containing all meetings, using the event.id as the primary key and referencing the row ID as the parent
an attendees table, with the custom fields and a link to an existing person in the persons table that I provide
a links table, containing all links using a synthesized primary key and referencing thw row ID as the parent

E.g. in the above example, I'd end up with:

main table:

ID	...
1	...

links table:

_synth_ID	parent ID	link	text
1234	1	https://linkedin.com/in/some-person-1234	Some Person (Co Founder)

events table:

ID	type	title	all_day	...
3715858079	meeting	Meeting title	FALSE	...

attendees table:

_synth_ID	parent ID	email_address	person_id
1234	3715858079	[email protected]	147862259
4567	3715858079	[email protected]	147862258
8910	3715858079	[email protected]	NULL

persons table:

ID	first_name	last_name	primary_email_address	type
147862259	Will	Smith	[email protected]	external
147862258	Joscha	Feth	[email protected]	internal

Marking nested structures with hints about how to split them would work for this I believe.

akelad added the enhancement New feature or request label Jul 29, 2024

github-project-automation bot added this to dlt core library Jul 29, 2024

github-project-automation bot moved this to Todo in dlt core library Jul 29, 2024

akelad added community This issue came from slack community workspace support This issue is monitored by Solution Engineer labels Jul 29, 2024

rudolfix moved this from Todo to Planned in dlt core library Aug 5, 2024

burnash added this to the 1.0 release milestone Aug 8, 2024

rudolfix self-assigned this Aug 25, 2024

rudolfix added sprint Marks group of tasks with core team focus at this moment and removed community This issue came from slack community workspace labels Aug 25, 2024

rudolfix mentioned this issue Sep 2, 2024

1.0.0 announcement and release notes #1778

Closed

This was referenced Sep 19, 2024

Feature: report missing "parent" in schema yaml #1839

Closed

allows to define hints for nested tables #1855

Draft

rudolfix assigned steinitzu and unassigned rudolfix Dec 3, 2024

rudolfix moved this from Planned to In Progress in dlt core library Dec 3, 2024

joscha mentioned this issue Dec 18, 2024

docs: indent subsection about references in nested tables #2161

Closed

steinitzu linked a pull request Dec 19, 2024 that will close this issue

Apply hints for nested tables #2165

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify schema modification of child tables #1647

Simplify schema modification of child tables #1647

akelad commented Jul 29, 2024

sh-rp commented Jul 29, 2024

rudolfix commented Sep 18, 2024 •

edited

Loading

joscha commented Nov 26, 2024 •

edited

Loading

Simplify schema modification of child tables #1647

Simplify schema modification of child tables #1647

Comments

akelad commented Jul 29, 2024

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

sh-rp commented Jul 29, 2024

rudolfix commented Sep 18, 2024 • edited Loading

joscha commented Nov 26, 2024 • edited Loading

rudolfix commented Sep 18, 2024 •

edited

Loading

joscha commented Nov 26, 2024 •

edited

Loading