How to ingest data without duplication allowed? #1877

parselife · 2022-03-31T04:41:58Z

With documentation, there is :

Data ID: An identifier for the data represented by this row. We do not impose a requirement that Data IDs are globally unique but they should be unique for the adapter. Therefore, the pairing of Internal Adapter ID and Data ID define a unique identifier for a data element. An example of a data ID for vector data would be the feature ID.

according to that, Adapter ID and Data ID define a unique identifier, so how to ingest data without duplication allowed?

now, my index looks like

adapter_id                 data_id   
4	......              places.12
4	......              places.12

Why this happened?

The values of adapter_id and data_id in these two records are the same

i want to get a single record without a duplicated one, how can i do?

The text was updated successfully, but these errors were encountered:

parselife · 2022-04-01T09:08:30Z

I find the cassandra's table definition :

**primary key (partition, adapter_id, sort, data_id, vis, nano_time, field_mask, value, num_duplicates)**

Any way to custom this ?

rfecher · 2022-04-01T12:37:24Z

Not sure why you'd why exactly you'd want to customize that primary key, you can give it the data ID to be unique, and other things like sort and partition key come from the index (which again you could customize but probably don't want to).

The issue is most likely that you are inserting rows into the index with the same adapter ID and data ID but different sort keys. This would happen, for example, if you were using a spatial index and the rows had different geometries (or similarly a temporal index with different date/times). In these rare cases you would want to delete the row prior to ingesting. The num_duplicates identifier that we tack onto the primary key is a hint that we intentionally are storing duplicates, and this can happen in rare circumstances such as if you are storing a time range (consider a track that has a start time and and end time) and that time range crosses a periodicity boundary on a temporal index (because time is unbounded, we place it on the space filling curve by applying a periodicity such as a year which is our default but can be configured, so in the case of a year periodicity if the track started on Dec. 31 and ended on Jan 1 for example, we have to insert 2 rows on each side of the boundary and we maintain that with the hint num_duplicates). Hopefully that adds some clarity to your situation - as mentioned most likely you are inserting a data ID multiple times with different sort keys, such as different geometries within a spatial index, which will require deleting the previous one prior to insertion in that case.

parselife · 2022-04-24T08:29:52Z

Thx for your reply, Where can i find the sort keys ? My situation is that: The data written twice is just the same

rfecher · 2022-04-25T12:35:03Z

Do you have a "ROUND_ROBIN" partition strategy on your index (such as described in this add index help output, https://locationtech.github.io/geowave/latest/userguide.html#help-command)? This partition strategy would by design add random partition keys even to identical rows and explain this behavior you're seeing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to ingest data without duplication allowed? #1877

How to ingest data without duplication allowed? #1877

parselife commented Mar 31, 2022 •

edited

Loading

parselife commented Apr 1, 2022

rfecher commented Apr 1, 2022

parselife commented Apr 24, 2022

rfecher commented Apr 25, 2022

How to ingest data without duplication allowed? #1877

How to ingest data without duplication allowed? #1877

Comments

parselife commented Mar 31, 2022 • edited Loading

parselife commented Apr 1, 2022

rfecher commented Apr 1, 2022

parselife commented Apr 24, 2022

rfecher commented Apr 25, 2022

parselife commented Mar 31, 2022 •

edited

Loading