Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore CORE-4775: remove html page number metadata field #2942

Merged
merged 10 commits into from Apr 30, 2024

Conversation

yuming-long
Copy link
Contributor

@yuming-long yuming-long commented Apr 26, 2024

Summary

Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple <article> tag)

Test

Unit tests test_add_chunking_strategy_on_partition_html_respects_multipage and test_add_chunking_strategy_title_on_partition_auto_respects_multipage removed since they relay on the page_number fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this

Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment

"884be260a86bbdf265c248d5fff5ea00",
"0a23b3ae6bd812b3d90e47fec1df9fe0",
"1e9e5be33c99f7bbf2e569b2430e16cf",
"333e32df62a0ec81a8df07d52dd73c99",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit of a chore to have to update these.
i realized after we merged this pattern we could just call partition_html twice and make sure element_id's are the same. that should be a separate PR wherever that pattern is followed, though.

yuming-long and others added 3 commits April 29, 2024 14:32
…t fixtures update (#2949)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: yuming-long <[email protected]>
@yuming-long yuming-long added this pull request to the merge queue Apr 30, 2024
Merged via the queue into main with commit 542d442 Apr 30, 2024
42 checks passed
@yuming-long yuming-long deleted the yuming/remove_html_page_numer_metadata_field branch April 30, 2024 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants