Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore CORE-4775: remove html page number metadata field #2942

Merged
merged 10 commits into from
Apr 30, 2024
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
## 0.13.6

### Enhancements
* Remove `page_number` metadata fields for HTML partition until we have a better strategy to decide page counting.

### Features

### Fixes

## 0.13.5

### Enhancements
Expand Down
12 changes: 6 additions & 6 deletions test_unstructured/partition/test_html_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -733,12 +733,12 @@ def test_all_element_ids_are_unique():
def test_element_ids_are_deterministic():
ids = [e.id for e in partition_html("example-docs/fake-html-with-duplicate-elements.html")]
assert ids == [
"cba9e551ed975e0f8a1956095894e92a",
"f540ea3b6569aafeb433df6616e79971",
"f4a34ee0fac26589fffdb53d0dfedbaf",
"15168aeddbd19da60791109a5a45af65",
"0c027f66120dd96271489dd0bb69bff5",
"abe89090c2e46dda8fff81053cc79f17",
"5899179e882d799d869a1d98fe7c7e77",
"88e47a42516af650afdcedfe098c3e6c",
"884be260a86bbdf265c248d5fff5ea00",
"0a23b3ae6bd812b3d90e47fec1df9fe0",
"1e9e5be33c99f7bbf2e569b2430e16cf",
"333e32df62a0ec81a8df07d52dd73c99",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit of a chore to have to update these.
i realized after we merged this pattern we could just call partition_html twice and make sure element_id's are the same. that should be a separate PR wherever that pattern is followed, though.

]


Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.13.5" # pragma: no cover
__version__ = "0.13.6" # pragma: no cover
9 changes: 8 additions & 1 deletion unstructured/partition/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ def partition_html(
if skip_headers_and_footers:
document = filter_footer_and_header(document)

return list(
elements = list(
apply_lang_metadata(
document_to_element_list(
document,
Expand All @@ -158,6 +158,13 @@ def partition_html(
),
)

# Note(yuming): Rip off page_number metadata fields here
# until we have a better way to handle page counting for html files
for element in elements:
if hasattr(element.metadata, "page_number"):
element.metadata.page_number = None
return elements


def convert_and_partition_html(
source_format: str,
Expand Down