Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Daethyra · 2023-11-28T19:46:45Z

This pull request

introduces significant enhancements to the conv_html_to_markdown.py module, integrating JinaAI's embeddings v2 (Small) model for advanced text processing. The primary goal is to refine the conversion of HTML content into Markdown format, with an added focus on removing redundant data using semantic analysis. Below is a detailed overview of the workflow, reasoning, steps taken, and the key features of this updated module.

Workflow and Reasoning:

Initial Goal: Improve the conversion of HTML to Markdown by eliminating redundant or semantically similar content.
Choice of Technology: Utilized JinaAI's embeddings v2 (Small) model because it is open-source, highly efficient in handling large text data and semantic text analysis capabilities, and it has an 8k maximum token context length.
Security Consideration: Implemented trust_remote_code=True cautiously to allow the execution of the model's custom code, acknowledging the potential risks involved. Passing this value is required for the model to run.
Thought Process: By integrating the conv_html_to_markdown.py module, we streamline the conversion process and intelligently curate the HTML content, paving the way for more refined and contextually relevant Markdown outputs.
- Semantic Similarity Threshold: I found most success with a float of 0.8699 after extensive methodical testing against the Hugging Face Pipelines documentation.
- Docker Integration: conv_html_to_markdown.py is seamlessly integrated into the root dir's Dockerfile, which has also been updated to install all necessary Python packages for the environment.

Features Covered:

HTML to Markdown conversion with tag stripping and link processing.
Semantic redundancy removal using JinaAI embeddings.
Batch processing of text for efficient embedding generation.
Resilient error handling with detailed logging.

Task List:

Convert HTML content to Markdown format.
Integrate JinaAI's embeddings for semantic analysis.
Implement mean pooling and batch processing for embedding generation.
Develop functionality to identify and remove redundant lines based on embeddings.
Ensure robust error handling and logging throughout the module.

Steps Taken:

Module Setup: Initialized the module with BeautifulSoup and markdownify for HTML parsing and Markdown conversion.
Embedding Integration: Integrated AutoTokenizer and AutoModel from the transformers library for embedding processing.
Batch Processing Logic: Developed process_embeddings method to handle embeddings in batches, optimizing for large datasets.
Semantic Analysis: Created remove_redundant_data method to filter out semantically similar content using cosine similarity on embeddings.
Error Handling: Added comprehensive try-except blocks for resilience, particularly in critical methods like convert and curate_content.
Testing and Validation: Conducted thorough testing to ensure the accuracy and efficiency of the embedding-based redundancy removal.

Future Improvements and Customization:

The module serves as a foundational framework for further customization and enhancement. Future iterations can explore:

Tuning of the semantic similarity threshold for redundancy removal.
- I found most success with a float of 0.8699 after extensive methodical testing against the Hugging Face Pipelines documentation.
House scraped code blocks in markdown blocks with backticks
- Create visual divider of chunks or sections
  - Could implement pagination of sorts instead

This pull request is brought to you by: WELL IT WORKS ON MY MACHINE!

While installing Python, switch to ROOT to avoid installing/using `sudo` - Switch back before installing `pip` packages to avoid pip warnings

- Added: - docstrings - granular exception handling - Ran Black, Flake8, and PyLint against `conv_html_to_markdown` - Need to change the input file name to BuildIO's default for consistency

modified: .gitignore new file: .pylintrc modified: Dockerfile renamed: conv_html_to_markdown.py -> src/conv_html_to_markdown.py new file: tests/test_conv_html_to_markdown.py

Merge release 1.0.0 changes

- out of scope of initial conversion processor - html to markdown

- Satisfied with useful:unuseful data ratio - Float 0.8699 as threshold for semantic similarity was chosen *intentionally* after extensive methodical testing - Room for improvement: - Wrap code blocks, remove all 'copied' - Remove language references - Remove `<nsource>` - Separate sections with '---' or something similar - need visual representation of each chunk while causing minimal noise - Introduce structure to house each class - Headers and subheaders maybe??? - Could also just use some basic type of markdown formatting modified: Dockerfile modified: src/conv_html_to_markdown.py

Semantic similarity

- Refactored load_json function to load_json_files, allowing it to handle multiple JSON files matching a pattern using glob. This change enables the aggregation of data from all matched files. Also, updated main function to reflect the new file loading process and added explanatory comments for clarity.

modified: config.ts

marcelovicentegc · 2023-12-04T12:30:17Z

Hey @Daethyra 👋! It occurs to me that these changes could be introduced as a separate package on another repo)as something possibly complementary to the gpt-crawler package.

I miss some instructions on this PR:

What are the requirements to run these changes? Do people need a HF API key to run this?
How does it integrate with gpt-crawler?

Daethyra · 2023-12-08T22:15:03Z

Hey @marcelovicentegc,

Sorry for the late reply. I appreciate you having a look at the PR.

In regard to introducing the Python script as a separate package, I leave that to you -- I'm a noob and don't know what's best for production environments and what's sustainable for you and your team.

To answer your questions,

Currently, the conv_html_to_markdown.py module runs as soon as the HTML scraping is finished.
Requirements:

Python: ">=3.9, <=3.10.12" | Successfully tested on 3.10 & 3.11
Packages: pip install -U beautifulsoup4 markdownify transformers torch
API Key?: No external API keys required as jina-embeddings-v2-small-en is open access on HF. It's not like Llama 2 where you need to request access. It's ready to go!
Integration: As normal, from inside the root directory of gpt-crawler, build and run the image. The Python script has already been pointed to in the Dockerfile's build process.

PowerShell: run docker build -t gpt-crawler . ; docker run -it gpt-crawler
Bash: run docker build -t gpt-crawler . && docker run -it gpt-crawler

Acknowledgements:

More enhancements to conv_html_to_markdown.py may be found on my fork's version.
I haven't thought of a way to quickly test results from conv_html_to_markdown.py because I rebuild the image every time I update the module's logic.
Adding this Python module does not produce ideal results; however, they are still an improvement upon the JSONic data for Custom GPTs, in my opinion.
I intend to add the following features at a later date, once Assistant Architect receives it's file-base enhancements from the Utilikit:

Update branch for the sake of BuilderIO's PR BuilderIO#89

Daethyra

Because semantic-similarity is the base of this PR, I merged enhancements from the 'main' branch into 'semantic-similarity'
- Fixes logic errors
Throw out changes to config.ts

Daethyra added 18 commits November 26, 2023 02:08

Moving Python reconstruction content

17a46dd

Fixed "sudo" not found error

4711536

While installing Python, switch to ROOT to avoid installing/using `sudo` - Switch back before installing `pip` packages to avoid pip warnings

Enhancements for conv_html_to_markdown

e82353a

- Added: - docstrings - granular exception handling - Ran Black, Flake8, and PyLint against `conv_html_to_markdown` - Need to change the input file name to BuildIO's default for consistency

Corrected paths for Python conversion pipeline

53aa5df

Create Python module to structure crawler output

97133fe

modified: .gitignore new file: .pylintrc modified: Dockerfile renamed: conv_html_to_markdown.py -> src/conv_html_to_markdown.py new file: tests/test_conv_html_to_markdown.py

Merge branch 'convert-html-to-markdown' into main

ed5ce91

Merge pull request #1 from Daethyra/main

6316f9a

Merge release 1.0.0 changes

Create semantic similarity branch

e798e77

- out of scope of initial conversion processor - html to markdown

Fixed semantic similarity functionality. needs testing

197f9df

Merge branch 'BuilderIO:main' into main

25159d9

Merge branch 'BuilderIO:main' into semantic-similarity

732041a

Merge pull request #2 from Daethyra/semantic-similarity

619f87a

Semantic similarity

Merge branch 'BuilderIO:main' into main

2e00ed1

Updating test_conv_html_to_markdown

5fd6126

Merge branch 'main' of https://github.com/daethyra/gpt-crawler

b45c7cc

Initialized daethyra/gpt-crawler as a submodule.

42d3daf

modified: config.ts

marcelovicentegc assigned Daethyra Dec 4, 2023

marcelovicentegc added trial triage and removed trial labels Dec 4, 2023

Merge branch 'BuilderIO:main' into semantic-similarity

b63c9eb

Daethyra mentioned this pull request Dec 8, 2023

Update branch for the sake of BuilderIO's PR #89 Daethyra/context-curator#9

Merged

Daethyra added 2 commits December 8, 2023 14:26

Merge branch 'semantic-similarity' into main

b36324f

Merge pull request #9 from Daethyra/main

558c24b

Update branch for the sake of BuilderIO's PR BuilderIO#89

Daethyra commented Dec 8, 2023

View reviewed changes

Daethyra closed this Dec 28, 2023

Daethyra deleted the semantic-similarity branch December 28, 2023 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Daethyra commented Nov 28, 2023

marcelovicentegc commented Dec 4, 2023

Daethyra commented Dec 8, 2023

Daethyra left a comment •

edited

Loading

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Conversation

Daethyra commented Nov 28, 2023

This pull request

Workflow and Reasoning:

Features Covered:

Task List:

Steps Taken:

Future Improvements and Customization:

marcelovicentegc commented Dec 4, 2023

Daethyra commented Dec 8, 2023

Daethyra left a comment • edited Loading

Choose a reason for hiding this comment

Daethyra left a comment •

edited

Loading