-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89
Conversation
While installing Python, switch to ROOT to avoid installing/using `sudo` - Switch back before installing `pip` packages to avoid pip warnings
- Added: - docstrings - granular exception handling - Ran Black, Flake8, and PyLint against `conv_html_to_markdown` - Need to change the input file name to BuildIO's default for consistency
modified: .gitignore new file: .pylintrc modified: Dockerfile renamed: conv_html_to_markdown.py -> src/conv_html_to_markdown.py new file: tests/test_conv_html_to_markdown.py
Merge release 1.0.0 changes
- out of scope of initial conversion processor - html to markdown
- Satisfied with useful:unuseful data ratio - Float 0.8699 as threshold for semantic similarity was chosen *intentionally* after extensive methodical testing - Room for improvement: - Wrap code blocks, remove all 'copied' - Remove language references - Remove `<nsource>` - Separate sections with '---' or something similar - need visual representation of each chunk while causing minimal noise - Introduce structure to house each class - Headers and subheaders maybe??? - Could also just use some basic type of markdown formatting modified: Dockerfile modified: src/conv_html_to_markdown.py
Semantic similarity
- Refactored load_json function to load_json_files, allowing it to handle multiple JSON files matching a pattern using glob. This change enables the aggregation of data from all matched files. Also, updated main function to reflect the new file loading process and added explanatory comments for clarity.
modified: config.ts
Hey @Daethyra 👋! It occurs to me that these changes could be introduced as a separate package on another repo)as something possibly complementary to the gpt-crawler package. I miss some instructions on this PR:
|
Hey @marcelovicentegc, Sorry for the late reply. I appreciate you having a look at the PR. In regard to introducing the Python script as a separate package, I leave that to you -- I'm a noob and don't know what's best for production environments and what's sustainable for you and your team. To answer your questions,
Acknowledgements:
|
Update branch for the sake of BuilderIO's PR BuilderIO#89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Because semantic-similarity is the base of this PR, I merged enhancements from the 'main' branch into 'semantic-similarity'
- Fixes logic errors
-
Throw out changes to
config.ts
This pull request
introduces significant enhancements to the
conv_html_to_markdown.py
module, integrating JinaAI's embeddings v2 (Small) model for advanced text processing. The primary goal is to refine the conversion of HTML content into Markdown format, with an added focus on removing redundant data using semantic analysis. Below is a detailed overview of the workflow, reasoning, steps taken, and the key features of this updated module.Workflow and Reasoning:
trust_remote_code=True
cautiously to allow the execution of the model's custom code, acknowledging the potential risks involved. Passing this value is required for the model to run.conv_html_to_markdown.py
module, we streamline the conversion process and intelligently curate the HTML content, paving the way for more refined and contextually relevant Markdown outputs.conv_html_to_markdown.py
is seamlessly integrated into the root dir's Dockerfile, which has also been updated to install all necessary Python packages for the environment.Features Covered:
Task List:
Steps Taken:
process_embeddings
method to handle embeddings in batches, optimizing for large datasets.remove_redundant_data
method to filter out semantically similar content using cosine similarity on embeddings.convert
andcurate_content
.Future Improvements and Customization:
The module serves as a foundational framework for further customization and enhancement. Future iterations can explore:
This pull request is brought to you by: WELL IT WORKS ON MY MACHINE!