Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
17a46dd
Moving Python reconstruction content
Daethyra Nov 26, 2023
4711536
Fixed "sudo" not found error
Daethyra Nov 26, 2023
e82353a
Enhancements for `conv_html_to_markdown`
Daethyra Nov 26, 2023
53aa5df
Corrected paths for Python conversion pipeline
Daethyra Nov 26, 2023
97133fe
Create Python module to structure crawler output
Daethyra Nov 26, 2023
ed5ce91
Merge branch 'convert-html-to-markdown' into main
Daethyra Nov 26, 2023
6316f9a
Merge pull request #1 from Daethyra/main
Daethyra Nov 26, 2023
e798e77
Create semantic similarity branch
Daethyra Nov 26, 2023
197f9df
Fixed semantic similarity functionality. needs testing
Daethyra Nov 27, 2023
69409f8
Entire process works! Results are as expected.
Daethyra Nov 28, 2023
25159d9
Merge branch 'BuilderIO:main' into main
Daethyra Nov 28, 2023
732041a
Merge branch 'BuilderIO:main' into semantic-similarity
Daethyra Nov 28, 2023
619f87a
Merge pull request #2 from Daethyra/semantic-similarity
Daethyra Nov 30, 2023
2e00ed1
Merge branch 'BuilderIO:main' into main
Daethyra Nov 30, 2023
5fd6126
Updating test_conv_html_to_markdown
Daethyra Nov 30, 2023
b45c7cc
Merge branch 'main' of https://github.com/daethyra/gpt-crawler
Daethyra Nov 30, 2023
bf266ef
refactor(data): improve JSON file loading and processing
Daethyra Dec 2, 2023
42d3daf
Initialized daethyra/gpt-crawler as a submodule.
Daethyra Dec 2, 2023
b63c9eb
Merge branch 'BuilderIO:main' into semantic-similarity
Daethyra Dec 8, 2023
b36324f
Merge branch 'semantic-similarity' into main
Daethyra Dec 8, 2023
558c24b
Merge pull request #9 from Daethyra/main
Daethyra Dec 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,10 @@ storage

# any output from the crawler
*.json

pnpm-lock.yaml

# Python
__pycache__
venv/
.venv/
Loading
Loading