-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve gitloader component #5351
base: main
Are you sure you want to change the base?
Conversation
Copilot
AI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/backend/base/langflow/components/git/git.py:93
- The 'show' attribute for 'repo_path' and 'clone_url' should be set to False only if they are not None to avoid potential KeyError.
build_config["repo_path"]["show"] = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
src/backend/base/langflow/components/git/git.py:97
- [nitpick] The variable name temp_dir could be more descriptive. Consider renaming it to temporary_directory.
temp_dir = tempfile.mkdtemp(prefix="langflow_clone_")
…rations and improve temporary directory management - Convert methods to async to enhance performance with file operations. - Implement async context manager for handling temporary clone directories. - Update binary file check and content filtering to be asynchronous.
…and binary check - Refactor file filtering logic to utilize fnmatch for pattern matching. - Introduce a new method to check for binary files based on null byte detection. - Update content filtering to handle exceptions more gracefully. - Modify temporary directory cleanup to use rmdir instead of remove for better directory management. - Adjust load_documents method to utilize asyncio.to_thread for lazy loading of documents.
…ring and binary check - Refactor binary file check to be synchronous and handle exceptions more gracefully. - Introduce new methods for checking file patterns and content patterns, allowing for more flexible filtering. - Consolidate filtering logic into a single method for better maintainability. - Update load_documents method to run lazy loading in a separate thread for improved performance.
… binary detection - Introduced new test suite for GitLoaderComponent, including tests for binary file detection and file pattern matching. - Implemented temporary file creation for testing various file types and permissions. - Added tests for combined filtering functionality, ensuring robust handling of file and content patterns.
…itLoaderComponent - Refactored pattern handling to use a more descriptive variable name `pattern_list` for clarity. - Enhanced content filtering by ensuring proper encoding when reading file content. - Updated regex validation to include a test string check for better error handling. - Removed unnecessary comments to streamline the code and improve readability.
8680c78
to
ffbd2ea
Compare
@@ -1,33 +1,50 @@ | |||
import re | |||
import tempfile | |||
from contextlib import asynccontextmanager | |||
from fnmatch import fnmatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from fnmatch import fnmatch | |
from fnmatch import fnmatchcase | |
from functools import lru_cache |
except Exception: # noqa: BLE001 | ||
return True | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staticmethod | |
@staticmethod | |
@lru_cache(maxsize=128) |
pattern_list: list[str] = [pattern.strip() for pattern in patterns.split(",") if pattern.strip()] | ||
|
||
# If no valid patterns after stripping, treat as include all | ||
if not pattern_list: | ||
return True | ||
|
||
# Process exclusion patterns first | ||
for pattern in pattern_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern_list: list[str] = [pattern.strip() for pattern in patterns.split(",") if pattern.strip()] | |
# If no valid patterns after stripping, treat as include all | |
if not pattern_list: | |
return True | |
# Process exclusion patterns first | |
for pattern in pattern_list: | |
pattern_list = [pattern.strip() for pattern in patterns.split(",") if pattern.strip()] | |
# Separate inclusion and exclusion patterns | |
include_patterns = set() | |
exclude_patterns = set() | |
exclude_patterns.add(exclude_pattern) | |
else: | |
include_patterns.add(pattern) | |
# Process exclusion patterns first | |
for exclude_pattern in exclude_patterns: | |
if fnmatchcase(path_str, exclude_pattern) or fnmatchcase(file_name, exclude_pattern): | |
return False | |
return any(fnmatchcase(path_str, pattern) or fnmatchcase(file_name, pattern) for pattern in include_patterns) |
⚡️ Codeflash found optimizations for this PR📄 2,449% (24.49x) speedup for
|
Test | Status |
---|---|
⚙️ Existing Unit Tests | 🔘 None Found |
🌀 Generated Regression Tests | ✅ 71 Passed |
⏪ Replay Tests | 🔘 None Found |
🔎 Concolic Coverage Tests | 🔘 None Found |
📊 Tests Coverage | undefined |
🌀 Generated Regression Tests Details
from fnmatch import fnmatch
from pathlib import Path
# imports
import pytest # used for our unit tests
from langflow.components.git.git import GitLoaderComponent
from langflow.custom import Component
# unit tests
def test_basic_functionality():
# Single Inclusion Pattern
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.py")
# Single Exclusion Pattern
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.py")
def test_multiple_patterns():
# Multiple Inclusion Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.txt,*.md")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.py,*.js")
# Multiple Exclusion Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.txt,!*.md")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.py,!*.js")
# Mixed Inclusion and Exclusion Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.txt,!test.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.py,!folder/test.py")
def test_edge_cases():
# Empty Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", " ")
# No Valid Patterns After Stripping
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", ",, ")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", " , , ")
# Patterns with Only Exclusions
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.md,!*.py")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.md,!*.txt")
def test_path_and_filename_matching():
# Match Full Path
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "folder/*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/subfolder/test.py", "folder/**/*.py")
# Match Filename Only
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "test.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/subfolder/test.py", "test.py")
def test_complex_patterns():
# Wildcard Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "*.t?t")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.p*")
# Negated Wildcard Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "!*.t?t")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.p*")
def test_large_scale():
# Large Number of Patterns
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", ",".join(["*.txt"] * 1000))
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", ",".join(["*.py", "!test.py"] * 500))
# Long File Paths
codeflash_output = GitLoaderComponent.check_file_patterns("a/" * 100 + "test.txt", "*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("a/" * 100 + "test.py", "*.py")
def test_special_characters_in_file_paths():
# File Paths with Special Characters
codeflash_output = GitLoaderComponent.check_file_patterns("test[1].txt", "test[1].txt")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test{2}.py", "folder/test{2}.py")
def test_case_sensitivity():
# Case-Sensitive Matching
codeflash_output = GitLoaderComponent.check_file_patterns("Test.TXT", "*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("Folder/TEST.PY", "*.py")
def test_patterns_with_whitespace():
# Patterns with Extra Whitespace
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", " *.txt ")
codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", " *.py ")
# Run the tests
pytest.main()
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from fnmatch import fnmatch
from pathlib import Path
# imports
import pytest # used for our unit tests
from langflow.components.git.git import GitLoaderComponent
# function to test
from langflow.custom import Component
# unit tests
# Basic Inclusion and Exclusion
def test_single_inclusion_pattern():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.py")
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.txt")
def test_single_exclusion_pattern():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "!*.py")
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "!*.txt")
# Multiple Patterns
def test_multiple_inclusion_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.py,*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.py,*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("test.md", "*.py,*.txt")
def test_multiple_exclusion_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "!*.py,!*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.py,!*.txt")
codeflash_output = GitLoaderComponent.check_file_patterns("test.md", "!*.py,!*.txt")
# Mixed Inclusion and Exclusion Patterns
def test_mixed_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.py,!test.py")
codeflash_output = GitLoaderComponent.check_file_patterns("main.py", "*.py,!test.py")
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.py,!test.py")
# Edge Cases
def test_empty_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "")
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " ")
def test_whitespace_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " , ")
def test_no_valid_patterns_after_stripping():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " , , ")
def test_patterns_with_leading_trailing_whitespace():
codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " *.py , *.txt ")
codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", " *.py , *.txt ")
codeflash_output = GitLoaderComponent.check_file_patterns("test.md", " *.py , *.txt ")
# Path vs Filename Matching
def test_full_path_matching():
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "*/src/*.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/docs/test.py", "*/src/*.py")
def test_filename_matching():
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "test.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/docs/test.py", "test.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/docs/main.py", "test.py")
# Complex Patterns
def test_complex_glob_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "**/*.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/subdir/test.py", "**/*.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.txt", "**/*.py")
def test_negated_complex_patterns():
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "**/*.py,!**/test.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/main.py", "**/*.py,!**/test.py")
codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/subdir/test.py", "**/*.py,!**/test.py")
# Large Scale Test Cases
def test_large_number_of_patterns():
patterns = ",".join([f"pattern{i}.py" for i in range(1000)])
codeflash_output = GitLoaderComponent.check_file_patterns("pattern500.py", patterns)
codeflash_output = GitLoaderComponent.check_file_patterns("pattern1000.py", patterns)
def test_large_file_path():
large_path = "/" + "/".join(["dir"] * 1000) + "/test.py"
codeflash_output = GitLoaderComponent.check_file_patterns(large_path, "*.py")
codeflash_output = GitLoaderComponent.check_file_patterns(large_path, "*.txt")
# Additional Considerations
This pull request includes significant updates to the
GitLoaderComponent
class in thesrc/backend/base/langflow/components/git/git.py
file, enhancing its functionality and improving the user interface for selecting repository sources and filtering files.Enhancements to
GitLoaderComponent
:DropdownInput
forrepo_source
, and updated therepo_path
andclone_url
inputs to reflect this selection.file_filter
input'sinfo
attribute to provide clearer examples of how to include and exclude files using patterns.update_build_config
method to dynamically update the visibility and requirements of input fields based on the selected repository source.build_gitloader
method to handle the new repository source selection and ensure proper filtering and cloning behavior.