Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve gitloader component #5351

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

rodrigosnader
Copy link
Contributor

@rodrigosnader rodrigosnader commented Dec 19, 2024

This pull request includes significant updates to the GitLoaderComponent class in the src/backend/base/langflow/components/git/git.py file, enhancing its functionality and improving the user interface for selecting repository sources and filtering files.

Enhancements to GitLoaderComponent:

  • Added support for selecting between local and remote Git repositories by introducing a new DropdownInput for repo_source, and updated the repo_path and clone_url inputs to reflect this selection.
  • Improved the file_filter input's info attribute to provide clearer examples of how to include and exclude files using patterns.
  • Introduced a temporary directory mechanism for cloning remote repositories, which is cleaned up after loading documents. [1] [2]
  • Added the update_build_config method to dynamically update the visibility and requirements of input fields based on the selected repository source.
  • Modified the build_gitloader method to handle the new repository source selection and ensure proper filtering and cloning behavior.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Dec 19, 2024
@ogabrielluiz ogabrielluiz requested a review from Copilot December 19, 2024 18:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/backend/base/langflow/components/git/git.py:93

  • The 'show' attribute for 'repo_path' and 'clone_url' should be set to False only if they are not None to avoid potential KeyError.
build_config["repo_path"]["show"] = False

src/backend/base/langflow/components/git/git.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 19, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 20, 2024
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 20, 2024
@ogabrielluiz ogabrielluiz requested a review from Copilot December 20, 2024 12:23

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

src/backend/base/langflow/components/git/git.py:97

  • [nitpick] The variable name temp_dir could be more descriptive. Consider renaming it to temporary_directory.
temp_dir = tempfile.mkdtemp(prefix="langflow_clone_")
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 20, 2024
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 20, 2024
rodrigosnader and others added 7 commits December 20, 2024 10:46
…rations and improve temporary directory management

- Convert methods to async to enhance performance with file operations.
- Implement async context manager for handling temporary clone directories.
- Update binary file check and content filtering to be asynchronous.
…and binary check

- Refactor file filtering logic to utilize fnmatch for pattern matching.
- Introduce a new method to check for binary files based on null byte detection.
- Update content filtering to handle exceptions more gracefully.
- Modify temporary directory cleanup to use rmdir instead of remove for better directory management.
- Adjust load_documents method to utilize asyncio.to_thread for lazy loading of documents.
…ring and binary check

- Refactor binary file check to be synchronous and handle exceptions more gracefully.
- Introduce new methods for checking file patterns and content patterns, allowing for more flexible filtering.
- Consolidate filtering logic into a single method for better maintainability.
- Update load_documents method to run lazy loading in a separate thread for improved performance.
… binary detection

- Introduced new test suite for GitLoaderComponent, including tests for binary file detection and file pattern matching.
- Implemented temporary file creation for testing various file types and permissions.
- Added tests for combined filtering functionality, ensuring robust handling of file and content patterns.
…itLoaderComponent

- Refactored pattern handling to use a more descriptive variable name `pattern_list` for clarity.
- Enhanced content filtering by ensuring proper encoding when reading file content.
- Updated regex validation to include a test string check for better error handling.
- Removed unnecessary comments to streamline the code and improve readability.
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 20, 2024
@@ -1,33 +1,50 @@
import re
import tempfile
from contextlib import asynccontextmanager
from fnmatch import fnmatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from fnmatch import fnmatch
from fnmatch import fnmatchcase
from functools import lru_cache

except Exception: # noqa: BLE001
return True

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@staticmethod
@staticmethod
@lru_cache(maxsize=128)

Comment on lines +108 to +115
pattern_list: list[str] = [pattern.strip() for pattern in patterns.split(",") if pattern.strip()]

# If no valid patterns after stripping, treat as include all
if not pattern_list:
return True

# Process exclusion patterns first
for pattern in pattern_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pattern_list: list[str] = [pattern.strip() for pattern in patterns.split(",") if pattern.strip()]
# If no valid patterns after stripping, treat as include all
if not pattern_list:
return True
# Process exclusion patterns first
for pattern in pattern_list:
pattern_list = [pattern.strip() for pattern in patterns.split(",") if pattern.strip()]
# Separate inclusion and exclusion patterns
include_patterns = set()
exclude_patterns = set()
exclude_patterns.add(exclude_pattern)
else:
include_patterns.add(pattern)
# Process exclusion patterns first
for exclude_pattern in exclude_patterns:
if fnmatchcase(path_str, exclude_pattern) or fnmatchcase(file_name, exclude_pattern):
return False
return any(fnmatchcase(path_str, pattern) or fnmatchcase(file_name, pattern) for pattern in include_patterns)

Copy link
Contributor

codeflash-ai bot commented Dec 20, 2024

⚡️ Codeflash found optimizations for this PR

📄 2,449% (24.49x) speedup for GitLoaderComponent.check_file_patterns in src/backend/base/langflow/components/git/git.py

⏱️ Runtime : 3.47 milliseconds 136 microseconds (best of 161 runs)

📝 Explanation and details
  1. Using Sets: Changed the lists for include and exclude patterns to sets for faster lookups.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 71 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage undefined
🌀 Generated Regression Tests Details
from fnmatch import fnmatch
from pathlib import Path

# imports
import pytest  # used for our unit tests
from langflow.components.git.git import GitLoaderComponent
from langflow.custom import Component


# unit tests
def test_basic_functionality():
    # Single Inclusion Pattern
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.py")

    # Single Exclusion Pattern
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.py")

def test_multiple_patterns():
    # Multiple Inclusion Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.txt,*.md")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.py,*.js")

    # Multiple Exclusion Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.txt,!*.md")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.py,!*.js")

    # Mixed Inclusion and Exclusion Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.txt,!test.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.py,!folder/test.py")

def test_edge_cases():
    # Empty Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", " ")

    # No Valid Patterns After Stripping
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", ",, ")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", " , , ")

    # Patterns with Only Exclusions
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.md,!*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.md,!*.txt")

def test_path_and_filename_matching():
    # Match Full Path
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "folder/*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/subfolder/test.py", "folder/**/*.py")

    # Match Filename Only
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "test.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/subfolder/test.py", "test.py")

def test_complex_patterns():
    # Wildcard Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "*.t?t")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "*.p*")

    # Negated Wildcard Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.txt", "!*.t?t")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", "!*.p*")

def test_large_scale():
    # Large Number of Patterns
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", ",".join(["*.txt"] * 1000))
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", ",".join(["*.py", "!test.py"] * 500))

    # Long File Paths
    codeflash_output = GitLoaderComponent.check_file_patterns("a/" * 100 + "test.txt", "*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("a/" * 100 + "test.py", "*.py")

def test_special_characters_in_file_paths():
    # File Paths with Special Characters
    codeflash_output = GitLoaderComponent.check_file_patterns("test[1].txt", "test[1].txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test{2}.py", "folder/test{2}.py")

def test_case_sensitivity():
    # Case-Sensitive Matching
    codeflash_output = GitLoaderComponent.check_file_patterns("Test.TXT", "*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("Folder/TEST.PY", "*.py")

def test_patterns_with_whitespace():
    # Patterns with Extra Whitespace
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", " *.txt ")
    codeflash_output = GitLoaderComponent.check_file_patterns("folder/test.py", " *.py ")

# Run the tests
pytest.main()
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from fnmatch import fnmatch
from pathlib import Path

# imports
import pytest  # used for our unit tests
from langflow.components.git.git import GitLoaderComponent
# function to test
from langflow.custom import Component

# unit tests

# Basic Inclusion and Exclusion
def test_single_inclusion_pattern():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.txt")

def test_single_exclusion_pattern():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "!*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "!*.txt")

# Multiple Patterns
def test_multiple_inclusion_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.py,*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.py,*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.md", "*.py,*.txt")

def test_multiple_exclusion_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "!*.py,!*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "!*.py,!*.txt")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.md", "!*.py,!*.txt")

# Mixed Inclusion and Exclusion Patterns
def test_mixed_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "*.py,!test.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("main.py", "*.py,!test.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", "*.py,!test.py")

# Edge Cases
def test_empty_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", "")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " ")

def test_whitespace_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " , ")

def test_no_valid_patterns_after_stripping():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " , , ")

def test_patterns_with_leading_trailing_whitespace():
    codeflash_output = GitLoaderComponent.check_file_patterns("test.py", " *.py , *.txt ")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.txt", " *.py , *.txt ")
    codeflash_output = GitLoaderComponent.check_file_patterns("test.md", " *.py , *.txt ")

# Path vs Filename Matching
def test_full_path_matching():
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "*/src/*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/docs/test.py", "*/src/*.py")

def test_filename_matching():
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "test.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/docs/test.py", "test.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/docs/main.py", "test.py")

# Complex Patterns
def test_complex_glob_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "**/*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/subdir/test.py", "**/*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.txt", "**/*.py")

def test_negated_complex_patterns():
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/test.py", "**/*.py,!**/test.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/main.py", "**/*.py,!**/test.py")
    codeflash_output = GitLoaderComponent.check_file_patterns("/home/user/src/subdir/test.py", "**/*.py,!**/test.py")

# Large Scale Test Cases
def test_large_number_of_patterns():
    patterns = ",".join([f"pattern{i}.py" for i in range(1000)])
    codeflash_output = GitLoaderComponent.check_file_patterns("pattern500.py", patterns)
    codeflash_output = GitLoaderComponent.check_file_patterns("pattern1000.py", patterns)

def test_large_file_path():
    large_path = "/" + "/".join(["dir"] * 1000) + "/test.py"
    codeflash_output = GitLoaderComponent.check_file_patterns(large_path, "*.py")
    codeflash_output = GitLoaderComponent.check_file_patterns(large_path, "*.txt")

# Additional Considerations

📢 Feedback on this optimization? Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants