Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error when parse file that extension not in FILE_NODE_PARSERS #13152

Closed
yukikwi opened this issue Apr 29, 2024 · 1 comment
Closed

[Bug]: Error when parse file that extension not in FILE_NODE_PARSERS #13152

yukikwi opened this issue Apr 29, 2024 · 1 comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@yukikwi
Copy link
Contributor

yukikwi commented Apr 29, 2024

Bug Description

When I try to parse file using SimpleFileNodeParser and file's extension is not in FILE_NODE_PARSERS.
This will cause llama index throw error Error: 'tuple' object has no attribute 'ref_doc_id' back.

Version

v0.10.33

Steps to Reproduce

from llama_index.core import (
    SimpleDirectoryReader,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import (
    SimpleFileNodeParser,
)

documents = SimpleDirectoryReader(
    input_files=["claude.py"],
    file_metadata=lambda filename: {
        "Original source": "claude.py",
        "extension": ".py"
    }
).load_data()


parser = SimpleFileNodeParser()
nodes = parser.get_nodes_from_documents(documents)

for node in nodes:
    print(node)

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/workspace/tmp/debug_llama_index.py", line 19, in <module>
    nodes = parser.get_nodes_from_documents(documents)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/tmp/tmp_venv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 80, in get_nodes_from_documents
    node.ref_doc_id is not None
    ^^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'ref_doc_id'
@yukikwi yukikwi added bug Something isn't working triage Issue needs to be triaged/prioritized labels Apr 29, 2024
Copy link

dosubot bot commented Apr 29, 2024

The issue you're experiencing is due to the SimpleFileNodeParser attempting to parse a Python file, which is not supported by default. The error tuple object has no attribute 'ref_doc_id' occurs because the parser is trying to handle the Python file in a way that's intended for recognized file types, leading to a misinterpretation of the file's content.

To resolve this, you'll need to either extend the SimpleFileNodeParser class to handle Python files or create a new parser class that can process these files into BaseNode objects or a compatible type. This new or extended parser must ensure that every processed document has the required attributes, such as ref_doc_id, to avoid the error.

The SimpleFileNodeParser is designed to work with specific file types, and Python files are not included in the default supported types. If you're looking to parse unsupported file types, you'll have to implement custom logic to convert these files into a format that the parser can handle correctly.

For more detailed guidance, consider looking into how the SimpleFileNodeParser uses the FILE_NODE_PARSERS dictionary to map file extensions to their respective node parsers, and how you might add support for .py files in a similar manner.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant