Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during traversal: The text contains a special token that is not allowed #22

Open
slavakurilyak opened this issue Apr 1, 2023 · 3 comments

Comments

@slavakurilyak
Copy link

slavakurilyak commented Apr 1, 2023

When I run doc index on the langchain repository, I receive the following error:

⠇ Processing 494 files...Error during traversal: The text contains a special token that is not allowed: <|endoftext|>
Failed to find `autodoc.config.json` file. Did you run `doc init`?
Error: The text contains a special token that is not allowed: <|endoftext|>
    at module.exports.__wbindgen_error_new (/usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:398:17)
    at wasm://wasm/00b63e2e:wasm-function[15]:0xebb8
    at wasm://wasm/00b63e2e:wasm-function[154]:0x48af5
    at Tiktoken.encode (/usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:257:18)
    at processFile (file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/commands/index/processRepository.js:24:40)
    at async file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/utils/traverseFileSystem.js:42:21
    at async Promise.all (index 2)
    at async dfs (file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/utils/traverseFileSystem.js:38:13)
    at async file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/utils/traverseFileSystem.js:25:21
    at async Promise.all (index 0)

I believe this is an issue with autodoc, rather than the langchain repository, as I have followed the instructions in the README file and run doc init in the langchain repository before running doc index.

Here is some information about my environment:

  • Operating system: macOS Monterey 12.6.3 (21G419)
  • Node.js version: v19.8.1

Please let me know if there is any additional information I can provide or steps I can take to resolve this issue.

@dahifi
Copy link

dahifi commented Apr 1, 2023

Get the same problem trying to process the microsoft/semantic-kernel repo. Managed to get things working by catching the error, but it's a hack as I don't understand what's throwing it.
src/cli/commands/index/processRepository.ts

    let summaryLength: number;
   try {
     summaryLength = encoding.encode(summaryPrompt).length;
   } catch (error) {
     console.error(
       `Error during encoding of summary prompt: ${(error as Error).message}`,
     );
     // set summaryLength to a default value
     summaryLength = 0;
   }

   let questionLength: number;
   try {
     questionLength = encoding.encode(questionsPrompt).length;
   } catch (error) {
     console.error(
       `Error during encoding of question prompt: ${(error as Error).message}`,
     );
     // set questionLength to a default value
     questionLength = 0;
   }

@slavakurilyak
Copy link
Author

slavakurilyak commented Apr 4, 2023

For langchain, I resolved the issue by deleting docs/modules/agents/toolkits/examples/openai_openapi.yml.

For semantic-kernel, I resolved the issue by deleting dotnet/src/SemanticKernel/Connectors/OpenAI/Tokenizers/Settings/encoder.json.

This issue is related to <|endoftext|> which is used when interacting with OpenAI. Since lanchain and semantic-kernel contain this special character in their repo, the doc index command fails.

Here's a possible fix: langchain-ai/langchain#923

@samheutmaker can you patch this?

@samheutmaker
Copy link
Contributor

Sorry, have been swamped. I'll take a look at this when I get a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants