Better File Chunk | 更加强大的文件分块 #3550
Replies: 14 comments 16 replies
-
excel可以说是知识库的刚需了,转成html然后借用html分块。 |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
计划支持一波 Lrc/Lrcx,这样就可以做歌词文件的分析了 |
Beta Was this translation helpful? Give feedback.
-
可以支持一下typst么?比较新的标记语言,和latex算是竞品。 |
Beta Was this translation helpful? Give feedback.
-
java和python有现有的分块方案吗,请求加上 |
Beta Was this translation helpful? Give feedback.
-
请问Unstructed.io如何配置,项目里有集成 吗? |
Beta Was this translation helpful? Give feedback.
-
Please add ePub support as well to the list of files supported? Most of the books are in ePub or PDF format, also seems like LLM's are better at reading ePub (due to less clutter than PDF's formatting). You also could add Mobi support, but Mobi support is dying since AMZN (it was only AMZN who were supporting it) stopped supporting it and even they have moved onto ePub now. Thanks |
Beta Was this translation helpful? Give feedback.
-
Unstructed.io 怎么使用? |
Beta Was this translation helpful? Give feedback.
-
楼主好,我想到一个办法,可以把excel文件转换成YAML格式的文件,然后按一定的规则分块。 |
Beta Was this translation helpful? Give feedback.
-
我在另外一个项目上是把excel表格转换成Markdown格式……就能向量了 |
Beta Was this translation helpful? Give feedback.
-
https://github.com/nanbingxyz/5ire 发现一个支持向量EXCEL的 |
Beta Was this translation helpful? Give feedback.
-
https://github.com/microsoft/markitdown 现在微软官方出了,office转markdown,py工具 |
Beta Was this translation helpful? Give feedback.
-
I have some XML WSDL files that describe my API. |
Beta Was this translation helpful? Give feedback.
-
Please add transcription via whisper open ai and chunking for this transcribed text, and view via html5 media player with links to transcribed part by timing |
Beta Was this translation helpful? Give feedback.
-
背景
在 RAG 中,只有将文件合理分块后,才能做好检索与查询,但是市面上文件类型是非常多的,目前一期只做了一部分的分块支持。
目前支持的分块类型:
纯文本类:
代码类:
富文本类:
表格类:
音频类:
视频类:
如果有对文件类型的分块诉求,请在下面留言,并说明对此类文件的分块设想
Beta Was this translation helpful? Give feedback.
All reactions