forked from microsoft/unilm
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
36 changed files
with
102,293 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Model Outputs | ||
## Text Recognition Task | ||
| **Input** | **Output** | | ||
|:---------------------------------------------:|:----------------------------------------------:| | ||
| ![Input](assets/cases/ocr_screen_input.png) | ![Output](assets/cases/ocr_screen_output.png) | | ||
| ![Input](assets/cases/ocr_ppt_input.png) | ![Output](assets/cases/ocr_ppt_output.png) | | ||
| ![Input](assets/cases/ocr_pdf_input.png) | ![Output](assets/cases/ocr_pdf_output.png) | | ||
| ![Input](assets/cases/ocr_cdip_input.png) | ![Output](assets/cases/ocr_cdip_output.png) | | ||
|
||
## Image to Markdown Task | ||
| **Input** | **Output** | | ||
|:---------------------------------------------:|:----------------------------------------------:| | ||
| ![Input](assets/cases/md_readme_input.png) | ![Output](assets/cases/md_readme_output.png) | | ||
| ![Input](assets/cases/md_latex1_input.png) | ![Output](assets/cases/md_latex1_output.png) | | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Microsoft Open Source Code of Conduct | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
|
||
Resources: | ||
|
||
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) | ||
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) | ||
- Contact [[email protected]](mailto:[email protected]) with questions or concerns |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,93 @@ | ||
- Sep 2023: [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419) | ||
# [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419) | ||
Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models. | ||
|
||
| ![Image 1](assets/example/in.png) | ![Image 2](assets/example/ocr.png) | ![Image 3](assets/example/md.png) | | ||
|:---------------------------------:|:----------------------------------:|:---------------------------------:| | ||
| **(a) Input** | **(b) Using the ocr prompt** | **(c) Using the markdown prompt** | | ||
|
||
<sub>More model outputs can be found in the "[CASES.md](./CASES.md)"</sub> | ||
|
||
## News | ||
- May 2024: 🔥We've open-sourced the checkpoint and inference code of Kosmos-2.5, This checkpoint has been trained for more steps than the one reported in the paper. | ||
- Sep 2023: We release the **Kosmos-2.5: A Multimodal Literate Model** paper. Checkout the [paper](https://arxiv.org/abs/2309.11419). | ||
|
||
## Checkpoints | ||
The [checkpoint](https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt?download=true) can be downloaded via: | ||
```bash | ||
wget https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt?download=true | ||
``` | ||
|
||
## Results | ||
### Text Recognition | ||
| | precision | recall | f1 | | ||
|---------|:---------:|:------:|:--------:| | ||
| FUNSD | 83.88 | 82.66 | 83.26 | | ||
| SROIE | 91.72 | 92.57 | 92.14 | | ||
| CORD | 83.64 | 87.83 | 85.69 | | ||
|
||
### Image to Markdown | ||
| | NED | NTED | | ||
|-------------------|:---------:|:------:| | ||
| General Documents | 91.59 | 82.08 | | ||
| README | 95.09 | 91.18 | | ||
| Tables | 85.14 | 90.64 | | ||
|
||
|
||
## Installation | ||
The code uses [Flash Attention2](https://github.com/Dao-AILab/flash-attention), so it only runs on Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). | ||
``` bash | ||
git clone https://github.com/microsoft/unilm.git | ||
cd kosmos-2.5 | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Inference | ||
|
||
``` bash | ||
python inference.py \ | ||
--do_ocr \ // --do_md for image2md task | ||
--image path/to/image \ | ||
--ckpt path/to/checkpoint \ | ||
``` | ||
For images with extreme aspect ratios, we recommend resizing images to a more typical aspect ratio for better performance with the following command: | ||
``` bash | ||
python inference.py \ | ||
--do_ocr \ // --do_md for image2md task | ||
--image path/to/image \ | ||
--use_preprocess \ | ||
--hw_ratio_adj_upper_span "[1.5, 5]" \ | ||
--hw_ratio_adj_lower_span "[0.5, 1.0]" | ||
``` | ||
Please adjust the parameters based on your use cases. For example, | ||
- `--hw_ratio_adj_upper_span "[1.5, 5]"` indicates that if the image's aspect ratio is between 1.5 and 5, the image will be resized to an aspect ratio of 1.5. | ||
- `--hw_ratio_adj_lower_span "[0.5, 1.0]"` indicates that if the image's aspect ratio is between 0.5 and 1.0, the image will be resized to an aspect ratio of 1.0. | ||
|
||
|
||
|
||
## NOTE: | ||
- This is a research project and is limited for **research purposes only**. | ||
- Since this is a generative model, there is a risk of **hallucination** during the generation process, and it **CAN NOT** guarantee the accuracy of all OCR/Markdown results in the images. | ||
|
||
## Citation | ||
|
||
If you find this repository useful, please consider citing our work: | ||
``` | ||
@article{lv2023kosmos, | ||
title={Kosmos-2.5: A multimodal literate model}, | ||
author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others}, | ||
journal={arXiv preprint arXiv:2309.11419}, | ||
year={2023} | ||
} | ||
``` | ||
|
||
|
||
## License | ||
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) | ||
|
||
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | ||
|
||
|
||
## Contact | ||
For help or issues using Kosmos-2.5, please submit a GitHub issue. | ||
|
||
For other communications related to Kosmos-2.5, please contact [Lei Cui](mailto:[email protected]) or [Furu Wei](mailto:[email protected]). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK --> | ||
|
||
## Security | ||
|
||
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin). | ||
|
||
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below. | ||
|
||
## Reporting Security Issues | ||
|
||
**Please do not report security vulnerabilities through public GitHub issues.** | ||
|
||
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report). | ||
|
||
If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp). | ||
|
||
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). | ||
|
||
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: | ||
|
||
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) | ||
* Full paths of source file(s) related to the manifestation of the issue | ||
* The location of the affected source code (tag/branch/commit or direct URL) | ||
* Any special configuration required to reproduce the issue | ||
* Step-by-step instructions to reproduce the issue | ||
* Proof-of-concept or exploit code (if possible) | ||
* Impact of the issue, including how an attacker might exploit the issue | ||
|
||
This information will help us triage your report more quickly. | ||
|
||
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs. | ||
|
||
## Preferred Languages | ||
|
||
We prefer all communications to be in English. | ||
|
||
## Policy | ||
|
||
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd). | ||
|
||
<!-- END MICROSOFT SECURITY.MD BLOCK --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# TODO: The maintainer of this repo has not yet edited this file | ||
|
||
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project? | ||
|
||
- **No CSS support:** Fill out this template with information about how to file issues and get help. | ||
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps. | ||
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide. | ||
|
||
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.* | ||
|
||
# Support | ||
|
||
## How to file issues and get help | ||
|
||
This project uses GitHub Issues to track bugs and feature requests. Please search the existing | ||
issues before filing new issues to avoid duplicates. For new issues, file your bug or | ||
feature request as a new Issue. | ||
|
||
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE | ||
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER | ||
CHANNEL. WHERE WILL YOU HELP PEOPLE?**. | ||
|
||
## Microsoft Support Policy | ||
|
||
Support for this **PROJECT or PRODUCT** is limited to the resources listed above. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from kosmos2_5.tasks import GenerationTask | ||
from kosmos2_5.models import UniGPTmodel |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.