Skip to content

Commit

Permalink
add kosmos-2.5
Browse files Browse the repository at this point in the history
  • Loading branch information
Dod-o committed May 13, 2024
1 parent 8c67019 commit 2e11a0d
Show file tree
Hide file tree
Showing 36 changed files with 102,293 additions and 1 deletion.
15 changes: 15 additions & 0 deletions kosmos-2.5/CASES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Model Outputs
## Text Recognition Task
| **Input** | **Output** |
|:---------------------------------------------:|:----------------------------------------------:|
| ![Input](assets/cases/ocr_screen_input.png) | ![Output](assets/cases/ocr_screen_output.png) |
| ![Input](assets/cases/ocr_ppt_input.png) | ![Output](assets/cases/ocr_ppt_output.png) |
| ![Input](assets/cases/ocr_pdf_input.png) | ![Output](assets/cases/ocr_pdf_output.png) |
| ![Input](assets/cases/ocr_cdip_input.png) | ![Output](assets/cases/ocr_cdip_output.png) |

## Image to Markdown Task
| **Input** | **Output** |
|:---------------------------------------------:|:----------------------------------------------:|
| ![Input](assets/cases/md_readme_input.png) | ![Output](assets/cases/md_readme_output.png) |
| ![Input](assets/cases/md_latex1_input.png) | ![Output](assets/cases/md_latex1_output.png) |

9 changes: 9 additions & 0 deletions kosmos-2.5/CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Microsoft Open Source Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).

Resources:

- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [[email protected]](mailto:[email protected]) with questions or concerns
437 changes: 437 additions & 0 deletions kosmos-2.5/LICENSE

Large diffs are not rendered by default.

94 changes: 93 additions & 1 deletion kosmos-2.5/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,93 @@
- Sep 2023: [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)
# [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)
Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

| ![Image 1](assets/example/in.png) | ![Image 2](assets/example/ocr.png) | ![Image 3](assets/example/md.png) |
|:---------------------------------:|:----------------------------------:|:---------------------------------:|
| **(a) Input** | **(b) Using the ocr prompt** | **(c) Using the markdown prompt** |

<sub>More model outputs can be found in the "[CASES.md](./CASES.md)"</sub>

## News
- May 2024: 🔥We've open-sourced the checkpoint and inference code of Kosmos-2.5, This checkpoint has been trained for more steps than the one reported in the paper.
- Sep 2023: We release the **Kosmos-2.5: A Multimodal Literate Model** paper. Checkout the [paper](https://arxiv.org/abs/2309.11419).

## Checkpoints
The [checkpoint](https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt?download=true) can be downloaded via:
```bash
wget https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt?download=true
```

## Results
### Text Recognition
| | precision | recall | f1 |
|---------|:---------:|:------:|:--------:|
| FUNSD | 83.88 | 82.66 | 83.26 |
| SROIE | 91.72 | 92.57 | 92.14 |
| CORD | 83.64 | 87.83 | 85.69 |

### Image to Markdown
| | NED | NTED |
|-------------------|:---------:|:------:|
| General Documents | 91.59 | 82.08 |
| README | 95.09 | 91.18 |
| Tables | 85.14 | 90.64 |


## Installation
The code uses [Flash Attention2](https://github.com/Dao-AILab/flash-attention), so it only runs on Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
``` bash
git clone https://github.com/microsoft/unilm.git
cd kosmos-2.5
pip install -r requirements.txt
```

## Inference

``` bash
python inference.py \
--do_ocr \ // --do_md for image2md task
--image path/to/image \
--ckpt path/to/checkpoint \
```
For images with extreme aspect ratios, we recommend resizing images to a more typical aspect ratio for better performance with the following command:
``` bash
python inference.py \
--do_ocr \ // --do_md for image2md task
--image path/to/image \
--use_preprocess \
--hw_ratio_adj_upper_span "[1.5, 5]" \
--hw_ratio_adj_lower_span "[0.5, 1.0]"
```
Please adjust the parameters based on your use cases. For example,
- `--hw_ratio_adj_upper_span "[1.5, 5]"` indicates that if the image's aspect ratio is between 1.5 and 5, the image will be resized to an aspect ratio of 1.5.
- `--hw_ratio_adj_lower_span "[0.5, 1.0]"` indicates that if the image's aspect ratio is between 0.5 and 1.0, the image will be resized to an aspect ratio of 1.0.



## NOTE:
- This is a research project and is limited for **research purposes only**.
- Since this is a generative model, there is a risk of **hallucination** during the generation process, and it **CAN NOT** guarantee the accuracy of all OCR/Markdown results in the images.

## Citation

If you find this repository useful, please consider citing our work:
```
@article{lv2023kosmos,
title={Kosmos-2.5: A multimodal literate model},
author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
journal={arXiv preprint arXiv:2309.11419},
year={2023}
}
```


## License
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)


## Contact
For help or issues using Kosmos-2.5, please submit a GitHub issue.

For other communications related to Kosmos-2.5, please contact [Lei Cui](mailto:[email protected]) or [Furu Wei](mailto:[email protected]).
41 changes: 41 additions & 0 deletions kosmos-2.5/SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->

## Security

Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.

## Reporting Security Issues

**Please do not report security vulnerabilities through public GitHub issues.**

Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).

If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).

You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).

Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:

* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.

If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.

## Preferred Languages

We prefer all communications to be in English.

## Policy

Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).

<!-- END MICROSOFT SECURITY.MD BLOCK -->
25 changes: 25 additions & 0 deletions kosmos-2.5/SUPPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# TODO: The maintainer of this repo has not yet edited this file

**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?

- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.

*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*

# Support

## How to file issues and get help

This project uses GitHub Issues to track bugs and feature requests. Please search the existing
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.

For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.

## Microsoft Support Policy

Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
2 changes: 2 additions & 0 deletions kosmos-2.5/__init.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from kosmos2_5.tasks import GenerationTask
from kosmos2_5.models import UniGPTmodel
Binary file added kosmos-2.5/assets/cases/md_latex1_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/md_latex1_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/md_readme_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/md_readme_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_arxiv_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_arxiv_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_cdip_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_cdip_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_pdf_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_pdf_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_ppt_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_ppt_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_screen_input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/cases/ocr_screen_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/example/in.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/example/md.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added kosmos-2.5/assets/example/ocr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 2e11a0d

Please sign in to comment.