This repository contains a paper collection of the latest text-related papers from top conferences.
- Text Recognition
- Controllable Text Generation
- Multi-modal Large Language Model
- Text Detection
- GUI Agents
👀
-
OmniParser for Pure Vision Based GUI Agent (08 Arxiv)
-
UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents (08 Arxiv)
-
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer (ECCV2024)
-
Self-supervised Character-to-Character Distillation for Text Recognition (ICCV 2023)
-
MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition (ICCV 2023)
-
Revisiting Scene Text Recognition: A Data Perspective (ICCV 2023)
-
Self-Supervised Implicit Glyph Attention for Text Recognition (CVPR 2023)
-
Relational Contrastive Learning for Scene Text Recognition (ACMMM 2023)
-
TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition (IJCAI 2023)
-
Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition (IJCAI 2023)
-
Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition (ACMMM 2022)
-
Chinese Character Recognition with Augmented Character Profile Matching (ACMMM 2022)
-
Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
-
Task Grouping for Multilingual Text Recognition (Workshops) (ECCV 2022 Workshops)
-
Multi-modal Text Recognition Networks: Interactive Enhancements Between Visual and Semantic Features (ECCV 2022)
-
On Calibration of Scene-Text Recognition Models (Workshops) (ECCV 2022 Workshops)
-
Pure Transformer with Integrated Experts for Scene Text Recognition (ECCV 2022)
-
Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning (ECCV 2022)
-
Multi-granularity Prediction for Scene Text Recognition (ECCV 2022)
-
Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition (ECCV 2022)
-
Background-Insensitive Scene Text Recognition with Text Semantic Segmentation (ECCV 2022)
-
SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition (ECCV 2022)
-
Levenshtein OCR (ECCV 2022)
-
SVTR: Scene Text Recognition with a Single Visual Model (IJCAI 2022)
-
Open-Set Text Recognition via Character-Context Decoupling (CVPR 2022)
-
Knowledge Mining with Scene Text for Fine-Grained Recognition (CVPR 2022)
-
Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition (AAAI 2022)
-
Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition (AAAI 2022)
-
Context-Based Contrastive Learning for Scene Text Recognition (AAAI 2022)
-
Sequence-to-Sequence Contrastive Learning for Text Recognition (CVPR 2021)
-
What if We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels (CVPR 2021)
-
MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition (CVPR 2021)
-
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition (CVPR 2021)
-
Dictionary-Guided Scene Text Recognition (CVPR 2021)
-
Primitive Representation Learning for Scene Text Recognition (CVPR 2021)
👀
-
How To Create SOTA Image Generation with Text: Recraft’s ML Team Insights
-
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering (ECCV2024)
-
Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering (ECCV2024)
-
ANYTEXT: MULTILINGUAL VISUAL TEXT GENERATION AND EDITING (ICLR2024)
-
Character-Aware Models Improve Visual Text Rendering (ACL2023)
-
TextDiffuser: Diffusion Models as Text Painters (NeurIPS2023)
-
GlyphControl: Glyph Conditional Control for Visual Text Generation (NeurIPS2023)
-
Layout-Agnostic Scene Text Image Synthesis with Diffusion Models (CVPR2024)
-
CustomText: Customized Textual Image Generation using Diffusion Models (CVPR2024)
-
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering (Arxiv)
-
Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation (Arxiv)
-
Typographic Text Generation with Off-the-Shelf Diffusion Model (Arxiv)
-
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models (Arxiv)
-
GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (Arxiv)
-
ARTIST: Improving the Generation of Text-rich Images by Disentanglement (Arxiv)
👀
-
On Pre-training of Multimodal Language Models Customized for Chart Understanding (06 Arxiv)
-
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling (10 Arxiv)
-
MPLUG-DOCOWL2: HIGH-RESOLUTION COMPRESSING FOR OCR-FREE MULTI-PAGE DOCUMENT UNDERSTANDING (09 Arxiv)
-
RegionGPT: Towards Region Understanding Vision Language Model (05 Arxiv)
-
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding (NeurIPS 2024)
-
Honeybee: Locality-enhanced Projector for Multimodal LLM (CVPR24)
-
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (01 Arxiv)
-
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (04 Arxiv)
-
Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping (08 Arxiv)
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (CVPR2024)
-
TRINS: Towards Multimodal Language Models that Can Read (CVPR2024)
-
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (EMNLP2023)
-
On Pre-training of Multimodal Language Models Customized for Chart Understanding (Arxiv)
-
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models (Arxiv)
-
Multimodal Table Understanding (Arxiv)
-
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding (Arxiv)
-
LayTextLLM: A Bounding Box is Worth One Token - Interleaving Layout and Text in a Large Language Model for Document Understanding (Arxiv)
-
MoAI: Mixture of All Intelligence for Large Language and Vision Models (Arxiv)
-
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning (Arxiv)
-
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (Arxiv)
-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (Arxiv)
-
Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding (Arxiv)
-
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models (Arxiv)
-
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (Arxiv)
-
DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL FOR MULTIMODAL DOCUMENT UNDERSTANDING (12-2023 Arxiv)
👀
-
Bridging Synthetic and Real Worlds for Pre-training Scene Text Detector (ECCV2024)
-
LORE: Logical Location Regression Network for Table Structure Recognition (AAAI2024)
-
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network (AAAI2024)
-
CPN: Complementary Proposal Network for Unconstrained Text Detection (AAAI2024)