We introduce a new document parsing framework, MinerU2.5. The key innovation is a decoupled architecture that separates global layout analysis from local content recognition via an efficient coarse-to-fine, two-stage inference mechanism.
In the first stage, the model conducts fast and holistic layout analysis on downsampled document images, capturing the global structural organization with minimal computational cost.
In the second stage, guided by the detected layout, it crops key regions from the original high-resolution input and performs fine-grained recognition within local windows, thereby preserving native resolution and ensuring high accuracy.
pull down to refresh
related posts
Too often document parsing systems attempt to resolve layout and fine detail recognition in one go which leads to compromises in either speed or accuracy. By first taking a broad view at low resolution the system can map the structural blueprint of the document without getting distracted by pixel level noise. The second stage then turns its full attention to the regions that truly matter and does so at the original resolution where no detail is lost...