Paper Notes: NanoNets OCR
last updated 2026-05-10
Curated a dataset of 250,000 pages… research papers, financial documents, legal documents, healthcare documents, tax forms, receipts, and invoices. Additionally, the collection features documents containing images, plots, equations, signatures, watermarks, checkboxes, and complex tables.
2 stage training: (1) train on synthetically generated datasets, (2) fine tune on human-annotated datasets.
Special tags for:
<signature><watermark><img><page_number>$$(latex formula)- tables as html
<table><tr><td>
Does not handle handwritten text.