Joe Barrow field_notes

Field Notes

Paper Notes: NanoNets OCR

last updated 2026-05-10

Curated a dataset of 250,000 pages… research papers, financial documents, legal documents, healthcare documents, tax forms, receipts, and invoices. Additionally, the collection features documents containing images, plots, equations, signatures, watermarks, checkboxes, and complex tables.

2 stage training: (1) train on synthetically generated datasets, (2) fine tune on human-annotated datasets.

Special tags for:

Does not handle handwritten text.