Software
Document understanding
CommonForms — automatically detect and insert form fields into PDFs. A family of object detectors. As easy as:
pip install commonforms
commonforms input.pdf fillable.pdf
FormalPDF — pypdfium2 wrapper for form operations in PDFs.
OmniOCR — use any open source OCR model behind a unified API.
Datasets
LOCUS-v1 — The Local Ordinance Corpus for the United States, a dataset of 2.2 million municipal laws from around the united states.
CommonForms — A dataset of ~500k prepared form pages, used to train object detectors for form field detection.
Other Things
tinyhnsw — the littlest vector database. A full, readable HNSW implementation in python, for pedagogy.
LambdaNet — the most popular Haskell deep learning framework. Built in 2015.
AllenNLP: The Hard Way — a tutorial series for the (now defunct 😭) AllenNLP library.