Publications

My Favorite Papers

CommonForms: A Large, Diverse Dataset for Form Field Detection — WACV, 2026. A dataset and models for automatically detecting form fields from PDFs. The Python package is downloaded thousands of times a month.
[copy bibtex] [Data] [Code]

Syntopical Graphs for Computational Argumentation Tasks — ACL, 2021. Building claim-relation graphs to improve corpus understanding, inspired by Mortimer Adler’s Syntopical Reading.
[copy bibtex]

A Joint Model for Document Segmentation and Segment Labeling — ACL, 2020. Learning to segment documents (back when LSTMs were still cool).
[copy bibtex]

Bias and Fairness in Large Language Models: A Survey — Computational Linguistics, 2024. An in-depth survey on bias and fairness in NLP.
[copy bibtex]

PDFTriage: Question Answering over Long, Structured Documents — EMNLP (Industry Track), 2024. Helping LLMs to see documents like people do.
[copy bibtex]

Chain of Logic: Rule-Based Reasoning with Large Language Models — Findings of ACL, 2024. Rule-based reasoning for legal NLP.
[copy bibtex]

Other Papers

SafePassage: High-Fidelity Information Extraction with Black Box LLMs — arXiv, 2025. [copy bibtex]
A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality — ICCV, 2025. [copy bibtex]
From Selection to Generation: A Survey of LLM-Based Active Learning — ACL, 2025. [copy bibtex]
A Survey on Small Language Models — RANLP, 2025. [copy bibtex]
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes — NAACL (Short Papers), 2025. [copy bibtex]
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models — COLM, 2024. [copy bibtex]
Personalized Multimodal Large Language Models: A Survey — arXiv preprint, 2024. [copy bibtex]
Personalization of Large Language Models: A Survey — arXiv preprint, 2024. [copy bibtex]
Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores — arXiv preprint, 2024. [copy bibtex]
Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards? — ACL, 2021. [copy bibtex]
It Takes Two to Lie: One to Lie, and One to Listen — ACL, 2020. [copy bibtex]
Mitigating Noisy Inputs for Question Answering — arXiv preprint, 2019. [copy bibtex]
Unsupervised System Combination for Set-Based Retrieval with Expectation Maximization — CLEF, 2019. [copy bibtex]
Surprise Languages: Rapid-Response Cross-Language IR — EVIA Workshop (NTCIR-14), 2019. [copy bibtex]
UMDeep at SemEval-2017 Task 1: End-to-End Shared Weight LSTM Model for Semantic Textual Similarity — SemEval, 2017. [copy bibtex]

Patents

Machine learning recollection as part of question answering using a corpus — US Patent 12,596,709, 2026. [copy bibtex]
Query based classification — US Patent 12,524,415, 2026. [copy bibtex]
Fine-grained attribution for document question answering — US Patent App. 18/528,618, 2025. [copy bibtex]
Responding to a user query using machine learning — US Patent App. 18/667,690, 2025. [copy bibtex]
Machine-learning tool for generating segmentation and topic metadata for documents — US Patent 12,147,499, 2024. [copy bibtex]
Syntopical reading for collection understanding — US Patent 12,038,962, 2024. [copy bibtex]
Utilizing embedding-based claim-relation graphs for efficient syntopical reading of content collections — US Patent App. 18/336,380, 2024. [copy bibtex]
Machine-learning tool for generating segmentation and topic metadata for documents — US Patent 11,783,008, 2023. [copy bibtex]

Recorded Talks

Richard Hamming believed that it’s the job of a scientist to communicate via publications, prepared talks, and impromptu talks. I do my best. If you’re interested in me speaking somewhere, reach out!

CommonForms @ WACV — 2026.
CommonForms @ Voxel51 — 2025.
Syntopical Graphs @ ACL — 2021.
S-LSTM @ ACL — 2020.

I’ve given lots of unrecorded talks and lectures on:

OCR
Prompt Engineering
Evaluation
Information Retrieval, Multivector Models, etc.