Publications
My Favorite Papers
CommonForms: A Large, Diverse Dataset for Form Field Detection — WACV, 2026. A dataset and models for automatically detecting form fields from PDFs. The Python package is downloaded thousands of times a month.
[copy bibtex] [Data] [Code]
Syntopical Graphs for Computational Argumentation Tasks — ACL, 2021. Building claim-relation graphs to improve corpus understanding, inspired by Mortimer Adler’s Syntopical Reading.
[copy bibtex]
A Joint Model for Document Segmentation and Segment Labeling — ACL, 2020. Learning to segment documents (back when LSTMs were still cool).
[copy bibtex]
Bias and Fairness in Large Language Models: A Survey — Computational Linguistics, 2024. An in-depth survey on bias and fairness in NLP.
[copy bibtex]
PDFTriage: Question Answering over Long, Structured Documents — EMNLP (Industry Track), 2024. Helping LLMs to see documents like people do.
[copy bibtex]
Chain of Logic: Rule-Based Reasoning with Large Language Models — Findings of ACL, 2024. Rule-based reasoning for legal NLP.
[copy bibtex]
Other Papers
- SafePassage: High-Fidelity Information Extraction with Black Box LLMs — arXiv, 2025. [copy bibtex]
- A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality — ICCV, 2025. [copy bibtex]
- From Selection to Generation: A Survey of LLM-Based Active Learning — ACL, 2025. [copy bibtex]
- A Survey on Small Language Models — RANLP, 2025. [copy bibtex]
- Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes — NAACL (Short Papers), 2025. [copy bibtex]
- AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models — COLM, 2024. [copy bibtex]
- Personalized Multimodal Large Language Models: A Survey — arXiv preprint, 2024. [copy bibtex]
- Personalization of Large Language Models: A Survey — arXiv preprint, 2024. [copy bibtex]
- Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores — arXiv preprint, 2024. [copy bibtex]
- Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards? — ACL, 2021. [copy bibtex]
- It Takes Two to Lie: One to Lie, and One to Listen — ACL, 2020. [copy bibtex]
- Mitigating Noisy Inputs for Question Answering — arXiv preprint, 2019. [copy bibtex]
- Unsupervised System Combination for Set-Based Retrieval with Expectation Maximization — CLEF, 2019. [copy bibtex]
- Surprise Languages: Rapid-Response Cross-Language IR — EVIA Workshop (NTCIR-14), 2019. [copy bibtex]
- UMDeep at SemEval-2017 Task 1: End-to-End Shared Weight LSTM Model for Semantic Textual Similarity — SemEval, 2017. [copy bibtex]
Patents
- Machine learning recollection as part of question answering using a corpus — US Patent 12,596,709, 2026. [copy bibtex]
- Query based classification — US Patent 12,524,415, 2026. [copy bibtex]
- Fine-grained attribution for document question answering — US Patent App. 18/528,618, 2025. [copy bibtex]
- Responding to a user query using machine learning — US Patent App. 18/667,690, 2025. [copy bibtex]
- Machine-learning tool for generating segmentation and topic metadata for documents — US Patent 12,147,499, 2024. [copy bibtex]
- Syntopical reading for collection understanding — US Patent 12,038,962, 2024. [copy bibtex]
- Utilizing embedding-based claim-relation graphs for efficient syntopical reading of content collections — US Patent App. 18/336,380, 2024. [copy bibtex]
- Machine-learning tool for generating segmentation and topic metadata for documents — US Patent 11,783,008, 2023. [copy bibtex]
Recorded Talks
Richard Hamming believed that it’s the job of a scientist to communicate via publications, prepared talks, and impromptu talks. I do my best. If you’re interested in me speaking somewhere, reach out!
- CommonForms @ WACV — 2026.
- CommonForms @ Voxel51 — 2025.
- Syntopical Graphs @ ACL — 2021.
- S-LSTM @ ACL — 2020.
I’ve given lots of unrecorded talks and lectures on:
- OCR
- Prompt Engineering
- Evaluation
- Information Retrieval, Multivector Models, etc.