David Mimno's Avatar

David Mimno

@dmimno

He teaches information science at Cornell. http://mimno.infosci.cornell.edu

6,695
Followers
4,482
Following
683
Posts
03.07.2023
Joined
Posts Following

Latest posts by David Mimno @dmimno

Powerful LLMs and agent workflows have led to a whole lot of very specific "we did a thing" papers. How are people evaluating these?

06.03.2026 18:58 πŸ‘ 13 πŸ” 2 πŸ’¬ 3 πŸ“Œ 0
Screenshot of plot showing ELO vs paramter count for different OCR models

Screenshot of plot showing ELO vs paramter count for different OCR models

There is no best VLM OCR model - rankings can flip completely by document type.

I built ocr-bench: run open OCR models on YOUR documents, get a per-collection leaderboard.

VLM-as-judge with Bradley-Terry ELO, all running on @hf.co. No local GPU needed.

05.03.2026 14:48 πŸ‘ 46 πŸ” 10 πŸ’¬ 1 πŸ“Œ 1
Call for Main Conference Papers Call for Main Conference Papers (EMNLP 2026)

The center of gravity in NLP is shifting. 🌍

This year's #EMNLP2026 Special Theme is "New Missions for NLP Research." We welcome empirical, theoretical, or position and survey papers that reframe our collective research goals.

Find out more:
2026.emnlp.org/calls/main_c...

05.03.2026 10:47 πŸ‘ 11 πŸ” 6 πŸ’¬ 0 πŸ“Œ 0

Excellent venue for computational humanities work, colocated with ACL in San Diego on July 6. Please share!

04.03.2026 20:13 πŸ‘ 13 πŸ” 7 πŸ’¬ 0 πŸ“Œ 0

Lots of museums have good datasets. What methods are they learning?

04.03.2026 20:10 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Here is the announcement for Cornell’s talk

03.03.2026 23:08 πŸ‘ 7 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
NLP4DH 2026 Conference Welcome to the OpenReview homepage for NLP4DH 2026 Conference

🚨 NLP4DH 2026 deadline has been extended to March 13! Submission link here: openreview.net/group?id=NLP...

03.03.2026 19:33 πŸ‘ 7 πŸ” 6 πŸ’¬ 0 πŸ“Œ 2
Preview
TRAILS UMD Post Doctoral Associate Job Description - Spring 2026 Post Doctoral Associate Institute for Trustworthy AI in Law & Society February 2026 The Institute for Trustworthy AI in Law & Society (TRAILS) and the University of Maryland aim to transform the pr...

Come join TRAILS as a postdoc at UMD (and work w folks at GW, MSU & Cornell) to conduct research and scholarship focused on approaches to AI that advance trust and trustworthiness with a great group of colleagues!

🌐 go.umd.edu/trails-postd...
πŸ—“οΈ Summer/Fall 2026 start

03.03.2026 15:42 πŸ‘ 4 πŸ” 5 πŸ’¬ 0 πŸ“Œ 0

Fork away!

03.03.2026 12:31 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

My favorite finding: "Surprisingly, a minimal set of eight words is sufficient to obtain 0.74 AUC on the training and test sets without any degradation in test performance. These words are of, in, to, had to indicate longer duration and you, said, it, he to indicate shorter duration."

02.03.2026 17:15 πŸ‘ 6 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
PDF Text Cleaner

I got frustrated copying quotes from PDFs with line breaks, and used Claude to make this little tool: mimno.github.io/copyoneline/

Paste text into the box, it removes newlines and puts the result back in your clipboard, adding quotation marks if desired.

02.03.2026 17:14 πŸ‘ 28 πŸ” 2 πŸ’¬ 2 πŸ“Œ 0
Preview
ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools for persuasion, their specific role in online, u...

1/
New preprint! 🧡 Can stories make arguments more persuasive? And which narrative features matter most? In ARGUS we build a framework to study this in Reddit's r/ChangeMyView, with @saranabhani.bsky.social , Khalid Al-Khatib, and @malvinanissim.bsky.social
arxiv.org/abs/2602.24109

02.03.2026 15:03 πŸ‘ 16 πŸ” 9 πŸ’¬ 1 πŸ“Œ 0

Yes! Many libraries have a good process for depositing a zip file but an interactive page isn’t their strength. And having a thing people can cite is great for academic visibility.

02.03.2026 12:43 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Text-based data files in a GitHub repository with an AI-prototyped web front end running on GitHub pages?

Not an archival solution, but a good compromise of user access, data access, and dev cost.

02.03.2026 12:36 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Introduction to Handwriting in England c. 700-1700 This course will provide an introduction to manuscript-production in Britain and Ireland over the course of a millennium. It will explore questions of palaeography, diplomatic, and processes of tra…

Do you want to improve your knowledge of medieval manuscripts from England? Book now for this summer school course, in person, in London, 8-12 June. πŸ‘‡β˜€οΈπŸ“š #medievalsky please repost!

palaeography.uk/study/short-...

27.02.2026 16:19 πŸ‘ 38 πŸ” 27 πŸ’¬ 0 πŸ“Œ 0
A picture of Joe Halpern smiling in green shirt in front of a blue background.

A picture of Joe Halpern smiling in green shirt in front of a blue background.

Today arXiv remembers our colleague Joe Halpern, who was instrumental in founding arXiv's CS section.

Joe's passions ranged far & wide and we're lucky that arXiv was one of them. Joe, thank you for giving so much to arXiv - you are missed.

blog.arxiv.org/2026/02/27/remembering-joe-halpern

27.02.2026 18:38 πŸ‘ 50 πŸ” 12 πŸ’¬ 2 πŸ“Œ 2

One mismatch I see is institutional AI seems focused on enabling model training, but most of the applications I see are inference. Batch uses like: "upload a spreadsheet of prompts, get results back in a few hours" or "apply this prompt to these volumes and return a spreadsheet".

27.02.2026 15:54 πŸ‘ 4 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

Circular rainbow zen pic

27.02.2026 14:54 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

What do you have in mind? Libraries need exciting initiatives. Mostly it's "how do we deal with this year's budget cut" and "how do we deal with the latest demands from publishers"

27.02.2026 14:51 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

… or build it out of heavy, low-value, durable materials in a remote and inaccessible part of a vast desert

27.02.2026 12:21 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Building benchmarks is only one way scholars can help steer AI development. We can also measure the effects of AI on students, build better datasets, or tune new open models. Openness itself could be our most important contribution. Universities have huge libraries, and the legal doctrine of fair use should protect models trained on those collections for a nonprofit educational purpose. At the moment, we are not pressing this advantage. Higher education has been so cautious about fair use that the private sector can now train more freely on our libraries (via Google Books) than is possible for academic AI researchers. We need to be bolder: It is our duty to ensure library collections remain open to the public in a form that empowers 21st-century readers. If our intellectual heritage gets enclosed in proprietary tools, we will find ourselves making the same bad bargain we made with scientific publishers, who sell our own research back to us at a steep markup.

Building benchmarks is only one way scholars can help steer AI development. We can also measure the effects of AI on students, build better datasets, or tune new open models. Openness itself could be our most important contribution. Universities have huge libraries, and the legal doctrine of fair use should protect models trained on those collections for a nonprofit educational purpose. At the moment, we are not pressing this advantage. Higher education has been so cautious about fair use that the private sector can now train more freely on our libraries (via Google Books) than is possible for academic AI researchers. We need to be bolder: It is our duty to ensure library collections remain open to the public in a form that empowers 21st-century readers. If our intellectual heritage gets enclosed in proprietary tools, we will find ourselves making the same bad bargain we made with scientific publishers, who sell our own research back to us at a steep markup.

We're in a strange situation rn where Google can train freely on books from university librariesβ€”but researchers *at* universities have limited access. I'm optimistic this can be fixed, but if you're in admin or working at a foundation, please know: univs are failing here & resources are needed.

26.02.2026 22:20 πŸ‘ 211 πŸ” 53 πŸ’¬ 10 πŸ“Œ 3

Only a few more days (full consideration deadline: March 1) to apply to our lecturer position at @cornelltech.bsky.social!

26.02.2026 14:09 πŸ‘ 1 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

Students have perceived problem sets as real and reading as optional because for psets it’s much easier to verify that they did something. Ironically, AI has leveled the playing field to a new low baseline.

26.02.2026 12:38 πŸ‘ 8 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
March Sadness 90s Edition A yearly March Madness-style tournament of essays about songs we love (and occasionally loathe)

announcing the March Sadness 1990s edition bracket
marchxness.com - brackets due no later than midnight 2/28/26

25.02.2026 17:36 πŸ‘ 4 πŸ” 1 πŸ’¬ 0 πŸ“Œ 1
When Was This War?

Fun game for the history nerds! Note that you’ll need to specify CE or BCE sometimes. I got within twenty years for nine out of ten but was out by a century for the other which I’m feeling slightly sheepish about.

when-was-this-war.web.app

25.02.2026 16:43 πŸ‘ 16 πŸ” 8 πŸ’¬ 1 πŸ“Œ 3
Preview
The ultimate guide to optimizing annotation workflows Β· Explosion This blog post collects tips and advice for how to build efficient human-in-the-loop data development workflows, break down business problems into actionable annotation steps and make the most of auto...

I've been getting a lot of questions recently about optimizing annotation workflows – many new NLP projects are starting atm! ✨

To share some of our tips, I put together a blog post featuring examples inspired by real use cases and a checklist to help you get started.

explosion.ai/blog/optimiz...

24.02.2026 15:06 πŸ‘ 18 πŸ” 4 πŸ’¬ 1 πŸ“Œ 1

US copyright only applies to works with human authors, so terms of service violation is the only possible complaint that Anthropocene can make, right?

Can artists and authors impose terms of service? Or are they restricted to copyright?

24.02.2026 14:55 πŸ‘ 5 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Welcome! You are invited to join a webinar: 2026 Schmidt Sciences HAVI Webinar . After registering, you will receive a confirmation email about joining the webinar. Schmidt Sciences is requesting proposals to the Humanities and AI Virtual Institute (HAVI), aimed at fostering research in the digital humanities with a particular focus on artificial intelligence. Id...

On Tuesday, Feb. 24 at 1PM eastern, we're hosting a second webinar for the @schmidtsciences.bsky.social HAVI program. Feel free to jump on if you'd like to learn more about the program. Register here: schmidtentities.zoom.us/webinar/regi...

24.02.2026 02:22 πŸ‘ 5 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0
Preview
Bellwether Postdoctoral Scholar - School of Information University of California, Berkeley is hiring. Apply now!

🚨 HIRING 🚨

The I School invites applications for up to three new full-time Bellwether Postdoctoral Scholars to start as soon as July 2026!

This program will allow researchers to develop their own research while collaborating with leading faculty.

Next review date is Feb 28! #academicsky

23.02.2026 21:30 πŸ‘ 9 πŸ” 13 πŸ’¬ 0 πŸ“Œ 2
Preview
UW researchers analyzed which anthologized writers and books get checked out the most from Seattle Public Library UW researchers analyzed the checkout data from the last 20 years of the 93 authors included in the post-1945 volume of β€œThe Norton Anthology of American Literature,” which is assigned in U.S....

Nice write-up by @uwnews.uw.edu about our research into the most read canonical American authors in Seattle, drawing on library data.

It was so fun to work on this project with @neel2112.bsky.social and a stellar group of undergraduate students.

www.washington.edu/news/2026/01...

12.01.2026 16:18 πŸ‘ 35 πŸ” 10 πŸ’¬ 1 πŸ“Œ 3