Sebastian Majstorovic's Avatar

Sebastian Majstorovic

@storytracer.com

Open Data Consultant for @eleutherai.bsky.social & Digital History Advisor for @eui-history.bsky.social. Website: https://www.storytracer.com/

803
Followers
210
Following
55
Posts
19.08.2023
Joined
Posts Following

Latest posts by Sebastian Majstorovic @storytracer.com

Preview
Independence National Historical Park - A Hopeful Update We recently posted about the takedown of signs at the President’s House site at Independence National Historical Park. We also posted a call for more photos for Save Our Signs, both before and after…

New DRP post: #Philly interpretive panels returned to the President's House. We congratulate the local community that fought HARD for their return. We hope this inspires other communities to #SaveOurSigns

www.datarescueproject.org/independence...

27.02.2026 14:01 👍 15 🔁 5 💬 0 📌 1
Decorative images with a screenshot of tiny.iiif in the background, and image server choices (as text labels) in the foreground: IIPImage, Cantaloupe.

Decorative images with a screenshot of tiny.iiif in the background, and image server choices (as text labels) in the foreground: IIPImage, Cantaloupe.

Small update to #tinyIIIF, my no-nonsense #IIIF server! You can now choose your image server during setup:

• IIPImage
• Cantaloupe

Running small- to mid-sized collections? Teaching with IIIF materials? Building IIIF-enabled tools? Check out tiny.iiif!

github.com/rsimon/tiny-...

#DigitalFriday

20.02.2026 09:12 👍 7 🔁 4 💬 1 📌 0
Screenshot of old vs new ocr. 

old ocr text is garbled. New ocr much cleaner.

Screenshot of old vs new ocr. old ocr text is garbled. New ocr much cleaner.

Re-OCR'd the complete 1771 Encyclopaedia Britannica (2,724 pages) with a single command on @hf.co Jobs.

- 0.9B model (GLM-OCR)
~$0.002/page
~$5 total on an L4 GPU

Before (old Tesseract ocr) → After

19.02.2026 11:29 👍 96 🔁 16 💬 5 📌 6

Get ready for live updates and quotes from @lyndamk.bsky.social and @mikalarae.bsky.social's #IDCC26 Keynote!

18.02.2026 13:53 👍 10 🔁 3 💬 7 📌 1
The Waldseemüller map in liiive (a IIIF annotation tool).

The Waldseemüller map in liiive (a IIIF annotation tool).

Yay! The first image served from my new #tinyIIIF test instance.

I'm running it on a 2 CPU/4GB RAM VM. Seems to be the absolute minimum & performance is pretty slow.

If anyone has advice on which specs I'd need to get more #IIIF speed out of Cantaloupe – let me know!

17.02.2026 08:43 👍 2 🔁 1 💬 0 📌 0
Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

13.02.2026 19:27 👍 22 🔁 12 💬 1 📌 0
Meme from the Simpsons, reading "0 days without needing to save data from the US federal government."

Meme from the Simpsons, reading "0 days without needing to save data from the US federal government."

It's official. One year ago today we formalized as the Data Rescue Project. We know this work is exhausting and are so grateful to everyone who has showed up, day in and day out.

05.02.2026 17:57 👍 28 🔁 3 💬 1 📌 2
Research Data Access and Preservation Association - 2025 RDAP Work of the Year Award

🎉Congrats to the winners of the 2025 RDAP Work of the Year Award, The Data Rescue Project! (@datarescueproject.org)🛟🏆

This award acknowledges the work’s impact on the wider research and scholarly communication ecosystem in support of RDAP’s mission and values.

rdapassociation.org/news/13593532

03.02.2026 18:37 👍 20 🔁 7 💬 0 📌 3

“Burning the Books: A History of the Deliberate Destruction of Knowledge” by @richove.bsky.social .

31.01.2026 01:04 👍 6 🔁 0 💬 1 📌 0
SciOp - Public Information Preservation Preserving Public Information

@sucho-org.bsky.social has been preserving Ukrainian cultural heritage data since 2022. @safeguardingdata.bsky.social is an international group of volunteers backing up US data and distributing them as torrents on sciop.net.

31.01.2026 00:11 👍 4 🔁 1 💬 0 📌 0
Post image

Say hi! to the wonderful people doing the @datarescueproject.org AMA tonight! @quetzal1234.bsky.social @nurnberger.bsky.social @storytracer.com Tess

Just missing for now:
@mikalarae.bsky.social and @katscade.bsky.social

30.01.2026 23:41 👍 13 🔁 2 💬 2 📌 0

We are always triaging at-risk datasets based on different factors: is there an executive order targeting a specific agency, do we have requests from data users or associations like @icpsr.bsky.social, and has someone else already backed it up?

30.01.2026 23:43 👍 2 🔁 0 💬 0 📌 0
Video thumbnail

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.

I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬.

It finds illustrated pages in historical books. No server. No GPU.

19.12.2025 12:08 👍 86 🔁 18 💬 2 📌 1
Diagram illustrating the BookReconciler workflow. On the left, a book cover of The Book of Salt by Monique Truong appears alongside “Minimal Metadata,” listing Author: Truong, Monique and Title: The Book of Salt. An arrow points to a box labeled “BookReconciler” with book and diamond icons. A downward arrow leads to “Enriched + Clustered Metadata,” showing multiple editions of the book cover and expanded metadata, including several ISBNs, subject headings (e.g., Vietnamese–France fiction, women authors, household employees, gay men, cooking), and an author VIAF identifier.

Diagram illustrating the BookReconciler workflow. On the left, a book cover of The Book of Salt by Monique Truong appears alongside “Minimal Metadata,” listing Author: Truong, Monique and Title: The Book of Salt. An arrow points to a box labeled “BookReconciler” with book and diamond icons. A downward arrow leads to “Enriched + Clustered Metadata,” showing multiple editions of the book cover and expanded metadata, including several ISBNs, subject headings (e.g., Vietnamese–France fiction, women authors, household employees, gay men, cooking), and an author VIAF identifier.

Very happy to introduce a new tool, BookReconciler!

You can take spreadsheets with book data and add subject headings, descriptions, ISBNs, HathiTrust IDs, & more. You can also cluster editions & variations of the same "Work."

Led by @thisismattmiller.com and supported by @post45data.bsky.social.

17.12.2025 21:37 👍 123 🔁 56 💬 7 📌 1
Preview
Auszeichnungen Der Dachverband Bibliothek & Information Deutschland (BID) e. V. hat die Karl-Preusker-Medaille 2025 dem Ukrainischen Bibliotheksverband verliehen. „Ausgezeichnet wird damit der außergewöhnliche Einsa...

SUCHO has received the 2025 Karl Preusker medal from Bibliothek & Information Deutschland (BID)! The jury selected us as an example of “courage, solidarity, professional excellence, and the central role of libraries, archives, and digital infrastructures in the resilience of democratic societies.”

12.12.2025 20:39 👍 22 🔁 2 💬 2 📌 0

Working with a small museum/archive/project with digitized images but little metadata?

I'm looking for testers for a VLM pipeline for auto-enrichment (transcription, captions, tags, IIIF). If you share a few sample images, I'll run them through + share results. Would love to hear your feedback!

03.12.2025 18:05 👍 10 🔁 5 💬 3 📌 3

Thank you to @datarescueproject.org for publishing this blog post by @kdeeds.bsky.social and myself on GovScape! Extremely grateful to @datarescueproject.org for all their incredible work!

02.12.2025 17:40 👍 5 🔁 2 💬 0 📌 0
Preview
Guest Post: GovScape: A Public Search System for 10+ Million Government PDFs This week's guest post is from Benjamin Charles Germain Lee, Assistant Professor at the University of Washington, and Kyle Deeds, Assistant Professor at Boston University. Learn more about their recen...

This is amazing from Benjamin Charles Germain Lee, @kdeeds.bsky.social, and @datarescueproject.org: www.datarescueproject.org/guest-post-g...

02.12.2025 16:25 👍 5 🔁 2 💬 0 📌 0

At the AI4LAM Fantastic Futures conference this week

Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants!

DM or find me at breaks! #AI4LAM #FF2025

01.12.2025 11:13 👍 15 🔁 3 💬 0 📌 0
Preview
Podcast Series "Meet a Historian" A podcast brought to you by the ERC projects CAPASIA and ECOINT How do Historians write History in the 21st century?  In this podcast series, early career researchers at the European University Inst...

How do historians write history in the #21stcentury?

🎙️ In the podcast series Meet a Historian, our PhD researchers engage in conversations with some of today’s most innovative historians 👉 loom.ly/IWAoLc8

Brought to you by our CAPASIA 👉 loom.ly/O6KOqFo and @ecointeui.bsky.social research projects

28.11.2025 14:31 👍 6 🔁 2 💬 0 📌 0
Preview
Revised news release dates following the 2025 lapse in appropriations Revised news release dates following the 2025 lapse in appropriations

BREAKING NEWS
Bureau of Labor Statistics announced cancellations of several key data releases

🔺 Job Openings and Labor Turnover (JOLTS)
🔺 Employment Situation
🔺 Consumer Price Index
... and more

This has knock-on effects for other products, like GDP (produced at BEA)

www.bls.gov/bls/2025-lap...

21.11.2025 16:50 👍 4 🔁 3 💬 1 📌 1
Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

hf jobs uv run \
  --flavor a100-large \
  -s HF_TOKEN=HF_TOKEN \
  https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \
  -- davanstrien/newspapers-with-images-after-photography-big \
  davanstrien/newspapers-photo-predictions \
  --class-name "photograph" \
  --confidence-threshold 0.4

hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4

Building datasets to train smaller, task-focused models used to be incredibly time-consuming.

Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!

Try it yourself: huggingface.co/datasets/uv-...!

21.11.2025 13:30 👍 51 🔁 12 💬 1 📌 0

1/ Announcing GovScape – a public search system for 10 million U.S. government PDFs (70 million pages)! GovScape offers visual search, semantic text search, and keyword search. Explore below:

Website: www.govscape.net
ArXiv link: arxiv.org/abs/2511.11010

18.11.2025 20:19 👍 80 🔁 35 💬 3 📌 4
Post image Post image Post image Post image

The @mozilla.org team has done a spectacular job for MozFest 2025. If you‘re also in Barcelona and would like to chat about Open Data and Open Source AI send me a DM, I‘m here until Monday! #mozfest #mozfest2025 #mozillafestival #mozilla #opensource #ai

07.11.2025 14:36 👍 10 🔁 0 💬 1 📌 0
Preview
Data Rescue Projects receives support from the John D. and Catherine T. MacArthur Foundation to support data rescue efforts FOR IMMEDIATE RELEASE Since launching in February 2025, the Data Rescue Project has grown substantially. At this point, the DRP has enabled the rescue of more than 1,000 datasets from US Federal…

The John D. and Catherine T. MacArthur Foundation has generously awarded us funding to secure our own storage. This critical processing space will be instrumental in ensuring that large datasets can be temporarily stored, curated, and described.

Thank you, MacArthur Foundation, for your support!

04.11.2025 17:20 👍 74 🔁 20 💬 0 📌 8

Members of our Steering Committee @lyndamk.bsky.social and @storytracer.com are in Strasbourg France today and tomorrow to talk about our DRP at Numérique en Commune[s]. Some of the earliest interest in our work was from the French media so it is exciting to be here.

29.10.2025 13:50 👍 5 🔁 2 💬 0 📌 0
Screenshot of the Viabundus website.

Screenshot of the Viabundus website.

A neat tool I just came across: Viabundus, a digital road map of northern Europe 1350-1650, that lets you calculate contemporary travel routes/times. In 1500, going Amiens → Köln by horse took almost 7 days and 13 toll payments.

#medievalsky

www.landesgeschichte.uni-goettingen.de/handelsstras...

24.10.2025 22:58 👍 988 🔁 378 💬 27 📌 47
Post image

Stanford created a similar tool for the Roman Empire more than a decade ago: orbis.stanford.edu. ORBIS lets you calculate travel times by land, river, and sea, with options for different modes of transport and travel speeds. It's truly an amazing resource and I'm so grateful they keep hosting it.

25.10.2025 16:19 👍 38 🔁 8 💬 1 📌 2

Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.

15.10.2025 15:55 👍 3 🔁 1 💬 0 📌 0

We are honored to receive an NDSA Digital Preservation Excellence award. In accepting the award, @lyndamk.bsky.social expressed how this work is only possible due to our volunteers who "have spent countless hours working to ensure that public data remains a public good that is publicly accessible."

10.10.2025 20:25 👍 24 🔁 7 💬 1 📌 1