Krithika Ramesh (@stolenpyjak)

PrivateNLP@ACL 2026 Overview Privacy-preserving data analysis has become essential in the age of Large Language Models (LLMs) where access to vast amounts of data can provide gains over tuned algorithms. A large proporti...

Submission deadline extension: until March 19.

Final Call for Papers: PrivateNLP workshop co-located with ACL 2026

See sites.google.com/view/private... for OpenReview submission link and details

05.03.2026 13:42 👍 2 🔁 2 💬 0 📌 0

📅 Deadlines (AoE):
Regular submissions: March 5
Fast-track: March 24
Non-archival: April 7

For questions/queries please contact: privatenlp26-orga[at]lists.ruhr-uni-bochum.de

24.02.2026 15:42 👍 0 🔁 0 💬 0 📌 0

LinkedIn This link will take you to a page that’s not on LinkedIn

🔐 Announcing the call for papers for the 7th Workshop on Privacy-Preserving Natural Language Processing at ACL 2026 in San Diego!
If your research lies at the intersection of privacy and NLP, consider submitting to our workshop!

Website: sites.google.com/view/private...

24.02.2026 15:42 👍 2 🔁 1 💬 0 📌 1

PrivateNLP@ACL 2026 Overview Privacy-preserving data analysis has become essential in the age of Large Language Models (LLMs) where access to vast amounts of data can provide gains over tuned algorithms. A large proporti...

First call for papers - Seventh Workshop on Privacy in Natural Language Processing, co-located with ACL 2026, San Diego (CA), USA (and on Zoom)

sites.google.com/view/private...

16.01.2026 12:16 👍 1 🔁 2 💬 0 📌 0

Frustrated with how most of the world’s low-resource languages have NO evaluation resources?

📢 Check out ChiKhaPo, a massively multilingual lexical comprehension and generation benchmark covering 2700+ languages.
www.arxiv.org/abs/2510.16928

24.11.2025 23:41 👍 1 🔁 2 💬 1 📌 0

Led by @stolenpyjak.bsky.social, we built a user-friendly python package for generating and evaluating privacy-preserving synthetic data! See details in our EMNLP Demo paper:

10.11.2025 06:14 👍 8 🔁 2 💬 0 📌 0

Catch @zihaozhao.bsky.social at today’s poster session (10:30–12) where he'll be presenting SynthTextEval! Stop by if you're interested in synthetic text for high-stakes domains. Zihao also has another EMNLP paper on private text generation, for people interested in this space!
@jhuclsp.bsky.social

07.11.2025 00:55 👍 3 🔁 0 💬 0 📌 0

GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval

SynthTextEval was developed in close collaboration with
Daniel Smolyak, @zihaozhao.bsky.social, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, @anjalief.bsky.social
@jhuclsp.bsky.social @jhucompsci.bsky.social

Stop by to see our work at EMNLP tomorrow, which Zihao will be presenting!

07.11.2025 00:53 👍 2 🔁 1 💬 0 📌 0

SynthTextEval is a comprehensive toolkit for evaluating synthetic text data with a wide range of metrics, enabling standardized, comparable assessments of generation approaches and building greater confidence in the quality of synthetic data, especially for high-stakes domains

07.11.2025 00:53 👍 1 🔁 0 💬 1 📌 0

Synthetic data shouldn’t be a black box - we make it easier to examine and identify issues in synthetic data outputs with
- Interactive text exploration & review with our GUI tool
- Exploring text diversity, structure and themes with our visual and descriptive text analyses tools

07.11.2025 00:53 👍 1 🔁 0 💬 1 📌 0

SynthTextEval also supports fine-tuning models for controllable text generation across diverse domains, which allows users to
- Produce text tailored to user-defined styles, content types, or domain labels
- Generate synthetic data with differentially private guarantees

07.11.2025 00:53 👍 1 🔁 0 💬 1 📌 0

🔧Utility: Downstream task-based evaluations (classification, coreference resolution)
📊Fairness: distributional balance & representational biases
🔐Privacy: Leakage, memorization, and re-identification risk
📜Quality: Distributional differences between synthetic and real text

07.11.2025 00:53 👍 1 🔁 0 💬 1 📌 0

Conventional metrics like BLEU, ROUGE, or perplexity only scratch the surface of synthetic text quality!

Our framework introduces a multi-dimensional evaluation suite that covers aspects such as utility, privacy, fairness and distributional similarity to the real data.

07.11.2025 00:53 👍 1 🔁 0 💬 1 📌 0

GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval

🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!

GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...

#EMNLP2025 #EMNLP #SyntheticData

07.11.2025 00:53 👍 13 🔁 3 💬 1 📌 2

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerou...

Thank you to @anjalief.bsky.social for advising. Hands-on with DP-SGD? Start with our another paper and open-source package
(arxiv.org/abs/2507.07229
github.com/kr-ramesh/sy...)

15.10.2025 20:23 👍 2 🔁 1 💬 0 📌 0

Controlled Generation for Private Synthetic Text Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privac...

🔗 Paper & code
Paper is accepted to EMNLP 2025 Main
arXiv: arxiv.org/abs/2509.25729
Code: github.com/zzhao71/Cont...
#SyntheticData #Privacy #NLP #LLM #Deidentification #HealthcareAI #LLM

15.10.2025 20:23 👍 2 🔁 1 💬 1 📌 0

Take a look at this EMNLP 2025 paper by @zihaozhao.bsky.social, which proposes novel methods for generating high utility, privacy-preserving synthetic text!

16.10.2025 02:39 👍 1 🔁 0 💬 0 📌 0

‼️‼️

08.07.2025 16:04 👍 1 🔁 0 💬 0 📌 0

This hypothesis says that 1) Multilingual generation uses a model-internal task-solving→translation cascade. 2) Failure of the translation stage *despite task-solving success* is a large part of the problem. That is, the model often solves the task but fails to articulate the answer.

04.07.2025 17:04 👍 2 🔁 1 💬 1 📌 0

⁉️

18.06.2025 02:09 👍 1 🔁 0 💬 0 📌 0

We know that speech LID systems flunk on accented speech. But why? And what can we do about it? 🤔
Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that the model relies on, and proposes a fix.

07.06.2025 17:27 👍 6 🔁 3 💬 1 📌 0

Hplm (Historical Perspectival LM) Org profile for Historical Perspectival LM on Hugging Face, the AI community building the future.

Go find new linguidtic changes, compare corpora and invent
huggingface.co/Hplm
arxiv.org/abs/2504.05523

15.04.2025 12:45 👍 19 🔁 3 💬 0 📌 1

Historical analysis is a good example, as historical periods can get lost in blended information from different eras. Finetuning large models isn't enough, they “leak” future/modern concepts, making historical analysis impossible. Did you know cars existed in the 1800s? 🤦

15.04.2025 12:45 👍 12 🔁 1 💬 1 📌 0

Pretraining Language Models for Diachronic Linguistic Change Discovery Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and lit...

arxiv.org/abs/2504.05523

Typical Large Language Models (LLMs) are trained on massive, mixed datasets, so the model's behaviour can't be linked to a specific subset of the pretraining data. Or in our case, to time eras.

15.04.2025 12:45 👍 15 🔁 3 💬 1 📌 0

How should the humanities leverage LLMs?
▶️Domain-specific pretraining!

Pretraining models can be a research tool, it's cheaper than LoRA, and allows studying
💠grammatical change
💠emergent word senses
💠who knows what more…

Train on your data with our pipeline or use ours!
#AI #LLM 🤖📈

15.04.2025 12:45 👍 46 🔁 12 💬 2 📌 7

Dialects lie on continua of (structured) linguistic variation, right? And we can’t collect data for every point on the continuum...🤔
📢 Check out DialUp, a technique to make your MT model robust to the dialect continua of its training languages, including unseen dialects.
arxiv.org/abs/2501.16581

27.02.2025 02:44 👍 13 🔁 5 💬 1 📌 1

MASC 2025 Call for Locations Are you able to host MASC this year, sometime in Spring 2025? Responsibilities include: Space for ~150 ish people Managing the review process (really just paper submissions) Organizing the event Choo...

Form here: forms.gle/6DRkaP1CTMYk...

16.12.2024 21:26 👍 1 🔁 1 💬 0 📌 0

📢 Want to host MASC 2025?

The 12th Mid-Atlantic Student Colloquium is a one day event bringing together students, faculty and researchers from universities and industry in the Mid-Atlantic.

Please submit this very short form if you are interested in hosting! Deadline January 6th. #MASC2025

16.12.2024 21:19 👍 10 🔁 5 💬 1 📌 2

📢 It's PhD admissions season! 🎓

The PhD admissions process is stressful! 😅

Want a behind-the-scenes look at the process? 👀✨ You have questions, we have answers. 📝🤝

Watch my Admissions AMA for @jhuclsp.

https://youtu.be/YlwpIPFNXjo?si=O7n5QwGT5sQdpg7u

01.12.2024 23:02 👍 13 🔁 2 💬 0 📌 0

I'm super excited about this program and happy to connect if you're interested in working with me through it!

20.11.2024 19:23 👍 25 🔁 11 💬 0 📌 0

Krithika Ramesh

Latest posts by Krithika Ramesh @stolenpyjak