Yapei Chang's Avatar

Yapei Chang

@yapeichang

โ˜๏ธ phd in progress @ UMD | ๐Ÿ”— https://lilakk.github.io/

2,760
Followers
653
Following
34
Posts
03.10.2023
Joined
Posts Following

Latest posts by Yapei Chang @yapeichang

Paper: arxiv.org/pdf/2505.11080
Code: github.com/lilakk/BLEUB... (coming soon)

Work done with the amazing @yekyung.bsky.social from UMD, Michael Krumdick from Kensho, Amir Zadeh and Chuan Li from LambdaAI ,
@chriswtanner.bsky.social from Kensho, and @miyyer.bsky.social from UMD

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Beyond benchmarks, human annotators rate BLEUBERI outputs as comparable to those from GRPO-RM models.

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Qualitatively, BLEUBERI models produce more factually grounded outputs, as measured by VeriScore on three diverse datasets. VeriScore extracts verifiable claims from responses and checks each one against Google Search.

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

The surprising effectiveness of BLEU extends to training. BLEUBERI first selects 5K low-BLEU examples, then trains LLMs with GRPO using BLEU as the reward. BLEUBERI models are competitive as those trained with GRPO-RM (8B) and SFT across 4 benchmarks.

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

When BLEU agrees with humans on a pair of model outputs, what n-grams contribute to this decision? Below is an example where it captures both format (the โ€œUkrainianโ€ and โ€œEnglishโ€ headers) and factuality (the number 6.1).

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

BLEU is often dismissed for weak human correlation in generation tasks. But on general instruction following, using BLEU to rank pairs of Chatbot Arena outputsโ€”scored against references from strong LLMsโ€”matches 8B & 27B reward models in human agreement, especially with more refs.

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

BLEU is widely used for machine translation (MT) eval. Given a reference and a generation, it computes modified n-gram precision (1โ€“4 grams) and applies a brevity penalty to penalize short outputs. If given multiple references, it takes the max match per n-gram.

20.05.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿค” Can simple string-matching metrics like BLEU rival reward models for LLM alignment?

๐Ÿ” We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO.

๐Ÿซ Introducing BLEUBERI:

20.05.2025 16:25 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

๐Ÿ•ต๏ธโ€โ™€๏ธ agents are strong on many tasks, but are they good at interacting with the web? ๐Ÿงธour BEARCUBS benchmark shows that they struggle on interactive tasks that seem trivial to humans! ๐Ÿ“„ check out the paper for how to build robust evaluations & directions for future agent research

12.03.2025 14:40 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER ๐Ÿ’, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages ๐Ÿงต๐Ÿ‘‡

05.03.2025 17:06 ๐Ÿ‘ 14 ๐Ÿ” 5 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 3

current models struggle with complex long-range reasoning tasks ๐Ÿ“š how can we reliably create synthetic training data?

๐Ÿ’ฝ check out CLIPPER, a pipeline that generates data conditioning on compressed forms of long input documents!

21.02.2025 16:30 ๐Ÿ‘ 8 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

People often claim they know when ChatGPT wrote something, but are they as accurate as they think?

Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy ๐ŸŽฏ

28.01.2025 14:55 ๐Ÿ‘ 189 ๐Ÿ” 66 ๐Ÿ’ฌ 10 ๐Ÿ“Œ 19
Preview
Finally, a Replacement for BERT: Introducing ModernBERT Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Great blog post (by a 15-author team!) on their release of ModernBERT, the continuing relevance of encoder-only models, and how they relate to, say, GPT-4/llama. Accessible enough that I might use this as an undergrad reading.

19.12.2024 19:11 ๐Ÿ‘ 75 ๐Ÿ” 19 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
GitHub to Plain Text Converter Convert GitHub repositories to plain text files easily. Transform code into a single formatted text file.

i've been using this one: repo2txt.simplebasedomain.com it also lets you filter by file type and supports private/local repos

08.12.2024 02:55 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

๐ŸšจI too am on the job marketโ€ผ๏ธ๐Ÿคฏ

I'm searching for faculty positions/postdocs in multilingual/multicultural NLP, vision+language models, and eval for genAI!

I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat there!

Papers in๐Ÿงต, see more: saxon.me

06.12.2024 01:44 ๐Ÿ‘ 49 ๐Ÿ” 9 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2
๐Ÿ˜ต fish washed up on the shore of walden pond

๐Ÿ˜ต fish washed up on the shore of walden pond

๐Ÿ  what monday feels like..

02.12.2024 23:46 ๐Ÿ‘ 8 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

private closed-source evals are the future ๐Ÿซฃ

26.11.2024 20:37 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Tommy Guerrero Best Of | ๆœ€้ซ˜ใฎ
Tommy Guerrero Best Of | ๆœ€้ซ˜ใฎ YouTube video by partedoparque

www.youtube.com/watch?v=afQT...

25.11.2024 23:17 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
arxiv-utils Chrome web store

arxiv-utils Chrome web store

i knew something like this had to exist but why did i only discover it now?? no more suffering from looking at my 10+ open arxiv tabs not knowing which one is which...

25.11.2024 21:22 ๐Ÿ‘ 27 ๐Ÿ” 3 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1

๐Ÿ™‹๐Ÿปโ€โ™€๏ธ

23.11.2024 22:19 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux

Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!

23.11.2024 19:54 ๐Ÿ‘ 176 ๐Ÿ” 54 ๐Ÿ’ฌ 101 ๐Ÿ“Œ 4

i also got 10/10! the ones that rhyme too well feel very AI to me..

21.11.2024 16:51 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

such a creative way of using long-context models! this sounds like a super hard evaluation task, but gemini is already so good at it...

21.11.2024 15:04 ๐Ÿ‘ 5 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
A plot showing that reranking improves recall as we increase the number of reranked docs, but with increasing docs we diminishing returns and eventually a performance dip.

A plot showing that reranking improves recall as we increase the number of reranked docs, but with increasing docs we diminishing returns and eventually a performance dip.

Mat is not on ๐Ÿฆ‹โ€”posting on his behalf!

It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO + BEIR.

We ask: on private or tricky IR tasks, are rerankers better? Surely, reranking many docs is best?

20.11.2024 19:44 ๐Ÿ‘ 81 ๐Ÿ” 23 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 5

llms are now training humans with data from their distribution

19.11.2024 03:47 ๐Ÿ‘ 5 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The soul-searching journey for figuring out what research area is right for you is tricky since so many papers are cool. I tell my early career students that they should try to differentiate papers that they'd like to read ๐Ÿ“–, implement ๐Ÿ”จ, *and* write ๐Ÿ“ from papers that they'd only like to read ๐Ÿ“–.

18.11.2024 23:32 ๐Ÿ‘ 67 ๐Ÿ” 11 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 0
Post image Post image Post image Post image

#EMNLP2024 was fun๐Ÿนnow brainstorming ideas for #EMNLP2025 ๐Ÿ™‡๐Ÿปโ€โ™€๏ธ

17.11.2024 22:58 ๐Ÿ‘ 4 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

airbnb >>> hotel for conferences #EMNLP2024

17.11.2024 01:28 ๐Ÿ‘ 4 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Abhilasha Ravichander - Home

โœจI am on the faculty job market in the 2024-2025 cycle!โœจ

My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.

If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!

11.11.2024 14:23 ๐Ÿ‘ 51 ๐Ÿ” 22 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 1