ICLR 2025 was so much fun!
ICLR 2025 was so much fun!
Curious about fine-grained text-to-image model evaluation? Come see our spotlight paper on Gecko π¦ in the afternoon poster session at #ICLR25
πHall 3 + Hall 2B #359
ποΈFriday 3pm
ICLR: iclr.cc/virtual/2025...
Paper: arxiv.org/abs/2404.16820
Prompts: github.com/google-deepm...
Why do LLMs hallucinate with RAG?! π€
Find out at my #ICLR25 poster on Sufficient Context! ππΌ
πHall 3 + Hall 2B #230
β° Fri 25 Apr 10 a.m. to 12:30 p.m.
Happy to chat with anyone at ICLR about RAG, LLMs, Factuality!
When RAG systems hallucinate, is the LLM misusing available information or is the retrieved context insufficient? In our #ICLR2025 paper, we introduce "sufficient context" to disentangle these failure modes. Work w Jianyi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur Taly, @cyroid.bsky.social
1/2 Just a reminder about Google Research Scholar Program, providing up to $60K unrestricted gifts to recognize early-career professors and support world-class research at institutions around the world. This year, we are particularly interested in the following research areas...
[6/6] The other idea is to do the weighted combination at an instance level. We look at intermediate layers for *each token* and slightly modify the overall distribution. This leads to consistent accuracy improvements for many models and datasets!
Would love to see some theory on why this works!
[5/6] Here's a nice example. We want to do some math. Greedy decoding leads to 5 x $10 = $50 for the overtime pay. This is cus A x B = C is a common pattern. But we really need A x B x C = D to get the answer. SLED can help with this because the internal layers happen to predict 'x' instead of '='.
[4/6] Our main decoding trick is to use a weighted combination of *all of the layers*. Precisely, we project the layers into the same output distribution (over vocab tokens). Then we combine the intermediate "logits" with the output logits based on our estimate of the LLM's internal knowledge
[3/6] The key observation is that LLMs "know" a lot more than they "tell" -- basically the training process can favor more popular tokens (in the dataset) rather than more accuracy predictions for the query at hand.
So we can utilize this during decoding time...
[2/6] Joint work with Jianyi Zhang Β· Da-Cheng Juan Β· Chun-Sung Ferng Β· Heinrich Jiang Β· Yiran Chen
ArXiv paper: arxiv.org/abs/2411.02433
Project page: jayzhang42.github.io/sled_page/
GitHub: github.com/JayZhang42/S...
But how does it work you ask?
Longer thread about our new factuality decoding method SLED at NeurIPS 2024. Main idea: freeze the model, but be thoughtful about the decoding. With a small amount of extra inference-time compute, we increase accuracy by 3% on several benchmarks! SLED helps for all major open source models!
ArXiv paper: arxiv.org/abs/2411.02433
Project page: jayzhang42.github.io/sled_page/
GitHub: github.com/JayZhang42/S...
First shameless plug -- our new factuality decoding method, SLED gets SOTA improvements on 14+ models (Llama 2/3, Gemma, Mistral) & 9 benchmarks!
See our #NeurIPS2024 poster today (Friday) in the East Exhibit Hall A-C #3311
Hi friends!π©·
I have never done this but iβm making a list so and i can keep in touch with all of you more easilyπ«Άπ»
please like this or say hi if i can add youπ₯° Thankπ«Άπ»
Everyone I spoke to at @rl-conference.bsky.social last summer agreed on it being one of the best conferences ever for an RL researcher... So many great RL-focused papers!
CFP is out, send your work here!
Excited to try out bluesky and chat about GenAI and ML theory!