Ever trusted a metric that works great on average, only for it to fail in your specific use case?
In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.
๐ aclanthology.org/2025.finding...
#NLP #Evaluation
(๐งต1/9)
29.04.2025 17:10
๐ 23
๐ 5
๐ฌ 1
๐ 2