PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation less than 1 minute read Published: July 20, 2025
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments less than 1 minute read Published: May 28, 2025
Comparing Framing in Humans and LLMs on Naturally Occurring Texts less than 1 minute read Published: February 24, 2025
SEAM: A Stochastic Benchmark for Multi-Document Tasks less than 1 minute read Published: March 07, 2024
Computation or Weight Adaptation? Rethinking the Role of Plasticity in Learning less than 1 minute read Published: March 07, 2024
Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction less than 1 minute read Published: February 21, 2024
Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution less than 1 minute read Published: May 24, 2023