Blog posts

2025

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

less than 1 minute read

Published: July 20, 2025

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

less than 1 minute read

Published: May 28, 2025

WildIFEval: Instruction Following in the Wild

less than 1 minute read

Published: March 09, 2025

Comparing the Framing Effect in Humans and LLMs on Naturally Occurring Texts

less than 1 minute read

Published: February 24, 2025

2024

SEAM: A Stochastic Benchmark for Multi-Document Tasks

less than 1 minute read

Published: March 07, 2024

Computation or Weight Adaptation? Rethinking the Role of Plasticity in Learning

less than 1 minute read

Published: March 07, 2024

Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction

less than 1 minute read

Published: February 21, 2024

2023

Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

less than 1 minute read

Published: May 24, 2023