Eval4SD at KONVENS 2026 · Hamburg, Germany · September 2026

News

  • February 2026 — Workshop accepted at KONVENS 2026, Hamburg.
  • TBD — Call for Papers will be published soon. Submissions open via OpenReview.

About

In this workshop, we want to showcase a wide range of efforts to measure the reliability of LLMs for specialized or niche applications.

A wide range of scientific and scholarly disciplines rely on the usage of LLMs, but in many cases, the reliability and quality of LLM outputs only receive limited attention. While large-scale benchmark datasets exist for mainstream tasks, there are often limited resources for specialized domains. Existing work has shown limited generalization of LLMs for variations on mainstream benchmark datasetsand explored the implicit training effects in prompt engineering. Such results raise concerns that these and similar effects may obscure poor reliability for out-of-distribution data in specialized domains.

The workshop’s scope includes benchmarking LLMs for specialized tasks, replicating existing domain research, and exploring evaluation methodology. The workshop is open for all manner of application domains, including but not limited to digital humanities, social sciences, law, and medicine. We understand the term LLM in the broadest sense, allowing, e.g., the investigation of small models, encoder-only models, or other types of language models.

Topics of Interest

Following the general theme described above, we have identified three core research directions and invite submissions on (but not limited to):

  • LLM Benchmarking
  • Domain Research Replication
  • Metrics and Evaluation Methodology

For further details see our call for papers