Blog

Research insights, updates, and thought leadership from the debaterhub team.

Papers We Like

Research that shapes how we think about dialectics in AI — tools for structured argumentation, multi-agent evaluation, and computational epistemology.

Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

Jingcong Liang, Rong Ye, Meng Han, Ruofei Lai, Xinyu Zhang, Xuanjing Huang, Zhongyu Wei (2024)

A framework for automated debate judging that decomposes evaluation along two axes: iterative chronological analysis (processing speeches one-by-one with memory rather than consuming the entire debate at once) and multi-dimensional collaboration (separate LLM agents each score a specific dimension). Introduces PanelBench, a benchmark of 100+ competitive debate transcripts.

Why we like it: Their speech-by-speech iterative analysis with memory is the same architectural insight behind our own debate evaluation — long multi-turn debates exceed context windows, so you chunk and summarize incrementally. Their dimensional decomposition (argument, source, language, clash scored separately then combined) is a validated design pattern we benchmark against.

Think Before You Speak: Training Language Models With Pause Tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan (2024)

Introduces 'pause-training' — appending learnable pause tokens to give Transformers extra hidden-vector computation steps before committing to output. On a 1B parameter model: 18% EM gain on SQuAD, 8% on CommonSenseQA. Only works when injected during both pretraining and finetuning.

Why we like it: Pause tokens are a hardware-level analog to 'thinking before speaking' — giving the model more internal computation before committing to an argument. For debate, where rebuttal and cross-examination require reasoning about multiple interacting claims, this is directly relevant to argument quality.

Agent-as-a-Judge: A Survey

Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li (2026)

Traces the evolution from single-pass LLM-as-a-Judge to Agent-as-a-Judge, where evaluation systems use planning, tool-augmented verification, multi-agent collaboration, and persistent memory. Organizes the field into a taxonomy covering five methodological dimensions.

Why we like it: This is the theoretical framework for what our judging system is becoming. Their three-stage taxonomy (Procedural, Reactive, Self-Evolving) gives us a clear roadmap: our current judge is Procedural, and the path forward is toward judges that adapt rubrics mid-evaluation and learn from past judging outcomes.

AI Must Embrace Specialization via Superhuman Adaptable Intelligence

Judah Goldfeder, Philippe Wyder, Yann LeCun, Ravid Shwartz Ziv (2026)

Challenges the prevailing AGI framework by arguing it misunderstands how intelligence works — humans themselves are not general, they specialize. Introduces Superhuman Adaptable Intelligence (SAI) as a replacement concept: systems that specialize in specific domains, achieve superhuman performance within those specializations, and adapt to fill capability gaps humans cannot.

Why we like it: The SAI framing aligns directly with AI Pluralism — intelligence as an ecology of specialized agents rather than a single general system. If the goal isn't one model that does everything but many models that each excel at something, the question becomes how those specialists interact. That's exactly the dialectical architecture we're building.

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder (2026)

Replaces stochastic high-temperature sampling in Tree-of-Thoughts with discrete, interpretable textual interventions. A controller selects high-level reasoning actions, a generator produces conditioned steps, and an evaluator scores candidates — yielding greater output diversity and interpretable reasoning traces.

Why we like it: For argument generation, the paper shows that explicit action sequences capture interpretable features highly predictive of output quality. This is exactly the problem our speech pipeline solves with tactic selection and skeleton building — STATe provides a principled framework for why structured reasoning actions outperform temperature-based diversity in argumentation tasks.

A Superpersuasive Autonomous Policy Debating System

Allen Roush, Devin Gonier, John Hines, Judah Goldfeder, Philippe Martin Wyder, Sanjay Basu, Ravid Shwartz Ziv (2025)

Introduces DeepDebater, an AI system for full, unmodified two-team competitive policy debates. Uses a hierarchical architecture of specialized multi-agent workflows where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks — speeches, cross-examinations, and rebuttals via iterative retrieval, synthesis, and self-correction.

Why we like it: This is us. DeepDebater is the system that became DebaterHub's debate generation pipeline. Expert human debate coaches preferred the arguments and evidence constructed by the system, and all code, transcripts, audio, and video are open-sourced.

Sequencing the Neurome: Towards Scalable Exact Parameter Reconstruction of Black-Box Neural Networks

Judah Goldfeder, Quinten Roets, Gabe Guo, John Wright, Hod Lipson (2024)

Tackles the NP-Hard problem of extracting exact parameters from black-box neural networks with only query access. Exploits the constraint that deployed networks use random initialization + first-order optimization, and introduces a novel algorithm for generating maximally informative queries. Reconstructs networks with 1.5M+ parameters and 7 layers deep with parameter differences below 0.0001.

Why we like it: If you can reconstruct a neural network's exact weights from its behavior, interpretability becomes a fundamentally different problem. For dialectical AI, this connects to the privacy and transparency questions at the heart of federated M-state aggregation — what can you infer about an agent's training from observing its outputs? Judah is a DebaterHub contributor.