What EACL 2026 reveals about the next phase of computational linguistics: multilingual agents, evaluation, and language diversity

For the past few years, a lot of AI discussion has centered on scale. Bigger models, bigger datasets, bigger claims. But when I looked through EACL 2026, I came away with a different impression. The most interesting story was not just that language technology is getting more powerful. It was that computational linguistics is becoming more demanding about what counts as progress.

This year’s conference suggests that the field is entering a new phase. Researchers are paying closer attention to multilingual evaluation, cross-linguistic reliability, and the gap between fluent output and genuine linguistic competence. EACL 2026 includes hundreds of long papers, short papers, demos, findings papers, and workshops, but what stands out is the kind of questions those papers are asking. Increasingly, the field is less satisfied with asking whether a model performs well on a benchmark and more interested in whether that benchmark actually tells us anything meaningful.

That shift matters. Computational linguistics has reached a point where sounding convincing is no longer enough. A model may generate polished text, but that does not mean it reasons well, generalizes across languages, or works fairly across different linguistic communities. EACL 2026 reflects a growing awareness of that problem. Its program includes sessions on multilingual reliability, multilingual diversity and resource-aware scaling, historical and multiscript language processing, and evaluation under stress testing. Even one of the plenary talks, “Omnilinguality, Scaling AI to Any language,” points directly to the conference’s broader focus. (2026.eacl.org)

Moving past the obsession with scale

Public conversations about AI still tend to reward scale. That makes sense to a point. Larger systems often do unlock new capabilities. But EACL 2026 suggests that the next phase of computational linguistics may be shaped less by model size and more by whether models can be evaluated honestly across languages and contexts.

That is one reason the First Workshop on Multilingual Multicultural Evaluation caught my attention. Its goal is not simply to add more languages to existing benchmarks. It focuses on improving multilingual evaluation in terms of accuracy, scalability, comparability, and fairness, while also incorporating cultural and social perspectives. That is a deeper challenge. It asks not only whether our systems work in many languages, but whether our methods for judging them are themselves too narrow.

As a student who is also trying to learn how research in computational linguistics actually works, I think this is one of the most important developments right now. Multilingual NLP has sometimes been treated as English NLP extended outward. Translate the task, rerun the benchmark, report the score. But language diversity is not that simple. Languages differ in structure, meaning-making, and social use. If our evaluation methods smooth over those differences, then our conclusions about model ability may be misleading from the start.

Multilingual agents are raising the stakes

EACL 2026 also makes clear that agents are no longer just a product trend. They are becoming a serious evaluation problem for computational linguistics. Once language models are expected to act as assistants, judges, or multi-step decision makers across languages, the question becomes whether their behavior remains reliable when the language changes.

One paper that stood out to me was MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators. The paper starts with a striking issue: LLMs are increasingly being used to evaluate dialogue quality, but many of the benchmarks for testing those evaluators are static, outdated, and not very multilingual. MEDAL addresses this by generating multilingual dialogues with multiple LLMs and studying how well strong models can judge them. The authors find real cross-lingual differences and show that even strong judge models struggle with nuanced issues like empathy, common sense, and relevance. (aclanthology.org)

What makes this especially interesting is that it reveals a second layer of uncertainty. We already worry about whether language models produce good outputs. Now we also have to worry about whether language models can reliably evaluate other language models, especially across languages. That is a very computational linguistics problem. It sits at the intersection of dialogue, evaluation, pragmatics, and multilinguality. It also shows how weaknesses do not disappear when models are placed in evaluative roles. They can become built into the systems we trust to judge quality.

Evaluation is becoming central, not secondary

If I had to summarize one message from EACL 2026, it would be this: evaluation is no longer a side issue. It is becoming one of the field’s central concerns.

A good example is Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning. This paper addresses one of the biggest benchmark problems today: contamination. If models have already seen benchmark data during training, then high scores become much harder to interpret. The authors respond by introducing a new benchmark based on a text-based trading card game, with English and Arabic versions and adjustable difficulty. Their findings show that performance drops as difficulty increases, that model size does not map neatly onto strategic ability, and that a notable gap remains between English and Arabic performance. (aclanthology.org)

That matters because it reflects a larger change in the field’s mindset. It is no longer enough for a benchmark to be widely used or easy to cite. It has to be trustworthy. If a model performs well because it has effectively memorized familiar patterns, then benchmark success may tell us less about reasoning than we think.

Another EACL 2026 paper pushes this idea even further. Garbage In, Reasoning Out? Why Benchmark Scores are Misleading for LLM Social Reasoning argues that benchmark success can be fragile and overly dependent on wording, framing, and context. The authors call for process-oriented evaluation rather than relying only on static outcome-based metrics. That is an important shift. The field is becoming less interested in whether a model happened to get the answer right and more interested in what kind of reasoning, if any, led to that answer. (aclanthology.org)

To me, that is one of the healthiest signs in current computational linguistics. A stronger evaluation culture makes a field more precise. It also makes it harder for hype to stand in for evidence.

Language diversity is moving to the center

The other major pattern I noticed at EACL 2026 is that language diversity is being treated less like a side topic and more like a core research challenge. You can see that just from the workshops: African NLP, languages using Arabic script, low-resource language models, low-resource machine translation, Turkic languages, similar languages and dialects, field linguistics, and the Iranian language family. This is not a small corner of the conference. It is a substantial part of the conversation. (aclanthology.org)

One paper that captures this especially well is Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas. The authors evaluate five major model families on 13 Indigenous languages across tasks including language identification, cloze completion, and grammatical feature classification. Their results show major variation across both languages and model families, with many combinations performing near chance. That is a useful reminder that claims of multilingual capability often hide a much less even reality. (aclanthology.org)

What I like about this paper is that it treats underrepresented languages as serious tests of linguistic competence, not as afterthoughts. The authors note that many Indigenous languages include rich morphology and nonstandardized orthographies, which complicate both tokenization and evaluation. These are not just difficult edge cases. They are important cases for understanding whether models have learned anything linguistically meaningful beyond high-resource patterns.

A related example is CETVEL, a benchmark for Turkish that evaluates language understanding, generation, and cultural capacity. What stands out here is not just the breadth of the benchmark, but the fact that it includes Turkish history, idiomatic usage, and culturally grounded content. The paper also finds that Turkish-centric instruction-tuned models can underperform broader multilingual or general-purpose models. That complicates the simple assumption that more language-specific automatically means better. It suggests that language-specific evaluation needs to be culturally grounded and methodologically strong if it is going to tell us something useful. (aclanthology.org)

What this says about the field

So what does EACL 2026 reveal about the next phase of computational linguistics?

To me, it reveals a field that is becoming more multilingual, more skeptical, and more serious about methodology. The excitement around large language models is still there, but conferences like this suggest that researchers are becoming less willing to accept easy narratives about progress. Instead, they are asking where models fail, how evaluation breaks down, and which linguistic communities are still being underserved.

It also suggests that computational linguistics is reclaiming some of its deeper identity. At its best, this field is not just about generating fluent text. It is about studying language carefully enough to build technologies that are interpretable, robust, and responsive to real linguistic diversity. EACL 2026 feels like evidence of that shift.

The next phase of computational linguistics may not be defined by the loudest demo or the largest model. It may be defined by who can evaluate language technology most honestly across languages, cultures, and communicative settings. For me, that is an encouraging direction. It leaves room for the kinds of questions that made me interested in this field in the first place: What does it mean for a model to know a language? What counts as understanding across different linguistic communities? And how do we design evaluations that respect the fact that language is never uniform? EACL 2026 does not answer all of those questions. But it makes them much harder to ignore.

References

Association for Computational Linguistics. “19th Conference of the European Chapter of the Association for Computational Linguistics.” ACL Anthology, 2026. (aclanthology.org)

EACL 2026 Organizers. “Conference Overview.” EACL 2026. (2026.eacl.org)

EACL 2026 Organizers. “Workshops.” EACL 2026. (2026.eacl.org)

Mendonça, John, Alon Lavie, and Isabel Trancoso. “MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators.” Findings of the Association for Computational Linguistics: EACL 2026.

Alrashed, Sultan, Jianghui Wang, and Francesco Orabona. “Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning.” Findings of the Association for Computational Linguistics: EACL 2026.

Vasselli, Justin, Arturo Mp, Frederikus Hudi, Haruki Sakajo, and Taro Watanabe. “Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas.” Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers).

Er, Abrek, Ilker Kesen, Gözde Gül Şahin, and Aykut Erdem. “CETVEL: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish.” Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers).

Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Horníková, and Giuseppe Riccardi. “Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It.” Findings of the Association for Computational Linguistics: EACL 2026, pages 1747–1759.

— Andrew

5,469 hits

March 24, 2026 0

CES 2026 and the Illusion of Understanding in Agentic AI

At CES 2026, nearly every major technology company promised the same thing in different words: assistants that finally understand us. These systems were not just answering questions. They were booking reservations, managing homes, summarizing daily life, and acting on a user’s behalf. The message was unmistakable. Language models had moved beyond conversation and into agency.

Yet watching these demonstrations felt familiar in an uncomfortable way. I have seen this confidence before, often at moments when language systems appear fluent while remaining fragile underneath. CES 2026 did not convince me that machines now understand human language. Instead, it exposed how quickly our expectations have outpaced our theories of meaning.

When an assistant takes action, language stops being a surface interface. It becomes a proxy for intent, context, preference, and consequence. That shift raises the bar for computational linguistics in ways that polished demos rarely acknowledge.

From chatting to acting: why agents raise the bar

Traditional conversational systems can afford to be wrong in relatively harmless ways. A vague or incorrect answer is frustrating but contained. Agentic systems are different. When language triggers actions, misunderstandings propagate into the real world.

From a computational linguistics perspective, this changes the problem itself. Language is no longer mapped only to responses but to plans. Commands encode goals, constraints, and assumptions that are often implicit. A request like “handle this later” presupposes shared context, temporal reasoning, and an understanding of what “this” refers to. These are discourse problems, not engineering edge cases.

This distinction echoes long-standing insights in linguistics. Winograd’s classic examples showed that surface structure alone is insufficient for understanding even simple sentences once world knowledge and intention are involved (Winograd). Agentic assistants bring that challenge back, this time with real consequences attached.

Instruction decomposition is not understanding

Many systems highlighted at CES rely on instruction decomposition. A user prompt is broken into smaller steps that are executed sequentially. While effective in constrained settings, this approach is often mistaken for genuine understanding.

Decomposition works best when goals are explicit and stable. Real users are neither. Goals evolve mid-interaction. Preferences conflict with past behavior. Instructions are underspecified. Linguistics has long studied these phenomena under pragmatics, where meaning depends on speaker intention, shared knowledge, and conversational norms (Grice).

Breaking an instruction into steps does not resolve ambiguity. It merely postpones it. Without a model of why a user said something, systems struggle to recover when their assumptions are wrong. Most agentic failures are not catastrophic. They are subtle misalignments that accumulate quietly.

Long-term memory is a discourse problem, not a storage problem

CES 2026 placed heavy emphasis on memory and personalization. Assistants now claim to remember preferences, habits, and prior conversations. The implicit assumption is that more memory leads to better understanding.

In linguistics, memory is not simple accumulation. It is interpretation. Discourse coherence depends on salience, relevance, and revision. Humans forget aggressively, reinterpret past statements, and update beliefs about one another constantly. Storing embeddings of prior interactions does not replicate this process.

Research in discourse representation theory shows that meaning emerges through structured updates to a shared model of the world, not through raw recall alone (Kamp and Reyle). Long-context language models still struggle with this distinction. They can retrieve earlier information but often fail to decide what should matter now.

Multimodality does not remove ambiguity

Many CES demonstrations leaned heavily on multimodal interfaces. Visuals, screens, and gestures were presented as solutions to linguistic ambiguity. In practice, ambiguity persists even when more modalities are added.

Classic problems such as deixis remain unresolved. A command like “put that there” still requires assumptions about attention, intention, and relevance. Visual input often increases the number of possible referents rather than narrowing them. More context does not automatically produce clearer meaning.

Research on multimodal grounding consistently shows that aligning language with perception is difficult precisely because human communication relies on shared assumptions rather than exhaustive specification (Clark). Agentic systems inherit this challenge rather than escaping it.

Evaluation is the quiet failure point

Perhaps the most concerning gap revealed by CES 2026 is evaluation. Success is typically defined as task completion. Did the system book the table? Did the lights turn on? These metrics ignore whether the system actually understood the user or simply arrived at the correct outcome by chance.

Computational linguistics has repeatedly warned against narrow benchmarks that mask shallow competence. Metrics such as BLEU reward surface similarity while missing semantic failure (Papineni et al.). Agentic systems risk repeating this mistake at a higher level.

A system that completes a task while violating user intent is not truly successful. Meaningful evaluation must account for repair behavior, user satisfaction, and long-term trust. These are linguistic and social dimensions, not merely engineering ones.

CES as a mirror for the field

CES 2026 showcased ambition, not resolution. Agentic assistants highlight how far language technology has progressed, but they also expose unresolved questions at the heart of computational linguistics. Fluency is not understanding. Memory is not interpretation. Action is not comprehension.

If agentic AI is the future, then advances will depend less on making models larger and more on how deeply we understand language, context, and human intent.

References

Clark, Herbert H. Using Language. Cambridge University Press, 1996.

Grice, H. P. “Logic and Conversation.” Syntax and Semantics, vol. 3, edited by Peter Cole and Jerry L. Morgan, Academic Press, 1975, pp. 41–58.

Kamp, Hans, and Uwe Reyle. From Discourse to Logic. Springer, 1993.

Papineni, Kishore, et al. “BLEU: A Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

Winograd, Terry. “Understanding Natural Language.” Cognitive Psychology, vol. 3, no. 1, 1972, pp. 1–191.

— Andrew

5,469 hits

January 20, 2026 0

Ex Machina Gears Up for VEX Worlds 2026 in St. Louis

After an incredible season last year where our team, Ex Machina, competed at the VEX Robotics World Championship 2025, I’m excited to share that we’re back for another season! I’ll continue competing this season as a team member of Ex Machina, building on everything we learned from competing together at the global championship.

A New Season, A New Challenge

This year’s game for the VEX V5 Robotics Competition has been announced, and it looks both challenging and fun. Here is the official game reveal video so you can see what teams will be working on this season:

Watch the VEX V5 Robotics Competition 2026 Game Reveal

From the initial reveal, I can already tell that strategy, design innovation, and precise teamwork will be key to succeeding this year.

Balancing Robotics and College Applications

This season is going to be especially busy for me and my teammates. As rising seniors, we’re all deep into the college application process. Between essays, interviews, and preparing for upcoming deadlines, our schedules are definitely packed. But despite the workload, we’ve all decided to continue competing. Robotics has been such an important part of our high school journey, and we’re passionate about pushing ourselves further as a team in our final season together.

VEX Worlds 2026 Heads to St. Louis

There’s another big change this year: for 2026, the VEX Robotics World Championship is moving to St. Louis, Missouri! For the past few years, the event was held in Dallas, Texas, so this will be a new experience for everyone.

The championship will be held in April 2026 at the America’s Center Convention Complex in downtown St. Louis, with specific dates to be announced later. You can read more details about the upcoming event on the REC Foundation’s official page.

Here is a video introducing VEX Worlds 2026 in St. Louis to get you excited for what’s ahead:

VEX Robotics World Championship Heads to St. Louis in 2026

Looking Ahead

It feels both exciting and bittersweet to enter my final year of high school robotics. I know the journey ahead will be intense with balancing robot design, programming, and competition prep alongside college applications, but I’m ready for the challenge.

I’ll keep sharing updates about our season as we start building and competing, so stay tuned to see how Ex Machina continues to grow in 2026.

— Andrew

July 8, 2025 0