What EACL 2026 reveals about the next phase of computational linguistics: multilingual agents, evaluation, and language diversity

For the past few years, a lot of AI discussion has centered on scale. Bigger models, bigger datasets, bigger claims. But when I looked through EACL 2026, I came away with a different impression. The most interesting story was not just that language technology is getting more powerful. It was that computational linguistics is becoming more demanding about what counts as progress.

This year’s conference suggests that the field is entering a new phase. Researchers are paying closer attention to multilingual evaluation, cross-linguistic reliability, and the gap between fluent output and genuine linguistic competence. EACL 2026 includes hundreds of long papers, short papers, demos, findings papers, and workshops, but what stands out is the kind of questions those papers are asking. Increasingly, the field is less satisfied with asking whether a model performs well on a benchmark and more interested in whether that benchmark actually tells us anything meaningful.

That shift matters. Computational linguistics has reached a point where sounding convincing is no longer enough. A model may generate polished text, but that does not mean it reasons well, generalizes across languages, or works fairly across different linguistic communities. EACL 2026 reflects a growing awareness of that problem. Its program includes sessions on multilingual reliability, multilingual diversity and resource-aware scaling, historical and multiscript language processing, and evaluation under stress testing. Even one of the plenary talks, “Omnilinguality, Scaling AI to Any language,” points directly to the conference’s broader focus. (2026.eacl.org)

Moving past the obsession with scale

Public conversations about AI still tend to reward scale. That makes sense to a point. Larger systems often do unlock new capabilities. But EACL 2026 suggests that the next phase of computational linguistics may be shaped less by model size and more by whether models can be evaluated honestly across languages and contexts.

That is one reason the First Workshop on Multilingual Multicultural Evaluation caught my attention. Its goal is not simply to add more languages to existing benchmarks. It focuses on improving multilingual evaluation in terms of accuracy, scalability, comparability, and fairness, while also incorporating cultural and social perspectives. That is a deeper challenge. It asks not only whether our systems work in many languages, but whether our methods for judging them are themselves too narrow.

As a student who is also trying to learn how research in computational linguistics actually works, I think this is one of the most important developments right now. Multilingual NLP has sometimes been treated as English NLP extended outward. Translate the task, rerun the benchmark, report the score. But language diversity is not that simple. Languages differ in structure, meaning-making, and social use. If our evaluation methods smooth over those differences, then our conclusions about model ability may be misleading from the start.

Multilingual agents are raising the stakes

EACL 2026 also makes clear that agents are no longer just a product trend. They are becoming a serious evaluation problem for computational linguistics. Once language models are expected to act as assistants, judges, or multi-step decision makers across languages, the question becomes whether their behavior remains reliable when the language changes.

One paper that stood out to me was MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators. The paper starts with a striking issue: LLMs are increasingly being used to evaluate dialogue quality, but many of the benchmarks for testing those evaluators are static, outdated, and not very multilingual. MEDAL addresses this by generating multilingual dialogues with multiple LLMs and studying how well strong models can judge them. The authors find real cross-lingual differences and show that even strong judge models struggle with nuanced issues like empathy, common sense, and relevance. (aclanthology.org)

What makes this especially interesting is that it reveals a second layer of uncertainty. We already worry about whether language models produce good outputs. Now we also have to worry about whether language models can reliably evaluate other language models, especially across languages. That is a very computational linguistics problem. It sits at the intersection of dialogue, evaluation, pragmatics, and multilinguality. It also shows how weaknesses do not disappear when models are placed in evaluative roles. They can become built into the systems we trust to judge quality.

Evaluation is becoming central, not secondary

If I had to summarize one message from EACL 2026, it would be this: evaluation is no longer a side issue. It is becoming one of the field’s central concerns.

A good example is Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning. This paper addresses one of the biggest benchmark problems today: contamination. If models have already seen benchmark data during training, then high scores become much harder to interpret. The authors respond by introducing a new benchmark based on a text-based trading card game, with English and Arabic versions and adjustable difficulty. Their findings show that performance drops as difficulty increases, that model size does not map neatly onto strategic ability, and that a notable gap remains between English and Arabic performance. (aclanthology.org)

That matters because it reflects a larger change in the field’s mindset. It is no longer enough for a benchmark to be widely used or easy to cite. It has to be trustworthy. If a model performs well because it has effectively memorized familiar patterns, then benchmark success may tell us less about reasoning than we think.

Another EACL 2026 paper pushes this idea even further. Garbage In, Reasoning Out? Why Benchmark Scores are Misleading for LLM Social Reasoning argues that benchmark success can be fragile and overly dependent on wording, framing, and context. The authors call for process-oriented evaluation rather than relying only on static outcome-based metrics. That is an important shift. The field is becoming less interested in whether a model happened to get the answer right and more interested in what kind of reasoning, if any, led to that answer. (aclanthology.org)

To me, that is one of the healthiest signs in current computational linguistics. A stronger evaluation culture makes a field more precise. It also makes it harder for hype to stand in for evidence.

Language diversity is moving to the center

The other major pattern I noticed at EACL 2026 is that language diversity is being treated less like a side topic and more like a core research challenge. You can see that just from the workshops: African NLP, languages using Arabic script, low-resource language models, low-resource machine translation, Turkic languages, similar languages and dialects, field linguistics, and the Iranian language family. This is not a small corner of the conference. It is a substantial part of the conversation. (aclanthology.org)

One paper that captures this especially well is Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas. The authors evaluate five major model families on 13 Indigenous languages across tasks including language identification, cloze completion, and grammatical feature classification. Their results show major variation across both languages and model families, with many combinations performing near chance. That is a useful reminder that claims of multilingual capability often hide a much less even reality. (aclanthology.org)

What I like about this paper is that it treats underrepresented languages as serious tests of linguistic competence, not as afterthoughts. The authors note that many Indigenous languages include rich morphology and nonstandardized orthographies, which complicate both tokenization and evaluation. These are not just difficult edge cases. They are important cases for understanding whether models have learned anything linguistically meaningful beyond high-resource patterns.

A related example is CETVEL, a benchmark for Turkish that evaluates language understanding, generation, and cultural capacity. What stands out here is not just the breadth of the benchmark, but the fact that it includes Turkish history, idiomatic usage, and culturally grounded content. The paper also finds that Turkish-centric instruction-tuned models can underperform broader multilingual or general-purpose models. That complicates the simple assumption that more language-specific automatically means better. It suggests that language-specific evaluation needs to be culturally grounded and methodologically strong if it is going to tell us something useful. (aclanthology.org)

What this says about the field

So what does EACL 2026 reveal about the next phase of computational linguistics?

To me, it reveals a field that is becoming more multilingual, more skeptical, and more serious about methodology. The excitement around large language models is still there, but conferences like this suggest that researchers are becoming less willing to accept easy narratives about progress. Instead, they are asking where models fail, how evaluation breaks down, and which linguistic communities are still being underserved.

It also suggests that computational linguistics is reclaiming some of its deeper identity. At its best, this field is not just about generating fluent text. It is about studying language carefully enough to build technologies that are interpretable, robust, and responsive to real linguistic diversity. EACL 2026 feels like evidence of that shift.

The next phase of computational linguistics may not be defined by the loudest demo or the largest model. It may be defined by who can evaluate language technology most honestly across languages, cultures, and communicative settings. For me, that is an encouraging direction. It leaves room for the kinds of questions that made me interested in this field in the first place: What does it mean for a model to know a language? What counts as understanding across different linguistic communities? And how do we design evaluations that respect the fact that language is never uniform? EACL 2026 does not answer all of those questions. But it makes them much harder to ignore.


References

Association for Computational Linguistics. “19th Conference of the European Chapter of the Association for Computational Linguistics.” ACL Anthology, 2026. (aclanthology.org)

EACL 2026 Organizers. “Conference Overview.” EACL 2026. (2026.eacl.org)

EACL 2026 Organizers. “Workshops.” EACL 2026. (2026.eacl.org)

Mendonça, John, Alon Lavie, and Isabel Trancoso. “MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators.” Findings of the Association for Computational Linguistics: EACL 2026.

Alrashed, Sultan, Jianghui Wang, and Francesco Orabona. “Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning.” Findings of the Association for Computational Linguistics: EACL 2026.

Vasselli, Justin, Arturo Mp, Frederikus Hudi, Haruki Sakajo, and Taro Watanabe. “Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas.” Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers).

Er, Abrek, Ilker Kesen, Gözde Gül Şahin, and Aykut Erdem. “CETVEL: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish.” Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers).

Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Horníková, and Giuseppe Riccardi. “Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It.” Findings of the Association for Computational Linguistics: EACL 2026, pages 1747–1759.

— Andrew

5,452 hits

We Submitted Our ACL 2026 DravidianLangTech Paper on Hope Speech Detection in Tulu

I’m happy to share that we have submitted our paper, “cantnlp@DravidianLangTech 2026: Organic Domain Adaptation Improves Multi-Class Hope Speech Detection in Tulu,” to the Sixth Workshop on Speech and Language Technologies for Dravidian Languages (DravidianLangTech-2026), which is co-located with ACL 2026 in San Diego. ACL 2026 is scheduled for July 2 to July 7, 2026, with workshops on July 3 and 4.

What the shared task is about

Our submission is part of the Hope Speech Detection shared task at DravidianLangTech-2026. The task focuses on identifying hopeful, encouraging, and supportive language in social media text, with a particular emphasis on code-mixed Tulu. That makes it both socially meaningful and technically challenging, especially because Tulu remains a low-resource language in NLP.

What our paper explores

Our paper studies how organic domain adaptation can improve multi-class hope speech detection in Tulu. In low-resource settings, even small domain mismatches can hurt performance, and code-mixed data adds another layer of difficulty. This project looks at how better adaptation strategies can help models generalize more effectively in that setting.

Why this matters

I find this work exciting because it sits at the intersection of low-resource NLP, code-mixed language processing, and socially useful language technology. Hope speech detection is not just a classification problem. It also connects to broader questions about how NLP systems can support healthier online spaces and extend research attention to languages that are often underrepresented.

Acknowledgments

I’m the first author of this submission, and I’m very grateful to my co-author and mentor, Dr. Sidney Wong. His guidance and support were central to both the research process and the writing of the paper.

What comes next

The paper was submitted by the March 5, 2026 shared-task paper deadline, so it is now under review. I’m looking forward to seeing the outcome and, hopefully, sharing more about the project in the months ahead. No matter what happens, this has already been a valuable experience in working on Tulu NLP and contributing to research on Dravidian languages.

Related links

— Andrew

5,452 hits

CES 2026 and the Illusion of Understanding in Agentic AI

At CES 2026, nearly every major technology company promised the same thing in different words: assistants that finally understand us. These systems were not just answering questions. They were booking reservations, managing homes, summarizing daily life, and acting on a user’s behalf. The message was unmistakable. Language models had moved beyond conversation and into agency.

Yet watching these demonstrations felt familiar in an uncomfortable way. I have seen this confidence before, often at moments when language systems appear fluent while remaining fragile underneath. CES 2026 did not convince me that machines now understand human language. Instead, it exposed how quickly our expectations have outpaced our theories of meaning.

When an assistant takes action, language stops being a surface interface. It becomes a proxy for intent, context, preference, and consequence. That shift raises the bar for computational linguistics in ways that polished demos rarely acknowledge.

From chatting to acting: why agents raise the bar

Traditional conversational systems can afford to be wrong in relatively harmless ways. A vague or incorrect answer is frustrating but contained. Agentic systems are different. When language triggers actions, misunderstandings propagate into the real world.

From a computational linguistics perspective, this changes the problem itself. Language is no longer mapped only to responses but to plans. Commands encode goals, constraints, and assumptions that are often implicit. A request like “handle this later” presupposes shared context, temporal reasoning, and an understanding of what “this” refers to. These are discourse problems, not engineering edge cases.

This distinction echoes long-standing insights in linguistics. Winograd’s classic examples showed that surface structure alone is insufficient for understanding even simple sentences once world knowledge and intention are involved (Winograd). Agentic assistants bring that challenge back, this time with real consequences attached.

Instruction decomposition is not understanding

Many systems highlighted at CES rely on instruction decomposition. A user prompt is broken into smaller steps that are executed sequentially. While effective in constrained settings, this approach is often mistaken for genuine understanding.

Decomposition works best when goals are explicit and stable. Real users are neither. Goals evolve mid-interaction. Preferences conflict with past behavior. Instructions are underspecified. Linguistics has long studied these phenomena under pragmatics, where meaning depends on speaker intention, shared knowledge, and conversational norms (Grice).

Breaking an instruction into steps does not resolve ambiguity. It merely postpones it. Without a model of why a user said something, systems struggle to recover when their assumptions are wrong. Most agentic failures are not catastrophic. They are subtle misalignments that accumulate quietly.

Long-term memory is a discourse problem, not a storage problem

CES 2026 placed heavy emphasis on memory and personalization. Assistants now claim to remember preferences, habits, and prior conversations. The implicit assumption is that more memory leads to better understanding.

In linguistics, memory is not simple accumulation. It is interpretation. Discourse coherence depends on salience, relevance, and revision. Humans forget aggressively, reinterpret past statements, and update beliefs about one another constantly. Storing embeddings of prior interactions does not replicate this process.

Research in discourse representation theory shows that meaning emerges through structured updates to a shared model of the world, not through raw recall alone (Kamp and Reyle). Long-context language models still struggle with this distinction. They can retrieve earlier information but often fail to decide what should matter now.

Multimodality does not remove ambiguity

Many CES demonstrations leaned heavily on multimodal interfaces. Visuals, screens, and gestures were presented as solutions to linguistic ambiguity. In practice, ambiguity persists even when more modalities are added.

Classic problems such as deixis remain unresolved. A command like “put that there” still requires assumptions about attention, intention, and relevance. Visual input often increases the number of possible referents rather than narrowing them. More context does not automatically produce clearer meaning.

Research on multimodal grounding consistently shows that aligning language with perception is difficult precisely because human communication relies on shared assumptions rather than exhaustive specification (Clark). Agentic systems inherit this challenge rather than escaping it.

Evaluation is the quiet failure point

Perhaps the most concerning gap revealed by CES 2026 is evaluation. Success is typically defined as task completion. Did the system book the table? Did the lights turn on? These metrics ignore whether the system actually understood the user or simply arrived at the correct outcome by chance.

Computational linguistics has repeatedly warned against narrow benchmarks that mask shallow competence. Metrics such as BLEU reward surface similarity while missing semantic failure (Papineni et al.). Agentic systems risk repeating this mistake at a higher level.

A system that completes a task while violating user intent is not truly successful. Meaningful evaluation must account for repair behavior, user satisfaction, and long-term trust. These are linguistic and social dimensions, not merely engineering ones.

CES as a mirror for the field

CES 2026 showcased ambition, not resolution. Agentic assistants highlight how far language technology has progressed, but they also expose unresolved questions at the heart of computational linguistics. Fluency is not understanding. Memory is not interpretation. Action is not comprehension.

If agentic AI is the future, then advances will depend less on making models larger and more on how deeply we understand language, context, and human intent.


References

Clark, Herbert H. Using Language. Cambridge University Press, 1996.

Grice, H. P. “Logic and Conversation.” Syntax and Semantics, vol. 3, edited by Peter Cole and Jerry L. Morgan, Academic Press, 1975, pp. 41–58.

Kamp, Hans, and Uwe Reyle. From Discourse to Logic. Springer, 1993.

Papineni, Kishore, et al. “BLEU: A Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

Winograd, Terry. “Understanding Natural Language.” Cognitive Psychology, vol. 3, no. 1, 1972, pp. 1–191.

— Andrew

5,452 hits

Attending SCiL 2025: My First In-Person Computational Linguistics Conference at the University of Oregon

This July, I had the amazing opportunity to attend the 2025 Society for Computation in Linguistics (SCiL) conference, held at the University of Oregon in Eugene from July 18 to 20. This wasn’t just my first academic conference in person. It was also my first time attending a conference where I was (surprisingly) the only high school student in the room.


Road Trip to Eugene and My Badge Moment

My family and I made the drive from Seattle to Eugene, a nearly 300-mile road trip along I-5. I was super excited (and a little nervous) to be attending a professional conference alongside professors, postdocs, and graduate students.

When I checked in, I got my conference badge and immediately noticed something funny. My badge just said “Andrew Li,” with no school or organization listed, while everyone else had theirs printed with their university or research institute. I guess Redmond High School isn’t in their system yet!


The Crowd: Grad Students, Professors, and Me

The SCiL crowd was mostly made up of college professors and graduate students. At first, I felt a little out of place sitting in rooms full of experts discussing topics in areas such as pragmatics and large language models. But once the sessions started, I realized that even as a student just starting out in the field, there was so much I could follow and even more that I wanted to learn.

The conference covered a wide range of topics, all tied together by a focus on computational modeling in linguistics. You can find the full conference schedule here.

I was especially drawn to Dr. Malihe Alikhani‘s keynote presentation “Theory of Mind in Generative Models: From Uncertainty to Shared Meaning“. Her talk explored how generative models can effectively facilitate communicative grounding by incorporating theory of mind alongside uncertainty and human feedback. What stood out to me most was the idea that positive friction can be intentionally built into conversational systems so that it encourages contemplative thinking such as reflection on uncertain assumptions by both the users and AI systems. I was also fascinated by how generative
models embody core mechanisms of pragmatic reasoning, offering linguists and cognitive
scientists both methodological challenges and opportunities to question how computational
systems reflect and shape our understanding of meaning and interaction.


Networking and New Connections

While I didn’t get the chance to meet Prof. Jonathan Dunn in person as planned (he’s teaching “Computational Construction Grammar” at the LSA 2025 Summer Institute from July 24 through August 7 and won’t arrive until July 23), I still made some great new connections.

One of them was Andrew Liu, a graduate student at the University of Toronto. We chatted about his project, “Similarity, Transformation, and the Newly Found Invariance of Influence Functions,” which he’s presenting during the poster session. He was super friendly and shared valuable advice about studying and doing research in computational linguistics and NLP. Here’s his LinkedIn profile if you’d like to check out his work.

Talking with grad students made me realize how wide the field of computational linguistics really is. Everyone had a different background — some came from linguistics, others from computer science or cognitive science — but they were all united by a shared passion for understanding language through computation.


Final Thoughts

Attending SCiL 2025 was eye-opening. Even though I was probably the youngest person there, I felt inspired, welcomed, and challenged in the best way. It confirmed my passion for computational linguistics /NLP and reminded me how much more I want to learn.

If you’re a high school student curious about computational linguistics/NLP, don’t be intimidated by professional conferences. Dive in, listen closely, ask questions, and you might be surprised by how much you take away.

— Andrew

Blog at WordPress.com.

Up ↑