What EACL 2026 reveals about the next phase of computational linguistics: multilingual agents, evaluation, and language diversity

For the past few years, a lot of AI discussion has centered on scale. Bigger models, bigger datasets, bigger claims. But when I looked through EACL 2026, I came away with a different impression. The most interesting story was not just that language technology is getting more powerful. It was that computational linguistics is becoming more demanding about what counts as progress.

This year’s conference suggests that the field is entering a new phase. Researchers are paying closer attention to multilingual evaluation, cross-linguistic reliability, and the gap between fluent output and genuine linguistic competence. EACL 2026 includes hundreds of long papers, short papers, demos, findings papers, and workshops, but what stands out is the kind of questions those papers are asking. Increasingly, the field is less satisfied with asking whether a model performs well on a benchmark and more interested in whether that benchmark actually tells us anything meaningful.

That shift matters. Computational linguistics has reached a point where sounding convincing is no longer enough. A model may generate polished text, but that does not mean it reasons well, generalizes across languages, or works fairly across different linguistic communities. EACL 2026 reflects a growing awareness of that problem. Its program includes sessions on multilingual reliability, multilingual diversity and resource-aware scaling, historical and multiscript language processing, and evaluation under stress testing. Even one of the plenary talks, “Omnilinguality, Scaling AI to Any language,” points directly to the conference’s broader focus. (2026.eacl.org)

Moving past the obsession with scale

Public conversations about AI still tend to reward scale. That makes sense to a point. Larger systems often do unlock new capabilities. But EACL 2026 suggests that the next phase of computational linguistics may be shaped less by model size and more by whether models can be evaluated honestly across languages and contexts.

That is one reason the First Workshop on Multilingual Multicultural Evaluation caught my attention. Its goal is not simply to add more languages to existing benchmarks. It focuses on improving multilingual evaluation in terms of accuracy, scalability, comparability, and fairness, while also incorporating cultural and social perspectives. That is a deeper challenge. It asks not only whether our systems work in many languages, but whether our methods for judging them are themselves too narrow.

As a student who is also trying to learn how research in computational linguistics actually works, I think this is one of the most important developments right now. Multilingual NLP has sometimes been treated as English NLP extended outward. Translate the task, rerun the benchmark, report the score. But language diversity is not that simple. Languages differ in structure, meaning-making, and social use. If our evaluation methods smooth over those differences, then our conclusions about model ability may be misleading from the start.

Multilingual agents are raising the stakes

EACL 2026 also makes clear that agents are no longer just a product trend. They are becoming a serious evaluation problem for computational linguistics. Once language models are expected to act as assistants, judges, or multi-step decision makers across languages, the question becomes whether their behavior remains reliable when the language changes.

One paper that stood out to me was MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators. The paper starts with a striking issue: LLMs are increasingly being used to evaluate dialogue quality, but many of the benchmarks for testing those evaluators are static, outdated, and not very multilingual. MEDAL addresses this by generating multilingual dialogues with multiple LLMs and studying how well strong models can judge them. The authors find real cross-lingual differences and show that even strong judge models struggle with nuanced issues like empathy, common sense, and relevance. (aclanthology.org)

What makes this especially interesting is that it reveals a second layer of uncertainty. We already worry about whether language models produce good outputs. Now we also have to worry about whether language models can reliably evaluate other language models, especially across languages. That is a very computational linguistics problem. It sits at the intersection of dialogue, evaluation, pragmatics, and multilinguality. It also shows how weaknesses do not disappear when models are placed in evaluative roles. They can become built into the systems we trust to judge quality.

Evaluation is becoming central, not secondary

If I had to summarize one message from EACL 2026, it would be this: evaluation is no longer a side issue. It is becoming one of the field’s central concerns.

A good example is Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning. This paper addresses one of the biggest benchmark problems today: contamination. If models have already seen benchmark data during training, then high scores become much harder to interpret. The authors respond by introducing a new benchmark based on a text-based trading card game, with English and Arabic versions and adjustable difficulty. Their findings show that performance drops as difficulty increases, that model size does not map neatly onto strategic ability, and that a notable gap remains between English and Arabic performance. (aclanthology.org)

That matters because it reflects a larger change in the field’s mindset. It is no longer enough for a benchmark to be widely used or easy to cite. It has to be trustworthy. If a model performs well because it has effectively memorized familiar patterns, then benchmark success may tell us less about reasoning than we think.

Another EACL 2026 paper pushes this idea even further. Garbage In, Reasoning Out? Why Benchmark Scores are Misleading for LLM Social Reasoning argues that benchmark success can be fragile and overly dependent on wording, framing, and context. The authors call for process-oriented evaluation rather than relying only on static outcome-based metrics. That is an important shift. The field is becoming less interested in whether a model happened to get the answer right and more interested in what kind of reasoning, if any, led to that answer. (aclanthology.org)

To me, that is one of the healthiest signs in current computational linguistics. A stronger evaluation culture makes a field more precise. It also makes it harder for hype to stand in for evidence.

Language diversity is moving to the center

The other major pattern I noticed at EACL 2026 is that language diversity is being treated less like a side topic and more like a core research challenge. You can see that just from the workshops: African NLP, languages using Arabic script, low-resource language models, low-resource machine translation, Turkic languages, similar languages and dialects, field linguistics, and the Iranian language family. This is not a small corner of the conference. It is a substantial part of the conversation. (aclanthology.org)

One paper that captures this especially well is Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas. The authors evaluate five major model families on 13 Indigenous languages across tasks including language identification, cloze completion, and grammatical feature classification. Their results show major variation across both languages and model families, with many combinations performing near chance. That is a useful reminder that claims of multilingual capability often hide a much less even reality. (aclanthology.org)

What I like about this paper is that it treats underrepresented languages as serious tests of linguistic competence, not as afterthoughts. The authors note that many Indigenous languages include rich morphology and nonstandardized orthographies, which complicate both tokenization and evaluation. These are not just difficult edge cases. They are important cases for understanding whether models have learned anything linguistically meaningful beyond high-resource patterns.

A related example is CETVEL, a benchmark for Turkish that evaluates language understanding, generation, and cultural capacity. What stands out here is not just the breadth of the benchmark, but the fact that it includes Turkish history, idiomatic usage, and culturally grounded content. The paper also finds that Turkish-centric instruction-tuned models can underperform broader multilingual or general-purpose models. That complicates the simple assumption that more language-specific automatically means better. It suggests that language-specific evaluation needs to be culturally grounded and methodologically strong if it is going to tell us something useful. (aclanthology.org)

What this says about the field

So what does EACL 2026 reveal about the next phase of computational linguistics?

To me, it reveals a field that is becoming more multilingual, more skeptical, and more serious about methodology. The excitement around large language models is still there, but conferences like this suggest that researchers are becoming less willing to accept easy narratives about progress. Instead, they are asking where models fail, how evaluation breaks down, and which linguistic communities are still being underserved.

It also suggests that computational linguistics is reclaiming some of its deeper identity. At its best, this field is not just about generating fluent text. It is about studying language carefully enough to build technologies that are interpretable, robust, and responsive to real linguistic diversity. EACL 2026 feels like evidence of that shift.

The next phase of computational linguistics may not be defined by the loudest demo or the largest model. It may be defined by who can evaluate language technology most honestly across languages, cultures, and communicative settings. For me, that is an encouraging direction. It leaves room for the kinds of questions that made me interested in this field in the first place: What does it mean for a model to know a language? What counts as understanding across different linguistic communities? And how do we design evaluations that respect the fact that language is never uniform? EACL 2026 does not answer all of those questions. But it makes them much harder to ignore.


References

Association for Computational Linguistics. “19th Conference of the European Chapter of the Association for Computational Linguistics.” ACL Anthology, 2026. (aclanthology.org)

EACL 2026 Organizers. “Conference Overview.” EACL 2026. (2026.eacl.org)

EACL 2026 Organizers. “Workshops.” EACL 2026. (2026.eacl.org)

Mendonça, John, Alon Lavie, and Isabel Trancoso. “MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators.” Findings of the Association for Computational Linguistics: EACL 2026.

Alrashed, Sultan, Jianghui Wang, and Francesco Orabona. “Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning.” Findings of the Association for Computational Linguistics: EACL 2026.

Vasselli, Justin, Arturo Mp, Frederikus Hudi, Haruki Sakajo, and Taro Watanabe. “Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas.” Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers).

Er, Abrek, Ilker Kesen, Gözde Gül Şahin, and Aykut Erdem. “CETVEL: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish.” Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers).

Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Horníková, and Giuseppe Riccardi. “Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It.” Findings of the Association for Computational Linguistics: EACL 2026, pages 1747–1759.

— Andrew

5,301 hits

When AI Starts Judging Research, What Comes Next?

Introduction: From AI-Assisted Research to AI-Assisted Judgment

In my past two blog posts, I focused on how AI is already reshaping academic research. In The Productivity Paradox of AI in Scientific Research,” I wrote about how AI can expand productivity while narrowing the range of questions science tends to pursue. In Citation Hallucinations at NeurIPS and What They Teach Us,” I looked at a different failure mode: AI’s ability to generate polished but false citations that can slip into scholarly work if no one checks carefully. This new story feels like the next stage of that same conversation. It is no longer just about how AI helps produce research. It is about how AI may begin influencing how research is judged.

That is why ABC Money’s article, Stanford’s AI-Powered Peer Review System Is Rejecting More Papers Than Ever,” stood out to me. The headline is dramatic, but the deeper issue is broader than one Stanford-linked tool or one claim about rising rejection rates. The real question is whether academia is moving toward a system in which AI increasingly helps shape which papers are seen as rigorous, original, and worth publishing.

What Stanford’s AI Reviewer Actually Represents

According to Stanford’s Agentic Reviewer overview, the system takes a paper, retrieves relevant literature, and generates a structured review. The project also says that in an internal evaluation built around ICLR 2025 submissions, the AI’s score correlation with a human reviewer was roughly comparable to the correlation between two human reviewers. That is a notable result. But it is still not the same thing as proving that the system can do peer review well in the fullest sense of the term.

That distinction matters. A model can learn the patterns of past reviewing behavior without understanding whether those patterns reflect good judgment. If peer review already tends to favor polished presentation, trendy topics, and familiar methods, then an AI trained on those signals may reinforce those biases rather than correct them. In that sense, this connects directly to the concern I raised in my productivity-paradox post: AI may accelerate what is already legible and well supported while doing far less for unusual, risky, or interdisciplinary ideas.

It is also worth being precise about what Stanford’s tool appears to be doing. The public materials describe it as an AI review and feedback system, but they do not show that Stanford or major journals have handed final publication decisions over to it. So the ABC Money headline may go further than the official project materials themselves. Even so, the broader trend is real. AI is moving closer to the front end of scholarly evaluation.

AI Is Already Entering Peer Review

And this is not just hypothetical. A 2026 study on ICLR 2025 describes an official AI feedback tool that was deployed to provide reviewers with post-review suggestions in a live, high-stakes conference setting (Chen et al. 2026). The researchers present this as the first empirical evidence of such a tool in a live review process. Importantly, the tool did not replace human reviewers or make final accept-reject decisions. It instead offered feedback on reviews themselves, flagging issues like (1) vagueness or genericity, (2) possible misunderstandings of the paper, and (3) unprofessional tone.

There are also examples outside conference review. openRxiv announced in November 2025 that it was enabling review options that included author-centered AI tools, reflecting a broader effort to expand the review ecosystem around bioRxiv and medRxiv. That is not the same as automated rejection, but it is another sign that AI-based review infrastructure is starting to become normal.

Meanwhile, AI is entering review even when it is not always built directly into official platforms. That is one reason this issue feels larger than any single Stanford project. Once AI becomes a routine part of writing, feedback, and manuscript assessment, it starts shaping what counts as clear, persuasive, and acceptable long before a final decision is made. This is where the conversation stops being just about software and starts becoming a conversation about institutional judgment.

Why This Matters Beyond Research

That concern becomes even more interesting when we look at college admissions. Admissions offices face a similar problem: too many applications, too little time, and pressure to make consistent decisions. The logic for using AI sounds very similar to the logic in peer review. A machine can process transcripts faster, extract structured information from essays, and help staff manage volume.

Some colleges have openly acknowledged using AI in parts of that process. UNC-Chapel Hill says it uses AI programs to provide data points about students’ Common App essays and school transcripts, including writing style, grammar, and course rigor, so that admissions staff can focus on essay content, grades, and curriculum strength. That is a clear example of AI entering applicant screening, even if UNC does not present it as automated decision-making.

Other institutions have been reported as experimenting in related ways. Associated Press reporting carried by VPM says some colleges are publicly incorporating AI into admissions, including Virginia Tech’s use of an AI-powered essay reader, while Caltech has been described as using an AI-supported authenticity check for student-submitted research projects. These are different use cases, but together they show how quickly machine-assisted evaluation is spreading into high-stakes educational settings.

At the same time, some universities are drawing a clear boundary. The University of California says every application is read and that each application gets multiple reviews. USC similarly says it is committed to keeping admissions a human process “absent of AI or other technology.” That contrast matters because it shows institutions still have choices. AI adoption is not inevitable in the same form everywhere.

The Bigger Theme: Institutional Judgment Under Pressure

This is why I find the Stanford story so interesting even if the headline itself may overstate the immediate facts. It points to a larger shift. In my earlier posts, I wrote about AI changing how research gets produced and how mistakes can enter that process. This story points to the next step: AI influencing how research gets filtered, rewarded, and legitimized. And once the same logic appears in college admissions too, it becomes harder to dismiss as a niche issue.

My own view is that AI can be useful in review when it remains a support tool rather than a gatekeeper. If it helps a researcher catch weakly supported claims, missing citations, or unclear phrasing, that seems valuable. If it helps a reviewer notice that their comments are too vague or unnecessarily harsh, that also seems valuable. But once institutions begin leaning on AI to define novelty, merit, authenticity, or promise, the stakes change. Those are not just clerical judgments. They are interpretive ones. And interpretive judgments are exactly where hidden assumptions can have the biggest consequences.

Conclusion: Fast, Scalable, Persuasive, but Not Necessarily Wise

At bottom, I think this is becoming one of the central questions of the AI era: what happens when institutions under pressure begin outsourcing pieces of judgment to systems that are fast, scalable, and persuasive, but not necessarily wise? That question matters for researchers trying to publish, for students trying to get admitted, and for anyone who cares about whether important but unconventional ideas still have a fair chance.

If universities, conferences, and journals keep moving in this direction, transparency will matter a great deal. Students, authors, reviewers, and researchers should know when AI is being used, what role it actually plays, and what decisions remain fully human. Without that clarity, people may assume they are being evaluated by experienced readers when an unseen machine is shaping the first impression. That would not just be a fairness problem. It would also be a trust problem.

References

ABC Money. “Stanford’s AI-Powered Peer Review System Is Rejecting More Papers Than Ever.” March 11, 2026. Stanford’s AI-Powered Peer Review System Is Rejecting More Papers Than Ever

Stanford Agentic Reviewer / PaperReview.ai. “Tech Overview.” Stanford’s Agentic Reviewer overview

Chen et al. “What Happens When Reviewers Receive AI Feedback in Their Reviews?” ACM CHI 2026. https://arxiv.org/abs/2602.13817

openRxiv. “Enabling options for review: from training and transparency to author-centered AI tools.” November 6, 2025. https://openrxiv.org/enabling-review-options/

UNC-Chapel Hill Undergraduate Admissions. “Does Undergraduate Admissions use AI and why?” https://admissions.unc.edu/faqs/does-undergraduate-admissions-use-ai-and-why/

University of California. “How the University of California evaluates student applications.” https://www.universityofcalifornia.edu/news/how-the-university-of-california-evaluates-student-applications

USC Undergraduate Admission Blog. “Humanizing the Admission Process.” February 23, 2026. https://www.admissionblog.usc.edu/p/humanizing-the-admissions-process

Associated Press / VPM. “Welcome to the new era of college admissions: AI may be scoring your essay.” December 2, 2025. https://www.vpm.org/news/2025-12-02/college-admissions-artificial-intelligence-virginia-tech-juan-espinoza

— Andrew

5,301 hits

Prompt Repetition As a Surprisingly Strong Baseline for Non-reasoning LLMs

A recent Google Research paper, Prompt Repetition Improves Non-Reasoning LLMs, makes a claim that feels almost too simple to be real: if you are running an LLM in a non-reasoning mode, you can often get better answers by duplicating the prompt verbatim before the model responds.

The trick

The transformation is exactly what it sounds like:

Baseline: <QUERY>

Repeat:   <QUERY><QUERY>

No extra instructions, no extra examples, no chain of thought prompting. Just the same prompt twice. See an example below from the paper.

What the authors found

Across a set of major models and a range of benchmarks, the authors report that prompt repetition is consistently helpful in the non-reasoning setting. They present results as head to head comparisons against the baseline prompt, and show broad improvement without changing the expected output format.

One detail that makes this especially practical is that they also examine latency and output length. The headline is that repeating the prompt generally does not increase the number of generated tokens, and in their experiments it usually does not meaningfully increase end to end latency either.

Why would this work

The paper’s explanation is rooted in a basic property of causal language models: tokens are processed left to right, and each position attends only to what came before it. This means the order of information in a prompt can matter more than we would like. If important details appear early, the model “saw” them before it knew what the final question would be. Repeating the prompt gives the model a second chance to integrate the whole request with the question now fully in view.

A nice way to interpret this is that prompt repetition is not adding new information. It is changing the geometry of attention by making the same information appear again later in the context window, closer to where the model must commit to an answer.

When it seems most useful

The effect should be strongest when prompts are long, structured, or easy to misread. The paper highlights cases like multiple choice formats where the placement of options and questions can create “unfriendly” ordering effects. Repetition helps smooth out those quirks because whatever was awkwardly positioned the first time is now encountered again.

They also introduce stress tests where the model must retrieve or locate items in long lists, and some of those show dramatic jumps with repetition.

What changes when reasoning is enabled

An important nuance is that these gains are mainly about non-reasoning usage. When the model is already encouraged to reason step by step, repetition tends to help less often, and many results become ties. The paper’s intuition is that reasoning style outputs often restate the problem anyway, which can partially mimic the benefit of repetition.

A practical takeaway for prompt design

If you are building an application where you want better performance without paying for longer responses, prompt repetition looks like a strong “cheap” baseline to try first. It is also a reminder that prompt engineering is not only about clever wording. Sometimes it is about controlling where information appears in the sequence so the model can reliably use it.

References

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Prompt Repetition Improves Non-Reasoning LLMs.” arXiv (2025). (arXiv)

— Andrew

5,301 hits

We Submitted Our ACL 2026 DravidianLangTech Paper on Hope Speech Detection in Tulu

I’m happy to share that we have submitted our paper, “cantnlp@DravidianLangTech 2026: Organic Domain Adaptation Improves Multi-Class Hope Speech Detection in Tulu,” to the Sixth Workshop on Speech and Language Technologies for Dravidian Languages (DravidianLangTech-2026), which is co-located with ACL 2026 in San Diego. ACL 2026 is scheduled for July 2 to July 7, 2026, with workshops on July 3 and 4.

What the shared task is about

Our submission is part of the Hope Speech Detection shared task at DravidianLangTech-2026. The task focuses on identifying hopeful, encouraging, and supportive language in social media text, with a particular emphasis on code-mixed Tulu. That makes it both socially meaningful and technically challenging, especially because Tulu remains a low-resource language in NLP.

What our paper explores

Our paper studies how organic domain adaptation can improve multi-class hope speech detection in Tulu. In low-resource settings, even small domain mismatches can hurt performance, and code-mixed data adds another layer of difficulty. This project looks at how better adaptation strategies can help models generalize more effectively in that setting.

Why this matters

I find this work exciting because it sits at the intersection of low-resource NLP, code-mixed language processing, and socially useful language technology. Hope speech detection is not just a classification problem. It also connects to broader questions about how NLP systems can support healthier online spaces and extend research attention to languages that are often underrepresented.

Acknowledgments

I’m the first author of this submission, and I’m very grateful to my co-author and mentor, Dr. Sidney Wong. His guidance and support were central to both the research process and the writing of the paper.

What comes next

The paper was submitted by the March 5, 2026 shared-task paper deadline, so it is now under review. I’m looking forward to seeing the outcome and, hopefully, sharing more about the project in the months ahead. No matter what happens, this has already been a valuable experience in working on Tulu NLP and contributing to research on Dravidian languages.

Related links

— Andrew

5,301 hits

MLRegTest: Stress-Testing Whether Models Learn Rules or Just Patterns

When people say “AI understands language,” they usually mean it can produce fluent text, summarize an article, or answer questions. Those abilities are impressive, but they can also hide a real problem: a model can look correct while relying on shortcuts that break in the exact cases we care about most.

That is why I have been interested in MLRegTest, a benchmark designed to stress-test sequence models using 1,800 carefully constructed regular languages. Instead of judging a model by how human its writing sounds, MLRegTest asks a simpler, sharper question: can a model learn a rule, and then apply it reliably when the test gets harder or more precise?

What is MLRegTest, in plain terms?

MLRegTest is a large collection of tiny, made-up “languages” built from simple symbols. Imagine an alphabet like A, B, C, D, and strings such as “AAB C” or “BBB A.” Each language has a hidden rule that determines whether a string belongs to it. The model learns from labeled examples and then answers a yes or no question: does this string follow the rule?

This might sound far from English or Spanish, but it is actually a powerful way to test something very relevant to computational linguistics: how models represent patterns and dependencies across sequences.

Why regular languages?

Regular languages are a class of formal languages that can be described using tools like regular expressions and finite-state machines. They are simpler than full human language, but they still capture many meaningful pattern constraints. MLRegTest uses regular languages because they let researchers control the task in a way that is difficult with natural text. The rules are fully known, the labels are unambiguous, and researchers can generate unlimited data under controlled conditions. That makes it possible to test specific kinds of generalization rather than only measuring how well a model matches the distribution of a dataset.

What makes MLRegTest different from typical benchmarks?

First, MLRegTest is not just one dataset. It is a suite of datasets drawn from 1,800 distinct regular languages, and those languages are organized by properties such as logical complexity and the kinds of constraints they express. That organization matters because “pattern learning” is not a single ability. Some rules are easy to approximate but hard to learn exactly, and some require models to track information across long spans of a sequence. MLRegTest is designed to probe those differences rather than hiding them inside one average score.

Second, the benchmark is built to examine long-distance dependencies in a controlled way. Sequence models often struggle when the relevant information is far apart in the input, and MLRegTest gives researchers a systematic way to test whether a model can handle that challenge.

Third, MLRegTest includes a kind of evaluation highlighted in Stony Brook’s write-up: border tests. These focus on edge cases where examples come in near-identical pairs. The strings might differ by only one symbol, but one is in the language and the other is not. Those are the cases where the true rule matters most, and they are also where shortcut strategies are most likely to fail. According to the Stony Brook announcement, models tended to struggle more on these boundary cases, even when they looked strong on more typical examples, which suggests that they can learn approximations instead of learning the rule itself.

What did the researchers evaluate?

The JMLR paper evaluates multiple neural architectures, including recurrent models and transformers, and reports that performance varies significantly depending on the kind of test set, the class of language, and the model architecture. That is useful because it pushes back on the idea that “a strong model is strong at everything.” MLRegTest makes it easier to ask where a system is strong and where it breaks, and to tie those results to specific properties of the pattern being learned.

Why this matters for evaluating language models

Even though MLRegTest does not test natural language directly, it targets a core issue in NLP evaluation: benchmarks can be “won” for the wrong reasons. A model can score well by picking up statistical hints that correlate with labels without learning the intended generalization. Border tests and other controlled generalization tests help researchers ask whether a model stays consistent when inputs shift in principled ways, whether it generalizes beyond the training regime, and whether it fails exactly when the rule becomes tight. Those questions matter if we want models that are dependable in real settings, especially when rare edge cases are the dangerous ones.

A quiet challenge to “just feed it more data”

MLRegTest also pushes back on a common assumption in AI right now: if a model struggles, the fix is simply more data. The benchmark is asking what happens if the deeper issue is not data quantity, but what the model is actually learning. This is not only a scientific concern but also a practical one. In high-stakes applications like robotic medical assistance or self-driving cars, the most serious situations are often rare. A particular combination of weather, road design, sensor noise, and unpredictable human behavior might occur only one in a million times. In medicine, a rare complication might be exactly the case where you cannot afford a mistake. The border tests connect directly to this idea because they emphasize edge cases where a tiny change can flip the correct decision, which is where shortcut learning becomes most dangerous.

The takeaway is simple: reliability is not the same thing as average performance. If a system only works well on patterns it has seen thousands of times, it may still be fragile in the exact scenarios we care about most. MLRegTest is valuable because it helps us measure that fragility directly instead of waiting to discover it in the real world.

A high school senior takeaway

As a high school senior interested in computational linguistics research, MLRegTest feels like a strong example of what careful evaluation looks like. It controls the task so we know what the model should learn, varies difficulty in interpretable ways so “harder” actually means something specific, and probes failure modes instead of stopping at one headline number. More broadly, it connects to a theme I keep coming back to in NLP: we do not just want systems that perform well. We want systems whose performance we can explain and trust.

References

  1. van der Poel, Sam, et al. “MLRegTest: A Benchmark for the Machine Learning of Regular Languages.” Journal of Machine Learning Research, vol. 25, no. 283, 2024, pp. 1–45. https://www.jmlr.org/papers/v25/23-0518.html
  2. Stony Brook University AI news announcement (February 13, 2026): “How Much Does AI Really Understand: Stress-testing Neural Networks with 1,800 Language Patterns.” https://ai.stonybrook.edu/about-us/News/how-much-does-ai-really-understand-stress-testing-neural-networks-1800-language

— Andrew

5,301 hits

How AI is Quietly Changing the Way We Talk

Introduction

In this blog post, I’d like to share recent findings suggesting AI is quietly reshaping the way we talk. You may be already aware of the trend of AI’s reshaping the way we write since the wide usage of ChatGPT and other LLM models in facilitating the text/script generation, particularly in writing research papers. See the discussion in my past blog post “Is the Increasing Trend of Leveraging LLMs like ChatGPT in Writing Research Papers Concerning?”.

Florida State University’s Study

A new study from Florida State University shows that large language models are starting to influence spoken language, not just written text. Researchers analyzed over 22 million words from unscripted science and tech podcasts, comparing episodes from before ChatGPT (2019–2021) with episodes after its release (2023–2025).

They found that words commonly used by AI models, such as “delve,” “boast,” and “meticulous,” are showing up more often in everyday conversation, while their close synonyms stayed flat.

The researchers call this phenomenon “lexical seepage,” where AI-preferred words gradually leak into the way people naturally talk.

How the Shift Happens

The study links this effect to psychology concepts like implicit learning and priming. People pick up on repeated words, even without realizing it, and then use them themselves. In other words, AI is not just helping us write. It may also be subtly shaping the way we speak. Importantly, the changes were observed in unscripted talk, in addition to formal speeches or scripted lectures.

Global Patterns and Concerns

This is not only happening in the U.S. A study in Germany found similar patterns on YouTube, suggesting the trend is global. Experts warn that if companies like OpenAI, Anthropic, and Google fine-tune their models in different ways, people might start adopting slightly different speech patterns. Over time, this could flatten dialects, erase regional slang, and reduce creativity. Some argue we need new benchmarks that push AI to use more diverse language instead of over-relying on the same set of words.

Natural Adoption vs. AI Amplification

The Florida State team also makes an important point: not everything can be pinned on AI.

“It is possible that these words have simply entered a phase of natural, rapid adoption, akin to the rise of expressions like ‘touch base,’ ‘dude,’ and ‘awesome’ in the mid-2000s.”

In this view, LLMs overuse words that were already becoming popular, but they still act as amplifiers that speed up language change. Even if AI is not the original source of these trends, the fact that machine-generated text can influence how humans speak is significant.

Final Thoughts

As a high school student, I find this both fascinating and a little worrying. On the one hand, it shows how powerful AI really is in shaping culture, not just technology. On the other hand, if AI makes everyone talk the same way, that could erase some of the creativity and uniqueness that makes language fun. Just like with social media, the full impact may take years to understand. For now, I think it’s important to keep asking questions about how AI is changing not just what we write, but also what we say.


Further Reading

  • “AI Is Quietly Reshaping the Way We Talk.” Fast Company, https://www.fastcompany.com/91398460/ai-is-quietly-reshaping-the-way-we-talk.
  • Anderson, Bryce, Riley Galpin, and Tom S. Juzek. Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English. arXiv, 2025, doi:10.48550/arXiv.2508.00238.
  • Yakura, Hiromu, et al. Empirical Evidence of Large Language Model’s Influence on Human Spoken Communication. arXiv, 2024, doi:10.48550/arXiv.2409.01754.

— Andrew

5,301 hits

LATAM-GPT, Linguistic Bias, and Why Regional AI Infrastructure Matters

When a language model answers in fluent Spanish but misses local context, the problem is not grammar. The problem is representation.

That is the central issue behind linguistic bias in GPT-style systems, and it is why LATAM-GPT is such an important project for computational linguistics researchers. It pushes us to ask a better question than “Can the model generate text?” We should be asking, “Whose language realities are represented in the model?”

Linguistic bias is bigger than offensive outputs

In NLP conversations, bias is often reduced to toxic or stereotypical responses. That matters, but it is only part of the picture. Linguistic bias also includes structural imbalance: which dialects are present in training data, which cultural contexts are understood, and which institutions or histories are treated as central versus peripheral.

For many GPT-like systems, the imbalance starts at the data level. English and Global North content dominate much of the public web, so model behavior tends to be stronger when prompts align with those distributions. A model may produce polished Spanish or Portuguese and still flatten regional variation, miss sociolinguistic nuance, or rely on generic interpretations that do not fit local usage. AP’s reporting on LATAM-GPT directly frames the initiative as a response to this representational gap in mainstream AI systems.

Why regional models matter

Regional models like LATAM-GPT are not only technical artifacts. They are research infrastructure choices.

First, they can improve local relevance because the model is trained with region-specific data and priorities rather than treated as a generic multilingual extension of a primarily external corpus. AP reports that LATAM-GPT was developed specifically to better reflect Latin American language and context.

Second, regional models help build scientific and governance capacity. Reuters describes LATAM-GPT as a collaborative effort among countries and institutions in the region, which means expertise, evaluation norms, and deployment decisions are not fully outsourced.

Third, the initiative is positioned as open infrastructure for downstream applications, not just as another chatbot interface. That distinction matters for public-interest work in education, government services, and domain-specific NLP tools.

What the LATAM-GPT project is

Based on AP’s report, supported by Reuters and the official project site, LATAM-GPT is a regional open-source initiative led by The National Center of Artificial Intelligence of Chile (CENIA). AP reports early backing that included funding from CENIA and the Development Bank of Latin America (CAF), and references future training support tied to a major supercomputing investment in northern Chile. Reuters also notes cloud support in the development process.

The project is collaborative by design. AP reports participation from more than 30 institutions across eight countries, while Reuters presents a broader regional coalition narrative around deployment and adoption. The reported training pipeline includes large-scale data, combining partnership-based sources and synthetic data to improve coverage in underrepresented areas. Initial focus is on Spanish and Portuguese, with plans to expand toward Indigenous languages.

The timeline is also important. AP describes work beginning in 2023, public visibility increasing at the 2025 AI Action Summit, and launch reporting in February 2026.

Performance versus ChatGPT and Gemini

This part needs careful wording.

AP quotes project leadership saying LATAM-GPT can be more accurate and efficient for Latin American and Caribbean contexts because of regional data alignment. That is a meaningful claim and it fits the project’s objective.

At the same time, both AP and Reuters frame LATAM-GPT as not primarily intended to replace ChatGPT or Gemini as general-purpose consumer assistants. It is presented as foundational infrastructure for regional applications. Public reporting so far does not provide a single standardized benchmark table showing universal superiority over frontier global models across all task categories.

So the most responsible interpretation is this: LATAM-GPT’s strength is regional alignment and representational fit, not blanket dominance across every benchmark.

What this implies for a junior computational linguistics researcher

For early-stage researchers, LATAM-GPT signals an important shift in what counts as strong NLP work. Bigger model size is no longer the only story. Research quality increasingly depends on whether your data curation, evaluation design, and error analysis capture real linguistic diversity.

That has practical consequences. If you only run generic leaderboard-style evaluations, you may miss the most consequential failures. Region-aware testing, dialect-sensitive prompts, and sociolinguistic error taxonomies become central methods, not side tasks. Corpus documentation and annotation policy choices also become core contributions, because they shape what the model can and cannot represent.

In other words, this is an opportunity. You can build technically rigorous work while also addressing linguistic equity and real-world usefulness. LATAM-GPT makes that path visible: computational linguistics can be both advanced and locally grounded.

Final reflection

LATAM-GPT matters because it reframes AI development from pure model competition to language representation, participation, and research sovereignty. The key question is not whether it outperforms every major global model on every task. The key question is whether communities that were historically underrepresented in AI can now help shape the systems that represent them.

For junior researchers, that is a powerful direction for the next decade of NLP.

References

  1. AP News. Chile launches open-source AI model designed for Latin America (Feb 2026).
  2. Reuters. Latin American countries to launch own AI model in September (Jun 17, 2025).
  3. LATAM-GPT official site (project overview).

— Andrew

5,301 hits

From Hallucinated Citations to Linked Evidence: The OpenScholar Approach

In my recent blog post, I discussed Citation Hallucinations at NeurIPS and What They Teach Us. As a student researcher, I think many people are asking the same question: can we use AI tools that help us get citations right without made-up references?

I recently read a Nature article that gave a strong answer. The article introduces OpenScholar, a retrieval-augmented system that combines a language model with a database of about 45 million open-access papers. Instead of relying only on model memory, OpenScholar retrieves papers first and then generates responses with explicit citation links.

Why this matters

For research workflows, citation reliability is everything. When references are wrong, the writing process breaks down quickly. OpenScholar is designed to reduce that risk by grounding claims in retrieved literature before generating the final response.

According to the article, OpenScholar is also:

  • Open source
  • Relatively lightweight
  • Deployable locally
  • Built for scientific search and literature review

That combination is important because it supports both accuracy and reproducibility, which are essential in research settings.

Reported performance

Nature reports that in the OpenScholar evaluations, the 8B model outperformed GPT-4o on correctness in their benchmark and significantly reduced fabricated citations. The article also notes that citation behavior was described as being comparable to human experts in their testing context.

Comparison with OpenAI deep research tools

The article places OpenScholar in a broader trend. Since OpenScholar was first posted on arXiv about 14 months ago, companies such as OpenAI have integrated similar retrieval-based “deep research” methods into commercial LLM products, improving factual accuracy and citation quality compared with earlier model behavior.

OpenScholar’s main distinction in that landscape is cost-efficiency plus openness. Nature cites the OpenScholar team saying it can run at a fraction of the cost of GPT-5 with deep research, while still grounding outputs in a large scientific corpus.

Limitations to keep in mind

The article is clear that OpenScholar is not perfect. The authors acknowledge two major limitations:

  1. It does not always retrieve the most representative or most relevant papers for every query.
  2. It is limited by the scope of its indexed database.

So even though OpenScholar helps with citation hallucinations, retrieval quality remains a core bottleneck. In practice, researchers still need to verify paper relevance and coverage before relying on output.

Final thoughts

My takeaway is that this is a meaningful step forward for student researchers and independent scholars. Better grounding, lower cost, and open access can make high-quality literature review tools more available to more people.

Nature also quotes an outside researcher who argues that if OpenScholar remains free, it could become one of the most widely used tools for scientific search. I think that is very possible.

If you have tested OpenScholar, share what worked and what did not. I may feature reader feedback in a follow-up post.

— Andrew

5,301 hits

Citation Hallucinations at NeurIPS and What They Teach Us

I’m writing this post about a recent discovery by GPTZero, reported by Shmatko et al. 2026. The finding sparked significant discussion across the research community (Goldman 2026). While hallucinations produced by large language models have been widely acknowledged, far less attention has been paid to hallucinations in citations. Even reviewers at top conferences such as NeurIPS failed to catch citation hallucination issues, showing how easily these errors can slip through existing academic safeguards.

For students and early-career researchers, this discovery should serve as a warning. AI tools can meaningfully improve research efficiency, especially during early-stage tasks like brainstorming, summarizing papers, or organizing a literature review. At the same time, these tools introduce new risks when they are treated as sources rather than assistants. Citation accuracy remains the responsibility of the researcher, not the model.

As a junior researcher, I have used AI tools such as ChatGPT to help with literature reviews in my own work. In practice, AI can make the initial stages of research much easier by surfacing themes, suggesting keywords, or summarizing large volumes of text. However, I have also seen how easily this convenience can introduce errors. Citation hallucinations are particularly dangerous because they often look plausible. A reference may appear to have a reasonable title, realistic authors, and a convincing venue, even though it does not actually exist. Unless each citation is verified, these errors can quietly make their way into drafts.

According to GPTZero, citation hallucinations tend to fall into several recurring patterns. One common issue is the combination or paraphrasing of titles, authors, or publication details from one or more real sources. Another is the outright fabrication of authors, titles, URLs, DOIs, or publication venues such as journals or conferences. A third pattern involves modifying real citations by extrapolating first names from initials, adding or dropping authors, or subtly paraphrasing titles in misleading ways. These kinds of errors are easy to overlook during review, particularly when the paper’s technical content appears sound.

The broader lesson here is not that AI tools should be avoided, but that they must be used carefully and responsibly. AI can be valuable for identifying research directions, generating questions, or helping navigate unfamiliar literature. It should not be relied on to generate final citations or to verify the existence of sources. For students in particular, it is important to build habits that prioritize checking references against trusted databases and original papers.

Looking ahead, this finding reinforces an idea that has repeatedly shaped how I approach my own work. Strong research is not defined by speed alone, but by care, verification, and reflection. As AI becomes more deeply embedded in academic workflows, learning how to use it responsibly will matter just as much as learning the technical skills themselves.

References

Shmatko, N., Adam, A., and Esau, P. GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers, Jan. 21, 2026

Goldman, S. NeurIPS, one of the world’s top academic AI conferences, accepted research papers with 100+ AI-hallucinated citations, new report claims. Fortune, Jan 21, 2026

— Andrew

5,301 hits

The Productivity Paradox of AI in Scientific Research

In January 2026, Nature published a paper with a title that immediately made me pause: Artificial intelligence tools expand scientists’ impact but contract science’s focus (Hao et al. 2026). The wording alone suggests a tradeoff that feels uncomfortable, especially for anyone working in AI while still early in their academic life.

The study, conducted by researchers at the University of Chicago and China’s Beijing National Research Center for Information Science and Technology, analyzes how AI tools are reshaping scientific research. Their findings are striking. Scientists who adopt AI publish roughly three times as many papers, receive nearly five times as many citations, and reach leadership positions one to two years earlier than their peers who do not use these tools (Hao et al. 2026). On the surface, this looks like a clear success story for AI in science.

But the paper’s core argument cuts in a different direction. While individual productivity and visibility increase, the collective direction of science appears to narrow. AI is most effective in areas that already have abundant data and well established methods. As a result, research effort becomes increasingly concentrated in the same crowded domains. Instead of pushing into unknown territory, AI often automates and accelerates what is already easiest to study (Hao et al. 2026).

James Evans, one of the authors, summarized this effect bluntly in an interview with IEEE Spectrum. AI, he argued, is turning scientists into publishing machines while quietly funneling them into the same corners of research (Dolgin 2026). The paradox is clear. Individual careers benefit, but the overall diversity of scientific exploration suffers.

Reading this as a high school senior who works in NLP and computational linguistics was unsettling. AI is the reason I can meaningfully participate in research at this stage at all. It lowers barriers, speeds up experimentation, and makes ambitious projects feasible for small teams or even individuals. At the same time, my own work often depends on large, clean datasets and established benchmarks. I am benefiting from the very dynamics this paper warns about.

The authors emphasize that this is not primarily a technical problem. It is not about whether transformer architectures are flawed or whether the next generation of models will be more creative. The deeper issue is incentives. Scientists are rewarded for publishing frequently, being cited often, and working in areas where success is legible and measurable. AI amplifies those incentives by making it easier to succeed where the path is already paved (Hao et al. 2026).

This raises an uncomfortable question. If AI continues to optimize research for speed and visibility, who takes responsibility for the slow, risky, and underexplored questions that do not come with rich datasets or immediate payoff? New fields rarely emerge from efficiency alone. They require intellectual friction, uncertainty, and a willingness to fail without quick rewards.

Evans has expressed hope that this work acts as a provocation rather than a verdict. AI does not have to narrow science’s focus, but using it differently requires changing what we value as progress (Dolgin 2026). That might mean funding exploratory work that looks inefficient by conventional metrics. It might mean rewarding scientists for opening new questions rather than closing familiar ones faster. Without changes like these, better tools alone will not lead to broader discovery.

For students like me, this tension matters. We are entering research at a moment when AI makes it easier than ever to contribute, but also easier than ever to follow the crowd. The challenge is not to reject AI, but to be conscious of how it shapes our choices. If the next generation of researchers only learns to optimize for what is tractable, science may become faster, cleaner, and more impressive on paper while quietly losing its sense of direction.

AI has the power to expand who gets to do science. Whether it expands what science is willing to ask remains an open question.

References

Hao, Q., Xu, F., Li, Y., et al. “Artificial Intelligence Tools Expand Scientists’ Impact but Contract Science’s Focus.” Nature, 2026. https://doi.org/10.1038/s41586-025-09922-y

Dolgin, Elie. “AI Boosts Research Careers but Flattens Scientific Discovery.” IEEE Spectrum, January 19, 2026. https://spectrum.ieee.org/ai-science-research-flattens-discovery-2674892739

“AI Boosts Research Careers, Flattens Scientific Discovery.” ACM TechNews, January 21, 2026. https://technews.acm.org/archives.cfm?fo=2026-01-jan/jan-21-2026.html

— Andrew

5,301 hits

Blog at WordPress.com.

Up ↑