Citation Hallucinations at NeurIPS and What They Teach Us

I’m writing this post about a recent discovery by GPTZero, reported by Shmatko et al. 2026. The finding sparked significant discussion across the research community (Goldman 2026). While hallucinations produced by large language models have been widely acknowledged, far less attention has been paid to hallucinations in citations. Even reviewers at top conferences such as NeurIPS failed to catch citation hallucination issues, showing how easily these errors can slip through existing academic safeguards.

For students and early-career researchers, this discovery should serve as a warning. AI tools can meaningfully improve research efficiency, especially during early-stage tasks like brainstorming, summarizing papers, or organizing a literature review. At the same time, these tools introduce new risks when they are treated as sources rather than assistants. Citation accuracy remains the responsibility of the researcher, not the model.

As a junior researcher, I have used AI tools such as ChatGPT to help with literature reviews in my own work. In practice, AI can make the initial stages of research much easier by surfacing themes, suggesting keywords, or summarizing large volumes of text. However, I have also seen how easily this convenience can introduce errors. Citation hallucinations are particularly dangerous because they often look plausible. A reference may appear to have a reasonable title, realistic authors, and a convincing venue, even though it does not actually exist. Unless each citation is verified, these errors can quietly make their way into drafts.

According to GPTZero, citation hallucinations tend to fall into several recurring patterns. One common issue is the combination or paraphrasing of titles, authors, or publication details from one or more real sources. Another is the outright fabrication of authors, titles, URLs, DOIs, or publication venues such as journals or conferences. A third pattern involves modifying real citations by extrapolating first names from initials, adding or dropping authors, or subtly paraphrasing titles in misleading ways. These kinds of errors are easy to overlook during review, particularly when the paper’s technical content appears sound.

The broader lesson here is not that AI tools should be avoided, but that they must be used carefully and responsibly. AI can be valuable for identifying research directions, generating questions, or helping navigate unfamiliar literature. It should not be relied on to generate final citations or to verify the existence of sources. For students in particular, it is important to build habits that prioritize checking references against trusted databases and original papers.

Looking ahead, this finding reinforces an idea that has repeatedly shaped how I approach my own work. Strong research is not defined by speed alone, but by care, verification, and reflection. As AI becomes more deeply embedded in academic workflows, learning how to use it responsibly will matter just as much as learning the technical skills themselves.

References

Shmatko, N., Adam, A., and Esau, P. GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers, Jan. 21, 2026

Goldman, S. NeurIPS, one of the world’s top academic AI conferences, accepted research papers with 100+ AI-hallucinated citations, new report claims. Fortune, Jan 21, 2026

— Andrew

4,807 hits

The Productivity Paradox of AI in Scientific Research

In January 2026, Nature published a paper with a title that immediately made me pause: Artificial intelligence tools expand scientists’ impact but contract science’s focus (Hao et al. 2026). The wording alone suggests a tradeoff that feels uncomfortable, especially for anyone working in AI while still early in their academic life.

The study, conducted by researchers at the University of Chicago and China’s Beijing National Research Center for Information Science and Technology, analyzes how AI tools are reshaping scientific research. Their findings are striking. Scientists who adopt AI publish roughly three times as many papers, receive nearly five times as many citations, and reach leadership positions one to two years earlier than their peers who do not use these tools (Hao et al. 2026). On the surface, this looks like a clear success story for AI in science.

But the paper’s core argument cuts in a different direction. While individual productivity and visibility increase, the collective direction of science appears to narrow. AI is most effective in areas that already have abundant data and well established methods. As a result, research effort becomes increasingly concentrated in the same crowded domains. Instead of pushing into unknown territory, AI often automates and accelerates what is already easiest to study (Hao et al. 2026).

James Evans, one of the authors, summarized this effect bluntly in an interview with IEEE Spectrum. AI, he argued, is turning scientists into publishing machines while quietly funneling them into the same corners of research (Dolgin 2026). The paradox is clear. Individual careers benefit, but the overall diversity of scientific exploration suffers.

Reading this as a high school senior who works in NLP and computational linguistics was unsettling. AI is the reason I can meaningfully participate in research at this stage at all. It lowers barriers, speeds up experimentation, and makes ambitious projects feasible for small teams or even individuals. At the same time, my own work often depends on large, clean datasets and established benchmarks. I am benefiting from the very dynamics this paper warns about.

The authors emphasize that this is not primarily a technical problem. It is not about whether transformer architectures are flawed or whether the next generation of models will be more creative. The deeper issue is incentives. Scientists are rewarded for publishing frequently, being cited often, and working in areas where success is legible and measurable. AI amplifies those incentives by making it easier to succeed where the path is already paved (Hao et al. 2026).

This raises an uncomfortable question. If AI continues to optimize research for speed and visibility, who takes responsibility for the slow, risky, and underexplored questions that do not come with rich datasets or immediate payoff? New fields rarely emerge from efficiency alone. They require intellectual friction, uncertainty, and a willingness to fail without quick rewards.

Evans has expressed hope that this work acts as a provocation rather than a verdict. AI does not have to narrow science’s focus, but using it differently requires changing what we value as progress (Dolgin 2026). That might mean funding exploratory work that looks inefficient by conventional metrics. It might mean rewarding scientists for opening new questions rather than closing familiar ones faster. Without changes like these, better tools alone will not lead to broader discovery.

For students like me, this tension matters. We are entering research at a moment when AI makes it easier than ever to contribute, but also easier than ever to follow the crowd. The challenge is not to reject AI, but to be conscious of how it shapes our choices. If the next generation of researchers only learns to optimize for what is tractable, science may become faster, cleaner, and more impressive on paper while quietly losing its sense of direction.

AI has the power to expand who gets to do science. Whether it expands what science is willing to ask remains an open question.

References

Hao, Q., Xu, F., Li, Y., et al. “Artificial Intelligence Tools Expand Scientists’ Impact but Contract Science’s Focus.” Nature, 2026. https://doi.org/10.1038/s41586-025-09922-y

Dolgin, Elie. “AI Boosts Research Careers but Flattens Scientific Discovery.” IEEE Spectrum, January 19, 2026. https://spectrum.ieee.org/ai-science-research-flattens-discovery-2674892739

“AI Boosts Research Careers, Flattens Scientific Discovery.” ACM TechNews, January 21, 2026. https://technews.acm.org/archives.cfm?fo=2026-01-jan/jan-21-2026.html

— Andrew

4,807 hits

CES 2026 and the Illusion of Understanding in Agentic AI

At CES 2026, nearly every major technology company promised the same thing in different words: assistants that finally understand us. These systems were not just answering questions. They were booking reservations, managing homes, summarizing daily life, and acting on a user’s behalf. The message was unmistakable. Language models had moved beyond conversation and into agency.

Yet watching these demonstrations felt familiar in an uncomfortable way. I have seen this confidence before, often at moments when language systems appear fluent while remaining fragile underneath. CES 2026 did not convince me that machines now understand human language. Instead, it exposed how quickly our expectations have outpaced our theories of meaning.

When an assistant takes action, language stops being a surface interface. It becomes a proxy for intent, context, preference, and consequence. That shift raises the bar for computational linguistics in ways that polished demos rarely acknowledge.

From chatting to acting: why agents raise the bar

Traditional conversational systems can afford to be wrong in relatively harmless ways. A vague or incorrect answer is frustrating but contained. Agentic systems are different. When language triggers actions, misunderstandings propagate into the real world.

From a computational linguistics perspective, this changes the problem itself. Language is no longer mapped only to responses but to plans. Commands encode goals, constraints, and assumptions that are often implicit. A request like “handle this later” presupposes shared context, temporal reasoning, and an understanding of what “this” refers to. These are discourse problems, not engineering edge cases.

This distinction echoes long-standing insights in linguistics. Winograd’s classic examples showed that surface structure alone is insufficient for understanding even simple sentences once world knowledge and intention are involved (Winograd). Agentic assistants bring that challenge back, this time with real consequences attached.

Instruction decomposition is not understanding

Many systems highlighted at CES rely on instruction decomposition. A user prompt is broken into smaller steps that are executed sequentially. While effective in constrained settings, this approach is often mistaken for genuine understanding.

Decomposition works best when goals are explicit and stable. Real users are neither. Goals evolve mid-interaction. Preferences conflict with past behavior. Instructions are underspecified. Linguistics has long studied these phenomena under pragmatics, where meaning depends on speaker intention, shared knowledge, and conversational norms (Grice).

Breaking an instruction into steps does not resolve ambiguity. It merely postpones it. Without a model of why a user said something, systems struggle to recover when their assumptions are wrong. Most agentic failures are not catastrophic. They are subtle misalignments that accumulate quietly.

Long-term memory is a discourse problem, not a storage problem

CES 2026 placed heavy emphasis on memory and personalization. Assistants now claim to remember preferences, habits, and prior conversations. The implicit assumption is that more memory leads to better understanding.

In linguistics, memory is not simple accumulation. It is interpretation. Discourse coherence depends on salience, relevance, and revision. Humans forget aggressively, reinterpret past statements, and update beliefs about one another constantly. Storing embeddings of prior interactions does not replicate this process.

Research in discourse representation theory shows that meaning emerges through structured updates to a shared model of the world, not through raw recall alone (Kamp and Reyle). Long-context language models still struggle with this distinction. They can retrieve earlier information but often fail to decide what should matter now.

Multimodality does not remove ambiguity

Many CES demonstrations leaned heavily on multimodal interfaces. Visuals, screens, and gestures were presented as solutions to linguistic ambiguity. In practice, ambiguity persists even when more modalities are added.

Classic problems such as deixis remain unresolved. A command like “put that there” still requires assumptions about attention, intention, and relevance. Visual input often increases the number of possible referents rather than narrowing them. More context does not automatically produce clearer meaning.

Research on multimodal grounding consistently shows that aligning language with perception is difficult precisely because human communication relies on shared assumptions rather than exhaustive specification (Clark). Agentic systems inherit this challenge rather than escaping it.

Evaluation is the quiet failure point

Perhaps the most concerning gap revealed by CES 2026 is evaluation. Success is typically defined as task completion. Did the system book the table? Did the lights turn on? These metrics ignore whether the system actually understood the user or simply arrived at the correct outcome by chance.

Computational linguistics has repeatedly warned against narrow benchmarks that mask shallow competence. Metrics such as BLEU reward surface similarity while missing semantic failure (Papineni et al.). Agentic systems risk repeating this mistake at a higher level.

A system that completes a task while violating user intent is not truly successful. Meaningful evaluation must account for repair behavior, user satisfaction, and long-term trust. These are linguistic and social dimensions, not merely engineering ones.

CES as a mirror for the field

CES 2026 showcased ambition, not resolution. Agentic assistants highlight how far language technology has progressed, but they also expose unresolved questions at the heart of computational linguistics. Fluency is not understanding. Memory is not interpretation. Action is not comprehension.

If agentic AI is the future, then advances will depend less on making models larger and more on how deeply we understand language, context, and human intent.


References

Clark, Herbert H. Using Language. Cambridge University Press, 1996.

Grice, H. P. “Logic and Conversation.” Syntax and Semantics, vol. 3, edited by Peter Cole and Jerry L. Morgan, Academic Press, 1975, pp. 41–58.

Kamp, Hans, and Uwe Reyle. From Discourse to Logic. Springer, 1993.

Papineni, Kishore, et al. “BLEU: A Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

Winograd, Terry. “Understanding Natural Language.” Cognitive Psychology, vol. 3, no. 1, 1972, pp. 1–191.

— Andrew

4,807 hits

How Computational Linguistics Can Help Stop Phishing Emails?

I’ve always been curious about how language can reveal hidden clues. One place this really shows up is in phishing emails. These are the fake messages that try to trick people into giving away passwords or personal information. They are annoying, but also dangerous, which makes them a great case study for how computational linguistics can be applied in real life.

Why Phishing Emails Matter

Phishing is more than just spam. A single click on the wrong link can cause real damage, from stolen accounts to financial loss. What interests me is that these emails often give themselves away through language. That is where computational linguistics comes in.

How Language Analysis Helps Detect Phishing

  • Spotting unusual patterns: Models can flag odd grammar or overly formal phrases that do not fit normal business communication.
  • Checking stylistic fingerprints: Everyone has a writing style. Computational models can learn those styles and catch imposters pretending to be someone else.
  • Finding emotional manipulation: Many phishing emails use urgency or fear, like “Act now or your account will be suspended.” Sentiment analysis can identify these tactics.
  • Looking at context and meaning: Beyond surface words, models can ask whether the message makes sense in context. A bank asking for login details over email does not line up with how real banks communicate.

Why This Stood Out to Me

What excites me about this problem is that it shows how language technology can protect people. I like studying computational linguistics because it is not just about theory. It has real applications like this that touch everyday life. By teaching computers to recognize how people write, we can stop scams before they reach someone vulnerable.

My Takeaway

Phishing shows how much power is hidden in language, both for good and for harm. To me, that is the motivation for studying computational linguistics: to design tools that understand language well enough to help people. Problems like phishing remind me why the field matters.


📚 Further Reading

Here are some recent peer-reviewed papers if you want to dive deeper into how computational linguistics and machine learning are used to detect phishing:

  • Recommended for beginners
    Saias, J. (2025). Advances in NLP Techniques for Detection of Message-Based Threats in Digital Platforms: A Systematic Review. Electronics, 14(13), 2551. https://doi.org/10.3390/electronics14132551
    A recent review covering multiple types of digital messaging threats—including phishing—using modern NLP methods. It’s accessible, up to date, and provides a helpful overview. Why I recommend this: As someone still learning computational linguistics, I like starting with survey papers that show many ideas in one place. This one is fresh and covers a lot of ground.
  • Jaison J. S., Sadiya H., Himashree S., M. Jomi Maria Sijo, & Anitha T. G. (2025). A Survey on Phishing Email Detection Techniques: Using LSTM and Deep Learning. International Journal for Research in Applied Science & Engineering Technology (IJRASET), 13(8). https://doi.org/10.22214/ijraset.2025.73836
    Overviews deep learning methods like LSTM, BiLSTM, CNN, and Transformers in phishing detection, with notes on datasets and practical challenges.
  • Alhuzali, A., Alloqmani, A., Aljabri, M., & Alharbi, F. (2025). In-Depth Analysis of Phishing Email Detection: Evaluating the Performance of Machine Learning and Deep Learning Models Across Multiple Datasets. Applied Sciences, 15(6), 3396. https://doi.org/10.3390/app15063396
    Compares various machine learning and deep learning detection models across datasets, offering recent performance benchmarks.

— Andrew

4,807 hits

Looking Back on 2025 (and Ahead to 2026)

Happy New Year 2026! I honestly cannot believe it is already another year. Looking back, 2025 feels like it passed in a blur of late nights, deadlines, competitions, and moments that quietly changed how I think about learning. This blog became my way of slowing things down. Each post captured something I was wrestling with at the time, whether it was research, language, or figuring out what comes next after high school. As I look back on what I wrote in 2025 and look ahead to 2026, this post is both a reflection and a reset.

That sense of reflection shaped how I wrote this year. Many of my early posts grew out of moments where I wished someone had explained a process more clearly when I was starting out.

Personal Growth and Practical Guides

Some of my 2025 writing focused on making opportunities feel more accessible. I wrote about publishing STEM research as a high school student and tried to break down the parts that felt intimidating at first, like where to submit and what “reputable” actually means in practice.

I also shared recommendations for summer programs and activities in computational linguistics, pulling from what I applied to, what I learned, and what I wish I had known earlier. Writing these posts helped me realize how much “figuring it out” is part of the process.

As I got more comfortable sharing advice, my posts started to shift outward. Instead of only focusing on how to get into research, I began asking bigger questions about how language technology shows up in real life.

Research and Real-World Application

In the first few months of the year, I stepped back from posting as school, VEX Robotics World Championship, and research demanded more of my attention. When I came back, one of the posts that felt most meaningful to write was Back From Hibernation. In it, I reflected on how sustained effort turned into a tangible outcome: a co-authored paper accepted to a NAACL 2025 workshop.

Working with my co-author and mentor, Sidney Wong, taught me a lot about the research process, especially how to respond thoughtfully to committee feedback and refine a paper through a careful round of revision. More than anything, that experience showed me what academic research looks like beyond the initial idea. It is iterative, collaborative, and grounded in clarity.

Later posts explored the intersection of language technology and society. I wrote about AI resume scanners and the ethical tensions they raise, especially when automation meets human judgment. I also reflected on applications of NLP in recommender systems after following work presented at RecSys 2025, which expanded my view of where computational linguistics appears beyond the examples people usually cite.

Another recurring thread was how students, especially high school students, can connect with professors for research. Writing about that made me more intentional about how I approach academic communities, not just as someone trying to get a yes, but as someone who genuinely wants to learn.

Those topics were not abstract for me. In 2025, I also got to apply these ideas through Student Echo, my nonprofit focused on listening to student voices at scale.

Student Echo and Hearing What Students Mean

Two of the most meaningful posts I wrote this year were about Student Echo projects where we used large language models to help educators understand open-ended survey responses.

In Using LLMs to Hear What Students Are Really Saying, I shared how I led a Student Echo collaboration with the Lake Washington School District, supported by district leadership and my principal, to extract insights from comments that are often overlooked because they are difficult to analyze at scale. The goal was simple but ambitious: use language models to surface what students care about, where they are struggling, and what they wish could be different.

In AI-Driven Insights from the Class of 2025 Senior Exit Survey, I wrote about collaborating with Redmond High School to analyze responses from the senior exit survey. What stood out to me was how practical the insights became once open-ended text was treated seriously, from clearer graduation task organization to more targeted counselor support.

Writing these posts helped me connect abstract AI ideas to something grounded and real. When used responsibly, these tools can help educators listen to students more clearly.

Not all of my learning in 2025 happened through writing or research, though. Some of the most intense lessons happened in the loudest places possible.

Robotics and Real-World Teamwork

A major part of my year was VEX Robotics. In my VEX Worlds 2025 recap, I wrote about what it felt like to compete globally with my team, Ex Machina, after winning our state championship. The experience forced me to take teamwork seriously in a way that is hard to replicate anywhere else. Design matters, but communication and adaptability matter just as much.

In another post, I reflected on gearing up for VEX Worlds 2026 in St. Louis. That one felt more reflective, not just because of the competition ahead, but because it made me think about what it means to stay committed to a team while everything else in life is changing quickly.

Experiences like VEX pushed me to think beyond my own projects. That curiosity carried into academic spaces as well.

Conferences and Big Ideas

Attending SCiL 2025 was my first real academic conference, and writing about it helped me process how different it felt from school assignments. I also reflected on changes to arXiv policy and what they might mean for openness in research. These posts marked a shift from learning content to thinking about how research itself is structured and shared.

Looking across these posts now, from robotics competitions to survey analytics to research reflections, patterns start to emerge.

Themes That Defined My Year

Across everything I wrote in 2025, a few ideas kept resurfacing:

  • A consistent interest in how language and AI intersect in the real world
  • A desire to make complex paths feel more navigable for other students
  • A growing appreciation for the human side of technical work, including context, trust, and listening

2025 taught me as much outside the classroom as inside it. This blog became a record of that learning.

Looking Toward 2026

As 2026 begins, I see this blog less as a record of accomplishments and more as a space for continued exploration. I am heading into the next phase of my education with more questions than answers, and I am okay with that. I want to keep writing about what I am learning, where I struggle, and how ideas from language, AI, and engineering connect in unexpected ways. If 2025 was about discovering what I care about, then 2026 is about going deeper, staying curious, and building with intention.

Thanks for reading along so far. I am excited to see where this next year leads.

— Andrew

4,807 hits

From Human Chatbots to Whale and Bird Talk: The Surprising Rise of Bio-Acoustic NLP in 2025

As a high school student passionate about computational linguistics, I find it amazing how the same technologies that power our everyday chatbots and voice assistants are now being used to decode animal sounds. This emerging area blends bioacoustics (the study of animal vocalizations) with natural language processing (NLP) and machine learning. Researchers are starting to treat animal calls almost like a form of language, analyzing them for patterns, individual identities, species classification, and even possible meanings.

Animal vocalizations do not use words the way humans do, but they frequently show structure, repetition, and context-dependent variation, features that remind us of linguistic properties in human speech.

A Highlight from ACL 2025: Monkey Voices Get the AI Treatment

One of the most interesting papers presented at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), the leading conference in our field, focuses directly on this topic.

Paper title: “Acoustic Individual Identification of White-Faced Capuchin Monkeys Using Joint Multi-Species Embeddings

Authors: Álvaro Vega-Hidalgo, Artem Abzaliev, Thore Bergman, Rada Mihalcea (University of Michigan)

What the paper covers

White-faced capuchin monkeys each have a unique vocal signature. Being able to identify which individual is calling is valuable for studying their social structures, kinship, and conservation efforts.

The main difficulty is the lack of large labeled datasets for wild or rare species. Human speech has massive annotated corpora, but animal data is much scarcer.

The researchers address this through cross-species pre-training, a transfer learning strategy. They take acoustic embedding models (essentially sound “fingerprints”) pre-trained on: (1) Extensive human speech data and (2) Large-scale bird call datasets.

These models are then applied to white-faced capuchin vocalizations, even though the original training never included capuchin sounds.

Key findings

  • Embeddings derived from human speech and bird calls transferred surprisingly well to monkey vocalizations.
  • Combining multi-species representations (joint embeddings) improved identification accuracy further.

This demonstrates how knowledge from one domain can help another distant one, similar to how learning one human language can make it easier to pick up a related one. It offers a practical solution to the data scarcity problem that often limits animal bioacoustics research.

This paper was one of 22 contributions from the University of Michigan’s Computer Science and Engineering group at ACL 2025, showing how far computational linguistics has expanded beyond traditional human text and speech.

Another ACL 2025 Contribution: Exploring Dog Communication

ACL 2025 also included “Toward Automatic Discovery of a Canine Phonetic Alphabet” by Theron S. Wang and colleagues. The work investigates the phonetic-like building blocks in dog vocalizations and aims to discover them automatically. This is an early step toward analyzing dog sounds in a more structured, language-inspired framework.

Why This Matters

  • Conservation applications — Automated systems can monitor endangered species like whales or rare birds continuously, reducing the need for long-term human fieldwork in remote locations.
  • Insights into animal communication — Researchers are beginning to test whether calls follow rule-based patterns or convey specific information (about food, threats, or social bonds), much like how humans use syntax and intonation.
  • Transfer of AI techniques — Models originally built for human speech transfer effectively to other species. New foundation models in 2025 (e.g., like NatureLM-audio) even handle thousands of animal species and support natural language queries such as “What bird is calling here?”

While these ACL 2025 papers represent cutting-edge academic work, the broader field is gaining momentum, with related discussions appearing in events like the 2025 NeurIPS workshop on AI for Non-Human Animal Communication.

This area is growing rapidly thanks to better data availability and stronger models. In the coming years, we might see practical tools that help interpret bird alarm calls or monitor ocean ecosystems through whale vocalizations.

What do you think? Would you be excited to build a simple AI tool to analyze your pet’s sounds or contribute to dolphin communication research? Computational linguistics is moving far beyond chatbots. It is now helping us listen to the voices of the entire planet.

Thanks for reading. I’d love to hear your thoughts in the comments!

— Andrew

4,807 hits

From VEX Robotics to Silicon Valley: Why Physical Intelligence Is Harder Than It Looks

According to ACM TechNews (Wednesday, December 17, 2025), ACM Fellow Rodney Brooks argues that Silicon Valley’s current obsession with humanoid robots is misguided and overhyped. Drawing on decades of experience, he contends that general-purpose, humanlike robots remain far from practical, unsafe to deploy widely, and unlikely to achieve human-level dexterity in the near future. Brooks cautions that investors are confusing impressive demonstrations and AI training techniques with genuine real-world capability. Instead, he argues that meaningful progress will come from specialized, task-focused robots designed to work alongside humans rather than replace them. The original report was published in The New York Times under the title “Rodney Brooks, the Godfather of Modern Robotics, Says the Field Has Lost Its Way.”

I read the New York Times coverage of Rodney Brooks’ argument that Silicon Valley’s current enthusiasm for humanoid robots is likely to end in disappointment. Brooks is widely respected in the robotics community. He co-founded iRobot and has played a major role in shaping modern robotics research. His critique is not anti-technology rhetoric but a perspective grounded in long experience with the practical challenges of engineering physical systems. He makes a similar case in his blog post, “Why Today’s Humanoids Won’t Learn Dexterity”.

Here’s what his core points seem to be:

Why he thinks this boom will fizzle

  • The industry is betting huge sums on general-purpose humanoid robots that can do everything humans do—walk, manipulate objects, adapt to new tasks—based on current AI methods. Brooks argues that belief in this near-term is “pure fantasy” because we still lack the basic sensing and physical dexterity that humans take for granted.
  • He emphasizes that visual data and generative models aren’t a substitute for true touch sensing and force control. Current training methods can’t teach a robot to use its hands with the precision and adaptation humans have.
  • Safety and practicality matter too. Humanoid robots that fall or make a mistake could be dangerous around people, which slows deployment and commercial acceptance.
  • He expects a big hype phase followed by a trough of disappointment—a period where money flows out of the industry because the technology hasn’t lived up to its promises.

Where I agree with him

I think Brooks is right that engineering the physical world is harder than it looks. Software breakthroughs like large language models (LLMs) are impressive, but even brilliant language AI doesn’t give a robot the equivalent of muscle, touch, balance, and real-world adaptability. Robots that excel at one narrow task (like warehouse arms or autonomous vacuum cleaners) don’t generalize to ambiguous, unpredictable environments like a home or workplace the way vision-based AI proponents hope. The history of robotics is full of examples where clever demos got headlines long before practical systems were ready.

It would be naive to assume that because AI is making rapid progress in language and perception, physical autonomy will follow instantly with the same methods.

Where I think he might be too pessimistic

Fully dismissing the long-term potential of humanoid robots seems premature. Complex technology transitions often take longer and go in unexpected directions. For example, self-driving cars have taken far longer than early boosters predicted, but we are seeing incremental deployments in constrained zones. Humanoid robots could follow a similar curve: rather than arriving as general-purpose helpers, they may find niches first (healthcare support, logistics, elder care) where the environment and task structure make success easier. Brooks acknowledges that robots will work with humans, but probably not in a human look-alike form in everyday life for decades.

Also, breakthroughs can come from surprising angles. It’s too soon to say that current research paths won’t yield solutions to manipulation, balance, and safety, even if those solutions aren’t obvious yet.

Bottom line

Brooks’ critique is not knee-jerk pessimism. It is a realistic engineering assessment grounded in decades of robotics experience. He is right to question hype and to emphasize that physical intelligence is fundamentally different from digital intelligence.

My experience in VEX Robotics reinforces many of his concerns, even though VEX robots are not humanoid. Building competition robots showed me how fragile physical systems can be. Small changes in friction, battery voltage, alignment, or field conditions routinely caused failures that no amount of clever code could fully anticipate. Success came from tightly scoped designs, extensive iteration, and task-specific mechanisms rather than general intelligence. That contrast makes the current humanoid hype feel misaligned with how robotics actually progresses in practice, where reliability and constraint matter more than appearance or breadth.

Dismissing the possibility of humanoid robots entirely may be too strict, but expecting rapid, general-purpose success is equally misguided. Progress will likely be slower, more specialized, and far less dramatic than Silicon Valley forecasts suggest.

— Andrew

4,807 hits

A Short Guide to Understanding NeurIPS 2025 Through Three Key Reports

Introduction

NeurIPS (Neural Information Processing Systems) 2025 brought together the global machine learning community for its thirty ninth annual meeting. It represents both continuity and change in the world’s premier machine learning conference. Held December 2 to 7 in San Diego, with a simultaneous secondary site in Mexico City, the conference drew enormous attention from researchers across academia, industry, and policy. The scale was striking. There were more than 21,575 submissions and over 5,200 accepted papers, which placed the acceptance rate at about 24.5 percent. With such breadth, NeurIPS 2025 offered a detailed look at the current state of AI research and the questions shaping its future.

Why I Follow the Conference

Even though my senior year has been filled with college applications and demanding coursework, I continue to follow NeurIPS closely because it connects directly to my future interests in computational linguistics and NLP. Reading every paper is unrealistic, but understanding the broader themes is still possible. For students or early researchers who want to stay informed without diving into thousands of pages, the following three references are especially helpful.

References:

  1. NeurIPS 2025: A Guide to Key Papers, Trends & Stats (Intuition Labs)
  2. Trends in AI at NeurIPS 2025 (Medium)
  3. At AI’s biggest gathering, its inner workings remain a mystery (NBC News)

Executive Summary of the Three Reports

1. Intuition Labs: Key Papers, Trends, and Statistics

The Intuition Labs summary of NeurIPS 2025 is a detailed, professionally structured report that provides a comprehensive overview of the conference. It opens with an Executive Summary highlighting key statistics, trends, awards, and societal themes, followed by sections on Introduction and Background, NeurIPS 2025 Organization and Scope (covering dates, venues, scale, and comparisons to prior years), and Submission and Review Process (with subsections on statistics, responsible practices, and ethics).

The report then delves into the core content through Technical Program Highlights (key themes, notable papers, and interdisciplinary bridging), Community and Social Aspects (affinity events, workshops, industry involvement, and conference life), Data and Evidence: Trends Analysis, Case Studies and Examples (including the best paper on gated attention and an invited talk panel), Implications and Future Directions, and a concluding section that reflects on the event’s significance. This logical flow, from context and logistics to technical depth, community, evidence, specifics, and forward-looking insights, makes it an ideal reference for understanding the conference’s breadth and maturation of AI research. It is a helpful summary for readers who want both numbers and high level insights.

2. Medium: Trends in AI at NeurIPS 2025

This article highlights key trends observed at NeurIPS 2025 through workshops, signaling AI’s maturation beyond text-based models. Major themes include embodied AI in physical/biological realms (e.g., animal communication via bioacoustics, health applications with regulatory focus, robotic world models, spatial reasoning, brain-body foundations, and urban/infrastructure optimization); reliability and interpretability (robustness against unreliable data, regulatable designs, mechanistic interpretability of model internals, and lifecycle-aware LLM evaluations); advanced reasoning and agents (multi-turn interactions, unified language-agent-world models, continual updates, mathematical/logical reasoning, and scientific discovery); and core theoretical advancements (optimization dynamics, structured graphs, and causality).

The author concludes that AI is evolving into situated ecosystems integrating biology, cities, and agents, prioritizing structure, geometry, causality, and protective policies alongside innovation, rather than pure scaling.

3. NBC News: The Challenge of Understanding AI Systems

NBC News focuses on a different but equally important issue. Even with rapid performance gains, researchers remain unsure about what drives model behavior. Many noted that interpretability is far behind capability growth. The article describes concerns about the lack of clear causal explanations for model outputs and the difficulty of ensuring safety when internal processes are not fully understood. Several researchers emphasized that the field needs better tools for understanding neural networks before deploying them widely. This tension between rapid advancement and limited interpretability shaped many of the conversations at NeurIPS 2025.

For Further Exploration

For readers who want to explore the conference directly, the NeurIPS 2025 website provides access to papers, schedules, and workshop materials:
https://neurips.cc/Conferences/2025

— Andrew

4,807 hits

How AI and Computational Linguistics Are Unlocking Medieval Jewish History

On December 3 (2025), ACM TechNews featured a story about a groundbreaking use of artificial intelligence in historical and linguistic research. It referred to an earlier report “Vast trove of medieval Jewish records opened up by AI” from Reuters. The article described a new project applying AI to the Cairo Geniza, a massive archive of medieval Jewish manuscripts that spans nearly one thousand years. These texts were preserved in a synagogue storeroom and contain records of daily life, legal matters, trade, personal letters, religious study, and community events.

The goal of the project is simple in theory and monumental in practice. Researchers are training an AI system to read, transcribe, and organize hundreds of thousands of handwritten documents. This would allow scholars to access the material far more quickly than traditional methods permit.


Handwriting Recognition for Historical Scripts

Computational linguistics plays a direct role in how machines learn to read ancient handwriting. AI models can be taught to detect character shapes, page layouts, and writing patterns even when the script varies from one writer to another or comes from a style no longer taught today. This helps the system replicate the work of experts who have spent years studying how historical scripts evolved.


Making the Text Searchable and Comparable

Once the handwriting is converted to text, another challenge begins. Historical manuscripts often use non standard spelling, abbreviations, and inconsistent grammar. Computational tools can normalize these differences, allowing researchers to search archives accurately and evaluate patterns that would be difficult to notice manually.


Extracting Meaning Through NLP

After transcription and normalization, natural language processing tools can identify names, dates, locations, and recurring themes in the documents. This turns raw text into organized data that supports historical analysis. Researchers can explore how people, places, and ideas were connected across time and geography.


Handling Multiple Languages and Scripts

The Cairo Geniza contains material written in Hebrew, Arabic, Aramaic, and Yiddish. A transcription system must recognize and handle multiple scripts, alphabets, and grammatical structures. Computational linguistics enables the AI to adapt to these differences so the dataset becomes accessible as a unified resource.


Restoring Damaged Manuscripts

Many texts are incomplete because of age and physical deterioration. Modern work in ancient text restoration uses machine learning models to predict missing letters or words based on context and surrounding information. This helps scholars reconstruct documents that might otherwise remain fragmented.


Why This Matters for Researchers and the Public

AI allows scholars to process these manuscripts on a scale that would not be feasible through manual transcription alone. Once searchable, the collection becomes a resource for historians, linguists, and genealogists. Connections between communities and individuals can be explored in ways that were not possible before. Articles about the project suggest that this could lead to a mapping of relationships similar to a historical social graph.

This technology also expands access beyond expert scholars. Students, teachers, local historians, and interested readers may one day explore the material in a clear and searchable form. If automated translation improves alongside transcription, the archive could become accessible to a global audience.


Looking Ahead

This project is a strong example of how computational linguistics can support the humanities. It shows how tools developed for modern language tasks can be applied to cultural heritage, historical research, and community memory. AI is not replacing the work of historians. Instead, it is helping uncover material that scholars would never have time to process on their own.

Projects like this remind us that the intersection of language and technology is not only changing the future. It is now offering a deeper look into the past.

— Andrew

4,807 hits

AI Sycophancy: When Our Chatbots Say “Yes” Instead of “Why”

“I asked ChatGPT to check my argument and it just kept agreeing with me.”
“Gemini told me my logic was solid even when I knew it wasn’t.”
“Grok feels like a hype-man, not a thinking partner.”

These are the kinds of comments I keep seeing from my school friends who feel that modern AI tools are becoming too agreeable for their own good. Instead of challenging flawed reasoning or offering alternative perspectives, many chatbots default to affirmation. This behavior has a name: AI sycophancy. The term does not originate from me. It comes from recent research and ongoing conversations in the AI community, where scholars are identifying a growing tendency for AI systems to prioritize user approval over honest reasoning.

At first glance, this might feel harmless or even comforting. After all, who does not like being told they are right? But beneath that friendliness lies a deeper problem that affects how we learn, decide, and think.


What is AI Sycophancy?

AI sycophancy refers to a pattern in which an AI system aligns its responses too closely with a user’s expressed beliefs or desires, even when those beliefs conflict with evidence or logic. Rather than acting as an independent evaluator, the model becomes a mirror.

For example, a user might say, “I think this argument is correct. Do you agree?” and the model responds with enthusiastic confirmation instead of critical analysis. Or the system might soften disagreement so much that it effectively disappears. Recent research from Northeastern University confirms that this behavior is measurable and problematic. Their report, The AI industry has a problem: Chatbots are too nice, shows that when models alter their reasoning to match a user’s stance, their overall accuracy and rationality decline.
https://news.northeastern.edu/2025/11/24/ai-sycophancy-research/


Why Does It Exist?

Several forces contribute to the rise of AI sycophancy:

  • Training incentives and reward systems.
    Many models are optimized to be helpful, polite, and pleasant. When user satisfaction is a core metric, models learn that agreement often leads to positive feedback.
  • User expectations.
    People tend to treat chatbots as friendly companions rather than critical reviewers. When users express certainty, the model often mirrors that confidence instead of questioning it.
  • Alignment trade-offs.
    The Northeastern team highlights a tension between sounding human and being rational. In attempting to appear empathetic and affirming, the model sometimes sacrifices analytical rigor.
  • Ambiguous subject matter.
    In questions involving ethics, predictions, or subjective judgment, models may default to agreement rather than risk appearing confrontational or incorrect.

What Are the Impacts?

The consequences of AI sycophancy extend beyond mild annoyance.

  • Weakened critical thinking.
    Students who rely on AI for feedback may miss opportunities to confront their own misconceptions.
  • Lower reasoning quality.
    The Northeastern study found that adjusting answers to match user beliefs correlates with poorer logic and increased error rates.
  • Risk in high-stakes contexts.
    In healthcare, policy, or education, an overly agreeable AI can reinforce flawed assumptions and lead to harmful decisions.
  • False confidence.
    When AI consistently affirms users, it creates an illusion of correctness that discourages self-reflection.
  • Ethical concerns.
    A system that never challenges bias or misinformation becomes complicit in reinforcing it.

How to Measure and Correct It

Measuring sycophancy

Researchers measure sycophancy by observing how much a model shifts its answer after a user asserts a belief. A typical approach involves:

  • Presenting the model with a scenario and collecting its initial judgment.
  • Repeating the scenario alongside a strong user opinion or belief.
  • Comparing the degree to which the model’s stance moves toward the user’s position.
  • Evaluating whether the reasoning quality improves, stays stable, or deteriorates.

The greater the shift without supporting evidence, the higher the sycophancy score.


Correcting the behavior

Several strategies show promise:

  • Penalize agreement that lacks evidence during training.
  • Encourage prompts that demand critique or alternative views.
  • Require models to express uncertainty or justify reasoning steps.
  • Educate users to value disagreement as a feature rather than a flaw.
  • Use multi-agent systems where one model challenges another.
  • Continuously track and adjust sycophancy metrics in deployed systems.

Why This Matters to Me as a Student

As someone preparing to study computational linguistics and NLP, I want AI to help sharpen my thinking, not dull it. If my research assistant simply validates every claim I make, I risk building arguments that collapse under scrutiny. In chess, improvement only happens through strong opposition. The same is true for intellectual growth. Agreement without resistance is not growth. It is stagnation.

Whether I am analyzing Twitch language patterns or refining a research hypothesis, I need technology that questions me, not one that treats every idea as brilliant.


Final Thought

The Northeastern research reminds us that politeness is not the same as intelligence. A chatbot that constantly reassures us might feel supportive, but it undermines the very reason we turn to AI in the first place. We do not need machines that echo our beliefs. We need machines that help us think better.

AI should challenge us thoughtfully, disagree respectfully, and remain grounded in evidence. Anything less turns a powerful tool into a flattering reflection.

— Andrew

4,807 hits

Blog at WordPress.com.

Up ↑