AI Sycophancy: When Our Chatbots Say “Yes” Instead of “Why”

“I asked ChatGPT to check my argument and it just kept agreeing with me.”
“Gemini told me my logic was solid even when I knew it wasn’t.”
“Grok feels like a hype-man, not a thinking partner.”

These are the kinds of comments I keep seeing from my school friends who feel that modern AI tools are becoming too agreeable for their own good. Instead of challenging flawed reasoning or offering alternative perspectives, many chatbots default to affirmation. This behavior has a name: AI sycophancy. The term does not originate from me. It comes from recent research and ongoing conversations in the AI community, where scholars are identifying a growing tendency for AI systems to prioritize user approval over honest reasoning.

At first glance, this might feel harmless or even comforting. After all, who does not like being told they are right? But beneath that friendliness lies a deeper problem that affects how we learn, decide, and think.


What is AI Sycophancy?

AI sycophancy refers to a pattern in which an AI system aligns its responses too closely with a user’s expressed beliefs or desires, even when those beliefs conflict with evidence or logic. Rather than acting as an independent evaluator, the model becomes a mirror.

For example, a user might say, “I think this argument is correct. Do you agree?” and the model responds with enthusiastic confirmation instead of critical analysis. Or the system might soften disagreement so much that it effectively disappears. Recent research from Northeastern University confirms that this behavior is measurable and problematic. Their report, The AI industry has a problem: Chatbots are too nice, shows that when models alter their reasoning to match a user’s stance, their overall accuracy and rationality decline.
https://news.northeastern.edu/2025/11/24/ai-sycophancy-research/


Why Does It Exist?

Several forces contribute to the rise of AI sycophancy:

  • Training incentives and reward systems.
    Many models are optimized to be helpful, polite, and pleasant. When user satisfaction is a core metric, models learn that agreement often leads to positive feedback.
  • User expectations.
    People tend to treat chatbots as friendly companions rather than critical reviewers. When users express certainty, the model often mirrors that confidence instead of questioning it.
  • Alignment trade-offs.
    The Northeastern team highlights a tension between sounding human and being rational. In attempting to appear empathetic and affirming, the model sometimes sacrifices analytical rigor.
  • Ambiguous subject matter.
    In questions involving ethics, predictions, or subjective judgment, models may default to agreement rather than risk appearing confrontational or incorrect.

What Are the Impacts?

The consequences of AI sycophancy extend beyond mild annoyance.

  • Weakened critical thinking.
    Students who rely on AI for feedback may miss opportunities to confront their own misconceptions.
  • Lower reasoning quality.
    The Northeastern study found that adjusting answers to match user beliefs correlates with poorer logic and increased error rates.
  • Risk in high-stakes contexts.
    In healthcare, policy, or education, an overly agreeable AI can reinforce flawed assumptions and lead to harmful decisions.
  • False confidence.
    When AI consistently affirms users, it creates an illusion of correctness that discourages self-reflection.
  • Ethical concerns.
    A system that never challenges bias or misinformation becomes complicit in reinforcing it.

How to Measure and Correct It

Measuring sycophancy

Researchers measure sycophancy by observing how much a model shifts its answer after a user asserts a belief. A typical approach involves:

  • Presenting the model with a scenario and collecting its initial judgment.
  • Repeating the scenario alongside a strong user opinion or belief.
  • Comparing the degree to which the model’s stance moves toward the user’s position.
  • Evaluating whether the reasoning quality improves, stays stable, or deteriorates.

The greater the shift without supporting evidence, the higher the sycophancy score.


Correcting the behavior

Several strategies show promise:

  • Penalize agreement that lacks evidence during training.
  • Encourage prompts that demand critique or alternative views.
  • Require models to express uncertainty or justify reasoning steps.
  • Educate users to value disagreement as a feature rather than a flaw.
  • Use multi-agent systems where one model challenges another.
  • Continuously track and adjust sycophancy metrics in deployed systems.

Why This Matters to Me as a Student

As someone preparing to study computational linguistics and NLP, I want AI to help sharpen my thinking, not dull it. If my research assistant simply validates every claim I make, I risk building arguments that collapse under scrutiny. In chess, improvement only happens through strong opposition. The same is true for intellectual growth. Agreement without resistance is not growth. It is stagnation.

Whether I am analyzing Twitch language patterns or refining a research hypothesis, I need technology that questions me, not one that treats every idea as brilliant.


Final Thought

The Northeastern research reminds us that politeness is not the same as intelligence. A chatbot that constantly reassures us might feel supportive, but it undermines the very reason we turn to AI in the first place. We do not need machines that echo our beliefs. We need machines that help us think better.

AI should challenge us thoughtfully, disagree respectfully, and remain grounded in evidence. Anything less turns a powerful tool into a flattering reflection.

— Andrew

4,361 hits

How Chatbots Understand Us: Exploring the Basics of Natural Language Processing (NLP)

If you’ve ever asked Siri a question, chatted with a customer support bot, or played around with ChatGPT, you’ve already seen natural language processing (NLP) in action. But have you ever wondered: How do these systems actually understand what I’m saying? That question is what first got me curious about NLP, and now, as a high school student diving into computational linguistics, I want to break it down for others who might be wondering too.


What Is NLP?

Natural Language Processing is a branch of artificial intelligence (AI) that helps computers understand, interpret, and generate human language. It allows machines to read text, hear speech, figure out what it means, and respond in a way that (hopefully) makes sense.

NLP is used in tons of everyday tools and apps, like:

  • Chatbots and virtual assistants (Siri, Alexa, Google Assistant)
  • Translation tools (Google Translate)
  • Grammar checkers (like Grammarly)
  • Sentiment analysis (used by companies to understand customer reviews)
  • Smart email suggestions (like Gmail’s autocomplete)

How Do Chatbots Understand Language?

Here’s a simplified view of what happens when you talk to a chatbot:

1. Text Input

You say something like: “What’s the weather like today?”
If it’s a voice assistant, your speech is first turned into text through speech recognition.

2. Tokenization

The text gets split into chunks called tokens (usually words or phrases). So that sentence becomes:
[“What”, “’s”, “the”, “weather”, “like”, “today”, “?”]

3. Understanding Intent and Context

The chatbot has to figure out what you mean. Is this a question? A request? Does “weather” refer to the forecast or something else?

This part usually involves models trained on huge amounts of text data, which learn patterns of how people use language.

4. Generating a Response

Once the bot understands your intent, it decides how to respond. Maybe it retrieves information from a weather API or generates a sentence like “Today’s forecast is sunny with a high of 75°F.”

All of this happens in just a few seconds.


Some Key Concepts in NLP

If you’re curious to dig deeper into how this all works, here are a few beginner-friendly concepts to explore:

  • Syntax and Parsing: Figuring out sentence structure (nouns, verbs, grammar rules)
  • Semantics: Understanding meaning and context
  • Named Entity Recognition (NER): Detecting names, dates, locations in a sentence
  • Language Models: Tools like GPT or BERT that learn how language works from huge datasets
  • Word Embeddings: Representing words as vectors so computers can understand similarity (like “king” and “queen” being close together in vector space)

Why This Matters to Me

My interest in NLP and computational linguistics started with my nonprofit work at Student Echo, where we use AI to analyze student survey responses. Since then, I’ve explored research topics like sentiment analysis, LLMs vs. neural networks, and even co-authored a paper accepted at a NAACL 2025 workshop. I also use tools like Zotero to manage my reading and citations, something I wish I had known earlier.

What excites me most is how NLP combines computer science with human language. I’m especially drawn to the possibilities of using NLP to better understand online communication (like on Twitch) or help preserve endangered languages.


Final Thoughts

So the next time you talk to a chatbot, you’ll know there’s a lot going on behind the scenes. NLP is a powerful mix of linguistics and computer science, and it’s also a really fun space to explore as a student.

If you’re curious about getting started, try exploring Python, open-source NLP libraries like spaCy or NLTK, or even just reading research papers. It’s okay to take small steps. I’ve been there too. 🙂

— Andrew

4,361 hits

AI in Schoolwork: Difference Approaches Taken in the U.S. and China

Recently, I read an article from MIT Technology Review titled “Chinese universities want students to use more AI, not less.” It really made me think about the differences in how the U.S. and China are approaching AI in education, especially as a high school student growing up in Washington state.

In China, AI has gone from being a taboo to a toolkit in just a couple of years. University students once had to find mirror versions of ChatGPT through secondhand marketplaces and VPNs just to access the tools. Back then, professors warned students not to use AI for assignments. But now, things have completely changed.

Chinese universities are actively encouraging students to use generative AI tools, as long as they follow best practices. Professors are adding AI-specific lessons to their classes. For example, one law professor teaches students how to prompt effectively and reminds them that AI is only useful when combined with human judgment. Students are using tools like DeepSeek for everything from writing literature reviews to organizing thoughts.

This push for AI education isn’t just happening in individual classrooms. It’s backed by national policy. The Chinese Ministry of Education released guidelines in April 2025 calling for an “AI plus education” approach. The goal is to help students develop critical thinking, digital fluency, and real-world skills across all education levels. Cities like Beijing have even introduced AI instruction in K–12 schools.

In China, AI is also viewed as a key to career success. A report from YiCai found that 80 percent of job listings for recent college grads mention AI as a desired skill. So students see learning how to use AI properly as something that gives them a competitive edge in a tough job market.

That’s pretty different from what I’ve seen here in the U.S.

In July 2024, the Washington Office of Superintendent of Public Instruction (OSPI) released official guidance for AI in schools. The message isn’t about banning AI. It’s about using it responsibly. The guidance encourages human-centered learning, with values like transparency, privacy, equity, and critical thinking. Students are encouraged to use AI tools to support their learning, but not to replace it.

Instead of secretly using AI to write a paper, students in Washington are encouraged to talk openly about how and when they use it. Teachers are reminded that AI should be a support, not a shortcut. The guidance also warns about overusing AI detection tools, especially since those tools can sometimes unfairly target multilingual students.

Adding to this, a recent brain-scan study by MIT Media Lab called “Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task” raises some interesting points. Over four months, participants had their brains scanned while using ChatGPT for writing tasks. The results were surprising:

  • 83% of AI users couldn’t remember what they had just written
  • Brain activity dropped by 47% in AI users and stayed low even after stopping
  • Their writing was technically correct but described by teachers as robotic
  • ChatGPT made users 60% faster, but reduced learning-related brain activity by 32%

The group that performed the best started their work without AI and only added it later. They had stronger memory, better brain engagement, and wrote with more depth. This shows that using AI right matters. If we rely on it too much, we might actually learn less.

MIT’s full research can be found here or read the paper on arXiv. (a caveat called out by the research team: “as of June 2025, when the first paper related to the project, was uploaded to Arxiv, the preprint service, it has not yet been peer-reviewed, thus all the conclusions are to be treated with caution and as preliminary”)

So what does this all mean?

I think both China’s and our approaches have something valuable to offer. China is focused on future skills and career readiness. The U.S. is focused on ethics, fairness, and critical thinking. Personally, I believe students should be allowed to use AI in schoolwork, but with the right guidance. We should be learning how to prompt better, double-check results, and combine AI tools with our own thinking.

AI is already part of our world. Instead of hiding from it, we should be learning how to use it the right way.

You can read the full MIT Technology Review article here
Washington’s official AI guidance for schools (published July 2024) is here (PDF)

— Andrew

4,361 hits

Learning from Industry: How Companies Evaluate LLMs

Over the past few years, large language models (LLMs) have been everywhere. From chatbots that help you book flights to tools that summarize long documents, companies are finding ways to use LLMs in real products. But success is not guaranteed. In fact, sometimes it goes very wrong. A famous example was when Expedia’s chatbot once gave instructions on how to make a Molotov cocktail (Cybernews Report; see the chatbot screenshot below). Another example was Air Canada’s AI-powered chatbot making a significant error by providing incorrect information regarding bereavement fares (BBC Report). Mistakes like these show how important it is for industry practitioners to build strong evaluation systems for LLMs.

Recently, I read a blog post from GoDaddy’s engineering team about how they evaluate LLMs before putting them into real-world use (GoDaddy Engineering Blog). Their approach stood out to me because it was more structured than just running a few test questions. Here are the main lessons I took away:

  1. Tie evaluations to business outcomes
    Instead of treating testing as an afterthought, GoDaddy connects test data directly to golden datasets. These datasets are carefully chosen examples that represent what the business actually cares about.
  2. Use both classic and new evaluation methods
    Traditional machine learning metrics like precision and recall still matter. But GoDaddy also uses newer approaches like “LLM-as-a-judge,” where another model helps categorize specific errors.
  3. Automate and integrate evaluation into development
    Evaluation isn’t just something you do once. GoDaddy treats it as part of a continuous integration pipeline. They expand their golden datasets, add new feedback loops, and refine their systems over time.

As a high school student, I’m not joining the tech industry tomorrow. Still, I think it’s important for me to pay attention to best practices like these. They show me how professionals handle problems that I might face later in my own projects. Even though my experiments with neural networks or survey sentiment analysis aren’t at the scale of Expedia, Air Canada, or GoDaddy, I can still practice connecting my evaluations to real outcomes, thinking about error types, and making testing part of my workflow.

The way I see it, learning industry standards now gives me a head start for the future. And maybe when I get to do college research or internships, I’ll already be used to thinking about evaluation in a systematic way rather than as an afterthought.

— Andrew

4,361 hits

How Large Language Models Are Changing Relation Extraction in NLP

When you type a question into a search engine like “Who wrote Hamlet?” it does more than match keywords. It connects the dots between “Shakespeare” and “Hamlet,” identifying the relationship between a person and their work. This process of finding and labelling relationships in text is called relation extraction (RE). It powers everything from knowledge graphs to fact-checking systems.

In the past, relation extraction systems were built with hand-crafted rules or required thousands of annotated examples to train. Now, large language models (LLMs) such as GPT, T5, and LLaMA are making it possible to do relation extraction with far less data and more flexibility. In this post, I want to explore what relation extraction is, how LLMs are transforming it, and why this matters for anyone interested in the future of language technology.


The Old Way Versus the New Way

Relation extraction used to rely heavily on feature engineering. Researchers would first hand-craft linguistic or statistical features from the text, such as part-of-speech tags, dependency parses, named entity types, and the words appearing between two entities. They often built lexical patterns like “X, the capital of Y” or “X, founded by Y” to capture specific relationships. They also extracted syntactic paths (for example, the shortest path between two entities in a dependency tree) or used windowed word embeddings around the entities to encode local context. These features were then fed into classifiers like Support Vector Machines, logistic regression models, or decision trees to decide which relationship best described the entity pair. Later, neural networks like BERT and RoBERTa replaced much of this manual feature design by learning their own representations directly from text, but they still needed large, labeled datasets to achieve good performance.

Today, LLMs can often handle relation extraction with just a prompt. Instead of training a model from scratch, you can ask an LLM directly: “Identify the relationship between ‘Einstein’ and ‘the theory of relativity.’” With the right instructions, the model can output the relation (“Einstein proposed the theory of relativity”) even if it has never seen that exact pair in training.

See the diagram showing “classic relation extraction pipeline” versus “LLM-based pipeline” below.


How LLMs Tackle Relation Extraction

LLMs have been trained on massive amounts of text, which gives them a broad sense of how entities and relationships co-occur. This makes them powerful at zero-shot and few-shot learning. In other words, you do not need to show them hundreds of examples of a relation type before they can identify it.

For example, a simple prompt could be:

Prompt:“Read the sentence: ‘Marie Curie discovered radium.’ What is the relationship between the entities?”
Model Output from ChatGPT-5 and Grok-4:

[ChatGPT-5] In the sentence “Marie Curie discovered radium”, the relationship between the entities is:

  • Marie Curie (the subject / agent) performed the action.
  • radium (the object / theme) is what was discovered.

This is a “discoverer–discovery” relationship (also called “person–invention/discovery” or “agent–object” in relation extraction). In other words, Marie Curie is the discoverer and radium is the thing discovered.

[Grok-4] In the sentence “Marie Curie discovered radium,” the relationship between the entities “Marie Curie” and “radium” is that of discoverer and discovered. Marie Curie, a scientist, is the person who identified or found the chemical element radium.

This ability to generalize makes LLMs more versatile, especially for relations in domains where labeled data is scarce, like medicine or law. However, it also introduces risks. LLMs may hallucinate relationships that are not actually in the text or mislabel subtle ones, so careful evaluation is still necessary.


Recent Research Highlights

A major paper, A Survey on Cutting-Edge Relation Extraction Techniques Based on Language Models (Diaz-Garcia & López, 2024), reviews 137 recent ACL papers (2020-2023) that use language models for relation extraction. It shows that BERT-based methods still lead many benchmarks while models like T5 are rising in few-shot and unseen-relation settings.

Other papers from ACL 2024 and 2025 explore how well LLMs handle unseen relation types, cross-domain relation extraction, and low-resource settings. These studies show steady improvements but also highlight open questions about factuality, bias, and consistency.


Why This Matters Beyond Academia

Relation extraction sits at the core of knowledge-driven applications. Building or updating a knowledge graph for a company’s internal documents, mapping patient histories in healthcare, or connecting laws to court cases in legal tech all depend on accurately identifying relationships between entities.

LLMs make it possible to automate these tasks more quickly. Instead of spending months labeling data, organizations can draft knowledge structures with an LLM, then have humans verify or refine the results. This speeds up research and decision-making while expanding access to insights that would otherwise stay hidden in text.


Challenges and Open Questions

While LLMs are powerful, they are not flawless. They may infer relationships that are plausible but incorrect, especially if the prompt is ambiguous. Evaluating relation extraction at scale is also difficult, because many relations are context-specific or only partially expressed. Bias in training data can also skew the relationships a model “sees” as likely or normal.

Researchers are now working on ways to add uncertainty estimates, retrieval-augmented methods (i.e., combining information retrieval with generative models to improve response accuracy and relevance), and better benchmarks to test how well models extract relations across different domains and languages.


My Take as a High Schooler Working in NLP

As someone who has built a survey analysis platform and published research papers about sentiment classification, I find relation extraction exciting because it can connect scattered pieces of information into a bigger picture. Specifically, for projects like my nonprofit Student Echo, a future system could automatically link student concerns to policy areas or resources.

At the same time, I am cautious. Seeing how easily LLMs generate answers reminds me that relationships in text are often subtle. Automating them risks oversimplifying complex realities. Still, the idea that a model can find and organize connections that would take a person hours to spot is inspiring and worth exploring.


Conclusion

Relation extraction is moving from hand-built rules and large labeled datasets to flexible, generalist large language models. This shift is making it easier to build knowledge graphs, extract facts, and understand text at scale. But it also raises new questions about reliability, fairness, and evaluation.

If you want to dig deeper, check out A Survey on Cutting-Edge Relation Extraction Techniques Based on Language Models (arXiv link) or browse ACL 2024–2025 papers on relation extraction. Watching how this field evolves over the next few years will be exciting, and I plan to keep following it for future blog posts.

— Andrew

4,361 hits

Latest Applications of NLP to Recommender Systems at RecSys 2025

Introduction

The ACM Conference on Recommender Systems (RecSys) 2025 took place in Prague, Czech Republic, from September 22–26, 2025. The event brought together researchers and practitioners from academia and industry to present their latest findings and explore new trends in building recommendation technologies.

This year, one of the most exciting themes was the growing overlap between natural language processing (NLP) and recommender systems. Large language models (LLMs), semantic clustering, and text-based personalization appeared everywhere, showing how recommender systems are now drawing heavily on computational linguistics. As someone who has been learning more about NLP myself, it is really cool to see how the research world is pushing these ideas forward.


Paper Highlights

A Language Model-Based Playlist Generation Recommender System

Paper Link

Relevance:
Uses language models to generate playlists by creating semantic clusters from text embeddings of playlist titles and track metadata. This directly applies NLP for thematic coherence and semantic similarity in music recommendations.

Abstract:
The title of a playlist often reflects an intended mood or theme, allowing creators to easily locate their content and enabling other users to discover music that matches specific situations and needs. This work presents a novel approach to playlist generation using language models to leverage the thematic coherence between a playlist title and its tracks. Our method consists in creating semantic clusters from text embeddings, followed by fine-tuning a transformer model on these thematic clusters. Playlists are then generated considering the cosine similarity scores between known and unknown titles and applying a voting mechanism. Performance evaluation, combining quantitative and qualitative metrics, demonstrates that using the playlist title as a seed provides useful recommendations, even in a zero-shot scenario.


An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

Paper Link

Relevance:
Focuses on off-policy learning to guide LLM-based sentence generation for personalized recommendations. Involves NLP tasks like controlled text generation and personalization via language model fine-tuning.

Abstract:
We study the problem of personalizing the output of a large language model (LLM) by training on logged bandit feedback (e.g., personalizing movie descriptions based on likes). While one may naively treat this as a standard off-policy contextual bandit problem, the large action space and the large parameter space make naive applications of off-policy learning (OPL) infeasible. We overcome this challenge by learning a prompt policy for a frozen LLM that has only a modest number of parameters. The proposed Direct Sentence Off-policy gradient (DSO) effectively propagates the gradient to the prompt policy space by leveraging the smoothness and overlap in the sentence space. Consequently, DSO substantially reduces variance while also suppressing bias. Empirical results on our newly established suite of benchmarks, called OfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts and reward noise are large.


Enhancing Sequential Recommender with Large Language Models for Joint Video and Comment Recommendation

Paper Link

Relevance:
Integrates LLMs to enhance sequential recommendations by processing video content and user comments. Relies on NLP for joint modeling of multimodal text (like comments) and semantic user preferences.

Abstract:
Nowadays, reading or writing comments on captivating videos has emerged as a critical part of the viewing experience on online video platforms. However, existing recommender systems primarily focus on users’ interaction behaviors with videos, neglecting comment content and interaction in user preference modeling. In this paper, we propose a novel recommendation approach called LSVCR that utilizes user interaction histories with both videos and comments to jointly perform personalized video and comment recommendation. Specifically, our approach comprises two key components: sequential recommendation (SR) model and supplemental large language model (LLM) recommender. The SR model functions as the primary recommendation backbone (retained in deployment) of our method for efficient user preference modeling. Concurrently, we employ a LLM as the supplemental recommender (discarded in deployment) to better capture underlying user preferences derived from heterogeneous interaction behaviors. In order to integrate the strengths of the SR model and the supplemental LLM recommender, we introduce a two-stage training paradigm. The first stage, personalized preference alignment, aims to align the preference representations from both components, thereby enhancing the semantics of the SR model. The second stage, recommendation-oriented fine-tuning, involves fine-tuning the alignment-enhanced SR model according to specific objectives. Extensive experiments in both video and comment recommendation tasks demonstrate the effectiveness of LSVCR. Moreover, online A/B testing on KuaiShou platform verifies the practical benefits of our approach. In particular, we attain a cumulative gain of 4.13% in comment watch time.


LLM-RecG: A Semantic Bias-Aware Framework for Zero-Shot Sequential Recommendation

Paper Link

Relevance:
Addresses domain semantic bias in LLMs for cross-domain recommendations using generalization losses to align item embeddings. Employs NLP techniques like pretrained representations and semantic alignment to mitigate vocabulary differences across domains.

Abstract:
Zero-shot cross-domain sequential recommendation (ZCDSR) enables predictions in unseen domains without additional training or fine-tuning, addressing the limitations of traditional models in sparse data environments. Recent advancements in large language models (LLMs) have significantly enhanced ZCDSR by facilitating cross-domain knowledge transfer through rich, pretrained representations. Despite this progress, domain semantic bias arising from differences in vocabulary and content focus between domains remains a persistent challenge, leading to misaligned item embeddings and reduced generalization across domains.

To address this, we propose a novel semantic bias-aware framework that enhances LLM-based ZCDSR by improving cross-domain alignment at both the item and sequential levels. At the item level, we introduce a generalization loss that aligns the embeddings of items across domains (inter-domain compactness), while preserving the unique characteristics of each item within its own domain (intra-domain diversity). This ensures that item embeddings can be transferred effectively between domains without collapsing into overly generic or uniform representations. At the sequential level, we develop a method to transfer user behavioral patterns by clustering source domain user sequences and applying attention-based aggregation during target domain inference. We dynamically adapt user embeddings to unseen domains, enabling effective zero-shot recommendations without requiring target-domain interactions.

Extensive experiments across multiple datasets and domains demonstrate that our framework significantly enhances the performance of sequential recommendation models on the ZCDSR task. By addressing domain bias and improving the transfer of sequential patterns, our method offers a scalable and robust solution for better knowledge transfer, enabling improved zero-shot recommendations across domains.


Trends Observed

These papers reflect a broader trend at RecSys 2025 toward hybrid NLP-RecSys approaches, with LLMs enabling better handling of textual side information (like reviews, titles, and comments) for cold-start problems and cross-domain generalization. This aligns with recent surveys on LLMs in recommender systems, which note improvements in semantic understanding over traditional embeddings.


Final Thoughts

As a high school student interested in computational linguistics, reading about these papers feels like peeking into the future. I used to think of recommender systems as black boxes that just show you more videos or songs you might like. But at RecSys 2025, it is clear the field is moving toward systems that actually “understand” language and context, not just click patterns.

For me, that is inspiring. It means the skills I am learning right now, from studying embeddings to experimenting with sentiment analysis, could actually be part of real-world systems that people use every day. It also shows how much crossover there is between disciplines. You can be into linguistics, AI, and even user experience design, and still find a place in recommender system research.

Seeing these studies also makes me think about the responsibility that comes with more powerful recommendation technology. If models are becoming better at predicting our tastes, we have to be careful about bias, fairness, and privacy. This is why conferences like RecSys are so valuable. They are a chance for researchers to share ideas, critique each other’s work, and build a better tech future together.

— Andrew

4,361 hits

Rethinking AI Bias: Insights from Professor Resnik’s Position Paper

I recently read Professor Philip Resnik’s thought-provoking position paper, “Large Language Models Are Biased Because They Are Large Language Models,” published in Computational Linguistics 51(3), which is available via open access. This paper challenges conventional perspectives on bias in artificial intelligence, prompting a deeper examination of the inherent relationship between bias and the foundational design of large language models (LLMs). Resnik’s primary objective is to stimulate critical discussion by arguing that harmful biases are an inevitable outcome of the current architecture of LLMs. The paper posits that addressing these biases effectively requires a fundamental reevaluation of the assumptions underlying the design of AI systems driven by LLMs.

What the paper argues

  • Bias is built into the very goal of an LLM. A language model tries to predict the next word by matching the probability patterns of human text. Those patterns come from people. People carry stereotypes, norms, and historical imbalances. If an LLM learns the patterns faithfully, it learns the bad with the good. The result is not a bug that appears once in a while. It is a direct outcome of the objective the model optimizes.
  • Models cannot tell “what a word means” apart from “what is common” or “what is acceptable.” Resnik uses a nurse example. Some facts are definitional (A nurse is a kind of healthcare worker). Other facts are contingent but harmless (A nurse is likely to wear blue clothing at work). Some patterns are contingent and harmful if used for inference (A nurse is likely to wear a dress to a formal occasion). Current LLMs do not have an internal line that separates meaning from contingent statistics or that flags the normative status of an inference. They just learn distributions.
  • Reinforcement Learning from Human Feedback (RLHF) and other mitigations help on the surface, but they have limits. RLHF tries to steer a pre-trained model toward safer outputs. The process relies on human judgments that vary by culture and time. It also has to keep the model close to its pretraining, or the model loses general ability. That tradeoff means harmful associations can move underground rather than disappear. Some studies even find covert bias remains after mitigation (Gallegos et al. 2024; Hofmann et al. 2024). To illustrate this, consider an analogy: The balloon gets squeezed in one place, then bulges in another.
  • The root cause is a hard-core, distribution-only view of language. When meaning is treated as “whatever co-occurs with what,” the model has no principled way to encode norms. The paper suggests rethinking foundations. One direction is to separate stable, conventional meaning (like word sense and category membership) from contextual or conveyed meaning (which is where many biases live). Another idea is to modularize competence, so that using language in socially appropriate ways is not forced to emerge only from next-token prediction. None of this is easy, but it targets the cause rather than only tuning symptoms.

Why this matters

Resnik is not saying we should give up. He is saying that quick fixes will not fully erase harm when the objective rewards learning whatever is frequent in human text. If we want models that reason with norms, we need objectives and representations that include norms, not only distributions.

Conclusion

This paper offers a clear message. Bias is not only a content problem in the data. It is also a design problem in how we define success for our models. If the goal is to build systems that are both capable and fair, then the next steps should focus on objectives, representations, and evaluation methods that make room for norms and constraints. That is harder than prompt tweaks, but it is the kind of challenge that can move the field forward.

Link to the paper: Large Language Models Are Biased Because They Are Large Language Models

— Andrew

4,361 hits

Summary: “Large Language Models Are Improving Exponentially”

I recently read an article on IEEE Spectrum titled “Large Language Models Are Improving Exponentially”. Here is a summary of its key points.


Benchmarking LLM Performance

Benchmarking large language models (LLMs) is challenging because their main goal is to produce text indistinguishable from human writing, which doesn’t always correlate with traditional processor performance metrics. However, it remains important to measure their progress to understand how much better LLMs are becoming over time and to estimate when they might complete substantial tasks independently.


METR’s Findings on Exponential Improvement

Researchers at Model Evaluation & Threat Research (METR) in Berkeley, California, published a paper in March called Measuring AI Ability to Complete Long Tasks. They concluded that:

  • The capabilities of key LLMs are doubling every seven months.
  • By 2030, the most advanced LLMs could complete, with 50 percent reliability, a software-based task that would take humans a full month of 40-hour workweeks.
  • These LLMs might accomplish such tasks much faster than humans, possibly within days or even hours.

Potential Tasks by 2030

Tasks that LLMs might be able to perform by 2030 include:

  • Starting up a company
  • Writing a novel
  • Greatly improving an existing LLM

According to AI researcher Zach Stein-Perlman, such capabilities would come with enormous stakes, involving both potential benefits and significant risks.


The Task-Completion Time Horizon Metric

At the core of METR’s work is a metric called “task-completion time horizon.” It measures the time it would take human programmers to complete a task that an LLM can complete with a specified reliability, such as 50 percent.

Their plots (see graphs below) show:

  • Exponential growth in LLM capabilities with a doubling period of around seven months (Graph at the top).
  • Tasks that are “messier” or more similar to real-world scenarios remain more challenging for LLMs (Graph at the bottom).

Caveats About Growth and Risks

While these results raise concerns about rapid AI advancement, METR researcher Megan Kinniment noted that:

  • Rapid acceleration does not necessarily result in “massively explosive growth.”
  • Progress could be slowed by factors such as hardware or robotics bottlenecks, even if AI systems become very advanced.

Final Summary

Overall, the article emphasizes that LLMs are improving exponentially, potentially enabling them to handle complex, month-long human tasks by 2030. This progress comes with significant benefits and risks, and its trajectory may depend on external factors like hardware limitations.

You can read the full article here.

— Andrew

Caring Machines, Centered Humans: Lessons from Ai4 2025

At Ai4 2025 (August 11–13, Las Vegas), two of the most influential voices in artificial intelligence expressed strikingly different visions for the future. Geoffrey Hinton, often called the “Godfather of AI,” suggested that AI should be designed with something like “maternal instincts.” He argued that as AI becomes smarter than humans, we cannot realistically control it through traditional dominance strategies. The only model we have of a less intelligent being guiding a more intelligent one is the relationship between a baby and its mother. A mother cares for her child not because she is weaker, but because she is built to protect and nurture. Hinton believes this kind of protective orientation is what could keep humanity safe in the long run.

Fei-Fei Li, sometimes called the “Godmother of AI,” offered a different perspective in a CNN interview. She disagrees with parental analogies for AI. Instead, she emphasizes designing human-centered AI, systems that uphold human dignity, promote agency, and avoid emotional metaphors that could mislead how we understand AI.

Summary Comparison of Views

When I first read about these contrasting views, I found myself agreeing with both in different ways. On one hand, Hinton’s maternal metaphor captures the seriousness of what could happen if superintelligence arrives sooner than many expect. If AI truly surpasses human intelligence, relying solely on control may fail. On the other hand, Li’s approach feels grounded and practical. She reminds us that the ethical choices we make today will set the trajectory for future systems.

The best answer may not lie in choosing between them, but in combining their strengths. I think about this as a layered model. The foundation should be Li’s human-centered AI: respect, fairness, transparency, and agency. On top of that we need what Hinton calls protective alignment. These would be structural safeguards that ensure highly intelligent systems still act in ways that preserve human well-being.

Hybrid Framework Diagram
Here is how I visualize this combination of perspectives: Li’s human-centered AI forms the core, while Hinton’s protective alignment provides the outer safeguard.

Practical Integration

  • Development Phase (Near-Term, Li):
    Apply human-centered AI frameworks to today’s large language models, robotics, and decision-support systems.
    Focus on privacy, bias reduction, explainability, and giving users agency over their interactions with AI.
  • Safety Research Phase (Mid- to Long-Term, Hinton):
    Begin embedding structural safeguards that mimic “caring instincts.”
    Example: AI systems with hard-coded prohibitions against harming humans, but reinforced by higher-order goals like proactively ensuring human thriving.
  • Governance and Oversight:
    Combine Li’s push for international, human-centered AI policy with Hinton’s insistence on global collaboration to avoid runaway dominance races.

In other words, AI should be designed to treat humanity as worth protecting, while being anchored in the principles of dignity.

As a high school student exploring AI and computational linguistics, I believe this hybrid vision is the most realistic path forward. It addresses the near-term challenges of fairness, transparency, and accountability while also preparing for the long-term risks of superintelligence. For me, this is not just an abstract debate. Thinking about how we embed values and safety into AI connects directly to my own interests in language models, hate speech detection, robotics, and how technology interacts with human society.

The future of AI is not predetermined. It will be shaped by the principles we choose now. By combining Hinton’s call for protective instincts with Li’s insistence on human-centered design, we have a chance to build AI that both cares for us and respects us.

For readers interested in the original coverage of this debate, see the CNN article here.

— Andrew

The AI Gap: How Socioeconomic Status Shapes Language Technology Use — A Perspective from Best Social Impact Paper at ACL 2025

The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) recently finished in Vienna, Austria from July 27 to August 1. The conference announced a few awards, one of which is Best Social Impact Paper. This award was given to two papers:

  1. AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset (by Charles Nimo et al.)
  2. The AI Gap: How Socioeconomic Status Affects Language Technology Interactions (by Elisa Bassignana, Amanda Cercas Curry, and Dirk Hovy).

In this blog post, I’ll talk about the second paper and share the findings from the paper and my thoughts on the topic. You can read the full paper here: https://aclanthology.org/2025.acl-long.914.pdf

What the Paper is About

This paper investigates how socioeconomic status (SES) influences interactions with language technologies, particularly large language models (LLMs) like ChatGPT, highlighting an emerging “AI Gap” that could exacerbate social inequalities. Drawing from the Technology Acceptance Model and prior work on digital divides, the authors argue that SES shapes technology adoption through factors like access, digital literacy, and linguistic habits, potentially biasing LLMs toward higher-SES patterns and underrepresenting lower-SES users.

Methods

The study surveys 1,000 English-speaking participants from the UK and US via Prolific, stratified by self-reported SES using the MacArthur scale (binned as low: 1-3, middle: 4-7, upper: 8-10). It collects sociodemographic data, usage patterns of language technologies (e.g., spell checkers, AI chatbots), and 6,482 real prompts from prior LLM interactions. Analysis includes statistical tests (e.g., chi-square for usage differences), linguistic metrics (e.g., prompt length, concreteness via Brysbaert et al.’s word ratings), topic modeling (using embeddings, UMAP, HDBSCAN, and GPT-4 for cluster descriptions), and markers of anthropomorphism (e.g., phatic expressions like “hi” and politeness markers like “thank you”).

Key Findings

  • Usage Patterns: Higher-SES individuals access more devices daily (e.g., laptops, smartwatches) and use LLMs more frequently (e.g., daily vs. rarely for lower SES). They employ LLMs for work/education (e.g., coding, data analysis, writing) and technical contexts, while lower-SES users favor entertainment, brainstorming, and general knowledge queries. Statistically significant differences exist in frequency (p < 0.001), contexts (p < 0.001), and tasks (p < 0.001).
  • Linguistic Differences in Prompts: Higher-SES prompts are shorter (avg. 18.4 words vs. 27.0 for low SES; p < 0.05) and more abstract (concreteness score: 2.57 vs. 2.66; p < 0.05). Lower-SES prompts show higher anthropomorphism (e.g., more phatic expressions) and concrete language. A bag-of-words classifier distinguishes SES groups (Macro-F1 39.25 vs. baseline 25.02).
  • Topics and Framing: Common topics (e.g., translation, mental health, medical advice, writing, text editing, finance, job, food) appear across groups, but framing varies—e.g., lower SES seeks debt reduction or low-skill jobs; higher SES focuses on investments, travel itineraries, or inclusivity. About 45% of prompts resemble search-engine queries, suggesting LLMs are replacing traditional searches.
  • User Perceptions: Trends indicate lower-SES users anthropomorphize more (e.g., metaphorical verbs like “ask”), while higher-SES use jargon (e.g., “generate”), though not statistically significant.

Discussion and Implications

The findings underscore how SES stratifies LLM use, with higher-SES benefiting more in professional/educational contexts, potentially widening inequalities as LLMs optimize for their patterns. Benchmarks may overlook lower-SES styles, leading to biases. The authors advocate the development of inclusive NLP technologies to accommodate different SES needs and habitus and mitigate the existing AI Gap.

Limitations and Ethics

Limited to Prolific crowdworkers (skewed middle/low SES, tech-savvy), subjective SES measures, and potential LLM-generated responses. Ethical compliance includes GDPR anonymity, opt-outs, and fair compensation (£9/hour).

Overall, the paper reveals SES-driven disparities in technology interactions, urging NLP development to address linguistic and habitual differences for equitable access and reduced digital divides.

My Takeaway

As a high school student who spends a lot of time thinking about fairness in AI, I find this paper important because it reminds us that bias is not just about language or culture, it can also be tied to socioeconomic status. This is something I had not thought much about before. If AI systems are trained mostly on data from higher SES groups, they might misunderstand or underperform for people from lower SES backgrounds. That could affect how well people can use AI for education, job searching, or even just getting accurate information online.

For me, the takeaway is that AI researchers need to test their models with SES diversity in mind, just like they do with gender or language diversity. And as someone interested in computational linguistics, it is inspiring to see that work like this is getting recognized with awards at ACL.

— Andrew

Blog at WordPress.com.

Up ↑