The Collins Word of the Year and Why It Matters for Computational Linguistics

Every year, a single word captures the moment when language and culture meet. Sometimes it comes from politics, sometimes from technology, but it always tells a story about how people think and communicate. As someone drawn to both words and code, I see each new “Word of the Year” as more than a headline. It’s data, meaning, and evolution all at once.

As I prepare to study Computational Linguistics in college, I have been paying attention not only to algorithms and corpora but also to the ways language changes around us. One of the most interesting reflections of that change is the annual “Word of the Year” chosen by Collins Dictionary. In this post, I’ll review the past ten years of Collins’ selections, explain how the 2025 Word of the Year was chosen (including the shortlist), and discuss why this matters for computational linguistics.


Past Ten Years of Collins Word of the Year

YearWord of the YearBrief explanation
2016BrexitCaptured the UK’s vote to leave the EU and its wide political, social, and linguistic effects.
2017fake newsReflected the rise of misinformation and debates about truth in media.
2018single-useHighlighted environmental awareness and discussions about disposable culture.
2019climate strikeDescribed global youth activism inspired by Greta Thunberg and climate movements.
2020lockdownDefined the year of the Covid-19 pandemic and its global restrictions.
2021NFTStood for “non-fungible token” and represented the emergence of digital assets and blockchain culture.
2022permacrisisDescribed a long period of instability and uncertainty, fitting the global mood.
2023AIRepresented artificial intelligence becoming central to everyday conversation.
2024bratCaptured the confident, independent attitude popularized by youth culture and pop music.
2025vibe codingDescribed the blending of language and technology through conversational code creation.

The 2025 Word of the Year: vibe coding

For 2025, Collins Dictionary selected vibe coding as its Word of the Year. The term refers to new software development practices that use natural language and artificial intelligence to create applications by describing what one wants rather than manually writing code. It describes a form of “coding by conversation” that bridges creativity and computation.

Source: Collins Dictionary Word of the Year 2025


How Collins Selects the Word of the Year

The Collins team monitors its extensive language database throughout the year. Using large-scale corpus analysis, they track words that rise sharply in frequency or reflect cultural, political, or technological change. The process includes:

  • Lexicographic monitoring: Editors and linguists identify new or trending words across print, social media, and digital sources.
  • Corpus analysis: Statistical tools measure frequency and context to see which words stand out.
  • Editorial review: The final decision balances data and cultural relevance to choose a word that captures the spirit of the year.

Shortlist for 2025

In addition to vibe coding, this year’s shortlist includes aura farming, biohacking, broligarchy, clanker, coolcation, glaze, HENRY, micro-retirement, and taskmasking.

You can view the full list on the Collins website: https://www.collinsdictionary.com/us/woty


Why the Collins Word of the Year Matters for Computational Linguistics

As someone preparing to study Computational Linguistics, I find the Collins Word of the Year fascinating for several reasons:

  1. Language change in data
    Each year’s word shows how new vocabulary enters real-world language use. Computational linguistics often studies these changes through corpora to model meaning over time.
  2. Human-machine interaction
    Vibe coding reflects a growing trend where natural language acts as an interface between humans and technology. It is an example of how linguistic principles are now shaping software design.
  3. Semantic and cultural evolution
    The meanings of words like “brat” or “AI” evolve quickly in digital contexts. For computational linguists, tracking these semantic shifts supports research in language modeling and word embeddings.
  4. Lexicographic data as research input
    Collins’ approach mirrors computational methods. Their frequency-based analysis can inspire how we model and predict linguistic trends using data science.
  5. Pedagogical and research relevance
    New words like vibe coding demonstrate how emerging technology changes both everyday communication and the future topics of linguistic research. They show where language innovation meets computation.

Reflection

When I first read that “vibe coding” had been chosen as the 2025 Word of the Year, I couldn’t help thinking about how it perfectly represents where computational linguistics is heading. Language is no longer just a subject of study; it is becoming a tool for creation. What used to be a set of rigid commands is turning into natural conversation.

The term also reminds me that words are living data points. Each new entry in a dictionary records a shift in how people think and communicate. For future computational linguists, observing how dictionaries evolve gives insight into how models and algorithms should adapt too.

It’s easy to see the Word of the Year as a piece of pop culture, but it’s really a linguistic dataset in disguise. Every annual choice documents how society expresses what matters most at that moment, and that is what makes it so meaningful to study.


Sources and Links

— Andrew

5,279 hits

Learning from Industry: How Companies Evaluate LLMs

Over the past few years, large language models (LLMs) have been everywhere. From chatbots that help you book flights to tools that summarize long documents, companies are finding ways to use LLMs in real products. But success is not guaranteed. In fact, sometimes it goes very wrong. A famous example was when Expedia’s chatbot once gave instructions on how to make a Molotov cocktail (Cybernews Report; see the chatbot screenshot below). Another example was Air Canada’s AI-powered chatbot making a significant error by providing incorrect information regarding bereavement fares (BBC Report). Mistakes like these show how important it is for industry practitioners to build strong evaluation systems for LLMs.

Recently, I read a blog post from GoDaddy’s engineering team about how they evaluate LLMs before putting them into real-world use (GoDaddy Engineering Blog). Their approach stood out to me because it was more structured than just running a few test questions. Here are the main lessons I took away:

  1. Tie evaluations to business outcomes
    Instead of treating testing as an afterthought, GoDaddy connects test data directly to golden datasets. These datasets are carefully chosen examples that represent what the business actually cares about.
  2. Use both classic and new evaluation methods
    Traditional machine learning metrics like precision and recall still matter. But GoDaddy also uses newer approaches like “LLM-as-a-judge,” where another model helps categorize specific errors.
  3. Automate and integrate evaluation into development
    Evaluation isn’t just something you do once. GoDaddy treats it as part of a continuous integration pipeline. They expand their golden datasets, add new feedback loops, and refine their systems over time.

As a high school student, I’m not joining the tech industry tomorrow. Still, I think it’s important for me to pay attention to best practices like these. They show me how professionals handle problems that I might face later in my own projects. Even though my experiments with neural networks or survey sentiment analysis aren’t at the scale of Expedia, Air Canada, or GoDaddy, I can still practice connecting my evaluations to real outcomes, thinking about error types, and making testing part of my workflow.

The way I see it, learning industry standards now gives me a head start for the future. And maybe when I get to do college research or internships, I’ll already be used to thinking about evaluation in a systematic way rather than as an afterthought.

— Andrew

5,279 hits

How Large Language Models Are Changing Relation Extraction in NLP

When you type a question into a search engine like “Who wrote Hamlet?” it does more than match keywords. It connects the dots between “Shakespeare” and “Hamlet,” identifying the relationship between a person and their work. This process of finding and labelling relationships in text is called relation extraction (RE). It powers everything from knowledge graphs to fact-checking systems.

In the past, relation extraction systems were built with hand-crafted rules or required thousands of annotated examples to train. Now, large language models (LLMs) such as GPT, T5, and LLaMA are making it possible to do relation extraction with far less data and more flexibility. In this post, I want to explore what relation extraction is, how LLMs are transforming it, and why this matters for anyone interested in the future of language technology.


The Old Way Versus the New Way

Relation extraction used to rely heavily on feature engineering. Researchers would first hand-craft linguistic or statistical features from the text, such as part-of-speech tags, dependency parses, named entity types, and the words appearing between two entities. They often built lexical patterns like “X, the capital of Y” or “X, founded by Y” to capture specific relationships. They also extracted syntactic paths (for example, the shortest path between two entities in a dependency tree) or used windowed word embeddings around the entities to encode local context. These features were then fed into classifiers like Support Vector Machines, logistic regression models, or decision trees to decide which relationship best described the entity pair. Later, neural networks like BERT and RoBERTa replaced much of this manual feature design by learning their own representations directly from text, but they still needed large, labeled datasets to achieve good performance.

Today, LLMs can often handle relation extraction with just a prompt. Instead of training a model from scratch, you can ask an LLM directly: “Identify the relationship between ‘Einstein’ and ‘the theory of relativity.’” With the right instructions, the model can output the relation (“Einstein proposed the theory of relativity”) even if it has never seen that exact pair in training.

See the diagram showing “classic relation extraction pipeline” versus “LLM-based pipeline” below.


How LLMs Tackle Relation Extraction

LLMs have been trained on massive amounts of text, which gives them a broad sense of how entities and relationships co-occur. This makes them powerful at zero-shot and few-shot learning. In other words, you do not need to show them hundreds of examples of a relation type before they can identify it.

For example, a simple prompt could be:

Prompt:“Read the sentence: ‘Marie Curie discovered radium.’ What is the relationship between the entities?”
Model Output from ChatGPT-5 and Grok-4:

[ChatGPT-5] In the sentence “Marie Curie discovered radium”, the relationship between the entities is:

  • Marie Curie (the subject / agent) performed the action.
  • radium (the object / theme) is what was discovered.

This is a “discoverer–discovery” relationship (also called “person–invention/discovery” or “agent–object” in relation extraction). In other words, Marie Curie is the discoverer and radium is the thing discovered.

[Grok-4] In the sentence “Marie Curie discovered radium,” the relationship between the entities “Marie Curie” and “radium” is that of discoverer and discovered. Marie Curie, a scientist, is the person who identified or found the chemical element radium.

This ability to generalize makes LLMs more versatile, especially for relations in domains where labeled data is scarce, like medicine or law. However, it also introduces risks. LLMs may hallucinate relationships that are not actually in the text or mislabel subtle ones, so careful evaluation is still necessary.


Recent Research Highlights

A major paper, A Survey on Cutting-Edge Relation Extraction Techniques Based on Language Models (Diaz-Garcia & López, 2024), reviews 137 recent ACL papers (2020-2023) that use language models for relation extraction. It shows that BERT-based methods still lead many benchmarks while models like T5 are rising in few-shot and unseen-relation settings.

Other papers from ACL 2024 and 2025 explore how well LLMs handle unseen relation types, cross-domain relation extraction, and low-resource settings. These studies show steady improvements but also highlight open questions about factuality, bias, and consistency.


Why This Matters Beyond Academia

Relation extraction sits at the core of knowledge-driven applications. Building or updating a knowledge graph for a company’s internal documents, mapping patient histories in healthcare, or connecting laws to court cases in legal tech all depend on accurately identifying relationships between entities.

LLMs make it possible to automate these tasks more quickly. Instead of spending months labeling data, organizations can draft knowledge structures with an LLM, then have humans verify or refine the results. This speeds up research and decision-making while expanding access to insights that would otherwise stay hidden in text.


Challenges and Open Questions

While LLMs are powerful, they are not flawless. They may infer relationships that are plausible but incorrect, especially if the prompt is ambiguous. Evaluating relation extraction at scale is also difficult, because many relations are context-specific or only partially expressed. Bias in training data can also skew the relationships a model “sees” as likely or normal.

Researchers are now working on ways to add uncertainty estimates, retrieval-augmented methods (i.e., combining information retrieval with generative models to improve response accuracy and relevance), and better benchmarks to test how well models extract relations across different domains and languages.


My Take as a High Schooler Working in NLP

As someone who has built a survey analysis platform and published research papers about sentiment classification, I find relation extraction exciting because it can connect scattered pieces of information into a bigger picture. Specifically, for projects like my nonprofit Student Echo, a future system could automatically link student concerns to policy areas or resources.

At the same time, I am cautious. Seeing how easily LLMs generate answers reminds me that relationships in text are often subtle. Automating them risks oversimplifying complex realities. Still, the idea that a model can find and organize connections that would take a person hours to spot is inspiring and worth exploring.


Conclusion

Relation extraction is moving from hand-built rules and large labeled datasets to flexible, generalist large language models. This shift is making it easier to build knowledge graphs, extract facts, and understand text at scale. But it also raises new questions about reliability, fairness, and evaluation.

If you want to dig deeper, check out A Survey on Cutting-Edge Relation Extraction Techniques Based on Language Models (arXiv link) or browse ACL 2024–2025 papers on relation extraction. Watching how this field evolves over the next few years will be exciting, and I plan to keep following it for future blog posts.

— Andrew

5,279 hits

Introduction to Zotero: Your Free Personal Research Assistant

At the beginning of this summer (Y2025), I learned about a tool that I wish I had discovered years ago. I hadn’t even heard of Zotero until my research collaborator, Computational Sociolinguist Dr. Sidney Wong, introduced it to me while we were working on our computational linguistics project analyzing Twitch data.

After exploring it and learning to use it for my current research, I now realize how effective and essential Zotero is for managing academic work. Honestly, I wish I could have used it for all my previous research projects.


What is Zotero?

Zotero is a free, easy-to-use tool that helps researchers at any level:

  • Collect sources such as journal articles, books, web pages, and more
  • Organize them into collections and tag them for easy retrieval
  • Annotate PDFs directly within the app with highlights and notes
  • Cite sources seamlessly in any citation style while writing papers
  • Share references and collections with collaborators

It’s like having a personal research assistant that keeps all your readings, citations, and notes organized in one place.


Why I Recommend Zotero for High School Students

As high school students, we often juggle multiple classes, club projects, competitions, and research interests. Zotero makes it easy to:

  • Manage research projects efficiently, especially when writing papers that require formal citations
  • Keep track of readings and annotate PDFs, so you don’t lose key insights
  • Collaborate with teammates or research mentors by sharing folders and annotations
  • Avoid citation mistakes, as it automatically generates references in APA, MLA, Chicago, and many other styles

My Experience Using Zotero

When Dr. Wong first recommended Zotero to me, I was a bit hesitant because I thought, “Do I really need another app?” But after installing it and importing my Twitch-related research papers, I quickly saw its value. Now, I can:

  • Search across all my papers by keyword or tag
  • Keep notes attached to specific papers so I never lose insights
  • Instantly generate BibTeX entries for LaTeX documents or formatted citations for my blog posts and papers

I wish I had known about Zotero earlier, especially during my survey sentiment analysis project and my work preparing research paper submissions. It would have saved me so much time managing citations and keeping literature organized.


Zotero vs. Other Reference Managers: Pros and Cons

Here is a quick comparison of Zotero vs. similar tools like Mendeley and EndNote based on my research and initial use:

Pros of Zotero

  • Completely free and open source with no premium restrictions on core features
  • Easy to use with a clean interface suitable for beginners
  • Browser integration for one-click saving of articles and webpages
  • Excellent plugin support for Word, LibreOffice, and Google Docs
  • ✅ Strong community support and development
  • ✅ Works well for group projects with shared libraries

Cons of Zotero

  • ❌ Limited built-in cloud storage for PDFs (300 MB free; need WebDAV or paid plan for more)
  • ❌ Not as widely used in certain STEM fields compared to Mendeley or EndNote
  • ❌ Slightly fewer advanced citation style editing features than EndNote

Compared to Mendeley

  • Mendeley offers 2 GB free storage and a slightly more modern PDF viewer, but it is owned by Elsevier and some users dislike its closed ecosystem.
  • Zotero, being open-source, is often preferred for transparency and community-driven development.

Compared to EndNote

  • EndNote is powerful and widely used in academia but is expensive (>$100 license), making it inaccessible for many high school students.
  • Zotero offers most of the core features for free with a simpler setup.

Final Thoughts

If you’re a high school student interested in research, I highly recommend checking out Zotero. It’s free, easy to set up, and can make your academic life so much more organized and efficient.

You can explore and download it here. Let me know if you want a future blog post on how I set up my Zotero collections and notes for research projects.

— Andrew

5,279 hits

Rethinking AI Bias: Insights from Professor Resnik’s Position Paper

I recently read Professor Philip Resnik’s thought-provoking position paper, “Large Language Models Are Biased Because They Are Large Language Models,” published in Computational Linguistics 51(3), which is available via open access. This paper challenges conventional perspectives on bias in artificial intelligence, prompting a deeper examination of the inherent relationship between bias and the foundational design of large language models (LLMs). Resnik’s primary objective is to stimulate critical discussion by arguing that harmful biases are an inevitable outcome of the current architecture of LLMs. The paper posits that addressing these biases effectively requires a fundamental reevaluation of the assumptions underlying the design of AI systems driven by LLMs.

What the paper argues

  • Bias is built into the very goal of an LLM. A language model tries to predict the next word by matching the probability patterns of human text. Those patterns come from people. People carry stereotypes, norms, and historical imbalances. If an LLM learns the patterns faithfully, it learns the bad with the good. The result is not a bug that appears once in a while. It is a direct outcome of the objective the model optimizes.
  • Models cannot tell “what a word means” apart from “what is common” or “what is acceptable.” Resnik uses a nurse example. Some facts are definitional (A nurse is a kind of healthcare worker). Other facts are contingent but harmless (A nurse is likely to wear blue clothing at work). Some patterns are contingent and harmful if used for inference (A nurse is likely to wear a dress to a formal occasion). Current LLMs do not have an internal line that separates meaning from contingent statistics or that flags the normative status of an inference. They just learn distributions.
  • Reinforcement Learning from Human Feedback (RLHF) and other mitigations help on the surface, but they have limits. RLHF tries to steer a pre-trained model toward safer outputs. The process relies on human judgments that vary by culture and time. It also has to keep the model close to its pretraining, or the model loses general ability. That tradeoff means harmful associations can move underground rather than disappear. Some studies even find covert bias remains after mitigation (Gallegos et al. 2024; Hofmann et al. 2024). To illustrate this, consider an analogy: The balloon gets squeezed in one place, then bulges in another.
  • The root cause is a hard-core, distribution-only view of language. When meaning is treated as “whatever co-occurs with what,” the model has no principled way to encode norms. The paper suggests rethinking foundations. One direction is to separate stable, conventional meaning (like word sense and category membership) from contextual or conveyed meaning (which is where many biases live). Another idea is to modularize competence, so that using language in socially appropriate ways is not forced to emerge only from next-token prediction. None of this is easy, but it targets the cause rather than only tuning symptoms.

Why this matters

Resnik is not saying we should give up. He is saying that quick fixes will not fully erase harm when the objective rewards learning whatever is frequent in human text. If we want models that reason with norms, we need objectives and representations that include norms, not only distributions.

Conclusion

This paper offers a clear message. Bias is not only a content problem in the data. It is also a design problem in how we define success for our models. If the goal is to build systems that are both capable and fair, then the next steps should focus on objectives, representations, and evaluation methods that make room for norms and constraints. That is harder than prompt tweaks, but it is the kind of challenge that can move the field forward.

Link to the paper: Large Language Models Are Biased Because They Are Large Language Models

— Andrew

5,279 hits

Summary: “Large Language Models Are Improving Exponentially”

I recently read an article on IEEE Spectrum titled “Large Language Models Are Improving Exponentially”. Here is a summary of its key points.


Benchmarking LLM Performance

Benchmarking large language models (LLMs) is challenging because their main goal is to produce text indistinguishable from human writing, which doesn’t always correlate with traditional processor performance metrics. However, it remains important to measure their progress to understand how much better LLMs are becoming over time and to estimate when they might complete substantial tasks independently.


METR’s Findings on Exponential Improvement

Researchers at Model Evaluation & Threat Research (METR) in Berkeley, California, published a paper in March called Measuring AI Ability to Complete Long Tasks. They concluded that:

  • The capabilities of key LLMs are doubling every seven months.
  • By 2030, the most advanced LLMs could complete, with 50 percent reliability, a software-based task that would take humans a full month of 40-hour workweeks.
  • These LLMs might accomplish such tasks much faster than humans, possibly within days or even hours.

Potential Tasks by 2030

Tasks that LLMs might be able to perform by 2030 include:

  • Starting up a company
  • Writing a novel
  • Greatly improving an existing LLM

According to AI researcher Zach Stein-Perlman, such capabilities would come with enormous stakes, involving both potential benefits and significant risks.


The Task-Completion Time Horizon Metric

At the core of METR’s work is a metric called “task-completion time horizon.” It measures the time it would take human programmers to complete a task that an LLM can complete with a specified reliability, such as 50 percent.

Their plots (see graphs below) show:

  • Exponential growth in LLM capabilities with a doubling period of around seven months (Graph at the top).
  • Tasks that are “messier” or more similar to real-world scenarios remain more challenging for LLMs (Graph at the bottom).

Caveats About Growth and Risks

While these results raise concerns about rapid AI advancement, METR researcher Megan Kinniment noted that:

  • Rapid acceleration does not necessarily result in “massively explosive growth.”
  • Progress could be slowed by factors such as hardware or robotics bottlenecks, even if AI systems become very advanced.

Final Summary

Overall, the article emphasizes that LLMs are improving exponentially, potentially enabling them to handle complex, month-long human tasks by 2030. This progress comes with significant benefits and risks, and its trajectory may depend on external factors like hardware limitations.

You can read the full article here.

— Andrew

Speeding Up AI for Everyone: The PaPaformer Model Making Language Tech Work on Phones and Low-Power Devices

AI has become more capable than ever, but many of the most advanced tools still require massive cloud servers to run. That means if you want ChatGPT-level performance, you usually need a reliable internet connection and a lot of computing power behind the scenes. But what if you could have that kind of AI right on your phone, even without Wi‑Fi?

That’s where the PaPaformer model comes in.

What is the PaPaformer Model?
PaPaformer is a new AI architecture developed to train large language models more efficiently and make them small enough to run smoothly on low-power devices like smartphones, tablets, or even embedded systems. You can read more about it in the original paper here: PaPaformer: Language Model from Pre-trained Parallel Paths.

Unlike most large models today that require powerful cloud servers to process requests, PaPaformer is designed so the model can be stored and run directly on your device. This means you can use advanced language technology without a constant internet connection. It also helps protect privacy, since your data stays local instead of being sent to the cloud for processing.

Why It Matters
By making AI lighter and more portable, PaPaformer could bring powerful language tools to more people around the world, including those with limited internet access or older devices. It could also make AI faster to respond, since it does not have to constantly send data back and forth to the cloud.

Examples in Action
Imagine using ChatGPT-style features on a budget smartphone in a remote area. With most current apps, like the regular ChatGPT app, you still need a strong internet connection because the AI runs on servers, not your device. But with a PaPaformer-powered tool, the AI would actually run locally, meaning you could:

  • Translate between languages instantly, even without Wi‑Fi
  • Use a speech-to-text tool for endangered languages that works entirely on your device
  • Let teachers translate lessons in real time for students in rural schools without relying on an internet connection
  • Help students write essays in multiple languages privately, without sending drafts to a remote server

This offline capability is the big difference. It is not just accessing AI through the cloud, it is carrying the AI with you wherever you go.

Looking Ahead
If PaPaformer and similar approaches keep improving, we could see a future where advanced AI is available to anyone, anywhere, without needing expensive devices or constant internet access. For someone like me, interested in computational linguistics, this could also open up new possibilities for preserving languages, creating translation tools, and making language technology more inclusive worldwide.

— Andrew

How NLP Helps Robots Handle Interruptions: A Summary of JHU Research

I recently came across an awesome study from Johns Hopkins University describing how computational linguistics and NLP can make robots better conversational partners by teaching them how to handle interruptions, a feature that feels basic for humans but is surprisingly hard for machines.


What the Study Found

Researchers trained a social robot powered by a large language model (LLM) to manage real-time interruptions based on speaker intent. They categorized interruptions into four types: Agreement, Assistance, Clarification, and Disruption.

By analyzing human conversations from interviews to informal discussions, they designed strategies tailored to each interruption type. For example:

  • If someone agrees or helps, the robot pauses, nods, and resumes speaking.
  • When someone asks for clarification, the robot explains and continues.
  • For disruptive interruptions, the robot can either hold the floor to summarize its remaining points before yielding to the human user, or it can stop talking immediately.

How NLP Powers This System

The robot uses an LLM to:

  1. Detect overlapping speech
  2. Classify the interrupter’s intent
  3. Select the appropriate response strategy

In tests involving tasks and conversations, the system correctly interpreted interruptions about 89% of the time and responded appropriately 93.7% of the time.


Why This Matters in NLP and Computational Linguistics

This work highlights how computational linguistics and NLP are essential to human-robot interaction.

  • NLP does more than generate responses; it helps robots understand nuance, context, and intent.
  • Developing systems like this requires understanding pause cues, intonation, and conversational flow, all core to computational linguistics.
  • It shows how multimodal AI, combining language with behavior, can enable more natural and effective interactions.

What I Found Most Interesting

The researchers noted that users didn’t like when the robot “held the floor” too long during disruptive interruptions. It reminded me how pragmatic context matters. Just like people expect some rules in human conversations, robots need these conversational skills too.


Looking Ahead

This research expands what NLP can do in real-world settings like healthcare, education, and social assistants. For someone like me who loves robots and language, it shows how computational linguistics helps build smarter, more human-friendly AI systems.

If you want to dive deeper, check out the full report from Johns Hopkins:
Talking robots learn to manage human interruptions

— Andrew

How Dragon Years Shape Marriages and Births: Evidence from Statistical Analysis

Recently, I came across an interesting article published in the journal Significance, an official magazine of the Royal Statistical Society, the American Statistical Association, and the Statistical Society of Australia. Being a Chinese American, I’m always interested in learning about Chinese culture, in addition to the language. This article explored something I’ve heard a lot from my family but never thought about deeply: Do dragon years really make people get married or have babies more?


What Is This All About?

In Chinese astrology, each lunar year is assigned one of 12 animals. The dragon is considered the most powerful and auspicious. Growing up, I often heard my relatives say it’s best to get married or have children in a dragon year because it brings luck and prosperity.

The article shared the author’s personal story about how his Aunty Li would always nag him about getting married. But in the Year of the Dragon (2024), she suddenly stopped. Why? Because planning a wedding or having a baby in a dragon year takes time, and it was already too late for him to give her a “dragon wedding” or “dragon baby.” This story made me smile because it reminded me of my own family gatherings.


What Did the Research Find?

Researchers looked at birth and marriage data from 1970 to 2023 in six countries: Singapore, China, Malaysia, the UK, Kenya, and Mexico. Here are some highlights that stood out to me:

  • In Singapore, there was a strong positive dragon effect. The fertility rate increased by about 0.17 children per woman in dragon years, which is a noticeable boost.
  • In China, surprisingly, there wasn’t a big dragon effect overall. The researchers suggested this could be because of the one-child policy (1979–2015). Families couldn’t plan for a second dragon baby even if they wanted to.
  • In Malaysia, there was a small positive effect, but it wasn’t as strong as Singapore’s.
  • In countries with tiny Chinese populations (UK, Kenya, Mexico), there was no real dragon effect.
  • Snake years, which follow dragon years and are considered less lucky, showed slightly negative effects on fertility, though these were small and not consistent across countries.

What About Marriage?

The study also looked at marriage rates among ethnic Chinese in Singapore. They expected an increase in dragon years, but the results were mixed. There was no clear pattern, and some dragon years actually had fewer marriages. So, while having a dragon baby seems to matter, a dragon wedding might not be as big of a deal in the data (even though aunties still care a lot about it!).


Why Does This Matter?

For me, reading this was a cool reminder of how cultural beliefs can actually show up in real data. It also shows how statistical models can help us separate superstition from reality. In Singapore, the effect was strong enough that even the prime minister encouraged citizens to “add a little dragon” in his Lunar New Year speech.

At the same time, the study reminded me that traditions, culture, and policies (like China’s one-child policy) all interact to shape what people decide to do with their lives.


Final Thoughts

As a student interested in computational linguistics and social data, I find studies like this inspiring. They connect language, culture, demographics, and data analysis in a meaningful way. Plus, it makes me think about how traditions continue to shape decisions, even in modern societies.

I wonder if my parents also hoped I would be a dragon baby. (Spoiler: I’m not, but at least I wasn’t born in the Year of the Snake either!)

If you’re curious about Chinese culture, statistics, or demographic trends, I highly recommend reading the full article here (if your school has access). Let me know if you want a follow-up post explaining how the statistical model in the paper worked.

— Andrew

I-Language vs. E-Language: What Do They Mean in Computational Linguistics?

In the summer of 2025, I started working on a computational linguistics research project using Twitch data under the guidance of Dr. Sidney Wong, a Computational Sociolinguist. As someone who is still pretty new to this field, I was mainly focused on learning how to conduct literature reviews, help narrow down research topics, clean data, build models, and extract insights.

One day, Dr. Wong suggested I look into the concept of I-language vs. E-language from theoretical linguistics. At first, I wasn’t sure why this mattered. I thought, Isn’t language just… language?

But as I read more, I realized that understanding this distinction changes how we think about language data and what we’re actually modeling when we work with NLP.

In this post, I want to share what I’ve learned about I-language and E-language, and why this distinction is important for computational linguistics research.


What Is I-Language?

I-language stands for “internal language.” This idea was proposed by Noam Chomsky, who argued that language is fundamentally a mental system. I-language refers to the internal, cognitive grammar that allows us to generate and understand sentences. It is about:

  • The unconscious rules and structures stored in our minds
  • Our innate capacity for language
  • The mental system that explains why we can produce and interpret sentences we’ve never heard before

For example, if I say, “The cat sat on the mat,” I-language is the system in my brain that knows the sentence is grammatically correct and what it means, even though I may never have said that exact sentence before.

I-language focuses on competence (what we know about our language) rather than performance (how we actually use it in real life).


What Is E-Language?

E-language stands for “external language.” This is the language we actually hear and see in the world, such as:

  • Conversations between Twitch streamers and their viewers
  • Tweets, Reddit posts, books, and articles
  • Any linguistic data that exists outside the mind

E-language is about observable language use. It includes everything from polished academic writing to messy chat messages filled with abbreviations, typos, and slang.

Instead of asking, “What knowledge do speakers have about their language?”, E-language focuses on, “What do speakers actually produce in practice?”


Why Does This Matter for Computational Linguistics?

When it comes to computational linguistics and NLP, this distinction affects:

1. What We Model

  • I-language-focused research tries to model the underlying grammatical rules and mental representations. For example, building a parser that captures syntax structures based on linguistic theory.
  • E-language-focused research uses real-world data to build models that predict or generate language based on patterns, regardless of theoretical grammar. For example, training a neural network on millions of Twitch comments to generate chat responses.

2. Research Goals

If your goal is to understand how humans process and represent language cognitively, you’re leaning towards I-language research. This includes computational psycholinguistics, cognitive modeling, and formal grammar induction.

If your goal is to build practical NLP systems for tasks like translation, summarization, or sentiment analysis, you’re focusing on E-language. These projects care about performance and usefulness, even if the model doesn’t match linguistic theory.


3. How Models Are Evaluated

I-language models are evaluated based on how well they align with linguistic theory or native speaker intuitions about grammaticality.

E-language models are evaluated using performance metrics, such as accuracy, BLEU scores, or perplexity, based on how well they handle real-world data.


My Thoughts as a Beginner

When Dr. Wong first told me about this distinction, I thought it was purely theoretical. But now, while working with Twitch data, I see the importance of both views.

For example:

  • If I want to study how syntax structures vary in Twitch chats, I need to think in terms of I-language to analyze grammar.
  • If I want to build an NLP model that generates Twitch-style messages, I need to focus on E-language to capture real-world usage patterns.

Neither approach is better than the other. They just answer different types of questions. I-language is about why language works the way it does, while E-language is about how language is actually used in the world.


Final Thoughts

Understanding I-language vs. E-language helps me remember that language isn’t just data for machine learning models. It’s a human system with deep cognitive and social layers. Computational linguistics becomes much more meaningful when we consider both perspectives: What does the data tell us? and What does it reveal about how humans think and communicate?

If you’re also just starting out in this field, I hope this post helps you see why these theoretical concepts matter for practical NLP and AI work. Let me know if you want a follow-up post about other foundational linguistics ideas for computational research.

— Andrew

Blog at WordPress.com.

Up ↑