Rethinking AI Bias: Insights from Professor Resnik’s Position Paper

I recently read Professor Philip Resnik’s thought-provoking position paper, “Large Language Models Are Biased Because They Are Large Language Models,” published in Computational Linguistics 51(3), which is available via open access. This paper challenges conventional perspectives on bias in artificial intelligence, prompting a deeper examination of the inherent relationship between bias and the foundational design of large language models (LLMs). Resnik’s primary objective is to stimulate critical discussion by arguing that harmful biases are an inevitable outcome of the current architecture of LLMs. The paper posits that addressing these biases effectively requires a fundamental reevaluation of the assumptions underlying the design of AI systems driven by LLMs.

What the paper argues

Bias is built into the very goal of an LLM. A language model tries to predict the next word by matching the probability patterns of human text. Those patterns come from people. People carry stereotypes, norms, and historical imbalances. If an LLM learns the patterns faithfully, it learns the bad with the good. The result is not a bug that appears once in a while. It is a direct outcome of the objective the model optimizes.
Models cannot tell “what a word means” apart from “what is common” or “what is acceptable.” Resnik uses a nurse example. Some facts are definitional (A nurse is a kind of healthcare worker). Other facts are contingent but harmless (A nurse is likely to wear blue clothing at work). Some patterns are contingent and harmful if used for inference (A nurse is likely to wear a dress to a formal occasion). Current LLMs do not have an internal line that separates meaning from contingent statistics or that flags the normative status of an inference. They just learn distributions.
Reinforcement Learning from Human Feedback (RLHF) and other mitigations help on the surface, but they have limits. RLHF tries to steer a pre-trained model toward safer outputs. The process relies on human judgments that vary by culture and time. It also has to keep the model close to its pretraining, or the model loses general ability. That tradeoff means harmful associations can move underground rather than disappear. Some studies even find covert bias remains after mitigation (Gallegos et al. 2024; Hofmann et al. 2024). To illustrate this, consider an analogy: The balloon gets squeezed in one place, then bulges in another.
The root cause is a hard-core, distribution-only view of language. When meaning is treated as “whatever co-occurs with what,” the model has no principled way to encode norms. The paper suggests rethinking foundations. One direction is to separate stable, conventional meaning (like word sense and category membership) from contextual or conveyed meaning (which is where many biases live). Another idea is to modularize competence, so that using language in socially appropriate ways is not forced to emerge only from next-token prediction. None of this is easy, but it targets the cause rather than only tuning symptoms.

Why this matters

Resnik is not saying we should give up. He is saying that quick fixes will not fully erase harm when the objective rewards learning whatever is frequent in human text. If we want models that reason with norms, we need objectives and representations that include norms, not only distributions.

Conclusion

This paper offers a clear message. Bias is not only a content problem in the data. It is also a design problem in how we define success for our models. If the goal is to build systems that are both capable and fair, then the next steps should focus on objectives, representations, and evaluation methods that make room for norms and constraints. That is harder than prompt tweaks, but it is the kind of challenge that can move the field forward.

Link to the paper: Large Language Models Are Biased Because They Are Large Language Models

— Andrew

4,811 hits

September 19, 2025 0

Computational Linguists Help Africa Try to Close the AI Language Gap

Introduction

The fact that African languages are underrepresented in the digital AI ecosystem has gained international attention. On July 29, 2025, Nature published a news article stating that

“More than 2,000 languages spoken in Africa are being neglected in the artificial intelligence (AI) era. For example, ChatGPT recognizes only 10–20% of sentences written in Hausa, a language spoken by 94 million people in Nigeria. These languages are under-represented in large language models (LLMs) because of a lack of training data.” (source: AI models are neglecting African languages — scientists want to change that)

Another example is BBC News, released on September 4, 2025, stating that

“Although Africa is home to a huge proportion of the world’s languages – well over a quarter according to some estimates – many are missing when it comes to the development of artificial intelligence (AI). This is both an issue of a lack of investment and readily available data. Most AI tools, such as ChatGPT, used today are trained on English as well as other European and Chinese languages. These have vast quantities of online text to draw from. But as many African languages are mostly spoken rather than written down, there is a lack of text to train AI on to make it useful for speakers of those languages. For millions across the continent this means being left out.” (source: Lost in translation – How Africa is trying to close the AI language gap)

To address this problem, linguists and computer scientists are collaborating to create AI-ready datasets in 18 African languages via The African Next Voices project. Funded by the Bill and Melinda Gates Foundation ($2.2-million grant), the project involves recording 9,000 hours of speech across 18 African languages in Kenya, Nigeria, and South Africa. The goal is to create a comprehensive dataset that can be utilized for developing AI tools, such as translation and transcription services, which are particularly beneficial for local communities and their specific needs. The project emphasizes the importance of capturing everyday language use to ensure that AI technologies reflect the realities of African societies. The 18 African languages selected represent only a fraction of the over 2,000 languages spoken across the continent, but project contributors aim to include more languages in the future.

Role of Computational Linguists in the Project

Computational linguists play a critical role in the African Next Voices project. Their key contributions include:

Data Curation and Annotation: They guide the transcription and translation of over 9,000 hours of recorded speech in languages like Kikuyu, Dholuo, Hausa, Yoruba, and isiZulu, ensuring linguistic accuracy and cultural relevance. This involves working with native speakers to capture authentic, everyday language use in contexts like farming, healthcare, and education.
Dataset Design: They help design structured datasets that are AI-ready, aligning the collected speech data with formats suitable for training large language models (LLMs) for tasks like speech recognition and translation. This includes ensuring data quality through review and validation processes.
Bias Mitigation: By leveraging their expertise in linguistic diversity, computational linguists work to prevent biases in AI models by curating datasets that reflect the true linguistic and cultural nuances of African languages, which are often oral and underrepresented in digital text.
Collaboration with Technical Teams: They work alongside computer scientists and AI experts to integrate linguistic knowledge into model training and evaluation, ensuring the datasets support accurate translation, transcription, and conversational AI applications.

Their involvement is essential to making African languages accessible in AI technologies, fostering digital inclusion, and preserving cultural heritage.

Final Thoughts

From the perspective of a U.S. high school student interested in pursuing computational linguistics in college, inspired by African Next Voices, here are some final thoughts and conclusions:

Impactful Career Path: Computational linguistics offers a unique opportunity to blend language, culture, and technology. For a student like me, the African Next Voices project highlights how this field can drive social good by preserving underrepresented languages and enabling AI to serve diverse communities, which could be deeply motivating.
Global Relevance: The project underscores the global demand for linguistic diversity in AI. As a future computational linguist, I can contribute to bridging digital divides, making technology accessible to millions in Africa and beyond, which is both a technical and humanitarian pursuit.
Skill Development: The work involves collaboration with native speakers, data annotation, and AI model training/evaluation, suggesting I’ll need strong skills in linguistics, programming (e.g., Python), and cross-cultural communication. Strengthening linguistics knowledge and enhancing coding skills could give me a head start.
Challenges and Opportunities: The vast linguistic diversity (over 2,000 African languages) presents challenges like handling oral traditions or limited digital resources. This complexity is exciting, as it offers a chance to innovate in dataset creation and bias mitigation, areas where I could contribute and grow.
Inspiration for Study: The focus on real-world applications (such as healthcare, education, and farming) aligns with my interest in studying computational linguistics in college and working on inclusive AI that serves people.

In short, as a high school student, I can see computational linguistics as a field where I can build tools that help people communicate and learn. I hope this post encourages you to look into the project and consider how you might contribute to similar initiatives in the future!

— Andrew

4,811 hits

September 16, 2025 0

What I Learned (and Loved) at SLIYS: Two Weeks of Linguistic Discovery at Ohio State

This summer, I had the chance to participate in both SLIYS 1 and SLIYS 2—the Summer Linguistic Institute for Youth Scholars—hosted by the Ohio State University Department of Linguistics. Across two weeks packed with lectures, workshops, and collaborative data collection, I explored the structure of language at every level: from the individual sounds we make to the complex systems that govern meaning and conversation. But if I had to pick just one highlight, it would be the elicitation sessions—hands-on explorations with real language data that made the abstract suddenly tangible.

SLIYS 1: Finding Language in Structure

SLIYS 1 started with the fundamentals—consonants, vowels, and the International Phonetic Alphabet (IPA)—but quickly expanded into diverse linguistic territory: morphology, syntax, semantics, and pragmatics. Each day featured structured lectures covering topics like sociolinguistic variation, morphological structures, and historical linguistics. Workshops offered additional insights, from analyzing sentence meanings to exploring language evolution.

The core experience, however, was our daily elicitation sessions. My group tackled Serbo-Croatian, collaboratively acting as elicitors and transcribers to construct a detailed grammar sketch. We identified consonant inventories, syllable structures (like CV, CVC, and CCV patterns), morphological markers for plural nouns and verb tenses, and syntactic word orders. Through interactions with our language consultant, we tested hypotheses directly, discovering intricacies like how questions were formed using particles like dahlee, and how adjective-noun order worked. This daily practice gave theory immediate clarity and meaning, shaping our skills as linguists-in-training.

SLIYS 2: Choosing My Path in Linguistics

SLIYS 2 built upon our initial foundations, diving deeper into phonological analysis, morphosyntactic properties, and the relationship between language and cognition. This week offered more autonomy, allowing us to select workshops tailored to our interests. My choices included sessions on speech perception, dialectology, semiotics, and linguistic anthropology—each challenging me to think more broadly about language as both cognitive and cultural phenomena.

Yet again, the elicitation project anchored our experience, this time exploring Georgian. Our group analyzed Georgian’s distinctive pluralization system, polypersonal verb agreement (verbs agreeing with both subjects and objects), and flexible sentence orders (SVO/SOV). One fascinating detail we uncovered was how nouns remained singular when preceded by numbers. Preparing our final presentation felt especially rewarding, bringing together the week’s linguistic discoveries in a cohesive narrative. Presenting to our peers crystallized not just what we learned, but how thoroughly we’d internalized it.

More Than Just a Summer Program

What I appreciated most about SLIYS was how seriously it treated us as student linguists. The instructors didn’t just lecture—they listened, challenged us, and encouraged our curiosity. Whether we were learning about deixis or discourse analysis, the focus was always on asking better questions, not just memorizing answers.

By the end of SLIYS 2, I found myself thinking not only about how language works, but why we study it in the first place. Language is a mirror to thought, a map of culture, and a bridge between people—and programs like SLIYS remind me that it’s also something we can investigate, question, and build understanding from.

Moments from SLIYS 2: A Snapshot of a Summer to Remember

As SLIYS 2 came to a close, our instructors captured these Zoom screenshots to help us remember the community, curiosity, and collaboration that made this experience so meaningful.

Special Thanks to the SLIYS 2025 Team

This incredible experience wouldn’t have been possible without the passion, insight, and dedication of the SLIYS 2025 instructors. Each one brought something unique to the table—whether it was helping us break down complex syntax, introducing us to sociolinguistics through speech perception, or guiding us through our elicitation sessions with patience and curiosity. I’m especially grateful for the way they encouraged us to ask deeper questions and think like real linguists.

Special thanks to:

Kyler Laycock – For leading with energy, making phonetics and dialectology come alive, and always reminding us how much identity lives in the details of speech.
Jory Ross – For guiding us through speech perception and conversational structure, and for sharing her excitement about how humans really process language.
Emily Sagasser – For her insights on semantics, pragmatics, and focus structure, and for pushing us to think about how language connects to social justice and cognition.
Elena Vaikšnoraitė – For their thoughtful instruction in syntax and psycholinguistics, and for showing us the power of connecting data across languages.
Dr. Clint Awai-Jennings – For directing the program with care and purpose—and for showing us that it’s never too late to turn a passion for language into a life’s work.

Thank you all for making SLIYS 1 and 2 an unforgettable part of my summer.

— Andrew

July 28, 2025 0

I-Language vs. E-Language: What Do They Mean in Computational Linguistics?

In the summer of 2025, I started working on a computational linguistics research project using Twitch data under the guidance of Dr. Sidney Wong, a Computational Sociolinguist. As someone who is still pretty new to this field, I was mainly focused on learning how to conduct literature reviews, help narrow down research topics, clean data, build models, and extract insights.

One day, Dr. Wong suggested I look into the concept of I-language vs. E-language from theoretical linguistics. At first, I wasn’t sure why this mattered. I thought, Isn’t language just… language?

But as I read more, I realized that understanding this distinction changes how we think about language data and what we’re actually modeling when we work with NLP.

In this post, I want to share what I’ve learned about I-language and E-language, and why this distinction is important for computational linguistics research.

What Is I-Language?

I-language stands for “internal language.” This idea was proposed by Noam Chomsky, who argued that language is fundamentally a mental system. I-language refers to the internal, cognitive grammar that allows us to generate and understand sentences. It is about:

The unconscious rules and structures stored in our minds
Our innate capacity for language
The mental system that explains why we can produce and interpret sentences we’ve never heard before

For example, if I say, “The cat sat on the mat,” I-language is the system in my brain that knows the sentence is grammatically correct and what it means, even though I may never have said that exact sentence before.

I-language focuses on competence (what we know about our language) rather than performance (how we actually use it in real life).

What Is E-Language?

E-language stands for “external language.” This is the language we actually hear and see in the world, such as:

Conversations between Twitch streamers and their viewers
Tweets, Reddit posts, books, and articles
Any linguistic data that exists outside the mind

E-language is about observable language use. It includes everything from polished academic writing to messy chat messages filled with abbreviations, typos, and slang.

Instead of asking, “What knowledge do speakers have about their language?”, E-language focuses on, “What do speakers actually produce in practice?”

Why Does This Matter for Computational Linguistics?

When it comes to computational linguistics and NLP, this distinction affects:

1. What We Model

I-language-focused research tries to model the underlying grammatical rules and mental representations. For example, building a parser that captures syntax structures based on linguistic theory.
E-language-focused research uses real-world data to build models that predict or generate language based on patterns, regardless of theoretical grammar. For example, training a neural network on millions of Twitch comments to generate chat responses.

2. Research Goals

If your goal is to understand how humans process and represent language cognitively, you’re leaning towards I-language research. This includes computational psycholinguistics, cognitive modeling, and formal grammar induction.

If your goal is to build practical NLP systems for tasks like translation, summarization, or sentiment analysis, you’re focusing on E-language. These projects care about performance and usefulness, even if the model doesn’t match linguistic theory.

3. How Models Are Evaluated

I-language models are evaluated based on how well they align with linguistic theory or native speaker intuitions about grammaticality.

E-language models are evaluated using performance metrics, such as accuracy, BLEU scores, or perplexity, based on how well they handle real-world data.

My Thoughts as a Beginner

When Dr. Wong first told me about this distinction, I thought it was purely theoretical. But now, while working with Twitch data, I see the importance of both views.

For example:

If I want to study how syntax structures vary in Twitch chats, I need to think in terms of I-language to analyze grammar.
If I want to build an NLP model that generates Twitch-style messages, I need to focus on E-language to capture real-world usage patterns.

Neither approach is better than the other. They just answer different types of questions. I-language is about why language works the way it does, while E-language is about how language is actually used in the world.

Final Thoughts

Understanding I-language vs. E-language helps me remember that language isn’t just data for machine learning models. It’s a human system with deep cognitive and social layers. Computational linguistics becomes much more meaningful when we consider both perspectives: What does the data tell us? and What does it reveal about how humans think and communicate?

If you’re also just starting out in this field, I hope this post helps you see why these theoretical concepts matter for practical NLP and AI work. Let me know if you want a follow-up post about other foundational linguistics ideas for computational research.

— Andrew

July 20, 2025 0

What Is Computational Linguistics (and How Is It Different from NLP)?

When I first got interested in this field, I kept seeing the terms computational linguistics and natural language processing (NLP) used almost interchangeably. At first, I thought they were the same thing. By delving deeper through reading papers, taking courses, and conducting research, I realized that although they overlap significantly, they are not entirely identical.

So in this post, I want to explain the difference (and connection) between computational linguistics and NLP from the perspective of a high school student who’s just getting started, but really interested in understanding both the language and the tech behind today’s AI systems.

So, what is computational linguistics?

Computational linguistics is the science of using computers to understand and model human language. It’s rooted in linguistics, the study of how language works, and applies computational methods to test linguistic theories, analyze language structure, or build tools like parsers and grammar analyzers.

It’s a field that sits at the intersection of computer science and linguistics. Think syntax trees, morphology, phonology, semantics, and using code to work with all of those.

For example, in computational linguistics, you might:

Use code to analyze sentence structure in different languages
Create models that explain how children learn grammar rules
Explore how prosody (intonation and stress) changes meaning in speech
Study how regional dialects appear in online chat platforms like Twitch

In other words, computational linguistics is often about understanding language (how it’s structured, how it varies, and how we can model it with computers).

Then what is NLP?

Natural language processing (NLP) is a subfield of AI and computer science that focuses on building systems that can process and generate human language. It’s more application-focused. If you’ve used tools like ChatGPT, Google Translate, Siri, or even grammar checkers, you’ve seen NLP in action.

While computational linguistics asks, “How does language work, and how can we model it?”, NLP tends to ask, “How can we build systems that understand or generate language usefully?”

Examples of NLP tasks:

Sentiment analysis (e.g., labeling text as positive, negative, or neutral)
Machine translation
Named entity recognition (e.g., tagging names, places, dates)
Text summarization or question answering

In many cases, NLP researchers care more about whether a system works than whether it matches a formal linguistic theory. That doesn’t mean theory doesn’t matter, but the focus is more on performance and results.

So, what’s the difference?

The line between the two fields can get blurry (and many people work in both), but here’s how I think of it:

Computational Linguistics	NLP
Rooted in linguistics	Rooted in computer science and AI
Focused on explaining and modeling language	Focused on building tools and systems
Often theoretical or data-driven linguistics	Often engineering-focused and performance-driven
Examples: parsing syntax, studying morphology	Examples: sentiment analysis, machine translation

Think of computational linguistics as the science of language and NLP as the engineering side of language technology.

Why this matters to me

As someone who’s really interested in computational linguistics, I find myself drawn to the linguistic side of things, like how language varies, how meaning is structured, and how AI models sometimes get things subtly wrong because they don’t “understand” language the way humans do.

At the same time, I still explore NLP, especially when working on applied projects like sentiment analysis or topic modeling. I think having a strong foundation in linguistics makes me a better NLP researcher (or student), because I’m more aware of the complexity and nuance of language.

Final thoughts

If you’re just getting started, you don’t have to pick one or the other. Read papers from both fields. Try projects that help you learn both theory and application. Over time, you’ll probably find yourself leaning more toward one, but having experience in both will only help.

I’m still learning, and I’m excited to keep going deeper into both sides. If you’re interested too, let me know! I’m always up for sharing reading lists, courses, or just thoughts on cool research.

— Andrew

July 17, 2025 0

Summer Programs and Activities in Computational Linguistics: My Personal Experiences and Recommendations

If you’re a high school student interested in computational linguistics, you might be wondering: What are some ways to dive deeper into this field over the summer? As someone who loves language, AI, and everything in between, I’ve spent the past year researching programs and activities, and I wanted to share what I’ve learned (along with some of my personal experiences).

1. Summer Linguistic Institute for Youth Scholars (SLIYS)

What it is:
SLIYS is a two-week summer program run by The Ohio State University’s Department of Linguistics. It focuses on introducing high school students to language analysis and linguistic theory in a fun and rigorous way. Students get to explore syntax, morphology, phonetics, language universals, and even some computational topics.

My experience:
I’m super excited to share that I’ll be participating in SLIYS this summer (July 14 – 25, 2025). I was so happy to be accepted, and I’m looking forward to learning from real linguistics professors and meeting other students who are passionate about language. I’ll definitely share a reflection post after I finish the program, so stay tuned if you want an inside look!

Learn more about SLIYS here.

2. Summer Youth Camp for Computational Linguistics (SYCCL)

What it is:
SYCCL is a summer camp hosted by the Department of Linguistics and the Institute for Advanced Computational Science at Stony Brook University. It introduces high school students to computational linguistics and language technology, covering topics like language data, NLP tools, and coding for language analysis.

My experience:
I had planned to apply for SYCCL this year as well, but unfortunately, its schedule (July 6 – 18, 2025) conflicted with SLIYS, which I had already accepted. Another challenge I faced was that SYCCL’s website wasn’t updated until late April 2025, which is quite late compared to other summer programs. I had actually contacted the university earlier this year and they confirmed it would run again, but I didn’t see the application open until April. My advice is to check their website frequently starting early spring, and plan for potential conflicts with other summer programs.

Learn more about SYCCL here.

3. North American Computational Linguistics Open Competition (NACLO)

What it is:
NACLO is an annual computational linguistics competition for high school students across North America. It challenges students with problems in linguistics and language data analysis, testing their ability to decipher patterns in unfamiliar languages.

My experience:
I’ve tried twice to participate in NACLO at my local test center. Unfortunately, both times the test dates were weekdays that conflicted with my school final exams, so I had to miss them. If you’re planning to participate, I strongly recommend checking the schedule early to make sure it doesn’t overlap with finals or other major commitments. Despite missing it, I still find their practice problems online really fun and useful for thinking like a computational linguist.

Learn more about NACLO here.

4. LSA Summer Institute

What it is:
The Linguistic Society of America (LSA) Summer Institute is an intensive four-week program held every two years at different universities. It offers courses and workshops taught by top linguists and is known as one of the best ways to explore advanced topics in linguistics, including computational linguistics.

My experience:
I was planning to apply for the LSA Summer Institute this year. However, I found out that it is only open to individuals aged 18 and older. I contacted the LSA Institute Registration Office to ask if there could be any exceptions or special considerations for underage participants, but it was disappointing to receive their response: “Unfortunately, the age limit is firm and the organizers will not be considering any exceptions.” So if you’re thinking about applying, my advice is to check the age qualifications early before starting the application process.

Learn more about LSA Summer Institute here.

5. Local University Outreach Events and Courses

Another great way to explore linguistics and computational linguistics is by checking out courses or outreach events at local universities. For example, last summer I took LING 234 (Language and Diversity) at the University of Washington (Seattle). It was an eye-opening experience to study language variation, identity, and society from a college-level perspective. I wrote a reflection about it in my blog post from November 29, 2024. If your local universities offer summer courses for high school students, I highly recommend checking them out.

6. University-Affiliated AI4ALL Summer Programs for High School Students

What it is:
AI4ALL partners with universities to offer summer programs introducing high school students to AI research, ethics, and applications, often including NLP and language technology projects. While these programs are not focused solely on computational linguistics, they provide a great entry point into AI and machine learning, which are essential tools for language technology research.

About AI4ALL:
AI4ALL is a U.S.-based nonprofit focused on increasing diversity and inclusion in artificial intelligence (AI) education, research, development, and policy, particularly for historically underrepresented groups such as Black, Hispanic/Latinx, Indigenous, women, non-binary, low-income, and first-generation college students. Their mission is to make sure the next generation of AI researchers and developers reflects the diversity of the world.

Examples:

Stanford AI4ALL
Princeton AI4ALL
Carnegie Mellon AI4ALL

These programs are competitive and have different focus areas, but all aim to broaden participation in AI by empowering future researchers early.

Final Thoughts

I feel grateful to have these opportunities to grow my passion for computational linguistics, and I hope this list helps you plan your own summer learning journey. Whether you’re solving NACLO problems in your free time or spending two weeks at SLIYS like I will this summer, every step brings you closer to understanding how language and AI connect.

Let me know if you want a future post reviewing SLIYS after I complete it in July!

— Andrew

July 5, 2025 0

SCiL vs. ACL: What’s the Difference? (A Beginner’s Take from a High School Student)

As a high school student just starting to explore computational linguistics, I remember being confused by two organizations: SCiL (Society for Computation in Linguistics) and ACL (Association for Computational Linguistics). They both focus on language and computers, so at first, I assumed they were basically the same thing.

It wasn’t until recently that I realized they are actually two different academic communities. Each has its own focus, audience, and style of research. I’ve had the chance to engage with both, which helped me understand how they are connected and how they differ.

Earlier this year, I had the opportunity to co-author a paper that was accepted to a NAACL 2025 workshop (May 3–4). NAACL stands for the North American Chapter of the Association for Computational Linguistics. It is a regional chapter that serves researchers in the United States, Canada, and Mexico. NAACL follows ACL’s mission and guidelines but focuses on more local events and contributions.

This summer, I will be participating in SCiL 2025 (July 18–19), where I hope to meet researchers and learn more about how computational models are used to study language structure and cognition. Getting involved with both events helped me better understand what makes SCiL and ACL unique, so I wanted to share what I’ve learned for other students who might also be starting out.

SCiL and ACL: Same Field, Different Focus

Both SCiL and ACL are academic communities interested in studying human language using computational methods. However, they focus on different kinds of questions and attract different types of researchers.

Here’s how I would explain the difference.

SCiL (Society for Computation in Linguistics)

SCiL is more focused on using computational tools to support linguistic theory and cognitive science. Researchers here are often interested in how language works at a deeper level, including areas like syntax, semantics, and phonology.

The community is smaller and includes people from different disciplines like linguistics, psychology, and cognitive science. You are likely to see topics such as:

Computational models of language processing
Formal grammars and linguistic structure
Psycholinguistics and cognitive modeling
Theoretical syntax and semantics

If you are interested in how humans produce and understand language, and how computers can help us model that process, SCiL might be a great place to start.

ACL (Association for Computational Linguistics)

ACL has a broader and more applied focus. It is known for its work in natural language processing (NLP), artificial intelligence, and machine learning. The research tends to focus on building tools and systems that can actually use human language in practical ways.

The community is much larger and includes researchers from both academia and major tech companies like Google, OpenAI, Meta, and Microsoft. You will see topics such as:

Language models like GPT, BERT, and LLaMA
Machine translation and text summarization
Speech recognition and sentiment analysis
NLP benchmarks and evaluation methods

If you want to build or study real-world AI systems that use language, ACL is the place where a lot of that cutting-edge research is happening.

Which One Should You Explore First?

It really depends on what excites you most.

If you are curious about how language works in the brain or how to use computational tools to test theories of language, SCiL is a great choice. It is more theory-driven and focused on cognitive and linguistic insights.

If you are more interested in building AI systems, analyzing large datasets, or applying machine learning to text and speech, then ACL might be a better fit. It is more application-oriented and connected to the latest developments in NLP.

They both fall under the larger field of computational linguistics, but they come at it from different angles. SCiL is more linguistics-first, while ACL is more NLP-first.

Final Thoughts

I am still early in my journey, but understanding the difference between SCiL and ACL has already helped me navigate the field better. Each community asks different questions, uses different methods, and solves different problems, but both are helping to push the boundaries of how we understand and work with language.

I am looking forward to attending SCiL 2025 this summer, and I will definitely write about that experience afterward. In the meantime, I hope this post helps other students who are just starting out and wondering where to begin.

— Andrew

June 28, 2025 0

Happy New Year 2025! Reflecting on a Year of Growth and Looking Ahead

As we welcome 2025, I want to take a moment to reflect on the past year and share some exciting plans for the future.

Highlights from 2024

Academic Pursuits: I delved deeper into Natural Language Processing (NLP), discovering Jonathan Dunn’s Natural Language Processing for Corpus Linguistics, which seamlessly integrates computational methods with traditional linguistic analysis.
AI and Creativity: Exploring the intersection of AI and human creativity, I read Garry Kasparov’s Deep Thinking, which delves into his experiences with AI in chess and offers insights into the evolving relationship between humans and technology.
Competitions and Courses: I actively participated in Kaggle competitions, enhancing my machine learning and data processing skills, which are crucial in the neural network and AI aspects of Computational Linguistics.
Community Engagement: I had the opportunity to compete in the 2024 VEX Robotics World Championship and reintroduced our school’s chess club to the competitive scene, marking our return since pre-COVID times.

Looking Forward to 2025

Expanding Knowledge: I plan to continue exploring advanced topics in NLP and AI, sharing insights and resources that I find valuable.
Engaging Content: Expect more in-depth discussions, tutorials, and reviews on the latest developments in computational linguistics and related fields.
Community Building: I aim to foster a community where enthusiasts can share knowledge, ask questions, and collaborate on projects.

Thank you for being a part of this journey. Your support and engagement inspire me to keep exploring and sharing. Here’s to a year filled with learning, growth, and innovation!

January 1, 2025 0

A Book That Expanded My Perspective on NLP: Natural Language Processing for Corpus Linguistics by Jonathan Dunn

Book Link: https://doi.org/10.1017/9781009070447

As I dive deeper into the fascinating world of Natural Language Processing (NLP), I often come across resources that reshape my understanding of the field. One such recent discovery is Jonathan Dunn’s Natural Language Processing for Corpus Linguistics. This book, a part of the Elements in Corpus Linguistics series by Cambridge University Press, stands out for its seamless integration of computational methods with traditional linguistic analysis.

A Quick Overview

The book serves as a guide to applying NLP techniques to corpus linguistics, especially in dealing with large-scale corpora that are beyond the scope of traditional manual analysis. It discusses how models like text classification and text similarity can help address linguistic problems such as categorization (e.g., identifying part-of-speech tags) and comparison (e.g., measuring stylistic similarities between authors).

What I found particularly intriguing is its structure, which is built around five compelling case studies:

Corpus-Based Sociolinguistics: Exploring geographic and social variations in language use.
Corpus Stylistics: Understanding authorship through stylistic differences in texts.
Usage-Based Grammar: Analyzing syntax and semantics via computational models.
Multilingualism Online: Investigating underrepresented languages in digital spaces.
Socioeconomic Indicators: Applying corpus analysis to non-linguistic fields like politics and sentiment in customer reviews.

The book is as much a practical resource as it is theoretical. Accompanied by Python notebooks and a stand-alone Python package, it provides hands-on tools to implement the discussed methods—a feature that makes it especially appealing to readers with a technical bent.

A Personal Connection

My journey with this book is a bit more personal. While exploring NLP, I had the chance to meet Jonathan Dunn, who shared invaluable insights about this field. One of his students, Sidney Wong, recommended this book to me as a starting point for understanding how computational methods can expand corpus linguistics. It has since become a cornerstone of my learning in this area.

What Makes It Unique

Two aspects of Dunn’s book particularly resonated with me:

Ethical Considerations: As corpus sizes grow, so do the ethical dilemmas associated with their use. From privacy issues to biases in computational models, the book doesn’t shy away from discussing the darker side of large-scale text analysis. This balance between innovation and responsibility is a critical takeaway for anyone venturing into NLP.
Interdisciplinary Approach: Whether you’re a linguist looking to incorporate computational methods or a computer scientist aiming to understand linguistic principles, this book bridges the gap between the two disciplines beautifully. It encourages a collaborative perspective, which is essential in fields as expansive as NLP and corpus linguistics.

Who Should Read It?

If you’re a student, researcher, or practitioner with an interest in exploring how NLP can scale linguistic analysis, this book is for you. Its accessibility makes it suitable for beginners, while the advanced discussions and hands-on code offer plenty for seasoned professionals to learn from.

For me, Natural Language Processing for Corpus Linguistics isn’t just a book—it’s a toolkit, a mentor, and an inspiration rolled into one. As I continue my journey in NLP, I find myself revisiting its chapters for insights and ideas.

December 26, 2024 0