In the summer of 2025, I started working on a computational linguistics research project using Twitch data under the guidance of Dr. Sidney Wong, a Computational Sociolinguist. As someone who is still pretty new to this field, I was mainly focused on learning how to conduct literature reviews, help narrow down research topics, clean data, build models, and extract insights.
One day, Dr. Wong suggested I look into the concept of I-language vs. E-language from theoretical linguistics. At first, I wasn’t sure why this mattered. I thought, Isn’t language just… language?
But as I read more, I realized that understanding this distinction changes how we think about language data and what we’re actually modeling when we work with NLP.
In this post, I want to share what I’ve learned about I-language and E-language, and why this distinction is important for computational linguistics research.
What Is I-Language?
I-language stands for “internal language.” This idea was proposed by Noam Chomsky, who argued that language is fundamentally a mental system. I-language refers to the internal, cognitive grammar that allows us to generate and understand sentences. It is about:
- The unconscious rules and structures stored in our minds
- Our innate capacity for language
- The mental system that explains why we can produce and interpret sentences we’ve never heard before
For example, if I say, “The cat sat on the mat,” I-language is the system in my brain that knows the sentence is grammatically correct and what it means, even though I may never have said that exact sentence before.
I-language focuses on competence (what we know about our language) rather than performance (how we actually use it in real life).
What Is E-Language?
E-language stands for “external language.” This is the language we actually hear and see in the world, such as:
- Conversations between Twitch streamers and their viewers
- Tweets, Reddit posts, books, and articles
- Any linguistic data that exists outside the mind
E-language is about observable language use. It includes everything from polished academic writing to messy chat messages filled with abbreviations, typos, and slang.
Instead of asking, “What knowledge do speakers have about their language?”, E-language focuses on, “What do speakers actually produce in practice?”
Why Does This Matter for Computational Linguistics?
When it comes to computational linguistics and NLP, this distinction affects:
1. What We Model
- I-language-focused research tries to model the underlying grammatical rules and mental representations. For example, building a parser that captures syntax structures based on linguistic theory.
- E-language-focused research uses real-world data to build models that predict or generate language based on patterns, regardless of theoretical grammar. For example, training a neural network on millions of Twitch comments to generate chat responses.
2. Research Goals
If your goal is to understand how humans process and represent language cognitively, you’re leaning towards I-language research. This includes computational psycholinguistics, cognitive modeling, and formal grammar induction.
If your goal is to build practical NLP systems for tasks like translation, summarization, or sentiment analysis, you’re focusing on E-language. These projects care about performance and usefulness, even if the model doesn’t match linguistic theory.
3. How Models Are Evaluated
I-language models are evaluated based on how well they align with linguistic theory or native speaker intuitions about grammaticality.
E-language models are evaluated using performance metrics, such as accuracy, BLEU scores, or perplexity, based on how well they handle real-world data.
My Thoughts as a Beginner
When Dr. Wong first told me about this distinction, I thought it was purely theoretical. But now, while working with Twitch data, I see the importance of both views.
For example:
- If I want to study how syntax structures vary in Twitch chats, I need to think in terms of I-language to analyze grammar.
- If I want to build an NLP model that generates Twitch-style messages, I need to focus on E-language to capture real-world usage patterns.
Neither approach is better than the other. They just answer different types of questions. I-language is about why language works the way it does, while E-language is about how language is actually used in the world.
Final Thoughts
Understanding I-language vs. E-language helps me remember that language isn’t just data for machine learning models. It’s a human system with deep cognitive and social layers. Computational linguistics becomes much more meaningful when we consider both perspectives: What does the data tell us? and What does it reveal about how humans think and communicate?
If you’re also just starting out in this field, I hope this post helps you see why these theoretical concepts matter for practical NLP and AI work. Let me know if you want a follow-up post about other foundational linguistics ideas for computational research.
— Andrew
Leave a comment