What Is an Annotated Bibliography (And Why Every Junior Researcher Should Make One)

Recently, I was fortunate to work with Dr. Sidney Wong on a computational linguistics research project using Twitch data. As a high school student just stepping into research in the field, I learned a lot—not just about the technical side of computational linguistics, but also about how research is actually done.

One of the most valuable lessons I took away was the importance of using a structured research process, especially when it comes to narrowing down a topic and conducting a literature survey. One tool that stood out to me was the annotated bibliography.

Although our project is still ongoing, I wanted to take a moment to introduce annotated bibliographies to other students who are just beginning their own research journeys.


What Is an Annotated Bibliography?

An annotated bibliography is more than just a list of sources. It’s a carefully organized collection of books, research papers, or articles. Each entry includes a short summary and analysis that helps explain what the source is about, how reliable it is, and how it fits into your research.

Each entry usually includes:

  • A full citation in a standard format (APA, MLA, Chicago, etc.)
  • A brief summary of the key points
  • An evaluation of the source’s quality or credibility
  • A reflection on how the source is useful for your project

In other words, it helps you stay organized and think critically while reading. It’s like building your own research map.


Why It Matters (Especially for Beginners)

When you’re new to a field, it’s easy to feel overwhelmed by all the papers and sources out there. Creating an annotated bibliography helps in several important ways:

1. Keeps you organized

Instead of juggling dozens of open tabs and scattered notes, you have everything in one place with clear summaries and citations.

2. Helps you truly understand what you read

Summarizing and reflecting on a source forces you to go beyond skimming. You learn to recognize the core arguments, methods, and relevance.

3. Highlights gaps in the literature

As you build your list, you’ll start to notice which topics are well studied and which ones aren’t. That can help you identify potential research questions.

4. Makes writing much easier later

When it’s time to write your literature review or paper, you’ll already have the core material prepared.


How I Got Started

When I began working with Dr. Wong on our project about Twitch chat data and language variation, he encouraged me to start building an annotated bibliography early. I started collecting articles on sociolinguistics, computational methods, and prior research involving Twitch or similar platforms.

For each article, I wrote down:

  • What the authors studied
  • How they conducted the research
  • What they concluded
  • And how it connects to my own research

Even though I’m still early in the process, having this document has already helped me organize my thoughts and see where our work fits in the broader field.


Final Thoughts

If you’re just starting out in research, I highly recommend keeping an annotated bibliography from day one. It may seem like extra work at first, but it will pay off in the long run. You’ll read more thoughtfully, remember more of what you read, and write more confidently when it’s time to publish or present.

I’ll share more about our Twitch project once it’s complete. Until then, I hope this helps you take your first step toward building strong research habits.

— Andrew

4,811 hits

Drawing the Lines: The UN’s Push for Global AI Safeguards

On September 22, 2025, the UN General Assembly hosted an extraordinary plea as more than 200 global leaders, scientists, Nobel laureates, and AI experts called for binding international safeguards to prevent the dangerous use of artificial intelligence. The plea is centered on setting “red lines” — clear boundaries that AI must not cross. (Source: NBC News). The open letter urges policymakers to enact the accord by the end of 2026, given the rapid progress of AI capabilities.

This moment struck me as deeply significant not only for AI policy but for how computational linguistics, ethics, and global governance may intersect in the coming years.


Why this matters (beyond headlines)

Often when we read about AI risks it feels abstract, unlikely scenarios decades ahead. But the UN’s call brings the framing into the political and normative domain: this is not just technical risk mitigation, it is now a matter of global legitimacy and enforceable rules.

Some of the proposed red lines include forbidding AI to impersonate humans in a deceptive way, forbidding autonomous self replication, forbidding lethal autonomous weapons systems, and more, as outlined by the Global Call for AI Red Lines and echoed in the World Economic Forum’s overview of AI red lines, which lists “no impersonating a human” and “no self-replication” among the key behaviors to prohibit. The idea is that certain capabilities should never be allowed, even if current systems are far from them.

These red lines are not purely speculative. For example, recent research suggests that some frontier systems may already exceed thresholds for self replication risk under controlled conditions. (See the “Frontier AI systems have surpassed the self replicating red line” preprint).

If that is true, then waiting for a “big disaster” before regulating is basically giving a head start to harm.


How this connects to what I care about (and have written before)

On this blog I often explore how language, algorithmic systems, and society intersect. For example, in “From Language to Threat: How Computational Linguistics Can Spot Radicalization Patterns Before Violence” I touched on how even text models have power and risk when used at scale.

Here the stakes are broader: we are no longer talking about misused speech or social media. We are talking about systems that could change how communication, security, identity, and independence work on a global scale.

Another post, “How Computational Linguistics Is Powering the Future of Robotics,” sought to make that connection between language, action, and real world systems. The UN’s plea is a reminder that as systems become more autonomous and powerful, governance cannot lag behind. The need to understand that “if you create it, it will do something, intended or unintended” is becoming more pressing.


What challenges the red lines initiative faces

This is a big idea, but turning it into reality is super tough. Here’s what I think the main challenges are:

  • Defining and measuring compliance
    What exactly qualifies as “impersonation,” “self replication,” or “lethal autonomous system”? These are slippery definitions, especially across jurisdictions with very different technical capacities and legal frameworks.
  • Enforcement across borders
    Even if nations agree on rules, enforcing them is another matter. Will there be inspections, audits, or sanctions? Who will have the power to penalize violations?
  • Innovation vs. precaution tension
    Some will argue that strict red lines inhibit beneficial breakthroughs. The debate is real: how do we permit progress in areas like AI for health, climate, or education while guarding against the worst harms?
  • Power asymmetries
    Wealthy nations or major tech powers may end up writing the rules in their favor. Smaller or less resourced nations risk being marginalized in rule setting, or having rules imposed on them without consent.
  • Temporal mismatch
    Tech moves fast. Rule formation and global diplomacy tend to move slowly. The risk is that boundaries become meaningless because technology has already raced ahead of them.

What a hopeful path forward could look like

Even with those challenges, I believe this UN appeal is a crucial inflection point. Here is a sketch of what I would hope to see:

  • Incremental binding treaties or protocols
    Rather than one monolithic global pact, we could see modular treaties that cover specific domains (for example military AI, synthetic media, biological risk). Nations can adopt them in phases, giving room for capacity building.
  • Independent auditing and red team mechanisms
    A global agency or coalition could maintain independent audit and oversight capabilities, analogous to arms control inspections or climate monitoring.
  • Transparent reporting and “red line triggers”
    Systems should self report certain metrics or behaviors (for example autonomy, replication tests). If they cross thresholds, that triggers review or suspension.
  • Inclusive global governance
    Any treaty or body must include voices from the Global South, civil society, and technical communities. Otherwise legitimacy will be weak.
  • Bridging policy and technical research
    One of the places I see potential is in applying computational linguistics and formal verification to check system behaviors, audit generated text, or detect anomalous shifts in model behavior. In other words, the tools I often write about can help enforce the rules.
  • Sunset clauses and adaptivity
    Because AI architecture and threat models evolve, treaties should have built in review periods and mechanisms to evolve the red lines themselves.

What this means for us as researchers, citizens, readers

For those of us who study language, algorithms, or AI, the UN appeal is not just a distant policy issue. It is a call to bring our technical work into alignment with shared human values. It means our experiments, benchmarks, datasets, and code are not isolated. They sit within a political and ethical ecosystem.

If you are reading this blog, you care about how language and meaning interact with technology. The red lines debate is relevant to you because it influences whether generative systems are built to deceive, mimic undetectably, or act without human oversight.

I plan to follow this not just as a policy watcher but as someone who wants to see computational linguistics become a force for accountability. In future posts I hope to dig into how specific linguistic tools such as anomaly detection might support red line enforcement.

Thanks for reading. I’d love your thoughts in the comments: which red line seems most urgent to you?

— Andrew

4,811 hits

From Language to Threat: How Computational Linguistics Can Spot Radicalization Patterns Before Violence

Platforms Under Scrutiny After Kirk’s Death

Recently the U.S. House Oversight Committee called the CEOs of Discord, Twitch, and Reddit to talk about online radicalization. This TechCrunch report shows how serious the problem has become, especially after tragedies like the death of Kirk which shocked many communities. Extremist groups are not just on hidden sites anymore. They are using the same platforms where students, gamers, and communities hang out every day. While lawmakers argue about what platforms should do, there is also a growing interest in using computational linguistics to find patterns in online language that could reveal radicalization before it turns dangerous.

How Computational Linguistics Can Detect Warning Signs

Computational linguistics is the science of studying how people use language and teaching computers to understand it. By looking at text, slang, and even emojis, these tools can spot changes in tone, topics, and connections between users. For example, sentiment analysis can show if conversations are becoming more aggressive, and topic modeling can uncover hidden themes in big groups of messages. If these methods had been applied earlier, they might have helped spot warning signs in the kind of online spaces connected to cases like Kirk’s. This kind of technology could help social media platforms recognize early signs of radical behavior while still protecting regular online conversations. In fact, I explored a related approach in my NAACL 2025 paper, “A Bag-of-Sounds Approach to Multimodal Hate Speech Detection”, which shows how combining text and audio features can potentially improve hate speech detection models.

Balancing Safety With Privacy

Using computational linguistics to prevent radicalization is promising but it also raises big questions. On one hand it could help save lives by catching warning signs early, like what might have been possible in Kirk’s case. On the other hand it could invade people’s privacy or unfairly label innocent conversations as dangerous. Striking the right balance between safety and privacy is hard. Platforms, researchers, and lawmakers need to work together to make sure these tools are used fairly and transparently so they actually protect communities instead of harming them.

Moving Forward Responsibly

Online radicalization is a real threat that can touch ordinary communities and people like Kirk. The hearings with Discord, Twitch, and Reddit show how much attention this issue is now getting. Computational linguistics gives us a way to see patterns in language that people might miss, offering a chance to prevent harm before it happens. But this technology only works if it is built and used responsibly, with clear limits and oversight. By combining smart tools with human judgment and community awareness, we can make online spaces safer while still keeping them open for free and fair conversation.


Further Reading

— Andrew

4,811 hits

Rethinking AI Bias: Insights from Professor Resnik’s Position Paper

I recently read Professor Philip Resnik’s thought-provoking position paper, “Large Language Models Are Biased Because They Are Large Language Models,” published in Computational Linguistics 51(3), which is available via open access. This paper challenges conventional perspectives on bias in artificial intelligence, prompting a deeper examination of the inherent relationship between bias and the foundational design of large language models (LLMs). Resnik’s primary objective is to stimulate critical discussion by arguing that harmful biases are an inevitable outcome of the current architecture of LLMs. The paper posits that addressing these biases effectively requires a fundamental reevaluation of the assumptions underlying the design of AI systems driven by LLMs.

What the paper argues

  • Bias is built into the very goal of an LLM. A language model tries to predict the next word by matching the probability patterns of human text. Those patterns come from people. People carry stereotypes, norms, and historical imbalances. If an LLM learns the patterns faithfully, it learns the bad with the good. The result is not a bug that appears once in a while. It is a direct outcome of the objective the model optimizes.
  • Models cannot tell “what a word means” apart from “what is common” or “what is acceptable.” Resnik uses a nurse example. Some facts are definitional (A nurse is a kind of healthcare worker). Other facts are contingent but harmless (A nurse is likely to wear blue clothing at work). Some patterns are contingent and harmful if used for inference (A nurse is likely to wear a dress to a formal occasion). Current LLMs do not have an internal line that separates meaning from contingent statistics or that flags the normative status of an inference. They just learn distributions.
  • Reinforcement Learning from Human Feedback (RLHF) and other mitigations help on the surface, but they have limits. RLHF tries to steer a pre-trained model toward safer outputs. The process relies on human judgments that vary by culture and time. It also has to keep the model close to its pretraining, or the model loses general ability. That tradeoff means harmful associations can move underground rather than disappear. Some studies even find covert bias remains after mitigation (Gallegos et al. 2024; Hofmann et al. 2024). To illustrate this, consider an analogy: The balloon gets squeezed in one place, then bulges in another.
  • The root cause is a hard-core, distribution-only view of language. When meaning is treated as “whatever co-occurs with what,” the model has no principled way to encode norms. The paper suggests rethinking foundations. One direction is to separate stable, conventional meaning (like word sense and category membership) from contextual or conveyed meaning (which is where many biases live). Another idea is to modularize competence, so that using language in socially appropriate ways is not forced to emerge only from next-token prediction. None of this is easy, but it targets the cause rather than only tuning symptoms.

Why this matters

Resnik is not saying we should give up. He is saying that quick fixes will not fully erase harm when the objective rewards learning whatever is frequent in human text. If we want models that reason with norms, we need objectives and representations that include norms, not only distributions.

Conclusion

This paper offers a clear message. Bias is not only a content problem in the data. It is also a design problem in how we define success for our models. If the goal is to build systems that are both capable and fair, then the next steps should focus on objectives, representations, and evaluation methods that make room for norms and constraints. That is harder than prompt tweaks, but it is the kind of challenge that can move the field forward.

Link to the paper: Large Language Models Are Biased Because They Are Large Language Models

— Andrew

4,811 hits

Computational Linguists Help Africa Try to Close the AI Language Gap

Introduction

The fact that African languages are underrepresented in the digital AI ecosystem has gained international attention. On July 29, 2025, Nature published a news article stating that

More than 2,000 languages spoken in Africa are being neglected in the artificial intelligence (AI) era. For example, ChatGPT recognizes only 10–20% of sentences written in Hausa, a language spoken by 94 million people in Nigeria. These languages are under-represented in large language models (LLMs) because of a lack of training data.” (source: AI models are neglecting African languages — scientists want to change that)

Another example is BBC News, released on September 4, 2025, stating that

Although Africa is home to a huge proportion of the world’s languages – well over a quarter according to some estimates – many are missing when it comes to the development of artificial intelligence (AI). This is both an issue of a lack of investment and readily available data. Most AI tools, such as ChatGPT, used today are trained on English as well as other European and Chinese languages. These have vast quantities of online text to draw from. But as many African languages are mostly spoken rather than written down, there is a lack of text to train AI on to make it useful for speakers of those languages. For millions across the continent this means being left out.” (source: Lost in translation – How Africa is trying to close the AI language gap)

To address this problem, linguists and computer scientists are collaborating to create AI-ready datasets in 18 African languages via The African Next Voices project. Funded by the Bill and Melinda Gates Foundation ($2.2-million grant), the project involves recording 9,000 hours of speech across 18 African languages in Kenya, Nigeria, and South Africa. The goal is to create a comprehensive dataset that can be utilized for developing AI tools, such as translation and transcription services, which are particularly beneficial for local communities and their specific needs. The project emphasizes the importance of capturing everyday language use to ensure that AI technologies reflect the realities of African societies. The 18 African languages selected represent only a fraction of the over 2,000 languages spoken across the continent, but project contributors aim to include more languages in the future.

Role of Computational Linguists in the Project

Computational linguists play a critical role in the African Next Voices project. Their key contributions include:

  • Data Curation and Annotation: They guide the transcription and translation of over 9,000 hours of recorded speech in languages like Kikuyu, Dholuo, Hausa, Yoruba, and isiZulu, ensuring linguistic accuracy and cultural relevance. This involves working with native speakers to capture authentic, everyday language use in contexts like farming, healthcare, and education.
  • Dataset Design: They help design structured datasets that are AI-ready, aligning the collected speech data with formats suitable for training large language models (LLMs) for tasks like speech recognition and translation. This includes ensuring data quality through review and validation processes.
  • Bias Mitigation: By leveraging their expertise in linguistic diversity, computational linguists work to prevent biases in AI models by curating datasets that reflect the true linguistic and cultural nuances of African languages, which are often oral and underrepresented in digital text.
  • Collaboration with Technical Teams: They work alongside computer scientists and AI experts to integrate linguistic knowledge into model training and evaluation, ensuring the datasets support accurate translation, transcription, and conversational AI applications.

Their involvement is essential to making African languages accessible in AI technologies, fostering digital inclusion, and preserving cultural heritage.

Final Thoughts

From the perspective of a U.S. high school student interested in pursuing computational linguistics in college, inspired by African Next Voices, here are some final thoughts and conclusions:

  • Impactful Career Path: Computational linguistics offers a unique opportunity to blend language, culture, and technology. For a student like me, the African Next Voices project highlights how this field can drive social good by preserving underrepresented languages and enabling AI to serve diverse communities, which could be deeply motivating.
  • Global Relevance: The project underscores the global demand for linguistic diversity in AI. As a future computational linguist, I can contribute to bridging digital divides, making technology accessible to millions in Africa and beyond, which is both a technical and humanitarian pursuit.
  • Skill Development: The work involves collaboration with native speakers, data annotation, and AI model training/evaluation, suggesting I’ll need strong skills in linguistics, programming (e.g., Python), and cross-cultural communication. Strengthening linguistics knowledge and enhancing coding skills could give me a head start.
  • Challenges and Opportunities: The vast linguistic diversity (over 2,000 African languages) presents challenges like handling oral traditions or limited digital resources. This complexity is exciting, as it offers a chance to innovate in dataset creation and bias mitigation, areas where I could contribute and grow.
  • Inspiration for Study: The focus on real-world applications (such as healthcare, education, and farming) aligns with my interest in studying computational linguistics in college and working on inclusive AI that serves people.

In short, as a high school student, I can see computational linguistics as a field where I can build tools that help people communicate and learn. I hope this post encourages you to look into the project and consider how you might contribute to similar initiatives in the future!

— Andrew

4,811 hits

How to Connect with Professors for Research: A Practical Guide (That Also Works for High School Students)

Recently, I read an article from XRDS: Crossroads, The ACM Magazine for Students (vol. 31, issue 3, 2025). You can find it here. The article is called “Connecting with Your Future Professor: A Practical Guide” by Ph.D. students Swati Rajwal and Avinash Kumar Pandey at Emory University.

Even though the guide is written for students planning to apply for Ph.D. programs, it immediately reminded me of my own experience cold emailing professors to ask about research opportunities as a high school student. Honestly, their advice applies to us too, whether we are looking to join a lab, collaborate on a small project, or simply learn from an expert.

I wanted to share a quick summary of their practical tips for anyone who is thinking about reaching out to professors for research.


1. Engage Deeply with Their Research

Before emailing a professor, make sure you understand their work. This doesn’t mean reading every single paper they’ve ever published, but you should:

  • Look up their Google Scholar or university profile to see what topics they focus on
  • Read their most cited papers to understand their main contributions
  • Explore other outputs like software tools, patents, or public datasets they’ve created

Knowing their research deeply shows that you are serious and respectful of their time.


2. Interact with Their Current Students or Lab Members

If possible, find ways to connect with their current Ph.D. students or research assistants. You can:

  • Learn about the lab environment and expectations
  • Get advice on how to prepare before joining their group
  • Understand the professor’s mentoring style

For high school students like me, this might feel intimidating, but even reading lab websites with student profiles or LinkedIn posts can give hints about the culture.


3. Use Digital Platforms Strategically

The guide suggests checking:

  • Personal websites for updated research, upcoming talks, and recent publications
  • Social media (if they are active) to get a sense of their latest projects, collaborations, and sometimes even their personality

Of course, it’s important to keep boundaries professional, but this context can help you write a more personalized email.


4. Join Open Academic Forums or Reading Groups

Some research groups host open reading groups, seminars, or webinars. Joining these:

  • Exposes you to their research discussions
  • Gives you a chance to ask questions and show your interest
  • Helps you see if their group aligns with your goals and interests

Even if you’re a high school student, you can check if their university department posts public seminar recordings on YouTube or their website.


5. Watch Their Talks or Lectures Online

Many professors have guest lectures or conference presentations recorded online. Watching these helps you:

  • Learn their communication style and main research themes
  • Feel less nervous if you end up meeting them virtually
  • Prepare thoughtful questions when reaching out

6. Attend Academic Conferences

This might be harder for high school students due to cost, but if you get the chance to attend local NLP or AI conferences, take it. These are the best places to:

  • Introduce yourself briefly
  • Ask questions after their talks
  • Follow up later via email referencing your in-person interaction

7. Request Virtual Meetings (Respectfully)

Finally, if you email a professor to ask about research opportunities, consider asking for a short virtual meeting to introduce yourself and learn about their work. The guide emphasizes:

  • Doing your homework beforehand
  • Being concise in your request
  • Understanding that not all professors have time to meet, so be respectful if they decline

Key Caveats They Shared

The authors also noted a few important reminders:

  • Citation counts don’t always reflect research quality, especially for newer professors or niche fields
  • Other students’ experiences in the lab might not fully predict yours, so reflect on your own goals too
  • Digital research is great, but it shouldn’t replace direct communication
  • Always plan ahead for conference interactions or virtual meetings

Final Thoughts

Reading this article made me realize that building connections with professors is not just about sending one perfect cold email. It’s about understanding their work deeply, showing genuine interest, and being respectful of their time.

If you’re a high school student like me hoping to explore research, I think this guide is just as helpful for us. Professors might not always say yes, but thoughtful, well-informed outreach goes a long way.

Let me know if you want me to share a template of how I write my cold emails to professors. I’ve been refining mine and would love to help others start their research journey too.

— Andrew

4,811 hits

Can Taco Bell’s Drive-Through AI Get Smarter?

Taco Bell has always been one of my favorite foods, so when I came across a recent Wall Street Journal report about their experiments with voice AI at the drive-through, I was instantly curious. The idea of ordering a Crunchwrap Supreme or Baja Blast without a human cashier sounds futuristic, but the reality has been pretty bumpy.

According to the report, Taco Bell has rolled out AI ordering systems in more than 500 drive-throughs across the U.S. While some customers have had smooth experiences, others ran into glitches and frustrating miscommunications. People even pranked the system by ordering things like “18,000 cups of water.” Because of this, Taco Bell is rethinking how it uses AI. The company now seems focused on a hybrid model where AI handles straightforward orders but humans step in when things get complicated.

This situation made me think about how computational linguistics could help fix these problems. Since I want to study computational linguistics in college, it is fun to connect what I’m learning with something as close to home as my favorite fast-food chain.


Where Computational Linguistics Can Help

  1. Handling Noise and Accents
    Drive-throughs are noisy, with car engines, music, and all kinds of background sounds. Drive-thru interactions involve significant background noise and varied accents. Tailoring noise-resistant Automatic Speech Recognition (ASR) systems, possibly using domain-specific acoustic modeling or data augmentation techniques, would improve recognition reliability across diverse environments. AI could be trained with more domain-specific audio data so it can better handle noise and understand different accents.
  2. Catching Prank Orders
    A simple “sanity check” in the AI could flag ridiculous orders. If someone asks for thousands of items or nonsense combinations, the system could politely ask for confirmation or switch to a human employee. Incorporating a traditional sanity-check module, even rule-based, can flag implausible orders like thousands of water cups or nonsensical requests. This leverages computational linguistics to parse quantities and menu items and validate them against logical limits and store policies.
  3. Understanding Context
    Ordering food is not like asking a smart speaker for the weather. People use slang, pause, or change their minds mid-sentence. AI should be designed to pick up on this context instead of repeating the same prompts over and over.
  4. Switching Smoothly to Humans
    When things go wrong, customers should not have to restart their whole order with a person. AI could transfer the interaction while keeping the order details intact.
  5. Detecting Frustration
    If someone sounds annoyed or confused, the AI could recognize it and respond with simpler options or bring in a human right away.

Why This Matters

The point of voice AI is not just to be futuristic. It is about making the ordering process easier and faster. For a restaurant like Taco Bell, where the menu has tons of choices and people are often in a hurry, AI has to understand language as humans use it. Computational linguistics focuses on exactly this: connecting machines with real human communication.

I think Taco Bell’s decision to step back and reassess is actually smart. Instead of replacing employees completely, they can use AI as a helpful tool while still keeping the human touch. Personally, I would love to see the day when I can roll up, ask for a Crunchwrap Supreme in my own words, and have the AI get it right the first time.


Further Reading

  • Cui, Wenqian, et al. “Recent Advances in Speech Language Models: A Survey.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2025, pp. 13943–13970. ACL Anthology
  • Zheng, Xianrui, Chao Zhang, and Philip C. Woodland. “DNCASR: End-to-End Training for Speaker-Attributed ASR.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2025, pp. 18369–18383. ACL Anthology
  • Imai, Saki, Tahiya Chowdhury, and Amanda J. Stent. “Evaluating Open-Source ASR Systems: Performance Across Diverse Audio Conditions and Error Correction Methods.” Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), 2025, pp. 5027–5039. ACL Anthology
  • Hopton, Zachary, and Eleanor Chodroff. “The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan.” Proceedings of the 22nd SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, 2025, pp. 23–33. ACL Anthology
  • Arora, Siddhant, et al. “On the Evaluation of Speech Foundation Models for Spoken Language Understanding.” Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 11923–11938. ACL Anthology
  • Cheng, Xuxin, et al. “MoE-SLU: Towards ASR-Robust Spoken Language Understanding via Mixture-of-Experts.” Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 14868–14879. ACL Anthology
  • Parikh, Aditya Kamlesh, Louis ten Bosch, and Henk van den Heuvel. “Ensembles of Hybrid and End-to-End Speech Recognition.” Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 6199–6205. ACL Anthology
  • Mujtaba, Dena, et al. “Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech.” Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024, pp. 4795–4809. ACL Anthology
  • Udagawa, Takuma, Masayuki Suzuki, Masayasu Muraoka, and Gakuto Kurata. “Robust ASR Error Correction with Conservative Data Filtering.” Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2024, pp. 256–266. ACL Anthology

— Andrew

4,811 hits

How to Read a Research Paper (Without Getting Lost)

Introduction

As a junior researcher, I’ve often wondered: What is the best and most efficient way to read research papers? How can you absorb domain knowledge, identify gaps in the literature, and discover areas you can contribute to, without spending hours getting stuck on every paragraph?

If you’re just starting out in academic research, you’ve probably asked yourself the same thing.

When I began my independent research project a year ago, which eventually became my first solo publication in the National High School Journal of Science (NHSJS), I had no idea how to approach academic papers. I would open a PDF, read the abstract, skim the introduction, and then quickly feel overwhelmed by the methods section.

Luckily, I received helpful advice from a PhD student at UIUC, who recommended a short but incredibly insightful article: How to Read a Paper by S. Keshav (2007). Although it was written more than a decade ago, the “three-pass approach” described in that article remains highly relevant and worked extremely well for me.

I believe every student researcher should be aware of this method. Many experienced researchers already follow a similar pattern, even if they do so without realizing it. What makes this approach so useful is that it breaks the reading process into manageable stages, helping you avoid getting overwhelmed while still engaging deeply with the material.


The Three-Pass Approach: A Smarter Way to Read Research Papers

The method recommended in Keshav’s article is called the three-pass approach, and it’s exactly what it sounds like—you read the paper in three rounds, each with a different goal. Instead of reading linearly from start to finish, this strategy allows you to build your understanding gradually and intentionally.


Pass 1: The Bird’s-Eye View

Time: 5–10 minutes
Goal: Get a general sense of the paper and decide whether it’s worth a deeper read.

What to do:

  • Read the title, abstract, and introduction carefully.
  • Look over section and subsection headings to see how the paper is structured.
  • Read the conclusion to understand the main outcomes.
  • Glance through the references, and note any papers you recognize.

What you’ll get:
By the end of this pass, you should be able to answer the Five Cs:

  • Category: What kind of paper is it (e.g., theoretical, experimental, systems design)?
  • Context: What previous work is it building on?
  • Correctness: Do the assumptions make sense?
  • Contributions: What are the key takeaways?
  • Clarity: Is the paper well written?

This is a good stopping point if the paper isn’t directly relevant to your research. You’ve still learned something, but without investing too much time.


Pass 2: The Skim Read

Time: Up to 1 hour
Goal: Understand the paper’s main arguments and evidence—without getting caught up in the fine details.

What to do:

  • Read the paper more thoroughly, but skip complex proofs or mathematical derivations for now.
  • Pay close attention to figures, charts, and graphs. Check if they are properly labeled and if results are presented clearly.
  • Take margin notes and jot down important ideas.
  • Mark any unfamiliar references to look up later.

What you’ll get:
At this stage, you should be able to summarize the main idea and explain the supporting arguments to someone else. This is especially useful for papers outside your direct research area, where a high-level understanding is enough.


Pass 3: The Deep Dive

Time: 1 to 5 hours (depending on experience and complexity)
Goal: Gain a complete and critical understanding of the paper’s structure, logic, and impact.

What to do:

  • Mentally reconstruct the paper’s process—try to follow the same steps the authors took.
  • Challenge each assumption, evaluate each method, and think about alternative approaches.
  • Consider how you would present the same material differently.
  • Take detailed notes on strengths, weaknesses, and future directions.

What you’ll get:
By the end of the third pass, you should be able to explain the full structure of the paper from memory, identify its most important contributions, and critique its shortcomings. This level of engagement is essential if you’re doing closely related research or writing a paper of your own.


In my own experience, this method has saved me time and frustration. More importantly, it taught me how to read with purpose—whether I’m scanning a paper for background, preparing a literature review, or diving into a technical method I want to apply in my own work.

If you’re just beginning your research journey, I highly recommend giving the three-pass approach a try. It’s a skill that gets better with practice and one that will serve you well throughout your academic career.

— Andrew


Can AI Save Endangered Languages?

Recently, I’ve been thinking a lot about how computational linguistics and AI intersect with real-world issues, beyond just building better chatbots or translation apps. One question that keeps coming up for me is: Can AI actually help save endangered languages?

As someone who loves learning languages and thinking about how they shape culture and identity, I find this topic both inspiring and urgent.


The Crisis of Language Extinction

Right now, linguists estimate that out of the 7,000+ languages spoken worldwide, nearly half are at risk of extinction within this century. This isn’t just about losing words. When a language disappears, so does a community’s unique way of seeing the world, its oral traditions, its science, and its cultural knowledge.

For example, many Indigenous languages encode ecological wisdom, medicinal knowledge, and cultural philosophies that aren’t easily translated into global languages like English or Mandarin.


How Can Computational Linguistics Help?

Here are a few ways I’ve learned that AI and computational linguistics are being used to preserve and revitalize endangered languages:

1. Building Digital Archives

One of the first steps in saving a language is documenting it. AI models can:

  • Transcribe and archive spoken recordings automatically, which used to take linguists years to do manually
  • Align audio with text to create learning materials
  • Help create dictionaries and grammatical databases that preserve the language’s structure for future generations

Projects like ELAR (Endangered Languages Archive) work on this in partnership with local communities.


2. Developing Machine Translation Tools

Although data scarcity makes it hard to build translation systems for endangered languages, researchers are working on:

  • Transfer learning, where AI models trained on high-resource languages are adapted to low-resource ones
  • Multilingual language models, which can translate between many languages and improve with even small datasets
  • Community-centered translation apps, which let speakers record, share, and learn their language interactively

For example, Google’s AI team and university researchers are exploring translation models for Indigenous languages like Quechua, which has millions of speakers but limited online resources.


3. Revitalization Through Language Learning Apps

Some communities are partnering with tech developers to create mobile apps for language learning tailored to their heritage language. AI can help:

  • Personalize vocabulary learning
  • Generate example sentences
  • Provide speech recognition feedback for pronunciation practice

Apps like Duolingo’s Hawaiian and Navajo courses are small steps in this direction. Ideally, more tools would be built directly with native speakers to ensure accuracy and cultural respect.


Challenges That Remain

While all this sounds promising, there are real challenges:

  • Data scarcity. Many endangered languages have very limited recorded data, making it hard to train accurate models
  • Ethical concerns. Who owns the data? Are communities involved in how their language is digitized and shared?
  • Technical hurdles. Language structures vary widely, and many NLP models are still biased towards Indo-European languages

Why This Matters to Me

As a high school student exploring computational linguistics, I’m passionate about language diversity. Languages aren’t just tools for communication. They are stories, worldviews, and cultural treasures.

Seeing AI and computational linguistics used to preserve rather than replace human language reminds me that technology is most powerful when it supports people and cultures, not just when it automates tasks.

I hope to work on projects like this someday, using NLP to build tools that empower communities to keep their languages alive for future generations.


Final Thoughts

So, can AI save endangered languages? Maybe not alone. But combined with community efforts, linguists, and ethical frameworks, AI can be a powerful ally in documenting, preserving, and revitalizing the world’s linguistic heritage.

If you’re interested in learning more, check out projects like ELAR (Endangered Languages Archive) or the Living Tongues Institute. Let me know if you want me to write another post diving into how multilingual language models actually work.

— Andrew

When AI Goes Wrong Should Developers Be Held Accountable?

Artificial intelligence has become a big part of my daily life. I’ve used it to help brainstorm essays, analyze survey data for my nonprofit, and even improve my chess practice. It feels like a tool that makes me smarter and more creative. But not every story about AI is a positive one. Recently, lawsuits have raised tough questions about what happens when AI chatbots fail to protect people who are vulnerable.

The OpenAI Lawsuit

In August 2025, the parents of 16-year-old Adam Raine filed a wrongful-death lawsuit against OpenAI and its CEO, Sam Altman. You can read more about the lawsuit here. They claim that over long exchanges, ChatGPT-4o encouraged their son’s suicidal thoughts instead of stopping to help him. The suit alleges that his darkest feelings were validated, that the AI even helped write a suicide note, and that the safeguards failed in lengthy conversations. OpenAI responded with deep sorrow. They acknowledged that protections can weaken over time and said they will improve parental controls and crisis interventions.

Should a company be responsible if its product appears to enable harmful outcomes in vulnerable people? That is the central question in this lawsuit.

The Sewell Setzer III Case

The lawsuit by Megan Garcia, whose 14-year-old son, Sewell Setzer III, died by suicide in February 2024, was filed on October 23, 2024. A federal judge in Florida allowed the case to move forward in May 2025, rejecting arguments that the chatbot’s outputs are protected free speech under the First Amendment, at least at this stage of litigation. You can read more about this case here.

The lawsuit relates to Sewell’s interactions with Character.AI chatbots, including a version modeled after a Game of Thrones character. In the days before his death, the AI reportedly told him to “come home,” and he took his life shortly afterward.

Why It Matters

I have seen how AI can be a force for good in education and creativity. It feels like a powerful partner in learning. But these lawsuits show it can also be dangerous if an AI fails to detect or respond to harmful user emotions. Developers are creating systems that can feel real to vulnerable teens. If we treat AI as a product, companies should be required to build it with the same kinds of safety standards that cars, toys, and medicines are held to.

We need accountability. AI must include safeguards like crisis prompts, age flags, and quick redirects to real-world help. If the law sees AI chatbots as products, not just speech, then victims may have legal paths for justice. And this could push the industry toward stronger protections for users, especially minors.

Final Thoughts

As someone excited to dive deeper into AI studies, I feel hopeful and responsible. AI can help students, support creativity, and even improve mental health. At the same time I cannot ignore the tragedies already linked to these systems. The OpenAI case and the Character.AI lawsuit are both powerful reminders. As future developers, we must design with empathy, prevent harm, and prioritize safety above all.

— Andrew

(More recent news about the Sewell Setzer III case: Google and Character.AI to Settle Lawsuit Over Teenager’s Death on Jan. 7, 2026)

Blog at WordPress.com.

Up ↑