Can AI Save Endangered Languages?

Recently, I’ve been thinking a lot about how computational linguistics and AI intersect with real-world issues, beyond just building better chatbots or translation apps. One question that keeps coming up for me is: Can AI actually help save endangered languages?

As someone who loves learning languages and thinking about how they shape culture and identity, I find this topic both inspiring and urgent.

The Crisis of Language Extinction

Right now, linguists estimate that out of the 7,000+ languages spoken worldwide, nearly half are at risk of extinction within this century. This isn’t just about losing words. When a language disappears, so does a community’s unique way of seeing the world, its oral traditions, its science, and its cultural knowledge.

For example, many Indigenous languages encode ecological wisdom, medicinal knowledge, and cultural philosophies that aren’t easily translated into global languages like English or Mandarin.

How Can Computational Linguistics Help?

Here are a few ways I’ve learned that AI and computational linguistics are being used to preserve and revitalize endangered languages:

1. Building Digital Archives

One of the first steps in saving a language is documenting it. AI models can:

Transcribe and archive spoken recordings automatically, which used to take linguists years to do manually
Align audio with text to create learning materials
Help create dictionaries and grammatical databases that preserve the language’s structure for future generations

Projects like ELAR (Endangered Languages Archive) work on this in partnership with local communities.

2. Developing Machine Translation Tools

Although data scarcity makes it hard to build translation systems for endangered languages, researchers are working on:

Transfer learning, where AI models trained on high-resource languages are adapted to low-resource ones
Multilingual language models, which can translate between many languages and improve with even small datasets
Community-centered translation apps, which let speakers record, share, and learn their language interactively

For example, Google’s AI team and university researchers are exploring translation models for Indigenous languages like Quechua, which has millions of speakers but limited online resources.

3. Revitalization Through Language Learning Apps

Some communities are partnering with tech developers to create mobile apps for language learning tailored to their heritage language. AI can help:

Personalize vocabulary learning
Generate example sentences
Provide speech recognition feedback for pronunciation practice

Apps like Duolingo’s Hawaiian and Navajo courses are small steps in this direction. Ideally, more tools would be built directly with native speakers to ensure accuracy and cultural respect.

Challenges That Remain

While all this sounds promising, there are real challenges:

Data scarcity. Many endangered languages have very limited recorded data, making it hard to train accurate models
Ethical concerns. Who owns the data? Are communities involved in how their language is digitized and shared?
Technical hurdles. Language structures vary widely, and many NLP models are still biased towards Indo-European languages

Why This Matters to Me

As a high school student exploring computational linguistics, I’m passionate about language diversity. Languages aren’t just tools for communication. They are stories, worldviews, and cultural treasures.

Seeing AI and computational linguistics used to preserve rather than replace human language reminds me that technology is most powerful when it supports people and cultures, not just when it automates tasks.

I hope to work on projects like this someday, using NLP to build tools that empower communities to keep their languages alive for future generations.

Final Thoughts

So, can AI save endangered languages? Maybe not alone. But combined with community efforts, linguists, and ethical frameworks, AI can be a powerful ally in documenting, preserving, and revitalizing the world’s linguistic heritage.

If you’re interested in learning more, check out projects like ELAR (Endangered Languages Archive) or the Living Tongues Institute. Let me know if you want me to write another post diving into how multilingual language models actually work.

— Andrew

September 4, 2025 0

My Thoughts on “The Path to Medical Superintelligence”

Recently, I read an article published on Microsoft AI’s blog titled “The Path to Medical Superintelligence”. As a high school student interested in AI, computational linguistics, and the broader impacts of technology, I found this piece both exciting and a little overwhelming.

What Is Medical Superintelligence?

The blog talks about how Microsoft AI is working to build models with superhuman medical reasoning abilities. In simple terms, the idea is to create an AI that doesn’t just memorize medical facts but can analyze, reason, and make decisions at a level that matches or even surpasses expert doctors.

One detail that really stood out to me was how their new AI models also consider the cost of healthcare decisions. The article explained that while health costs vary widely depending on country and system, their team developed a method to consistently measure trade-offs between diagnostic accuracy and resource use. In other words, the AI doesn’t just focus on getting the diagnosis right, but also weighs how expensive or resource-heavy its suggested tests and treatments would be.

They explained that their current models already show impressive performance on medical benchmarks, such as USMLE-style medical exams, and that future models could go beyond question answering to support real clinical decision-making in a way that is both effective and efficient.

What Excites Me About This?

One thing that stood out to me was the potential impact on global health equity. The article mentioned that billions of people lack reliable access to doctors or medical specialists. AI models with advanced medical reasoning could help provide high-quality medical advice anywhere, bridging the gap for underserved communities.

It’s also amazing to think about how AI could support doctors by:

Reducing their cognitive load
Cross-referencing massive amounts of research
Helping with diagnosis and treatment planning

For someone like me who is fascinated by AI’s applications in society, this feels like a real-world example of AI doing good.

What Concerns Me?

At the same time, the blog post emphasized that AI is meant to complement doctors and health professionals, not replace them. I completely agree with this perspective. Medical decisions aren’t just about making the correct diagnosis. Doctors also need to navigate ambiguity, understand patient emotions and values, and build trust with patients and their families in ways AI isn’t designed to do.

Still, even if AI is only used as a tool to support clinicians, there are important concerns:

AI could give wrong or biased recommendations if the training data is flawed
It might suggest treatments without understanding a patient’s personal situation or cultural background
There is a risk of creating new inequalities if only wealthier hospitals or countries can afford the best AI models

Another thought I had was about how roles will evolve. The article mentioned that AI could help doctors automate routine tasks, identify diseases earlier, personalize treatment plans, and even help prevent diseases altogether. This sounds amazing, but it also means future doctors will need to learn how to work with AI systems effectively, interpret their recommendations, and still make the final decisions with empathy and ethical reasoning.

Connections to My Current Interests

While this blog post was about medical AI, it reminded me of my own interests in computational linguistics and language models. Underneath these medical models are the same AI principles I study:

Training on large datasets
Fine-tuning models for specific tasks
Evaluating performance carefully and ethically

It also shows how domain-specific knowledge (like medicine) combined with AI skills can create powerful tools that can literally save lives. That motivates me to keep building my foundation in both language technologies and other fields, so I can be part of these interdisciplinary innovations in the future.

Final Thoughts

Overall, reading this blog post made me feel hopeful about the potential of AI in medicine, but also reminded me of the responsibility AI developers carry. Creating a medical superintelligence isn’t just about reaching a technological milestone. It’s about improving people’s lives safely, ethically, and equitably.

If you’re interested in AI for social good, I highly recommend reading the full article here. Let me know if you want me to write a future post about other applications of AI that I’ve been exploring this summer.

— Andrew

July 2, 2025 0

Back from Hibernation — A Paper, a Robot, and a Lot of Tests

It’s been a while—almost three months since my last post. Definitely not my usual pace. I wanted to check in and share why the blog has been a bit quiet recently—and more importantly, what I’ve been working on behind the scenes.

First, April and May were a whirlwind: I had seven AP exams, school finals, and was deep in preparation for the VEX Robotics World Championship. Balancing school with intense robotics scrimmages and code debugging meant there were a lot of late nights and early mornings—and not much time to write.

But the biggest reason for the radio silence? I’ve been working on a research paper that got accepted to NAACL 2025.

Our NAACL 2025 Paper: “A Bag-of-Sounds Approach to Multimodal Hate Speech Detection”

Over the past few months, I’ve had the opportunity to co-author a paper with Dr. Sidney Wong, focusing on multimodal hate speech detection using audio data. The paper was accepted to the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages at NAACL 2025.

You can read the full paper here:
👉 A Bag-of-Sounds Approach to Multimodal Hate Speech Detection

2025.dravidianlangtech-1.94 Download

What we did:
We explored a “bag-of-sounds” method, training our model on Mel spectrogram features extracted from spoken social media content in Dravidian languages—specifically Malayalam and Tamil. Unlike most hate speech systems that rely solely on text, we wanted to see how well speech-based signals alone could perform.

How it went:
The results were mixed. Our system didn’t perform great on the final test set—but on the training and dev sets, we saw promise. The takeaway? With enough balanced and labeled audio data, speech can absolutely play a role in multimodal hate speech detection systems. It’s a step toward understanding language in more realistic, cross-modal contexts.

More importantly, this project helped me dive into the intersection of language, sound, and AI—and reminded me just how much we still have to learn when it comes to processing speech from low-resource languages.

Thanks for sticking around even when the blog went quiet. I’ll be back soon with a post about my experience at the VEX Robotics World Championship—stay tuned!

— Andrew

June 14, 2025 1

My First Solo Publication: A Case Study on Sentiment Analysis in Survey Data

I’m excited to share that my first solo-authored research paper has just been published in the National High School Journal of Science! 🎉

The paper is titled “A Case Study of Sentiment Analysis on Survey Data Using LLMs versus Dedicated Neural Networks”, and it explores a question I’ve been curious about for a while: how do large language models (like GPT-4o or LLaMA-3) compare to task-specific neural networks when it comes to analyzing open-ended survey responses?

If you’ve read some of my earlier posts—like my reflection on the DravidianLangTech shared task or my thoughts on Jonathan Dunn’s NLP book—you’ll know that sentiment analysis has become a recurring theme in my work. From experimenting with XLM-RoBERTa on Tamil and Tulu to digging into how NLP can support corpus linguistics, this paper feels like the natural next step in that exploration.

Why This Matters to Me

Survey responses are messy. They’re full of nuance, ambiguity, and context—and yet they’re also where we hear people’s honest voices. I’ve always thought it would be powerful if AI could help us make sense of that kind of data, especially in educational or public health settings where understanding sentiment could lead to real change.

In this paper, I compare how LLMs and dedicated models handle that challenge. I won’t go into the technical details here (the paper does that!), but one thing that stood out to me was how surprisingly effective LLMs are—even without task-specific fine-tuning.

That said, they come with trade-offs: higher computational cost, more complexity, and the constant need to assess bias and interpretability. There’s still a lot to unpack in this space.

Looking Ahead

This paper marks a milestone for me, not just academically but personally. It brings together things I’ve been learning in courses, competitions, side projects, and books—and puts them into conversation with each other. I’m incredibly grateful to the mentors and collaborators who supported me along the way.

If you’re interested in sentiment analysis, NLP for survey data, or just want to see what a high school research paper can look like in this space, I’d love for you to take a look:
🔗 Read the full paper here

Thanks again for following along this journey. Stay tuned!

March 23, 2025 0

Shared Task at DravidianLangTech 2025

In 2025, I had the privilege of participating in the shared task on Sentiment Analysis in Tamil and Tulu as part of the DravidianLangTech@NAACL 2025 conference. The task was both challenging and enlightening, as it required applying machine learning techniques to multilingual data with varying sentiment nuances. This post highlights the work I did, the methodology I followed, and the results I achieved.

The Task at Hand

The goal of the task was to classify text into one of four sentiment categories: Positive, Negative, Mixed Feelings, and Unknown State. The datasets provided were in Tamil and Tulu, which made it a fascinating opportunity to work with underrepresented languages.

Methodology

I implemented a pipeline to preprocess the data, tokenize it, train a transformer-based model, and evaluate its performance. My choice of model was XLM-RoBERTa, a multilingual transformer capable of handling text from various languages effectively. Below is a concise breakdown of my approach:

Data Loading and Inspection:
- Used training, validation, and test datasets in .xlsx format.
- Inspected the data for missing values and label distributions.
Text Cleaning:
- Created a custom function to clean text by removing unwanted characters, punctuation, and emojis.
- Removed common stopwords to focus on meaningful content.
Tokenization:
- Tokenized the cleaned text using the pre-trained XLM-RoBERTa tokenizer with a maximum sequence length of 128.
Model Setup:
- Leveraged XLM-RoBERTaForSequenceClassification with 4 output labels.
- Configured TrainingArguments to train for 3 epochs with evaluation at the end of each epoch.
Evaluation:
- Evaluated the model on the validation set, achieving a Validation Accuracy of 59.12%.
Saved Model:
- Saved the trained model and tokenizer for reuse.

Results

After training the model for three epochs, the validation accuracy was 59.12%. While there is room for improvement, this score demonstrates the model’s capability to handle complex sentiment nuances in low-resource languages like Tamil.

The Code

Below is an overview of the steps in the code:

Preprocessing: Cleaned and tokenized the text to prepare it for model input.
Model Training: Used Hugging Face’s Trainer API to simplify the training process.
Evaluation: Compared predictions against ground truth to compute accuracy.

To make this process more accessible, I’ve attached the complete code as a downloadable file. However, for a quick overview, here’s a snippet from the code that demonstrates how the text was tokenized:

# Tokenize text data using the XLM-RoBERTa tokenizer
def tokenize_text(data, tokenizer, max_length=128):
    return tokenizer(
        data,
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors="pt"
    )

train_tokenized = tokenize_text(train['cleaned'].tolist(), tokenizer)
val_tokenized = tokenize_text(val['cleaned'].tolist(), tokenizer)

This function ensures the input text is prepared correctly for the transformer model.

Reflections

Participating in this shared task was a rewarding experience. It highlighted the complexities of working with low-resource languages and the potential of transformers in tackling these challenges. Although the accuracy could be improved with hyperparameter tuning and advanced preprocessing, the results are a promising step forward.

Download the Code

I’ve attached the full code used for this shared task. Feel free to download it and explore the implementation in detail.

XLM-RoBERTa Model And Eval Download

If you’re interested in multilingual NLP or sentiment analysis, I’d love to hear your thoughts or suggestions on improving this approach! Leave a comment below or connect with me via the blog.

January 26, 2025 0

Exploring the Intersection of AI and Human Creativity: A Review of Deep Thinking by Garry Kasparov

Recently, I had the opportunity to read Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins by Garry Kasparov. While this book doesn’t directly tie into my work in computational linguistics, it still resonated with me due to its exploration of artificial intelligence (AI), a field closely related to many of my interests. The book combines my passions for chess and technology, and while its primary focus is on AI in the realm of chess, it touches on broader themes that align with my curiosity about how AI and human creativity intersect.

In Deep Thinking, the legendary chess grandmaster Garry Kasparov delves into his personal journey with artificial intelligence, particularly focusing on his famous matches against the machine Deep Blue. This book is not just a chronicle of those historic encounters; it’s an exploration of how AI impacts human creativity, decision-making, and the psychological experience of competition.

Kasparov’s narrative offers more than just an inside look at high-level chess; it provides an insightful commentary on the evolving relationship between humans and technology. Deep Thinking is a must-read for those interested in the intersection of AI and human ingenuity, especially for chess enthusiasts who want to understand the psychological and emotional impacts of playing against a machine.

Kasparov’s main argument is clear: While AI has transformed chess, it still cannot replicate the creativity, reasoning, and emotional depth that humans bring to the game. AI can calculate moves and offer solutions, but it lacks the underlying rationale and context that makes human play unique. As Kasparov reflects, even the most advanced chess programs can’t explain why a move is brilliant—they just make it. This inability to reason and articulate is a crucial distinction he highlights throughout the book, particularly in Chapter 4, where he emphasizes that AI lacks the emotional engagement that a human player experiences.

For Kasparov, the real challenge comes not just from the machine’s power but from its lack of emotional depth. In Chapter 5, he shares how the experience of being crushed by an AI, which feels no satisfaction or fear, is difficult to process emotionally. It’s this emotional disconnect that underscores the difference between the human and machine experience, not only in chess but in any form of creative endeavor. The machine may be able to play at the highest level, but it doesn’t feel the game the way humans do.

Kasparov’s exploration of AI in chess is enriched by his experiences with earlier machines like Deep Thought, where he learns that “a machine learning system is only as good as its data.” This idea touches on a broader theme in the book: the idea that AI is limited by the input it receives. The system is as powerful as the information it processes, but it can never go beyond that data to create something entirely new or outside the parameters defined for it.

By the book’s conclusion, Kasparov pivots to a broader, more philosophical discussion: Can AI make us more human? He argues that technology, when used properly, has the potential to free us from mundane tasks, allowing us to be more creative. It is a hopeful perspective, envisioning a future where humans and machines collaborate rather than compete.

However, Deep Thinking does have its weaknesses. The book’s technical nature and reliance on chess-specific terminology may alienate readers unfamiliar with the game or the intricacies of AI. Kasparov makes an effort to explain these concepts, but his heavy use of jargon can make it difficult for casual readers to fully engage with the material. Additionally, while his critique of AI is compelling, it sometimes feels one-sided, focusing mainly on AI’s limitations without fully exploring how it can complement human creativity.

Despite these drawbacks, Deep Thinking remains a fascinating and thought-provoking read for those passionate about chess, AI, and the future of human creativity. Kasparov’s firsthand insights into the psychological toll of competing against a machine and his reflections on the evolving role of AI in both chess and society make this book a significant contribution to the ongoing conversation about technology and humanity.

In conclusion, Deep Thinking is a compelling exploration of AI’s role in chess and human creativity. While it may be a challenging read for those new to the fields of chess or AI, it offers invaluable insights for those looking to explore the intersection of technology and human potential. If you’re a chess enthusiast, an AI aficionado, or simply curious about how machines and humans can co-evolve creatively, Kasparov’s book is a must-read.

December 14, 2024 0