Shared Task at DravidianLangTech 2025

In 2025, I had the privilege of participating in the shared task on Sentiment Analysis in Tamil and Tulu as part of the DravidianLangTech@NAACL 2025 conference. The task was both challenging and enlightening, as it required applying machine learning techniques to multilingual data with varying sentiment nuances. This post highlights the work I did, the methodology I followed, and the results I achieved.


The Task at Hand

The goal of the task was to classify text into one of four sentiment categories: Positive, Negative, Mixed Feelings, and Unknown State. The datasets provided were in Tamil and Tulu, which made it a fascinating opportunity to work with underrepresented languages.


Methodology

I implemented a pipeline to preprocess the data, tokenize it, train a transformer-based model, and evaluate its performance. My choice of model was XLM-RoBERTa, a multilingual transformer capable of handling text from various languages effectively. Below is a concise breakdown of my approach:

  1. Data Loading and Inspection:
    • Used training, validation, and test datasets in .xlsx format.
    • Inspected the data for missing values and label distributions.
  2. Text Cleaning:
    • Created a custom function to clean text by removing unwanted characters, punctuation, and emojis.
    • Removed common stopwords to focus on meaningful content.
  3. Tokenization:
    • Tokenized the cleaned text using the pre-trained XLM-RoBERTa tokenizer with a maximum sequence length of 128.
  4. Model Setup:
    • Leveraged XLM-RoBERTaForSequenceClassification with 4 output labels.
    • Configured TrainingArguments to train for 3 epochs with evaluation at the end of each epoch.
  5. Evaluation:
    • Evaluated the model on the validation set, achieving a Validation Accuracy of 59.12%.
  6. Saved Model:
    • Saved the trained model and tokenizer for reuse.

Results

After training the model for three epochs, the validation accuracy was 59.12%. While there is room for improvement, this score demonstrates the model’s capability to handle complex sentiment nuances in low-resource languages like Tamil.


The Code

Below is an overview of the steps in the code:

  • Preprocessing: Cleaned and tokenized the text to prepare it for model input.
  • Model Training: Used Hugging Face’s Trainer API to simplify the training process.
  • Evaluation: Compared predictions against ground truth to compute accuracy.

To make this process more accessible, I’ve attached the complete code as a downloadable file. However, for a quick overview, here’s a snippet from the code that demonstrates how the text was tokenized:

# Tokenize text data using the XLM-RoBERTa tokenizer
def tokenize_text(data, tokenizer, max_length=128):
return tokenizer(
data,
truncation=True,
padding='max_length',
max_length=max_length,
return_tensors="pt"
)

train_tokenized = tokenize_text(train['cleaned'].tolist(), tokenizer)
val_tokenized = tokenize_text(val['cleaned'].tolist(), tokenizer)

This function ensures the input text is prepared correctly for the transformer model.


Reflections

Participating in this shared task was a rewarding experience. It highlighted the complexities of working with low-resource languages and the potential of transformers in tackling these challenges. Although the accuracy could be improved with hyperparameter tuning and advanced preprocessing, the results are a promising step forward.


Download the Code

I’ve attached the full code used for this shared task. Feel free to download it and explore the implementation in detail.


If you’re interested in multilingual NLP or sentiment analysis, I’d love to hear your thoughts or suggestions on improving this approach! Leave a comment below or connect with me via the blog.

Happy New Year 2025! Reflecting on a Year of Growth and Looking Ahead

As we welcome 2025, I want to take a moment to reflect on the past year and share some exciting plans for the future.

Highlights from 2024

  • Academic Pursuits: I delved deeper into Natural Language Processing (NLP), discovering Jonathan Dunn’s Natural Language Processing for Corpus Linguistics, which seamlessly integrates computational methods with traditional linguistic analysis.
  • AI and Creativity: Exploring the intersection of AI and human creativity, I read Garry Kasparov’s Deep Thinking, which delves into his experiences with AI in chess and offers insights into the evolving relationship between humans and technology.
  • Competitions and Courses: I actively participated in Kaggle competitions, enhancing my machine learning and data processing skills, which are crucial in the neural network and AI aspects of Computational Linguistics.
  • Community Engagement: I had the opportunity to compete in the 2024 VEX Robotics World Championship and reintroduced our school’s chess club to the competitive scene, marking our return since pre-COVID times.

Looking Forward to 2025

  • Expanding Knowledge: I plan to continue exploring advanced topics in NLP and AI, sharing insights and resources that I find valuable.
  • Engaging Content: Expect more in-depth discussions, tutorials, and reviews on the latest developments in computational linguistics and related fields.
  • Community Building: I aim to foster a community where enthusiasts can share knowledge, ask questions, and collaborate on projects.

Thank you for being a part of this journey. Your support and engagement inspire me to keep exploring and sharing. Here’s to a year filled with learning, growth, and innovation!

A Book That Expanded My Perspective on NLP: Natural Language Processing for Corpus Linguistics by Jonathan Dunn

Book Link: https://doi.org/10.1017/9781009070447

As I dive deeper into the fascinating world of Natural Language Processing (NLP), I often come across resources that reshape my understanding of the field. One such recent discovery is Jonathan Dunn’s Natural Language Processing for Corpus Linguistics. This book, a part of the Elements in Corpus Linguistics series by Cambridge University Press, stands out for its seamless integration of computational methods with traditional linguistic analysis.

A Quick Overview

The book serves as a guide to applying NLP techniques to corpus linguistics, especially in dealing with large-scale corpora that are beyond the scope of traditional manual analysis. It discusses how models like text classification and text similarity can help address linguistic problems such as categorization (e.g., identifying part-of-speech tags) and comparison (e.g., measuring stylistic similarities between authors).

What I found particularly intriguing is its structure, which is built around five compelling case studies:

  1. Corpus-Based Sociolinguistics: Exploring geographic and social variations in language use.
  2. Corpus Stylistics: Understanding authorship through stylistic differences in texts.
  3. Usage-Based Grammar: Analyzing syntax and semantics via computational models.
  4. Multilingualism Online: Investigating underrepresented languages in digital spaces.
  5. Socioeconomic Indicators: Applying corpus analysis to non-linguistic fields like politics and sentiment in customer reviews.

The book is as much a practical resource as it is theoretical. Accompanied by Python notebooks and a stand-alone Python package, it provides hands-on tools to implement the discussed methods—a feature that makes it especially appealing to readers with a technical bent.

A Personal Connection

My journey with this book is a bit more personal. While exploring NLP, I had the chance to meet Jonathan Dunn, who shared invaluable insights about this field. One of his students, Sidney Wong, recommended this book to me as a starting point for understanding how computational methods can expand corpus linguistics. It has since become a cornerstone of my learning in this area.

What Makes It Unique

Two aspects of Dunn’s book particularly resonated with me:

  1. Ethical Considerations: As corpus sizes grow, so do the ethical dilemmas associated with their use. From privacy issues to biases in computational models, the book doesn’t shy away from discussing the darker side of large-scale text analysis. This balance between innovation and responsibility is a critical takeaway for anyone venturing into NLP.
  2. Interdisciplinary Approach: Whether you’re a linguist looking to incorporate computational methods or a computer scientist aiming to understand linguistic principles, this book bridges the gap between the two disciplines beautifully. It encourages a collaborative perspective, which is essential in fields as expansive as NLP and corpus linguistics.

Who Should Read It?

If you’re a student, researcher, or practitioner with an interest in exploring how NLP can scale linguistic analysis, this book is for you. Its accessibility makes it suitable for beginners, while the advanced discussions and hands-on code offer plenty for seasoned professionals to learn from.

For me, Natural Language Processing for Corpus Linguistics isn’t just a book—it’s a toolkit, a mentor, and an inspiration rolled into one. As I continue my journey in NLP, I find myself revisiting its chapters for insights and ideas.

Exploring the Intersection of AI and Human Creativity: A Review of Deep Thinking by Garry Kasparov

Recently, I had the opportunity to read Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins by Garry Kasparov. While this book doesn’t directly tie into my work in computational linguistics, it still resonated with me due to its exploration of artificial intelligence (AI), a field closely related to many of my interests. The book combines my passions for chess and technology, and while its primary focus is on AI in the realm of chess, it touches on broader themes that align with my curiosity about how AI and human creativity intersect.

In Deep Thinking, the legendary chess grandmaster Garry Kasparov delves into his personal journey with artificial intelligence, particularly focusing on his famous matches against the machine Deep Blue. This book is not just a chronicle of those historic encounters; it’s an exploration of how AI impacts human creativity, decision-making, and the psychological experience of competition.

Kasparov’s narrative offers more than just an inside look at high-level chess; it provides an insightful commentary on the evolving relationship between humans and technology. Deep Thinking is a must-read for those interested in the intersection of AI and human ingenuity, especially for chess enthusiasts who want to understand the psychological and emotional impacts of playing against a machine.

Kasparov’s main argument is clear: While AI has transformed chess, it still cannot replicate the creativity, reasoning, and emotional depth that humans bring to the game. AI can calculate moves and offer solutions, but it lacks the underlying rationale and context that makes human play unique. As Kasparov reflects, even the most advanced chess programs can’t explain why a move is brilliant—they just make it. This inability to reason and articulate is a crucial distinction he highlights throughout the book, particularly in Chapter 4, where he emphasizes that AI lacks the emotional engagement that a human player experiences.

For Kasparov, the real challenge comes not just from the machine’s power but from its lack of emotional depth. In Chapter 5, he shares how the experience of being crushed by an AI, which feels no satisfaction or fear, is difficult to process emotionally. It’s this emotional disconnect that underscores the difference between the human and machine experience, not only in chess but in any form of creative endeavor. The machine may be able to play at the highest level, but it doesn’t feel the game the way humans do.

Kasparov’s exploration of AI in chess is enriched by his experiences with earlier machines like Deep Thought, where he learns that “a machine learning system is only as good as its data.” This idea touches on a broader theme in the book: the idea that AI is limited by the input it receives. The system is as powerful as the information it processes, but it can never go beyond that data to create something entirely new or outside the parameters defined for it.

By the book’s conclusion, Kasparov pivots to a broader, more philosophical discussion: Can AI make us more human? He argues that technology, when used properly, has the potential to free us from mundane tasks, allowing us to be more creative. It is a hopeful perspective, envisioning a future where humans and machines collaborate rather than compete.

However, Deep Thinking does have its weaknesses. The book’s technical nature and reliance on chess-specific terminology may alienate readers unfamiliar with the game or the intricacies of AI. Kasparov makes an effort to explain these concepts, but his heavy use of jargon can make it difficult for casual readers to fully engage with the material. Additionally, while his critique of AI is compelling, it sometimes feels one-sided, focusing mainly on AI’s limitations without fully exploring how it can complement human creativity.

Despite these drawbacks, Deep Thinking remains a fascinating and thought-provoking read for those passionate about chess, AI, and the future of human creativity. Kasparov’s firsthand insights into the psychological toll of competing against a machine and his reflections on the evolving role of AI in both chess and society make this book a significant contribution to the ongoing conversation about technology and humanity.

In conclusion, Deep Thinking is a compelling exploration of AI’s role in chess and human creativity. While it may be a challenging read for those new to the fields of chess or AI, it offers invaluable insights for those looking to explore the intersection of technology and human potential. If you’re a chess enthusiast, an AI aficionado, or simply curious about how machines and humans can co-evolve creatively, Kasparov’s book is a must-read.

I am back!

This will be a short post since I’m planning to post a more in-depth discussion on one thing that I’ve been up to over the summer. Between writing a research paper (currently under review by the Journal of High School Science) and founding a nonprofit called Student Echo, I’ve been keeping myself busy. Despite all this, I plan to post shorter updates more frequently here. Sorry for the wait—assuming anyone was actually waiting—but hey, here you go.

Here’s a bit more about what’s been keeping me occupied:
My Research Paper
Title: Comparing Performance of LLMs vs. Dedicated Neural Networks in Analyzing the Sentiment of Survey Responses
Abstract: Interpreting sentiment in open-ended survey data is a challenging but crucial task in the age of digital information. This paper studies the capabilities of three LLMs, Gemini-1.5-Flash, Llama-3-70B, and GPT-4o, comparing them to dedicated, sentiment analysis neural networks, namely RoBERTa-base-sentiment and DeBERTa-v3-base-absa. These models were evaluated on their accuracy along with other metrics (precision, recall, and F1-score) in determining the underlying sentiment of responses from two COVID-19 surveys. The results revealed that despite being designed for broader applications, all three LLMs generally outperformed specialized neural networks, with the caveat that RoBERTa was the most precise at detecting negative sentiment. While LLMs are more resource-intensive than dedicated neural networks, their enhanced accuracy demonstrates their evolving potential and justifies the increased resource costs in sentiment analysis.

My Nonprofit: Student Echo
Website: https://www.student-echo.org/
Student-Echo.org is a student-led non-profit organization with the mission of amplifying students’ voices through student-designed questionnaires, AI-based technology, and close collaboration among students, teachers, and school district educators.

Blog at WordPress.com.

Up ↑