In 2025, I had the privilege of participating in the shared task on Sentiment Analysis in Tamil and Tulu as part of the DravidianLangTech@NAACL 2025 conference. The task was both challenging and enlightening, as it required applying machine learning techniques to multilingual data with varying sentiment nuances. This post highlights the work I did, the methodology I followed, and the results I achieved.
The Task at Hand
The goal of the task was to classify text into one of four sentiment categories: Positive, Negative, Mixed Feelings, and Unknown State. The datasets provided were in Tamil and Tulu, which made it a fascinating opportunity to work with underrepresented languages.
Methodology
I implemented a pipeline to preprocess the data, tokenize it, train a transformer-based model, and evaluate its performance. My choice of model was XLM-RoBERTa, a multilingual transformer capable of handling text from various languages effectively. Below is a concise breakdown of my approach:
- Data Loading and Inspection:
- Used training, validation, and test datasets in
.xlsxformat. - Inspected the data for missing values and label distributions.
- Used training, validation, and test datasets in
- Text Cleaning:
- Created a custom function to clean text by removing unwanted characters, punctuation, and emojis.
- Removed common stopwords to focus on meaningful content.
- Tokenization:
- Tokenized the cleaned text using the pre-trained XLM-RoBERTa tokenizer with a maximum sequence length of 128.
- Model Setup:
- Leveraged XLM-RoBERTaForSequenceClassification with 4 output labels.
- Configured TrainingArguments to train for 3 epochs with evaluation at the end of each epoch.
- Evaluation:
- Evaluated the model on the validation set, achieving a Validation Accuracy of 59.12%.
- Saved Model:
- Saved the trained model and tokenizer for reuse.
Results
After training the model for three epochs, the validation accuracy was 59.12%. While there is room for improvement, this score demonstrates the model’s capability to handle complex sentiment nuances in low-resource languages like Tamil.
The Code
Below is an overview of the steps in the code:
- Preprocessing: Cleaned and tokenized the text to prepare it for model input.
- Model Training: Used Hugging Face’s
TrainerAPI to simplify the training process. - Evaluation: Compared predictions against ground truth to compute accuracy.
To make this process more accessible, I’ve attached the complete code as a downloadable file. However, for a quick overview, here’s a snippet from the code that demonstrates how the text was tokenized:
# Tokenize text data using the XLM-RoBERTa tokenizer
def tokenize_text(data, tokenizer, max_length=128):
return tokenizer(
data,
truncation=True,
padding='max_length',
max_length=max_length,
return_tensors="pt"
)
train_tokenized = tokenize_text(train['cleaned'].tolist(), tokenizer)
val_tokenized = tokenize_text(val['cleaned'].tolist(), tokenizer)
This function ensures the input text is prepared correctly for the transformer model.
Reflections
Participating in this shared task was a rewarding experience. It highlighted the complexities of working with low-resource languages and the potential of transformers in tackling these challenges. Although the accuracy could be improved with hyperparameter tuning and advanced preprocessing, the results are a promising step forward.
Download the Code
I’ve attached the full code used for this shared task. Feel free to download it and explore the implementation in detail.
If you’re interested in multilingual NLP or sentiment analysis, I’d love to hear your thoughts or suggestions on improving this approach! Leave a comment below or connect with me via the blog.