A Sentiment Analysis Model for Bahraini Dialects: American University of Bahrain (AUBH)


Education_BH: Innovation and Excellence celebrates the remarkable innovations implemented in Bahrain’s schools and universities. We learned about these innovations, the process, the challenges, and the lessons learned from some of the leading educational institutions in the Kingdom. Read more in our latest issue.

This innovation sought to increase Bahrain’s artificial intelligence research and development footprint by creating the first-ever sentiment analysis model for the Bahraini dialects, which beat many state-of-the-art Arabic sentiment analysis models that were developed by large international corporations and research institutes. Sentiment analysis is the process of interpreting the emotion(s) (positive, negative, or neutral) that a text carries by utilising Artificial Intelligence and Natural Language Processing (NLP) techniques. This innovation was created by an AUBH alumnus with the supervision of our faculty members.

In this project, a data set consisting of 4025 Instagram comments in the Bahraini dialect was created using web scraping, annotated by three Bahraini volunteers, and was diligently pre-processed to be an input for Machine Learning and Deep Learning algorithms.

Over 6700 combinations of algorithms, dataset pre-processing steps, feature extraction methods, and dataset splits were extensively examined to determine the best model.

In the end, the best-performing model achieved an average of 70.05% accuracy and was a Logistic Regression model with Term Frequency Inverse Document Frequency (TF-IDF), Unigrams, and Bigrams as features using an 80/20 dataset split of the dataset stemmed using RapidMiner.

As for contributions to this project, a research paper has been accepted, to be published in a SCOPUS-indexed proceeding, and will be presented at the Arab ICT 2024 Conference. Additionally, another research paper is in progress, planned to be published in an international journal. Also, the project won second place in the largest hackathon in the Innov8 hackathon, Bahrain, sponsored by large entities such as Tamkeen and Amazon Web Services (AWS).

In the future, this model, which is currently an Application Programming Interface (API), could be the beginning of a startup that specialises in social media analytics for companies in Bahrain to allow them to get insights into their customers’ social media interactions.

How was the innovation planned?        

While working as an intern in the Data Analytics department at a well-known bank in Summer 2022, one of the things the student noticed and analysed was the sentiment of customers’ comments on the bank’s Instagram page. At that time, the student used a ready-made tool published by a university in Abu Dhabi to do this task. However, while analysing its performance, she noticed that it was performing poorly, and that’s due to it not being trained on text in the Bahraini dialect. She looked for tools that included Bahraini text when training their AI model but couldn’t find any, so she decided to develop it herself. And that’s how the idea of this graduation project came to life.

What were the challenges faced during implementation?  

“Empowering young people to thrive in the digital economy has been so instrumental.” – Ahmed AlHujairy, Chief Executive Officer of ICT Firms in Bahrain.     

During the implementation of the sentiment analysis model for Bahraini dialects, several challenges were encountered. It proved difficult to scrape sufficient data that accurately represented the various Bahraini dialects. Ensuring consistency and reliability in manual annotation by volunteers required substantial coordination and quality control.

Capturing the nuances and subtleties of Bahraini dialects, which are rich and varied, was difficult. Standard Natural Language Processing (NLP) tools are often not designed for dialectal Arabic. With over 6700 combinations of algorithms and parameters to test, selecting the optimal configuration without overfitting or underfitting was a complex task.

 Managing computational resources efficiently, especially when dealing with large datasets and computationally intensive deep learning models, was a logistical hurdle.

Give us a brief assessment of your results.       

The project aimed to expand Bahrain’s Artificial Intelligence research and development by creating the first-ever sentiment analysis model for Bahraini dialects using AI and NLP techniques. The data was gathered from Instagram and meticulously processed to ensure data protection and remove extraneous records, resulting in a dataset of 4025 comments annotated by local volunteers. Through rigorous testing of over 6700 combinations of algorithms, preprocessing methods, featuring extraction techniques, and dataset splits, the student determined that machine learning models, specifically Logistic Regression using TF-IDF with Unigrams and Bigrams on an 80/20 dataset split, outperformed deep learning models for this application.

The top machine learning model achieved an average accuracy of 70.05% and an F1-Score of 70.94%. Remarkably, the model performed better than state-of-the-art sentiment analysis APIs. The project went further by deploying the top machine learning and deep learning models on a web application hosted on AWS, integrated into Power BI dashboards for analysing Instagram data, thus providing practical tools for businesses and researchers alike.

A comparative study showcased that a logistic regression algorithm outperformed others and that our models surpassed two tested deep learning architectures in accuracy. Additionally, we developed a new dataset of Bahraini dialects, which was lacking in the field, and used this unique dataset to inform our models and further research.

The significance of this work extends beyond technical achievements. On a societal level, the model enhances communication between businesses, individuals, and government entities by providing a more accurate understanding of public sentiment. Culturally, it contributes to the preservation and promotion of the Bahraini dialect. From a disciplinary perspective, the project has been recognised by academia with a research paper accepted for presentation and publication in a SCOPUS-indexed proceeding.

Innovatively, this project not only provides the first sentiment analysis model and dataset for Bahraini dialects but also sets a precedent by exploring thousands of algorithmic combinations. The survey of 35 research papers on Arabic sentiment analysis further positions this project as a comprehensive and recent study in the field.

This research contributes to AI and NLP by providing a solid foundation for future work, including potential expansions into speech-to-text and text-to-speech applications. With the dataset planned for publication, it will become an asset to the research community. Overall, this student-led project stands as a testament to the potential impact and importance of targeted AI research in addressing linguistic and cultural challenges in the Arab world.

In hindsight, what were the most valuable lessons learned while implementing the innovation? Could things have been done differently?

The effort spent collecting and preprocessing the dataset was crucial. It also confirmed that good data is the backbone of effective AI models. Testing over 6700 algorithmic combinations taught the student that sometimes, simpler models can yield better performance than their more complex counterparts. Working with native speakers for data annotation was essential, as it highlighted the importance of cultural and contextual understanding in sentiment analysis. Deployment turned out to be imperative to balance cost with performance and highlighted the importance of early planning for computational resources. Involving community stakeholders could provide continuous feedback and improve the model’s relevance and usability.

© Copyright 2024, Gulf Insider All Rights Reserved