IsiNdebele NLP Pipeline
Sisekelo Sinyolo
6/6/20242 min read
Project description
This project builds on research into low resource languages and focuses on exploring and enhancing Natural Language Processing techniques for the isiNdebele language spoken in Zimbabwe.
The project encompasses a series of Jupyter notebooks that detail the processes of data scraping, cleaning, language modeling, linguistic feature analysis, and visualisation to uncover insights from isiNdebele text data.
The above notebooks are available on Github. My conclusion is that further research and practical applications are needed to support and enhance the digital presence of the isiNdebele language.
In this article I'll talk about:
Data collection
Data was scraped from two Ndebele news websites: Umthunywa and VOA News Ndebele. Content was scraped in accordance with the restrictions mentioned in the robots.txt file.
This data was supplemented by a corpus of literary texts and schoolbooks in the Ndebele language which I collected for a separate project during my Master studies.
Tools & Frameworks: Python, BeautifulSoup, Requests
10 800 different articles were retrieved with 1 200 014 words.
Data preparation
The data cleaning process began with loading the dataset and assessing its structure and completeness, including identifying missing values in the Author column and inconsistencies in the Content field.
Specific cleaning actions included standardizing date formats, removing redundant text from the Content field, and correcting author names based on predefined rules and patterns.
Tools & Frameworks: Python, Pandas
After cleaning,
9 845 articles remaining with 984 235 words.
Language modeling
I trained and fine-tuned NLP models specifically for isiNdebele text to capture its unique syntactic and semantic characteristics.
I utilized various models, such as n-grams and more advanced deep learning models, to effectively understand and predict isiNdebele language patterns.
Linguistic feature analysis
I conducted linguistic feature analysis to extract and examine key features of isiNdebele text, including part-of-speech tagging, named entity recognition, and morphological analysis.
I developed custom algorithms and tools to handle the linguistic nuances of isiNdebele, ensuring accurate and meaningful feature extraction.
Visualisation and reporting
I employed visualization techniques to illustrate the distribution and frequency of words, phrases, and linguistic features within the isiNdebele corpus.
I generated interactive charts and graphs to make the data analysis intuitive and accessible, highlighting key patterns and trends in the text data.
Future exploration
Semantic analysis
Machine translation
Voice to text
Checkout these other projects
Semantic analysis
Machine translation
Voice to text
Subscription plan recommender
Wine market analysis
Semantic analysis
Machine translation
Voice to text