AI’s Slow March Towards a More Inclusive Internet in India

8 min readJun 25, 2021

Madhukanta Patel, 81, got her first smartphone three years ago. For years, she had a feature phone with a numbered keypad that she used to call her friends and family, but eventually, her family decided to upgrade to a brand-new smartphone. They knew that she could only read Gujarati, so they looked for a device that supported vernacular languages. The options for such devices, while rare in the past, had become commonplace more recently. When the device finally arrived, she was overjoyed.

“I love watching Gujarati TV shows, and the new phone allowed me to watch them whenever I wanted,” said Madhukanta. But this instant affection soon cooled. “While my phone’s language was Gujarati, some apps still used English, which I could not understand. I would tap buttons and accidentally call family members,” she explained. Her family would also have to reset her phone quite often because she would unwittingly download apps or videos.

But Madhukanta was hardly alone. When Reliance Jio launched its services in 2016 by offering six months of free data, it sparked a digital revolution by making internet services affordable for all. Until then, a typical Indian internet user was likely to be from urban India and spoke at least Hindi or English. But when internet connectivity became affordable, millions across the country came online for the first time, only to find that the internet was not a place they could readily access, thanks to the language barrier.

However, over the decades, researchers across India had been working to build better vernacular language systems for digital users with the help of natural language processing or NLP. This movement was finally given a boost when the number of internet users doubled from 302 million in 2015 to 636 million in 2019 and brought a demographic shift in the profile of the average Indian internet user.

But what exactly is NLP, and how is this technology helping vernacular language speakers across the country?

What is NLP?

Natural Language Processing, or NLP, is a branch of artificial intelligence (AI) that allows machines to understand language as spoken or written by humans, in a way similar to how humans interact with each other. While the idea of intelligent machines understanding humans may seem far-fetched, today, if you’re using English on the internet, chances are that you are already using NLP in your day-to-day life. Whether it’s autocorrect, Google Translate, grammar checkers, or voice assistants like Alexa and Siri, NLP tools empower devices to understand and respond to our needs.

Why is the interest around NLP in India growing?

India is home to 22 official or scheduled languages, and over 19500 languages or dialects are spoken as a mother tongue. English, the dominant language of the internet, is only spoken by 10% of the country’s population. At the same time, India is also home to the largest population of illiterate adults globally, with an unknown percentage of the population having low literacy rates. This makes it harder for such Indians to navigate text-based digital environments. However, until recently, the standard narrative was that there was no need to build better systems for vernacular languages as most internet users in India were English speakers living in urban parts of the country.

Today, the growth of internet users comes primarily from rural India, and for the first time in 2020, India has 10% more internet users in rural regions than urban regions. With this growth has come a rising demand for accessibility in regional languages. Of these new users, 90% choose to consume content in their mother tongue. The Indian government, too, has pushed for better access to content in vernacular languages through its Digital India initiative and the newly launched National Language Translation Mission. Several start-ups and initiatives have also risen to meet this demand and fill in the gap by building and improving NLP tools in various Indian languages. Still, they all face a common challenge — limited data.

Understanding the resource gap

To build an AI system, researchers depend on large amounts of data. The kind of data needed depends on the type of system being developed, but the demand for data has only risen with time.

Amounts of data required for NLP systems across time. Courtesy of Monojit Choudhury.

A recent paper on the linguistic diversity of global languages divided the world’s languages into six categories, as highlighted in the chart below.

But where do Indian languages fit into this?

According to Dr Monojit Choudhury, a principal researcher with Microsoft Research India, “Hindi, with its large user base, would fit into Category 4 where everything looks quite promising. It can be pushed to have the best NLP technologies in a few years. For category three languages like Bengali and Tamil, the recent revolution in NLP technology has proven to be promising, which will allow us to build better systems for them. The remaining Indian languages like Marathi, Telegu, Assamese, and others fall in category 2 or 1 because there is very limited data available. These languages that are not included in the list of 22 scheduled languages of India fall under category 0, where despite having possibly millions of speakers, there are no digital language resources available for them.”

While Dr Choudhury has pointed out the need for data, it is vital to understand what data is needed to build better AI systems.

What kind of data is needed, and how are researchers filling in this gap?

Researchers working on language technology need two different types of data to build better computer systems that are capable of understanding various languages:

1. Labelled data: This refers to information that has some labels or annotations associated with it. A simple example of an audio clip, with its transcription.
2. Unlabelled data: This type of data refers to any running chain of text, like sentences and paragraphs in different articles, blog posts and more.

Machine learning systems usually require a large amount of labelled data to get better at performing tasks like translating words and sentences, identifying parts of speech, and understanding commands. However, this data is expensive and hard to generate. Luckily, there has been a breakthrough in the past couple of years.

“Unlabelled data is where a lot of progress in NLP is happening in the last three years. Using a backbone trained on unlabelled data, significant progress has been made with the limited need for labelled data. This is the basic recipe that we want to replicate for Indian languages,” says Dr Mitesh Khapra, an associate professor at IIT Madras. This technique is called ‘Transfer Learning’, where a model is first trained on a task that does not require manual labels, like predicting the next word in a sentence, and then fine-tuned to perform well on meaningful tasks like translations that require labelled data.

Dr Khapra is also a co-founder of AI4Bharat, an initiative to build AI solutions to India's challenges. Their original aim was to serve as a platform where stakeholders could define problems of relevance to India and deliver AI-driven solutions for it. But they soon realised the need to build NLP systems for Indian languages. “Cutting across multiple sectors and domains like healthcare, finance, agriculture, and more, in India, the language barrier remains a huge problem,” said Dr Khapra.

To build a suite of NLP tools, the team at AI4Bharat decided first to collect as much data as they could in regional languages.

“India did not have publicly available data in regional languages, so we decided to crawl the web and scrape data from websites. News websites were the obvious choice because they contained many articles in Indian languages, and the content was also typically in the Indian context, which made them suitable for Indian users,” explained Dr Khapra.

In the end, they collected a corpus of 9 billion tokens or words (2.3x the size of English Wikipedia or 5000x the length of Mahābhārata) across 12 languages. However, even within this data, a gap existed between some of the more commonly spoken languages in India, like Hindi and Bengali, as opposed to languages like Odia and Assamese. But one issue with using only news content is that it is not helpful in developing speech-based systems. The reason? Code-mixing.

Understanding code-mixing

Code-mixing is the phenomenon where people use two or more languages at once. It is commonly observed in multilingual societies. Hinglish, the amalgamation of Hindi and English, is a great example of this. Whether it’s an aunty in the neighbourhood asking, “Studies kaisi chal rahi tumhari?” or Bollywood songs that remind us that “Tu hai toh, I’ll be alright!” Indians tend to code-mix quite frequently when speaking.

Bollywood movies often tend to use code-mixed language.

While we can understand code-mixed content because we know both languages, NLP systems are usually trained in one language at a time, which leads to poor performance on code-mixed data. Another issue is that if code-mixing is a spoken-language phenomenon, how do we collect such data? Dr Monojit Choudhury and his team at Microsoft Research Labs found a unique avenue to find data — social media.

He explains, “On social media, people write the way they talk. The informal setting means that they tend to code-switch quite often. So, in the past 10 years, all the work we’ve done on code-mixed content, we could only do because we could access and use social media data.”

These speech-based technologies hold the key to the integration of people with low literacy rates in India. In addition, with voice search queries on the rise, India may soon become the world’s first and largest voice-first internet market. This growth in voice-related searches has led several companies to invest in speech-to-text technologies and voice-assistance technologies for vernacular languages, which may soon make it easier for Indian web users to navigate the digital medium.

But we still have a long way to go as most private companies focus on languages with the largest numbers of speakers. So, what does the future hold for these languages?

Future of NLP in India

The Indian NLP community is trying to fix the resource gap between various languages, and one new technology is providing hope. Multilingual NLP models train different languages, which share certain commonalities. For example, a model would try to use data from Bengali, for which a relatively large corpus is available, along with languages like Odia and Assamese, which share characteristics and similarities due to how long the populations speaking these languages have been in close contact with each other. “By training a large model that covers several languages together, it is possible to alleviate some of the problems associated with low-resource languages, but definitely not all,” explains Dr Khapra, “In the end, there is no substitute for data.”

But there is reason to be optimistic as more researchers share and compile databases in Indian languages to make life easier for users like Madhukanta Patel. The government, too, has stepped up and launched a data distribution portal to help with the development of tools and resources for language technology. With these developments, there’s hope that by the time India gains 1 billion internet users, its citizens will not just be able to access the internet but also navigate it with ease.

AI’s Slow March Towards a More Inclusive Internet in India

Written by Apeksha Shetty