Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

How Affordable Tokenisation Will Increase AI Accessibility in Indian Languages

Tags: language

While innovations like ChatGPT are celebrated globally, the cost of implementing ChatGPT in non-English languages, especially Indian languages poses a huge challenge.

The excitement surrounding ChatGPT’s capabilities is palpable, yet questions arise about its adaptability to diverse languages. In a country like India, with a mosaic of languages and dialects, the potential to harness ChatGPT’s power in native languages is an enticing prospect. However, the reality is that the path to achieving this is fraught with hurdles that extend beyond technical complexities.

Tokenisation, a fundamental process in natural language processing models, lays the groundwork for the challenges ahead. In essence, tokenisation involves breaking down language into smaller units to facilitate comprehension by AI models. 

Tokenisation and its Challenges

The catch is that different languages, particularly those with intricate structures and scripts, demand varying numbers of tokens. English, with its simplicity, requires fewer tokens compared to languages like Hindi, Kannada, or Telugu.

The financial implications of tokenization disparities are even more pronounced. The cost of training and using AI models hinges on token counts, compute and cloud costs. As per OpenAI’s pricing structure, each token has a price associated with it. Herein lies the crux of the matter: languages like Hindi and Kannada require significantly more tokens for the same input, translating into higher costs. For instance, generating articles like this in English using the ‘Ada’ model costs around $1.2, whereas the same article in Hindi would incur approximately $8, and in Kannada, an astonishing $14.5.

These inflated costs paint a challenging picture for developing AI models in non-English languages. To put things into perspective, training GPT-3 in Hindi could potentially cost around $32 million, a colossal figure compared to the original training cost. 

As we navigate the landscape of AI and language models, it’s imperative to acknowledge the hidden costs that language diversity presents. While the strides made in AI are commendable, the road to achieving seamless interactions in non-English languages is riddled with complexities. As we seek to bridge the language gap, we must also bridge the cost gap, ensuring that the marvels of technology are accessible to all, regardless of the language they speak.

A recent study highlights that server costs for language processing services like OpenAI vary significantly based on the language used. English inputs and outputs are notably cheaper than other languages, with Simplified Chinese being twice as expensive, Spanish costing 1.5 times more, and Shan language being 15 times costlier.

Analyst Dylan Patel shared research from the University of Oxford, revealing that processing a Burmese-written sentence using a Large Language Model (LLM) required 198 tokens, while the same sentence in English only needed 17 tokens. These tokens represent the computational cost of accessing an LLM through APIs like OpenAI’s ChatGPT or Anthropic’s Claude 2. As a result, the cost for the Burmese sentence was 11 times higher compared to the English version when utilizing the service.

Help from Government and Big Techs

This financial barrier is a formidable roadblock for the inclusive internet access envisioned through projects like the Government of India’s Bhashini and Google’s Vaani, hindering the democratisation of AI in India.

The increasing demand for training datasets in native languages also led to the introduction of Bangalore-based nonprofit organisations like Karya, which was earlier incubated within Microsoft and is dedicated to accelerating social mobility in India through AI training and upskilling. Karya’s ‘Labely’ tool was developed to perform transcription and annotations, designed for simplicity and ease of use in rural India. Through collaborations with local NGOs, Karya sourced rural talent, and the organization has completed over 30 million digital tasks. 

The organization’s linear structure ensures just compensation for workers, offering between $5 to $30 per hour based on skill sets. Karya is positioned to revolutionise the linguistic landscape in India and further the Indian AI ecosystem. while also collaborating with big tech companies and universities.

OpenAI also introduced its much-touted ChatGPT’s Android app in India, targeting a new user base with distinct preferences and needs. The launch aimed to gather user feedback to refine AI responses for improved contextual relevance and cultural sensitivity.

Access to authentic, verbally transmitted knowledge, such as that shared among rural farming communities, adds value to ChatGPT’s training dataset. By interacting with ChatGPT in their local languages, farmers can share and gain knowledge. This collaborative process enables ChatGPT to better comprehend Indian farmers’ unique challenges and requirements.

This approach also positions ChatGPT as a repository of global knowledge. The unique prompts from Indian users generate new content that enriches its dataset, turning it into a comprehensive information hub.

The launch presents an opportunity for ChatGPT to become a widely used app on the Google Play Store, potentially affecting usage patterns of other apps like Google Search. With its mobile availability, user-friendly interface, and voice prompt feature, ChatGPT aims to offer convenience and flexibility to its users, solidifying its presence in the Indian market.

OpenAI Could Race Ahead

Additionally, Microsoft recently took on the task to use AI technology to preserve and empower endangered languages in India. Microsoft’s Project ELLORA focuses on languages with limited written resources and digital presence.  While the Indian Constitution recognizes 22 major languages, there are around 19,569 dialects spoken as mother tongues, and about 192 of these are vulnerable or endangered according to UNESCO. 

The project’s goal is to provide language communities with tools and resources to develop their own language technologies. Microsoft is working on Gondi, Mundari, and Idu Mishmi languages, creating an open-source framework that allows these communities to build technologies themselves. They’ve developed the Interactive Neural Machine Translation (INMT) tool to aid human translators, with an offline mobile version called INMT-Lite.

In a broader context, while other projects like A14Bharat and Syspin are focused on major languages recognised by the Constitution, Project ELLORA shifts its focus to languages not included in these initiatives. This inclusivity could significantly contribute to promoting linguistic diversity and accessibility in India.

Conclusively, OpenAI’s Android app data collection and ELLORA could synergise effectively. OpenAI could benefit from ELLORA’s resources to refine its dataset, incorporating an extensive corpus of Indic languages that are specialized and not widely used. This strategic synergy could drive OpenAI’s efforts towards enhancing its language model’s capabilities and significantly reducing tokenisation costs in Indic languages.

The post How Affordable Tokenisation Will Increase AI Accessibility in Indian Languages appeared first on Analytics India Magazine.



This post first appeared on Analytics India Magazine, please read the originial post: here

Share the post

How Affordable Tokenisation Will Increase AI Accessibility in Indian Languages

×

Subscribe to Analytics India Magazine

Get updates delivered right to your inbox!

Thank you for your subscription

×