From obscurity to opportunity: The Rise of Low-Resource Languages in the LLM realm

March 1, 2024

Elkida Bazaj

Before the tech world became abuzz with the innovations surrounding Large Language Models like GPT, I embarked on a personal and unique exploration within the field of language technology, focusing on Albanian, my native language. In the realm of language technology, Albanian is somewhat of an overlooked treasure, seldom discussed or explored. This brings us to the concept of a "low-resource" language, which essentially means the language lacks substantial tools, data, and digital content. It's akin to attempting to bake a cake with only flour and sugar, missing crucial ingredients like eggs and milk. This scenario describes the struggle of integrating technology with a low-resource language, which, unlike widely-used languages such as English or Spanish, doesn't have enough resources like books, websites, or articles to facilitate proper computational understanding and usage.

Historically, working with technology for languages like Albanian has been exceptionally challenging. Imagine the complexity of teaching a language's vast composition of words, sentences, and grammar rules to a computer when you only have a very limited set of resources at your disposal. It's a bit like trying to learn how to fix a car with just a few pages of an instruction manual. Consequently, executing tasks such as translation or sentiment analysis with such constrained resources was fraught with difficulties, underscoring the significant barriers faced by lesser-known languages in the digital domain.

Then came The Transformers! (and no – we're not talking about cinematic robots)

Transformers revolutionized the way computers understand languages by introducing a new technology that better captures the context within text sequences. It marked a significant step forward by allowing for a more nuanced understanding and generation of text across a diverse linguistic landscape.

Building on this foundation, Large Language Models (LLMs) like GPT leveraged transformer technology to train on vast amounts of text from a wide array of languages and sources. This extensive training enables LLMs to grasp complex language structures and perform a variety of natural language processing tasks, even for languages with minimal available data. The breadth of their training means they can apply learned patterns to low-resource languages, offering a level of comprehension and utility previously out of reach.

Now there’s a big question: When it comes to low-resource languages, is it better to build one multi-language model or a language-specific one?

Large, multilingual models offer significant advantages, primarily due to their scalability and cost-effectiveness. By creating a single model that can process multiple languages, developers can save on resources and time that would otherwise be spent building and maintaining numerous separate models. These large models, trained on vast datasets encompassing a variety of languages, are equipped to handle a wide spectrum of linguistic tasks, from translation to sentiment analysis. However, their broad focus can sometimes be a drawback, especially when dealing with less common languages. The nuances, idiomatic expressions, and cultural specificities of smaller languages may be underrepresented or misunderstood, leading to less accurate or culturally insensitive outputs.

On the other hand, smaller, language-specific models present a different set of advantages and challenges. By focusing on a single language, these models can dive deeper into the linguistic and cultural nuances that make each language unique. They can be finely tuned to understand local dialects, slang, and idioms, providing a level of precision and understanding that large models may not achieve. However, this comes at a cost. Developing and maintaining a separate model for each language requires significantly more resources, both in terms of data and human expertise. Additionally, for low-resource languages, gathering the necessary data to train these specialized models can be a major hurdle, making them less feasible or sustainable in the long run.

For most scenarios, especially in contexts where Albanian is not the only language of interest, starting with a large multilingual model and then fine-tuning it with Albanian-specific data could offer a balanced approach. This way, you leverage the strengths of large models in understanding context and language structures while enhancing its performance for Albanian with targeted data, ensuring better accuracy and cultural relevance.

This leads me to advocate for the development of similar technologies tailored to my language. It’s very important to represent all languages in the digital age, as it levels the playing field, giving everyone, no matter their language, equal access to information and opportunities online. This is key for education, as it allows learners from all linguistic backgrounds to tap into global knowledge bases.

It also helps preserve cultural heritage, keeping traditions and histories alive through generations. It's about more than just words; it's about maintaining the identity and legacy of diverse communities. Embracing linguistic diversity in digital spaces sparks innovation too. Different languages bring unique viewpoints, enriching problem-solving and creative processes. This diversity can lead to breakthroughs that benefit everyone, everywhere.

It’s a foundation for a more informed, connected, and culturally rich digital world.