AI needs to be trained on culturally diverse datasets to avoid bias

By: Vered Shwartz – Assistant Professor, Computer science, University of British Columbia

Large language models (LLMs) are deep learning artificial intelligence programs, like OpenAI’s ChatGPT. The capabilities of LLMs have developed into quite a wide range, from writing fluent essays, through coding to creative writing. Millions of people worldwide use LLMs, and it would not be an exaggeration to say these technologies are transforming work, education and society.

LLMs are trained by reading massive amounts of texts and learning to recognize and mimic patterns in the data. This allows them to generate coherent and human-like text on virtually any topic.

Because the internet is still predominantly English — 59 per cent of all websites were in English as of January 2023 — LLMs are primarily trained on English text. In addition, the vast majority of the English text online comes from users based in the United States, home to 300 million English speakers.

Learning about the world from English texts written by U.S.-based web users, LLMs speak Standard American English and have a narrow western, North American, or even U.S.-centric, lens.

Model Bias

The Data Security Council of India has forecast that the cybersecurity ecosystem will expand up to a point where nearly one million professionals will be required by 2025. Additionally, the demand for cloud security skills is estimated to grow by 115% between 2020 and 2025, representing almost 20,000 job openings, Narayan added.

An extensive exercise in reskilling and/or upskilling the existing workforce, believe staffing experts, is one of the ways that telcos can future proof their work.

Indian mobile phone operators are expected to at least double their investments on network security with the 5G roll out expected to spark a surge in network vulnerabilities, which assume critical importance especially for enterprises.

However, it is already proving to be a challenge for telcos to have robust security teams.

To be better understood by AI tools, users may adapt their communication styles in a manner similar to how people learned to “Americanize” their foreign accents in order to operate personal assistants like Siri and Alexa.

Impacts of Bias

Culture plays a significant role in shaping our communication styles and worldviews. Just like cross-cultural human interactions can lead to miscommunications, users from diverse cultures that are interacting with conversational AI tools may feel misunderstood and experience them as less useful.

To be better understood by AI tools, users may adapt their communication styles in a manner similar to how people learned to “Americanize” their foreign accents in order to operate personal assistants like Siri and Alexa.

As more people rely on LLMs for editing writing, they are likely to unify how we write. Over time, LLMs run the risk of erasing cultural differences.

Decision-making and AI

AI is already in use as the backbone of various applications that make decisions affecting people’s lives, such as resume filtering, rental applications and social benefits applications.

For years, AI researchers have been warning that these models learn not only “good” statistical associations — such as considering experience as a desired property for a job candidate — but also “bad” statistical associations, such as considering women as less qualified for tech positions.

As LLMs are increasingly used for automating such processes, one can imagine that the North American bias learned by these models can result in discrimination against people from diverse cultures. Lack of cultural awareness may lead to AI perpetuating stereotypes and reinforcing societal inequalities.

Bharti Airtel, for example, has been preparing for 5G roll out by upskilling its professionals and offering them certification courses such as CCNA (Cisco Certified Network Associate) and CCNP (Cisco Certified Network Professional). The courses are offered based on skill and eligibility level free of cost.

LLMs for languages other than English

Developing LLMs for languages other than English is an important effort, and many such models exist. However, there are several reasons why this should be done in parallel to improving LLMs’ cultural awareness and sensitivity.

First, there is a huge population of English speakers outside of North America who are not represented by English LLMs. The same argument holds for other languages. A French language model would be representative of the culture in France more than the culture in other Francophone regions.

Training LLMs for regional dialects — which may capture finer-grained cultural differences — is not a feasible solution either. The quality of LLMs is based on the amount of data available, and as such, their quality would be worse for dialects with little online data.

Second, many users whose native language is not English still choose to use English LLMs. Significant breakthroughs in language technologies tend to start with English before they are applied to other languages. Even then, many languages — such as Welsh, Swahili and Bengali — don’t have enough text online to train high quality models.

Due to either a lack of availability of LLMs in their native languages, or superior quality of the English LLMs, users from diverse countries and backgrounds may prefer to use English LLMs.

Ways Forward

Our research group at the University of British Columbia is working on enhancing LLMs with culturally diverse knowledge. Together with graduate student Mehar Bhatia, we trained an AI model on a collection of facts about traditions and concepts in diverse cultures.

Before reading these facts, the AI suggested that a person eating a dutch baby (a type of German pancake) is “disgusting and mean,” and would feel guilty. After training, it said the person feels “full and satisfied.”

We are currently collecting a large scale image captioning dataset with images from 60 cultures, which will help models learn, for instance, about types of breakfasts other than bacon and eggs. Our future research will go beyond teaching models about the existence of culturally diverse concepts to better understand how people interpret the world through the lens of their cultures.

With AI tools becoming increasingly ubiquitous in society, it is imperative that they go beyond the dominating western and North American perspectives. Businesses and organizations throughout many sectors of the economy are adopting AI to automate manual processes and make better evidence-informed decisions using data. Making such tools more inclusive is crucial for the diverse population of Canada.

case studies

Explore More

IT Consulting, App development

Maximizing the Efficiency of Open-Source Technology – The NewsHub Success Story

The client needed a secure and custom solution to source, supply and manage content for their clients.

✔︎ Custom Development
✔︎ Modern infrastructure
✔︎ Consulting services

Learn more

Open Source, App development

Making Challenging Open-Source Applications Easier to Install and Manage

Developing easy and simple bash-scripts to install, configure, and secure hard-to-manage open-source applications like Mastodon, Pixelfed, PeerTube, Superdesk, Postal, and others.

✔︎ Modern infrastructure
✔︎ Consulting services

Learn more

App development, IT Consulting

Developing an Integrated Stock and Weather Data Solution for a Major Financial Services Provider

The client – a major financial services provider to the banking and financial services industry – needed a near real-time and reliable stock and weather data solution.

✔︎ Financial Services
✔︎ Application Development
✔︎ Data Feeds

Learn more

Contact Us To Get Started

Have questions? Need more information?

Have questions or thinking about starting with us? Interested in exploring a partnership? Please complete the form with all the necessary details, and a team member will be in touch with you soon.

Your benefits:

What happens next?

We schedule a call at your convenience

We do a discovery and consulting meeting

We prepare a proposal, or offer a solution

Get in Touch

Honeytree needs the contact information you provide to us to contact you about our products and services. You may unsubscribe from these communications at any time. For information on how to unsubscribe, as well as our privacy practices and commitment to protecting your privacy, please review our Privacy Policy.

AI needs to be trained on culturally diverse datasets to avoid bias

Model Bias

Impacts of Bias

Decision-making and AI

LLMs for languages other than English

Ways Forward

Explore More

Maximizing the Efficiency of Open-Source Technology – The NewsHub Success Story

Making Challenging Open-Source Applications Easier to Install and Manage

Developing an Integrated Stock and Weather Data Solution for a Major Financial Services Provider

Contact Us To Get Started

Your benefits:

What happens next?

Get in Touch

Services and Solutions

Company Information

LinkedIn

Alignable

Github

Help and Support

Read the Latest

Public Cloud Providers

Premium Managed Web
and Technology Solutions

Public Cloud Partners

Fully Managed Services

Managed Web Hosting

WordPress Hosting

Cloud Web Hosting

Domain Names

Agency Web Hosting

Industry Focus

AI needs to be trained on culturally diverse datasets to avoid bias

Model Bias

Impacts of Bias

Decision-making and AI

LLMs for languages other than English

Ways Forward

Explore More

Maximizing the Efficiency of Open-Source Technology – The NewsHub Success Story

Making Challenging Open-Source Applications Easier to Install and Manage

Developing an Integrated Stock and Weather Data Solution for a Major Financial Services Provider

Contact Us To Get Started

Your benefits:

What happens next?

Get in Touch

LinkedIn

Alignable

Github

Help and Support

Read the Latest

Public Cloud Providers

Premium Managed Web and Technology Solutions

Public Cloud Partners

Fully Managed Services

Managed Web Hosting

WordPress Hosting

Cloud Web Hosting

Domain Names

Agency Web Hosting

Industry Focus

Premium Managed Web
and Technology Solutions