Arabic is spoken by more than 440 million people worldwide and is the fourth most-common language used on the Internet today. Yet the Arabic language is seriously underrepresented online.
Digital content in Arabic accounts for only 1 to 3 percent of all content online, according to a paper, “Digital Arabic Content,” produced by the International Telecommunication Union for a summit in 2012. A recent study by the W3Techs survey firm found that Arabic was the language of fewer than 1 percent of websites it surveyed.
Kareem Darwish, a senior scientist at the Arabic Language Technologies Group at the Qatar Computing Research Institute, in Doha, is part of a team working on tools that use artificial intelligence to change that.
The problem is twofold, Darwish says.
“A limited number of people have the intellectual capacity, the time and financial means to invest in providing high-quality content on a voluntary basis,” he said. “On the other hand, the lack of technological tools that account for the specific characteristics of the Arabic language makes it difficult to retrieve the content when it exists.”
Developing better tools for automatic processing of the Arabic language is not an easy task.
In Arabic, one “root,” or combination of several consonant sounds in a certain order, can generate numerous words having different meanings. Also, the shape of the same letter differs depending on its position within the word. Moreover, symbols placed above or below the letters, called diacritics, change the pronunciation, grammatical formulation, and even the meaning of the words sometimes. This confuses search systems and produces poor search results.