Multilingual Watson

Learning to understand the human gift of language

DJ McCloskey, IBM Watson Group
Machines use programming languages to at least appear to understand our human languages. IBM Watson is one of the most sophisticated, helping everyone from healthcare providers to sous chefs by using several programming languages and algorithms to read and comprehend natural language. But the system could only answer questions posed in English – until now.

Natural Language Processing architect D.J. McCloskey leads a team “teaching” Watson the fundamental mechanisms to comprender español, entender português, 日本語を理解する(understand Japanese), and many other languages.

“Back in the late 1990s and early 2000s, the notion of a machine reading text was primarily defined by creating search indexes out of the words in text. We wanted to take it one step farther where ‘reading’ actually meant ‘understanding’ the text. So, we created LanguageWare in 2001, a technology that could automate fact extraction from the text,” D.J. said.

LanguageWare established a lightweight, optimized library of functions for processing natural language text, using a set of generalized structures and algorithms that captured the essence of language. Multilingual by design, this foundation gave LanguageWare a way to process text from any language so that a machine could understand the atomic sentence context, and build semantic understanding of that sentence in any language.

But D.J.’s team developed this sophisticated tooling with the mantra of “involve the humans” in mind. By letting humans teach the machines everything about language – from word morphology, to knowing the difference between “run” and “running” and “goose” and “geese,” and transcribing the knowledge of domain experts (learning from a subject’s human masters) – the system can then detect accurately worded facts in text, such as negative reactions to a drug, or an acquisition of one company by another. Today, Watson’s entire suite of cognitive capabilities uses and extends this tooling.

“And in Watson we have employed this capability to capture and apply precise knowledge from oncology experts, providing a way for human experts to teach the system at a deep level,” D.J. said.

Gluing it together with open architecture

These analytics and algorithms work together on top of Apache’s Unstructured Information Management Architecture, or UIMA (“you-ee-mah”). Its open architecture gave LanguageWare back in 2001, and Watson today, a way to combine their analytics with other complementary analytics to rapidly collaborate and prototype new ideas – a way to end up with a whole much greater than the sum of its parts, like the ideas from the Watson Mobile Challenge.

“I remember trying to convince people of the viability of machines understanding unstructured data, pre-Watson,” said D.J. “And then Watson (and UIMA) happened, and now people believe it can cure cancer, and make our tea!

“Amazingly enough, the power of this technology actually has potential to help do both – and more. Watson can’t cure cancer but we have real solutions where Watson Oncology Advisor helps consultant oncologists improve treatment of cancer patients. And a member of our team recently made Chef Watson’s Korean BBQ lemon cupcakes and they were awesome (with my tea)!”

Parsing languages (other than English)

Another ingredient in Watson’s NLP pantry is its parser. This set of code helps it analyze and understand the written English language down to the grammar and syntax level. For example, Watson’s parser lets the system know “who did what to whom,” as in “the boy kicked the ball.” So, a question about what was kicked will find “the ball” as the receiver of said action.

But not all sentences operate the same way or in the same order.

Say “Hola” to Watson, and find out more about its new capabilities, and its new home at Astor Place in New York City, here.
In English, the subjects, verbs, and objects follow a certain order: “John saw Mary.” John did the seeing, while Mary was seen in a subject-verb-object order. However, in Hindi it is “Jŏna mairī dēkhā,” or “John Mary saw,” so a subject-object-verb order. And in Ireland, where D.J. lives and works, verbs follow subjects, which follow objects for “Chonaic John Máire” which is “Saw John Mary.”

D.J.’s team chose Spanish first, a widely spoken representative of a romance language, as Watson’s next language to parse, but hopes to build a generic parser that, once plugged into UIMA, will allow Watson to understand any language.

“We are after the mechanics of language to get to a point where Watson works between languages in a pragmatic way, Watson going global!” D.J. said.

Now, with Watson’s capabilities on BlueMix available to developers all around the world, its ability to process local language just as well as English will be increasingly valuable. New mobile apps could exploit all of Watson’s natural language power on regionally relevant knowledge sources. Ultimately, Watson will be cross lingual, meaning questions in one language can find answers in another and be returned to the user, translated back into his or her native or preferred language – making the knowledge of to world available to all regardless of language.

More about IBM Watson

Labels: , ,