Learning
to understand the human gift of language
|
DJ McCloskey, IBM Watson Group |
Machines use
programming languages to at least appear to understand our human languages. IBM
Watson is one of the most sophisticated, helping everyone from healthcare
providers to sous chefs by using several programming languages and algorithms to
read and comprehend natural language. But the system could only answer
questions posed in English – until now.
Natural Language Processing
architect D.J. McCloskey leads a team “teaching” Watson the fundamental
mechanisms to comprender español, entender português, 日本語を理解する(understand Japanese), and many other
languages.
“Back in
the late 1990s and early 2000s, the notion of a machine reading text was
primarily defined by creating search indexes out of the words in text. We
wanted to take it one step farther where ‘reading’ actually meant ‘understanding’
the text. So, we created LanguageWare
in 2001, a technology that could automate fact extraction from the text,” D.J.
said.
LanguageWare
established a lightweight, optimized library of functions for processing
natural language text, using a set of generalized structures and algorithms that
captured the essence of language. Multilingual by design, this foundation gave
LanguageWare a way to process text from any language so that a machine could
understand the atomic sentence context, and build semantic understanding of
that sentence in any language.
But D.J.’s
team developed this sophisticated tooling with the mantra of “involve the
humans” in mind. By letting humans teach the machines everything about language
– from word
morphology, to knowing the difference between “run” and “running” and
“goose” and “geese,” and transcribing the knowledge of domain experts (learning
from a subject’s human masters) – the system can then detect accurately worded
facts in text, such as negative reactions to a drug, or an acquisition of one
company by another. Today, Watson’s entire suite of cognitive capabilities uses
and extends this tooling.
“And in
Watson we have employed this capability to capture and apply precise knowledge
from oncology experts, providing a way for human experts to teach the system at
a deep level,” D.J. said.
Gluing
it together with open architecture
These
analytics and algorithms work together on top of Apache’s Unstructured
Information Management Architecture, or UIMA
(“you-ee-mah”). Its open architecture gave LanguageWare back in 2001, and
Watson today, a way to combine their analytics with other complementary
analytics to rapidly collaborate and prototype new ideas – a way to end up with
a whole much greater than the sum of its parts, like the ideas from the Watson Mobile
Challenge.
“I
remember trying to convince people of the viability of machines understanding
unstructured data, pre-Watson,” said D.J. “And then Watson (and UIMA) happened,
and now people believe it can cure cancer, and make our tea!
“Amazingly
enough, the power of this technology actually has potential to help do both –
and more. Watson can’t cure cancer but we have real solutions where Watson
Oncology Advisor helps consultant oncologists improve treatment of cancer
patients. And a member of our team recently made Chef Watson’s Korean BBQ lemon
cupcakes and they were awesome (with my tea)!”
Parsing
languages (other than English)
Another
ingredient in Watson’s NLP pantry is its parser. This set of code helps it
analyze and understand the written English language down to the grammar and
syntax level. For example, Watson’s parser lets the system know “who did what
to whom,” as in “the boy kicked the ball.” So, a question about what was kicked
will find “the ball” as the receiver of said action.
But not
all sentences operate the same way or in the same order.
Say “Hola” to
Watson, and find out more about its new capabilities, and its new home at Astor Place in New York City, here.
|
In
English, the subjects, verbs, and objects follow a certain order: “John saw
Mary.” John did the seeing, while Mary was seen in a subject-verb-object order.
However, in Hindi it is “Jŏna mairī dēkhā,” or “John Mary saw,” so a subject-object-verb
order. And in Ireland,
where D.J. lives and works, verbs follow subjects, which follow objects for
“Chonaic John Máire” which is “Saw John Mary.”
D.J.’s
team chose Spanish first, a widely spoken representative of a romance language,
as Watson’s next language to parse, but hopes to build a generic parser that,
once plugged into UIMA, will allow Watson to understand any language.
“We are
after the mechanics of language to get to a point where Watson works between
languages in a pragmatic way, Watson going global!” D.J. said.
Now, with
Watson’s capabilities on BlueMix available to developers all around the world,
its ability to process local language just as well as English will be increasingly
valuable. New mobile apps could exploit all of Watson’s natural language power
on regionally relevant knowledge sources. Ultimately, Watson will be cross
lingual, meaning questions in one language can find answers in another and be
returned to the user, translated back into his or her native or preferred
language – making the knowledge of to world available to all regardless of
language.
More about IBM Watson
Labels: ibmwatson, natural language processing, uima