Statistical Modeling for Character Dialogue

From Hanson Robotics Wiki
Jump to: navigation, search

Multiple approaches to generating NL dialogue responses are useful within Hanson Robotics at this stage, e.g.

  • Hand-coded dialogue responses such as AIML, ChatScript, etc.
  • Deep cognition based dialogue such as the approach taken using OpenCog
  • Statistical generation of dialogue responses based on corpora

This page discusses the latter. This is particularly interesting when creating dialogue for a robot character corresponding to a real-life person (e.g. PKD), for whom there is a significant corpus of text available to drive statistical modeling. But it can also be useful for purely synthetic characters, if it is feasible to generate a corpus of text corresponding to that character.

Supervised Learning Driven Statistical Response Generation

Suppose we want to generate text based on a corpus of appropriate text, in a statistical but generally conversationally appropriate way. (This is not as good as generating text based on real understanding, of course. But it’s useful as a stopgap, and as a way of leveraging corpora relevant to particular characters, e.g. corpora containing statements made by actual people like PKD…)

Software for doing this (based on Lucene and solr) is described in Chapter 8 of “Taming Text” (readable for free here), and is available for download. Andre’ Senna has improved and tweaked this software into a medical QA system for Ben’s Telehealth project (outside the scope of Hanson Robotics, but this is also OSS code).

I suggest this methodology and code could be used to preprocess corpora related to specific characters, for loading into OpenCog to fuel character-appropriate dialogue.

Basically, in this approach, one identifies a set of N answer types. Then one marks up the available corpus partially, marking up say K examples of each of the N answer types.

One also marks up K1 examples of “questions” (or other cues) that correspond to each of the N answer types. Note that these cues need not occur in the same corpus, though they may often do so. (The cues could be fabricated just for training the models, as one possibility if the corpus has only answers and few good questions/cues in it.)

Then one uses these examples to train supervised learning models on the N answer types, and the N corresponding cue types.

When a cue corresponding to answer type T is observed, then an answer (judged by the supervised learning models as) corresponding to answer type T is pulled out of the corpus. Since there will be many such answers, generally the one that has the most semantic similarity (e.g. the most overlap of words) with the cue is returned.

Further, this approach could be layered on top of LSI-based corpus analysis... LSI tells you the theme of a sentence or passage; this sort of analysis tells you whether a passage has the right sort of structure to fit into the "conversational slot" comprising a certain type of response... So LSI based similarity could be used for fancier semantics-based ranking of the results (instead of just looking at overlapping words).

Integration w/ OpenCog

The results of this sort of supervised learning analysis could be fed into OpenCog, i.e. OpenCog would then get a bunch of utterances labeled with the "response category" they fit into, and some GroundedSchemaNodes referencing models for recognizing "cue types"

Trying it out for various characters

I suggest we try this first for the PKD character — we can identify N answer types and then mark up a corpus of PKD texts accordingly.

Maybe we can try this for Einstein as well, based on interviews and other writings of his…

After refining the method this way, we can then perhaps write custom corpora for other characters (e.g. Sophia, Alice), writing the texts with the answer types in mind…