Perception tasks

From Hanson Robotics Wiki
Jump to: navigation, search

The types of perceptions that the robot must be able to make are primarily focused on detecting and interacting with humans in it's immediate environment. See face tracking requirements for the task of discovering faces in the robot's visual field.

Perceptions are used to drive [[attention allocation] and behavior. A hierarchy of increasingly ore sophisticated implementations are described below.

Face tacking architecture diagram

Basic Mechanical Indication

Mechanically indicate tracked percepts in the robot's environment using the robot's head, eye and hand motions.

  • Machine perception of faces and object in the robot's environment, mapping the perception in 3D, and into the model space and the motion control framework.
  • Build a robot and a virtual rig with motion control scripts that achieve nice tracking behaviour.
  • Attention regulation, via scripted rules that decide what to look at and when, to achieve tracking that looks alive, natural, and meaningful.

Perception tasks

The following is a list of tasks that need to be accomplished:

  • Integrate (create ROS Bindings for) Tracking: Face tracking, Saliency tracking, Open NI, (Backlog: TLD Predator, audio localization).
  • Track multiple objects (faces and saliency events)
  • Use multiple cameras (eye-for-fovea, and body-for-peripheral).
  • Map multiple simultaneous percepts into a coherent 3D spatial model
  • Use the blender/ROS/gazebo model of the neck and eyes to estimate the 3D position of the eye-cam percepts
  • Fuse various perceptual streams into a 3D map. Use face size to estimate distance for now; later maybe use Kinect for more accurate distance estimation to provide human skeleton to "hang" various percepts on. Use filtering (exponential, alpha-beta or Kalman) to remove noisy data.


Integrate more perception software

  • Integrate Face recognition (biometric ID), Integrate Audio biometric ID, audio affect detection, facial expression detection, hand gesture detection, object detection, proximity detection, and spatial mapping, and usefully fuse/synthesize these perceptual streams into social interactions, use cases, and robot understanding.
  • Save perception data and logs into a semantic database, including "person objects", which builds models of people, their relationships, and their stories--or events over a timeline.
  • Drive TLD Predator with bounding boxes from Face Tracking, Open NI, Face recognition, and/or Saliency tracking, and vice versa, to achieve more robust face and general faster saliency tracking. Use multiple instances of TLD predator, and save state with a semantic index--tagged as face, saliency, etc., so that they TLD models can be brought up again when seeing that person or thing.

More thoughts on the Perception

  • Use Fast Saliency Tracker (FST) to track faces, blobs, and motion.
  • Use the FST bounding boxes to track the faces and saliency events better.
    • Must interface TLD to take the bounding box from other software instead of its UI
    • Must instantiate multiple instances of TLD, to track multiple faces and saliency events.
    • Must save the TLD state/trained models into a database, indexed intelligently (with descriptive metadata) for later use.
      • The database should be a Person Objects database, treating people like instances of a class of objects.Each person has a name, face data, body, and various other kinds of facts, that can be referenced later.
  • Use face-recognition (biometric ID) software to intelligently tag the face events (and save these in the database)
    • Train the face recognizer using a few photos of a given face from bounding-box snapshots automated tests should validate the quality of the snapshots-- the samples should not be too blurry, and should include at least some frontal views of the face.
    • Improve the recognition by tracking features near a recognized face shirt blobs, to be specific; maybe hat blobs, arm blobs, and hand blobs)
    • Use some conversational software to validate if the face is new or known, and to name the face (and to so tag the face in the database).
  • Author rules for tagging saliency events
  • Fuse the data from multiple camera sources,
    • Consider Octomap for the 3D occupancy grid
    • Eye-cam data (narrow angle, moving a lot). Use the Blender inverse kinematics model to calculate the 3D orientation of the eye cam(s) to map its data and percepts into 3d occupancy grid.
    • Body-cam data (wide angle, moving less). Map the data and percepts from wide-angle camera that is mounted in the torso of the robot.
    • Integrate Kinect vision results, to see a skeleton.
    • Fuse the human-data using rules: for example, if two people overlap by more than 25%, then fuse them into one person. Etc.
  • Map the percepts into BGE/Morse to control the virtual rig--
    • To make people models and saliency fields appear and move around in Blender, with probabilistic "heat" or likelihood ratings.
    • These will then drive the Blender animation rules to drive the robotic tracking behavior.

The task involves using some of the open source vision processing software mentioned in the relevant software page.

and integrating it into a single software component that takes camera output and produces an appropriately annotated 3D occupancy grid.

The goal is to use these various software tools together, getting them to all output information to the same 3D occupancy grid. For the occupancy grid we can use this software (OctoMap):