Some thoughts on an extensible robotics/AI architecture, leveraging ROS and connecting to OpenCog, are here:
This page gathers more specific thoughts on how to make a perception-synthesizer architecture fulfilling those general ideas, and using OpenCog as a key tool (but not the only tool). The initial author of the page was Ben Goertzel.
- 1 Perception Synthesizer: Introduction to this High-Level Specification Page
- 2 Particulars
- 3 Incremental Development
- 4 More work: Object and Event Identification
Perception Synthesizer: Introduction to this High-Level Specification Page
This document outlines what a first-pass Perception Synthesizer (PS) component for OpenCog might look like. The goal is to make something useful for Hanson Robotics applications of OpenCog in the near term, and also valuable as the initial version of a more ambitious PS that can be broadly useful for OpenCog in the long term.
The specific use cases that were kept in mind when writing this document are:
- A Hanson Robotics head with a combination of cameras, e.g. face-centered and torso-centered, and optionally an off-board depth camera. Also optionally, sound-source localization.
- A toy RoboSapien robot, with a torso camera and a head camera, and a small mic array with crude sound source localization
- A small toy robot with a crude camera/mic on board, and sometime access to a better camera/mic on the user’s smartphone
- Minecraft-playing agents controlled by OpenCog
- OpenCog-controlled agents in a Unity3D game world
In the robotics use cases, the working assumption is that visual and auditory data processing are done by a combination of external tools, which can be wrapped up in ROS nodes. Mostly these will be open source tools but some proprietary tools might be part of the mix.
It is also considered important that perceptual data from multiple different sources (e.g. different robots) should be able to co-exist in the same Atomspace without confusion.
This is a first draft document (first put together July 17, 2015) and it’s quite possible some important stuff has been left out.
Purpose of the PS
The PS is "simply" a component that aggregates/fuses inputs from multiple sensory-processing subsystems into a unified set of (interlinked) Atomspace-based representations of the world around the robot.
This ultimately also needs to be representable as a 3-space mapping of the perceptual surround (objects, people) for the purposes of location and navigation (and rendered in an operator view for monitoring and troubleshooting purposes).
The data types of greatest immediate concern are video, audio and BVR location sensing data (IR, LIDAR, etc.) Soon we also want to handle kinesthetic sensory data internal (gyroscope, accelerometer, motor failures, etc.) and external (pressure on a robot's skin -- if someone is holding the robot's mouth shut, does it feel it?, etc.)
Basic Design Idea
After extensive discussions we have decided to use a stripped-down Atomspace for the PS.
All the kinds of indexing used in the Atomspace normally may not be needed for this use case. But this special Atomspace may come along with some particular processes not needed for general Atomspaces.
Multiple Sensoria and Spacetime Domains
In a large, complex OpenCog system connected to multiple different bodies or worlds simultaneously, there might be many different PS Atomspaces. For instance, if the same OpenCog system was controlling a robot body and playing Minecraft at the same time, it would likely be best off with two different PS Atomspaces for the two different embodiments. We can assume that each of the "spacetime domains" involved with an OpenCog system at a certain point in time has a name, represented as a string, e.g. "RoboSapien_123" or "Minecraft_11", etc.
Note that two different sensors associated with the same spacetime domain (e.g. a head camera and eye camera on the same robot, or first and third person view in a video game) will generally correspond to the same PS. But there could be exceptions to this, e.g. an OpenCog controlling several robots acting in the same environment, where the robots' sensoria are overlapping little enough on a moment-by-moment basis that it seems best to synthesize each robot's perceptions separately on a real-time basis, and reconcile them later in the background (in the main Atomspace not the robots' PS's). On the other hand, if one has a fixed-position Kinect plus some sensors on a mobile robot, all looking at the same world, probably one wants to fuse the Kinect output and the robot camera output in the same PS, as there the role of the Kinect is to help get a clearer version of that robot's view of the world.
Where do Perceptions Come From?
For the purpose of this initial design document, we can assume that perceptions come into OpenCog via ROS messages, i.e. into a ROS node associated with OpenCog. The ROS node then, internally, pops these messages off its queue and, one by one, executes a method associated with each message as it gets popped off. For perception messages, the method executed involves creating some Atoms in an Atmospace. In a multi-Atomspace scenario, these methods can indicate WHICH spacetime domain they refer to (so that the Atoms being created, can get created in the right PS corresponding to the appropriate spacetime domain).
What kinds of perceptions will we actually be dealing with in the near future? For Hanson Robotics applications, we will be dealing with perceptions of people, objects, and sounds, basically. Here,
- People and objects can be represented via ObjectNodes
- Sounds can be represented via SoundNodes
For ObjectNodes identified as a person or a chair, we might have
InheritanceLink ObjectNode “12345” ConceptNode “Person” InheritanceLink ObjectNode “666” ConceptNode “chair”
For a SoundNode identified as a screech we might have
InheritanceLink SoundNode “456” ConceptNode “screech”
For a SoundNode interpreted via a Speech-to-text tool as “Howdy partner”, we might have
EvaluationLink PredicateNode “Auditory realization” UtteranceNode “456” SoundNode “888” EvaluationLink PredicateNode “Textual realization” UtteranceNode “456” SentenceNode “Howdy partner”
Representing and Storing Space and time information
When a perception comes into OpenCog, it may contain time-stamps corresponding to certain events or observations. These can be recorded in an Atomspace like
AtTimeLink TimeNode "1245566" \\ time stamp TimeDomainNode "Domain123" Node "blahblahblah”
(note, the use of TimeDomainNode here is new, and has been implementedin July 2015 by GsoC student Yishan Chen).
The desired logic would seem to be:
- When an AtTimeLink is created, an entry in the TimeServer is created correspondingly
- When an AtTimeLink is removed from the Atomspace, the corresponding entry from the TimeServer is not removed (the reason being that keeping records in the TimeServer is much cheaper than keeping a bunch of corresponding Atoms around).
I’m not sure methods for persisting the TimeServer to the BackingStore currently exist; if not they should be created.
Similar comments pertain to spatial location, where the relevant construct CURRENTLY looks like
AtLocationLink ListLink NumberNode "123" \\ xcoord NumberNode “666” \\ ycoord NumberNode “322” \\ zcoord SpaceDomainNode "Domain123" Node "blahblahblah”
However, it seems to me that to deal with robotics properly, one needs to modify the AtLocationLink and related construct somewhat. The current approach is OK for video game worlds but the uncertainty associated with robotics may entail slightly different structures.
I would suggest a structure such as
AtLocationLink LocationNode “333” SpaceDomainNode "Domain123" Node "blahblahblah”
where a LocationNode can represent a location in one of several possible coordinate systems. For instance, we could have
InheritanceLink CoordinateSystemNode “Euclidean_Domain_123” ConceptNode “Euclidean” EvaluationLink PredicateNode “origin” ListLink CoordinateSystemNode “Euclidean_Domain_123” LocationNode “333” EvaluationLink PredicateNode “scale” ListLink CoordinateSystemNode “Euclidean_Domain_123” NumberNode “10” ConceptNode “meters” EvaluationLink PredicateNode “xcoord” ListLink CoordinateSystemNode “Euclidean_Domain_123” LocationNode “333” NumberNode “4” EvaluationLink PredicateNode “ycoord” ListLink CoordinateSystemNode “Euclidean_Domain_123” LocationNode “333” NumberNode “5” EvaluationLink PredicateNode “zcoord” ListLink CoordinateSystemNode “Euclidean_Domain_123” LocationNode “333” NumberNode “6”
InheritanceLink CoordinateSystemNode “Spherical_Domain_123” ConceptNode “Spherical” EvaluationLink PredicateNode “origin” ListLink CoordinateSystemNode “Spherical_Domain_123” LocationNode “444” EvaluationLink PredicateNode “r” ListLink LocationNode “444” NumberNode “4” EvaluationLink PredicateNode “theta” ListLink LocationNode “444” NumberNode “2.2” EvaluationLink PredicateNode “phi” ListLink LocationNode “444” NumberNode “-1.3”
Note I have used above the notion of the “scale” of a coordinate system. This factor will be used e.g. in calculating distances – “near” and “far” may be calculated relative to this scale. For instance in the case of an indoor robot, the scale factor might set equal to the distance across the room.
An important point is that we may have multiple views of the same spatial environment, coming from different sensors. The relation between the views coming from different sensors may or may not be easy to reconcile. This is similar to humans, in that some of our senses are more accurate than others, and a hierarchy of refinement is used wherein if we smell or hear something we try to see or touch it to verify its position. We will want a similar weighted preference for robot senses based on their respective accuracies.
To do this we want to support different “spatial views” corresponding to the same observed world (the same “space domain”). We might have
ReferenceLink SpatialViewNode “RoboSapien_123_belly_camera” SpaceDomainNode “RoboSapien_123” ReferenceLink SpatialViewNode “RoboSapien_123_head_camera” SpaceDomainNode “RoboSapien_123”
indicating that the two different spatial views in question correspond to the same world (the same spatial domain). We may then want to say
AtLocationLink LocationNode “333” SpatialViewNode "RoboSapien_123_head_camera" Node "blahblahblah”
Note that the distinction between different spatial views is not the same as the distinction between different coordinate systems. Within a fixed spatial view, transforming a location among different coordinate systems is just a matter of some standard mathematics. On the other hand, transforming a location from one spatial view to another may be much subtler; e.g. one may not confidently know the spatial relations between the origin of the coordinate system in one spatial view, and the origin of the coordinate system in another spatial view (e.g. if one view is from a head-based sensor and another is from an eye-based sensor, one may not have good data on the precise spatial relationship between the head and the eye). Though we should endeavor to minimize this problem by measuring and calibrating all sensors, and furthermore though experimentation determining the quality hierarchy of said devices and allowing the robot to make refinements accordingly.
A Space domain may also (optionally) have a standard reference view, in which case one could say e.g.
AtLocationLink ListLink NumberNode "123" \\ xcoord NumberNode “666” \\ ycoord NumberNode “322” \\ zcoord SpaceDomainNode "RoboSapien_123" Node "blahblahblah”
Internally, the SpaceServer is maintaining an octree representation of spatial locations for each 3D spatial view, which allows rapid lookup of spatial relations between entities (what is near to X? what is behind X? what is on the spatial path from X to Y?). Octrees are a standard representation for spatial data in game AI and robotics. The octree used inside the 3DspaceMap object that lives inside the SpaceServer in OpenCog, used to be a custom octree but now (July 2015) is an OctoMap (a third-party software component).
(One reason for keeping SpatialViewNode and TimeDomainNode separate in the above examples, is that one could have domains with space and no time, or time and no space. But one could also just merge the concepts into SpacetimeDomainNode without causing problems, it would seem.)
Forgetting and Memory Transfer
The PS is a specialized “short term memory buffer” – it’s not intended to keep information for a long time. It’s supposed to take in sensory data, fuse it as needed, pass the important stuff along to other memory stores and forget the rest.
To achieve this efficiently, we suggest the PS should implement a simplified version of attention allocation. Atoms in the PS can be assigned STI (short term importance) and LTI (long term importance) values, but these can be managed in a simpler way than in the main Atomspace.
Initially, for simplicity, an Atom can be assigned STI=c1 and LTI=c2 (just some low default values) when it’s entered into the PS. Each Atom entered into the PS also has a time-stamp, recorded in the TimeServer.
The highly simplistic attention allocation suggested for the PS is:
- Every s seconds, the TimeServer is inspected by an Inspector process.
- According to the action of the Inspector process,
- any Atoms in the PS created more than C*s seconds in the past are removed from the PS
- any Atoms in the PS, which have LTI>c2 and have been in the PS at least c*s seconds (where c*s is very small), are moved into the main Atomspace.
(Note, in a more advanced version, instead of deleting the Atoms in the PS with low LTI, these Atoms might get put into some other “experience database” Atomspace, to be saved to disk as a background process.)
To avoid deletion, then, an Atom must get its LTI boosted fairly quickly after it’s been entered into the PS. How would this happen? Well, we can assume there are a number of “filters” that act on the Atoms in the PS, checking them for certain conditions and increasing their LTI if the conditions are met. Obviously, the parameter s must be set large enough that there has been time for perceptual fusion plus filter-checking to be done, before Atoms are deleted.
Consider for instance the case where a camera is watching 10 people and tracking their current locations. Most of the messages regarding the exact locations of each of the people, are not that important and can be forgotten without much loss. But if a person changes location a lot, or moves in an otherwise surprising manner, then this should go into the long-term memory. Or if a person makes a loud noise, then their change in location may also become important-seeming, and should be pushed into the long-term memory. Also if a new person comes into the scene, for a period of time after they have arrived, their location is interesting and should be pushed into long-term memory (but they after they’ve been around a while, their location is no longer of interest and gets observed and forgotten just like the locations of all the other people around).
Perceptual data about people who talk directly to the robot, directly engaged nonverbally (prolonged eye contact, smiling, etc.) or are already known to the robot (i.e. identifiable as someone already stored in long term memory) is immediately boosted into LTM.
In general we can assume there are certain “cognitive schema” existing in the PS – i.e. Atom-sets acting roughly as “behavior trees”, which are applied to each new Atom entered into the PS, to check if this Atom “passes any of the filters”. What these cognitive schema do, when enacted, is to adjust the LTI values of certain Atoms.
(Note: The Atoms corresponding to these cognitive schema should neither get forgotten nor pushed out of the PS into the main Atomspace. This suggests we may want to add a flag to the AttentionValue indicating that the Atom containing the Attention value should be considered “immovable” (not to be moved to a different Atomspace container). )
There are also cognitive schema that should be checked for applicability slightly after an Atom has been entered into the PS, so that they can be checked after some sensor fusion has happened. (This will not generally be a long delay, but it will be a greater than zero delay.) For instance, if a filter says “remember the location of person X, if there was a sound coming from near X’s location,” then one wants to check this filter after the sound and the location have both been registered in the PS. It doesn’t make sense to check it immediately when the location is perceived, since at that point the sound may not have been perceived.
This could be done either via having the filters in question poll periodically, or via TriggerLinks. The latter mechanism seems better. One would have, in the above, example, something like
TriggerLink ListLink Person’s location perceived Sound perceived “Sound near location” schema evaluated
In this way, whenever a person’s location is perceived, it would be checked whether a sound was perceived recently, and if so the schema would be evaluated (to see if the sound was near the person). And, each time a sound was perceived, a check would be done to see whether a person was perceived nearby, and if so the schema would be evaluated (to see if the person was near the sound). But if no person nor sound was perceived, the schema would not be touched.
As an example, suppose we have two cameras focusing on overlapping scenes: a fixed-position Kinect, plus a camera on the moving head of a robot. Suppose the head-based camera notices a certain object (perhaps up fairly close) and recognizes it as a book. Suppose the Kinect notices the same object, and doesn’t identify it as a book (or as any particular object type), but does note its 3D location fairly accurately. One thing that needs to happen in the PS is, the perceptions of the same object from these two different cameras, need to be associated with the same Atom.
As another example, suppose a single camera shows a number of people in front of a robot. Suppose that sound source localization indicates a sound coming from a certain direction (making it likely that the sound comes from Bob, who is standing to the left, rather than Jane, who is standing to the right). We need to associate the output of sound-source localization and the output of face identification with the same Atom.
How will this sort of fusion be achieved, operationally?
Starting with the first example, let’s suppose that the Kinect creates an ObjectNode
AtLocationLink <.8,.9> LocationNode “333” SpatialViewNode "RoboSapien_123_kinect" ObjectNode “77777” EvaluationLink PredicateNode “xcoord” ListLink CoordinateSystemNode “Euclidean_RoboSapien_123_kinect” LocationNode “333” NumberNode “4” EvaluationLink PredicateNode “ycoord” ListLink CoordinateSystemNode “Euclidean_RoboSapien_123_kinect” LocationNode “333” NumberNode “5” EvaluationLink PredicateNode “zcoord” ListLink CoordinateSystemNode “Euclidean_RoboSapien_123_kinect” LocationNode “333” NumberNode “6”
(with a high probability of .8, because the Kinect is good at localization) and the head-mounted camera creates an ObjectNode
AtLocationLink <.8,.9> LocationNode “666” SpatialViewNode "RoboSapien_123_head_camera" ObjectNode “88888” EvaluationLink <.2,.5> PredicateNode “r” ListLink CoordinateSystemNode "Spherical_RoboSapien_123_head_camera" LocationNode “666” NumberNode “2” EvaluationLink <.8,.9> PredicateNode “theta” ListLink CoordinateSystemNode "Spherical_RoboSapien_123_head_camera" LocationNode “666” NumberNode “-3” EvaluationLink <.8,.9> PredicateNode “phi” ListLink CoordinateSystemNode "Spherical_RoboSapien_123_head_camera" LocationNode “666” NumberNode “-2”
(note that, in this example, a low strength has been assigned to the r-value of the location as assigned by the head camera, because the head camera is not good at determining depth.)
How might it be recognized that the ObjectNodes “77777” and “88888” are actually the same objects?
The mathematical criterion is simple enough: once the coordinate systems of the two spatial views are aligned as well as can be done given current knowledge, then one asks whether the identified locations of “77777” and “88888” are known to be sufficiently close to each other, with sufficient confidence. If so, it may be assume the two entities are the same, and they may be fused into the same ObjectNode (e.g. links to “88888” may be replaced with links to “77777”).
The only thing here that isn’t conceptually very simple is the alignment of different spatial views. For instance, this could be recorded in the Atomspace via Atoms such as
AtTimeLink TimeNode “987654321” TimeDomainNode “RoboSapien_123” EquivalenceLink AtLocationLink LocationNode “333” SpatialViewNode "RoboSapien_123_kinect" ObjectNode $X AtLocationlink LocationNode “999” SpatialViewNode "RoboSapien_123_head_camera" ObjectNode $X
which establish equivalence between one coordinate system and another. This kind of correspondence could be established
- up-front via calibration, e.g. if one had a fixed-position robot head and then a separate external Kinect viewing the same scene as cameras on the fixed-position robot head
- in real-time, e.g. if one had an external Kinect viewing a scene containing a mobile robot with on-board cameras. Then the external Kinect would observe the robot itself, and keep a record of the robot’s location in its own Kinect-centered coordinate system. Furthermore, the angle of the robot’s head relative to the Kinect coordinate system could be determined via computer vision tools, via checking which angle, if assumed, would yield the greatest similarity between what the robot camera sees and what the Kinect reports.
In both cases, the cross-view correspondence links are being created by some external process outside the Atomspace, running as part of “low level perception”.
The 3DspaceMap should come along with a function that transforms between different views of the same space domain, i.e. with the general functionality
coordinate-vector crossViewMapping(coordinate-vector, SpatialView view1, SpatialView view2)
This mapping function may need to make use of Atoms encoding correspondence between views, as appropriate.
The fusion process may then be carried out as follows. When an entity is assigned a location in a SpatialView, then
- A check is made whether any other entity has very recently been assigned a nearby location in the same SpatialView.
- A check is made, using the crossViewMapping function, to see whether any other entity has very recently been assigned a nearly-correspondent location in another SpatialView
These checks can be triggered along with entry of data into the SpaceServer. If one of these checks comes out positive, THEN a list of sensor-fusion predicates is applied, and if one of these is triggered then fusion will occur. For instance, one such predicate might be
BindLink AndLink LessThanLink ExecutionOutputLink GroundedSchemaNode “normalized_distance.scm” ListLink LocationNode $L1 LocationNode $L2 NumberNode “.05” AtLocationLink LocationNode $L! SpatialViewNode $V1 ObjectNode $X1 AtLocationlink LocationNode $L2 SpatialViewNode $V2 ObjectNode $X2 ReferenceLink $V1 $V ReferenceLink $V2 $V InheritanceLink $V ConceptNode “RoboSapien Spatial Domain” ExecutionOutputLink GroundedSchemaNode “fuse.scm” ListLink ObjectNode $X1 ObjectNode $X2
Note, the “normalized distance” function here is assumed to be somewhat fancy, e.g. it is assumed to account for mapping between different spatial views.
This BindLink simply says that we should fuse two objects if they are really close and are part of different spatial views. But the criterion for “Really close” is set at .05 (in normalized coordinates), for the case of the RoboSapien’s space domain.
The list of fusion predicates to be applied, may be collected within a SetLink similar to the elements of a RuleBase as used by the Unified Rule Engine.
(Note: We’ve been focused on space here, but a case of purely temporal sensor fusion would occur. To handle these cases, one would set up a similar process regarding the TimeServer. E.g. if one had a screech detector and a speech-to-text system operating at the same time, using different microphones, neither allied with any sound source localization. In this case, if a screech and a sentence were indicated at very close to the same time, one might want to conclude it was a “screeched sentence”.)
In the sound source localization example, if the microphone array is on the robot’s torso along with certain cameras, it would make sense for these to share the same SpatialView, even though they pertain to different sense modalities.
A hierarchy of reliability / quality will also help fusion by providing weights to conflicting assignments, and refinements to coarser data by associating it with more fine-grained. There may still be error, but human sensory fusion has similar issues (we don't always assign the right visual sources to sounds, etc., depending on the situation). Reduction of error is essential, but there won't always be enough data to eliminate it.
Queries and Indexing
Without knowing the full set of “filters” to be used with the PS in practice, it’s hard to know what types of queries the PS will need to support. But it seems clear that filters will need to check whether Atoms satisfying certain simple criteria have been entered into the PS recently. For instance, it will be necessary to figure out
- what SoundNodes have been created in roughly the last half-second
- what ObjectNodes representing people have been created in roughly the last second
This suggests that it will be important for the PS to keep
- an index of the Nodes it contains, keyed on Atom type
- an index of Links by target
For instance, without a PeopleNode, whether a Node represents a person is determinable only via checking whether
InheritanceLink $X ConceptNode “person”
is fulfilled. This requires a lookup of which InheritanceLinks point to the ConceptNode “person” from the key “person.”
A natural incremental development plan for the above ideas, for robotics applications, would seem to be
- 1) PS with a single sensor (no fusion)
- 2) PS with more than one sensor within the same perceptual view (simple fusion)
- 3) PS with multiple sensors coming from more than one perceptual view (full-on perceptual fusion)
For game-world applications, it would be natural to follow 1) instead with
- 1*) PS with basic object identification
- 1**) PS with basic event identification
-- whereas for robotics, 1* and 1** would naturally come after 3.
More work: Object and Event Identification
In the near-term Hanson Robotics applications, the recognition of objects and events is going to be done via statistical methods, outside the Atomspace, as a kind of perceptual pre-processing.
For game-world applications (e.g. Minecraft) and for further-future robotics applications, we will need to carry out basic object and event recognition within the PS. In humans, basic object recognition and event recognition are, subjectively, effectively instantaneous – we subjectively SEE the objects and events, not the pixels and optical flow, etc.
In a Minecraft-like game world, some relatively simple heuristics can be used to recognize when a set of contiguous blocks should be considered a distinct object. Given a set of identified objects, information-theoretic methods can be used to identify when a distinct event begins or ends (e.g. via the heuristic that event boundaries are local discontinuities in predictability).
In a robotics, context, object and event identification will generally be done outside the Atomspace, e.g. via a deep learning vision system (or a deep learning hierarchy ingesting multimodal data). However, there is a subtle point here, in that advanced object and event identification generally involves understanding of context, and this context is substantially contained in the Atomspace. So, we will need a dynamic wherein Atoms representing objects and events are placed in the PS from a deep learning hierarchy -- but this deep learning hierarchy is occasionally consulting the main Atomspace (or special Atomspaces containing long-term perceptual and episodic memory) for contextual information to aid it in its processing.
Think of these stages in a human brain analog as follows: the low-level perception of object locations and contours, sorting out overlap, associating sound location with visual confirmation, and other pre-semantic processing like the visual cortex. The Atomspace-external semantic processes that do things like categorize objects (face, donut, etc.) and their interactions with the atomspace for context setting and getting are like the visual association cortex. And the atomspace processes are like the higher cognitive processes that the visual association cortext interacts with.
This is advanced and not likely to get engineered in 2015, but needs to be mentioned here as it’s part of the overall architecture and larger picture.