Simultaneous Localization and Mapping (SLAM) began as a technique to enable real-time robotic navigation on previously unexplored environments. The created maps however were designed for the sole purpose of localizing the robot (i.e. what is the position and orientation of the robot in relation to the map) and several systems demonstrated the increasing descriptive power of map representations, which on vision-only SLAM solutions consisted of simple sparse corner-like features as well as edges, planes and most recently fully dense surfaces that abandon the notion of sparse structures altogether. Early sparse representations enjoyed the benefït of being simple to maintain as features could be added, optimized and removed independently while being memory and compute efƒicient, making them suitable for robust real-time camera tracking that relies on a consistent map. However, sparse representations are limiting when it comes to interaction, as for example, a robot aiming to safely navigate in an environment would need to sense complete surfaces in addition to empty space. Furthermore, sparse features can only be detected on highly-textured areas and during slow motion. Recent dense methods overcome the limitations of sparse methods as they can work in situations where corner features would fail to be detected due to blurry images created during rapid camera motion and also enable to correctly reason about occlusions and complete 3D surfaces, thus raising the interaction capabilities to new levels. This is only possible thanks to the advent of commodity parallel processing power and large amount of memory on Graphic Processing Units (GPUs) that needs careful consideration during algorithm design. However, increasing the map density makes creating consistent structures more challenging due to the vast amount of parameters to optimize and the interdependencies amongst them. More importantly, our interest is in making interaction even more sophisticated by abandoning the idea that an environment is a dense monolithic structure in favour of one composed of discrete detachable objects and bounded regions having physical properties and metadata. This work explores the development of a new type of visual SLAM system representing the map with semantically meaningful objects and planar regions which we call Dense Semantic SLAM, enabling new types of interaction where applications that can go beyond asking the question of “where am I” towards “what is around me and what can I do with it”. In a way it can be seen as a return to lightweight sparse-based representations while keeping the predictive power of dense methods with added scene understanding at the object and region levels.
The basic idea about SLAM is that if I were a robot, and I know two fixed points: (x1, y1) and (x2, y2). I now can see point1, and the measurement tells me it is 5 m away from me. When I turn my head for 30 degrees, I can see point2, and the measurements tell me it is 10m away. As a result of knowing my relative position with respect to point1 and point2, I can localize myself.
But here comes the question. I have to look back to see point1 and point2 every time I move forward, to know where I am. This kind of looking back is very stupid. So, a smart researcher comes up with a solution: as soon as I know where I am with respect to point1, point2, I add more points: point3, point4... because I know the coordinate of myself, I can measure the location of these points (mapping). I could use those points afterwards for localization.
Therefore, SLAM means knowing my localization and the coordinates of the fixed points (feature mapping). Because SLAM introduces noise no matter when measuring distance or calculating the rotation, traveled distance, SLAM requires Kalman filter.
MonoSLAM also refers to real time structure from motion. The fixed points are the visual features, and measuring distance becomes match feature and then triangulate. Therefore,
3D landmark = 2D interest point + depth.
At the beginning, a few initial landmarks must be provided for initialization; otherwise the camera motion cannot be estimated. After the initialization phase, an iterative process begins.
Step 1: 2D interest points must evolve into 3D landmarks when the estimate of their depth is good enough.
Step 2: 3D landmarks contribute to the estimation of the state, including camera and the map.
The camera and map state is estimated through an Extended Kalman Filter (EKF). After initialization, this filter performs a model based prediction, compares it to the measurements (camera data), and thus applies a correction on the global state according to the innovation.
To use camera data, we have to know its resolution, focal length, Complementary Metal–Oxide–Semiconductor (CMOS) sensor dimension, and thus we need to perform a calibration step. The landmarks could be initialized manually, or by using an object of known size at the beginning.
Since we use EKF, we need to have a model, and the only model we can have for the camera is a kinematic one, instead of a dynamic one. This model allows only constant velocity movement. However, the camera may not move in this way. Hence we have to add a noise to allow for changes in direction or speed. This is necessary for the EKF to assess the confidence into its estimates.
A common noise model is the additive noise model, where random Gaussian noise is added on each state component. But here we use a nonlinear, acceleration based noise model. To enable several trajectories around the previous one, we consider the linear and angular accelerations to be given by random zero mean Gaussian variables. The standard deviation of these variables allows us to tune how uneven the motion can be.
The prediction step randomly chooses one trajectory among these possibilities, and if necessary the correction step will bring the system back in tracks.
Landmarks are identified from interest points. The Shi-Tomasi detector is used to define salient points. The detection is based on local extrema of the gradient analysis of 15x15 pixels patches. The eigenvalues of the matrix are used to assess the quality of a potential interest point, through a threshold. When a new interest point is identified, it is attached to its neighborhood, namely its 15x15 patch, to describe this point. The patch is normalized to increase the detection robustness for illumination change.
Since there is no depth information from a single image, we need to use motion to recover the depth. This is done using particle filter. When an interest point is selected, we only know that this point belongs to a certain line called the epipolar line. To estimate where the point lies on the epipolar line, the line is uniformly sampled by a given number of particles, where each particle represents a depth hypothesis. A maximal depth hypothesis related to the size of the environment limits the search zone.
When an interest point has been registered, meaning that its patch and epipolar line are known, the algorithm tries to match it in the subsequent pictures. An interest point without any depth information is tracked using correlation over a window. The size of the window is constant, defied by a maximal acceleration in terms of pixels. Interest points moving out of the image or weakly matched are discarded.
During the tracking, after the correlation has matched an interest point, all the particles associated to this point are re-projected in the picture. Resampling is performed to promote hypothesis best-linked to the correlation result, and 25% of the particles are resampled uniformly to avoid early convergence. When the variance of the depth hypotheses shrinks below a threshold, the estimation is considered reliable. The interest point is then attached a depth and becomes a landmark, with a large uncertainty along the epipolar line to allow the EKF to modify its position easily if necessary.
Compared to the interest point management (initialization, tracking, depth estimation and upgrade as landmark), landmark tracking is much easier. The EKF prediction gives an estimate of the camera position and orientation, as well as uncertainties on these quantities and on the landmarks position. The position and orientation of the camera, relative to the landmarks position, allow us to predict how the landmarks will be projected onto the image plane, along with an uncertainty area. Both the uncertainty of the camera position and orientation and the uncertainty of landmark position are taken into account to compute this uncertainty area. The larger the uncertainty is, the wider the search area is in the new picture. The the search is simply performed by correlation within this area. The difference between the EKF prediction and the camera observation gives the innovation signal of the filter.
When the algorithm is running, at least one landmark must always be visible. Thus, as soon as the number of visible landmarks falls below a chosen number, a search procedure is launched, in order to identify new interest points. This is achieved by searching interest points in randomly drawn windows. The only constraint on these windows is they have to be far away from the currently tracked landmarks and interest points. Otherwise, no effort is spent in trying to pick up interest points in the most useful place for the algorithm.
The innovation signal is at the core of the EKF correction step, which affects the whole state vector. We previously explained how the innovation signal is obtained, without any details concerning the projection.
The simple pinhole camera model is used to perform reprojection. No distortion correction is applied here. For a landmark yi described by its coordinates in the world frame, we have to express its position relatively to the camera.
Correction is based on the innovation. A simple additive noise is finally added to the observation model, which enables us to tune the confidence of the EKF towards the observations. To increase the robustness of the correction algorithm, landmarks weakly matched (bad correlation results) or unmatched (not seen when tey are predicted to be seen) along several frames are removed from the map. EKF is very likely to be lost if the landmark association is bad
To avoid wasting energy, the robot needs to do methodical motion and thus incorporates an omnidirectional camera (128 x 1024 resolution) used for sparse feature tracking and mapping (together with a few other sensors). Successful academic algorithms such as MonoSLAM or PTAM, may still need more than a decade of real-world testing and tuning, before they could reach a mature prototype.
Perhaps surprisingly, most of the robot systems across this wide spectrum of requirements and implementations can be described as segments of a single processing pipeline. The simplest systems implement only the earliest stages of the pipeline. More demanding systems implement deeper stages, until we reach the point of machine intelligence, where all the stages of the pipeline are present, and may be coded into one giant neural-network model.
The individual stages are easy to describe, if not easy to implement. In the first stage, a—usually–simple algorithm extracts features from the image. Features, in this context, are easily-detectable patterns, such as edges, corners, or wavelets, that tend to be stable attribute of the object on which they occur, even when externalities like position or lighting change. The line- and arc-segments in printed characters or the patterns of light and dark regions in a human face are examples. Depending on what kind of features you are trying to extract, the best tool may be, for example, a convolution kernel scanning across the image, or a gradient-based matrix analysis looking for points at which the color or intensity of pixels changes dramatically.
In the next stage, the system looks for patterns in the list of features. A simple system might look just for spatial relationships—is there a loop near the top of that vertical bar? Or the system might look for temporal relationships—how has each feature moved between this frame and the previous one: are there features that always move together? This search for patterns can be repeated to search for patterns among the patterns, and so on. These searches for patterns can be done by rule-based algorithms, or they can be done by convolutional networks.
The next stage is more challenging: locating objects. In the abstract sense, an object is a persistent set of patterns that appear to be grouped together by some associating factor, such as being next to each other, being all the same color, being surrounded by a boundary of edges, or all moving together, to cite some examples. Again, object extraction can be done by rule-based systems—although this can get quite complex—or by convolutional neural networks.
The next stage attempts to classify the objects: to name them and attach attributes to them. Classification can use observed information such as position, shape, and color; a-priori information about the situation, such as knowing that all the objects must be decimal digits; or learned parameters in a neural network.
Finally, a last stage in the most ambitious systems will use the classified objects and the relations between them to create a predictive model of the scene—what the objects are doing and the probably next sequence of events. This stage may also attach value judgments to those events. Generally, it is implemented in Kalman filters or neural network layers.
The first step in nearly all vision systems is feature extraction. The kind of features the system seeks depends, reasonably enough, on what it is trying to accomplish. But there are some general categories.
Perhaps the simplest are purely artificial features: patterns printed on objects specifically for the benefit of the vision system. Examples include bar codes, Quick Response (QR) codes, and alignment patterns. A bar-code reader simply keeps scanning until it happens to scan a pattern that produces a valid binary string. QR and alignment patterns are different, in that they allow a simple algorithm to detect the existence and orientation of the pattern and to extract data from it.
In the real world, few things come with QR codes or alignment patterns. So extraction often has to find distinctive features in the scene it receives. Nearly all of the techniques to do this depend upon calculating rates of change in intensity or color between adjacent pixels. A locus of pixels where the rate of change is greatest strongly suggests itself as a feature. In this way a character recognizer can reduce a page of handwriting to a list of strokes, their locations and orientations. A more general feature extractor can reduce a photographic scene to a kind of line drawing composed of edges and corners, or to a cloud of distinctive points.
One interesting special case is used primarily in facial recognition. Haar-like features—essentially sets of alternating light and dark rectangles—can be a useful abstraction for such images as faces, where distinct edges and corners may not be obvious and stable. For example, faces exhibit a long, narrow, light-colored horizontal rectangular area across the cheeks, with a similar-sized dark rectangle just above it across the eyes. Rule-based algorithms can extract these rectangles from even a quite soft image of a face by computing the sum of the pixel values within each rectangle and maximizing the difference between the two sums. By imposing a number of such rectangle sets on different parts of the face, an index based on the locations, shapes, and pixel-sums of the rectangles can be produced that is relatively stable for a given face, but differs for faces that humans judge to be different.
Whatever the algorithm, the feature extractor’s job is to produce a list of the features in the image, often with additional attributes about each feature. This list gets sent deeper into the pipeline for pattern recognition.
The next stages in the vision pipeline will organize the extracted features into patterns that seem to have stable associations, and then to infer from those patterns of features the existence of objects in the scene. For example, in optical character readers, a group of strokes that are connected–or nearer to each other than to any other strokes–probably constitutes a character.
The challenge of binding together features to form objects becomes much more formidable in systems working with continuous-tone images such as photographs or video. These systems may take several successive layers of pattern extraction, attempting to group features by proximity, color, surrounding edges, or other rules suggested by a-priori knowledge of the subject. Once features have been grouped into patterns, the system often must apply further rules to identify patterns of features with objects. Or the system may use trained convolutional neural network layers to find patterns and infer objects.
3D approach adapted to a single-instruction, multiple-data vision-processing chip: Rather than isolating objects and then measuring the distance to them, the algorithm identifies distinctive features in a scene—a cloud of feature points, Jacobs called them—through a series of matrix operations based on pixel-value gradients. By finding pixels where the gradient change is greatest, the algorithm basically acts as a corner detector.
The algorithm performs the extraction a second time on an image captured after the camera has moved a little. By comparing the locations of the feature points in the two images, the algorithm figures out which feature points matchup between the two images, and develops motion vectors, not unlike the motion vectors used in video compression. A statistical solver uses these vectors to estimate camera motion, distance to each feature point, and motion that can’t be accounted for by parallax, and therefore must represent moving of the objects on which the features points lie.
Vision-based system: The algorithm is based on a vertically-mounted camera and a ring-shaped lens, producing a 360-degree image that reaches from the floor around the robot to well up the walls of the room. A vision processor populates this cylindrical surface with distinctive feature points. From consecutive images the system computes motion vectors, extracts objects, and performs simultaneous localization and mapping (SLAM) using Kalman filters.
An enormous amount of testing is the price of rule-based algorithms. Unless you can constrain the environment so tightly that you can enumerate all possible situations, you face the chance that your rules will encounter a new situation and do something undesirable. This open-ended risk has led some researchers to revive an older approach to poorly-defined problems: neural networks.
Unlike systems that apply predefined mathematical functions and classification rules to images to extract features, identify and classify objects, and create predictive models, neural networks are simply consecutive layers of arrays of nodes—called, by analogy to biology, neurons. The output of each neuron at time t+1 is a function—usually non-linear and sometimes slightly randomized—of the outputs of some (or all) of the neurons in the previous layer at time t.
Rather than choosing the input function for each neuron, you use a directed search process to train the neural network. In essence, after deciding the general kind of function you want each neuron to perform, you randomize all the parameters in all the neuron functions, present the first layer of the network with an image, and see how close the network comes to correctly classifying the image. For example, one output of the final stage of the network might indicate the presence or absence of a lion in the image. You measure the error between the actual network output and the correct classification, and then use that error to adjust the function parameters through the network in a direction that reduces the error. And repeat—over an enormous number of training images.
Neural networks were first tried for vision processing decades ago. But two problems stymied progress. First, as the image resolution increased, the number of neurons and connections exploded. Second, as the network grew, training times became intolerably long.
But two advances have relieved these problems. The first is the emergence Of Convolutional Neural Networks (CNNs). CNNs replace the neurons in the first few layers with small convolution engines. Where each neuron might be connected to a very large number of pixels or inputs from the previous layer, the convolution engines take as input only a small number of signals—a 16-by-16 array of pixels, for example. And instead of an arbitrary function that must be trained, the convolution engine does only one thing: it convolves its inputs with a kernel. You only have to use training to establish the values in the kernel.
One measure of the importance of CNNs comes from an industry benchmark, the Imagenet Large-Scale Visual Recognition open challenge, sponsored by a group of interested researchers that includes Facebook, Google, Stanford University, and the University of North Carolina. This large set of test images, accompanied by an even larger set of training images, has become the standard by which still-image recognition systems are measured.
CNNs have quickly dominated the Imagenet challenge, producing far more accurate object classifications than rule-based systems. But the training process to achieve these results can be enormous. To accelerate the learning process, Baidu uses a 16-layer CNN in which only the final three or four layers are made up of fully-connected neurons. But the real effort goes into the training. Baidu has expanded the already exhaustive set of training images by generating variants of the Imagenet training set, varying the color and lighting and introducing optical distortions. The result is a set of about two billion training images—a seemingly impossible task for even a supercomputer-based CNN model to digest.
That brings up the second advance—hardware acceleration. The Baidu CNN is implemented on a custom supercomputer. During training, the system uses a stochastic gradient descent algorithm to refine the convolution kernels and neuron input functions. While the CNN model itself apparently runs on CPUs, the training algorithm executes on a bank of 32 Nvidia Tesla graphics processing units (GPUs), linked on a private Infiniband network. Once the training is done, the computing load for executing the CNN model drops significantly.
IBM, meanwhile, is taking a quite different approach to hardware acceleration. Rather than execute CNN models on heterogeneous CPU/GPU systems, IBM’s TrueNorth program is staying with fully-connected neural networks but executing the models on an array of custom chips that directly emulate axons, synapses, and neuron cells.
The current chip packs the equivalent of a million neurons, with 256 million potential synaptic connections, onto each die. The full computer is a vast array of these chips connected through a network that allows an axon from any neuron to connect to a synapse of any other neuron. The behavior and threshold of each synapse is programmable, as is the behavior of each neuron. IBM has sown that the system is flexible enough to emulate any of the documented behaviors of biological neurons.
Aside from being fully connected and very flexible, the TrueNorth design has another important characteristic. Unlike models executing on stored-program CPUs, the neuron chip operates at relatively low frequencies, using quite simple internal operations. This allows a single chip to run on 100 mW, making mobile devices entirely feasible even with the existing 45 nm silicon-on-insulator process.
TrueNorth models can either be programmed with predetermined functions, or they can be architected and then trained with conventional back-propagation techniques. Using the latter approach, IBM researchers have demonstrated quite impressive results at classifying scenes.
This CRC Press News presents an analysis of the VISUAL MONO-SLAM algorithm used to create sparse consistent 3D maps in real-time from images perceived by a monocular hand-held camera developed by Davison et al. To understand the workings of VISUAL MONO-SLAM, foundations concerning camera models and lens distortion are presented and followed by an excursion about visual features and image procession techniques. The concept of the Extended Kalman ï¬lter is introduced and it is shown how an Extended Kalman ï¬lter can be used to obtain both a 6D pose estimation for the camera and position estimates for feature points in a 3D coordinate frame. An encoding for 3D point estimations using inverse depth is presented, allowing for immediate feature initialization without any prior knowledge about the depth of the feature point. It is shown that this encoding performs well even for features at great depth showing small or no parallax in contrast to conventional XYZ encoding. To save computational load a conversion mechanism from inverse depth encoding to common 3 dimensional XYZ encoding for features showing high parallax is discussed. An implementation using OPENCV and OPENGL is used to evaluate the discussed methods in a simulation, on provided sample image sequences and with a real time camera.
Computer vision approaches are increasingly used in mobile robotic systems, since they allow obtaining a very good representation of the environment by using low-power and cheap sensors. In particular it has been shown that they can compete with standard solutions based on laser range scanners when dealing with the problem of simultaneous localization and mapping (SLAM), where the robot has to explore an unknown environment while building a map of it and localizing in the same map. Experimental results in real scenarios as well as on standard datasets show that the algorithm is able to track the trajectory of the robot and build a consistent map of small environments, while running in near real-time on a standard PC.
Future work will be devoted to improve the robustness of the approach on larger scales by implementing a way to detect and manage loop closings. Moreover, other sensors will be included when available, such as accelerometers, magnetometers, gyroscopes, and wheel odometers. Finally, more complex applications of the Mono-SLAM algorithm are under consideration. These include a multi- robot extension of the algorithm; integration of high-level semantic information in the algorithm; and an extension of the solution with multi-camera systems.