Existing computing solutions for Level 4 autonomous driving often consume thousands of Watts, dissipate enormous amounts of heat, and cost tens of thousands of dollars. These power, heat, and cost barriers thus make autonomous driving technologies difficult to transfer to the general public. With inventive problem solving, featured by modular, secure, dynamic, high-performance, and energy-efficient, an autonomous driving computing architecture and software stack can be developed. For example, a simulated system on an ARM Mobile SoC consumes 11 W on average and is able to drive a mobile vehicle at 5 miles per hour. With more computing resources, the simulated system would be able to process more data and would eventually satisfy the need of a production-level autonomous driving system.
Here, we attempt to develop some initial understandings of the following questions:
what computing units are best suited for what kind of workloads;
if we considered an extreme, would a mobile processor be sufficient to perform the tasks in autonomous driving, and
how to design an efficient computing platform for autonomous driving.
The aforementioned autonomous driving computing stack, which provides several benefits:
modular: more ROS nodes can be added if more functions are required
secure: ROS nodes provide a good isolation mechanism to prevent nodes from impacting each other
highly dynamic: the run-time layer can schedule tasks for max throughput, lowest latency, or lowest energy consumption
high performance: each heterogeneous computing unit is used for the most suitable task to achieve highest performance
energy-efficient: can use the most energy-efficient computing unit for each task, for example, a DSP for feature extraction.
The reason why we could deliver high performance on an ARM mobile SoC is that we can utilize the heterogeneous computing resources of the system and use the best suited computing unit for each task so as to achieve best possible performance and energy efficiency. However, there is a downside as well:
we could not fit all the tasks into such a system, for example,
change lane prediction,
cross-road traffic prediction, etc.
In addition, we need for the autonomous driving system to have the capability to upload raw sensor data and processed data to the cloud; however, the amount of data is so large that it would take all of the available network bandwidth.
The aforementioned functions, object tracking, change lane prediction, cross-road traffic prediction, data uploading etc. are not needed all the time. For example,
the object tracking task is triggered by the object recognition task
the traffic prediction task is, in turn, triggered by the object tracking task.
The data uploading task is not needed all the time either since uploading data in batches usually improves throughput and reduces bandwidth usage.
If we designed an ASIC chip for each of these tasks, it would be a waste of chip area, thus an FPGA would be a perfect fit for these tasks. We could have one FPGA chip in the system and have these tasks time-share the FPGA. It has been demonstrated that using Partial- Reconfiguration techniques, an FPGA soft core could be changed within less than a few milliseconds, making time-sharing possible in real-time.
With respect to computing stack for autonomous driving, at the level of the computing platform
layer, an SoC architecture consists of
an I/O subsystem that interacts with the front-end sensors;
a DSP to pre-process the image stream to extract features;
a GPU to perform object recognition and some other deep learning tasks;
a multi-core CPU for planning, control, and interaction tasks;
an FPGA that can be dynamically reconfigured and time-shared for data compression and uploading, object tracking, and traffic prediction, etc.
These computing and I/O components communicate through shared memory.
On top of the computing platform layer, we could have a run-time layer to map different workloads to the heterogeneous computing units through OpenCL, and to schedule different tasks at runtime with a run-time execution engine.
On top of the Run-Time Layer, we have an Operating Systems Layer utilizing Robot Operating System (ROS)design principles, which is a distributed system consisting of multiple ROS nodes, each encapsulating a task in autonomous driving.
Let’s explore the edges of the envelope and understand how well an autonomous driving system could perform on the aforementioned ARM mobile SoC. A vision-based autonomous driving system can be implemented on this mobile SoC. Here, we can utilize
DSP for sensor data processing tasks, such as feature extraction and optical flow, which is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
GPU for deep learning tasks, such as object recognition;
two CPU threads for localization tasks to localize the vehicle at real-time;
one CPU thread for real-time path planning;
another CPU thread for obstacle avoidance.
Note that multiple CPU threads can run on the same CPU core if a CPU core is not fully utilized.
Surprisingly, though virtual testing, it turns out that the performance was quite impressive when simulating this system’s implementation on the ARM Mobile SoC with HIL tesing. The localization pipeline is able to process 25 images per second, almost keeping up with image generation at 30 images per second. The deep learning pipeline is capable ofperforming 2 to 3 object recognition tasks per second. The planning and control pipeline is designed to plan a path within 6 ms. When running this full system, the SoC consumes 11 W on average. With this system, we would be able to drive the vehicle at around 5 miles per hour without any loss of localization, quite a remarkable feat, considering that this ran on a mobile SoC. With more computing resources, the system should be capable of processing more data and allowing the vehicle to move at a higher speed, eventually satisfying the needs of a production-level autonomous driving system.
For matching workloads to computing units, we would need to understand which computing units are best fitted to convolution and feature extraction workloads, which are the most computation-intensive workloads in autonomous driving scenarios. We could conducted Design f Experiment (DoE) on an off-the shelf ARM mobile SoC consisting of a four-core CPU, a GPU, as well as a DSP. To study the performance and energy consumption of this heterogeneous platform, we implement the DoE to optimize feature extraction and convolution tasks on CPU, GPU, and DSP based on measured chip-level energy consumption.
First, we can implement a convolution layer, which is commonly used, and is the most computation-intensive stage, in object recognition and object tracking tasks. Let’s summarize the performance and energy consumption results:
when running on the CPU, each convolution takes about 8 ms to complete, consuming 20 mJ;
when running on the DSP, each convolution takes 5 ms to complete, consuming 7.5 mJ;
when running on a GPU, each convolution takes only 2 ms to complete, consuming only 4.5 mJ.
These results confirm that GPU is the most efficient computing unit for convolution tasks, both in performance and in energy consumption.
Next, we implemented feature extraction, which generates feature points for the localization stage, and this is the most computation expensive task in the localization pipeline. Let’s summarize the performance and energy consumption results:
when running on a CPU, each feature extraction task takes about 20 ms to complete, consuming 50 mJ;
when running on a GPU, each convolution takes 10 ms to complete, consuming 22.5 mJ;
when running on a DSP, each convolution takes only 4 ms to complete, consuming only 6 mJ.
These results confirm that DSP is the most efficient computing unit for feature processing tasks, both in performance and in energy consumption. Note that we did not simulate the implementations of other tasks in autonomous driving, such as localization, planning, obstacle avoidance etc. on GPUs and DSPs as these tasks are control heavy and would not efficiently execute on GPUs and DSPs.
In the section of Computer Architecture Design Exploration, to understand how the chip makers attempt to solve these problems, we have looked at the existing autonomous driving computation solutions provided by different chip makers. To understand the main points in autonomous driving computing platforms, let’s look at an existing computation hardware implementation of a level 4 autonomous car from a leading autonomous driving company.
Let’s examine some existing computing solutions targeted for autonomous driving.
When performing the DoE based simulation, t he Nvidia PX platform was the leading GPU-based solution for autonomous driving.
Each PX 2 consists of two Tegra SoCs and two Pascal graphics processors.
Each GPU has its own dedicated memory, as well as specialized instructions for Deep Neural Network acceleration.
To deliver high throughput, each Tegra connects directly to the Pascal GPU using a PCI-E Gen 2 x4 bus (total bandwidth: 4.0 GB/s). In addition, the dual CPU-GPU cluster is connected over Gigabit Ethernet, delivering 70 Gigabits per second. With optimized I/O architecture and DNN acceleration, each PX2 is able to perform 24 trillion deep learning calculations every second. This means that, when running AlexNet deep learning workloads, it is capable of processing 2,800 images/s.
Texas Instruments’ TDA provides a DSP-based solution for autonomous driving.
A TDA2x SoC consists of two floating point C66x DSP cores and four fully programmable Vision Accelerators, which are designed for vision processing functions.
The Vision Accelerators provide eight-fold acceleration on vision tasks compared to an ARM Cortex-15 CPU, while consuming less power.
Similarly, CEVA XM4 is another DSP-based autonomous driving computing solution. It is designed for computer vision tasks on video streams. The main benefit for using CEVA-XM4 is energy efficiency, which requires less than 30mW for a 1080p video at 30 frames per second.
Altera’s Cyclone V SoC is one FPGA-based autonomous driving solution which has been used in Audi products.
Altera’s FPGAs are optimized for sensor fusion, combining data from multiple sensors in the vehicle for highly reliable object detection.
Similarly, Zynq UltraScale MPSoC is also designed for autonomous driving tasks.
When running Convolution Neural Network tasks, it achieves 14 images/sec/Watt, which outperforms the Tesla K40 GPU (4 images/sec/Watt). Also, for object tracking tasks, it reaches 60 fps in a live 1080p video stream.
MobilEye EyeQ5 is a leading ASIC-based solution for autonomous driving. EyeQ5 features
fully programmable accelerators,
each of the four accelerator types in the chip are optimized for their own family of algorithms, including
signal processing, and machine-learning tasks.
This diversity of accelerator architectures enables applications to save both computational time and energy by using the most suitable core for every task. To enable system expansion with multiple EyeQ5 devices, EyeQ5 implements two PCI-E ports for inter-processor communication.
LiDAR is capable of producing over a million data points per second with a range up to 200 meters. However, it is very costly (a high-end LiDAR sensor costs over tens of thousands of dollars). Thus, let’s explore an affordable yet promising alternative, vision-based autonomous driving.
The localization method in LiDAR-based systems heavily utilizes a particle filter, while vision-based localization utilizes visual odometry techniques. These two different approaches are required to handle the vastly different types of sensor data.
The point clouds generated by LiDAR provide a “shape description” of the environment, however it is hard to differentiate individual points.
By using a particle filter, the system compares a specific observed shape against the known map to reduce uncertainty.
In contrast, for vision-based localization, the observations are processed through a full pipeline of image processing to extract salient points and the salient points’ descriptions, which is known as feature detection and descriptor generation.
This allows us to uniquely identify each point and apply these salient points to directly compute the current position.
In detail, vision-based localization undergoes the following simplified pipeline:
by triangulating stereo image pairs, we first obtain a disparity map which can be used to derive depth information for each point.
by matching salient features between successive stereo image frames, we can establish correlations between feature points in different frames. We can then estimate the motion between the past two frames.
by comparing the salient features against those in the known map, we can also derive the current position of the vehicle.
Compared to a LiDAR-based approach, a vision-based approach introduces several highly parallel data processing stages, including
disparity map generation,
Gaussian Blur, etc.
These sensor data processing stages heavily utilize vector computations and each task usually has a short processing pipeline, which means that these workloads are best suited for DSPs. In contrast, a LiDAR-based approach heavily utilizes the Iterative Closest Point (ICP) algorithm, which is an iterative process that is hard to parallelize, and thus more efficiently executed on a sequential CPU.
Autonomous Driving is a highly complex system that consists of many different tasks. In order to achieve autonomous operation in urban situations with unpredictable traffic, several real-time systems must interoperate, including
Note that existing successful implementations of autonomous driving are mostly LiDAR-based: they rely heavily on LiDAR for
while other sensors are used for peripheral functions.
Normally, an autonomous vehicle consists of several major sensors. Indeed, since each type of sensor presents advantages and drawbacks, in autonomous vehicles, the data from multiple sensors must be combined for increased reliability and safety. They can include the following:
The GPS/IMU system helps the autonomous vehicle localize itself by reporting both inertial updates and a global position estimate at a high rate.
GPS is a fairly accurate localization sensor, howver its update rate is slow, at about only 10 Hz, and thus not capable of providing real-time updates.
Conversely, an IMU’s accuracy degrades with time, and thus cannot be relied upon to provide reliable position updates over long periods of time.
However, an IMU can provide updates more frequently, at or higher than 200 Hz to satisfy the realtime requirement. Assuming a vehicle traveling at 60 miles per hour, the traveled distance is less than 0.2 meters between two position updates, (this means that the worst case localization error is less than 0.2 meters).
By combining both GPS and IMU, we can provide accurate and real-time updates for vehicle localization. Nonetheless, we cannot rely on this sole combination for localization for three reasons:
its accuracy is only about one meter;
the GPS signal has multipath problems, meaning that the signal may bounce off buildings, introducing more noise;
GPS requires an unobstructed view of the sky and would thus not work in environments such as tunnels.
LiDAR is used for
It works by bouncing a laser beam off of surfaces and measures the reflection time to determine distance. Due to its high accuracy, it is used as the main sensor in most autonomous vehicle implementations. LiDAR can be used to
produce high-definition maps,
localize a moving vehicle against high-definition maps,
detect obstacles ahead, etc.
Normally, a LiDAR unit, such as Velodyne 64-beam laser, rotates at 10 Hz and takes about 1.3 million readings per second. There are two main problems with LiDAR:
when there are many suspended particles in the air, such as rain drops and dust, the measurements may be extremely noisy.
a 64-beam LiDAR unit is quite costly.
Cameras are mostly used for object recognition and object tracking tasks such as
traffic light detection, and
pedestrian detection, etc.
To enhance autonomous vehicle safety, existing implementations usually mount eight or more 1080p cameras around the car, such that we can use cameras to detect, recognize, and track objects in front of, behind, and on both sides of the vehicle. These cameras usually run at 60 Hz, and, when combined, would generate around 1.8 GB of raw data per second.
The radar and sonar system is mostly used as the last line of defense in obstacle avoidance. The data generated by radar and sonar shows the distance to the nearest object in front of the vehicle’s path. Once we detect that an object is close ahead, there may be a danger of a collision, then the autonomous vehicle should apply the brakes or turn to avoid the obstacle. Therefore, the data generated by radar and sonar does not require much processing and usually is fed directly to the control processor, and thus not through the main computation pipeline, to implement such “urgent” functions as swerving, applying the brakes, or pre-tensioning the seatbelts.
After getting sensor data, we feed the data into the perception stage to understand the vehicle’s environment. The three main tasks in autonomous driving perception are
object detection, and
Localization is a sensor-fusion process, such that GPS/IMU, and LiDAR data can be used to generate a high-resolution infrared reflectance ground map. To localize a moving vehicle relative to these maps, we could apply a particle filter method to correlate the LiDAR measurements with the map. The particle filter method has been demonstrated to achieve real-time localization with 10-centimeter accuracy and to be effective in urban environments. However, the high cost of LiDAR could limit its wide application.
In recent years, however, we have seen the rapid development of vision-based Deep Learning technology, which achieves significant object detection and tracking accuracy . Convolution Neural Network (CNN) is a type of Deep Neural Network (DNN) that is widely used in object recognition tasks. A general CNN evaluation pipeline usually consists of the following layers:
The Convolution Layer which contains different filters to extract different features from the input image.
Each filter contains a set of “learnable” parameters that will be derived after the training stage.
The Activation Layer which decides whether to activate the target neuron or not.
The Pooling Layer which reduces the spatial size of the representation to reduce the number of parameters and consequently the computation in the network.
The Fully Connected Layer where neurons have full connections to all activations in the previous layer. The convolution layer is often the most computation-intensive layer in a CNN.
Object tracking refers to the automatic estimation of the trajectory of an object as it moves.
After the object to track is identified using object recognition techniques, the goal of object tracking is to automatically track the trajectory of the object subsequently.
This technology can be used to track nearby moving vehicles as well as people crossing the road to ensure that the current vehicle does not collide with these moving objects.
In recent years, deep learning techniques have demonstrated advantages in object tracking compared to conventional computer vision techniques.
Specifically, by using auxiliary natural images, a stacked Auto-Encoder can be trained offline to learn generic image features that are more robust against variations in viewpoints and vehicle positions.
Then, the offline trained model can be applied for online tracking.
Based on the understanding of the vehicle’s environment, the decision stage can generate a safe and efficient action plan in real-time. The tasks in the decision stage mostly involve probabilistic processes and Markov chains.
One of the main challenges for human drivers when navigating through traffic is to cope with the possible actions of other drivers which directly influence their own driving strategy.
This is especially true when there are multiple lanes on the road or when the vehicle is at a traffic change point.
To make sure that the vehicle travels safely in these environments, the decision unit generates predictions of nearby vehicles, and decides on an action plan based on these predictions.
To predict actions of other vehicles, one can generate a stochastic model of the reachable position sets of the other traffic participants, and associate these reachable sets with probability distributions.
Planning the path of an autonomous, agile vehicle in a dynamic environment is a very complex problem, especially when the vehicle is required to use its full maneuvering capabilities.
A brute force approach would be to search all possible paths and utilize a cost function to identify the best path.
However, the brute force approach would require enormous computation resources and may be unable to deliver navigation plans in real-time.
In order to circumvent the computational complexity of deterministic, complete algorithms, probabilistic planners have been utilized to provide effective real-time path planning.
As safety is the paramount concern in autonomous driving, at least two levels of obstacle avoidance mechanisms need to be deployed to ensure that the vehicle will not collide with obstacles.
The first level is proactive, and is based on traffic predictions.
At runtime, the traffic prediction mechanism generates measures like time to collision or predicted minimum distance, and based on this information, the obstacle avoidance mechanism is triggered to perform local path re-planning.
If the proactive mechanism fails, the second-level, the reactive mechanism, using radar data, will take over.
Once the radar detects an obstacle, it will override the current control to avoid the obstacles.
An autonomous vehicle must be capable of sensing its environment and safely navigating without human input. Indeed, the US Department of Transportation's National Highway Traffic Safety Administration (NHTSA) has formally defined five different levels of autonomous driving:
Level 0: the driver completely controls the vehicle at all times; the vehicle is not autonomous at all.
Level 1: semi-autonomous; most functions are controlled by the driver, however some functions such as braking can be done automatically by the vehicle.
Level 2: the driver is disengaged from physically operating the vehicle by having no contact with the steering wheel and foot pedals. This means that at least two functions, cruise control and lane-centering, are automated.
Level 3: there is still a driver who may completely shift safety-critical functions to the vehicle and is not required to monitor the situation as closely as for the lower levels.
Level 4: the vehicle performs all safety-critical functions for the entire trip, and the driver is not expected to control the vehicle at any time since this vehicle would control all functions from start to stop, including all parking functions.
Levels 3 and 4 autonomous vehicles must sense their surroundings by using multiple sensors, including LiDAR, GPS, IMU, cameras, etc. Based on the sensor inputs, they need to be able to
localize themselves, and in real-time,
make decisions about how to navigate within the perceived environment.
Due to the enormous amount of sensor data and the high complexity of the computation pipeline, autonomous driving places extremely high demands in terms of
computing power, and
electrical power consumption.
Existing designs often require equipping an autonomous car with multiple servers, each with multiple high-end CPUs and GPUs. These designs come with several problems:
First, the costs are extremely high, thus making autonomy unaffordable to the general public.
Second, power supply and heat dissipation become a problem as this setup consumes thousands of Watts, consequently placing high demands on the vehicle’s power system.
In this CRC Press News, we have explored computer architecture techniques for autonomous driving.
First, we have introduced the tasks involved in current LiDAR-based autonomous driving.
Second, we have explored how vision-based autonomous driving, a rising paradigm for autonomous driving, is different from the LiDAR-based counterpart.
Then, we have examined existing system implementations for autonomous driving.
Next, considering different computing resources, including CPU, GPU, FPGA, and DSP, we have further exploed the most suitable computing resource for each task.
Based on the simulation results of running autonomous driving tasks on a heterogeneous ARM Mobile
SoC, we have exploed a system architecture for autonomous driving, which is
capable of delivering high levels of computing performance.
In summary, in the CRC Press News, we have described the computing tasks involved in autonomous driving, examine existing autonomous driving computing platform implementations. To enable autonomous driving, the computing stack could simultaneously provide
low power consumption,
low thermal dissipation, and
We have also discussed possible approaches to design computing platforms that will meet these needs.