SiaSearch — A Tool to Tame the Data Flood of Intelligent Vehicles

January 23, 2020
Transferring self-driving vehicles from a well-controlled research setting into the real world is a crucial step which in recent years has proven to be harder than originally expected

Transferring self-driving vehicles from a well-controlled research setting into the real world is a crucial step which in recent years has proven to be harder than originally expected (see, for example, the valuation cut of Waymo by Morgan Stanley due to developmental delays, or Cruise’s announcement about delays in bringing L4 to market).

Whether we are talking about automated highway driving or level 4–5 autonomy, an autonomous vehicle has to safely tackle a large variety of traffic situations, ranging from everyday situations to rare corner cases. In order to develop and test the perception and navigation algorithms of an autonomous vehicle (AV), engineers rely on a large amount of previously recorded data. However, this data is typically hard to access, resulting in a time-consuming process of data triage. You can only begin to imagine how long it would take to find specific situations of interest, such as cut-in maneuvers on highways or jaywalkers in the city. In the following post we will introduce our solution to the exciting challenge of making autonomous driving data easily accessible.

The recording of manual data continues to be crucial for the development of AVs. While simulations are widely used, too, simulated data alone is not enough for the development of most systems. Moreover, in order to be able to identify and replicate underlying interaction models, simulated data is typically based on real data.

In order to accelerate development of autonomous vehicles, the large available data lake needs to be easily accessible for engineers.
In order to accelerate development of autonomous vehicles, the large available data lake needs to be easily accessible for engineers.

Yet, companies that operate autonomous and intelligent vehicles on the road definitely do not face a shortage of manually recorded data. A single vehicle with a typical sensor setup of cameras, radars, IMU, GPS and potentially lidars records data at a rate of between 1–10 TB per hour of driving. Let’s say 20 vehicles are deployed: This means that around 1 PB of new data is recorded every single week! Even if only half of it is stored permanently this will result in storage costs of millions of USD per year. In addition, it is yet unclear what data will be legally required in order to validate and certify the safety of automated vehicles, typically resulting in a rather conservative “self-regulated” policy for data retention.

Deciding what to store and where to store it (on-premises vs cloud) is not the only issue with such huge amounts of data. Engineers need to be able to easily query the data and decide about its relevance. For that purpose engineers need information on the content of the data. As of today, understanding the content of the sensor data relies heavily on manual labor (in–car tagging with a tablet application or post–collection manual review of the data). This can easily eat up more than 50% of an engineer’s working time, therefore significantly slowing down the pace of development and adding considerable costs to the data–handling supply chain.

In autonomous driving, extremely large amounts of data are recorded. Finding and selecting the required data for development needs to be automated as manual labor is slow, cumbersome and does not scale.

Connecting engineers to data with SiaSearch

So how can we solve the data overload problem? Is there a way to get the required data from recording to the engineer faster and cheaper? Can an engineer seamlessly access the whole database of recordings within fractions of a second? Is there a way to automate the data–triage process? With SiaSearch, the answer to all these questions is yes!

We have developed SiaSearch, the leading search engine for autonomous driving and ADAS data. SiaSearch allows users to query for specific driving data and access the exact sequences they are looking for, instead of writing cumbersome ETL jobs themselves. Let’s look at an example: A motion planning engineer is interested in assessing the system’s behavior during left turns. More specifically, she is interested in unprotected left turns at night, during rain, at a crowded intersection with many surrounding pedestrians. In a classical setup she would have to manually trawl through the data until she has found a satisfactory number of situations. With SiaSearch, on the other hand, the results of that search are provided in fractions of a second.

SiaSearch is a search engine for autonomous driving data. The raw data is indexed in order to make it searchable for the users.
SiaSearch is a search engine for autonomous driving data. The raw data is indexed in order to make it searchable for the users.

So how does SiaSeach work? It analyzes the raw driving data in two stages: In the first step, the raw data is processed by its scalable and efficient ETL processing pipeline. Using intelligent extractors, the data is analyzed, catalogued, and loaded into a database. In a second step, engineers can access the data through a web-based interface or a programmatic API.

Extracting the data content

In the first stage, we use a large–scale data processing pipeline to extract relevant information from the raw data and to create a queryable data catalogue by running extractors for a variety of semantic attributes and driving situations. Efficient distributed processing allows the pipeline to operate around 25x faster than real-time, ensuring minimal waiting times and fast iterations. Prior to running SiaSearch’s processing pipeline, the data does not need to be labeled or manually selected. That means that SiaSearch can process and tag one hour of driving in under 3 minutes!

Accessing the data

In the second stage, the user accesses the data through our web-based GUI or our programmatic API. By selecting from an intuitive directory of semantic attributes, users can conduct granular queries without any prior experience with the platform. Under the hood, SiaSearch runs an optimized query on the data catalogue created in the first stage. It finds the time intervals relevant to the user query and returns the corresponding drive segments. Additionally, SiaSearch also clusters scenarios based on their similarity, making the search experience more intuitive and the results even more relevant.

SiaSearch allows the users to easily search for the data they want and provides the results in fractions of a second.
SiaSearch allows the users to easily search for the data they want and provides the results in fractions of a second.

How can SiaSearch be integrated?

SiaSearch is built in a highly modular way. It is fully dockerized and can be easily deployed in cloud platforms or on premises using Kubernetes. Via its programmatic API it can also be integrated into existing data pipelines. The GUI is web-based and therefore easily accessible. It also allows for straightforward sharing of queries and data with your colleagues. In order to limit data exchange to a minimum, SiaSearch is designed to run entirely within the customer’s infrastructure. This allows our clients to maintain full control and ownership over their data.

As many AV engineers with different tasks and backgrounds rely on data in their day-to-day development, SiaSearch users are quite diverse. In the following we outline three standard use cases of the product.

Assembling new training data for perception model training

As a perception engineer, you probably know it is not only about big data but smart data. In order to develop reliable models, the training (as well as validation and test) data need to be properly distributed. Moreover, datasets need to be continuously updated in order to cover an expressive distribution of the real world. Therefore, engineers need to have an efficient way of finding and selecting the data they need to improve their models. SiaSearch offers just that, either through the web interface or in an integrated fashion through our custom API. The prospective training data found through SiaSearch can be forwarded to labeling (ground truth annotation) with the click of a button.

Assembling a new test dataset

Everyone in the domain of autonomous and intelligent vehicles agrees that purely testing vehicles on the road is not feasible. Recordings and simulation need to be combined efficiently, typically in a setup called scenario-based testing. Based on real-world recordings and statistics, suitable testing scenarios must be defined and either simulated or detected in real data. In order to allow for a complete catalogue of scenarios and driving situations, SiaSearch makes it easy to find scenarios still missing in your driving data catalogue and to potentially align them with the simulated ones.

Quickly finding scenarios for experiments

Does the following situation sound familiar? You are working on a new functionality and want to test it on a specific scenario. In addition to the few recorded failure cases of the AV, SiaSearch allows you to access all the recorded data at once. In addition, we cluster the metadata, allowing for an intuitive similarity search. This enables the user to find additional similar situations and to efficiently explore datasets. For example, think about the possibility of finding scenarios similar to any disengagement of the autonomous system. Comparing scenarios in which the system disengaged to similar scenarios in which it didn’t, will help us learn much more about the disengagement and failure cases.

Read more