Future Vision Transport : Design an Autonomous Vehicle

Context

"Future Vision Transport" is a company building embedded computer vision systems for autonomous vehicles.

The goal here is to build a model able to classify each pixel of an image into one of the given categories of objects, and expose this model's API as a web service. This problem is known as "Semantic Segmentation" and is a challenge for the autonomous vehicle industry.

We will compare different models (varying architecture, augmentation method and image resizing) and their performances on the Cityscapes dataset. We will use AzureML and Azure App Service to deploy our models.

State of the art

In Computer Vision, the problem of Semanitc Segmentation is to classify each pixel of an image into one of the given categories of objects.

Deep Neural Network (DNN) models

The basic principle of DNNs for Semantic Segmentation is basically to :

Encoder / Decoder architecture

Over the time, research teams have produced more sophisticated models with more layers, connections between layers and different convolution kernels in order to improve the prediction accuracy.

The major models are :

FCN (2015)

U-Net (2015)

PSPNet (2017)

DenseNet (2018)

DeepLabv3+ (2018)

Results on the Cityscapes Dataset

The Cityscapes Dataset is a large dataset of images captured by a car dash camera in the city, with their corresponding labels.

Semantic Segmentation on Cityscapes State of the Art

Project modules

We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.

We will use the usual libraries for data exploration, modeling and visualisation :

We will also use libraries specific to the goals of this project :

Exploratory data analysis (EDA)

We are going to load the data and analyse the distribution of each variable.

Load data

Let's download the data from the Cityscapes dataset and look at what it contains.

The dataset is composed of the following image files :

We are not going to use the polygon shapes, but we will use the instance ids to build the ground truth images.

There are :

According to the Cityscapes dataset documentation, the images are of size 2048x1024 and are in RGB format.

We will not use the object labels (32 labels), but the 8 label categories : "void", "flat", "construction", "object", "nature", "sky", "human" and "vehicle".

Now that we have an idea of what the dataset contains, let's understand it further by computing statistics.

We can see that the most important category for an autonomous vehicle ("human") is the least represented in the dataset.

This means the problem is imbalanced and we have to take into account that the most important class is under represented.

Models selection and training

We are going to evaluate the performance of different models, with different parameters :

The trianing experiment will be run in AzureML, and the results will be stored in the AzureML environment.

Selected models architectures

We can observe the architecture of the three selected models :

Model name FCN-8 U-Net Deeplab v3+
Total params 69,775,768 2,060,424 2,143,304
Trainable params 69,773,848 2,056,648 2,110,216
Non-trainable params 1,920 3,776 33,088
Architecture FCN-8 U-Net DeepLabv3+

Loss and metric

There multiple ways to evaluate the model performance.

Loss Metric Intersection vs. Union Confusion Matrix Pros Cons
Sparse Categorical Cross-entropy Pixel Accuracy $ - $ $ \frac{TP + TN}{TP + FP + TN + FN} $ Easy to interpret Bad with imbalanced target classes.
Dice F1 $ \frac{2 \lvert\lvert A \cap B \rvert\rvert}{\lvert\lvert A \rvert\rvert + \lvert\lvert B \rvert\rvert} $ $ \frac{2 TP}{2 TP + FP + FN} $ Good with imbalanced target classes. Not easy to interpret.
Jaccard Intersection over Union (IoU) $ \frac{\lvert\lvert A \cap B \rvert\rvert}{\lvert\lvert A \cup B \rvert\rvert} $ $ \frac{TP}{TP + FP + FN} $ Easy to interpret. Good with imbalanced target classes.

We are going to use the Jaccard Index as the model's loss and the Mean IoU as the model's metric.

Training process

The code is available in the train.ipynb file :

Training results

The training results are available in the AzureML experiment.

FCN-8 is slightly faster to train than DeepLabV3+ and U-Net.

DeepLabV3+ performs much better than FCN-8 and U-Net.

Adding image augmentation reduces increases slightly the training time.

Adding the image augmentation reduces the models performance : ~ -20% Mean IoU on average with augmentation vs. without.

This observation is explained by the fact that by adding augmentation, the models are trained on a dataset that is not representative of the validation dataset, thus the models trained on augmented images underfit the validation dataset

Changing the image resize doesn't seem to have a significant effect on training time.

Increasing the input image resize improves greatly the models performance, especially for DeepLabV3+.

Overall, we can conclude that the best results are obtained with DeepLabV3+, without image augmentation.

Model deployment and testing

In order to provide the predictions service as requested, we will deploy our model in production. We will use two approaches :

Deployment as AzureML Endpoint

The code is available in the deploy.ipynb file :

Test the Endpoints

We can test the Endpoint by calling the API with a sample image. The results are visible in the predict.ipynb file :

Model prediction and time
U-Net 64px : 5.76s
U-Net 64px with augmentation : 5.65s
DeepLab 64px : 5.94s
DeepLab 64px with augmentation : 5.97s
U-Net 128px : 9.83s
U-Net 128px with augmentation : 9.55s
DeepLab 128px : 14.38s
DeepLab 128px with augmentation : 11.08s
U-Net 256px : 41.91s
U-Net 256px with augmentation : 32.52s
DeepLab 256px : 27.02s
DeepLab 256px with augmentation : 23.85s

We can see that the prediction improves with the input size, but the prediction time also increases : from ~6s with 64x64 pixels to more than 40s with 256x256 pixels, until the API fails with a timeout error (60s limit) for inputs > 320 pixels.

We can also see that the DeepLabV3+ model at 256 pixels is the only model to correctly predict the "human" pixels.

Deployment as Flask webapp

We developed a simple web application based on the Flask framework. The code is available in the ../webapp directory :

Prediction time comparison

Let's compare the time required to predict the segmentation mask for the input image on the two deployment approaches, as well as a local prediction.

We can see that the local prediction is a dramatically faster approach. This is mosltly due to the image transfer time required to query the distant Endpoint and API. This paradigm is what is called Edge Computing : running prediction locally on IoT devices is much faster than sending the data and running it on a server.

Conclusion

We have been able to evaluate the performance of different models, with different parameters, and to deploy them in production.

Two main challenges remain :

In order to miigate thos two problems, we could :