Image

Testing MATLAB's Computer Vision Toolbox

MATLAB has many different toolboxes for various applications, such as signal processing, control engineering, AI, and image processing. To find out how easy and quick they are to use, a small project was created based on a practical example that uses the toolboxes for image processing, computer vision, and deep learning. 

The project involves using a camera to show a robot or gripper arm the exact positions of apples that have fallen from a tree so that they can then be picked up. To do this, the device must first recognize what is and is not an apple in an image.

MATLAB was used to process images of different apples in order to obtain training data for two different object detectors implemented by MATLAB. These detectors should then be able to recognize apples on the ground. This was tested by evaluating the data from the detectors from a webcam input and a smartphone camera using MATLAB Mobile. We succeeded in creating, evaluating, and applying two functioning object detectors in a short time and with little training data.

Figure 1: Process for detecting apples

Preprocessing

Contrast 

To achieve good, consistent contrast, the brightness of all pixels is generally analysed using a histogram. In a histogram, pixels are sorted according to their value and number. The histogram makes it easy to see whether the image is rather dark or light. To ensure that all images have the same brightness and contrast, the bottom 1% is shifted to the minimum and the top 1% to the maximum. This also distributes all pixels across the entire brightness range, which makes dark images lighter and vice versa. To obtain the brightness of an image in RGB format, it must first be converted to the HSV colour space. The V is then directly the value of the brightness of the pixels. Once the V value has been adjusted, the image can be converted back to RGB.

Figure 2: Histogram of brightness before (left) and after (right) contrast adjustment
Figure 3: Image on the left is the original, image on the right has had its contrast adjusted
Figure 4: Image on the left is the original, image on the right has had its contrast adjusted

Size Adjustment

The size of the images must be adjusted because the input of the object detectors is smaller than the images. The input of the object detector is 704x1280, which is adjusted to the highest possible resolution of the webcam. The training data was recorded with a smartphone and has a resolution of up to 2112x4608.  In order not to exceed the input size, the size of the training data must be adjusted. To obtain more training data, the images were cropped or the apples in the image were enlarged by cutting off the edges. The complete size adjustment to 704x1280 is done either in the training data augmenter or in the object detector.

Detectors and Training

Training Data

In order to train object detectors, as much training data as possible is required. However, this is not always possible, so the same data is randomly modified in order to train the object detectors more robustly. There are many ways to modify images without making their content unrecognisable. For example, saturation, contrast or brightness can be slightly altered and you can still recognise what the image shows. Or you can simply rotate, mirror or distort the image and obtain new usable data to train the object detectors. 

First, new images are taken and labelled with bounding boxes and labels using the MATLAB app “Image Labeler”. The bounding boxes mark where an apple is in the image. This training data is stored in datastores to use as little memory as possible during training.

Next, a callback function is created in which the training data is cropped to the correct size and randomly augmented. The callback function is called every time the datastore is accessed during training. This keeps training efficient and requires less memory. The random augmentation of the images includes colour value, contrast, brightness, rotation, scaling and shearing. It is also important that the bounding boxes are also changed. Finally, the images are scaled to the correct size and the bounding boxes that are too small are deleted. This means that apples that are too small are no longer recognised, but the anchor boxes become more uniform, which increases accuracy. 

Figure 5: Training images with bounding boxes after random changes

ACFObjectDetector

The ACFObjectDetector is an object detector that can find objects in images using aggregate channel features. This detector works according to the principles described in the work of Dollár et al. (1) Roughly explained, so-called channels are created from the images. A channel can be a colour, for example, or it can be created using Gaussian filters, such as edges and borders. Features are derived from the channels, which are used to train decision trees to identify objects in images.

Figure 6: Structure of the ACF detector from Dollár et al. (1)

The ACF detector is trained using the "trainACFObjectDetector" function. It was trained with 10 stages and the augmented training data, and then the "detect" function was used to attempt to recognise apples in images. If any were recognised, the bounding boxes and their confidence levels were obtained.

Figure 7: Left: Histogram of detected apples based on their confidence level / Right: Image in which apples were detected but only marked if their confidence level was above 25

After training, the detector can evaluate itself to see how good it is. The augmented training data was used again for this purpose, as it changes slightly each time it is used. Ideally, however, new test data should be used to obtain a better overview. 

The "evaluateObjectDetection" function makes it very easy to evaluate object detectors. The function directly generates data that can be used to determine the accuracy of the detector. To do this, you can create a precision and recall plot. Precision is calculated from true positives (TP) and false positives (FP): Precision = TP / (TP + FP). Recall is calculated from true positive (TP) and false negative (FN): Recall = TP / (TP + FN).

It has been found that the ACF detector is not particularly good at recognising apples. This could be because apples vary greatly in size and also have different colours and shapes. Another reason could be that very little training data was used.

Figure 8: Precision and recall curve of the ACF detector
Figure 9: Apples detected from the augmented training data, with their confidences, by the ACF detector

YOLOv4ObjectDetector

The YOLOv4 (You Only Look Once version 4) object detector is a deep learning network that specialises in real-time object detection with bounding boxes (2). Various pre-trained networks serve as the backbone, such as CSPResNext50, CSPDarknet53 and EfficientNet-B3. The neck mixes and combines the features from the backbone so that the head can detect the objects. The neck consists of a spatial pyramid pooling (SPP) and a path aggregation network (PAN). YOLOv3 is used as the head, which can recognise objects using anchor boxes. 

MATLAB's "YOLOv4ObjectDetector" comes with two pre-trained backbones that were trained on the COCO dataset. In this project, the ‘tiny-yolov4-coco’ was chosen because it is smaller and therefore faster. 

Figure 10: Structure of an object detector from Bochkovskiy et al. (2)

To generate a YOLOv4 detector in MATLAB, you first need to calculate the anchor boxes, define the input size, and pick a pre-trained backbone. We picked MATLAB's "tiny-yolov4-coco" as the backbone, but another option would be "csp-darknet53-coco". The number of anchor boxes depends on the backbone; in this case, there are six. The size is calculated from the bounding boxes of the training data after they have been scaled down to the input size. The input size must be divisible by 32, and since the webcam has a resolution of 720x1280, 704x1280 was chosen.

The same augmented training data was used for training as for the ACF detector. The training settings were largely left at their default values, with "MiniBachSize" set to 1, as very little training data was used and this produced the best results. "MaxEpoch" was set to 120 so that training would not take too long.

The YOLOv4 object detector was verified in the same way as the ACF detector to make it easier to compare them. The results clearly show that the powerful YOLOv4 detector is significantly better at recognising apples. However, the detector still has difficulties when the apples are covered or too small in the image. In some cases, apples are also recognised twice, but this can be easily remedied with the "selectStrongestBbox" function.

Figure 11: Precision and recall curve of the YOLOv4 detector
Figure 12: Detected apples from the augmented training data with their confidence levels as determined by the YOLOv4 detector

Applications

Webcam

The MATLAB Computer Vision Toolbox is useful for creating live detection of apples. The "videoinput" function makes it easy to read the webcam and examine the individual images with the object detectors, as well as mark any apples found. The images can then be displayed, resulting in a live video with marked apples. Unfortunately, processing is relatively slow, as no GPU was used, resulting in a video with only about 5 frames per second.

MATLAB Mobile

Smartphones are also interesting for another application, as they have a camera and are very mobile. MATLAB Mobile is available for running MATLAB code on smartphones. This also allows you to access MATLAB Drive and offers additional functions for taking photos with your smartphone camera. A small programme was therefore written for MATLAB Mobile that takes a photo and uses an object detector stored on the Drive to search for apples. Finally, you can view the photo with the apples marked on your smartphone. 

Conclusion

With MATLAB, projects can be completed quickly because it includes many ready-made functions that can perform most tasks with flexible settings. There is also documentation for each function and often examples of how to use it. However, this documentation is not always very detailed and only explains what you can do, not why. This is particularly problematic for complex and large functions such as the training functions of object detectors. Nevertheless, with the help of the examples, object detectors for images or videos can be created relatively quickly, even with relatively little training data.

Glossary

Anchor Boxes  

Calculated from the training data, this specifies the size of the object to be found for the object detector.

Backbone  

Part of an object detector, responsible for feature extraction.

Bounding Boxes  

Frame that fits as closely as possible around the object.

Callback Function  

A function that is executed by another function after that function has been executed.

Channels  

Usually part of the image, such as a colour channel like RGB or edge enhancement generated by filters.

Confidence  

How confident the object detector is that it has correctly found an object.

Datastores  

A data collection that can be very large and still allows you to process your data without overloading the memory.

Decision Trees  

Hierarchical decision-making structure that generates definitions based on trained decision rules.

 

Deep Learning Network  

A machine learning network based on artificial neural networks that can identify statistical characteristics.

Features  

These are generated from channels through training and are characteristics that help object detectors find objects.

Head  

This is the last part of a detector and attempts to find objects from the data collected by the neck.

Neck  

Mixes the data from the backbone with networks and passes it on to the head.

Precision  

Precision of an object detector, measures how many of all objects found are actually correct objects.

Recall  

Precision of an object detector measures how many objects found above the confidence threshold are actually correct objects.

Stages  

The training stages indicate how long the training should last.

Used MATLAB Toolboxes

-         Image Processing Toolbox

-         Computer Vision Toolbox

-         Deep Learning Toolbox

-         Computer Vision Toolbox Model for YOLO v4 Object Detection

-         Image Acquisition Toolbox

References

1. Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona. Fast Feature Pyramids for Object Detection. 2009.

2. Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao. YOLOv4: Optimal Speed and Accuracy of Object Detection. 2020.