MATLAB has many different toolboxes for various applications, such as signal processing, control engineering, AI, and image processing. To find out how easy and quick they are to use, a small project was created based on a practical example that uses the toolboxes for image processing, computer vision, and deep learning.
The project involves using a camera to show a robot or gripper arm the exact positions of apples that have fallen from a tree so that they can then be picked up. To do this, the device must first recognize what is and is not an apple in an image.
MATLAB was used to process images of different apples in order to obtain training data for two different object detectors implemented by MATLAB. These detectors should then be able to recognize apples on the ground. This was tested by evaluating the data from the detectors from a webcam input and a smartphone camera using MATLAB Mobile. We succeeded in creating, evaluating, and applying two functioning object detectors in a short time and with little training data.
Preprocessing
Contrast
To achieve good, consistent contrast, the brightness of all pixels is generally analysed using a histogram. In a histogram, pixels are sorted according to their value and number. The histogram makes it easy to see whether the image is rather dark or light. To ensure that all images have the same brightness and contrast, the bottom 1% is shifted to the minimum and the top 1% to the maximum. This also distributes all pixels across the entire brightness range, which makes dark images lighter and vice versa. To obtain the brightness of an image in RGB format, it must first be converted to the HSV colour space. The V is then directly the value of the brightness of the pixels. Once the V value has been adjusted, the image can be converted back to RGB.
Size Adjustment
The size of the images must be adjusted because the input of the object detectors is smaller than the images. The input of the object detector is 704x1280, which is adjusted to the highest possible resolution of the webcam. The training data was recorded with a smartphone and has a resolution of up to 2112x4608. In order not to exceed the input size, the size of the training data must be adjusted. To obtain more training data, the images were cropped or the apples in the image were enlarged by cutting off the edges. The complete size adjustment to 704x1280 is done either in the training data augmenter or in the object detector.
Detectors and Training
Training Data
In order to train object detectors, as much training data as possible is required. However, this is not always possible, so the same data is randomly modified in order to train the object detectors more robustly. There are many ways to modify images without making their content unrecognisable. For example, saturation, contrast or brightness can be slightly altered and you can still recognise what the image shows. Or you can simply rotate, mirror or distort the image and obtain new usable data to train the object detectors.
First, new images are taken and labelled with bounding boxes and labels using the MATLAB app “Image Labeler”. The bounding boxes mark where an apple is in the image. This training data is stored in datastores to use as little memory as possible during training.
Next, a callback function is created in which the training data is cropped to the correct size and randomly augmented. The callback function is called every time the datastore is accessed during training. This keeps training efficient and requires less memory. The random augmentation of the images includes colour value, contrast, brightness, rotation, scaling and shearing. It is also important that the bounding boxes are also changed. Finally, the images are scaled to the correct size and the bounding boxes that are too small are deleted. This means that apples that are too small are no longer recognised, but the anchor boxes become more uniform, which increases accuracy.
ACFObjectDetector
The ACFObjectDetector is an object detector that can find objects in images using aggregate channel features. This detector works according to the principles described in the work of Dollár et al. (1) Roughly explained, so-called channels are created from the images. A channel can be a colour, for example, or it can be created using Gaussian filters, such as edges and borders. Features are derived from the channels, which are used to train decision trees to identify objects in images.
The ACF detector is trained using the "trainACFObjectDetector" function. It was trained with 10 stages and the augmented training data, and then the "detect" function was used to attempt to recognise apples in images. If any were recognised, the bounding boxes and their confidence levels were obtained.
After training, the detector can evaluate itself to see how good it is. The augmented training data was used again for this purpose, as it changes slightly each time it is used. Ideally, however, new test data should be used to obtain a better overview.
The "evaluateObjectDetection" function makes it very easy to evaluate object detectors. The function directly generates data that can be used to determine the accuracy of the detector. To do this, you can create a precision and recall plot. Precision is calculated from true positives (TP) and false positives (FP): Precision = TP / (TP + FP). Recall is calculated from true positive (TP) and false negative (FN): Recall = TP / (TP + FN).
It has been found that the ACF detector is not particularly good at recognising apples. This could be because apples vary greatly in size and also have different colours and shapes. Another reason could be that very little training data was used.
YOLOv4ObjectDetector
The YOLOv4 (You Only Look Once version 4) object detector is a deep learning network that specialises in real-time object detection with bounding boxes (2). Various pre-trained networks serve as the backbone, such as CSPResNext50, CSPDarknet53 and EfficientNet-B3. The neck mixes and combines the features from the backbone so that the head can detect the objects. The neck consists of a spatial pyramid pooling (SPP) and a path aggregation network (PAN). YOLOv3 is used as the head, which can recognise objects using anchor boxes.
MATLAB's "YOLOv4ObjectDetector" comes with two pre-trained backbones that were trained on the COCO dataset. In this project, the ‘tiny-yolov4-coco’ was chosen because it is smaller and therefore faster.
To generate a YOLOv4 detector in MATLAB, you first need to calculate the anchor boxes, define the input size, and pick a pre-trained backbone. We picked MATLAB's "tiny-yolov4-coco" as the backbone, but another option would be "csp-darknet53-coco". The number of anchor boxes depends on the backbone; in this case, there are six. The size is calculated from the bounding boxes of the training data after they have been scaled down to the input size. The input size must be divisible by 32, and since the webcam has a resolution of 720x1280, 704x1280 was chosen.
The same augmented training data was used for training as for the ACF detector. The training settings were largely left at their default values, with "MiniBachSize" set to 1, as very little training data was used and this produced the best results. "MaxEpoch" was set to 120 so that training would not take too long.
The YOLOv4 object detector was verified in the same way as the ACF detector to make it easier to compare them. The results clearly show that the powerful YOLOv4 detector is significantly better at recognising apples. However, the detector still has difficulties when the apples are covered or too small in the image. In some cases, apples are also recognised twice, but this can be easily remedied with the "selectStrongestBbox" function.
Applications
Webcam
The MATLAB Computer Vision Toolbox is useful for creating live detection of apples. The "videoinput" function makes it easy to read the webcam and examine the individual images with the object detectors, as well as mark any apples found. The images can then be displayed, resulting in a live video with marked apples. Unfortunately, processing is relatively slow, as no GPU was used, resulting in a video with only about 5 frames per second.
MATLAB Mobile
Smartphones are also interesting for another application, as they have a camera and are very mobile. MATLAB Mobile is available for running MATLAB code on smartphones. This also allows you to access MATLAB Drive and offers additional functions for taking photos with your smartphone camera. A small programme was therefore written for MATLAB Mobile that takes a photo and uses an object detector stored on the Drive to search for apples. Finally, you can view the photo with the apples marked on your smartphone.
Conclusion
With MATLAB, projects can be completed quickly because it includes many ready-made functions that can perform most tasks with flexible settings. There is also documentation for each function and often examples of how to use it. However, this documentation is not always very detailed and only explains what you can do, not why. This is particularly problematic for complex and large functions such as the training functions of object detectors. Nevertheless, with the help of the examples, object detectors for images or videos can be created relatively quickly, even with relatively little training data.
Glossary
Used MATLAB Toolboxes
- Image Processing Toolbox
- Computer Vision Toolbox
- Deep Learning Toolbox
- Computer Vision Toolbox Model for YOLO v4 Object Detection
- Image Acquisition Toolbox
References



