Visual Object Detection For Autonomous UAV Cinematography

The popularization of commercial, battery-powered, camera-equipped, Vertical Take-off and Landing (VTOL) Unmanned Aerial Vehicles (UAVs) during the past decade, has significantly affected aerial video capturing operations in varying domains. UAVs are affordable, agile and flexible, having the ability to access otherwise inaccessible spots. However, their limited resources burden computation cinematography techniques on operating with high accuracy and real-time speed on such devices. State-of-the-art object detectors and feature extractors are, thus, studied in this work, aiming to find a trade-off between performance and speed that will allow UAV exploitation for intelligent cinematography purposes. Experimental evaluation on three newly introduced datasets of rowing boats, cyclists and parkour athletes is performed and evidence is provided that even limited-resource autonomous UAVs can indeed be used for cinematography applications.


Introduction
The use of camera-equipped Unmanned Aerial Vehicles (UAVs) for covering public sport events, such as bicycle or boat races, parkour shows and football games, as well as for media production, surveillance, search and rescue operations, etc., is becoming increasingly popular, since UAVs are capable of shooting spectacular videos that would otherwise be very difficult and costly to obtain. Visual analysis tasks may, thus, be of assistance in UAVbased intelligent cinematography [5,12,14,16], e.g., for detecting and tracking a desired target, or even in flight safety related tasks [21], such * Corresponding Author: foteinpp@csd.auth.gr as obstacle detection and avoidance. Technological progress has led to the production of numerous commercially available UAVs with similar cognitive autonomy and perceptual capabilities, but the limited computational hardware, the possibly high camera-to-target distance and the fact that both the UAV/camera and the target(s) are moving, constitute achieving both high accuracy and real-time performance rather challenging [9,10,11].
The most promising state-of-the-art approach towards achieving real-time performance on the restricted computational hardware on-board a UAV is to use one-stage deep neural detectors, structured around the concept of "anchors". Such detectors, e.g., Single-Shot Detector (SSD) [7] and You Only Look Once (YOLOv2) [17], are based on the notion of a convolutional Region Proposal Network (RPN), simultaneously regressing the pixel coordinates of multiple visible object Regions-of-Interest (ROIs) (in the form of spatial offsets from the predefined anchors) and assigning class labels to them. SSD [7] is a single-stage multi-object detector, meaning that a single feed-forward image pass suffices for the extraction of multiple ROIs with coordinate and class information, without internal ROI pooling. In its original form, the detector relied on the VGG16 [18] architecture for feature extraction, with the addition of a number of layers upon it, so as to extract better defined boxes. Two versions were proposed, one requiring an input image size of 300 × 300 pixels and one requiring 500 × 500 pixels, with the latter producing better results in terms of detection precision while being significantly slower than the former. In [3], SSD was used as a meta-architecture for single-stage object detection and was compared to region-based detectors. The experimental evaluation conducted in this work, proved that when combined with Mo-bileNets [2] and Inception v2 [4] feature extractors, SSD is fastest than any region-based detector at  Similar in nature to SSD, YOLO [17] is a widely used object detector, whose popularity may be attributed to its simplicity, stemming from its ability to detect multiple objects with a single forward image pass, in combination with its speed, which surpasses that of SSD. YOLO relies on a custom architecture for feature extraction and is pretrained on the ImageNet [1] publicly available dataset. Its fully-convolutional architecture [7,8] allows the network to be trained and deployed at any resolution, although odd multiples of 32 (the network's total subsampling factor) are preferred in order for the final heatmap produced to effectively divide the image into equally sized overlapping regions. Each such region is responsible for detecting any object whose center lies within it, by fitting precomputed anchor boxes to the groundtruth ROIs. Thus, input size affects not only the size of the produced heatmaps, and consequently the speed of the classifier, but also the maximum number of boxes that can be detected.
These widely used, heavily studied, lightweight neural architectures, along with the fact that autonomous UAV usage for cinematography purposes tends to become mainstream, inspired this study aiming to identify the circumstances under which these architectures could operate on such resource-limited devices, making them suitable for use in intelligent cinematography applications. To this end, an extensive experimental evaluation is performed, testing the detectors paired with different feature extractors, and recording the accuracy and time performance achieved for several input image resolutions, in search of a trade-off between detection accuracy and speed. Evaluation is performed on three use cases, corresponding to real-life applications suitable for autonomous UAV coverage. The created datasets, which are publicly available and can be downloaded from http://www.aiia.csd.auth.gr/LAB_PROJECTS/ MULTIDRONE/AUTH_MULTIDRONE_Dataset.html, the adopted protocols and the obtained results are subsequently discussed.

Use Cases
In this Section, the experimental protocols adopted are discussed and results on the following three use cases are reported: row boat race, cycling race and parkour. These scenarios were selected based on the large benefits induced on their media coverage process by exploiting autonomous UAVs for filming and broadcasting. All time-dependent measurements were made on an NVIDIA Jetson TX2 computing board, i.e., a common embedded AI hardware platform which is easily deployable on drones. All three datasets were manually collected, annotated and made public, as no publicly available datasets of such data currently exist to the best of our knowledge.

Lightweight Rowing Boat Detection
The use of lightweight convolutional object detector [13,15,19,20]   Average precision (AP) and speed results on a Jetson TX2 module, obtained using the SSD detector coupled with MobileNet v1 and Inception v2 backbone feature extraction networks, are summarized in Table 1, while rowing boat detection examples are given in Fig. 1.
The Inception v2 extractor seems to be faster than MobileNet v1, while also leading to more accurate detections. Moreover, as expected, decreases in input image resolution result in execution speedup but deteriorate performance, as the relatively small input sizes used to train the detectors are responsible for most of the false negative sample instances arising. This is due to the fact that the objects to be detected may sometimes be of rather small sizes, e.g., boats far away from the camera, and thus, lowering input image resolution shrinks small target items to tiny, making them indistinguishable, even to the human eye.

Bicycle Detection
SSD detector was evaluated on another single-class problem, that of detecting bicycles in a cycling race. To this end, a dataset consisting of about 12k images was gathered from cycling events and about 77k cyclists were annotated, along with their bicycles. As most of the shots were aerial, the annotated objects (i.e., professional bicycles) are small relative to image size and can be easily confused with other vehicles, such as motorcycles, especially in distant shots, while many partially occluded targets as well as motion blurred instances are also included in the dataset. Finally, it should be noted that, despite the fact that both the cyclist ("person") and "bicycle" classes are popular in datasets such as COCO [6] and ImageNet, on which the detectors have been pretrained, in this scenario, a target is considered to exist only when both "subobjects" are detected close to each other. Figure 2 illustrates the performance of SSD with MobileNet v1 and Inception v2 backbone detectors. As expected, for a specific number of false positive detections, e.g., 500, performance rises dramatically as resolution increases -from 32.2% to 65.1% for MobileNet and from 34.5% to 64.5% for Inception. By allowing more false positives, the Inception models achieve higher recall rates, while at around 22 FPS and 56.2% recall rate (at 500 false positives), the MobileNet v1 model at an input resolution of 192 × 192 pixels is identified to offer a great trade-off between speed and accuracy.
The same results are also summarized in Table 2 for both backbones and all input sizes. Bicycle detection examples are depicted in Figure 3.

Parkour Athlete Detection
The single-stage detector YOLOv2, was also trained to detect parkour athletes. More specifically, YOLOv2 pretrained on COCO public object detection dataset was finetuned with parkour data extracted from publicly available Youtube videos.
In detail, 8 videos were manually annotated with parkour athlete ROIs, resulting in a 30624-image dataset, 28372 of which were used for training and 2252 for validation purposes. Model testing was performed on footage specifically captured for this purpose at Bothkamp, resulting in 4987 more video frames. The training protocol adopted was the following. A one-class implementation of YOLOv2 detector, pretrained on COCO dataset, was finetuned in order to detect only parkour athlete instances. To this end, only athlete annotations were used for training, and COCO person class weights were employed for network initialization, aspiring to make the detector capable of detecting athletes performing parkour as "persons". Training sessions for several input image resolutions were conducted, and the obtained results are presented in Table 3, along with the respective processing speeds for algorithm execution on an nVidia Jetson TX2 board. The re-ported metrics are mean Average Precision (mAP), F1-measure and Frames per Second (FPS), in order of appearance. Parkour athlete detection examples are depicted in Fig. 4.
It can be easily noticed that as the input image resolution falls, processing speed increases, while the best mAP and F1 results are obtained at an image resolution of 416 × 416 pixels. This can be attributed to the fact that as image resolution gets greater than 416 × 416 pixels, the increase in True Positive Rate (TP) becomes smaller than the increase in False Positive Rate (FP), thus resulting in lower mAP scores.

Conclusions
This paper studied the use of state-of-the-art CNNbased visual object detectors, namely SSD and YOLO, on autonomous UAVs for cinematography applications, under the assumption of limited resources. A trade-off between the obtained accuracy and time required was searched for, and experiments on three newly introduced datasets consisting of rowing, cycling and parkour data, respectively, indicated that for relatively low-resolution input images, rather satisfactory results can be obtained regarding detection accuracy, while also achieving real-time or near real-time execution speed on NVIDIA Jetson TX2 module. This is made feasible with the aid of the fastest feature extraction neural architectures currently available, namely MobileNets and Inception v2. The obtained results can thus be considered to provide evidence that despite their limited resources, UAVs can be employed effectively for computational cinematography and embedded visual analysis tasks. 4