What are the classic object tracking algorithms in computer vision?

I believe that many people here and I have been here for the first time, are looking for a better target tracking algorithm, or want to have a deeper understanding of the target tracking in this area, although this problem is a classic target tracking algorithm, But in fact, maybe we don't need trackers (target tracking algorithms) that have been brilliant but have been shot on the beach, but those that are about to become classic, or are currently looking for best use, speed, and performance Past tracker. I am concerned about the relevant filtering direction in the target tracking. Next, I will help you introduce the target tracking that I know, especially the related filtering methods. Share some of the algorithms that I think are better, and let me talk about my views by the way.

▌ Part I: Target Tracking Snapshot

Start with a couple of SOTA trackers and get a glimpse of what the target tracks in this direction. Everything should start with the database in 2013. . If you ask someone what has been comparing niubility's tracking algorithm in recent years, most people will throw you a paper by Wu Yi, OTB50 and OTB100 (OTB50 here refers to OTB-2013, OTB100 here refers to OTB-2015, 50 and 100 respectively. Represents the number of videos for easy memory):

Wu Y, Lim J, Yang M H. Online object tracking: A benchmark [C]// CVPR, 2013.

Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. TPAMI, 2015.

Top will turn top of the top treatment, plus more than 1480+320 citations, the impact is self-evident, is already doing the database tracking must run, test code and sequences can be downloaded: Visual Tracker Benchmark (http: //cvlab.hanyang.ac.kr/tracker_benchmark/), OTB50 includes 50 sequences, all manually labeled:

Two papers compared 29 top trackers including 2012 and before on the database. There are OAB, IVT, MIL, CT, TLD, Struck, etc. that are familiar to everyone. There are no more commonly recognized databases, and the papers are self-promotion. We don't know which one is easy to use. So the significance of this database is very significant. It directly promotes the development of tracking algorithms, and later it is extended to OTB100 to TPAMI. There are 100 sequences. , more difficult and more authoritative, we refer here to the results of OTB100, the first is the speed of 29 trackers and publication time (marked some of the performance speed is better algorithm):

Next, look at the results (more details suggest that you see the paper is clearer):

Directly to the conclusion: On average, the performance of Struck, SCM, and ASLA is relatively high, ranking in the first three, with emphasis on CSK, showing the potential of related filtering to the world for the first time, ranking fourth and still 362 FPS. . The second fastest speed is the classic algorithm CT (64fps) (SCM, ASLA, etc. are the hottest sparse representations of that era). If you are interested in an earlier algorithm, another classical survey is recommended (I'm not interested or seen anyway):

Yilmaz A, Javed O, Shah M. Object tracking: A survey [J]. CSUR, 2006.

The algorithm before 2012 is basically the same. Ever since the launch of AlexNet in 2012, CV has seen tremendous changes in various fields, so I guess you definitely want to know what happened between 2013 and 2017. I'm sorry I don't know either. It is a matter of course, but we can be sure that the papers from 2013 onwards will always quote the OTB50 paper. With the help of the cited number feature in Google Scholar, the following results are obtained:

Only a few citations are cited here, followed by Struck to TPAMI, and the three major filtering methods KCF, CN, DSST, and VOT competitions. Here are only demonstrations, and you are interested in trying it yourself. (The theoretical basis for this is: a thesis. The work before it can be seen in its citations. The work afterwards can see who quoted it; although the quotation does not explain anything, the good method is basically used by everyone. (respect and recognition); afterwards, you can also view related papers for a certain period of time by limiting the time. For example, the latest papers can be found in 2016-2017, and the quality of the papers needs to be carefully screened; important papers in other directions can also be Use, follow the steps, and then you know what the big cattle are, and then focus on tracking their work) so that we generally know that the latest progress in the field of target tracking is relevant filtering is no doubt, and then you can see the relevant filter The class algorithms are SAMF, LCT, HCF, SRDCF and so on.

Of course, the amount of quotation is also related to time, and it is recommended to look at it every year. In addition, the latest version of OPENCV 3.2 includes several very new tracking algorithms in addition to the TLD OpenCV: Tracking API (https://):

The TrackerKCF interface implements KCF and CN. The influence is evident. There is also a GOTURN method based on deep learning. Although the speed is fast but the accuracy is slightly poor, it is worth looking at. The latest paper in the tracking direction can follow three major conferences (CVPR/ICCV/ECCV) and arXiv.

▌ Part II: Background Introduction

The next step is to introduce the target tracking. The target tracking mentioned here is a common single-target tracking. The first frame is a rectangular box. This box is manually annotated in the database. In the actual situation, it is mostly the result of the detection algorithm. Then the tracking algorithm needs to be followed in the subsequent frame. To live in this box, here are the requirements for the VOT tracking algorithm:

Usually target tracking faces several difficulties (Wu Yi's slides at VALSE): appearance distortion, light changes, fast motion and motion blur, background similar interference:

Out-of-plane rotation, in-plane rotation, scale change, occlusion and out-of-view situations, etc.:

Because of these conditions, it has become difficult to track. At present, more commonly used databases than the OTB, as well as the previously found VOT contest database (analog ImageNet), have been held for four years, VOT2015 and VOT2016 both include 60 sequences, all sequences Also free download VOT Challenge | Challenges:

VOT Challenge | Challenges:

Http://votchallenge.net/challenges.html

Kristan M, Pflugfelder R, Leonardis A, et al. The visual object tracking vot2013 challenge results [C]// ICCV, 2013.

Kristan M, Pflugfelder R, Leonardis A, et al. The Visual Object Tracking VOT2014 Challenge Results [C]// ECCV, 2014.

Kristan M, Matas J, Leonardis A, et al. The visual object tracking vot2015 challenge results [C]// ICCV, 2015.

Kristan M, Ales L, Jiri M, et al. The Visual Object Tracking VOT2016 Challenge Results [C]// ECCV, 2016.

The difference between OTB and VOT: OTB includes 25% of grayscale sequences, but VOT is a color sequence, which is also the reason for the difference in the performance of many color feature algorithms; the evaluation indicators of the two libraries are different. Please refer to the paper for details; VOT library The sequence resolution is generally higher, which will be mentioned later in the analysis. For a tracker, if the paper has a good result on both libraries (preferably OTB100 and VOT2016), it is definitely very good (you can adjust the two library parameters, I accept, recognize ~~), If only one is run, the individual prefers to VOT2016, because the sequences are finely labeled and the evaluation indicators are better (people are contests after all, evaluation indicators have been issued by TPAMI), where the difference is the most, OTB has random frames to begin, or The rectangular box plus random interference initialization to run, the author said that more in line with the detection algorithm to the box; and VOT is the first frame initialization to run, each tracking failure (prediction box and label box does not overlap), re-initialization after 5 frames , VOT mainly short-term, and that the tracking detection should not be separated together, the detector will initialize tracker multiple times.

Added: OTB was published in 2013. It is transparent to the algorithm after 2013. The paper will adjust the parameters, especially those who only run OTB. If the key parameters are given directly to two decimal places, it is recommended that you first Measured (people are not old, ah ~ be pitted more). The database of the VOT contest is updated every year, and it is re-labeled at every turn. Changing the evaluation index at every turn is relatively difficult for the algorithm of the year, so the result is relatively more reliable. (I believe that many people, like me, will find each work very good and important. If we don't have this paper, it must be an explosion of the earth and the universe is restarting. So it's like everyone has learned the depth of the ILSVRC competition through the years. Like the development of learning, the results of the third party are more persuasive, so I also use the competition ranking + whether open source + measured performance as the standard, optimizing several algorithms analysis)

Visual Object Tracking is widely recognized as being divided into two categories: generative model methods and discriminative model methods. The most popular method is discriminative methods, which are also called tracking tracking-by-detection. In order to maintain the integrity of the answer, the following is a brief introduction.

Generate class method, in the current frame modeling of the target area, the next frame to find the most similar area with the model is the predicted position, the more famous Kalman filter, particle filter, mean-shift and so on. For example, knowing from the current frame that the target area is 80% red and 20% green, then in the next frame, the search algorithm is like a headless fly, looking for the area that best fits this color ratio, the recommended algorithm ASMS vojirt/ Asms (https://github.com/vojirt/asms):

Vojir T, Noskova J, Matas J. Robust scale-adaptive mean-shift for tracking [J]. Pattern Recognition Letters, 2014.

ASMS and DAT, dubbed "Human Duo" (copyright reserved), are color-only algorithms and are very fast, followed by VOT2015's 20th and 14th respectively, and VOT2016's are 32 and 31 respectively (medium). Level). ASMS is an official real-time algorithm recommended by VOT2015. The average frame rate is 125 FPS. The scale estimation is added under the classical mean-shift framework. The classical color histogram features are added with two priors (the scale does not change but + may be the largest) as a regular term. , and reverse scale consistency checks. The author gave C + + code, in the era of relevant filtering and deep learning, but also see mean-shift hit list there is such a high price is not easy (has been tears ~ ~), measured performance is not bad, if you are There is a special interest in generating class methods, which is highly recommended. (Some algorithms, if you are even more than this. But the rooftop is on the 24th floor, thanks)

Discriminative methods, most of the methods in OTB50 are of this type. In CV, the classic set of image features + machine learning, the current frame is a positive sample in the target area, the background area is a negative sample, and the machine learning method trains the classifier. One frame uses the trained classifier to find the optimal area:

The biggest difference from the generation class approach is that classifiers use machine learning, and that background information is used in training, so that the classifier can focus on distinguishing the foreground from the background, so discriminating class methods is generally better than generating classes. For example, when telling tracker that 80% of targets are red and 20% are green during training, tell it that there is orange in the background. Take extra care not to make mistakes. Such a classifier knows more information and the effect is relatively it is good. Tracking-by-detection is very similar to the detection algorithm. For example, HOG+SVM for classic pedestrian detection and Haar+structured output SVM are used for Struck. Multi-scale traversal search is also required for tracking in order to scale adaptively. The only difference is that the tracking algorithm has unique features. The speed of on-line machine learning is more demanding and the detection range and scale are smaller.

This is not unexpected. In most cases, the complexity of the detection and recognition algorithm is not high. It is not possible to do every frame. This time, it is appropriate to use a tracking algorithm with a lower complexity, and it only needs to follow the drift or interval. After testing again to initialize the tracker on it. In fact, I would like to say that FPS is the most important indicator of TMD, the slow death algorithm can be dead (classmates do not be so extreme, the speed can be optimized). The classic discriminative class method recommends Struck and TLD, both of which have real-time performance. Struck is the best method before 2012. The TLD is a representative of the classic long-term. The idea is very worth learning:

Hare S, Golodetz S, Saffari A, et al. Struck: Structured output tracking with kernels [J]. IEEE TPAMI, 2016.

Kalal Z, Mikolajczyk K, Matas J. Tracking-learning-detection [J]. IEEE TPAMI, 2012.

After the waves of the Yangtze River push forward, the front has been placed on the beach. This after wave is related filtering and deep learning. Related filter methods Correlation filter referred to as CF, also known as discriminative correlation filter abbreviated as DCF, pay attention to the difference between the following DCF algorithm, including those mentioned earlier, but also behind to focus on the introduction.

Deep ConvNet based class method, because deep learning is not suitable for landing at present, it is not recommended. For more information, please refer to Winsty's Naiyan Wang - Home (Link 1), and VOT2015 Champion MDNet Learning Multi-Domain Convolutional. Neural Networks for Visual Tracking (Link 2), and VOT2016's Champion TCNN http:// (Link 3)(), with the speed of the SiamFC Siamese FC tracker like 80FPS (link 4) and GOTURN davheld/GOTURN over 100FPS (link 5) Note that all are on the GPU. ResNet-based SiamFC-R (ResNet) performed well on VOT2016 and is very optimistic about follow-up development. Interested parties can also go to VALSE and listen to the author himself to explain VALSE-20160930-LucaBertinetto-Oxford-JackValmadre-Oxford-pu (link 6). As for GOTURN, The effect is relatively poor, but the advantage is that it runs 100 FPS very quickly, and if the effect can come later, it will be enough. The deep learning of classmates who do scientific research is the key, and it is better to be able to balance the speed.

Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking [C]// CVPR, 2016.

Nam H, Baek M, Han B. Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242, 2016.

Bertinetto L, Valmadre J, Henriques JF, et al. Fully-convolutional siamese networks for object tracking [C]// ECCV, 2016.

Held D, Thrun S, Savarese S. Learning to track at 100 fps with deep regression networks [C]// ECCV, 2016.

In the end, the powerful power of deep learning END2END is still far from being used in the direction of target tracking. There is not much difference between it and related filtering methods (I am not born with slow speed, but the effect is always good, otherwise you What is the significance of existence?. The revolution has not been successful and comrades still have to work hard.) Another problem that needs attention is that the target tracking database does not have a strict training set or test set. The deep learning method that requires offline training must pay great attention to whether its training set has similar sequences, and it has been specified until the VOT2017 official to limit Training sets cannot be trained with similar sequences.

Finally strongly recommend two resources. Qiang Wang maintains the benchmark_resultsfoolwood/benchmark_results (https://github.com/foolwood/benchmark_results) maintained by Qiang Wang: a large number of top-level performance comparisons on OTB libraries, various paper codes are available, and Gods own C++ implementation and open source CSK, KCF And DAT, as well as his own DCFNet papers plus source code, can not find the students to follow closely.

@H Hakase maintains related filter resources

HakaseH/CF_benchmark_results (https://github.com/HakaseH/TBCF), detailed classification and thesis code resources, do not miss passing through, relevant filtering algorithm is very comprehensive, very carefully!

(The above two, see the invitation to pay me about the advertising fee, 10% discount ~ ~)

▌ Part III: Correlation Filtering

Introduce the most classic high-speed correlation filter tracking algorithm CSK, KCF/DCF, CN. Many people knew CF at the earliest and should be attracted to the following picture like me:

This is the experimental result of the KCF/DCF algorithm on the OTB50 (arVix was circulated in April 2014, when the OTB100 was not published yet). Precision and FPS have crushed the best Struck on the OTB50 and used it for barely real-time Struck. With the TLD, it is surprising that the KCF/DCF is a bit high-speed. In fact, the KCF/DCF is an improved version of the multi-channel feature of the CSK that shines on the OTB. Take note of the super-fast 615 FPPS MOSSE (serious speeding, this is your ticket), which is the first related filtering method in the field of target tracking. This is really the first time showing the potential of related filtering. There is also a CN in the same period as KCF. The color feature method that caused severe reaction in 2014'CVPR is actually a multi-channel color feature improvement algorithm of CSK. From MOSSE (615) to CSK (362) to KCF (172FPS), DCF (292FPS), CN (152FPS), CN2 (202FPS), although the speed is getting slower and slower, the effect is getting better and better and always At high speed levels:

Bolme DS, Beveridge JR, Draper BA, et al. Visual object tracking using adaptive correlation filters [C]//CVPR, 2010.

Henriques JF, Caseiro R, Martins P, et al. Exploiting the circulant structure of tracking-by-detection with kernels [C]// ECCV, 2012.

Henriques JF, Rui C, Martins P, et al. High-Speed ​​Tracking with Kernelized Correlation Filters [J]. IEEE TPAMI, 2015.

Danelljan M, Shahbaz Khan F, Felsberg M, et al. Adaptive color attributes for real-time visual tracking [C]// CVPR, 2014.

Both CSK and KCF are two papers by Henriques JF (Oxford University) João F. Henriques Great God, affecting many later work, the ridge regression of the core part, the approximate dense sampling of cyclic shifts, and the details of the entire correlation filtering algorithm. Derivation. There is also a closed solution of ridge regression plus kernel-trick, multi-channel HOG features.

Martin Danelljan (University of Linköping) used the multi-channel color feature Color Names (CN) to extend the CSK to obtain good results. The algorithm is also referred to as CN Coloring Visual Tracking.

CN Coloring Visual Tracking

Http://

MOSSE is the correlation filtering of single-channel gray features. CSK extends dense sampling (plus padding) and kernel-trick on the basis of MOSSE. KCF extends the multi-channel gradient HOG features on the basis of CSK, and CN is the basis of CSK. Extended Color Names for multi-channel color. HOG is a gradient feature, while CN is a color feature, and the two can be complementary. Therefore, HOG+CN has become the hand-craft feature standard in the tracking algorithm for the past two years. Finally, based on the experimental results of KCF/DCF, two issues are discussed:

1. Why does the difference between the KCF with single-channel grayscale feature and the KCF with multi-channel HOG feature is very small?

First, the author used HOG's fast algorithm fHOG from Piotr's Computer Vision Matlab Toolbox, C code, and did SSE optimization. If you have questions about fHOG, please refer to the paper Object Detection with Discriminatively Trained Part Based Models.

Second, HOG features commonly used cell size is 4, which means that 100 * 100 images, HOG feature map dimensions only 25 * 25, and Raw pixels are grayscale normalized, the dimension is still 100 * 100, We simply calculate: the complexity of the 27-channel HOG feature is 27*625*log(625)=47180, and the complexity of the single-channel gray feature is 10000*log(10000)=40000, which is also similar in theory and conforms to the table.

Looking at the code, we will see that when the author expands the area of ​​the target area, the author will first sample the extracted image block down to 50*50 by factor 2, and the complexity will become 2500*log(2500)=8495. , dropped a lot. Then you may think that if you downsampling a little more, the complexity will be lower, but this is at the expense of tracking accuracy. As another example, if the image block area is 200*200, first downsampling to 100 *100, then extract the HOG feature, the resolution is reduced to 25*25, which means that the resolution of the response graph is also 25*25, that is to say, the response map is shifted by 1 pixel, the tracking frame in the original image is to be moved 8 Pixels, which reduces the tracking accuracy. When the accuracy is not high enough, the accuracy can be sacrificed slightly to increase the frame rate (but it really can't seem to be downsampled anymore).

2. Which of the KCF and DCF features of the HOG feature is better?

Most people think that the KCF effect is more than DCF, and the accuracy of each attribute is above the DCF. However, if you look at the other point of view, using DCF as a benchmark, and then looking at the KCF with kernel-trick, mean precision only improves. 0.4%, while FPS dropped by 41%, is this not surprising? In addition to the total number of image block pixels, the KCF complexity is mainly related to kernel-trick. Therefore, if there is no kernel-trick in the CF method below, it will be referred to as DCF for short. If kernel-trick is added, it will be referred to as KCF (half spoiler). Of course, the CN here also has a kernel-trick, but please note that this is the first time that Martin Danelljan used the kernel-trick for the last time. . .

This raises the question of how powerful the kernel-trick stuff is, how can we improve this point? Here we must mention another masterpiece of Winsty:

Wang N, Shi J, Yeung DY, et al. Understanding and diagnosing visual tracking systems[C]// ICCV, 2015.

Summarized in a sentence, do not look at the variety of machine learning methods, that are all virtual, the characteristics of the target tracking algorithm is the most important (because this article I powder WIN uncle haha), the above is the most classic three High-speed algorithms, CSK, KCF/DCF and CN, recommended.

▌ Part IV: Scale adaptation for 14 years

The VOT and OTB appeared as early as 2013, but the VOT2013 sequence was too small. The first PLT code could not be found, and it was skipped without reference value. Directly to the VOT2014 competition VOT2014 Benchmark (http://votchallenge.net/vot2014/index.html). There were 25 well-chosen series and 38 algorithms in the year. At that time, the deep learning warfare had not burnt track, so the protagonist could only be the CF who just took the lead and dominated the side. The following are the top ones. Details:

The top three are related filter CF methods. The third place KCF is already familiar. There is a slight difference here. Multi-scale detection and sub-pixel peak estimation are added. The resolution of VOT sequence is higher (detection of updated images. The relatively high resolution of the block results in a KCF speed of only 24.23 (EFO conversion 66.6 FPS). Here speed is the Equivalent Filter Operations (EFO). This parameter is also used to measure the speed of the algorithm in VOT2015 and VOT2016. Here is a one-time list for reference (the actual speed of the tracker implemented by MATLAB is higher):

In fact, in addition to the slightly different characteristics of the top three, the core is the KCF-based extended multi-scale detection, the outline is as follows:

Scale change is a relatively basic and common problem in tracking. The previously introduced KCF/DCF and CN have no scale update. If the target is reduced, the filter will learn a lot of background information. If the target is enlarged, the filter will follow the target local texture. Go, both of these situations are likely to have unexpected results, leading to drift and failure.

SAMF ihpdep/samf (https://github.com/ihpdep/samf), the work of Yang Li of Zhejiang University, based on KCF, features HOG+CN, multi-scale method is a translation filter to target on multi-scale image blocks Detection, take the translation position and scale where the response is the most:

Li Y, Zhu J. A scale adaptive kernel correlation filter tracker with feature integration [C]// ECCV, 2014.

Martin Danelljan's DSST Accurate scale estimation for visual tracking (http://) uses only the HOG feature, DCF is used for translational position detection, and it specializes in training similar MOSSE-related filters to detect scale changes, creating a translational filter + a scale filter. , and then transferred TPAMI made a series of accelerated version fDSST, very + very + highly recommended:

Danelljan M, Häger G, Khan F, et al. Accurate scale estimation for robust visual tracking [C]// BMVC, 2014.

Danelljan M, Hager G, Khan FS, et al. Discriminative Scale Space Tracking [J]. IEEE TPAMI, 2017.

Simply compare these two methods of adaptive scaling:

Which of the scale detection methods used by DSST and SAMF is better?

First of all, tell you a joke: After Martin Danelljan proposed DSST, his follow-up paper has never been used again (until the latest CVPR ECO-HC to accelerate the use of fDSST).

Although both SAMF and DSST can keep up with the common target scale changes, SAMF only has 7 scales, while DSST has 33 scales that are fine and accurate.

DSST firstly detects the best scale for optimal translation and re-detection, which is the best step-by-step method. While SAMF is the translation scale to detect together, it is the translation and scale that are optimal at the same time, and often the local optimal and the global optimal are not the same;

DSST divides the tracking into two problems: translation tracking and scale tracking. Different methods and features can be used to make it more flexible. However, it requires extra training of a filter. Each frame measurement needs to sample 33 image blocks, and then calculates the characteristics and adds them separately. Window, FFT, etc., the scale filter is much slower than the shift filter; SAMF only requires one filter, no additional training and storage, and each feature detection once features and FFT, but when the image block is larger, the computational ratio DSST is high.

Therefore, the scale detection DSST is not always better than the SAMF. In fact, the SAMF is more than the DSST in VOT2015 and VOT2016. Of course, this is mainly because the features are better, but at least the scale method is not bad. In general, the DSST approach is very new and faster, and SAMF is equally good and accurate.

Must DSST be 33 scales?

DSST comes standard with 33 scales that are very very sensitive and easily reduce the number of scales. Even if you increase the corresponding step size, the scale filter will not be able to keep up with the scale change. The possible explanation for this may be that the training-scale filter uses one-dimensional samples and there is no cyclic shift, which means that only one sample is updated for only 33 samples. If the number of samples is reduced, training will be insufficient and the classifier will have serious discriminating power. The drop, unlike the translation filter, has a very large sample of shifts (personal views welcome the exchange). In short, please do not try to reduce the number of scales by a large amount. If you do not use the scale filters 33 and 1.02, it is good.

The above are two recommended scale detection methods, which will be referred to as DSST-like multi-scale and SAMF-like multi-scale. If you pay more attention to speed, the accelerated version of fDSST, and only three dimensions of SAMF (such as KCF in VOT2014) are better choices; if you are more precise, 33-scale DSST, and 7-scale SAMF are more appropriate. .

Part V: Boundary Effect

The next VOT2015 competition VOT2015 Challenge | Home, this year has 60 carefully selected sequences, 62 trackers, the biggest surprise is deep learning began to attack the tracking field, MDNet won the championship of the year directly, combined with depth characteristics The relevant filtering method DeepSRDCF is the second, and the SRDCF which only solves the boundary effect only has the fourth HOG feature.

With the expansion of the influence of the VOT contest, the organizers are also well-intentioned, classic and top gathered together, a hundred schools of thought contend, up to 62 tracker Imperial City PK, Huashan Mountain. In addition to the previously described deep learning and related filtering, there are EBTs that combine object proposals (class object area detection) (EBT: Proposal and Tracking have to say the secret - Zhihu column, https://zhuanlan.zhihu.com/p /26654891) Row three, Mean-Shift class color algorithm ASMS is the recommended real-time algorithm, as well as the previously mentioned another color algorithm DAT, and the Struck in the 9th is not the original Struck. In addition, classic methods such as OAB, STC, CMT, CT, NCC, etc. can be seen in the countdown position. The classic method has been left behind.

Before introducing SRDCF, first analyze the disadvantages of the related filtering. In general, the related filtering methods do not have a good effect on the tracking of rapid deformation and rapid movement.

The rapid deformation is mainly because CF is a template class method. Easy to lose this is relatively easy to understand, in front of the analysis of the relevant filter is a template class method, if the target is rapidly deformed, that based on the HOG gradient template certainly can not keep up, if the rapid color change, then the CN-based color template certainly will follow Not on. This is also related to model updating strategy and update speed. Linear weighted updating of fixed learning rate, if the learning rate is too large, partial or short occlusion and any detection is inaccurate, the model will learn background information and accumulate to a certain degree of model following background. Elopement, never to return. If the learning rate is too small and the target is distorted and the template is still the template, it will become unrecognized.

The rapid movement is mainly a boundary effect, and the erroneous samples generated by the boundary effect will cause the discriminator to have insufficient discriminating power. The following training stages and detection stages are separately discussed.

During the training phase, the synthetic sample reduced the discriminatory power. If no cosine window is added, the shift sample is long:

Except for the most original sample, the other samples were “synthetic”, and 100*100 image blocks, only 1/10,000 of the samples were true, and such sample sets could not be trained at all. If a cosine window is added, since the edge pixel values ​​of the image are all zero, the sample is considered reasonable if the object remains intact during the cyclic shift. Only when the target center is close to the edge, those samples whose target crosses the boundary are wrong. The number of such false but reasonable samples increased to approximately 2/3 (one-dimensional case padding = 1). However, we cannot forget that even if there are still 1/3 (3000/10000, http://tel:3000/10000) samples that are unreasonable, these samples will reduce the discriminant power of the classifier. In addition, the cosine window is not "free". The cosine window changes the pixels in the edge region of the image block to 0, and filters out a lot of background information that the classifier originally needs to learn. This can be seen by the discriminator during training. Background information is very limited, we have added a cosine window to block the background, which further reduces the classifier's discriminating ability (is it right? God covered the curtain in front of me. Not God, is the cosine window).

In the detection phase, the correlation filtering is weak for the target detection of rapid motion. The correlation filter training image block and the detected image block size must be the same, that is to say you have trained a 100*100 filter, then you can only detect the 100*100 area, if you plan to add more Padding to expand the detection area, so that in addition to the expansion of the complexity, there will be no benefit. The target motion may be the target's own movement, or the camera's movement, according to the position of the target in the detection area, divided into four situations:

If the target is near the center, the detection is accurate and successful.

If the target moves near the border but does not have a border, after the cosine window is added, some target pixels will be filtered out. At this time, there is no guarantee that the response here is the largest in the world. Moreover, the test sample and training process at this time. The unreasonable samples in them are very similar, so they are likely to fail.

If part of the target has been moved out of this area, and we have to add a cosine window, it is possible to filter out the only target pixels, and the detection fails.

If the entire target has moved out of this area, it must have failed.

The above is the boundary effect. We recommend two main methods to solve the boundary effect. The SRDCF has a relatively slow speed and is not suitable for real-time applications.

Martin Danelljan's SRDCF Learning Spatially Regularized Correlation Filters for Visual Tracking, the main idea: Since the boundary effect occurs near the boundary, then ignore the pixels of the boundary part of all shifted samples, or limit the filter coefficients near the boundary to close to 0:

Learning Spatially Regularized Correlation Filters for Visual Tracking

Http://

Danelljan M, Hager G, Shahbaz Khan F, et al. Learning spatially regularized correlation filters for visual tracking [C]// ICCV. 2015.

SRDCF based on DCF, SAMF-like multi-scale, using a larger detection area (padding = 4), while adding regularization of the spatial domain, punish the filter coefficients of the boundary region, because there is no closed solution, using Gauss-Seidel method iterative optimization. The detection area was expanded (1.5->4), and iterative optimization (disruption of the closed solution) resulted in SRDCF of only 5 FP, but the effect was very good in the 2015 baseline.

Another method is the modified MOSSE algorithm proposed by Hamed Kiani, CFLM Correlation Filters with Limited Boundaries based on gray features, and BACF Learning Background-Aware Correlation Filters for Visual Tracking based on HOG features. The main idea is to use larger size to detect image blocks. And the smaller size filter to increase the proportion of the real sample, or filter fill 0 to keep the same as the detection image, there is no closure solution, using ADMM iteration optimization:

Kiani Galoogahi H, Sim T, Lucey S. Correlation filters with limited boundaries [C]// CVPR, 2015.

Kiani Galoogahi H, Fagg A, Lucey S. Learning Background-Aware Correlation Filters for Visual Tracking [C]// ICCV, 2017.

CFLB only features single-channel grayscale, although the speed is 167FPS faster, but the performance is far worse than KCF, not recommended; the latest BACF will feature extended multi-channel HOG features, performance exceeds SRDCF, and the speed is relatively fast 35FPS, it is highly recommended.

In fact, these two solutions are similar, with larger detection and updating of image blocks, and training filters with relatively small scope. The difference is that SRDCF filter coefficients smoothly transition from the center to the edge to 0, and CFLM directly fills the edge of the filter with 0.

VOT2015 related filter is still in second place, combined with DeepSRDCF features, because the depth characteristics are very slow, not to mention high speed on the CPU, real-time can not, although the performance is very high, but not recommended here, first jump Over.

Part VI: Color histograms and correlation filtering

The VOT2016 competition VOT2016 Challenge | Home is still the VOT2015 series of 60, but this time the relabeling was more fair and reasonable. There were 70 contestants this year. It is expected that deep learning has dominated the world, 8 pure CNN methods and 6 Most of the CF methods that combine depth features are among the best. There is also a CF method. Most importantly, the organizers of the conscience actually disclosed 38 trackers, part of the tracker codes, and the home page they can get. The download address is VOT2016 Challenge. Trackers (Mother never worry about I can not find the source ~), pay attention to some of the download link, part of the source compression package, part of the source code is a binary file, it's easy to use a test to know, easy comparison and research, need Hurry to try it. Look at the results of the competition right now (only the first 60 are listed here):

高亮标出来了前面介绍过的或比较重要的方法,结合多层深度特征的相关滤波C-COT排第一名,而CNN方法TCNN是VOT2016的冠军,作者也是VOT2015冠军MDNet,纯颜色方法DAT和ASMS都在中等水平(其实两种方法实测表现非常接近),其他tracker的情况请参考论文。再来看速度,SMACF没有公开代码,ASMS依然那么快,排在前10的方法中也有两个速度比较快,分别是排第5的Staple,和其改进算法排第9的STAPLE+,而且STAPLE+是今年的推荐实时算法。首先恭喜Luca Bertinetto的SiamFC和Staple都表现非常不错,然后再为大牛默哀三分钟(VOT2016的paper原文):

This was particularly obvious in case of SiamFC trackers, which runs orders higher than realtime (albeit on GPU), and Staple, which is realtime, but are incorrectly among the non-realtime trackers.

VOT2016竟然发生了乌龙事件,Staple在论文中CPU上是80FPS,怎么EFO在这里只有11?幸好公开代码有Staple和STAPLE+,实测下来,虽然我电脑不如Luca Bertinetto大牛但Staple我也能跑76FPS,而更可笑的是,STAPLE+比Staple慢了大约7-8倍,竟然EFO高出4倍,到底怎么回事呢?

首先看Staple的代码,如果您直接下载Staple并设置params.visualization = 1,Staple默认调用Computer Vision System Toolbox来显示序列图像,而恰好如果您没有这个工具箱,默认每帧都会用imshow(im)来显示图像,所以非常非常慢,而设置params.visualization = 0就跑的飞快(作者你是孙猴子派来的逗逼吗),建议您将显示图像部分代码替换成DSST中对应部分代码就可以正常速度运行和显示了。

再来看STAPLE+的代码,对Staple的改进包括额外从颜色概率图中提取HOG特征,特征增加到56通道(Staple是28通道),平移检测额外加入了大位移光流运动估计的响应,所以才会这么慢,而且肯定要慢很多。

所以很大可能是VOT举办方把Staple和STAPLE+的EFO弄反了,VOT2016的实时推荐算法应该是排第5的Staple,相关滤波结合颜色方法,没有深度特征更没有CNN,跑80FPS还能排在第五,这就是接下来主要介绍的,2016年最NIUBILITY的目标跟踪算法之一Staple (直接让排在后面的一众深度学习算法怀疑人生)。

颜色特征,在目标跟踪中颜色是个非常重要的特征,不管多少个人在一起,只要目标穿不用颜色的一幅就非常明显。前面介绍过2014年CVPR的CN是相关滤波框架下的模板颜色方法,这里隆重介绍统计颜色特征方法DAT Learning, Recognition, and Surveillance @ ICG ,帧率15FPS推荐:

Possegger H, Mauthner T, Bischof H. In defense of color-based model-free tracking [C]// CVPR, 2015.

DAT统计前景目标和背景区域的颜色直方图并归一化,这就是前景和背景的颜色概率模型,检测阶段,贝叶斯方法判别每个像素属于前景的概率,得到像素级颜色概率图,再加上边缘相似颜色物体抑制就能得到目标的区域了。

如果要用一句话介绍Luca Bertinetto(牛津大学)的Staple Staple tracker,那就是把模板特征方法DSST(基于DCF)和统计特征方法DAT结合:

Bertinetto L, Valmadre J, Golodetz S, et al. Staple: Complementary Learners for Real-Time Tracking [C]// CVPR, 2016.

前面分析了相关滤波模板类特征(HOG)对快速变形和快速运动效果不好,但对运动模糊光照变化等情况比较好;而颜色统计特征(颜色直方图)对变形不敏感,而且不属于相关滤波框架没有边界效应,快速运动当然也是没问题的,但对光照变化和背景相似颜色不好。综上,这两类方法可以互补,也就是说DSST和DAT可以互补结合:

两个框架的算法高效无缝结合,25FPS的DSST和15FPS的DAT,而结合后速度竟然达到了80FPS。DSST框架把跟踪划分为两个问题,即平移检测和尺度检测,DAT就加在平移检测部分,相关滤波有一个响应图,像素级前景概率也有一个响应图,两个响应图线性加权得到最终响应图,其他部分与DSST类似,平移滤波器、尺度滤波器和颜色概率模型都以固定学习率线性加权更新。

另一种相关滤波结合颜色概率的方法是17CVPR的CSR-DCF,提出了空域可靠性和通道可靠性,没有深度特征性能直逼C-COT,速度可观13FPS:

Lukežič A, Vojíř T, Čehovin L, et al. Discriminative Correlation Filter with Channel and Spatial Reliability [C]// CVPR, 2017.

CSR-DCF中的空域可靠性得到的二值掩膜就类似于CFLM中的掩膜矩阵P,在这里自适应选择更容易跟踪的目标区域且减小边界效应;以往多通道特征都是直接求和,而CSR-DCF中通道采用加权求和,而通道可靠性就是那个自适应加权系数。采用ADMM迭代优化,可以看出CSR-DCF是DAT和CFLB的结合算法。

VOT2015相关滤波还有排第一名的C-COT(别问我第一名为什么不是冠军,我也不知道),和DeepSRDCF一样先跳过。

▌第七部分:long-term和跟踪置信度

以前提到的很多CF算法,也包括VOT竞赛,都是针对short-term的跟踪问题,即短期(shor-term)跟踪,我们只关注短期内(如100~500帧)跟踪是否准确。但在实际应用场合,我们希望正确跟踪时间长一点,如几分钟或十几分钟,这就是长期(long-term)跟踪问题。

Long-term就是希望tracker能长期正确跟踪,我们分析了前面介绍的方法不适合这种应用场合,必须是short-term tracker + detecter配合才能实现正确的长期跟踪。

用一句话介绍Long-term,就是给普通tracker配一个detecter,在发现跟踪出错的时候调用自带detecter重新检测并矫正tracker。

介绍CF方向一篇比较有代表性的long-term方法,Chao Ma的LCT chaoma99/lct-tracker:

Ma C, Yang X, Zhang C, et al. Long-term correlation tracking[C]// CVPR, 2015.

LCT在DSST一个平移相关滤波Rc和一个尺度相关滤波的基础上,又加入第三个负责检测目标置信度的相关滤波Rt,检测模块Online Detector是TLD中所用的随机蔟分类器(random fern),在代码中改为SVM。第三个置信度滤波类似MOSSE不加padding,而且特征也不加cosine窗,放在平移检测之后。

如果最大响应小于第一个阈值(叫运动阈值),说明平移检测不可靠,调用检测模块重新检测。注意,重新检测的结果并不是都采纳的,只有第二次检测的最大响应值比第一次检测大1.5倍时才接纳,否则,依然采用平移检测的结果。

如果最大响应大于第二个阈值(叫外观阈值),说明平移检测足够可信,这时候才以固定学习率在线更新第三个相关滤波器和随机蔟分类器。注意,前两个相关滤波的更新与DSST一样,固定学习率在线每帧更新。

LCT加入检测机制,对遮挡和出视野等情况理论上较好,速度27fps,实验只跑了OTB-2013,跟踪精度非常高,根据其他论文LCT在OTB-2015和VOT上效果略差一点可能是两个核心阈值没有自适应, 关于long-term,TLD和LCT都可以参考。

接下来介绍跟踪置信度。 跟踪算法需要能反映每一次跟踪结果的可靠程度,这一点非常重要,不然就可能造成跟丢了还不知道的情况。生成类(generative)方法有相似性度量函数,判别类(discriminative)方法有机器学习方法的分类概率。有两种指标可以反映相关滤波类方法的跟踪置信度:前面见过的最大响应值,和没见过的响应模式,或者综合反映这两点的指标。

LMCF(MM Wang的目标跟踪专栏:目标跟踪算法- 知乎专栏 )提出了多峰检测和高置信度更新:

Wang M, Liu Y, Huang Z. Large Margin Object Tracking with Circulant Feature Maps [C]// CVPR, 2017.

高置信度更新,只有在跟踪置信度比较高的时候才更新跟踪模型,避免目标模型被污染,同时提升速度。 第一个置信度指标是最大响应分数Fmax,就是最大响应值(Staple和LCT中都有提到)。 第二个置信度指标是平均峰值相关能量(average peak-to correlation energy, APCE),反应响应图的波动程度和检测目标的置信水平,这个(可能)是目前最好的指标,推荐:

跟踪置信度指标还有,MOSSE中的峰值旁瓣比(Peak to Sidelobe Ratio, PSR), 由相关滤波峰值,与11*11峰值窗口以外旁瓣的均值与标准差计算得到,推荐:

还有CSR-DCF的空域可靠性,也用了两个类似指标反映通道可靠性, 第一个指标也是每个通道的最大响应峰值,就是Fmax,第二个指标是响应图中第二和第一主模式之间的比率,反映每个通道响应中主模式的表现力,但需要先做极大值检测:

▌第八部分:卷积特征

最后这部分是Martin Danelljan的专场,主要介绍他的一些列工作,尤其是结合深度特征的相关滤波方法,代码都在他主页Visual Tracking,就不一一贴出了。

Danelljan M, Shahbaz Khan F, Felsberg M, et al. Adaptive color attributes for real-time visual tracking [C]// CVPR, 2014.

在CN中提出了非常重要的多通道颜色特征Color Names,用于CSK框架取得非常好得效果,还提出了加速算法CN2,通过类PCA的自适应降维方法,对特征通道数量降维(10 -> 2),平滑项增加跨越不同特征子空间时的代价,也就是PCA中的协方差矩阵线性更新防止降维矩阵变化太大。

Danelljan M, Hager G, Khan FS, et al. Discriminative Scale Space Tracking [J]. IEEE TPAMI, 2017.

DSST是VOT2014的第一名,开创了平移滤波+尺度滤波的方式。在fDSST中对DSST进行加速,PCA方法将平移滤波HOG特征的通道降维(31 -> 18),QR方法将尺度滤波器~1000*17的特征降维到17*17,最后用三角插值(频域插值)将尺度数量从17插值到33以获得更精确的尺度定位。

SRDCF是VOT2015的第四名,为了减轻边界效应扩大检测区域,优化目标增加了空间约束项,用高斯-塞德尔方法迭代优化,并用牛顿法迭代优化平移检测的子网格精确目标定位。

Danelljan M, Hager G, Shahbaz Khan F, et al. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking [C]// CVPR, 2016.

SRDCFdecon在SRDCF的基础上,改进了样本和学习率问题。以前的相关滤波都是固定学习率线性加权更新模型,虽然这样比较简单不用保存以前样本,但在定位不准确、遮挡、背景扰动等情况会污染模型导致漂移。SRDCFdecon选择保存以往样本(图像块包括正,负样本),在优化目标函数中添加样本权重参数和正则项,采用交替凸搜索,首先固定样本权重,高斯-塞德尔方法迭代优化模型参数,然后固定模型参数,凸二次规划方法优化样本权重。

Danelljan M, Hager G, Shahbaz Khan F, et al. Convolutional features for correlation filter based visual tracking [C]// ICCVW, 2015.

DeepSRDCF是VOT2015的第二名,将SRDCF中的HOG特征替换为CNN中单层卷积层的深度特征(也就是卷积网络的激活值),效果有了极大提升。这里用imagenet-vgg-2048 network,VGG网络的迁移能力比较强,而且MatConvNet就是VGG组的,MATLAB调用非常方便。论文还测试了不同卷积层在目标跟踪任务中的表现:

第1层表现最好,第2和第5次之。由于卷积层数越高语义信息越多,但纹理细节越少,从1到4层越来越差的原因之一就是特征图的分辨率越来越低,但第5层反而很高,是因为包括完整的语义信息,判别力比较强(本来就是用来做识别的)。

注意区分这里的深度特征和基于深度学习的方法,深度特征来自ImageNet上预训练的图像分类网络,没有fine-turn这一过程,不存在过拟合的问题。而基于深度学习的方法大多需要在跟踪序列上end-to-end训练或fine-turn,如果样本数量和多样性有限就很可能过拟合。

Ma C, Huang JB, Yang X, et al. Hierarchical convolutional features for visual tracking [C]// ICCV, 2015.

值得一提的还有Chao Ma的HCF,结合多层卷积特征提升效果,用了VGG19的Conv5-4, Conv4-4和Conv3-4的激活值作为特征,所有特征都缩放到图像块分辨率,虽然按照论文应该是由粗到细确定目标,但代码中比较直接,三种卷积层的响应以固定权值1, 0.5, 0.02线性加权作为最终响应。虽然用了多层卷积特征,但没有关注边界效应而且线性加权的方式过于简单,HCF在VOT2016仅排在28名(单层卷积深度特征的DeepSRDCF是第13名)。

Danelljan M, Robinson A, Khan FS, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking [C]// ECCV, 2016.

C-COT是VOT2016的第一名,综合了SRDCF的空域正则化和SRDCFdecon的自适应样本权重,还将DeepSRDCF的单层卷积的深度特征扩展为多成卷积的深度特征(VGG第1和5层),为了应对不同卷积层分辨率不同的问题,提出了连续空间域插值转换操作,在训练之前通过频域隐式插值将特征图插值到连续空域,方便集成多分辨率特征图,并且保持定位的高精度。目标函数通过共轭梯度下降方法迭代优化,比高斯-塞德尔方法要快,自适应样本权值直接采用先验权值,没有交替凸优化过程,检测中用牛顿法迭代优化目标位置。

注意以上SRDCF, SRDCFdecon,DeepSRDCF,C-COT都无法实时,这一系列工作虽然效果越来越好,但也越来越复杂,在相关滤波越来越慢失去速度优势的时候,Martin Danelljan在2017CVPR的ECO来了一脚急刹车,大神来告诉我们什么叫又好又快,不忘初心:

Danelljan M, Bhat G, Khan FS, et al. ECO: Efficient Convolution Operators for Tracking [C]// CVPR, 2017.

ECO是C-COT的加速版,从模型大小、样本集大小和更新策略三个方便加速,速度比C-COT提升了20倍,加量还减价,EAO提升了13.3%,最最最厉害的是, hand-crafted features的ECO-HC有60FPS。 .吹完了,来看看具体做法。

第一减少模型参数,定义了factorized convolution operator(分解卷积操作),效果类似PCA,用PCA初始化,然后仅在第一帧优化这个降维矩阵,以后帧都直接用,简单来说就是有监督降维,深度特征时模型参数减少了80%。

第二减少样本数量, compact generative model(紧凑的样本集生成模型),采用Gaussian Mixture Model (GMM)合并相似样本,建立更具代表性和多样性的样本集,需要保存和优化的样本集数量降到C-COT的1/8。

第三改变更新策略,sparser updating scheme(稀疏更新策略),每隔5帧做一次优化更新模型参数,不但提高了算法速度,而且提高了对突变,遮挡等情况的稳定性。但样本集是每帧都更新的,稀疏更新并不会错过间隔期的样本变化信息。

ECO的成功当然还有很多细节,而且有些我也看的不是很懂,总之很厉害就是了。 . ECO实验跑了四个库(VOT2016, UAV123, OTB-2015, and TempleColor)都是第一,而且没有过拟合的问题,仅性能来说ECO是目前最好的相关滤波算法,也有可能是最好的目标跟踪算法。 hand-crafted features版本的ECO-HC,降维部分原来HOG+CN的42维特征降到13维,其他部分类似,实验结果ECO-HC超过了大部分深度学习方法,而且论文给出速度是CPU上60FPS。

最后是来自Luca Bertinetto的CFNet End-to-end representation learning for Correlation Filter based tracking,除了上面介绍的相关滤波结合深度特征,相关滤波也可以end-to-end方式在CNN中训练了:

Valmadre J, Bertinetto L, Henriques JF, et al. End-to-end representation learning for Correlation Filter based tracking [C]// CVPR, 2017.

在SiamFC的基础上,将相关滤波也作为CNN中的一层,最重要的是cf层的前向传播和反向传播公式推导,两层卷积层的CFNet在GPU上是75FPS,综合表现并没有很多惊艳,可能是难以处理CF层的边界效应吧,持观望态度。

▌第九部分:2017年CVPR和ICCV结果

下面是CVPR 2017的目标跟踪算法结果:可能MD大神想说,一个能打的都没有!

仿照上面的表格,整理了ICCV 2017的相关论文结果对比ECO:哎,还是一个能打的都没有!

▌第十部分:大牛推荐

凑个数,目前相关滤波方向贡献最多的是以下两个组(有创新有代码):

牛津大学:Joao F. Henriques和Luca Bertinetto,代表:CSK, KCF/DCF, Staple, CFNet (其他SiamFC, Learnet).

林雪平大学:Martin Danelljan,代表:CN, DSST, SRDCF, DeepSRDCF, SRDCFdecon, C-COT, ECO.

国内也有很多高校的优秀工作就不一一列举了。

SMD LED

Smd Led,Smd Fnd Display,Smd Led 0603 Display,Smd Led 0805 Display

Wuxi Ark Technology Electronic Co.,Ltd. , https://www.arkledcn.com

Posted on