I believe that many people here and I have been here for the first time, are looking for a better target tracking algorithm, or want to have a deeper understanding of the target tracking in this area, although this problem is a classic target tracking algorithm, But in fact, maybe we don't need trackers (target tracking algorithms) that have been brilliant but have been shot on the beach, but those that are about to become classic, or are currently looking for best use, speed, and performance Past tracker. I am concerned about the relevant filtering direction in the target tracking. Next, I will help you introduce the target tracking that I know, especially the related filtering methods. Share some of the algorithms that I think are better, and let me talk about my views by the way.
▌ Part I: Target Tracking Snapshot
Start with a couple of SOTA trackers and get a glimpse of what the target tracks in this direction. Everything should start with the database in 2013. . If you ask someone what has been comparing niubility's tracking algorithm in recent years, most people will throw you a paper by Wu Yi, OTB50 and OTB100 (OTB50 here refers to OTB-2013, OTB100 here refers to OTB-2015, 50 and 100 respectively. Represents the number of videos for easy memory):
Wu Y, Lim J, Yang M H. Online object tracking: A benchmark [C]// CVPR, 2013.
Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. TPAMI, 2015.
Top will turn top of the top treatment, plus more than 1480+320 citations, the impact is self-evident, is already doing the database tracking must run, test code and sequences can be downloaded: Visual Tracker Benchmark (http: //cvlab.hanyang.ac.kr/tracker_benchmark/), OTB50 includes 50 sequences, all manually labeled:
Two papers compared 29 top trackers including 2012 and before on the database. There are OAB, IVT, MIL, CT, TLD, Struck, etc. that are familiar to everyone. There are no more commonly recognized databases, and the papers are self-promotion. We don't know which one is easy to use. So the significance of this database is very significant. It directly promotes the development of tracking algorithms, and later it is extended to OTB100 to TPAMI. There are 100 sequences. , more difficult and more authoritative, we refer here to the results of OTB100, the first is the speed of 29 trackers and publication time (marked some of the performance speed is better algorithm):
Next, look at the results (more details suggest that you see the paper is clearer):
Directly to the conclusion: On average, the performance of Struck, SCM, and ASLA is relatively high, ranking in the first three, with emphasis on CSK, showing the potential of related filtering to the world for the first time, ranking fourth and still 362 FPS. . The second fastest speed is the classic algorithm CT (64fps) (SCM, ASLA, etc. are the hottest sparse representations of that era). If you are interested in an earlier algorithm, another classical survey is recommended (I'm not interested or seen anyway):
Yilmaz A, Javed O, Shah M. Object tracking: A survey [J]. CSUR, 2006.
The algorithm before 2012 is basically the same. Ever since the launch of AlexNet in 2012, CV has seen tremendous changes in various fields, so I guess you definitely want to know what happened between 2013 and 2017. I'm sorry I don't know either. It is a matter of course, but we can be sure that the papers from 2013 onwards will always quote the OTB50 paper. With the help of the cited number feature in Google Scholar, the following results are obtained:
Only a few citations are cited here, followed by Struck to TPAMI, and the three major filtering methods KCF, CN, DSST, and VOT competitions. Here are only demonstrations, and you are interested in trying it yourself. (The theoretical basis for this is: a thesis. The work before it can be seen in its citations. The work afterwards can see who quoted it; although the quotation does not explain anything, the good method is basically used by everyone. (respect and recognition); afterwards, you can also view related papers for a certain period of time by limiting the time. For example, the latest papers can be found in 2016-2017, and the quality of the papers needs to be carefully screened; important papers in other directions can also be Use, follow the steps, and then you know what the big cattle are, and then focus on tracking their work) so that we generally know that the latest progress in the field of target tracking is relevant filtering is no doubt, and then you can see the relevant filter The class algorithms are SAMF, LCT, HCF, SRDCF and so on.
Of course, the amount of quotation is also related to time, and it is recommended to look at it every year. In addition, the latest version of OPENCV 3.2 includes several very new tracking algorithms in addition to the TLD OpenCV: Tracking API (https://):
The TrackerKCF interface implements KCF and CN. The influence is evident. There is also a GOTURN method based on deep learning. Although the speed is fast but the accuracy is slightly poor, it is worth looking at. The latest paper in the tracking direction can follow three major conferences (CVPR/ICCV/ECCV) and arXiv.
▌ Part II: Background Introduction
The next step is to introduce the target tracking. The target tracking mentioned here is a common single-target tracking. The first frame is a rectangular box. This box is manually annotated in the database. In the actual situation, it is mostly the result of the detection algorithm. Then the tracking algorithm needs to be followed in the subsequent frame. To live in this box, here are the requirements for the VOT tracking algorithm:
Usually target tracking faces several difficulties (Wu Yi's slides at VALSE): appearance distortion, light changes, fast motion and motion blur, background similar interference:
Out-of-plane rotation, in-plane rotation, scale change, occlusion and out-of-view situations, etc.:
Because of these conditions, it has become difficult to track. At present, more commonly used databases than the OTB, as well as the previously found VOT contest database (analog ImageNet), have been held for four years, VOT2015 and VOT2016 both include 60 sequences, all sequences Also free download VOT Challenge | Challenges:
VOT Challenge | Challenges:
Http://votchallenge.net/challenges.html
Kristan M, Pflugfelder R, Leonardis A, et al. The visual object tracking vot2013 challenge results [C]// ICCV, 2013.
Kristan M, Pflugfelder R, Leonardis A, et al. The Visual Object Tracking VOT2014 Challenge Results [C]// ECCV, 2014.
Kristan M, Matas J, Leonardis A, et al. The visual object tracking vot2015 challenge results [C]// ICCV, 2015.
Kristan M, Ales L, Jiri M, et al. The Visual Object Tracking VOT2016 Challenge Results [C]// ECCV, 2016.
The difference between OTB and VOT: OTB includes 25% of grayscale sequences, but VOT is a color sequence, which is also the reason for the difference in the performance of many color feature algorithms; the evaluation indicators of the two libraries are different. Please refer to the paper for details; VOT library The sequence resolution is generally higher, which will be mentioned later in the analysis. For a tracker, if the paper has a good result on both libraries (preferably OTB100 and VOT2016), it is definitely very good (you can adjust the two library parameters, I accept, recognize ~~), If only one is run, the individual prefers to VOT2016, because the sequences are finely labeled and the evaluation indicators are better (people are contests after all, evaluation indicators have been issued by TPAMI), where the difference is the most, OTB has random frames to begin, or The rectangular box plus random interference initialization to run, the author said that more in line with the detection algorithm to the box; and VOT is the first frame initialization to run, each tracking failure (prediction box and label box does not overlap), re-initialization after 5 frames , VOT mainly short-term, and that the tracking detection should not be separated together, the detector will initialize tracker multiple times.
Added: OTB was published in 2013. It is transparent to the algorithm after 2013. The paper will adjust the parameters, especially those who only run OTB. If the key parameters are given directly to two decimal places, it is recommended that you first Measured (people are not old, ah ~ be pitted more). The database of the VOT contest is updated every year, and it is re-labeled at every turn. Changing the evaluation index at every turn is relatively difficult for the algorithm of the year, so the result is relatively more reliable. (I believe that many people, like me, will find each work very good and important. If we don't have this paper, it must be an explosion of the earth and the universe is restarting. So it's like everyone has learned the depth of the ILSVRC competition through the years. Like the development of learning, the results of the third party are more persuasive, so I also use the competition ranking + whether open source + measured performance as the standard, optimizing several algorithms analysis)
Visual Object Tracking is widely recognized as being divided into two categories: generative model methods and discriminative model methods. The most popular method is discriminative methods, which are also called tracking tracking-by-detection. In order to maintain the integrity of the answer, the following is a brief introduction.
Generate class method, in the current frame modeling of the target area, the next frame to find the most similar area with the model is the predicted position, the more famous Kalman filter, particle filter, mean-shift and so on. For example, knowing from the current frame that the target area is 80% red and 20% green, then in the next frame, the search algorithm is like a headless fly, looking for the area that best fits this color ratio, the recommended algorithm ASMS vojirt/ Asms (https://github.com/vojirt/asms):
Vojir T, Noskova J, Matas J. Robust scale-adaptive mean-shift for tracking [J]. Pattern Recognition Letters, 2014.
ASMS and DAT, dubbed "Human Duo" (copyright reserved), are color-only algorithms and are very fast, followed by VOT2015's 20th and 14th respectively, and VOT2016's are 32 and 31 respectively (medium). Level). ASMS is an official real-time algorithm recommended by VOT2015. The average frame rate is 125 FPS. The scale estimation is added under the classical mean-shift framework. The classical color histogram features are added with two priors (the scale does not change but + may be the largest) as a regular term. , and reverse scale consistency checks. The author gave C + + code, in the era of relevant filtering and deep learning, but also see mean-shift hit list there is such a high price is not easy (has been tears ~ ~), measured performance is not bad, if you are There is a special interest in generating class methods, which is highly recommended. (Some algorithms, if you are even more than this. But the rooftop is on the 24th floor, thanks)
Discriminative methods, most of the methods in OTB50 are of this type. In CV, the classic set of image features + machine learning, the current frame is a positive sample in the target area, the background area is a negative sample, and the machine learning method trains the classifier. One frame uses the trained classifier to find the optimal area:
The biggest difference from the generation class approach is that classifiers use machine learning, and that background information is used in training, so that the classifier can focus on distinguishing the foreground from the background, so discriminating class methods is generally better than generating classes. For example, when telling tracker that 80% of targets are red and 20% are green during training, tell it that there is orange in the background. Take extra care not to make mistakes. Such a classifier knows more information and the effect is relatively it is good. Tracking-by-detection is very similar to the detection algorithm. For example, HOG+SVM for classic pedestrian detection and Haar+structured output SVM are used for Struck. Multi-scale traversal search is also required for tracking in order to scale adaptively. The only difference is that the tracking algorithm has unique features. The speed of on-line machine learning is more demanding and the detection range and scale are smaller.
This is not unexpected. In most cases, the complexity of the detection and recognition algorithm is not high. It is not possible to do every frame. This time, it is appropriate to use a tracking algorithm with a lower complexity, and it only needs to follow the drift or interval. After testing again to initialize the tracker on it. In fact, I would like to say that FPS is the most important indicator of TMD, the slow death algorithm can be dead (classmates do not be so extreme, the speed can be optimized). The classic discriminative class method recommends Struck and TLD, both of which have real-time performance. Struck is the best method before 2012. The TLD is a representative of the classic long-term. The idea is very worth learning:
Hare S, Golodetz S, Saffari A, et al. Struck: Structured output tracking with kernels [J]. IEEE TPAMI, 2016.
Kalal Z, Mikolajczyk K, Matas J. Tracking-learning-detection [J]. IEEE TPAMI, 2012.
After the waves of the Yangtze River push forward, the front has been placed on the beach. This after wave is related filtering and deep learning. Related filter methods Correlation filter referred to as CF, also known as discriminative correlation filter abbreviated as DCF, pay attention to the difference between the following DCF algorithm, including those mentioned earlier, but also behind to focus on the introduction.
Deep ConvNet based class method, because deep learning is not suitable for landing at present, it is not recommended. For more information, please refer to Winsty's Naiyan Wang - Home (Link 1), and VOT2015 Champion MDNet Learning Multi-Domain Convolutional. Neural Networks for Visual Tracking (Link 2), and VOT2016's Champion TCNN http:// (Link 3)(), with the speed of the SiamFC Siamese FC tracker like 80FPS (link 4) and GOTURN davheld/GOTURN over 100FPS (link 5) Note that all are on the GPU. ResNet-based SiamFC-R (ResNet) performed well on VOT2016 and is very optimistic about follow-up development. Interested parties can also go to VALSE and listen to the author himself to explain VALSE-20160930-LucaBertinetto-Oxford-JackValmadre-Oxford-pu (link 6). As for GOTURN, The effect is relatively poor, but the advantage is that it runs 100 FPS very quickly, and if the effect can come later, it will be enough. The deep learning of classmates who do scientific research is the key, and it is better to be able to balance the speed.
Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking [C]// CVPR, 2016.
Nam H, Baek M, Han B. Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242, 2016.
Bertinetto L, Valmadre J, Henriques JF, et al. Fully-convolutional siamese networks for object tracking [C]// ECCV, 2016.
Held D, Thrun S, Savarese S. Learning to track at 100 fps with deep regression networks [C]// ECCV, 2016.
In the end, the powerful power of deep learning END2END is still far from being used in the direction of target tracking. There is not much difference between it and related filtering methods (I am not born with slow speed, but the effect is always good, otherwise you What is the significance of existence?. The revolution has not been successful and comrades still have to work hard.) Another problem that needs attention is that the target tracking database does not have a strict training set or test set. The deep learning method that requires offline training must pay great attention to whether its training set has similar sequences, and it has been specified until the VOT2017 official to limit Training sets cannot be trained with similar sequences.
Finally strongly recommend two resources. Qiang Wang maintains the benchmark_resultsfoolwood/benchmark_results (https://github.com/foolwood/benchmark_results) maintained by Qiang Wang: a large number of top-level performance comparisons on OTB libraries, various paper codes are available, and Gods own C++ implementation and open source CSK, KCF And DAT, as well as his own DCFNet papers plus source code, can not find the students to follow closely.
@H Hakase maintains related filter resources
HakaseH/CF_benchmark_results (https://github.com/HakaseH/TBCF), detailed classification and thesis code resources, do not miss passing through, relevant filtering algorithm is very comprehensive, very carefully!
(The above two, see the invitation to pay me about the advertising fee, 10% discount ~ ~)
▌ Part III: Correlation Filtering
Introduce the most classic high-speed correlation filter tracking algorithm CSK, KCF/DCF, CN. Many people knew CF at the earliest and should be attracted to the following picture like me:
This is the experimental result of the KCF/DCF algorithm on the OTB50 (arVix was circulated in April 2014, when the OTB100 was not published yet). Precision and FPS have crushed the best Struck on the OTB50 and used it for barely real-time Struck. With the TLD, it is surprising that the KCF/DCF is a bit high-speed. In fact, the KCF/DCF is an improved version of the multi-channel feature of the CSK that shines on the OTB. Take note of the super-fast 615 FPPS MOSSE (serious speeding, this is your ticket), which is the first related filtering method in the field of target tracking. This is really the first time showing the potential of related filtering. There is also a CN in the same period as KCF. The color feature method that caused severe reaction in 2014'CVPR is actually a multi-channel color feature improvement algorithm of CSK. From MOSSE (615) to CSK (362) to KCF (172FPS), DCF (292FPS), CN (152FPS), CN2 (202FPS), although the speed is getting slower and slower, the effect is getting better and better and always At high speed levels:
Bolme DS, Beveridge JR, Draper BA, et al. Visual object tracking using adaptive correlation filters [C]//CVPR, 2010.
Henriques JF, Caseiro R, Martins P, et al. Exploiting the circulant structure of tracking-by-detection with kernels [C]// ECCV, 2012.
Henriques JF, Rui C, Martins P, et al. High-Speed ​​Tracking with Kernelized Correlation Filters [J]. IEEE TPAMI, 2015.
Danelljan M, Shahbaz Khan F, Felsberg M, et al. Adaptive color attributes for real-time visual tracking [C]// CVPR, 2014.
Both CSK and KCF are two papers by Henriques JF (Oxford University) João F. Henriques Great God, affecting many later work, the ridge regression of the core part, the approximate dense sampling of cyclic shifts, and the details of the entire correlation filtering algorithm. Derivation. There is also a closed solution of ridge regression plus kernel-trick, multi-channel HOG features.
Martin Danelljan (University of Linköping) used the multi-channel color feature Color Names (CN) to extend the CSK to obtain good results. The algorithm is also referred to as CN Coloring Visual Tracking.
CN Coloring Visual Tracking
Http://
MOSSE is the correlation filtering of single-channel gray features. CSK extends dense sampling (plus padding) and kernel-trick on the basis of MOSSE. KCF extends the multi-channel gradient HOG features on the basis of CSK, and CN is the basis of CSK. Extended Color Names for multi-channel color. HOG is a gradient feature, while CN is a color feature, and the two can be complementary. Therefore, HOG+CN has become the hand-craft feature standard in the tracking algorithm for the past two years. Finally, based on the experimental results of KCF/DCF, two issues are discussed:
1. Why does the difference between the KCF with single-channel grayscale feature and the KCF with multi-channel HOG feature is very small?
First, the author used HOG's fast algorithm fHOG from Piotr's Computer Vision Matlab Toolbox, C code, and did SSE optimization. If you have questions about fHOG, please refer to the paper Object Detection with Discriminatively Trained Part Based Models.
Second, HOG features commonly used cell size is 4, which means that 100 * 100 images, HOG feature map dimensions only 25 * 25, and Raw pixels are grayscale normalized, the dimension is still 100 * 100, We simply calculate: the complexity of the 27-channel HOG feature is 27*625*log(625)=47180, and the complexity of the single-channel gray feature is 10000*log(10000)=40000, which is also similar in theory and conforms to the table.
Looking at the code, we will see that when the author expands the area of ​​the target area, the author will first sample the extracted image block down to 50*50 by factor 2, and the complexity will become 2500*log(2500)=8495. , dropped a lot. Then you may think that if you downsampling a little more, the complexity will be lower, but this is at the expense of tracking accuracy. As another example, if the image block area is 200*200, first downsampling to 100 *100, then extract the HOG feature, the resolution is reduced to 25*25, which means that the resolution of the response graph is also 25*25, that is to say, the response map is shifted by 1 pixel, the tracking frame in the original image is to be moved 8 Pixels, which reduces the tracking accuracy. When the accuracy is not high enough, the accuracy can be sacrificed slightly to increase the frame rate (but it really can't seem to be downsampled anymore).
2. Which of the KCF and DCF features of the HOG feature is better?
Most people think that the KCF effect is more than DCF, and the accuracy of each attribute is above the DCF. However, if you look at the other point of view, using DCF as a benchmark, and then looking at the KCF with kernel-trick, mean precision only improves. 0.4%, while FPS dropped by 41%, is this not surprising? In addition to the total number of image block pixels, the KCF complexity is mainly related to kernel-trick. Therefore, if there is no kernel-trick in the CF method below, it will be referred to as DCF for short. If kernel-trick is added, it will be referred to as KCF (half spoiler). Of course, the CN here also has a kernel-trick, but please note that this is the first time that Martin Danelljan used the kernel-trick for the last time. . .
This raises the question of how powerful the kernel-trick stuff is, how can we improve this point? Here we must mention another masterpiece of Winsty:
Wang N, Shi J, Yeung DY, et al. Understanding and diagnosing visual tracking systems[C]// ICCV, 2015.
Summarized in a sentence, do not look at the variety of machine learning methods, that are all virtual, the characteristics of the target tracking algorithm is the most important (because this article I powder WIN uncle haha), the above is the most classic three High-speed algorithms, CSK, KCF/DCF and CN, recommended.
▌ Part IV: Scale adaptation for 14 years
The VOT and OTB appeared as early as 2013, but the VOT2013 sequence was too small. The first PLT code could not be found, and it was skipped without reference value. Directly to the VOT2014 competition VOT2014 Benchmark (http://votchallenge.net/vot2014/index.html). There were 25 well-chosen series and 38 algorithms in the year. At that time, the deep learning warfare had not burnt track, so the protagonist could only be the CF who just took the lead and dominated the side. The following are the top ones. Details:
The top three are related filter CF methods. The third place KCF is already familiar. There is a slight difference here. Multi-scale detection and sub-pixel peak estimation are added. The resolution of VOT sequence is higher (detection of updated images. The relatively high resolution of the block results in a KCF speed of only 24.23 (EFO conversion 66.6 FPS). Here speed is the Equivalent Filter Operations (EFO). This parameter is also used to measure the speed of the algorithm in VOT2015 and VOT2016. Here is a one-time list for reference (the actual speed of the tracker implemented by MATLAB is higher):
In fact, in addition to the slightly different characteristics of the top three, the core is the KCF-based extended multi-scale detection, the outline is as follows:
Scale change is a relatively basic and common problem in tracking. The previously introduced KCF/DCF and CN have no scale update. If the target is reduced, the filter will learn a lot of background information. If the target is enlarged, the filter will follow the target local texture. Go, both of these situations are likely to have unexpected results, leading to drift and failure.
SAMF ihpdep/samf (https://github.com/ihpdep/samf), the work of Yang Li of Zhejiang University, based on KCF, features HOG+CN, multi-scale method is a translation filter to target on multi-scale image blocks Detection, take the translation position and scale where the response is the most:
Li Y, Zhu J. A scale adaptive kernel correlation filter tracker with feature integration [C]// ECCV, 2014.
Martin Danelljan's DSST Accurate scale estimation for visual tracking (http://) uses only the HOG feature, DCF is used for translational position detection, and it specializes in training similar MOSSE-related filters to detect scale changes, creating a translational filter + a scale filter. , and then transferred TPAMI made a series of accelerated version fDSST, very + very + highly recommended:
Danelljan M, Häger G, Khan F, et al. Accurate scale estimation for robust visual tracking [C]// BMVC, 2014.
Danelljan M, Hager G, Khan FS, et al. Discriminative Scale Space Tracking [J]. IEEE TPAMI, 2017.
Simply compare these two methods of adaptive scaling:
Which of the scale detection methods used by DSST and SAMF is better?
First of all, tell you a joke: After Martin Danelljan proposed DSST, his follow-up paper has never been used again (until the latest CVPR ECO-HC to accelerate the use of fDSST).
Although both SAMF and DSST can keep up with the common target scale changes, SAMF only has 7 scales, while DSST has 33 scales that are fine and accurate.
DSST firstly detects the best scale for optimal translation and re-detection, which is the best step-by-step method. While SAMF is the translation scale to detect together, it is the translation and scale that are optimal at the same time, and often the local optimal and the global optimal are not the same;
DSST divides the tracking into two problems: translation tracking and scale tracking. Different methods and features can be used to make it more flexible. However, it requires extra training of a filter. Each frame measurement needs to sample 33 image blocks, and then calculates the characteristics and adds them separately. Window, FFT, etc., the scale filter is much slower than the shift filter; SAMF only requires one filter, no additional training and storage, and each feature detection once features and FFT, but when the image block is larger, the computational ratio DSST is high.
Therefore, the scale detection DSST is not always better than the SAMF. In fact, the SAMF is more than the DSST in VOT2015 and VOT2016. Of course, this is mainly because the features are better, but at least the scale method is not bad. In general, the DSST approach is very new and faster, and SAMF is equally good and accurate.
Must DSST be 33 scales?
DSST comes standard with 33 scales that are very very sensitive and easily reduce the number of scales. Even if you increase the corresponding step size, the scale filter will not be able to keep up with the scale change. The possible explanation for this may be that the training-scale filter uses one-dimensional samples and there is no cyclic shift, which means that only one sample is updated for only 33 samples. If the number of samples is reduced, training will be insufficient and the classifier will have serious discriminating power. The drop, unlike the translation filter, has a very large sample of shifts (personal views welcome the exchange). In short, please do not try to reduce the number of scales by a large amount. If you do not use the scale filters 33 and 1.02, it is good.
The above are two recommended scale detection methods, which will be referred to as DSST-like multi-scale and SAMF-like multi-scale. If you pay more attention to speed, the accelerated version of fDSST, and only three dimensions of SAMF (such as KCF in VOT2014) are better choices; if you are more precise, 33-scale DSST, and 7-scale SAMF are more appropriate. .
Part V: Boundary Effect
The next VOT2015 competition VOT2015 Challenge | Home, this year has 60 carefully selected sequences, 62 trackers, the biggest surprise is deep learning began to attack the tracking field, MDNet won the championship of the year directly, combined with depth characteristics The relevant filtering method DeepSRDCF is the second, and the SRDCF which only solves the boundary effect only has the fourth HOG feature.
With the expansion of the influence of the VOT contest, the organizers are also well-intentioned, classic and top gathered together, a hundred schools of thought contend, up to 62 tracker Imperial City PK, Huashan Mountain. In addition to the previously described deep learning and related filtering, there are EBTs that combine object proposals (class object area detection) (EBT: Proposal and Tracking have to say the secret - Zhihu column, https://zhuanlan.zhihu.com/p /26654891) Row three, Mean-Shift class color algorithm ASMS is the recommended real-time algorithm, as well as the previously mentioned another color algorithm DAT, and the Struck in the 9th is not the original Struck. In addition, classic methods such as OAB, STC, CMT, CT, NCC, etc. can be seen in the countdown position. The classic method has been left behind.
Before introducing SRDCF, first analyze the disadvantages of the related filtering. In general, the related filtering methods do not have a good effect on the tracking of rapid deformation and rapid movement.
The rapid deformation is mainly because CF is a template class method. Easy to lose this is relatively easy to understand, in front of the analysis of the relevant filter is a template class method, if the target is rapidly deformed, that based on the HOG gradient template certainly can not keep up, if the rapid color change, then the CN-based color template certainly will follow Not on. This is also related to model updating strategy and update speed. Linear weighted updating of fixed learning rate, if the learning rate is too large, partial or short occlusion and any detection is inaccurate, the model will learn background information and accumulate to a certain degree of model following background. Elopement, never to return. If the learning rate is too small and the target is distorted and the template is still the template, it will become unrecognized.
The rapid movement is mainly a boundary effect, and the erroneous samples generated by the boundary effect will cause the discriminator to have insufficient discriminating power. The following training stages and detection stages are separately discussed.
During the training phase, the synthetic sample reduced the discriminatory power. If no cosine window is added, the shift sample is long:
Except for the most original sample, the other samples were “syntheticâ€, and 100*100 image blocks, only 1/10,000 of the samples were true, and such sample sets could not be trained at all. If a cosine window is added, since the edge pixel values ​​of the image are all zero, the sample is considered reasonable if the object remains intact during the cyclic shift. Only when the target center is close to the edge, those samples whose target crosses the boundary are wrong. The number of such false but reasonable samples increased to approximately 2/3 (one-dimensional case padding = 1). However, we cannot forget that even if there are still 1/3 (3000/10000, http://tel:3000/10000) samples that are unreasonable, these samples will reduce the discriminant power of the classifier. In addition, the cosine window is not "free". The cosine window changes the pixels in the edge region of the image block to 0, and filters out a lot of background information that the classifier originally needs to learn. This can be seen by the discriminator during training. Background information is very limited, we have added a cosine window to block the background, which further reduces the classifier's discriminating ability (is it right? God covered the curtain in front of me. Not God, is the cosine window).
In the detection phase, the correlation filtering is weak for the target detection of rapid motion. The correlation filter training image block and the detected image block size must be the same, that is to say you have trained a 100*100 filter, then you can only detect the 100*100 area, if you plan to add more Padding to expand the detection area, so that in addition to the expansion of the complexity, there will be no benefit. The target motion may be the target's own movement, or the camera's movement, according to the position of the target in the detection area, divided into four situations:
If the target is near the center, the detection is accurate and successful.
If the target moves near the border but does not have a border, after the cosine window is added, some target pixels will be filtered out. At this time, there is no guarantee that the response here is the largest in the world. Moreover, the test sample and training process at this time. The unreasonable samples in them are very similar, so they are likely to fail.
If part of the target has been moved out of this area, and we have to add a cosine window, it is possible to filter out the only target pixels, and the detection fails.
If the entire target has moved out of this area, it must have failed.
The above is the boundary effect. We recommend two main methods to solve the boundary effect. The SRDCF has a relatively slow speed and is not suitable for real-time applications.
Martin Danelljan's SRDCF Learning Spatially Regularized Correlation Filters for Visual Tracking, the main idea: Since the boundary effect occurs near the boundary, then ignore the pixels of the boundary part of all shifted samples, or limit the filter coefficients near the boundary to close to 0:
Learning Spatially Regularized Correlation Filters for Visual Tracking
Http://
Danelljan M, Hager G, Shahbaz Khan F, et al. Learning spatially regularized correlation filters for visual tracking [C]// ICCV. 2015.
SRDCF based on DCF, SAMF-like multi-scale, using a larger detection area (padding = 4), while adding regularization of the spatial domain, punish the filter coefficients of the boundary region, because there is no closed solution, using Gauss-Seidel method iterative optimization. The detection area was expanded (1.5->4), and iterative optimization (disruption of the closed solution) resulted in SRDCF of only 5 FP, but the effect was very good in the 2015 baseline.
Another method is the modified MOSSE algorithm proposed by Hamed Kiani, CFLM Correlation Filters with Limited Boundaries based on gray features, and BACF Learning Background-Aware Correlation Filters for Visual Tracking based on HOG features. The main idea is to use larger size to detect image blocks. And the smaller size filter to increase the proportion of the real sample, or filter fill 0 to keep the same as the detection image, there is no closure solution, using ADMM iteration optimization:
Kiani Galoogahi H, Sim T, Lucey S. Correlation filters with limited boundaries [C]// CVPR, 2015.
Kiani Galoogahi H, Fagg A, Lucey S. Learning Background-Aware Correlation Filters for Visual Tracking [C]// ICCV, 2017.
CFLB only features single-channel grayscale, although the speed is 167FPS faster, but the performance is far worse than KCF, not recommended; the latest BACF will feature extended multi-channel HOG features, performance exceeds SRDCF, and the speed is relatively fast 35FPS, it is highly recommended.
In fact, these two solutions are similar, with larger detection and updating of image blocks, and training filters with relatively small scope. The difference is that SRDCF filter coefficients smoothly transition from the center to the edge to 0, and CFLM directly fills the edge of the filter with 0.
VOT2015 related filter is still in second place, combined with DeepSRDCF features, because the depth characteristics are very slow, not to mention high speed on the CPU, real-time can not, although the performance is very high, but not recommended here, first jump Over.
Part VI: Color histograms and correlation filtering
The VOT2016 competition VOT2016 Challenge | Home is still the VOT2015 series of 60, but this time the relabeling was more fair and reasonable. There were 70 contestants this year. It is expected that deep learning has dominated the world, 8 pure CNN methods and 6 Most of the CF methods that combine depth features are among the best. There is also a CF method. Most importantly, the organizers of the conscience actually disclosed 38 trackers, part of the tracker codes, and the home page they can get. The download address is VOT2016 Challenge. Trackers (Mother never worry about I can not find the source ~), pay attention to some of the download link, part of the source compression package, part of the source code is a binary file, it's easy to use a test to know, easy comparison and research, need Hurry to try it. Look at the results of the competition right now (only the first 60 are listed here):
é«˜äº®æ ‡å‡ºæ¥äº†å‰é¢ä»‹ç»è¿‡çš„或比较é‡è¦çš„方法,结åˆå¤šå±‚深度特å¾çš„相关滤波C-COT排第一å,而CNN方法TCNN是VOT2016çš„å† å†›ï¼Œä½œè€…ä¹Ÿæ˜¯VOT2015å† å†›MDNet,纯颜色方法DATå’ŒASMS都在ä¸ç‰æ°´å¹³(其实两ç§æ–¹æ³•å®žæµ‹è¡¨çŽ°éžå¸¸æŽ¥è¿‘),其他tracker的情况请å‚考论文。å†æ¥çœ‹é€Ÿåº¦ï¼ŒSMACF没有公开代ç ,ASMSä¾ç„¶é‚£ä¹ˆå¿«ï¼ŒæŽ’在å‰10的方法ä¸ä¹Ÿæœ‰ä¸¤ä¸ªé€Ÿåº¦æ¯”较快,分别是排第5çš„Staple,和其改进算法排第9çš„STAPLE+,而且STAPLE+是今年的推è实时算法。首先æå–œLuca Bertinettoçš„SiamFCå’ŒStaple都表现éžå¸¸ä¸é”™ï¼Œç„¶åŽå†ä¸ºå¤§ç‰›é»˜å“€ä¸‰åˆ†é’Ÿ(VOT2016çš„paper原文):
This was particularly obvious in case of SiamFC trackers, which runs orders higher than realtime (albeit on GPU), and Staple, which is realtime, but are incorrectly among the non-realtime trackers.
VOT2016竟然å‘生了乌龙事件,Staple在论文ä¸CPU上是80FPS,怎么EFO在这里åªæœ‰11?幸好公开代ç 有Stapleå’ŒSTAPLE+,实测下æ¥ï¼Œè™½ç„¶æˆ‘电脑ä¸å¦‚Luca Bertinetto大牛但Staple我也能跑76FPS,而更å¯ç¬‘的是,STAPLE+比Staple慢了大约7-8å€ï¼Œç«Ÿç„¶EFO高出4å€ï¼Œåˆ°åº•æ€Žä¹ˆå›žäº‹å‘¢ï¼Ÿ
首先看Staple的代ç ,如果您直接下载Staple并设置params.visualization = 1,Staple默认调用Computer Vision System Toolboxæ¥æ˜¾ç¤ºåºåˆ—图åƒï¼Œè€Œæ°å¥½å¦‚果您没有这个工具箱,默认æ¯å¸§éƒ½ä¼šç”¨imshow(im)æ¥æ˜¾ç¤ºå›¾åƒï¼Œæ‰€ä»¥éžå¸¸éžå¸¸æ…¢ï¼Œè€Œè®¾ç½®params.visualization = 0就跑的飞快(ä½œè€…ä½ æ˜¯å™çŒ´åæ´¾æ¥çš„逗逼å—),建议您将显示图åƒéƒ¨åˆ†ä»£ç 替æ¢æˆDSSTä¸å¯¹åº”部分代ç å°±å¯ä»¥æ£å¸¸é€Ÿåº¦è¿è¡Œå’Œæ˜¾ç¤ºäº†ã€‚
å†æ¥çœ‹STAPLE+的代ç ,对Staple的改进包括é¢å¤–从颜色概率图ä¸æå–HOG特å¾ï¼Œç‰¹å¾å¢žåŠ 到56通é“(Staple是28通é“),平移检测é¢å¤–åŠ å…¥äº†å¤§ä½ç§»å…‰æµè¿åŠ¨ä¼°è®¡çš„å“应,所以æ‰ä¼šè¿™ä¹ˆæ…¢ï¼Œè€Œä¸”肯定è¦æ…¢å¾ˆå¤šã€‚
所以很大å¯èƒ½æ˜¯VOT举办方把Stapleå’ŒSTAPLE+çš„EFO弄å了,VOT2016的实时推è算法应该是排第5çš„Staple,相关滤波结åˆé¢œè‰²æ–¹æ³•ï¼Œæ²¡æœ‰æ·±åº¦ç‰¹å¾æ›´æ²¡æœ‰CNN,跑80FPS还能排在第五,这就是接下æ¥ä¸»è¦ä»‹ç»çš„,2016年最NIUBILITYçš„ç›®æ ‡è·Ÿè¸ªç®—æ³•ä¹‹ä¸€Staple (直接让排在åŽé¢çš„一众深度å¦ä¹ 算法怀疑人生)。
颜色特å¾ï¼Œåœ¨ç›®æ ‡è·Ÿè¸ªä¸é¢œè‰²æ˜¯ä¸ªéžå¸¸é‡è¦çš„特å¾ï¼Œä¸ç®¡å¤šå°‘个人在一起,åªè¦ç›®æ ‡ç©¿ä¸ç”¨é¢œè‰²çš„一幅就éžå¸¸æ˜Žæ˜¾ã€‚å‰é¢ä»‹ç»è¿‡2014å¹´CVPRçš„CN是相关滤波框架下的模æ¿é¢œè‰²æ–¹æ³•ï¼Œè¿™é‡Œéš†é‡ä»‹ç»ç»Ÿè®¡é¢œè‰²ç‰¹å¾æ–¹æ³•DAT Learning, Recognition, and Surveillance @ ICG ,帧率15FPS推è:
Possegger H, Mauthner T, Bischof H. In defense of color-based model-free tracking [C]// CVPR, 2015.
DAT统计å‰æ™¯ç›®æ ‡å’ŒèƒŒæ™¯åŒºåŸŸçš„颜色直方图并归一化,这就是å‰æ™¯å’ŒèƒŒæ™¯çš„颜色概率模型,检测阶段,è´å¶æ–¯æ–¹æ³•åˆ¤åˆ«æ¯ä¸ªåƒç´ 属于å‰æ™¯çš„概率,得到åƒç´ 级颜色概率图,å†åŠ ä¸Šè¾¹ç¼˜ç›¸ä¼¼é¢œè‰²ç‰©ä½“æŠ‘åˆ¶å°±èƒ½å¾—åˆ°ç›®æ ‡çš„åŒºåŸŸäº†ã€‚
如果è¦ç”¨ä¸€å¥è¯ä»‹ç»Luca Bertinetto(牛津大å¦)çš„Staple Staple tracker,那就是把模æ¿ç‰¹å¾æ–¹æ³•DSST(基于DCF)和统计特å¾æ–¹æ³•DAT结åˆï¼š
Bertinetto L, Valmadre J, Golodetz S, et al. Staple: Complementary Learners for Real-Time Tracking [C]// CVPR, 2016.
å‰é¢åˆ†æžäº†ç›¸å…³æ»¤æ³¢æ¨¡æ¿ç±»ç‰¹å¾(HOG)对快速å˜å½¢å’Œå¿«é€Ÿè¿åŠ¨æ•ˆæžœä¸å¥½ï¼Œä½†å¯¹è¿åŠ¨æ¨¡ç³Šå…‰ç…§å˜åŒ–ç‰æƒ…况比较好;而颜色统计特å¾(颜色直方图)对å˜å½¢ä¸æ•æ„Ÿï¼Œè€Œä¸”ä¸å±žäºŽç›¸å…³æ»¤æ³¢æ¡†æž¶æ²¡æœ‰è¾¹ç•Œæ•ˆåº”,快速è¿åŠ¨å½“然也是没问题的,但对光照å˜åŒ–和背景相似颜色ä¸å¥½ã€‚综上,这两类方法å¯ä»¥äº’补,也就是说DSSTå’ŒDATå¯ä»¥äº’补结åˆï¼š
ä¸¤ä¸ªæ¡†æž¶çš„ç®—æ³•é«˜æ•ˆæ— ç¼ç»“åˆï¼Œ25FPSçš„DSSTå’Œ15FPSçš„DAT,而结åˆåŽé€Ÿåº¦ç«Ÿç„¶è¾¾åˆ°äº†80FPS。DSST框架把跟踪划分为两个问题,å³å¹³ç§»æ£€æµ‹å’Œå°ºåº¦æ£€æµ‹ï¼ŒDATå°±åŠ åœ¨å¹³ç§»æ£€æµ‹éƒ¨åˆ†ï¼Œç›¸å…³æ»¤æ³¢æœ‰ä¸€ä¸ªå“应图,åƒç´ 级å‰æ™¯æ¦‚率也有一个å“应图,两个å“åº”å›¾çº¿æ€§åŠ æƒå¾—到最终å“应图,其他部分与DSST类似,平移滤波器ã€å°ºåº¦æ»¤æ³¢å™¨å’Œé¢œè‰²æ¦‚率模型都以固定å¦ä¹ çŽ‡çº¿æ€§åŠ æƒæ›´æ–°ã€‚
å¦ä¸€ç§ç›¸å…³æ»¤æ³¢ç»“åˆé¢œè‰²æ¦‚率的方法是17CVPRçš„CSR-DCF,æ出了空域å¯é 性和通é“å¯é 性,没有深度特å¾æ€§èƒ½ç›´é€¼C-COT,速度å¯è§‚13FPS:
LukežiÄ A, VojÃÅ™ T, ÄŒehovin L, et al. Discriminative Correlation Filter with Channel and Spatial Reliability [C]// CVPR, 2017.
CSR-DCFä¸çš„空域å¯é 性得到的二值掩膜就类似于CFLMä¸çš„掩膜矩阵Pï¼Œåœ¨è¿™é‡Œè‡ªé€‚åº”é€‰æ‹©æ›´å®¹æ˜“è·Ÿè¸ªçš„ç›®æ ‡åŒºåŸŸä¸”å‡å°è¾¹ç•Œæ•ˆåº”;以往多通é“特å¾éƒ½æ˜¯ç›´æŽ¥æ±‚和,而CSR-DCFä¸é€šé“é‡‡ç”¨åŠ æƒæ±‚和,而通é“å¯é æ€§å°±æ˜¯é‚£ä¸ªè‡ªé€‚åº”åŠ æƒç³»æ•°ã€‚采用ADMMè¿ä»£ä¼˜åŒ–,å¯ä»¥çœ‹å‡ºCSR-DCF是DATå’ŒCFLB的结åˆç®—法。
VOT2015相关滤波还有排第一åçš„C-COT(别问我第一å为什么ä¸æ˜¯å† 军,我也ä¸çŸ¥é“),和DeepSRDCFä¸€æ ·å…ˆè·³è¿‡ã€‚
▌第七部分:long-term和跟踪置信度
以å‰æ到的很多CF算法,也包括VOT竞赛,都是针对short-term的跟踪问题,å³çŸæœŸ(shor-term)跟踪,我们åªå…³æ³¨çŸæœŸå†…(如100~500帧)跟踪是å¦å‡†ç¡®ã€‚但在实际应用场åˆï¼Œæˆ‘们希望æ£ç¡®è·Ÿè¸ªæ—¶é—´é•¿ä¸€ç‚¹ï¼Œå¦‚å‡ åˆ†é’Ÿæˆ–åå‡ åˆ†é’Ÿï¼Œè¿™å°±æ˜¯é•¿æœŸ(long-term)跟踪问题。
Long-term就是希望tracker能长期æ£ç¡®è·Ÿè¸ªï¼Œæˆ‘们分æžäº†å‰é¢ä»‹ç»çš„方法ä¸é€‚åˆè¿™ç§åº”用场åˆï¼Œå¿…须是short-term tracker + detecteré…åˆæ‰èƒ½å®žçŽ°æ£ç¡®çš„长期跟踪。
用一å¥è¯ä»‹ç»Long-term,就是给普通trackeré…一个detecter,在å‘现跟踪出错的时候调用自带detecteré‡æ–°æ£€æµ‹å¹¶çŸ«æ£tracker。
介ç»CFæ–¹å‘一篇比较有代表性的long-term方法,Chao Maçš„LCT chaoma99/lct-tracker:
Ma C, Yang X, Zhang C, et al. Long-term correlation tracking[C]// CVPR, 2015.
LCT在DSST一个平移相关滤波Rc和一个尺度相关滤波的基础上,åˆåŠ å…¥ç¬¬ä¸‰ä¸ªè´Ÿè´£æ£€æµ‹ç›®æ ‡ç½®ä¿¡åº¦çš„ç›¸å…³æ»¤æ³¢Rt,检测模å—Online Detector是TLDä¸æ‰€ç”¨çš„éšæœºè”Ÿåˆ†ç±»å™¨(random fern),在代ç ä¸æ”¹ä¸ºSVM。第三个置信度滤波类似MOSSEä¸åŠ padding,而且特å¾ä¹Ÿä¸åŠ cosine窗,放在平移检测之åŽã€‚
如果最大å“应å°äºŽç¬¬ä¸€ä¸ªé˜ˆå€¼(å«è¿åŠ¨é˜ˆå€¼),说明平移检测ä¸å¯é ,调用检测模å—é‡æ–°æ£€æµ‹ã€‚注æ„,é‡æ–°æ£€æµ‹çš„结果并ä¸æ˜¯éƒ½é‡‡çº³çš„,åªæœ‰ç¬¬äºŒæ¬¡æ£€æµ‹çš„最大å“应值比第一次检测大1.5å€æ—¶æ‰æŽ¥çº³ï¼Œå¦åˆ™ï¼Œä¾ç„¶é‡‡ç”¨å¹³ç§»æ£€æµ‹çš„结果。
如果最大å“应大于第二个阈值(å«å¤–观阈值),说明平移检测足够å¯ä¿¡ï¼Œè¿™æ—¶å€™æ‰ä»¥å›ºå®šå¦ä¹ 率在线更新第三个相关滤波器和éšæœºè”Ÿåˆ†ç±»å™¨ã€‚注æ„,å‰ä¸¤ä¸ªç›¸å…³æ»¤æ³¢çš„更新与DSSTä¸€æ ·ï¼Œå›ºå®šå¦ä¹ 率在线æ¯å¸§æ›´æ–°ã€‚
LCTåŠ å…¥æ£€æµ‹æœºåˆ¶ï¼Œå¯¹é®æŒ¡å’Œå‡ºè§†é‡Žç‰æƒ…况ç†è®ºä¸Šè¾ƒå¥½ï¼Œé€Ÿåº¦27fps,实验åªè·‘了OTB-2013,跟踪精度éžå¸¸é«˜ï¼Œæ ¹æ®å…¶ä»–论文LCT在OTB-2015å’ŒVOT上效果略差一点å¯èƒ½æ˜¯ä¸¤ä¸ªæ ¸å¿ƒé˜ˆå€¼æ²¡æœ‰è‡ªé€‚应, 关于long-term,TLDå’ŒLCT都å¯ä»¥å‚考。
接下æ¥ä»‹ç»è·Ÿè¸ªç½®ä¿¡åº¦ã€‚ 跟踪算法需è¦èƒ½åæ˜ æ¯ä¸€æ¬¡è·Ÿè¸ªç»“果的å¯é 程度,这一点éžå¸¸é‡è¦ï¼Œä¸ç„¶å°±å¯èƒ½é€ æˆè·Ÿä¸¢äº†è¿˜ä¸çŸ¥é“的情况。生æˆç±»(generative)方法有相似性度é‡å‡½æ•°ï¼Œåˆ¤åˆ«ç±»(discriminative)方法有机器å¦ä¹ 方法的分类概率。有两ç§æŒ‡æ ‡å¯ä»¥åæ˜ ç›¸å…³æ»¤æ³¢ç±»æ–¹æ³•çš„è·Ÿè¸ªç½®ä¿¡åº¦ï¼šå‰é¢è§è¿‡çš„最大å“应值,和没è§è¿‡çš„å“应模å¼ï¼Œæˆ–者综åˆåæ˜ è¿™ä¸¤ç‚¹çš„æŒ‡æ ‡ã€‚
LMCF(MM Wangçš„ç›®æ ‡è·Ÿè¸ªä¸“æ ï¼šç›®æ ‡è·Ÿè¸ªç®—æ³•- 知乎专æ )æ出了多峰检测和高置信度更新:
Wang M, Liu Y, Huang Z. Large Margin Object Tracking with Circulant Feature Maps [C]// CVPR, 2017.
高置信度更新,åªæœ‰åœ¨è·Ÿè¸ªç½®ä¿¡åº¦æ¯”较高的时候æ‰æ›´æ–°è·Ÿè¸ªæ¨¡åž‹ï¼Œé¿å…ç›®æ ‡æ¨¡åž‹è¢«æ±¡æŸ“ï¼ŒåŒæ—¶æå‡é€Ÿåº¦ã€‚ ç¬¬ä¸€ä¸ªç½®ä¿¡åº¦æŒ‡æ ‡æ˜¯æœ€å¤§å“应分数Fmax,就是最大å“应值(Stapleå’ŒLCTä¸éƒ½æœ‰æ到)。 ç¬¬äºŒä¸ªç½®ä¿¡åº¦æŒ‡æ ‡æ˜¯å¹³å‡å³°å€¼ç›¸å…³èƒ½é‡(average peak-to correlation energy, APCE),å应å“åº”å›¾çš„æ³¢åŠ¨ç¨‹åº¦å’Œæ£€æµ‹ç›®æ ‡çš„ç½®ä¿¡æ°´å¹³ï¼Œè¿™ä¸ª(å¯èƒ½)是目å‰æœ€å¥½çš„æŒ‡æ ‡ï¼ŒæŽ¨è:
è·Ÿè¸ªç½®ä¿¡åº¦æŒ‡æ ‡è¿˜æœ‰ï¼ŒMOSSEä¸çš„峰值æ—瓣比(Peak to Sidelobe Ratio, PSR), 由相关滤波峰值,与11*11峰值窗å£ä»¥å¤–æ—瓣的å‡å€¼ä¸Žæ ‡å‡†å·®è®¡ç®—得到,推è:
还有CSR-DCF的空域å¯é æ€§ï¼Œä¹Ÿç”¨äº†ä¸¤ä¸ªç±»ä¼¼æŒ‡æ ‡åæ˜ é€šé“å¯é 性, ç¬¬ä¸€ä¸ªæŒ‡æ ‡ä¹Ÿæ˜¯æ¯ä¸ªé€šé“的最大å“应峰值,就是Fmaxï¼Œç¬¬äºŒä¸ªæŒ‡æ ‡æ˜¯å“应图ä¸ç¬¬äºŒå’Œç¬¬ä¸€ä¸»æ¨¡å¼ä¹‹é—´çš„比率,åæ˜ æ¯ä¸ªé€šé“å“应ä¸ä¸»æ¨¡å¼çš„表现力,但需è¦å…ˆåšæžå¤§å€¼æ£€æµ‹ï¼š
▌第八部分:å·ç§¯ç‰¹å¾
最åŽè¿™éƒ¨åˆ†æ˜¯Martin Danelljan的专场,主è¦ä»‹ç»ä»–的一些列工作,尤其是结åˆæ·±åº¦ç‰¹å¾çš„相关滤波方法,代ç 都在他主页Visual Tracking,就ä¸ä¸€ä¸€è´´å‡ºäº†ã€‚
Danelljan M, Shahbaz Khan F, Felsberg M, et al. Adaptive color attributes for real-time visual tracking [C]// CVPR, 2014.
在CNä¸æ出了éžå¸¸é‡è¦çš„多通é“颜色特å¾Color Names,用于CSK框架å–å¾—éžå¸¸å¥½å¾—效果,还æå‡ºäº†åŠ é€Ÿç®—æ³•CN2,通过类PCA的自适应é™ç»´æ–¹æ³•ï¼Œå¯¹ç‰¹å¾é€šé“æ•°é‡é™ç»´(10 -> 2)ï¼Œå¹³æ»‘é¡¹å¢žåŠ è·¨è¶Šä¸åŒç‰¹å¾å空间时的代价,也就是PCAä¸çš„å方差矩阵线性更新防æ¢é™ç»´çŸ©é˜µå˜åŒ–太大。
Danelljan M, Hager G, Khan FS, et al. Discriminative Scale Space Tracking [J]. IEEE TPAMI, 2017.
DSST是VOT2014的第一å,开创了平移滤波+尺度滤波的方å¼ã€‚在fDSSTä¸å¯¹DSSTè¿›è¡ŒåŠ é€Ÿï¼ŒPCA方法将平移滤波HOG特å¾çš„通é“é™ç»´(31 -> 18),QR方法将尺度滤波器~1000*17的特å¾é™ç»´åˆ°17*17,最åŽç”¨ä¸‰è§’æ’值(频域æ’值)将尺度数é‡ä»Ž17æ’值到33以获得更精确的尺度定ä½ã€‚
SRDCF是VOT2015的第四å,为了å‡è½»è¾¹ç•Œæ•ˆåº”æ‰©å¤§æ£€æµ‹åŒºåŸŸï¼Œä¼˜åŒ–ç›®æ ‡å¢žåŠ äº†ç©ºé—´çº¦æŸé¡¹ï¼Œç”¨é«˜æ–¯-塞德尔方法è¿ä»£ä¼˜åŒ–,并用牛顿法è¿ä»£ä¼˜åŒ–平移检测的åç½‘æ ¼ç²¾ç¡®ç›®æ ‡å®šä½ã€‚
Danelljan M, Hager G, Shahbaz Khan F, et al. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking [C]// CVPR, 2016.
SRDCFdecon在SRDCFçš„åŸºç¡€ä¸Šï¼Œæ”¹è¿›äº†æ ·æœ¬å’Œå¦ä¹ 率问题。以å‰çš„相关滤波都是固定å¦ä¹ çŽ‡çº¿æ€§åŠ æƒæ›´æ–°æ¨¡åž‹ï¼Œè™½ç„¶è¿™æ ·æ¯”较简å•ä¸ç”¨ä¿å˜ä»¥å‰æ ·æœ¬ï¼Œä½†åœ¨å®šä½ä¸å‡†ç¡®ã€é®æŒ¡ã€èƒŒæ™¯æ‰°åŠ¨ç‰æƒ…况会污染模型导致漂移。SRDCFdecon选择ä¿å˜ä»¥å¾€æ ·æœ¬(图åƒå—包括æ£ï¼Œè´Ÿæ ·æœ¬)ï¼Œåœ¨ä¼˜åŒ–ç›®æ ‡å‡½æ•°ä¸æ·»åŠ æ ·æœ¬æƒé‡å‚æ•°å’Œæ£åˆ™é¡¹ï¼Œé‡‡ç”¨äº¤æ›¿å‡¸æœç´¢ï¼Œé¦–å…ˆå›ºå®šæ ·æœ¬æƒé‡ï¼Œé«˜æ–¯-塞德尔方法è¿ä»£ä¼˜åŒ–模型å‚数,然åŽå›ºå®šæ¨¡åž‹å‚æ•°ï¼Œå‡¸äºŒæ¬¡è§„åˆ’æ–¹æ³•ä¼˜åŒ–æ ·æœ¬æƒé‡ã€‚
Danelljan M, Hager G, Shahbaz Khan F, et al. Convolutional features for correlation filter based visual tracking [C]// ICCVW, 2015.
DeepSRDCF是VOT2015的第二å,将SRDCFä¸çš„HOG特å¾æ›¿æ¢ä¸ºCNNä¸å•å±‚å·ç§¯å±‚的深度特å¾(也就是å·ç§¯ç½‘络的激活值),效果有了æžå¤§æå‡ã€‚这里用imagenet-vgg-2048 network,VGG网络的è¿ç§»èƒ½åŠ›æ¯”较强,而且MatConvNet就是VGG组的,MATLAB调用éžå¸¸æ–¹ä¾¿ã€‚论文还测试了ä¸åŒå·ç§¯å±‚åœ¨ç›®æ ‡è·Ÿè¸ªä»»åŠ¡ä¸çš„表现:
第1层表现最好,第2和第5次之。由于å·ç§¯å±‚数越高è¯ä¹‰ä¿¡æ¯è¶Šå¤šï¼Œä½†çº¹ç†ç»†èŠ‚越少,从1到4层越æ¥è¶Šå·®çš„åŽŸå› ä¹‹ä¸€å°±æ˜¯ç‰¹å¾å›¾çš„分辨率越æ¥è¶Šä½Žï¼Œä½†ç¬¬5层åè€Œå¾ˆé«˜ï¼Œæ˜¯å› ä¸ºåŒ…æ‹¬å®Œæ•´çš„è¯ä¹‰ä¿¡æ¯ï¼Œåˆ¤åˆ«åŠ›æ¯”较强(本æ¥å°±æ˜¯ç”¨æ¥åšè¯†åˆ«çš„)。
注æ„区分这里的深度特å¾å’ŒåŸºäºŽæ·±åº¦å¦ä¹ 的方法,深度特å¾æ¥è‡ªImageNet上预è®ç»ƒçš„图åƒåˆ†ç±»ç½‘络,没有fine-turn这一过程,ä¸å˜åœ¨è¿‡æ‹Ÿåˆçš„问题。而基于深度å¦ä¹ 的方法大多需è¦åœ¨è·Ÿè¸ªåºåˆ—上end-to-endè®ç»ƒæˆ–fine-turnï¼Œå¦‚æžœæ ·æœ¬æ•°é‡å’Œå¤šæ ·æ€§æœ‰é™å°±å¾ˆå¯èƒ½è¿‡æ‹Ÿåˆã€‚
Ma C, Huang JB, Yang X, et al. Hierarchical convolutional features for visual tracking [C]// ICCV, 2015.
值得一æ的还有Chao Maçš„HCF,结åˆå¤šå±‚å·ç§¯ç‰¹å¾æå‡æ•ˆæžœï¼Œç”¨äº†VGG19çš„Conv5-4, Conv4-4å’ŒConv3-4的激活值作为特å¾ï¼Œæ‰€æœ‰ç‰¹å¾éƒ½ç¼©æ”¾åˆ°å›¾åƒå—åˆ†è¾¨çŽ‡ï¼Œè™½ç„¶æŒ‰ç…§è®ºæ–‡åº”è¯¥æ˜¯ç”±ç²—åˆ°ç»†ç¡®å®šç›®æ ‡ï¼Œä½†ä»£ç ä¸æ¯”较直接,三ç§å·ç§¯å±‚çš„å“应以固定æƒå€¼1, 0.5, 0.02çº¿æ€§åŠ æƒä½œä¸ºæœ€ç»ˆå“应。虽然用了多层å·ç§¯ç‰¹å¾ï¼Œä½†æ²¡æœ‰å…³æ³¨è¾¹ç•Œæ•ˆåº”è€Œä¸”çº¿æ€§åŠ æƒçš„æ–¹å¼è¿‡äºŽç®€å•ï¼ŒHCF在VOT2016仅排在28å(å•å±‚å·ç§¯æ·±åº¦ç‰¹å¾çš„DeepSRDCF是第13å)。
Danelljan M, Robinson A, Khan FS, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking [C]// ECCV, 2016.
C-COT是VOT2016的第一å,综åˆäº†SRDCF的空域æ£åˆ™åŒ–å’ŒSRDCFdeconçš„è‡ªé€‚åº”æ ·æœ¬æƒé‡ï¼Œè¿˜å°†DeepSRDCFçš„å•å±‚å·ç§¯çš„深度特å¾æ‰©å±•ä¸ºå¤šæˆå·ç§¯çš„深度特å¾ï¼ˆVGG第1å’Œ5层),为了应对ä¸åŒå·ç§¯å±‚分辨率ä¸åŒçš„问题,æ出了连ç»ç©ºé—´åŸŸæ’值转æ¢æ“作,在è®ç»ƒä¹‹å‰é€šè¿‡é¢‘域éšå¼æ’值将特å¾å›¾æ’值到连ç»ç©ºåŸŸï¼Œæ–¹ä¾¿é›†æˆå¤šåˆ†è¾¨çŽ‡ç‰¹å¾å›¾ï¼Œå¹¶ä¸”ä¿æŒå®šä½çš„é«˜ç²¾åº¦ã€‚ç›®æ ‡å‡½æ•°é€šè¿‡å…±è½æ¢¯åº¦ä¸‹é™æ–¹æ³•è¿ä»£ä¼˜åŒ–,比高斯-塞德尔方法è¦å¿«ï¼Œè‡ªé€‚åº”æ ·æœ¬æƒå€¼ç›´æŽ¥é‡‡ç”¨å…ˆéªŒæƒå€¼ï¼Œæ²¡æœ‰äº¤æ›¿å‡¸ä¼˜åŒ–过程,检测ä¸ç”¨ç‰›é¡¿æ³•è¿ä»£ä¼˜åŒ–ç›®æ ‡ä½ç½®ã€‚
注æ„以上SRDCF, SRDCFdecon,DeepSRDCF,C-COTéƒ½æ— æ³•å®žæ—¶ï¼Œè¿™ä¸€ç³»åˆ—å·¥ä½œè™½ç„¶æ•ˆæžœè¶Šæ¥è¶Šå¥½ï¼Œä½†ä¹Ÿè¶Šæ¥è¶Šå¤æ‚,在相关滤波越æ¥è¶Šæ…¢å¤±åŽ»é€Ÿåº¦ä¼˜åŠ¿çš„时候,Martin Danelljan在2017CVPRçš„ECOæ¥äº†ä¸€è„šæ€¥åˆ¹è½¦ï¼Œå¤§ç¥žæ¥å‘Šè¯‰æˆ‘们什么å«åˆå¥½åˆå¿«ï¼Œä¸å¿˜åˆå¿ƒï¼š
Danelljan M, Bhat G, Khan FS, et al. ECO: Efficient Convolution Operators for Tracking [C]// CVPR, 2017.
ECO是C-COTçš„åŠ é€Ÿç‰ˆï¼Œä»Žæ¨¡åž‹å¤§å°ã€æ ·æœ¬é›†å¤§å°å’Œæ›´æ–°ç–ç•¥ä¸‰ä¸ªæ–¹ä¾¿åŠ é€Ÿï¼Œé€Ÿåº¦æ¯”C-COTæå‡äº†20å€ï¼ŒåŠ é‡è¿˜å‡ä»·ï¼ŒEAOæå‡äº†13.3%,最最最厉害的是, hand-crafted featuresçš„ECO-HC有60FPS。 .å¹å®Œäº†ï¼Œæ¥çœ‹çœ‹å…·ä½“åšæ³•ã€‚
第一å‡å°‘模型å‚数,定义了factorized convolution operator(分解å·ç§¯æ“作),效果类似PCA,用PCAåˆå§‹åŒ–,然åŽä»…在第一帧优化这个é™ç»´çŸ©é˜µï¼Œä»¥åŽå¸§éƒ½ç›´æŽ¥ç”¨ï¼Œç®€å•æ¥è¯´å°±æ˜¯æœ‰ç›‘ç£é™ç»´ï¼Œæ·±åº¦ç‰¹å¾æ—¶æ¨¡åž‹å‚æ•°å‡å°‘了80%。
第二å‡å°‘æ ·æœ¬æ•°é‡ï¼Œ compact generative model(ç´§å‡‘çš„æ ·æœ¬é›†ç”Ÿæˆæ¨¡åž‹),采用Gaussian Mixture Model (GMM)åˆå¹¶ç›¸ä¼¼æ ·æœ¬ï¼Œå»ºç«‹æ›´å…·ä»£è¡¨æ€§å’Œå¤šæ ·æ€§çš„æ ·æœ¬é›†ï¼Œéœ€è¦ä¿å˜å’Œä¼˜åŒ–çš„æ ·æœ¬é›†æ•°é‡é™åˆ°C-COTçš„1/8。
第三改å˜æ›´æ–°ç–略,sparser updating scheme(稀ç–æ›´æ–°ç–ç•¥),æ¯éš”5帧åšä¸€æ¬¡ä¼˜åŒ–更新模型å‚数,ä¸ä½†æ高了算法速度,而且æ高了对çªå˜ï¼Œé®æŒ¡ç‰æƒ…å†µçš„ç¨³å®šæ€§ã€‚ä½†æ ·æœ¬é›†æ˜¯æ¯å¸§éƒ½æ›´æ–°çš„,稀ç–更新并ä¸ä¼šé”™è¿‡é—´éš”æœŸçš„æ ·æœ¬å˜åŒ–ä¿¡æ¯ã€‚
ECOçš„æˆåŠŸå½“然还有很多细节,而且有些我也看的ä¸æ˜¯å¾ˆæ‡‚,总之很厉害就是了。 . ECO实验跑了四个库(VOT2016, UAV123, OTB-2015, and TempleColor)都是第一,而且没有过拟åˆçš„问题,仅性能æ¥è¯´ECO是目å‰æœ€å¥½çš„相关滤波算法,也有å¯èƒ½æ˜¯æœ€å¥½çš„ç›®æ ‡è·Ÿè¸ªç®—æ³•ã€‚ hand-crafted features版本的ECO-HC,é™ç»´éƒ¨åˆ†åŽŸæ¥HOG+CNçš„42维特å¾é™åˆ°13维,其他部分类似,实验结果ECO-HC超过了大部分深度å¦ä¹ 方法,而且论文给出速度是CPU上60FPS。
最åŽæ˜¯æ¥è‡ªLuca Bertinettoçš„CFNet End-to-end representation learning for Correlation Filter based tracking,除了上é¢ä»‹ç»çš„相关滤波结åˆæ·±åº¦ç‰¹å¾ï¼Œç›¸å…³æ»¤æ³¢ä¹Ÿå¯ä»¥end-to-endæ–¹å¼åœ¨CNNä¸è®ç»ƒäº†ï¼š
Valmadre J, Bertinetto L, Henriques JF, et al. End-to-end representation learning for Correlation Filter based tracking [C]// CVPR, 2017.
在SiamFC的基础上,将相关滤波也作为CNNä¸çš„一层,最é‡è¦çš„是cf层的å‰å‘ä¼ æ’å’Œåå‘ä¼ æ’å…¬å¼æŽ¨å¯¼ï¼Œä¸¤å±‚å·ç§¯å±‚çš„CFNet在GPU上是75FPS,综åˆè¡¨çŽ°å¹¶æ²¡æœ‰å¾ˆå¤šæƒŠè‰³ï¼Œå¯èƒ½æ˜¯éš¾ä»¥å¤„ç†CF层的边界效应å§ï¼ŒæŒè§‚望æ€åº¦ã€‚
▌第ä¹éƒ¨åˆ†ï¼š2017å¹´CVPRå’ŒICCV结果
下é¢æ˜¯CVPR 2017çš„ç›®æ ‡è·Ÿè¸ªç®—æ³•ç»“æžœï¼šå¯èƒ½MD大神想说,一个能打的都没有ï¼
仿照上é¢çš„è¡¨æ ¼ï¼Œæ•´ç†äº†ICCV 2017的相关论文结果对比ECO:哎,还是一个能打的都没有ï¼
▌第å部分:大牛推è
凑个数,目å‰ç›¸å…³æ»¤æ³¢æ–¹å‘贡献最多的是以下两个组(有创新有代ç ):
牛津大å¦ï¼šJoao F. Henriqueså’ŒLuca Bertinetto,代表:CSK, KCF/DCF, Staple, CFNet (其他SiamFC, Learnet).
林雪平大å¦ï¼šMartin Danelljan,代表:CN, DSST, SRDCF, DeepSRDCF, SRDCFdecon, C-COT, ECO.
å›½å†…ä¹Ÿæœ‰å¾ˆå¤šé«˜æ ¡çš„ä¼˜ç§€å·¥ä½œå°±ä¸ä¸€ä¸€åˆ—举了。
Smd Led,Smd Fnd Display,Smd Led 0603 Display,Smd Led 0805 Display
Wuxi Ark Technology Electronic Co.,Ltd. , https://www.arkledcn.com