-
Notifications
You must be signed in to change notification settings - Fork 0
/
Computer Vision for Engineering and Science
1369 lines (1337 loc) · 79.4 KB
/
Computer Vision for Engineering and Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Computer Vision for Engineering and Science
Computer vision algorithms are running on our phones, cars, and
even our refrigerators.
As cameras are added to more and more devices, the need for
people with computer vision experience is growing rapidly.
That's why math works created computer vision for engineering and
science on coursera.
Play video starting at ::27 and follow transcript0:27
In this three course specialization, you'll complete projects like,
aligning satellite images, training models that identify road signs,
and tracking objects, even as they move out of view.
Play video starting at ::43 and follow transcript0:43
>> Sounds exciting.
Let's see how you gain those skills.
In course 1, you'll learn the fundamentals of computer vision.
Play video starting at ::52 and follow transcript0:52
You'll apply a variety of algorithms, to extract useful features from images.
Play video starting at ::57 and follow transcript0:57
These features are used in many applications,
like image registration, classification, and tracking.
Play video starting at :1:6 and follow transcript1:06
By the end of course 1, you'll detect, extract, and
match features to align, and stitch together images like these.
Play video starting at :1:15 and follow transcript1:15
In course 2 of the specialization,
you'll use these images with popular machine learning algorithms,
to train image classification and object detection models.
Play video starting at :1:27 and follow transcript1:27
However, training a model is only one part of the workflow.
To achieve good results, you'll learn to properly prepare your images for
machine learning, and evaluate the trained model on test images.
Importantly, the skills you gain, also applied to deep learning,
where feature extraction is done by the network during training.
>> And speaking of deep learning,
there are a growing number of models already available.
In course 3, you'll import and use common deep learning models, like,
YOLO, to perform object detection.
Detecting objects is often the first step in a larger workflow, for example,
detection is used with motion prediction, to differentiate, and
track objects over time.
Play video starting at :2:15 and follow transcript2:15
At the end of the specialization, you'll apply tracking to count the number of cars
going in each direction on a busy road.
Play video starting at :2:23 and follow transcript2:23
>> To be successful in these courses,
it'll help to have some prior image processing experience.
Play video starting at :2:30 and follow transcript2:30
If you're brand new to working with image data, we recommend also enrolling in our
image processing for engineering and science specialization on Coursera.
Play video starting at :2:40 and follow transcript2:40
Computer vision is an exciting, and growing field.
Play video starting at :2:45 and follow transcript2:45
The specialization will give you the skills to succeed in a world where images,
and cameras, are more important than ever.
In many applications, from
autonomous systems engineering to scientific research,
you'll need to differentiate between
objects and track them over time.
Before you can track objects,
you have to detect them. In this course.
You'll start by detecting objects
in videos using pre-trained models,
including deep neural networks.
Many general and special-purpose object detection models
are readily available for use in MATLAB.
Sometimes though, using a machine or
deep learning model is unnecessary and inefficient.
You'll also review image
processing-based techniques
to segment objects of interest
and you'll learn new tools like
optical flow to detect motion and moving objects.
However, as far as detection is concerned,
every new frame and a video is a whole new world.
Objects in one frame have
no connection to previous frames.
That's where object tracking comes in.
Object tracking enables you to
distinguish objects over time.
Reduce the effects of flawed detections,
and keep track of objects
that are temporarily obscured from view.
At the end of this course,
you'll analyze highway traffic flow using
the detection and tracking techniques you
learned. Let's get started.
### Meet Your Instructors
Amanda Wang is an Online Content Developer at MathWorks. She earned a B.S. in Mathematics with Computer Science and a B.S. in Business Analytics from MIT in 2020. In addition to developing MATLAB-based courses with the Online Course Development team, she is currently pursuing an M.S. in Computer Science from the University of Illinois Urbana-Champaign.
Isaac Bruss is a Senior Online Content Developer at MathWorks. He earned his Ph.D. from the University of Massachusetts Amherst in 2015, performing research in a number of projects related to biophysics. One such project involved using confocal microscope videos to track the migration of nanoparticles tethered to a surface using DNA. Most recently, he taught undergraduate physics at Hampshire College. Now at MathWorks, he happily supports and designs MATLAB-based online courses.
Matt Rich is a Senior Online Content Developer at MathWorks. He holds a Ph.D. and M.S. in Electrical Engineering from Iowa State University. His Ph.D. research developed new methods to design control systems over networks with communication interrupted by random processes. His MS research focused on system identification and robust control methods for small UAVs with uncertain physical parameters. Prior to his current role, he worked supporting MathWorks Model-Based Design tools in industry and academia.
Megan Thompson is a Senior Online Content Developer at MathWorks. She earned her Ph.D. in bioengineering from the University of California at Berkeley and San Francisco in 2018. As a medical imaging research scientist, she used image processing to study concussions in football, dementia, schizophrenia and more. Now at MathWorks, she designs and supports MATLAB-based online courses to help others analyze data and chase their own answers.
Brandon Armstrong is a Senior Team Lead in Online Course Development at MathWorks. He earned a Ph.D. in physics from the University of California at Santa Barbara in 2010. His research in magnetic resonance has been cited over 1000 times, and he is a co-inventor on 4 patents. He is excited to create courses on image and video processing as he owns a green screen just for fun!
#### Course files and MATLAB
There are many reasons to detect
objects and images in videos.
Autonomous driving
or driver assistance systems identify pedestrians,
lane markings, traffic signs, and other vehicles.
Medical professionals need to isolate
abnormalities or patterns
that can indicate injuries or disease,
researchers in biological sciences need
to detect moving cells to study their behaviors,
and industrial quality control systems must
first locate objects before checking for defects.
Fortunately, there's a growing number of
pre-trained detectors that can solve problems like these.
In this lesson, you will apply a pre-trained object
detector to a video clip
and create an annotated video file with your results.
Annotating a video means adding a bounding box
around the objects of interest in each frame of the file.
To do this, you will apply
an object detector to each frame.
The detector returns the location
and size of the bounding box for each of
the objects of the type to be detected.
Then you can overlay the box on
the frame to highlight the object.
Finally, you'll create a new video file
from the annotated frames.
While it is certainly possible to create
and train your own detector,
it's often unnecessary.
Many general and special purpose detectors
are readily available.
Using models created by
computer vision specialists
allows you to focus on your application.
Most detectors are created using
either classical machine learning
or a deep learning neural network.
Aggregate channel features or
ACF is a modern machine learning algorithm.
MATLAB provides pre-trained versions of
this model for detecting people and vehicles.
These are very task-specific,
but that helps keep the model size
small and prediction speed fast.
Deep learning-based detectors are much more general,
but require considerably more
computational resources to use.
MATLAB provides two popular families of detectors,
Region-based Convolutional Neural Networks or
R-CNN and YOLO or You Only Look Once.
These can detect any of the 80 classes of objects
from the Common Object Collection or COCO dataset.
Each class of model also has versions trained for
specialized tasks like vehicle detection.
Other general purpose and specialized models
are available from the MATLAB deep learning model hub,
on GitHub, which you can import into MATLAB.
The YOLO detectors were
the first deep learning models to
achieve real-time object detection.
Let's use one to detect cars in this dash cam footage.
Start by loading the YOLO vehicle detector.
This command can take awhile to run,
so place it before a section break,
so you can run the rest of the code separately.
Use the VideoReader function to import the video file,
and the VideoWriter function to save the output video.
Frames can be read sequentially with
the readFrame function or
specific frames can be read with the read function.
To test the detector,
read in and view a sample frame of the video.
There's one car in the frame,
let's see if the detector finds it.
Use the detect function with the detector
and the image as inputs.
The result is the size and location of any bounding boxes
and a score for the strength of each detection.
While different detectors have unique scoring scales,
a higher score is always a more likely match.
The insertObjectAnnotation function will
add the bounding box to the frame.
Here we'll use the detection score as the label.
Play video starting at :4:55 and follow transcript4:55
This works well when one or
more of the objects are detected.
But if there are none,
the function will return an error.
Wrap the annotation function in
an if statement to make sure
there is an annotation to add.
Surrounding the detection
and annotation commands with a while loop,
we'll apply the detector to the entire video.
Play video starting at :5:24 and follow transcript5:24
Rather than view each frame as it is created,
we can write it
to a new video file using the VideoWriter.
Be sure to open and close
the VideoWriter before and after use.
Play video starting at :5:45 and follow transcript5:45
After the code is run,
you can view the movie with the implay function.
Play video starting at :5:54 and follow transcript5:54
This general workflow can also
be applied to images that do not form a video.
If an image datastore is used in place of the video file,
the same steps can be used.
The result will now be a new series of image files
with the bounding boxes
superimposed over the objects of interest.
Many classes of objects are readily
detected by pre-trained models available with MATLAB.
They are a great way to quickly
get started with object detection.
## Using Pre-trained Deep Learning Models
1. Follow the instructions below to install the YOLO version 4 object detection model.
2. Open the detectObjectsWithDeepLearning.mlx file included with the course to use a deep learning model to detect objects.
Accessing Pre-trained models
Some special-purpose detectors are included in MATLAB, like the YOLO vehicle detector shown in the video. Larger, more general models are listed on the MATLAB Deep-Learning Model Hub.
Installing a deep-learning model
• Use the Add-On Manager in MATLAB to install a model listed on the MATLAB Deep-Learning Model. The Add-On Manager is located in the Home tab, as shown below.
• Search for "Computer Vision model for yolo." You will see something like the image below. You may need to scroll or Filter by Source for MathWorks.
• Select the Computer Vision Toolbox Model for YOLO v4 Object Detection created by MathWorks.
• Follow the instructions to install the model.
• Open the detectObjectsWithDeepLearning.mlx script included with the course files to see how to use this model!
• There are a growing number of
• pre-trained models available to perform object detection.
• If you find a model that provides
• sufficient quality detection for
• your use case, you're good to go.
• But what if you don't? You could train a model.
• Keep in mind though that training models,
• especially deep neural networks,
• can take significant effort and
• resources which may not always be readily available.
• In many circumstances,
• image processing approaches will be more than sufficient.
• If you can consistently distinguish your objects of
• interest using
• intuitive visual features such as brightness,
• color, shape, or size,
• then it is often more efficient to design
• an image processing algorithm to segment them.
• For example, consider fluorescence microscopy.
• In this video, amoeba are
• attempting to digest yeast cells.
• The yeast cells are fluorescently
• labeled so you should be able to
• segment them using the relative size
• along with color information or grayscale intensity.
• Let's go into MATLAB and walk through one workflow
• to segment the yeast in the video
• using the grayscale intensity.
• First, read the video into MATLAB,
• extract a sample frame,
• and convert it to grayscale.
• Next, open the image segment or
• app and load the grayscale image into the app.
• In this case, there's a strong difference
• between the bright fluorescence
• and the rest of the image.
• Use the manual threshold on
• the grayscale intensity to roughly segment the yeast.
• Here, we're missing the center of the yeast,
• so use the Fill Holes feature to fill this region in.
• Now the remaining artifacts
• are smaller in size than the yeast cell.
• This means you can eliminate
• them with a morphological opening.
• Play video starting at :2:25 and follow transcript2:25
• This looks good. To apply
• these steps to other frames in
• the video, export a function.
• You could save this as a dedicated function file.
• But in this case, let's just copy it into the bottom
• of our script so we can call it anywhere in the script.
• To apply these steps to the color frames of the video,
• create a copy of the input with a new name,
• assign the grayscale conversion
• back to the original variable,
• and finally, update the masked image
• to use the color version.
• Don't forget to replicate
• the binary mask in all three color planes.
• Now let's test this function on
• the sample frame and view
• the results to make sure it works.
• It's a good idea to test
• your segmentation function on a few other frames.
• Let's try another one.
• Play video starting at :3:26 and follow transcript3:26
• One more for good measure. Looks good.
• To add a bounding box and label like you saw earlier,
• first use the regionprompts function
• to return the bounding box for the segmentation.
• Then use this result with
• the InsertObjectAnnotation function to
• add a labeled bounding box.
• Here we'll use the label yeast.
• Finally, check that the result is what you expect.
• Now you're ready to process all the frames in
• the video as you've seen previously.
• Create a new video writer object,
• open it, and close it.
• Then between the open and closed commands,
• use a for loop to iterate through the video frames.
• Finally, write each frame to
• the new video. There you have it.
• You've detected the high fluorescence yeast cells
• using classical image processing techniques.
• In this video, we only scratched
• the surface of the image processing methods
• available in MATLAB.
• Many of which have useful apps
• capable of generating code like you saw here.
• If you're unfamiliar with segmenting images in MATLAB,
• or feel like you could use a refresher on
• the functions and apps available for segmentation,
• we have an image processing specialization
• available on Coursera as well.
Navigate to the Module 1 folder and open the file reviewOfSegmentingImages.mlx file to work through the script to review some approaches to segmentation.
Module Assessment
You used pre-trained deep-learning models and image segmentation in this module to detect objects. Now, it's time to apply these skills. The assessment is broken into two parts:
1. A quiz where you'll apply the YOLO pre-trained network to an image of cars on a busy street. The image you'll work with is a single frame from a video used for the final project.
2. A coding assignment where you need to segment curling stones from the background.
Part 1
Start by taking the quiz. You'll apply the tiny and large base-network YOLO models to a single frame from a video of vehicles on a busy street and investigate the accuracy of the detections. The provided "detectObjectsWithDeepLearning.mlx" reading will be helpful for the quiz.
Bonus: Try applying the tiny YOLO detector to the video "MathWorksTraffic.mp4" included in the course files. How does the detector do? What if you lower the detection "Threshold" value? It will take 1-3 seconds to detect every frame using the tiny YOLO detector. If you use the full YOLO detector, it may take an hour without a GPU.
Part 2
In Part 2, you'll identify curling stones (curling is a winter Olympic sport) by segmenting them from the background. We encourage you to work in MATLAB, using apps like the Image Segmenter app, or writing code to do the segmentation. Once you're satisfied with your segmentation, copy your code into the online grader for assessment.
### Project: Segment an Image
An image of curling stones is included with the course files. Segment the image such that the curling stones are turned into true pixels and the background is false. We recommend working in MATLAB and copying your code into the online grader.
________________________________________
This course uses a third-party app, Project: Segment an Image, to enhance your learning experience. The app will reference basic information like your Coursera ID.
motion detection is a common task in many applications.
For example, by detecting motion, you can estimate the trajectory of objects,
helping you determine if a person is crossing the street or
safely walking on the sidewalk.
Other applications include camera stabilization and
helping autonomous systems map their surroundings, just to name a few.
There are several ways to detect motion in a video clip.
A moving object against a stationary background can be
segmented using background subtraction.
This method can be implemented with basic image processing techniques by
subtracting a static background image from each frame.
You can isolate moving objects.
This of course requires a static background feature
based motion detection works similarly to image registration.
You detect and extract features from an object in one frame,
then match those features in later frames.
By doing so, the translation and rotation of an object can be computed.
However, for this approach to work, you first isolate
the object of interest like a face before extracting features
in template matching, you select a portion of an image and
search the following frames for that pattern of pixels.
This method is especially useful for stabilizing jittery video,
where orientation and lighting are consistent between frames.
You determine the motion by keeping track of the template location in each frame.
You'll use template matching later in this course all three of these
techniques require you to do some processing to detect motion,
creating a static background image,
detecting the object of interest or identifying an object that appears
in the same orientation throughout the video to use as a template.
However, in some applications, none of these approaches will work.
So what do you do?
Optical flow is a powerful technique to determine motion.
It uses the differences between subsequent video frames and
the gradient of those frames to estimate a velocity vector for every pixel.
Thus you don't need to first identify an object or static background.
You can then annotate the video by adding the velocity vectors to each frame.
Objects moving right or left are easy to distinguish by the large
arrows pointing in the direction of motion.
Velocity arrows indicating motion towards or away from the camera are less
obvious objects moving away from the camera will have outlines getting smaller,
so the edges will have velocities that converge objects moving towards.
The camera will get larger and the velocity vectors will diverge.
There is a key constraint with optical flow.
The illumination of the scene must be approximately constant
because optical flow uses the difference in pixel intensities between
frames a shadow or change in lighting could appear as motion.
This affects the other approaches to motion detection as well,
but it is still possible to match features or
a template with some changes in illumination.
You already have the skills to determine motion.
Using background subtraction and feature matching.
Next, you'll learn to apply optical flow and template matching
### Concept Check: Introduction to Motion Detection
Submit your assignment
Not only is a jittery video unpleasant to look at,
but it also complicates
further analysis like object detection and segmentation.
Here, you'll learn the steps
needed to stabilize a shaky video.
Specifically, you will
correct the common problem of camera shake in
the X-Y plane when there's
a stationary object to use as a reference.
This assumption means stationary object does not change
size or orientation due to camera motion.
The process of stabilizing
such videos consists of the following steps:
motion estimation,
camera motion estimation, and video correction.
Consider how the location of an object
changes from frame to frame in a video.
The position of an object in
the current frame is its position in the previous frame
plus the apparent motion between frames.
You estimate the motion using
techniques like optical flow or template matching.
Assuming the camera moves only in the X-Y plane,
the apparent motion is the sum of
the camera motion and object motion.
This is where it helps to have
a stationary object in the video.
Then the component of
the motion due to the object moving is zero,
and the apparent motion is entirely due to camera motion.
The last step is to perform video correction.
You stabilize the current frame
by subtracting the camera motion.
However, to stabilize
the video with respect to the original frame,
you need to account for the cumulative motion of
the camera from all previous frames.
Thus, the stabilized frame is
the current frame minus
the cumulative camera motion from all previous frames.
Play video starting at :2:24 and follow transcript2:24
To practice this process,
we have provided an example video and
script in the course materials.
Here we'll cover the main components.
The camera is moving in the X-Y plane,
but there are stationary objects that
can be used for motion estimation.
The corner of this traffic sign might make for
a good object because it
appears throughout the entire video
and is not easily
confused with other objects in its vicinity.
Window corners or this pole are not
good choices because they are not
unique and will be difficult
to correctly match between frames.
We'll use a template-matching
algorithm to establish the apparent motion,
which is usually done on grayscale images.
Template-matching works by providing a template from
a reference image and finding
the closest match in a new target image.
Like spatial filtering,
the template is moved across the image
and the sum of squared differences or SSD
is calculated for each pixel in the target image.
The position with the smallest SSD
corresponds to the position of
the template in the new image.
Play video starting at :3:48 and follow transcript3:48
In MATLAB, you do this using
the TemplateMatcher object.
While not necessary,
it's a good idea to specify
a region of interest to perform the search.
This is more efficient as the algorithm
will search only in the specified region,
and using an ROI helps avoid
an incorrect match from a similar region in the image.
To find the position of the best match,
you call the TemplateMatcher by
its name and pass the following inputs:
the target image, the template image,
and the rectangular region of interest.
The region is given as a four-element vector
containing the pixel locations of the upper-left corner
and the width and height of the rectangle.
The output is the position in
pixels of the best match
of the template in the target image.
The position points to the center of the template.
Because the sign is stationary,
any differences in position
is due to the motion of the camera.
Play video starting at :5:3 and follow transcript5:03
Use the current and previous template positions
to keep track of the cumulative motion of the camera.
Then use the imtranslate function
to shift the current frame by that amount.
Here, the current frame was translated
five pixels in the positive X and Y direction,
resulting in a small black border on the edges.
After applying the correction to all frames,
notice that the edges of
the video move because
of the translation applied to each frame.
To correct this,
crop the stabilized video to show
only the parts of the scene that appear in every frame.
Now, you might be thinking,
"Hey, that is still a shaky video."
Let's look again, but this time with
a box around the starting location of two signs.
Stabilized video has much less motion than the original.
As the camera moves,
parts of the background move in and out of you,
giving the appearance of motion.
This should be expected if
your video has a varying background.
You now know the main concepts and functions
needed to stabilize a video using template matching.
To see a full implementation,
including how to keep track of the cumulative motion,
refer to the provided examples.
### Considerations when Template Matching
You just saw how to apply template matching to stabilize a video. Template matching worked well, but there was still some apparent motion. In this reading, you’ll investigate why a video can still exhibit significant motion even if reference objects that the template is attempting to match are perfectly stabilized. You’ll also take a closer look at the assumptions behind template matching and make your code robust when the assumptions are not strictly satisfied.
After you complete this reading, review the two accompanying files: simpleVideoStabilization.mlx and robustVideoStabilization.mlx
## Key assumptions and why they’re necessary
In the "Stabilizing Video with Template Matching" video, there was one explicitly stated assumption and two implicit assumptions:
1. The camera only moves in the plane of the image. Motion in the perpendicular direction will change the apparent size of an object throughout the video. Thus, a stationary object in the template image will appear to change size, causing the matching to fail.
2. The illumination of the object in the template remains consistent throughout the video. If the illumination of the object significantly changes, the difference between the template and the reference object in later frames will be significant and more prone to mismatch.
3. The reference object stays inside a specified part of the image (region of interest, or ROI) or is visible in the entire video if not using an ROI. An ROI is recommended to constrain the template search. Using an ROI is more efficient and protects against matching to a similar object elsewhere in the image. If the reference object leaves the ROI during the video, an incorrect match will be returned.
How to account for violations in the assumptions
Fortunately, these assumptions need not be strictly satisfied to get good results. To make your code robust, update your template image and ROI every frame (or periodically based on some conditions).
For example, suppose there is a lot of in or out-of-plane motion and/or lighting changes in the video, but these differences are small between frames. Then, instead of using the same template image for the entire video, updating the template every frame will still satisfy assumptions 1 and 2. You do this by cropping a small region out of the current frame to use as the template for the next frame.
The same idea can be used to update the ROI if there is significant motion throughout the video. Instead of using a fixed ROI, update the ROI based on the new position of the template. This way, the ROI moves with the reference object, making it less likely to make a wrong match.
Code examples
Two examples are provided with the course files for the ShakyStreet video you saw in the lesson. Both the simple and robust approaches give similar results. First, review the simple version, simpleVideoStabilization.mlx, to understand the code and process of video stabilization. Once you’re comfortable with the example, go through the robust version, robustVideoStabilization.mlx, to update the template and ROI for each frame. Examine the stabilized video. Notice that even the robust version still has apparent motion around the signs.
## Why is there still movement in the stabilized video?
Varying Distance Away from Camera
We used a sign close to the camera in the ShakyStreet video as the template image. However, there are very distinct objects at varying distances from the camera. The animation below illustrates an extreme example where, initially, the red oval starts to the left of the triangle in the image. After the camera moves, the red oval appears to the right of the triangle. This means that objects further from the camera appear to move less than objects close to the camera. Translating frames cannot resolve this issue, and the two objects will appear to move relative to each other. Video stabilization will work best when all objects are a similar distance away from the camera.
These values are used to estimate
the horizontal and vertical velocities for each pixel.
Because there is one equation and two unknowns,
this system is under-determined.
However, solution algorithms have
been developed to estimate the velocities.
Matlab provides several methods,
including Horn-Schunk, Lukas-Kanade, and Farneback.
In this lesson, you will detect the motion of
the pedestrians in this dash cam footage
using the Farneback method.
This modern algorithm has good performance and speed,
and works on a range of applications.
Your final result will be
a video with the moving objects segmented in each frame.
The syntax and workflow are the
same for all optical flow methods.
If the Farneback method doesn't
work for your application,
you can switch methods easily.
To get started, create
VideoReader and VideoWriter objects.
You'll also need an optical flow solver.
Let's focus on the portion of
the video where the car is stopped at the crosswalk,
between frames 96 and 565,
which were found using the video viewer app.
Read in the first frame and view it with IM Show.
There are several pedestrians in view,
as well as both parked and moving cars.
The optical flow solver must
be initialized with the first frame.
This compares the frame to
an all-black image to
create a base for future calculations.
Read in the next frame and apply the flow solver again.
The optical flow variable stores the previous frame,
so only the current frame is needed.
You view the resulting velocity vectors
using the plot command to add them to the image.
However, even this low resolution video
has hundreds of thousands of pixels.
It would be impossible to see a vector for each one.
To solve this, use
the DecimationFactor name value pair
to reduce the number of vectors shown.
For this video, showing
every 15th vector in the x and y-direction will work.
Also, since frames are one-thirtieth of a second apart,
the motions will be very slight.
The ScaleFactor will increase
the length of these vectors to be more visible.
At this point, the pedestrian motion is clearly
visible and the direction
and speed can be roughly determined by eye.
That's a pretty good accomplishment
with only a few lines of code.
There are a few problems though,
the buildings are not moving,
but motion vectors are present.
All optical flow applications will have this type of
noise due to the sensitivity of pixel level calculations.
How can you fix this?
Use the velocity magnitude as
a threshold to filter out the low-level noise.
The optical flow solution contains
the horizontal and vertical components of the velocities,
as well as the magnitude and direction.
A histogram of the magnitude distribution
can be used to select the cutoff value.
There are a lot of pixels with very small velocities,
since most of the frame is stationary.
After some trial and error,
a threshold of 0.5 works for this video.
Recall that the goal is to detect moving pedestrians.
Rather than looking at the velocity vectors,
create a mask using the threshold value.
This will highlight them moving objects.
This is a good start.
The stationary buildings are no longer showing motion.
Raising the threshold will
remove some of the other noise,
but will also remove
the slower moving portions of the pedestrians.
Since this is now a binary segmentation,
image processing can be used to clean it up.
Morphological opening removes most of
the noise while preserving
the outlines of the large regions.
Then, use region analysis to filter
the mask so that it includes only areas above 500 pixels.
Play video starting at :6:5 and follow transcript6:05
Now, the moving car and pedestrians are well segmented.
There's also a reflection of
a pedestrian on the hood of the vehicle.
You could eliminate this by using a region of
interest to remove the hood of the car from the frame.
Now, apply the workflow to the entire video,
just like an object detection.
Detecting moving objects is
just one application of optical flow.
It is a powerful technique with
many complex applications beyond
the scope of this course.
It's also used in physical applications,
including flow velocimetry and 3D mapping.
Optical flow is often used in deep learning workflows
like activity classification and image creation.
It is also used to create values between
video frames, improving video quality.
### Practice Applying Optical Flow
You now know several ways to detect motion in videos and, in particular, the diverse utility of optical flow.
Navigate to the Module 2 folder and open the file applyingOpticalFlow.mlx. Work through the live script to calculate the optical flow and practice using the results in several different ways.
Project: Introduction to Applying Optical Flow to Determine the Direction of Traffic
In this module, you have seen multiple techniques for detecting motion. Although the module project asks you to use optical flow, depending on your images, there is usually more than one correct approach.
In this module's project, you will apply what you have learned to detect moving cars from several frames of a video portraying a busy highway. This footage was taken using a camera on a tripod, so there is little need for camera stabilization.
In the first external MATLAB Grader tool, you will use optical flow to create a mask isolating the fastest-moving objects (cars) in each frame.
In the second external MATLAB Grader tool, you will apply this mask to calculate the velocity in the x-direction of each car and determine how many cars are moving in each direction.
Proceed to the next external MATLAB Grader tool to get started. You are encouraged to develop your code in MATLAB and copy it over to the Grader tool when you are ready. The images you are asked to perform optical flow on are included in the course files download.
If you get stuck, refer to the applyingOpticalFlow.mlx live script reading, which performs similar operations on moving cars, bicycles, and pedestrians.
Good luck!
Project: Applying Optical Flow to Detect Moving Objects
Copy your detection code into the online grader to see if it give the correct result.
Object tracking is an integral part of autonomous systems engineering,
scientific research and countless other applications.
Play video starting at ::15 and follow transcript0:15
Consider this cartoon example of a couple of moving objects.
For each frame, in addition to object detection,
tracking involves repeating three main steps in a cycle.
Play video starting at ::28 and follow transcript0:28
The first is predicting a new location estimate for
all the currently tracked objects, aptly referred to as tracks,
using existing detections, estimates, and motion models.
Play video starting at ::39 and follow transcript0:39
In this example,
we assume the two objects continue moving with a roughly constant velocity.
Play video starting at ::47 and follow transcript0:47
The next step is to use the track location predictions along with all
the new object detections to match or assign detections to existing tracks,
as well as determine which are left unassigned.
Here, two of the detections were assigned to tracks one and
two based on relative location to the predictions for each, and
another was left unassigned since it was nowhere near any predicted track location.
Play video starting at :1:15 and follow transcript1:15
The third step is to use those results to update existing track
estimates as well as initialize or remove tracks for new or
lost objects, and the process is ready to repeat again.
That all may seem complicated, but the core processes of object tracking are so
intuitive and automatic for our minds, you probably take them for granted.
Consider navigating busy city streets, you observe or detect moving cars and
pedestrians periodically, but you also have predictions about what is going to
happen in the immediate future based on your understanding of those things.
Play video starting at :1:55 and follow transcript1:55
If you look away briefly, when you look back, you expect to see the same cars
traveling steadily forward just a bit further ahead than before.
Play video starting at :2:5 and follow transcript2:05
Similarly, while your view of a moving pedestrian may become obstructed,
you predict that you will see them emerge again in a new location
based on their previous velocity.
Play video starting at :2:17 and follow transcript2:17
Now, you might be thinking, sure, I do that, but why is this so important for
computer vision?
Do we really need more than detection?
Play video starting at :2:26 and follow transcript2:26
Well, even perfect detections represent an object at just one moment in time.
A computer needs tracking to connect the detections and
recognize one object across frames.
Also remember, the output from an object detection algorithm is often just
bounding boxes or the centroids of labeled pixel regions.
Play video starting at :2:47 and follow transcript2:47
In practice, this output is often noisy,
tracking helps smooth out the effects of detection noise.
And as you know, objects can be obscured in some frames,
causing detection is to be lost.
Tracking helps prevent loss or confusion of objects in these scenarios.
Play video starting at :3:5 and follow transcript3:05
In the following lesson, you'll learn to implement each of the key steps of object
tracking and combine them into a complete processing loop.
Let's get started.
[MUSIC]
object tracking is done using a recursive loop.
Assume you've detected objects in a given video frame tracking
involves predicting the locations of known objects called tracks,
assigning matches between the detections and existing tracks,
updating the existing track estimates, initializing tracks for
new objects and removing tracks for lost objects.
This set of steps is repeated over and over at each frame.
In this video we'll take you through each of these main blocks in more detail and
you'll learn the components and
processing steps required in each to build a functioning object tracking loop.
I know this looks like a lot and it is but don't worry,
we'll go through it all piece by piece by the end of this video.
You'll see that while there are a lot of parts,
the job each does is relatively intuitive.
The two main things you need at each frame are the set of detections and
the set of tracks.
These are the core variables of the tracking process.
Each detection will include information like bounding boxes,
sent droids and or other information on where things appear.
It can also include additional data like object size or
metadata associated with each detection.
Each track contains everything you want to know about.
A specific object over time.
You'll need a fair amount of additional information for each track.
Things like an identifier, A running count of frames in which the object has
been detected as well as other information.
We'll circle back to this when we get to the details of the update process.
The real core of each track is the estimator you use the estimator
to predict the location of a track and compare that with new detections.
If a track is assigned a detection, the estimator is then updated using
the detection information before making a new prediction.
Remember at each new frame you need to predict the motion of
objects to know where you expect them to be.
You use these predictions to determine which if any detections should
be assigned to each track but predictions are almost never perfect and
detections are often noisy or even missing from some frames.
You use the assignment results to either update the estimate or
track location if a detection was assigned or
just use the prediction if there was no detection for the track that frame.
In this course you'll use a Calman filter as the estimator for each track.
The Coleman filter is a powerful and widely used tool to estimate the true
values of unknown and or noisy variables in dynamic systems.
While the details are beyond the scope of this course here,
the Coleman filter in each track generates a location prediction at
every frame using assumed equations of motion.
Then the Coleman filter in each detected track uses assumptions on measurement,
noise, motion model, uncertainty and a recursive li updated measure
of internal estimation error to combine the prediction and
detection into an updated track location.
The new track location will shift in favor of either the prediction or
detection based on the amount of uncertainty.
The common filter assumes for each you'll set parameters for
these uncertainties or noise levels as well as the type of motion
prediction model to use when you initialize new tracks for now,
let's go back to the overall process.
We've covered the set of detection and
tracks the predict block and the most crucial part of the update block.
Now let's see how to use the predictions to assign incoming detections to tracks.
The assignment process.
Takes the set of all detections in each frame and
the set of all existing tracks and determines which pairs to assign.
Notice that there may be unassigned detections such as when a new
object appears or there is a false detection.
There may also be unassigned tracks for
example tracks that have been obscured from view or
the detector failed to find at each frame.
You need to evaluate multiple possible combinations of assignments between
existing tracks and a variable set of new detections and determine the best one.
You do this by calculating a cost value for
every possible assignment as well as the cost of non assignment,
which means leaving either a track or detection on a sign.
The assignment costs should be based on proximity but can also
incorporate additional information like size, shape or even color.
In this course will assign costs using measures of distance from each prediction
to each detection obtained from the track.
Common filters so you won't have to come up
with the assignment cost values on your own.
Once you have all the costs, you minimize the total cost over all
possible combinations of assignment and non assignment.
Don't worry.
This optimization is solved for you by a single function in that lab Here.
The minimum sum of costs is reached by assigning
detection to to track one Detection 3 to track two And
leaving both detection one And Track three on assigned.
This reflects a scenario where due to their distance apart.
The third track was more likely obscured or
lost while the first detection more likely represents a new object or
possibly noise rather than a huge error detecting the third track.
Alright, we're almost done just a few things left in the update block.
Let's go through them.
The first thing to do is update the track locations.
You saw this a bit earlier.
For those who signed a detection, you use the common filter update process for
those tracks not assigned to detection.
You use the common filter prediction next?
Remember that track metadata I promised we'd circle back to when we got to
the update block.
Well here we are.
Track metadata can include almost anything you want.
However, this usually includes things like a unique identifier,
total detection count, overall age, consecutive missing
count confirmation status and whether it is currently detected or not.
A unique identifier is used to maintain and differentiate tracks across frames.
The total number of frames in which a track has been detected is used as
a threshold to determine whether it is reliable enough to be considered
a confirmed track.
This helps deal with noisy sporadic detections by keeping new tracks
unconfirmed until they've been detected enough times the number of
consecutive frames the track has gone undetected.
The total number of frames it has existed and
the number of frames in which it was detected are used to determine
when to consider a track has been lost and deleted next.
Any detections that were not assigned are considered potential new objects.
So you create a new track for each initializing a Coleman filter and
setting initial metadata since detection can be false.
This is where the confirmed status we just mentioned comes in.
You'll only confirm tracks after a certain detection count and
finally you'll use data on age detected frames and
consecutive missing frames to delete tracks that do not seem to be reappearing.
Well you made it.
This was a lot to take in so don't worry if everything isn't perfectly clear yet,
you'll get plenty of step by step, practice and
reference materials in the rest of this lesson
[MUSIC]
There are many processes and
parameters you need to define to build a functioning object tracker.
In this video, we'll take you through an example of how to implement each
of these steps in MATLAB.
Play video starting at ::20 and follow transcript0:20
We'll test the overall algorithm out by tracking these cells.
Importantly, you have access to the code and the video to experiment with on
your own, and to use as a template to track objects in other scenarios.
Play video starting at ::39 and follow transcript0:39
You implement the overall tracking process in a loop over each frame.
Play video starting at ::44 and follow transcript0:44
You use individual functions to perform object detection,
track prediction, detection to track assignment and track updates.
Play video starting at ::54 and follow transcript0:54
Of course you'll want to have some way to save and display your results as well.
Play video starting at ::59 and follow transcript0:59
We could add this to our diagram as an additional block.
Play video starting at :1:4 and follow transcript1:04
The loop is purposefully a modular structure.
This enables you to substitute in a new function for a particular step.
For example, detection while leaving the rest of the code unchanged,
or add additional functions for example, to analyze your results at every frame.
Play video starting at :1:23 and follow transcript1:23
For now, we'll focus on the core tracking algorithm you've seen previously.
Play video starting at :1:29 and follow transcript1:29
While detection is a critical part of tracking,
it's often developed independently and has been covered in previous materials.
Play video starting at :1:37 and follow transcript1:37
We'll focus on the implementation details of the prediction,
assignment and update processes here.
Play video starting at :1:45 and follow transcript1:45
Now, notice the set of tracks is initialized before the loop as
an empty table.
Play video starting at :1:51 and follow transcript1:51
This is to enable the loop to run when there are no tracks yet.
Play video starting at :1:56 and follow transcript1:56
Let's see what happens for the very first frame with detected objects.
Play video starting at :2:1 and follow transcript2:01
There are no tracks to make predictions on yet.
So, nothing will happen in the predict function.
Similarly, since there are no tracks to assign detections to,
all detections will be passed through the assignment function as unassigned.
Initialization of new tracks for
unassigned detections happens in the update function.
So, let's examine this function first by right clicking and
selecting open update tracks.
Play video starting at :2:31 and follow transcript2:31
The first thing we do in this function is define a counter variable to use
as a track identifier.
We use a persistent variable so that we can accumulate the count of how
many tracks we've made through multiple updates.
Since this is the first time through and the tracks table is empty,
the code to update the track locations, update track metadata and
delete lost tracks is all skipped on this iteration.
Play video starting at :3:1 and follow transcript3:01
Don't worry, we'll be coming back to all of this in the following frame iteration.
Play video starting at :3:6 and follow transcript3:06
Now here you assign new tracks to the unassigned detections.
For each unassigned detection,
initialize a common filter with the configure common filter function.
Play video starting at :3:19 and follow transcript3:19
This function requires a set of parameters for a given tracking scenario.
Play video starting at :3:25 and follow transcript3:25
Here, we use the detected centroid to define the initial location.
Set the filter type to assume constant velocity, and choose estimates for
initial error, motion noise and measurement noise.
Play video starting at :3:40 and follow transcript3:40
While the values for these parameters are largely going to come down to trial and
error, remember that the relative size of the motion noise and
measurement noise affects how much your filter trusts its own internal model or
the detections respectively.
Play video starting at :3:54 and follow transcript3:54
After that, you initialize track data, create variables for
the detected, tracked, and predicted locations, and
initialize them all to the first detected location.
Play video starting at :4:8 and follow transcript4:08
Then, create a variable for track age and set it to 1.
Account of detected frames and set it to 1.
Account of consecutive undetected frames and set it to 0.
A detected flag and set it to true, and a confirmed status and set it to false,
create a new track entry using all of these variables to form a single row
table and add it to the tracks table and increment the track ID.
Play video starting at :4:40 and follow transcript4:40
Okay, let's go back to the overall process and assume we're on the next video frame,
and we have a new set of detections.
Play video starting at :4:49 and follow transcript4:49
Now, on this iteration, the predict function will make predictions for
the tracks initialized previously.
Let's see how.
This function takes the set of tracks in and predicts a location for each.
Then outputs the updated tracks.
Play video starting at :5:5 and follow transcript5:05
In MATLAB, this only takes a few lines of code.
Play video starting at :5:9 and follow transcript5:09
Just loop through the table of tracks, and use the predict function on the common
filter in each track to get a new predicted location.
Play video starting at :5:21 and follow transcript5:21
The next function in the loop is detection to track assignment.
Play video starting at :5:27 and follow transcript5:27
This function takes in the sets of detection and tracks, and
updates both with assignment information.
Play video starting at :5:34 and follow transcript5:34
The first thing you do here is get the cost for
each possible assignment by looping through the tracks table and using
the distance function of the common filter with every detection location as an input.
Play video starting at :5:47 and follow transcript5:47
Next, you set a cost of non-assignment like the common filter parameters.
This may take some trial and error when working with a new video.
Just remember, the smaller this cost, the more likely you are to leave tracks and
detections unassigned.
Play video starting at :6:3 and follow transcript6:03
Then, you solve the assignment optimization using
the assignDetectionsToTracks function included with MATLAB.
Play video starting at :6:11 and follow transcript6:11
The assignments output specifies index pairs for the tracks and
detections that were assigned to each other in each row.
Play video starting at :6:19 and follow transcript6:19
In the rest of this function, set the detected status to false for
undetected tracks, and true for
detected tracks using the first column of the assigned index pairs.
Then, add the detective central location to each assigned track using
the columns of the assigned index pairs.
Play video starting at :6:39 and follow transcript6:39
Finally, set in assigned status to false for unassigned detections.
And true for assigned detections using the 2nd column of the assigned index pairs.
Play video starting at :6:51 and follow transcript6:51
Now, this time we have both detections and tracks when reaching the update block.
So let's take another look at that function.
Play video starting at :7:1 and follow transcript7:01
You've all ready seen the creation of the track ID counter.
This time though, the tracks table is not empty.
Play video starting at :7:8 and follow transcript7:08
So for each track that is currently detected,
you update the track location using the correct function,
with the KalmanFilter and detected location as inputs.
Play video starting at :7:20 and follow transcript7:20
Note that this does not simply replace the track location with the detection.
It uses the prediction, detection and
estimates of uncertainty to create a new track location.
For any tracks that were not detected in this frame,
you update the track location to be equal to the predicted location.
Play video starting at :7:39 and follow transcript7:39
Next, you update the track metadata.
Start by implementing the age for all tracks.
Play video starting at :7:46 and follow transcript7:46
Then, increment the total number of detected frames for
all tracks that are currently detected.
Play video starting at :7:52 and follow transcript7:52
Then, use the total detection count to confirm tracks that have met a threshold.
Play video starting at :7:58 and follow transcript7:58