Aerial Image Segmentation with PyTorch  About Guided Projects In this Guided Project, you will get assigned a cloud desktop that has all the required software pre-installed. This will allow you to follow along with the instructor to complete the above mentioned tasks. After all, we learn best with active, hands-on learning.

# Computer Vision course materials

## Module 1: Getting Started with OpenCV in Python
## Module 2: Video IO and GUI
## Module 3: Binary Image Processing
## Module 4: Image Enhancement and Filtering
## Module 5 : Advanced Image Processing and Computational Photography
## Module 6 : Geometric Transforms and Image Features
## Module 7 : Image Segmentation and Recognition
## Module 8 : Video Analysis

A fully functioning object detection system 
that runs on an embedded system. 
Computer vision, or the idea of having computers 
interpret images and videos 
has been around since the 1960s. 
A lot of advancements in technology 
are possible because of computer vision. 
One of the most notable applications 
in recent years is self-driving cars, 
which rely on a variety of 
sensors to survey their surroundings. 
A few of these sensors often involve cameras and require 
complex vision models in order to detect 
various objects and road signs 
to assist the driving algorithms. 
Without such models, the car would not be able to figure 
out what's a pedestrian or another vehicle to avoid. 
Computer vision has plenty of uses in manufacturing too. 
You could use anomaly detection 
on images to look for things 
like rust or mechanical parts not moving correctly. 
Many robots rely on cameras and 
vision algorithms to figure out where to place an object, 
turn a screw, or weld two pieces of metal together. 
For example, this engineer built 
a robotic arm in his garage that 
automatically looks for the charging port on 
his Tesla and attaches the charging plug. 
He does this with a Raspberry Pi 4. 
Vision systems can also be 
used to identify different types 
of components that need to be sorted 
or picked for an assembly process. 
This is something we will explore in this course. 
Computer vision can be used for analyzing 
satellite images to look for things 
like wildfires and deforestation. 
My Wyze camera here has 
a built-in person detection model that can 
alert me whenever it sees a person in the frame. 
This has a lot of 
potential security applications where you 
may not want to have someone 
say watching a screen all the time. 
As machine learning and 
computer vision technology gets 
better and more efficient, 
we can start running these algorithms on 
embedded systems that allows us to create smart sensors 
that are capable of making 
decisions without needing to stream 
raw video data out to 
a more powerful computer all the time. 
In this course, we'll start by going 
over what makes up a digital image 
and how we can use that information 
as input to a neural network. 
We'll also give a brief overview 
of how neural networks operate. 
But I highly recommend taking my introduction to 
embedded Machine Learning course 
first if you have not done so already, 
to get more information about 
general machine learning on embedded systems. 
We'll start by going over image classification, 
where we'll construct a simple neural network that 
attempts to predict the main subject of the image. 
Note that it won't be able to identify 
multiple objects in that image 
or tell us where they're located. 
We'll use Edge Impulse to 
create and train the model and then I'll 
show you how to deploy it to 
the OpenMV camera as well as a Raspberry Pi. 
Note that in this course, 
I recommend having some experience 
with Python as I'd like to 
have you do some work in 
Google Colab to examine and manipulate images, 
curate your datasets, and analyze your models. 
My goal is to give you enough examples and reference 
material so that you can 
successfully complete any of the projects. 
But knowing some Python 
ahead of time will definitely help. 
For the initial release of this course, 
I plan to show the Raspberry Pi 4 
and OpenMV camera H7 plus. 
You can get most of the examples from 
this course working on either of these boards, 
note that deploying a model to 
the Raspberry Pi requires writing code in Python, 
and deploying a model to 
the OpenMV requires writing micro Python. 
The open NV-IDE with micro Python also supports 
the Arduino Portenta which 
should work for most of this course. 
However, note that the camera on 
the vision shield only does greyscale, 
which will limit the capabilities of some vision models. 
Once we have a trained model, 
I'll show you how to write some micro Python code for 
the open-end V or Python code for 
the Raspberry Pi to use the model. 
You're welcome to try using other boards for this course, 
but I likely won't be able to help you 
troubleshoot any issues you might run across. 
I recommend using the discussion forum 
in this course to ask me 
and fellow students about the content 
and projects found in this course. 
If you run into technical issues with Edge Impulse, 
I highly recommend posting something to 
their forums at forums.edgeimpulse.com. 
If you ask in the discussion forum for this course, 
there's a good chance I'm just going to copy 
your message to the Edge Impulse forum anyway. 
You will likely get a faster answer from 
the Edge Impulse team if you post 
there for technical help using their tool. 
It should be possible to write 
C++ programs to perform inference. 
However, I found it a lot easier just to stick 
with Python so we can focus on the concepts. 
That being said, if I happen to 
find some examples or create some that are 
in C++ that run on something like 
the Arduino Nano 33 or the Arduino Portenta, 
I'll make sure to include them 
in the recommended reading sections. 
From there, we'll dive into 
convolutional neural networks to see how they work and 
why they make for better image classification models 
than regular dense neural networks. 
Finally, we'll look at 
several popular object detection models that can be 
used to locate and 
identify more than one object in an image. 
I'll show you how to train one such model to identify 
objects of your choice and then 
deploy it to an embedded system. 
Note that at this time, 
object detection models are still quite slow, 
even on something like a Raspberry Pi 4. 
At the initial release of this course, 
the object detection model only runs 
on single-board computers like the Pi. 
Once there is support for them on microcontrollers, 
I will update the course to hopefully show it 
working on something like the OpenMV camera. 
My goal is to give you 
enough tools and knowledge so you can get started 
creating your own embedded vision systems 
with Machine Learning. Let's get started.
While I plan to cover most of 
the topics in this course myself, 
I've invited some guests to talk about 
their active areas of research 
and to showcase some projects. 
Computer vision and machine learning 
are very popular topics right now. 
I think it'll be worthwhile to 
see what some other people are working on. 
Mat Kelcey is 
an applied at machine learning research engineer 
at Edge Impulse. 
He has previously worked at Amazon and Google, 
and he has advised a number of 
startup companies when it comes 
to implementing machine learning. 
His interests, include 
deep reinforcement learning for robotics, 
information extraction and search ranking. 
We'll hear from him about how features from models can be 
reused to create self supervised learning systems. 
We'll also hear from Dmitry Maslov, 
who is a computer vision engineer 
with a background in machine learning and 
robotics and who works for 
the Seeed Studio single-board computer department. 
He also runs the Hardware.ai YouTube channel, 
that talks about various applications 
of artificial intelligence in robots. 
He'll give us a demonstration of 
his latest project that uses 
multi-stage inference to detect cars in a video, 
and then identify the type of car. 
This is me. I run 
my own freelance and consulting business where I 
make courses like this and help 
companies create technical content. 
I used to work at SparkFun Electronics 
first as an engineer designing products, 
and then as a content creator, 
making videos and writing blogs. 
I'm currently enjoying working with 
embedded systems and machine learning to 
teach these concepts as well 
as make fun projects like this. 
This is an open envy camera running 
a machine learning model that looks 
for a particular Lego brick. 
It's a bit slow, 
but the idea is to have something that would 
save you time searching through such a pile. 
It uses a convolutional neural network, 
which is something we will cover in this course. 
I hope that this project, 
along with the projects of our guest instructors, 
will inspire you to use 
embedded computer vision in your next project.

Please note that a large amount of foundational math goes into training neural networks and using them for inference. This course does not assume a background in such knowledge, and as such, it is meant as a course in applying machine learning (using various tools and libraries) in embedded systems without needing to understand the finer details of neural networks (and other machine learning algorithms).
In this course, we will focus on applying machine learning tools, techniques, and models to computer vision problems. 
We first review several concepts around neural networks, including training, evaluation, and deployment. We then dive into how convolutional neural networks operate in order to classify digital images. Finally, we cover object detection systems. You will have the opportunity to train, test, and deploy your own deep learning models to a microcontroller and/or single board computer to perform live image classification and object detection.

Syllabus
Here is a broad outline of the topics that will be covered in the course:
•	What is computer vision (CV)?
•	How can machine learning (ML) be used to accomplish CV tasks
•	Ethics and limitations of CV
•	How digital images are created and stored
•	How digital images can be manipulated and transformed in code
•	Using embedded ML to solve CV problems
•	Data collection and curation
•	Using the Edge Impulse tool to create and train an embedded ML model
•	Convolution as a way to filter digital images
•	Pooling as a way to downsample digital images
•	Using convolution, pooling, and dense neural networks to create a convolutional neural network (CNN)
•	How CNNs can be used to classify digital images
•	Training a CNN
•	Deploying a CNN to an embedded system (microcontroller and/or single board computer)
•	Performing continuous image classification using a CNN
•	Data augmentation to increase the accuracy of an image classification model
•	Transfer learning
•	Object detection
•	Evaluating an object detection model
•	Image segmentation

Required Hardware
You are welcome to take this course without attempting the projects, as they are not graded. However, I highly recommend doing the projects (or at least running the provided solutions on your embedded system(s)) to get the most out of the course. In my experience, challenging yourself with hands-on projects is where the real learning occurs.
Here are your options for hardware (you only need to choose one of these options):
•	None: you can use your computer and smartphone to capture images to complete some of the projects. You will not be able to complete any projects that require deploying machine learning models to an embedded system.
•	OpenMV Camera: the best option for using a microcontroller for this course. It runs MicroPython, which makes completing the projects easier (as the syntax is the same as Python). I recommend the OpenMV H7 Plus model, but the OpenMV H7 should work for most projects (you will likely need a micro SD card). Important: at this time, the OpenMV Camera does NOT run object detection models, so you will not be able to complete the final project in the course.
•	Raspberry Pi 4 with Pi Camera: the Raspberry Pi 4 is a single board computer that will work for all projects in the course. It supports full Python. Some webcams may work for the projects, but due to the variety of such cameras, I will not be able to help troubleshoot issues with them. As a result, I recommend using the official Pi Camera Module v2, and project solutions are written for the Pi Camera. Note that you will need a micro SD card, USB-C power cable and likely a keyboard, mouse, and monitor to use the Raspberry Pi.
Note that it might be possible to accomplish some or all projects in the course using hardware not listed above. However, I will likely not be able to help you troubleshoot issues if you use other hardware.
I chose to use primarily Python and MicroPython for the course so that we can focus on the concepts of computer vision and machine learning using a single language. Translating a Python (or MicroPython) program to C/C++ is possible, but it usually requires effort outside the scope of this course. You are welcome to try implementing some of the projects in C/C++, but I doubt I will be able to assist with any issues you run into.
If I am able to get the projects in the course to run on other boards (such as the Arduino Portenta, Arduino Nano 33 BLE Sense, ESP32 Cam, etc.), I will list them here and update the project descriptions.
I recommend searching on the following sites for the recommended hardware:
Global
•	Seeed Studio
•	Digi-Key Electronics
•	Mouser Electronics
Australia
•	Pakronics
India
•	Fab.to.Lab
United Kingdom (UK)
•	Cool Components LTD
United States (US)
•	Adafruit
•	SparkFun Electronics
•	If you have any questions regarding the material and quizzes or you run into technical problems with the projects, I recommend searching in the Discussion Forums first to see if other students had the same question. If you do not find a satisfactory answer, please create a new post. I will try to answer within a few days.
•	I also encourage you to help other students if you see an unanswered question in the forums and you know the answer!
•	If you run into technical issues with the Edge Impulse tool, I recommend posting your question or issue to the Edge Impulse forum. There is is a good chance that I will not be able to replicate your exact issue, as I do not have administrative access to the Edge Impulse tool (i.e. I cannot see your project). Additionally, I will likely copy-and-paste your question to that forum anyway, and the Edge Impulse staff is much faster (and more experienced) than I am at assisting people with such problems.
•	Computer vision is the science and engineering of 
•	teaching computers to assign meaning to images and video. 
•	The idea of capturing 
•	an image has been around for a long time, 
•	and digital cameras have been around since the 1970s. 
•	To capture an image, 
•	we need some sensor. 
•	Most modern digital cameras have 
•	a complimentary metal oxide semiconductor 
•	or CMOS image sensor. 
•	You'll sometimes find charge-coupled device sensors, 
•	but these are usually found in older digital cameras. 
•	Light entering the camera can be bent or 
•	refracted through lenses and 
•	possibly bounced off a mirror, 
•	as in the case of my DSLR camera. 
•	Either way, as light strikes the sensor, 
•	tiny sections of the sensor respond to 
•	the amount and color of that portion of light. 
•	Each of these tiny portions, known as pixels, 
•	generate an electrical signal 
•	proportional to the amount of color and light hitting it. 
•	A computer or microcontroller reads 
•	these electrical signals and stores them 
•	as numerical values in an array. 
•	Often, you'll find three arrays: one for red, 
•	one for green, and one for blue. 
•	These arrays which represent the full-colored photo 
•	that we just took are saved to some non-volatile memory, 
•	such as an SD card. 
•	Sometimes you'll find that these arrays are 
•	compressed in a way that saves storage space, 
•	even if it means losing some of the information in them. 
•	For example, JPEG images are compressed this way. 
•	We can then plug the SD card or maybe it's 
•	our phone into our computer and view the image. 
•	The computer knows how to read those stored arrays of 
•	numbers and convert their individual values 
•	into colors on our screen. 
•	This allows us to view the image we captured. 
•	The more pixels in the image, 
•	the greater the detail we can make out. 
•	This is not the only way to 
•	construct a digital image however. 
•	A variety of sensors can be used to create 
•	a digital representation of the world around us. 
•	An image sensor in a camera works with visible light, 
•	so it's similar to how we might see with our eyes. 
•	However, we can use 
•	infrared sensors to create an array of infrared values, 
•	which is great for seeing things at night or 
•	looking at the relative temperature of objects. 
•	We can also use things like radar to get an idea 
•	of how far away things are or map out terrain, 
•	or maybe we use something like 
•	ultrasound to get a cross-section inside our bodies. 
•	In all of these cases, 
•	we are producing digital images. 
•	While this hopefully gives you an idea of 
•	how digital images are captured and stored, 
•	simply recording something isn't computer vision. 
•	In all these cases, 
•	it requires a human to interpret the digital images. 
•	Computer vision is when we have a computer automatically 
•	interpret and assign meaning 
•	to images or parts of an image. 
•	For example, we can use 
•	computer vision to locate the trees in this photo. 
•	Or maybe we automatically identify 
•	potential energy leaks in 
•	a home from a captured infrared image. 
•	Computer vision might be able to identify 
•	dangerous lava flow routes from a terrain map. 
•	Health care workers could rely on computer vision to 
•	automatically identify 
•	potential issues when taking x-rays, 
•	ultrasounds, or CAT scans. 
•	I'm not a doctor or ultrasound tech, 
•	so I don't actually know what I'm 
•	looking at in this particular image, 
•	but you get the idea. 
•	Most people credit Larry Roberts as 
•	being the founder of the field of computer vision. 
•	His 1963 PhD thesis, 
•	machine perception of three-dimensional solids, 
•	proposes methods for extracting information about 
•	3D objects from a simple two-dimensional image or photo. 
•	From here, a whole field of study was born that attempts 
•	to automate the process of 
•	extracting meaning from images. 
•	Throughout the 1970s, 
•	the British neuroscientist, David Marr, 
•	published a number of papers that describe 
•	how images captured by two eyes can 
•	be constructed into three-dimensional representations 
•	of scenes in the brains of living creatures. 
•	From there, researchers have worked 
•	to automate this process in computers. 
•	For example, we can use two cameras mounted 
•	a fixed distance from each other 
•	to take photos of the same scene. 
•	These photos will be ever so slightly different 
•	from each other thanks to how the cameras are separated, 
•	much like human eyes. 
•	Here is an example taken from 
•	this ArduCam stereo HAT for Raspberry Pi. 
•	The two grayscale images are from each of the cameras. 
•	With some math, it produces the image on the left. 
•	Greens and blues are objects that are farther 
•	away and the orange and red blobs 
•	are images that are closer to the cameras. 
•	This is known as a depth map and 
•	it helps us figure out where objects are in 
•	relation to the cameras without relying on 
•	distance sensors like ultrasound or LiDAR. 
•	The process of extracting 
•	three-dimensional information using a pair of 
•	cameras set at a fixed distance from 
•	each other is known as stereoscopic vision. 
•	Another common objective in 
•	computer vision is to find 
•	the boundaries between objects. 
•	This is often accomplished 
•	using edge detection algorithms, 
•	which filter an image and 
•	output one or more images such as these. 
•	You can see how only the edges of 
•	the objects in the photo are shown, 
•	much like someone drawing a sketch of the scene. 
•	You can choose to pick up more or less detail in 
•	the edges depending on 
•	the particular algorithm and parameters used. 
•	Image segmentation is another popular area 
•	of study in computer vision. 
•	Various algorithms exist to help divide a picture into 
•	various parts or objects to 
•	assist in providing meaning to that image. 
•	The goal of most image 
•	segmentation algorithms is to assign 
•	a value to each pixel 
•	and group associated pixels together. 
•	These groupings can be colored and redrawn as shown in 
•	the right image to help detect or 
•	classify objects in that image. 
•	While all of these are great examples of computer vision, 
•	we haven't yet seen how 
•	machine learning fits into the picture. 
•	As I just showed, 
•	computer vision is not 
•	the same thing as machine learning. 
•	However, machine learning can be 
•	a very useful tool for computer vision, 
•	and computer vision can be 
•	a very useful tool for machine learning. 
•	Both fields are usually considered 
•	to be a part of artificial intelligence. 
•	They are different from each other, 
•	but there is some overlap. 
•	In this course, we will focus on 
•	using machine learning to accomplish 
•	computer vision goals but there are plenty of 
•	things in computer vision that we will not cover. 
•	Specifically, we will go over 
•	image classification and 
•	object detection using neural networks. 
•	Image classification is the process of 
•	attempting to comprehend an entire image. 
•	For example, we might train a classifier to recognize 
•	the first image as that of a dog 
•	and the second image as that of a cat. 
•	It would not be able to tell you 
•	where in the image each animal was found, 
•	just that the image contained that animal. 
•	However, it would likely 
•	struggle with an image like this, 
•	which contains instances of both animals. 
•	It would make a guess based on 
•	prominent features in the image 
•	and depend on where the model 
•	looks in the image for those features. 
•	Object detection is 
•	a harder problem than image classification, 
•	but it allows us to identify things 
•	in a picture and where they are located. 
•	It also allows us to 
•	identify more than one object per image, 
•	which is a big limitation of image classification. 
•	OpenMV comes with a person detection example. 
•	Here, if the camera thinks 
•	there is a person in the frame, 
•	it will update the label in 
•	the output image to show that. 
•	This could be useful for determining 
•	if someone is at your front door or 
•	maybe monitoring a room to 
•	automatically control the lights and air conditioning. 
•	Now, let's say you've been 
•	tasked with designing a new smart lighting, 
•	heating, ventilation, 
•	and air conditioning system for an office building. 
•	Rather than old passive infrared sensors, 
•	you've decided to deploy person detection cameras, 
•	which you found to be much more reliable at 
•	determining when someone is actually in the room. 
•	However, you need 30 of them 
•	to cover all the office spaces. 
•	You could stream all this video data to 
•	a central server on the network or across the Internet. 
•	This server would be in charge of doing 
•	the vision processing to determine if 
•	a person was in each frame. 
•	Let's calculate what kind of 
•	bandwidth you might require to do that. 
•	We'll assume each camera needs a modest 240 by 
•	240 pixel resolution to 
•	correctly identify people in a frame. 
•	We don't need color, so each frame is 
•	a grayscale image where each pixel is an eight-bit value. 
•	We'll need 30 cameras, 
•	and we'll say that each camera really only 
•	needs to take a photo once every second. 
•	This isn't a live video stream, 
•	we just want to know if someone is in the room every 
•	second to make changes to 
•	the lights and air conditioning systems. 
•	Under these conditions, we'd need around 13.8 megabits 
•	per second of network capacity 
•	devoted entirely to this new sensor system. 
•	Of course, there are ways to 
•	compress the images to reduce this sum, 
•	but you get the idea. 
•	While modern Wi-Fi can support 
•	at least 10 times that amount, 
•	it's still seems like a big waste. 
•	Alternatively, we could move 
•	that classification problem to the cameras themselves. 
•	These smart cameras could be just like 
•	the OpenMV camera demo I showed you a moment ago. 
•	Each camera would perform 
•	whatever inference was necessary to determine 
•	if a person was in the frame 
•	and just send that result to the server. 
•	Now, we essentially need one bit for that value. 
•	Was a person in the frame or not? 
•	Thirty bits per second is a lot less 
•	than 13.8 megabits per second. 
•	I'm making some assumptions about 
•	minimum packet length and message headers, 
•	but you get the idea. 
•	As you can see, using micro-controllers or 
•	low-power computers can save 
•	bandwidth and processing power on remote servers. 
•	This form of embedded computer vision 
•	offers an alternative to streaming 
•	raw or even compressed data to 
•	a centralized location for processing. 
•	Self-driving cars rely on a variety of sensors 
•	to help them navigate the roads and avoid collisions. 
•	Most used cameras with a combination of 
•	radar or distance sensors to help 
•	create a clear picture of objects in the distance. 
•	The car needs to use computer vision and 
•	likely some machine learning 
•	to figure out what's around it. 
•	For example, it needs to be able to 
•	read road signs and signals, 
•	watch for other cars and avoid pedestrians. 
•	Object detection can help 
•	the car see these things and take 
•	appropriate actions like turning on 
•	a light or stopping for pedestrians. 
•	You can't necessarily guarantee 
•	an Internet connection in cars, 
•	so a lot of this has to be 
•	computed in the car's computer. 
•	While a car can transport a powerful computer, 
•	it's still somewhat of an embedded system. 
•	I hope this helps illustrate the need for using 
•	embedded machine learning to tackle 
•	some computer vision problems. 
•	Alex Fred-Ojala talked about the ethics of 
•	data acquisition and machine learning 
•	algorithms in the introductory course. 
•	Let's revisit that topic 
•	and see how it applies to computer vision. 
•	Alex talked about the three pillars that help 
•	create trust in an artificial intelligence system. 
•	The system should follow laws, 
•	be robust against any sort of attack, 
•	and guarantee high reliability. 
•	They should also guarantee fairness by 
•	not promoting any sort of discrimination, 
•	biases or social injustices. 
•	For example, here is a tweet that went viral in 2017 
•	showing how a soap dispenser 
•	struggles to work with a dark-skinned individual. 
•	Bias in a soap dispenser is pretty benign and that 
•	soap dispenser likely wasn't 
•	using machine learning anyway. 
•	However, you can see how this might be a problem for 
•	critical computer vision systems like self-driving cars. 
•	If you were designing a system to 
•	work with and for people, 
•	make sure you take everyone into 
•	account when training and testing the model, 
•	not just people who look like you. 
•	There's also the notion of privacy. 
•	Are you creating a new type of 
•	smart security camera that says, 
•	records every face that walks by a corner. 
•	Let's say you then attempt to identify each person by 
•	matching their face to 
•	available photos on their social media account. 
•	Even if this is legal, 
•	you have to consider the privacy implications 
•	of this type of project. 
•	Do these people consent to having their faces and 
•	possibly names recorded whenever they walk by the corner? 
•	For more information on ethical and trustworthy AI, 
•	I highly recommend checking out 
•	the European Union's AI Alliance page. 
•	They offer some good guidance on 
•	the various factors that make up an ethical AI system. 
•	Licenses.ai has some good templates 
•	for creating end-user license agreements. 
•	These cover various ethical concerns 
•	that someone creating 
•	an AI system might have and hopefully prevent its misuse. 
•	Edge Impulse uses a similar responsible AI license 
•	that outlines how you may or may not use their tool. 
•	I definitely recommend reading 
•	through this license before getting started. 
•	I hope this has helped give you 
•	an understanding of how embedded machine learning 
•	can fit into computer vision and how you 
•	can use it to create responsible AI systems. 
•	Now I think it's time we dive into some technical stuff.
•	Before we go over using images with machine learning, 
•	I'd like to cover some concepts about how 
•	digital images are made and stored on your computer. 
•	Some of you may be familiar with these concepts already, 
•	but I find that it provides 
•	a useful vocabulary when working with images. 
•	Let's start with a simple grayscale photo. 
•	Then let's zoom way 
•	in on a portion of this elephant's ear. 
•	Digital photos are made up of a grid of 
•	simple building blocks known as 
•	picture elements or pixels for short. 
•	This grid of pixels can be expressed 
•	as a simple two-dimensional array of values. 
•	Let's take an even smaller subset 
•	of these pixels to examine. 
•	One way to express these pixel values is by 
•	using a number between zero and one, 
•	where zero is black and one is white. 
•	This could be interpreted as the amount of 
•	light being given off or reflected by each pixel. 
•	White is 100 percent or one. 
•	However, storing and doing math with 
•	floating point numbers like this 
•	is often difficult for computers, 
•	especially low power devices like microcontrollers. 
•	So one way to handle that is to quantize 
•	these values to some integer values 
•	that fit nicely into bytes. 
•	For example, we can quantize 
•	those 0-1 percentage values to one byte or eight bits. 
•	Now, zero is black and 255 is white. 
•	However, this means that 
•	only 256 shades of 
•	gray can be represented by these values. 
•	This is known as bit depth or colored depth. 
•	Each pixel or element of this array is 
•	an eight-bit number that 
•	describes the shade of gray to be displayed. 
•	Higher bit depth for these grayscale images 
•	means that more shades of gray can be displayed. 
•	Remember that we were only looking at 
•	a small piece of the whole image. 
•	The original image contains 2,290 
•	pixel columns and 1,487 pixel rows. 
•	In other words, the image is 2,250 
•	pixels wide and 1,487 pixels high. 
•	These dimensions are known as 
•	the resolution of the image. 
•	You will almost always see 
•	resolution expressed as width by height. 
•	You can find how much space this raw photo 
•	would take up by multiplying the width by the height, 
•	by the number of bytes per pixel. 
•	If stored raw, this photo would 
•	need around 3.4 megabytes. 
•	However, most image formats need 
•	some bytes reserved for header information, 
•	so this might be higher. 
•	Additionally, many image formats like 
•	JPEG use one or more algorithms to compress the image, 
•	resulting in a smaller file size. 
•	Lossy compression like JPEG 
•	can lose some information in the data, 
•	resulting in a slightly imperfect picture. 
•	We won't get into compression in this course, 
•	but know that it's what allows us to store 
•	digital images in smaller files, 
•	than what we calculated here. 
•	Let's examine our five by 
•	four segment of the elephant's ear using Python. 
•	We'll use NumPy to store our number arrays. 
•	NumPy is incredibly popular 
•	in the machine learning community as 
•	it's free and offers 
•	efficient ways to perform matrix operations. 
•	We'll also need pyplot from 
•	the matplotlib library to view our array as an image. 
•	PIL, short for Python imaging library, 
•	is a common Python libraries use 
•	to read and write to various image files. 
•	All three of these packages 
•	should come pre-installed in Colab. 
•	Remember that you can press "Shift 
•	Enter" to run a cell in Colab. 
•	Next, I'll upload the image. 
•	I'll open the image in an editing program. 
•	I cropped out the tiny section 
•	from the elephant photo that we looked at earlier. 
•	I saved this test image in bitmap or 
•	BMP format with a bit depth of eight bits. 
•	The bitmap format is not compressed, 
•	so it's useful for storing and working with raw images. 
•	If I zoom in, you can see that it really 
•	is just a five by four grayscale image. 
•	In Colab, I can click on the 
•	"File browser" and click the "Upload" button. 
•	I then find the five by four 
•	bitmap image and click "Open". 
•	If we go up one folder, 
•	you can see that we're working with 
•	a Linux instance on a remote computer. 
•	Colab is limited to 
•	essentially python and a few system calls, 
•	but it's very helpful for working with 
•	things like TensorFlow for machine learning. 
•	Our files will be stored in the content directory. 
•	I save the path to 
•	the uploaded file in this image path variable. 
•	Note that you can also right-click on 
•	the file and select "Copy path". 
•	Next, I use PIL to open the image. 
•	PIL attempts to automatically 
•	scale the values in the pixels. 
•	We need to call the convert function with 
•	the L parameter to keep them 
•	in the eight-bit gray-scale format. 
•	This image object has 
•	some extraneous information as 
•	it's unique to the PIL library. 
•	However, we can call the NumPy array as 
•	array function to convert it to a NumPy array. 
•	All NumPy arrays have a dot shape 
•	attribute that we can 
•	print to see the shape of the array, 
•	even though image resolution is given as width by height, 
•	two-dimensional NumPy array shapes 
•	are given as number of rows first, 
•	followed by number of columns. 
•	This means arrays are given as height by width. 
•	If I talk about image resolution, 
•	it will be width by height. 
•	If I talk about two-dimensional NumPy arrays, 
•	it will be height first, then width. 
•	When we print the array, 
•	you can see the pixel values that we saw earlier. 
•	These go between zero and 255. 
•	We can also normalize 
•	the array by dividing all the values by 
•	255 to convert the pixels to that 0-1 scale. 
•	You normally don't want to store 
•	an image in this floating point format, 
•	but this will be helpful later when 
•	working with some neural network inputs. 
•	Finally, we can use pyplot to draw the image for us. 
•	Note that we want to draw the eight bit grayscale image, 
•	and we need to tell imshow to 
•	use the grayscale map for drawing and that it should 
•	expect a minimum value of zero and 
•	a maximum value of 255 for each pixel. 
•	For the first couple modules in this course, 
•	we will stick to grayscale images. 
•	In many computer vision applications, 
•	you will find that color is probably not necessary. 
•	However, in some cases, 
•	it is necessary as it can 
•	convey extra information about the image. 
•	Let's zoom in on a section of this sea turtle. 
•	Here, we have another five by four section of pixels, 
•	but they're in color this time. 
•	Instead of a bit depth of eight bits, 
•	each pixel now contains 24 bits of information. 
•	Eight bits describe the amount of red in the pixel, 
•	eight bits are for green, 
•	and eight bits are for blue. 
•	Now, each pixel has three bytes needed to describe it. 
•	You can see that the bluish pixels have 
•	more of the blue channel than the others, 
•	and the reddish ones have 
•	more of the red channel present. 
•	As with the grayscale images, 
•	we can use more bits to 
•	describe colors than what we're showing here, 
•	but you'll often run into 
•	three bytes per pixel for many color images. 
•	Sometimes, you'll see an Alpha channel present. 
•	This determines the transparency of each pixel, 
•	and is common in image formats like PNG. 
•	We won't need to worry about 
•	the Alpha channel for this course. 
•	The red, green, blue, 
•	or RGB color model 
•	uses additive light to describe colors. 
•	The higher the value of one of those color channels, 
•	the more light is emitted in that color. 
•	We can combine the three different colors 
•	to produce any other color. 
•	When all three are at their max, 
•	they combine to create white. 
•	Computers use this model to 
•	interpret RGB images and then light up 
•	pixels on our monitors to 
•	display images in a variety of colors. 
•	I created a colab script to load a color image, 
•	just like we did for the grayscale image. 
•	However, I'm using that five-by-four pixel sample 
•	from the edge of the turtle shell. 
•	I use PIL to open the image, 
•	but I need to convert it to RGB format this time, 
•	then I convert it to a NumPy array. 
•	The first three elements are the red, 
•	green, and blue values of the first pixel. 
•	The next three elements belong to 
•	the second pixel in the first row. 
•	This group describes the five pixels in the first row. 
•	This continues to the last row of pixels. 
•	Here, you can see that the final pixel 
•	has more red and less than green and blue. 
•	We'll verify that in a minute. 
•	Now, let's draw the channels separately. 
•	You could extract each plot, 
•	but Matplotlib has a habit of 
•	coloring grayscale images in an odd manner. 
•	We're going to create three copies of the original array. 
•	In the first, we'll set all of 
•	the green and blue values to zero. 
•	In the second, we set all of 
•	the red and blue values to zero. 
•	Then in the third, we set red and green to zero. 
•	Notice that I can index into the arrays as follows. 
•	A colon means give me everything from that axis. 
•	Colon, colon zero is 
•	a two-dimensional array containing 
•	all the values in the red channel. 
•	Colon colon one would be 
•	all the values in the green channel. 
•	Finally, we print the channels separately. 
•	You can see how there's more red in 
•	the bottom right and more blue in the top left. 
•	There's a bright stripe of green going 
•	diagonally from the bottom left to the top right. 
•	Now, let's print all of these channels together. 
•	I hope you can see how 
•	those channels combined to form this image. 
•	There's more blue in the top left, 
•	mostly green in the middle diagonal, 
•	and a lot of red in the bottom right. 
•	I find it easiest to think about color images as 
•	a collection of three different two-dimensional arrays. 
•	When those arrays get combined, 
•	the computer is capable of producing 
•	nearly any color in the visible spectrum. 
•	As with grayscale images, 
•	there are lots of ways to compress them, 
•	but we won't get into that. 
•	Having the extra information in 
•	color channels can be useful but I recommend 
•	seeing if grayscale will meet your needs first as 
•	it uses less data and less computing power. 
•	I hope this helps you get an idea of 
•	how images are stored on your computer.

In order to create an image classifier, 
we first need to collect some data. 
There are plenty of pre-made datasets 
out there that include thousands of images, 
but I encourage you to try collecting your own. 
I will show you how to do this using the OpenMV camera, 
as well as a smartphone, 
but you are welcome to collect 
digital images in any way you see fit. 
The goal is the same. 
You'll want to collect around 50 images of 
the same object for each class you want to identify. 
I recommend starting with 
three or four classes so you can 
see how to work with multiple classes. 
You can choose to identify anything you want. 
Clothing, fruit, animals, and so on. 
There should be a large difference between the shapes of 
the objects as our model will be fairly simple. 
For example, the model might have 
trouble classifying breeds of dogs, 
but it has a good chance of working if it's trying to 
pick between dog and cat classes. 
Each photo needs to be scaled and 
cropped to 96 by 96 pixels. 
They can be colored or grayscale, 
but we will ultimately convert everything to 
grayscale to make the model 
smaller and easier to understand. 
We will also resize or scale these images to make 
them smaller before feeding them to our neural network. 
Additionally, you will want them in bitmap or PNG format, 
as those are commonly used uncompressed formats. 
The object you're trying to identify 
should be mostly centered in the image, 
and take up a large portion of the frame. 
Multiple photos should have 
the same object in a similar position, 
with similar lighting, 
and the same background every time. 
You'll also want to keep the camera at 
about the same distance from the subject each time. 
The background should be the same among all your classes. 
If you want to train a model to identify something 
in a variety of situations, lighting conditions, 
and positions, you're going 
to need a lot more than 50 images, 
probably on the order of a few thousand. 
You'll also likely need 
a more complex model which we'll explore later. 
But for now, try to keep everything about the same. 
I collected photos of 
a few different electronic components: a resistor, 
a capacitor, a diode, and an LED. 
I have 50 photos of each 
stored in a folder named after the class. 
Note that I only used one component for each. 
I didn't try to use different sizes, 
shapes, or colors of LEDs, for example. 
Also, I highly recommend 
collecting some photos of just the background. 
This will be its own class. 
Many times, you'll find that you want to 
identify when something is in the frame or not, 
such as detecting a person in a room. 
You'll want photos of the empty room, 
or of the white background with 
no electronic components, in my case. 
You are welcome to use my dataset 
if you do not want to collect your own. 
Head 
to github.com/shawnhymel/computer-vision- 
with-embedded-machine-learning. 
Click on the Datasets folder, 
and download the electronic-components ZIP file. 
BMP files are good for examining raw data, 
but you'll ultimately need 
the PNG files for uploading to Edge Impulse. 
Unzip it somewhere on your computer. 
Feel free to look through the folders. 
Each folder has 50 color images in them. 
I kept them in color in case you 
wanted to try working with color images, 
but we'll be converting them to grayscale in 
a future project to train the actual classifier. 
Note that the images are fairly 
similar with little variation. 
I tried to keep the component body 
close to the center of the image. 
The leads point either left or right, 
the capture photos with the OpenMV or 
Portenta, head to openmv.io. 
Go to downloads, and download the latest OpenMV IDE. 
Run the installer, accepting all the defaults. 
If you're working with the OpenMV H7 basic model, 
you'll want to use a microSD card as there's 
not enough internal storage to store images. 
Make sure it has been formatted 
with the FAT32 file system. 
Plug the SD card into the OpenMV camera, 
and plug the board into your computer with a USB cable. 
If you're finding that your photos are not in focus, 
you can adjust the focus of the lens by 
unscrewing the set screw and twisting the top. 
This might take some experimentation 
to get the images to look great. 
In the same GitHub repo, 
go to the Data Collection folder, OpenMV, 
and view the raw code for 
ImageCapture.py. Copy this code. 
Paste the code into a new file in the OpenMV IDE. 
Feel free to look through this code, 
and see how we capture and store images to the SD card. 
Note that we initialize the camera with a 
320 by 240 QVGA resolution, 
but we crop it to 96 by 96, 
which is what ultimately gets displayed and stored. 
Whenever we run this program, 
it will show what the camera sees in the upper right. 
Count down from three, 
then snap a photo, 
which it saves to the internal storage or SD card. 
Let's run it to collect a couple of samples. 
Click the "Connect" button. 
If asked, agree to update the firmware on your board. 
Click the "Serial Terminal" button to 
open a console connected to the board. 
Click the "Run" button. 
It will take a moment to initialize, 
so use that time to frame your object. 
You should see it count down from three. 
When it reaches one, 
it should flash black for a moment to 
let you know that the photo is being saved. 
It should print the name of 
the path and file to the serial terminal. 
If you want to take another photo, 
you'll have to run the program again 
by clicking the Stop button and clicking "Start." 
Continue doing this until you get your entire sample set. 
Remember, try to keep the object in 
about the same position with the same background, 
same lighting, and about 
the same distance from the camera every time. 
Note that I took pictures of my electronic components on 
a white background with a desk lamp 
nearby to provide consistent lighting. 
You can move the camera around a little bit, 
but try to make sure the object is centered and in 
about the same orientation or position for each photo. 
To copy your files, 
you'll need to either remove the SD card and plug it into 
a reader on your computer or cycle 
power on the open MV board. 
This should cause it to 
enumerate as a mass storage device. 
Go into the drive and copy the photos. 
I recommend taking a group of photos of the first object, 
copying those photos out to 
a directory with that object's name, 
and then repeating the process for the next object. 
You can also capture 
these images with your phone or webcam. 
Take a bunch of photos of your objects. 
Note that the cameras on 
most modern phones have a pretty high resolution. 
As a result, you'll want to open each image in 
an editing program and crop and scale as necessary. 
Save them as 96 by 
96 bitmap or PNG files in a folder with the class label. 
Finally, you can also use your Raspberry Pi. 
I recommended the Raspberry Pi 4 as that 
seems to work best with the edge impulse examples. 
I'll be using the official Pi camera as I know that 
works with the Pi and 
the examples in the rest of the course. 
You're welcome to try a USB webcam, 
but I can't guarantee it will work. 
Since I probably don't have your exact model of webcam, 
it will be really difficult for me to help 
troubleshoot any issues you might run into. 
So I recommend using the Pi Camera if you can, 
as I won't be able to help you with webcams. 
You'll want to load the latest version of Raspbian 
onto a micro SD card and plug it into your Pi. 
Make sure the camera is connected and 
plug in the USB-C connector for power. 
In a terminal, run, sudo Raspi-config. 
Go to interfacing options and select "Camera." 
Click "Yes" to enable the Raspberry Pi camera interface. 
You might need to reboot the Pi when you're done. 
For this course, you'll want to make sure you are working 
with Python Version 3.7 or newer. 
From the same GitHub repository 
we were looking at earlier, 
you can find two scripts for 
previewing the camera and capturing images. 
Copy the code from the previous script 
into a new Python document. 
When you run it, 
it will simply provide an output of 
what the camera sees in a preview window. 
The capture script will 
save whatever is in that preview, 
so make sure your objects are in focus and can be seen. 
If you're working with close-up objects 
like electronic components, 
you can adjust the focus by carefully twisting 
the lens housing with some needle nose pliers. 
When you're done, click on the console and 
press control C to exit the preview script. 
If you have the preview window selected, 
you need to press the Q key to exit. 
Then enter python Pi camera 
capture.py to run the capture script. 
This will give you the same viewfinder and you'll have 
about five seconds to put 
the subject in the preview window. 
The script will take a photo and 
automatically save it as an image file. 
Just like with our Open MV example, 
these files will be sequentially numbered. 
I recommend using the PNG format 
as it will save us a step later on. 
PNG is a compressed image format, 
but it is lossless, 
which means we can fully recover 
the uncompressed image using some math. 
Feel free to look through 
the preview and capture scripts. 
The capture script is similar to the one on the Open MV, 
but it uses the Pi Camera Package. 
Note that this only works with 
the Raspberry Pi camera and not a USB webcam. 
The script starts the camera 
and displays a preview window. 
It then counts down from five and then saves 
the last image captured from the camera to a file. 
There are a bunch of pre-made image 
datasets out there that machine 
learning folks like to play with. 
You're welcome to try using one of them, 
but I find that some are very trivial, 
like the MNIST digit set. 
Others like Fashion MNIST 
don't look anything like real photos. 
Some have so many images with thousands of photos 
that it can be unwieldy to work 
with for this intro project. 
Also, many of these pre-made datasets 
do not contain background samples, 
which makes it harder to do things like object detection. 
In reality, if you're training 
an embedded machinery model for a vision project, 
you're probably going to have some special use case in 
mind where you'll need to collect 
your own dataset anyway. 
Because of that, I recommend trying to take 
your own photos and construct a dataset if you can. 
It might take some time to collect all of these photos, 
so be patient with it. 
We're going to need them in our upcoming project.

## Python and Numpy Help
We will be using Python (and the microcontroller variant MicroPython) throughout most of this course. If you are new to Python or you need a refresher, I highly recommend checking out one (or more) of these articles and tutorials:
NumPy is a popular open-source Python package used for performing various array and matrix operations. It is often used with SciPy for statistics, research, and machine learning. We will use some NumPy throughout the course to examine and manipulate images (which are essentially just arrays of numbers). I recommend checking out at least one of these tutorials to get a feel for using NumPy:

## Project - Load and Manipulate Images
Introduction
Welcome to the first project of the course! In these projects, I’ll ask you to implement many of the concepts we covered in previous lectures. Some will require you to use Python in Google Colab to perform file and image manipulation, and others will require you to create a machine learning model in Edge Impulse to be deployed to your embedded system.
To start, we’ll collect some images to create your dataset, review some Python and Numpy commands, and practice manipulating those images (e.g. viewing the pixel values, drawing on them, and resizing them).

Project - Load and Manipulate Images
Introduction
Welcome to the first project of the course! In these projects, I’ll ask you to implement many of the concepts we covered in previous lectures. Some will require you to use Python in Google Colab to perform file and image manipulation, and others will require you to create a machine learning model in Edge Impulse to be deployed to your embedded system.
To start, we’ll collect some images to create your dataset, review some Python and Numpy commands, and practice manipulating those images (e.g. viewing the pixel values, drawing on them, and resizing them).

### Option 3: Smartphone
You should not need to perform any special steps to configure your smartphone to take photos.
Capture Dataset
For this module and the next, you will need a dataset for image classification. I highly recommend following along with this guide to create your own dataset. 

Note that for this process, you will want to capture at least 50 images for each class of object. I recommend aiming for the following:
•	There should only be one object in each image
•	The object should be at about the same distance, angle, and lighting in each image
•	You can move the camera or object around some, but try to maintain a high level of similarity between shots
•	Use the same background for all images
•	I recommend collecting various images (at least 50) of the background to create a “background” class
Option 1: Raspberry Pi
If you are using the Raspberry Pi, you will need to install libatlas, Numpy, and OpenCV (for Python):
sudo apt install libatlas3-base
python -m install numpy
python -m pip install opencv-python

Save the document as something like pi-cam-preview.py.
Run the script with the following:
python pi-cam-preview.py
You should see a window pop up showing you whatever the Pi Camera sees. I recommend using this time to adjust the focus of the Pi Camera. Use a pair of needle-nose pliers to carefully rotate the lens on the camera module. Make sure that your subject is in focus before proceeding.
I used electronic components in my dataset, so I needed to adjust the focus so that close-up objects were not blurry.
Once you are happy with the focus, stop the program by pressing ‘q’ (if the preview window has focus) or pressing ‘ctrl-c’ (if the terminal has focus).

python pi-cam-capture.py
Point your Pi Camera at your first object. After about 5 seconds, the script will capture a still image and automatically save it to your Raspberry Pi (with a name like 0.png).
 
Repeat this process until you have at least 50 images for each category.
Option 2: OpenMV
If you are using the OpenMV Cam, open the OpenMV IDE. You can run the default program to get a preview of what the camera sees. Press the connect button (bottom-left of the IDE) to first open a connection to the camera. Then, press the run button (also in the bottom-left) to run the program. You should see a preview of the captured image stream in the top-right of the IDE. You can press the zoom button just above the preview to make the preview bigger.
 
Use that time to adjust the focus. Depending on the lens and sensor, you may need to loosen a set screw or loosen the set nut in order to twist the lens. Turn the lens until your subject is in focus.
I used electronic components in my dataset, so I needed to adjust the focus so that close-up objects were not blurry.
Once you are happy with the focus, stop the program by clicking on the stop button in the bottom-left of the IDE.

Plug the SD card into your OpenMV Cam.
Run the program. It will count down from 3 and save a still image to the SD card with a name like 0.bmp.
 
Repeat this process until you have at least 50 images for each category.
Smartphone
You can use your smartphone (or any other camera) to capture images for this process. You will want to capture at least 50 images for each class of object.
Use an image editing program (such as GIMP) to crop and zoom each image as necessary. Your images should each be 96x96 pixels, and the object should be centered as much as possible.
 
Dataset Curation
Once you have collected all of your images, you will want to organize them into a set of folders on your computer. I highly recommend creating the following directory structure (note that the images might be .bmp or .png, depending on the method used to capture them):
•	Name of dataset
•	|- Class 1
•	|--- 0.bmp
•	|--- 1.bmp
•	|--- …
•	|- Class 2
•	|--- 0.bmp
•	|--- 1.bmp
•	|---…
•	|- …
This method helps you quickly know which image belongs to which class, and you can use Python to read the folder name to discover the class label. Here is how I organized my dataset:
 
The names of the individual images do not matter, as these will be randomized in a later step. You want every image to be the same size (96x96 pixels), same bit depth (24-bit color or 8-bit grayscale), and same format (PNG or BMP). We will convert everything to grayscale in a later step.
Note: If you are using my pre-made electronic components image set, simply download it and unzip it. All images should be in appropriately named folders as described above.

Conclusion
Learning to use Python (or some other high-level language, like MATLAB) can be incredibly useful when working with machine learning. Even in embedded machine learning, practitioners will often use Python (or tools like Edge Impulse) to examine data, curate datasets, extract features, train models, etc. before moving to the embedded system.
Once you have a working model, you’ll often need to test the feature extraction and inference pipeline before deploying the model to an embedded system. As a result, a high-level language (e.g. Python) with a linear algebra system (e.g. Numpy) can help you quickly test everything. Once you are sure that everything works in Python, you can begin to convert the system to a lower-level language, like C++.
[MUSIC] The first problem we're going to tackle is 
image classification. 
It's one of the most important challenges in computer vision because it 
attempts to answer the question, what am I looking at? 
Remember that a digital image is an array of numbers. 
There might be three of these two dimensional arrays for a color image. 
If we load this image as an array into NumPy, 
it would have a shape of 1536 by 2048 by 3. 
Here's the problem. 
As humans, our brains are wired for image recognition. 
We can easily tell that this is an image of a cat. 
If we shift all the pixels over by one, we can still tell it's a cat. 
However, if we just did a strict comparison of the arrays, 
the computer would tell you it's a very different image. 
Image classification is the process of comprehending or 
assigning a label to an image as a whole. 
For example, 
we might want to train a model to identify the type of animal present in this photo. 
A well trained model would tell us that it's a cat. 
We are not so worried about figuring out where the cat is in the photo. 
We just care that it's a photo of a cat. 
The most basic classifier is the binary classifier, 
where you only have two classes. 
This might be identifying cat versus dog. 
This is known as a one-versus-one classifier. 
However, if you feed an image to this classifier that's neither a cat nor 
a dog, the model will still predict one of those two classes. 
Alternatively, you can train a model to positively identify a cat as 
opposed to anything else that's not a cat, 
which is known as a one-versus-rest or one-versus-many classifier. 
Here, cats would be considered the positive case, and 
everything else would be the negative case. 
Our brains are much more capable than doing simple binary classification. 
However, binary classifiers can still be incredibly useful in computer vision 
if all you need to do is, for example, identify if there's a person in the frame, 
or if a part of an image contains something that looks like, say, a tumor. 
In addition to binary classifiers, we also have multiclass classifiers. 
Rather than saying something is either cat or not cat, 
we want our model to be able to determine if the image is of a cat, dog, person, 
dolphin, bird, empty field, and so on. 
This is closer to how our brains work. 
We are constantly identifying animals, people, and other objects. 
However, it turns out to be quite difficult to create 
a single model that does this. 
There are two basic approaches to tackling the multiclass problem. 
The first is to expand the one-versus-one technique and 
training model for each possible pairing of the desired classes. 
For example, the first model would compare cat versus dog and 
give us the probabilities that it thought the input image belong to each of those 
labels. 
Then it would compare cat versus human and give the probabilities. 
It would continue doing this for 
all cat versus other classes it had been trained for. 
Then it would compare dog to each of the classes to get those probabilities. 
It would continue doing this until every possible pairing had been analyzed. 
The class with the most predictions or votes would be labeled as the winner and 
be given as the output of the multiclass classifier. 
As you add more classes, you will need to train more models for this to work. 
The other approach is to perform as many one-versus-rest comparisons as 
there are classes. 
So you would compute the probability that the input image is a cat versus not cat. 
Then you'd compute the probability that the same image is a dog versus not a dog. 
You'd continue this for all the classes. 
Whichever class has the highest probability would be declared the winner 
and offered as the most likely label for the input image. 
As you can see, this requires far fewer models than the one-versus-one method. 
In fact, we would only need to train one classifier for each class. 
Well, neural networks and deep learning is the most popular way to 
accomplish this task at the moment, and it's what will be focusing on. 
Please remember that it's just one possible tool of many for 
solving classification problems. 
I'd like to talk about a few other algorithms for a minute, 
just so you are aware of what else is out there. 
The first is the k-nearest neighbors algorithm. 
Here, let's say we plot out the pixel values for all of our images. 
Note that for even something simple like our 5 by 4 pixel grayscale image, 
that would be a plot with 20 dimensions, which is kind of difficult for 
humans to visualize. 
So we're going to assume that our images have two pixels to make this visual 
representation easier. 
These are grayscale images with a bit depth of 8 bits, so 
each pixel can be between 0 and 255. 
Let's say that these are extremely low resolution photos of cats with white 
coats and low resolution photos of dogs with dark coats, 
you might expect something like this. 
The cat images would generally be lighter than the dog images. 
Each point on this chart represents a single two-pixel image, and 
we've labeled our training data as such. 
Note that there's really no training in this algorithm. 
All the processing is in the inference stage. 
So we capture a new two-pixel photo and 
we want to figure out which class it belongs to. 
In k-nearest neighbors, 
we would compare it to the closest images in the training set. 
With k=1, we find which of the training samples is closest to our unknown 
sample by computing the Euclidean distance to all the training samples. 
Let's say that we find the Euclidean distance to be 
the shortest to this sample, which has the ground truth label of dog. 
As a result, we would assign that label to our new sample and 
predict that it's an image of a dog with k=2. 
We would compare the unknown sample to the two closest labels and 
assign the label based on a simple vote. 
However, this would be a 50/50 split. 
So you need to find a way to break the tie, 
such as by looking at whatever the closest neighbor is. 
Alternatively, you could stick to odd numbers for k. 
With k=3, we'd look at the three closest samples in this plot. 
Here we have two dogs labels and one cat label. 
There are more dog labels, so we would assign dog as the label to the new sample. 
There are a number of problems with this algorithm, however. 
K-nearest neighbors suffers at high dimensions, 
which is common when working with large images. 
It also isn't really examining the subject of the image, but 
rather looking at pixel values. 
Images with similar colored backgrounds might be classified the same 
because the pixels and the backgrounds are so close in value. 
Finally, inference is actually the computationally expensive part. 
For embedded systems, we usually want most of the effort on training, and 
require as few resources as possible to do inference. 
Next is the support vector machine. 
This algorithm tries to find the best line or 
plane that linearly separates one class from another. 
However, its real power comes from something called the kernel trick, 
where we can use a mathematical transform to change the number of dimensions 
that these samples are plotted on. 
For example, we'd find a way to take these two dimensional points and 
plot them in three dimensional or four dimensional space. 
By raising the number of dimensions, it becomes easier to find a clear separation 
between the different groupings. 
Then, when a new sample is submitted to the model for 
classification, we would just see which side of the line or plane it falls on. 
This is on the cat side, so we'd give the image the label of cat. 
Support vector machines work well with high dimensions, but 
they lose some effectiveness when the number of training samples is less than 
the number of input dimensions of each sample. 
Remember that the number of input dimensions for images is determined by 
the total number of pixels times the number of color channels. 
For many applications, support vector machines work well, but 
are often outperformed by neural networks. 
There are other classifiers that I didn't talk about, like decision trees and 
linear classifiers, but let's look at neural networks. 
Neural networks, or artificial neural networks, 
were loosely modeled based on how neurons in the brain are connected. 
Layers consist of one or 
more nodes that perform a series of mathematical operations on the input data. 
The input to the first set of nodes would be our features. 
These features can be direct pixel values or something else, like a filtered image, 
or maybe the frequency of how often a particular color occurs. 
You will generally see all the input data to a neural network be normalized to 
values between 0 and 1 or between negative 1 and 1. 
This helps ensure that all the features are on the same scale. 
It can help make training faster, and 
assist in finding the optimal parameters for each node during the training process. 
No math is performed on these input features in the first layer. 
Each of these input values are copied to every node in the next layer. 
Any layers that are not input or output layers are considered to be hidden layers. 
The outputs from one layer are fed as features into the next layer. 
If we are doing a simple binary classifier, the final node 
would output the probability that the input image belonged to that class. 
Following our previous cat versus dog example, it might give us the probability 
that it thought an image belonged to the cat class. 
If that probability was below 0.5, 
we would say that the image should have the label dog. 
If it's above 0.5, we'd assign the label cat. 
Now let's extend that to the multiclass problem. 
For a neural network, the number of input nodes would be equal to the number of 
features we want to feed the network. 
We could have any number of layers or nodes in the middle, 
depending on our application. 
The output nodes would be equal to the number of classes we want to identify. 
Each output node would essentially be in charge of performing a one-versus-rest 
classification to determine the likelihood that the input image belonged to that 
particular class. 
Unfortunately, the output of each of these nodes might be scaled in 
such a way that we can't get a good sense of the probability of membership. 
We can use the softmax function as the last layer in our network to combine 
the outputs of the last set of nodes in order to give us a set of probabilities. 
It scales and normalizes the outputs as necessary, 
such that when you sum up all the probabilities, they will equal 1. 
So now we can interpret the output as being a set of probabilities. 
For example, let's say we have a trained neural network and 
we feed it a new unseen image. 
The output might look something like this. 
We can say that the neural network thinks the image is 63% likely to be a dog, 
26% likely to be a human, and basically negligible probability for anything else. 
From there, we can select the label with the highest probability or 
set a threshold of something like 0.5 or 0.6. 
If the output probability of a particular class is over that threshold, 
we can say that the model thinks the input image belongs to that class. 
While neural networks are amazingly versatile and useful, they're far from 
perfect, as they can easily suffer from a variety of biases and other issues. 
One of the biggest challenges in computer vision is lighting. 
If you train your model using well lit subjects like the one on the left, 
then the model will likely look for similar types of images. 
It will struggle with poorly lit images or ones where the subject is backlit. 
Our eyes and brains are incredibly capable of still picking out the cats in the right 
two images, but many computer vision models struggle with them if similar items 
weren't included in the training set. 
This means you either need to make sure your model is trained with 
images that contain a variety of lighting, or 
you need to find a way to ensure your subject is well lit during inference. 
Cats have an affinity for taking on a variety of shapes. 
Even most human children can still tell that these are photos of cats, but 
a machine learning model might struggle with them if it wasn't trained to look for 
cats in various poses. 
Additionally, the cat in the middle has its eyes closed. 
This is a variation that might prove problematic if the neural network was 
trained to look for certain shapes, position, and colors of the cat's eyes. 
What if the cat was looking away? 
Will the model still recognize it? 
As humans, we still instantly recognize that image as a cat. 
We want our model to be robust enough to figure out that all of these images 
are of cats. 
Another common problem is occlusion, where something is blocking or 
covering up part of the subject in the image. 
Once again, we can easily recognize the cats in these photos, but 
a model may struggle if part of the cat is hidden. 
Something else that I've run into is that models trained on photos do not perform 
very well with paintings or illustrations, like the cat on the far right. 
They have a hard time generalizing the shape of something. 
This is true for even many complex neural networks. 
Finally, is this a picture of a cat or a dog? 
Most humans can see that this picture contains multiple cats and 
multiple dogs, but a model trained to look for 
one or the other in an entire image would really struggle with something like this. 
One way to tackle this problem might be to only look at one section of the image at 
a time and determine if the animal in question is present. 
Then slide this window over to perform inference on another section. 
This is known as a sliding window, 
which will examine in more detail when we talk about object detection. 
Just because we don't have human-level comprehension of images yet 
doesn't mean that image classification isn't useful. 
One popular use case is determining if a human is present in the frame of 
a security camera. 
Person detection can be useful for businesses and homes. 
Rather than having someone watch security feeds all the time, 
you can just send a notification with a small video clip if a person was detected. 
We don't even need to know exactly where in the frame the human was, 
just that there was one present. 
We can do something similar for animals to if we're say, trying to identify wildlife. 
Image classification is a powerful tool, and it's a great start for 
trying to automatically assign meaning to pictures. 
[MUSIC]
I'd like to take a moment and review neural networks. 
These are concepts that I covered in 
the intro to embedded machine learning course. 
If you'd like to get more details about 
the things that I'm talking about in this lecture, 
I recommend going back to 
that course and reviewing those lectures. 
In supervised learning, we 
start with a set of training data, 
each sample consists of a set of 
features and an associated label. 
The features are given by x and the label is given by y. 
Features might be a collection of 
pixels that form an image. 
A label is what's in that image, 
like a cat or a dog. 
The goal is to find some kind 
of function where we plug in x, 
and it gives us the correct label. 
In situations where there is 
no easily derived function to map x to y, 
as in the case of 
many observed phenomena or 
things with lots of probabilities, 
we can rely on machine learning to help us get close. 
To do that, we train a model 
which will act as our function. 
Remember that a model is 
usually nothing more than a series of 
mathematical operations performed on 
the input data in order to produce an output. 
That output is known as a prediction, 
and it's often given as y-hat. 
By using a training algorithm, 
we can automatically adjust the parameters 
inside the model so that for any given x, 
y-hat will hopefully be close to 
the ground truth label for that feature set. 
From this, we want the model to generalize the features, 
and patterns in the data well enough so that we 
can give them model 
new data that's not in the training set, 
and it will still be able to assign 
the correct label to that set. 
This is known as inference. 
We can set aside some of 
our dataset to be used for validation 
and testing to make sure that 
the model has generalized well enough. 
There are other forms of learning, 
such as unsupervised learning 
and semi-supervised learning. 
But we're going to stick to 
supervised learning in this course. 
A feature is an individual measurable property 
or characteristic of a phenomenon. 
For example, the x, y, 
and z measurements from an accelerometer 
taken at a point in time might be used as features. 
These would make up the features for one sample. 
In this case, a single sample X 
would have three features, 
and we would say that X has three dimensions. 
If we were to plot this sample as a point, 
we would need x, y, 
and z axes to show how 
this sample differs from other samples. 
However, this feature does not capture how 
an accelerometer might be moving through time, 
so maybe we use one second's worth of 
accelerometer data that's been sampled 
at 100 hertz with x, 
y, and z data, 
that would give us 300 features for one sample. 
To plot a single sample, 
to compare it to other samples, 
we need a 300 dimensional plot, 
which is a little difficult for humans to visualize. 
However, 300 features is still perfectly 
valid as an input to a machine learning model. 
Just remember that if we construct a model 
that expects a given number of features, 
we should always use 
that number of features in our dataset. 
Otherwise, the model will 
not be able to work with the data. 
We can also perform some math to transform 
or extract certain features from the raw data. 
For example, we could compute the Fourier transform of 
the accelerometer data to see what 
the frequency components of that data looks like. 
If we're mostly interested in 
certain frequencies of that accelerometer, 
this can help them model learn what to 
look for much more quickly and 
can help reduce the amount of data 
that needs to be fed to the model. 
We don't need to be limited to 
just one set of features, too. 
We could also, say, 
compute the power spectral density 
of our measurements and 
use that along with 
the frequency data as our set of features. 
Just remember that the more features 
you send to the model, 
the more computationally expensive it becomes. 
As a result, you often want to 
spend some time thinking about, 
and experimenting with different features to figure out 
what works best for 
your particular model and application. 
For images, we might use 
raw pixel values as our features. 
I show them here as being 
normalized between zero, and one, 
we might need to flatten the two-dimensional image into 
a one-dimensional vector in 
order to feed it to the model. 
We would obviously use 
more features if we wanted to include the red, 
green, and blue channels for a color image. 
Alternatively, we could filter the image to 
pick out things like color blobs or edges, 
and feed that to the model. 
We can even have the model automatically 
figure out which features to pick out in an image, 
which we'll see when we talk 
about convolutional neural networks. 
As you can see, as the image gets larger, 
the number of features per sample also gets larger. 
We can easily have 
hundreds or thousands of dimensions for a single sample, 
when we use images as the input to our model. 
To start, we usually want to collect a dataset. 
This could be downloading a set made by 
someone else or creating your own. 
For supervised learning, you will need to 
assign a label to every sample in that set. 
This might be the name of 
the folder each group of samples is 
stored in a list of files and labels in a text file, 
or simply a separate array in Python 
that stores the labels in the same order as the samples. 
Then you'll usually want to 
randomly divide the dataset into test, 
validation, and training sets. 
The test set should be put aside and not 
touched until you are ready to 
report on the results of your model. 
Any tweaks made to the model based on things learned from 
the test set could result in 
the model overfitting to the dataset, 
including the test set. 
The validation set is often used during 
training to test how well the model is performing, 
and to adjust model hyperparameters. 
Hyperparameters are things like the size, 
and shape of the model, 
number of training iterations, and so on. 
The training set is used during 
the training process to automatically update 
the parameters in the model using 
whichever training algorithm you choose. 
Depending on the size of the dataset, 
setting aside 10-20 percent 
of the samples for a test set, 
another 10-20 percent for the validation set, 
and leaving the rest for 
training is usually a good start. 
Let's review what's going on inside a neural network. 
A node or neuron makes up 
the fundamental building block of a basic neural network. 
A weighted sum of the inputs is first calculated. 
The inputs could be our features if 
this is a node in the top layer of the network, 
or it might be outputs from other nodes. 
Note that there is no x_0 as the weighted sum includes 
a bias term which is often given by b or w_0. 
This is added to the weighted sum of the inputs. 
The weights, and bias terms make up the parameters that 
will automatically be adjusted 
during the training process. 
As a result, w_0, w_1, 
and so on are usually 
randomized when the network is first created, 
and then take on new values 
throughout the training process. 
The output of the weighted sum is then 
transformed through an activation function. 
The activation function acts to add 
some level of non-linearity in the network. 
Without it, the network would simply be 
a collection of linear equations, 
and would struggle with 
anything beyond basic classification. 
Some activation functions, 
like the hyperbolic, tangent, 
or sigmoid functions also 
act to squash or constrain the output. 
For example, the sigmoid function ensures that 
the output of the node is between 0 and 1. 
One of the most popular activation functions 
is the Rectified Linear Unit, 
or ReLU for short. 
The ReLU function is very easy to implement in code, 
as all you do is just convert any negative values to 0, 
and leave positive values alone. 
This makes it very efficient while still 
providing a good amount of non-linearity. 
When we connect these nodes together, 
we create a neural network. 
This is an example of 
a Dense Neural Network where the outputs of 
all the nodes in one layer are used as 
the inputs to every node in the next layer. 
The input values or features make up the input layer. 
For example, if we are using 
the three values from the accelerometer as our features, 
we would have three inputs to this network. 
Neural networks are fairly robust, 
and can work with all sorts of data, 
so long as the values are numbers. 
Anything between negative infinity, 
and positive infinity can theoretically be 
used as an input feature to a neural network. 
However, you start to run into issues with 
training times when you mix different scales. 
For example, let's say you're using 
one axis accelerometer data as one feature, 
and a light sensor output as 
another feature to a neural network. 
The accelerometer might have a scale of something like 
negative 20-20 meters per second squared, 
whereas the light sensor could go 
from 0-120 thousand lux. 
Because the features are on different scales, 
the training time would potentially 
be much longer than necessary. 
You also might lose some floating point number accuracy 
with really large ranges on some processor architectures. 
To alleviate these issues, 
you will often see 
the input feature values being 
normalized between 0 and 1, 
or sometimes negative 1 and 1. 
For our purposes, this might 
mean taking our grayscale pixel values, 
and mapping the 0-255 values to be between 0 and 1. 
Because the pixel values have similar ranges, 
you might see neural networks work with raw pixel values, 
either normalized or non normalized. 
Each of these three values would be 
copied to every node in the next layer. 
Hidden layers are any layers 
that are not the input values, 
or provide the final output of the network. 
This particular network only has one hidden layer. 
If we are creating a multi-class classifier, 
we can replace the final activation function 
in the output layer with the softmax function, 
which combines all of the weighted sums from the nodes in 
a way to give us a measure 
of probabilities of each class. 
This collection of probabilities 
would make up our prediction, 
which is often represented as y-hat. 
During training, we would compare this y-hat output 
with the ground truth label assigned to the input sample. 
To perform training, we first initialize 
all the weights in each node to a random value. 
We then take one of the training samples, 
and feed the features as inputs to the network. 
Next, we do a forward pass where we calculate 
the outputs of each layer to feed to the next layer. 
The output of the whole network will be a prediction. 
We compare that prediction to 
the known good value from that sample, 
and calculate how different they are. 
The value that gives that difference 
is known as the loss value, 
and we calculate this value with the loss function. 
We use this loss value in the backpropagation algorithm 
to go backwards through 
the network to update each of the weights. 
When we update the weights in this manner, 
we hope that the neural network will 
be more accurate next time. 
We continue doing that for all 
of the samples in our training set, 
which is known as one epic or epoch. 
You'll come across both pronunciations. 
The auto-generated captions struggle 
with the spelling sometimes, 
so I'll do my best to stick to epoch. 
After each epoch, we can run the validation set through 
the network using just forward 
passes to see how well it performs. 
Note that the models' weights are not 
updated with backpropagation during validation. 
We repeat the forward passes and 
backpropagation with the training data over and 
over again until the accuracy of the model 
reaches an acceptable or desired level. 
I'd like to review some terminology 
around the training process in case you 
run into it later in the course 
or in some of the recommended readings. 
Gradient descent is 
the most popular algorithm used to adjust 
the parameters in most neural networks 
during the backpropagation training step. 
It computes the gradient of the loss function given after 
a forward pass with a training sample and 
uses that information to 
update the parameters in the model. 
Epoch is when all of the training samples 
have been passed through the model during training. 
You'll often see accuracy and loss 
plots use epochs as the time access. 
Batch gradient descent is when 
you pass all the training samples through 
a model and compute 
the average gradient to use for gradient descent. 
Stochastic gradient descent is 
computing the gradient and upping 
the parameters once each time you do 
a forward pass with a new training sample. 
Mini-batch gradient descent is a mix 
of batch and stochastic gradient descent. 
You compute the loss and average gradient for a group 
of training samples and use 
that to update the parameters. 
You then move on to the next group in 
the training set and repeat the process. 
You'll often want to watch for when a model is 
underfitting or overfitting during training. 
Underfitting occurs when the model performs 
poorly on both the training and validation datasets. 
It means that the model was unable to generalize 
the features to accurately predict the classes. 
To fix it, you'll want to try getting more data, 
working with different features, 
training for longer or trying a more complex model. 
With a good fit, 
the model should have good accuracy for 
both the training and validation sets. 
What level of accuracy is good enough is dependent 
on your particular needs for your project or product. 
Overfitting happens when the model is able 
to predict labels for the training data well, 
but fails to generalize what it's learned to unseen data, 
which includes the validation set. 
Here we expect the training accuracy to be high, 
but the validation or test accuracy to be low. 
To fix overfitting, you can 
gather more data, try early stopping, 
reducing model complexity, adding 
regularization terms or inserting dropout layers, 
assuming you're working with a neural network. 
Note that sometimes you may not have 
the right data or right features and you need to go 
back to figuring out what's the right type of data 
to work with as input for your model. 
Instead of accuracy, you might also see 
the loss value given over time 
or epochs of the training process. 
You should expect it to basically 
be the opposite of accuracy. 
Over time, the loss should decrease. 
Training and validation sets will show 
little decrease in loss if the model is underfit and 
there will be a big separation between 
the training and validation loss values 
if the model is overfit. 
Finally, let's review how we might 
evaluate the effectiveness of a classifier. 
One popular way to do this is the confusion matrix. 
Across the top, we list the labels 
that the model predicts for each sample. 
Down the side, we list the actual classes. 
We usually want to compute 
the confusion matrix for the validation or test set. 
For each sample from that set that we feed to the model, 
we record the output. 
If the predicted label was bird, 
but the actual label was cat, 
we add 1 to that cell. 
Once we've gone through the whole set, 
we'll end up with something like this. 
The true positives are the number of 
predictions that match the actual class. 
We can compute the total model accuracy, 
but an examination of 
each class can yield some other useful metrics. 
For example, we might start by focusing 
on the metrics for just the dog class. 
The number of true positives would be 199. 
The true negatives are when the classifier correctly 
predicts something other than our intended class. 
A false positive occurs when 
the predicted label is our intended class, 
but the actual label was something else. 
Finally, a false negative is when 
the actual label is our intended class like dog, 
but the model predicted something else. 
Using this information, we can 
compute a number of metrics. 
For example, the total system accuracy for 
this dataset is calculated by summing 
all correctly predicted classes and 
dividing that by the total number of samples in the set. 
Note that this has nothing to do with the intended class. 
We sum 205, 199, 223, 
and 186 to get the total number of correct predictions. 
We can calculate the per class accuracy by adding 
the true positives to the true negatives and then 
dividing that number by the total number of samples. 
For example, the dog class would have a 
92.7 percent accuracy with this model and dataset. 
In some instances, you'll have 
an imbalance among your classes. 
As a result, it's often better to use 
the F_1 score for a class rather than 
accuracy as it takes 
the false positive and false negative rates into account. 
To compute the F_1 score for a class, 
you divide 2 times the number of true positives by 
2 times the number of true positives added to 
the number of false positives and false negatives. 
For our example, the dog class would have 
an F_1 score of 0.845. 
The closer the F_1 score is to one, 
the better your classifier performed 
on that dataset for that class. 
We can compute the accuracy and 
the F_1 score for each class to compare them. 
As you can see, they differ slightly. 
To get the total F_1 score for the whole classifier, 
we can compute the average of the F_1 scores. 
Again, we can compare that to 
the total system accuracy for that dataset. 
Both of these numbers can give you some insights into 
how well the model performs with unseen data, 
assuming you are using the validation or test sets. 
I hope this helps give you an idea of how neural 
networks work and how we can evaluate them. 
Don't worry too much about the math as we will be using 
tools to do most of the complex calculations for us.

Previously, we went over how 
a basic neural network operates, 
how you train one, 
and how you can evaluate its performance. 
I'm going to show you one way you might 
implement this using Python and Keras. 
If you head to the GitHub repositories for this course, 
you can go to the training an image classifier with 
Keras folder and find a link to this Colab. 
I recommend working through 
this exercise so you can get a feeling of how 
dataset curation and model training 
works in Keras even better, 
you might want to manually copy all of the code into 
your own Colab and play with 
the values to see what's going on. 
I want you to see how you might 
traditionally use this machine learning framework 
and you can compare this code 
to the advanced mode in Edge Impulse. 
Personally, I like to do 
data curation and evaluate various features in 
Python before uploading my datasets 
to Edge Impulse to do the model training. 
To run the script, 
just click on the first code cell and 
then press "Shift and Enter". 
You might get a warning from Google saying 
it's not authored by Google and that's fine, 
just click "Run Anyway". 
This will run one cell at a time and you will 
automatically be placed into the next cell. 
Press "Shift, Enter" again to run it. 
Note, here that we need to set 
the dataset path and 
this whole thing is basically a Linux instance. 
On the left side, 
we can click on the little folder icon and that 
gives us a view into the files 
and folders of this file system. 
We can go up and you see 
a lot of the folders it looks just like 
a Linux instance and 
our files will be kept in the content directory. 
Click on that, right-click and click "New Folder". 
We want to keep our datasets in something called dataset. 
Then I'm going to 
create a new folder in that and I'm going to name 
these folders after my labels 
for the things that I'm trying to identify. 
I'm doing electronic components. 
My first one will be capacitor, 
then I'm going to have a diode, 
and then LED, and a resistor. 
This may be enough for your use cases. 
However, I always like to have a background or 
noise or something that's basically 
not any of the things that I'm looking for. 
You'll find that this does not 
exist in many pre-made datasets, 
so I like to add it when I'm 
collecting my own datasets because a lot of times 
you'll take a picture or a video of something and 
the items you're trying to identify are 
just not in that photo. 
One of your classes should be background 
or noise or whatever it 
is you're looking for just is not present. 
I'm going to call this one background. 
I'm going to right-click on the folders, 
I'm going to click "Upload", 
I'm going to go into wherever I keep 
my dataset that I've collected. 
For me I have it in Python datasets and I have them in 
the electronic components. This is background. 
I'm going to click 
"Ctrl A" to highlight all of them and click "Open". 
This will upload all of the files to the Colab instance. 
I'm going to repeat this process for each of my classes. 
Here's capacitor. 
Remember that this is a Linux instance 
that will only run once while you're logged in. 
As soon as you log out or the run-time is disconnected, 
you have to re-upload all of your data. 
That's because Google does 
a refresh of all of the files on 
the system whenever you disconnect 
from the run-time environment of this Jupyter notebook. 
If you click down, you can actually 
see all of the files I got uploaded. 
We're going to do LED next 
and then last for me is resistor. 
This can be your own dataset. 
You can use my electronic components. 
Feel free to experiment with different types of 
photos or maybe it's M-NEST fashion M-NEST, 
some type of pre-made dataset. 
They should more or less work for you, 
but you might have to make a few changes 
to this settings. 
Now that we have 
our dataset under the dataset directory in content, 
we can point our dataset path to there. 
We're going to resize our images. 
All of my images are 96 by 96 color images. 
We're going to scale those down to 
28 by 28 pixels and we're 
going to actually use them as gray-scale or we're 
going to convert them to gray-scale instead. 
That just makes the machine learning model 
smaller and easier to work with. 
If you don't need large resolution to classify objects, 
then don't use it because it just wastes 
a lot of processing power and time. 
Sometimes you will find that dark images 
or dark backgrounds work better with 
lighter subjects on them. 
This can help improve your accuracy. 
I'm not going to do this for now, 
but feel free to experiment with this. 
All this does is just invert the image, 
so white pixels become black. 
Vice versa. As we saw earlier, 
we're going to set aside 20 percent 
of this data for validation, 
and 20 percent for testing. 
This random seed affects the random library in Python. 
This way, with a particular seed, 
you should be able to 
replicate what I've done here exactly, 
and random will follow the same sequence 
of numbers that it generates. 
We're going to press "Shift Enter" to run that cell. 
Everything's been set for our settings. 
The next thing we want to do is load 
our images as numpy arrays. 
I'm not going to go through this code 
in a lot of detail because you shouldn't be 
really working with this as we're going to be 
moving to edge impulse fairly soon. 
But I wanted to show you 
what this code would look like if you were 
going to use Keras 
in order to train a machine learning model. 
That way, you can see what Keras looks like, 
and then compare it to what's in 
the advanced settings in edge impulse. 
This just goes through each of the folders, 
it finds the files in those folders, 
it reads them using the PIL library, 
and then it converts that image file into a numpy array. 
It then appends that array to our collection of samples, 
which I call x, in this case, 
uppercase x because it's a three-dimensional array. 
Eventually, that will get converted to 
a much larger array, 
that's a numpy array, but right now, 
it's just a list of numpy arrays.
Play video starting at :6:26 and follow transcript6:26
In addition to this, each 
time one of the images is added, 
it also adds the label, 
which we get from the name of 
the folder that the sample is located in. 
I'm going to go ahead and run this. 
One thing I like to do is to always print out 
something from each cell to tell me 
that it worked properly. 
Here, I added 50 images from the led class, 
background passer resistor diode. 
That looks correct because I 
collected 50 images of each, 
and then uploaded those. 
It also lists out the labels for me, 
so I know what my labels are, 
and it gives me a total number of samples. 
This should just be a total of these individual classes. 
Next, we're going to convert the labels to numbers. 
Right now, they're stored as strings, 
so there will be 50 background strings in our y array, 
and then 50 capacitor strings, 
and so on, because all I did was tacked 
on the name of the folder to the array. 
What we're going to do is change those to numbers. 
Here it is. You can see, 
it's actually out of order a little bit. 
Resistor came first because I can't 
necessarily control how the labels are read, 
if they're going to be alphabetical order, 
and that's all right, 
which is actually here 
when I'm listing out the directory. 
This may or may not be alphabetical order 
when it's listed out in this for loop. 
You can see the labels, 
the ground truth answers or 
my ys are just strings before, 
and then they become labels, 
and it's alphabetical order. 
Background will be zero, 
capacitor will be 1, 
diode is 2, 
3, and 4. 
I need to run this cell. 
You can see led came first this time, 
and that is label 3.
Play video starting at :8:11 and follow transcript8:11
Background is 0, and so on. 
Now, instead of strings, they are numbers. 
This makes it a little easier to work with, 
especially in a machine learning model when we are 
working with numbers rather than strings. 
Next up, we have our collection 
of samples and the associated labels. 
We want to shuffle 
them because right now they're in order. 
All the leds, including the samples, 
are in the same order, 
and then all of the background, 
and the capacitors, 
and then resistors, and diodes are somewhere in there. 
They are in this exact order, 
and that is a terrible order 
to feed to our network for training. 
We want to shuffle it up a bit. 
Also, more importantly, 
we want to extract the test and validation sets, 
and those should be a random sampling 
from this larger set. 
The best way I have found to do that is 
to zip the x and the ys together. 
The first element in 
the x-list is associated 
with the first element in the y-list, 
and then the second is associated with the second, 
so that when we shuffle them, 
that ordering stays the same, 
or that pairing stays the same. 
Then we unzip them 
again so that the pairing stays the same, 
but everything's all shuffled. 
Then we figure out where we want to 
slice out from that array in order to get our test set, 
and we take 20 percent off the top. 
We take 20 percent from the beginning of the array, 
then we take the next 20 percent for our validation set, 
and then the rest, which should be 60 percent, 
goes to our training set. 
This keeps the correct ys with 
the correct xs or 
the correct labels with the correct samples, 
but it shuffles them up and 
extracts the test and validation data. 
Finally, for later on, 
we're going to need to know the number 
of training samples that's 
going to help us in some future calculations, 
and then we print out everything that's been done. 
We pull out 50 samples for our test set, 
50 samples for our validation set, 
and then 150 samples for our training set.
Play video starting at :10:21 and follow transcript10:21
When I'm working with new data, 
I always like to view what one of the samples looks like. 
It gives me an idea for what I'm working 
with and tells me what the numbers are. 
Up here, I've got the index. 
You can index into 
the x training set 
and take a view of what 
that raw data looks like as well as, 
well, print an image because we're working with images. 
You can see here, I've got my capacitor 
and it looks like 
we are working with gray scale, which is good. 
The value of the pixels should be between zero and 255. 
I set that as a min and max here so that it's 
plotted appropriately when I'm using Matplotlib. 
You are welcome to try adjusting 
this index so we'll pick something else, 
let's say, I don't know, 142. 
I can run the cell again. 
LED, sure enough, 
that's an LED, and 10, diode, diode. 
It looks like the ground truth labels match up, 
and these images are still in the 96 by 96 array format. 
I created a function here to help me resize 
a list of images because we're going 
to call that for each of the sets. 
All it's doing is just calling this resize which comes 
from the skimage.transform library. 
That allows us to just resize the images using 
whatever resizing algorithm that they have by 
default and we're going to keep anti 
aliasing on by default, 
but you can turn that off if you want. 
It just helps get rid of 
some of these jagged edges in the image, 
which is more useful when you're sizing up, 
but I find it useful when you're sizing 
down too. We're going to use this function. 
We're then going to resize all of the images 
in the training validation and test set. 
Then I like to view the images again. 
You can see here, it's a much smaller image. 
It's still the capacitor, 
but it's a lot more pixelated 
because we've scaled it down. 
We can see that it's a 28 by 28 image now, 
it's more pixelated, but it's a lot smaller and 
it should still be identifiable for what this object is. 
You will notice that the resize command in 
the resize images function that we wrote takes the 
zero to 255 gray scale values and 
converts them into floating point values 
between zero and one. 
It does that normalization process for 
us which means we don't have to do it manually. 
Keep that in mind because these are 
the numbers that we're feeding to the neural network, 
not zero through 255. 
Next, we want to convert a list of samples 
into our Numpy arrays. 
It's going to take a list as an input and output 
a Numpy array as opposed to 
that list because when we go to train using Keras, 
it's expecting a larger Numpy array that contains all of 
the training samples rather than a list of Numpy arrays. 
I'm going to run that cell. 
You will see that our training set now is 
one large array instead of a list of arrays. 
It is three-dimensions, 
it is 150 in the first dimension. 
Instead of 150 samples in the list, 
we now have one larger array, 
and the images are now 
just arrays stacked on top of that. 
You can imagine, 
this as one large three-dimensional array 
and the width and height, for example, 
would be the width and height of each image, 
and then the depth would be 150 sets each of the images. 
Then same thing with y, 
rather than being a list of values, 
it's now one long array. 
We do the same thing for the other sets. 
I can't remember if I ran the cell or not. I did. 
You always look at the number above to 
see if it ran because a cell that has not been 
run in a new notebook will not have 
a number in the spot with a play button. 
The next thing we need to do is flatten 
each image to a one-dimensional vector. 
For our simple dense neural network demonstration here, 
it cannot work with two-dimensional arrays like this. 
What we do is we just take the first row and 
then take the second row and tack 
that onto the end, and then take the third row, 
tack that onto the end of 
the second row, take the next row, 
tack that on, and it just continues like that to 
squish this down into one long array. 
When we run this cell, 
you'll see that it's no longer 
a three-dimensional array but rather, 
a two-dimensional array because 
this width and height got squished to one array. 
Now, it's just a collection of 
784 pixel values that have been normalized. 
Remember, these are floating point values 
between zero and one. 
The same thing happens to all of 
our training validation and test sets. 
Finally, we need to do one-hot encoding. 
This should be the last step before 
we actually get to building our model. 
One-hot encoding takes our numbers 
for our labels which would be 0, 1, 2, 3, 4, 
and converts them into a series of 
probabilities of each class, 
if that even remotely begins to make sense. 
In the first index we have something 
that aligns with label 0. 
The second index is label 1, 
label 2, label 3 and label 4. 
The number of values in 
this one-hot encoding lines up 
with the number of classes that we have. 
We have 0, 1, 2, 3, 4, that's five classes, 
so we should have five elements in this array, 
and this then becomes 
our actual answer or our ground truth label. 
Rather than having a single value, 
we've got a collection of values and these 
correspond to the probabilities of each class. 
Label 1, which was 
our capacitor because alphabetical order, 
rather than being the number 1 or 
number 2 as being the output to the array, 
becomes the index for the highest probability number. 
Because we know that 
this instance of a capacitor is label 1, 
the second or index one element 
of this array should be probability 1. 
It's guaranteed to be a capacitor, 
and then all the other 
probabilities should be set to zero. 
It's zero probability that it's the background, 
100 percent probability that it's a capacitor, 
zero probability that it's a diode, 
zero that it's an LED and zero that it's a resistor. 
You can see how this works for another example. 
Let's take four, which we know should be the resistor. 
That will be zero probability for everything else, 
and 100 percent probability that it is the resistor. 
These are the answers that are 
used to train the network, 
we don't expect to get 01000, for example, 
coming out of the network during 
inference because it's going to give us a list of 
probabilities that come from 
the softmax layer when we are done with inference. 
You will see that when we go to perform inference on 
an unseen validation or test set sample. 
Once again, let's see if I run this cell, I did not, 
so we run this cell and I left the answers in here, 
so you can get an expectation of what to look for. 
But you still need to run 
every cell otherwise, parts of it won't work. 
We then construct our Keras model. 
This is a fairly simple model, 
it consists of 64 nodes of a densely connected layer. 
We're using the ReLU activation. 
I did put dropout layers in here, 
which we looked at in the previous course, 
in the introduction course, 
followed by another layer of 64 nodes. 
Same thing, ReLU activation, 
25 percent dropout and then the final layer 
consists of the same number of nodes as we have classes. 
For example, this is no longer 10 nodes, 
in fact this is a typo in this, 
this would actually be 3, 4, 
5 nodes because that's the number of classes we 
have and the softmax activation where 
each node should output a number after 
going through the softmax function here that 
corresponds to the probability that the model 
believes the sample belongs to that class. 
That's why we have five nodes, 
one for each class. 
We then set some training settings here, 
the categorical cross entropy is 
generally used when you have more than one class, 
so we're not doing a binary classifier. 
This would change if you do have a binary classifier. 
The atom optimizer is fairly common, 
and then we're going to record 
the accuracy metrics as we're training the model. 
We then print out a model summary, 
which I always like to do. 
You can get an idea of the number of 
parameters that are in this model, 
which is based on the number of nodes, 
how those are interconnected, 
the input shape of your sample and so on. 
The higher the parameters, 
the more computations you have to do. 
Then we train the model. 
This is where a lot of 
your hyperparameter tuning can come in outside of, 
of course, playing with 
the size and shape of the network. 
Batch sizes adjust how fast things change, 
I'm just going to say 32 to start. 
You can also play with the number of epics for training. 
Start with like a 100 to 200, see how that does, 
and you'll be able to judge when you plot 
the accuracy of the validation set. 
Let's run this and we wait for it to train. 
We don't have many training samples to work with, 
so this will be fairly quick. 
If you have a lot of training samples in your dataset, 
this might take a little longer. 
When it's done, I always like the plot the training and 
validation loss as well as the accuracy. 
This just plots those for us and makes it 
all nice with blue and orange markers, 
so we can see and look for overfitting and underfitting. 
When we plot those, we see that 
the model looks like it fits fairly nicely. 
There's a little bit of overfitting because the accuracy 
of the training data is a little bit in 
the accuracy of the validation data. 
But it looks like it's good 
enough sitting at around 90 percent. 
It's where it maxes out. 
That's not bad considering this as 
a densely connected neural network 
and we don't have a lot of training samples to work with. 
Then we can use Keras to predict 
a single label from 
our validation set or 
feel free to do it with the test set. 
But you should really be using 
the validation set if you're still trying to tune 
hyperparameters and figuring out 
what model works for your data. 
To do this, I'm going to pick 
an index into the validation set, 
just like we did earlier when 
plotting the individual images. 
We need to convert it to 
a two-dimensional arrays that have 
a one-dimensional vector because 
this is what Keras expects. 
We actually just add a 0 with dimension to a 2D array. 
It's really odd, 
but then we feed that array consisting of 
our image data to our model. 
It's a parameter to model.predict. 
That's all we need to do to perform inference. 
We then find the index of the highest score. 
Remember how I said it's going to be printed 
out in an array of 
five elements corresponding to 
the probabilities of each label. 
We just find the highest one and then 
we output that prediction and print 
that versus the actual label because we should know what 
the actual label of the validation sample is. 
We run that. If you look across, 
here's the actual output of the model. 
It is five values 
corresponding to the probabilities of each class. 
Class 0 has a very low probability, 
same with classes 2, 
3, and 4. 
The highest probability has about 99.8 percent. 
That's how much the model believes that 
this sample belongs to the capacitor class. 
Sure enough it is correct, 
the actual label is capacitor. 
You can continue to do this for many of 
the samples or all of the samples in your validation set. 
Feel free to do that with some of the test set as well. 
I'm just going to, look like we have a bunch 
of capacitors in a row here. We go background. 
Yes, this changes the first element now changes 
to be the highest and so on. 
Here's an actual miss. 
This is a false classification. 
It thinks it's a diode, 
but really it's a capacitor. 
You can look here. 
Diode is 0, 1, 2. 
It's about 40 percent sure it's a diode and it 
looks like it's about 16 percent 
sure that it's a capacitor. 
This makes it about 35 percent sure it's the background. 
It's struggling with this image. 
It really doesn't know. 
You can see how it's not going to be perfect. 
To explore that we can 
actually print out a confusion matrix. 
This just helps us print that out. 
It goes through all of 
the validation set and computes the confusion matrix. 
We've got the predicted labels across the top, 
actual labels down the bottom, 
you should see the largest number 
for each of the predicted and 
actual labels be in 
this diagonal because that 
means it's predicting correctly, 
but it does have a few where it's not. 
You can see some of these misclassifications 
off to the side here that are 
not part of, that's not one. 
This is one. That these are not classified correctly. 
That means the model is not completely accurate, 
but it's actually doing a pretty good job. 
We can use model. 
evaluate in Keras to get a accuracy. 
I did this on the validation set. 
This is another typo that I will 
make sure it's correct it in the code. 
I can run that validation loss. Then here's the accuracy. 
It's about 90 percent accurate on the validation set. 
Then I'm going to run this on the test set because 
I would only use this up to 
this point to update my hyperparameters, 
play with the network, try to find something 
that's getting any better than say, 90 percent accuracy. 
When I'm happy with that, I would 
then run it on the test set. 
As you can see, I'm using the test set 
here as opposed to the validation set, 
which should give me a little bit different number. 
It actually looks like it's better on 
the test set, about 94 percent. 
That's a lot of that is because I have 
a very small dataset to work. 
With larger numbers, 
hopefully these would be closer, 
but that's not necessarily always the case. 
You can have a model that starts to overfit to 
the validation set as well. Be careful with that. 
I hope this gives you 
an idea of how you might get started using Keras. 
Don't worry about trying to follow everything here. 
I do recommend either running 
this or manually typing it into 
your own Colab instance and working 
through each of the cells to see what's going on. 
This is going to give us an idea of what's 
happening under the hood and edge impulse, 
especially as we get into 
## image classification and uploading our own datasets.
Image Classification and Neural Networks
(Optional) Recommended Reading
You can upload images directly to Edge Impulse, 
which we will do later, 
but I want to show you how to use 
the ingestion service so that you can automate 
the curation process and upload 
extracted features separately if needed. 
You can find the code for this section by 
heading to the GitHub repository for 
the course and looking in 
the using Colab to curate and upload a dataset folder. 
Click the link there to open 
the Jupyter Notebook in Colab. 
This is going to be very similar to 
the example we saw in a previous lecture where we 
uploaded our images to 
specific folders in Colab and then used Colab to 
run a script that trained a Keras model 
in order to identify or classify those images. 
However, this time, instead of 
creating a Keras classifier, 
we're going to do the same curation process, 
the same feature extraction, 
to resize those images. 
Then instead of training that model, 
we're going to send them to 
Edge Impulse using their ingestion service. 
To start, we are going to need to create 
a dataset folder just 
like we did in the previous lecture. 
In our file system here, 
right click, make sure you're in the Content folder. 
We're here, I'm going to right click and say New folder. 
I'm going to call it dataset without the capital letters. 
Dataset, all lowercase. 
Right click on the dataset folder, 
we're going to create a folder for each of our labels. 
For me that's going to be background,
Play video starting at :1:45 and follow transcript1:45
diode, capacitor, LED, and resistor.
Play video starting at :1:53 and follow transcript1:53
Here we're going to upload all of our images, 
so I'm going to Right click, say "Upload." 
Go to the bitmap images that I have saved from 
our open envy camera capture or however you 
decided to curate those images to 
start to get an initial dataset. 
I'm going to go to background control header, 
select all of them, 
click "Open" and I'm going to continue doing that 
for all of the folders in here, 
capacitor, diode, LED, and resistor. 
With that done, I'm going to start running the cells. 
Notice instead of Keras, 
we're going to keep the skimage transform library 
because we need to resize each of the images. 
This will handle all of 
the interpolation for us to 
make sure it's resized appropriately. 
I'm very thankful for being able to 
use Python because it has access to 
a number of these libraries that makes things like 
image transformation very easy and straight forward. 
I'm going to run this to import all the libraries. 
Notice here that I need to copy in my API 
and HMAC keys from Edge Impulse. 
Let's head to Edge Impulse. 
Create a account if you don't already have one, 
and here you'll be asked to make 
a project or select from your existing projects. 
If you have existing projects, 
scroll down, select "Create new project." 
I'm going to call 
mine electroniccomponents-dnn for dense neural network.
Play video starting at :3:16 and follow transcript3:16
Click "Okay", click "X" on the pop-up to close it out, 
and we should be at the dashboard. 
In this dashboard, you're going to want to head to keys. 
You're going to want to double click on the key. 
Notice that it is going to actually copy the whole key, 
even though it looks like it's not copying the key. 
Copy that, head back to 
our Colab script and paste that in. 
Notice it does paste in the whole script, 
and we're going to do the same thing for the HMAC key. 
We're going to need both of these in order to 
upload our samples to Edge Impulse. 
Most of the rest of this should look very familiar. 
It should point to the dataset folder 
that we just created in order to 
find all of the images that have been 
divided up into our labels. 
We have the target width and height. 
We're going to resize the images from 
96 by 96 to 28 by 28, 
as smaller seems to work fine and it saves on 
computational cycles for both training and inference. 
We're also going to set aside 20 percent for test. 
Notice that we don't have a validation set, 
as Edge Impulse will 
automatically extract a validation set, 
I believe it's 20 percent from 
the training set that we will upload, 
so leave out the validation set 
for doing this ingestion service. 
The other thing I want to point out, 
the reason we're doing this and you don't 
need to use the ingestion service. 
However, it is available for 
us to use if you have your own data, 
say sitting on your computer or somewhere on 
the Internet that you want to send to Edge Impulse. 
This is showing you 
both how to use the ingestion service, 
which means that some of 
this code will be fairly complicated. 
Don't feel obligated to 
understand all of the code at this time. 
It's just really to demonstrate how we can, 
one option to upload data to Edge Impulse. 
It's actually a lot easier to send 
regular images in PNG or JPEG formats, 
for example, to Edge Impulse, 
which we will do in a later lecture. 
However, Edge Impulse doesn't 
really work with images in that way, 
and flatten them for us, 
so if we want to work with 
raw data instead of images, we can do this. 
We can use the ingestion service to upload 
our own raw data that's been manipulated, 
curated, or we do 
our own feature extraction outside of Edge Impulse, 
and I'm just going to demonstrate that to you here. 
There's no need to copy my code, 
but if you want to see what's going on, 
feel free to run it, 
play with it, try writing your own. 
As the ingestion service can be 
extremely useful when you're working with your own data. 
That being said, let's continue. 
We will run this to do our settings. 
It looks like I accidentally 
typed an extra character here. 
That ran, looks good. 
This is divided up into a few sections, 
so this next part, 
these few cells, should look very 
familiar from the previous lecture, 
where we are first going through each of the images, 
loading them using the PIL library, 
and then converting that image data into a NumPy array. 
We are extracting the label that comes 
from the name of the folder in Colab. 
Notice that we're not going to convert 
that label to a number which we had to do 
in the kairos example 
because we're not going to be 
training the model directly here. 
This label is actually best in 
string format when it's uploaded to Edge Impulse, 
because that's going to display 
the strings for us and it will 
handle converting those two numbers 
during the training process. 
After running this, I should get 
50 images from each of my classes 
because that's the number of images 
that I took for each label. 
It also has the labels in string format, 
which is great, 250 total images. 
Next, we are going to zip 
the samples and the labels together. 
We're going to shuffle them and we're going 
to extract the test set, 
which is the first 20 percent of that shuffled set. 
The rest are going to become our training set 
or will become the training set in Edge Impulse. 
Edge Impulse will divide that into 
validation and training sets for us. 
We don't need to worry about doing that here. 
We should see that the number of 
test samples is 20 percent of everything, 
and the rest belong to the training samples.
Play video starting at :7:38 and follow transcript7:38
Let's go down to the next cell. 
I always like to view one of 
the training samples or more 
to get an idea of what I'm working with. 
For these images, it should be zero to 
255 grayscale values for each pixel. 
This should be in a 96 by 96 array 
depending on how you took the images, 
whether it was with open Envy, 
your cell phone and crop them and feel free 
to try it a different index to 
see what those images look like. 
I am just going to continue down here. 
This is a function that is essentially a wrapper for 
the resize function that will 
resize the image that is given to it. 
We're just going to go through the list 
of everything in one of the datasets, 
and resize them all to that desired width and height, 
which is going to be 28 by 28 for us. 
You can see I'm passing in these parameters here, 
that's 28 and 28. 
We're going to do this for the training set 
and the test set. 
Then I'm going to view the same image, 
and it should look just like the same here, 
but it's been resized or scaled. 
Instead of 96 by 96, 
it is now 28 by 28. 
Also notice that this 
resize function that we saw earlier, 
that handles mapping the zero to 
255 grayscale values to the 
normalized zero to one floating point value. 
We don't need to worry about normalizing. 
That has already been done for us. 
Then we convert that list, 
which is a list of arrays into one giant NumPy arrays. 
Instead of a list of 28 by 28 NumPy arrays, 
we now have one giant NumPy array, 
where each of the integer arrays 
is basically just been stacked on top of each other. 
Instead of a list of 200, 
we now have just one array that is 200 by 28 by 28. 
The same thing for the test set, 
but with 50 samples instead. 
Next, we want to go through and 
flatten each of the images. 
Instead of 28 by 28, 
those arrays just get completely flattened into 
a single 784 element array by 
appending one row next to each other to create 
one long vector instead of a two-dimensional array. 
With that, we get into some helper functions. 
The ingestion service for Edge Impulse expects things 
in particular formats when 
we upload it using that service. 
One of them is JSON, 
which we create a wrapper for here, 
and you can see all of 
the different fields that are needed,
Play video starting at :10:9 and follow transcript10:09
they get filled out in this function. 
We will create that function, 
and that just fills out the fields, 
but leaves the values empty 
that we fill out with the image data, 
or whatever sample data you want. 
Note that when you're using this ingestion service, 
it doesn't have to be imaged data. 
It can be any sensor, 
it can be from an accelerometer, 
a sound sensor, a temperature sensor. 
The ingestion service is really 
agnostic to what kind of data it's being sent. 
We can send it floating point data, 
we can send it integer data, it's just numbers. 
You want to make sure that 
your samples have the same size. 
If you're working with, say, two-dimensional arrays, 
they need to always be 28 by 28 for every sample. 
Or if you're working with one-dimensional vectors, 
if you have, say, 
three axes accelerometer data, 
your samples should always have three values in them, 
because that's what the model is going to expect. 
Especially when we train it from Edge Impulse, 
you want to make sure that all of your samples 
have the same size and shape. 
Next, we have the send sample function. 
This handles taking that JSON data and then 
actually uploading it to 
the end points, which are given here. 
Ingestion.edgeimpulse.com, API testing data. 
If you're using HTTPS, 
then you will need to create 
a signature that is done here. 
That just gets appended to 
the message before it's sent out. 
The response value is returned from this function, 
so you can compare it to say, 
HTTP okay, to make sure that everything 
was received correctly by Edge Impulse. 
If you'd like to learn more about 
the Edge Impulse ingestion service head to 
docs.edgeimpulse.com/reference at the top, 
you should see this whole section about data ingestion. 
It's really just a bunch of HTTP or 
HTTPS post requests that you construct, 
and send those along 
to that server endpoint that was listed. 
These are your endpoints and you can do 
API testing or API training to 
split between training and test sets. 
They give you some example usage if you 
want to create your own post requests, 
which is essentially what we're doing, 
just, we're doing it in Python. 
An idea of how to work with HTTPS. 
There are a number of formats that are supported. 
You could upload files directly in 
Edge impulse with CSV, 
so comma separated values, 
JPEG or PNG for images or WAV for audio files. 
If you're not using that, you can still 
use this ingestion service 
to upload any arbitrary numbers you want, 
so long as those samples are the same size and shape. 
You can encode the data in CBOR or 
JSON and they give you 
examples of the fields you need to fill out. 
This should look very similar to our JSON wrapper. 
This creates those initial fields for us and it leaves 
values empty and we just put 
the data right in that values field. 
CBOR is something similar, 
so you're welcome to use CBOR instead, 
and I believe you can also use CSV, 
but CSV is great if you have 
actual files saved as.csv files that you collected, 
you can examine those in a spreadsheet. 
A lot of machine learning likes to use 
CSV to store data because it's fairly simple. 
Here's an example of how you might store 
something in CSV format to upload. 
But for now we're going to stick with JSON format 
because it's easy enough to work with in Python.
Play video starting at :13:45 and follow transcript13:45
We're going to define our send sample function, 
and then we have our upload sample. 
This is a little different because it's 
essentially just a wrapper for send sample. 
We could call send sample directly, 
however, I'm doing everything with threads. 
The reason I'm doing everything with threads is because 
I found that using 
a single thread in order to upload data takes 
a long time because as you upload each data, 
you have to wait for send sample to 
return this response value before checking it. 
But if we say spin up 20 threads, 
we can run them all in while loops concurrently and 
others can be running while 
one is waiting for this response code. 
To do that, we're going to 
spin up the 20 threads down here. 
Each thread runs this while loop. 
Before doing that, we made sure to put 
our sample and associated label inside of a queue. 
So we create a queue, we 
put the sample and the label in that queue, 
and then each thread pulls from that queue, 
which is just a global variable. 
All the threads know about this one queue, 
each one pulls from it, 
sends that off, and they keep doing 
this until the queue is empty. 
Then once the queue is empty, 
the threads exit, 
and then we finish with this whole cell. 
Let's create our function 
which is our threads essentially. 
We're going to run this cell which is going to put 
the training set and 
associated training labels in the queue, 
spin up the threads, 
and then wait for those threads to be done. 
If there's any failures uploading, 
it should be printed to the console below us. 
We can actually go to 
the Data Acquisition view 
in Edge Impulse to watch this happen. 
When I Shift Enter to run this thread, 
we can see the labels and 
samples coming in at a rapid pace.
Play video starting at :15:38 and follow transcript15:38
Since I don't have many samples, 
that should actually be fairly 
quick and we can see that we 
have the labels associated correctly. 
We have LED, resistor, 
background, capacitor, and diode. 
Notice that Edge Impulse assumes, at least for now, 
that anything that isn't image or 
audio data that's been specially labeled as such, 
it assumes that it's some sort of time series, 
which is why you're seeing 33 seconds. 
It thinks that I believe there's a millisecond 
between each of these samples which we can change. 
But it's really not necessary since 
this is not time series data. 
Things will still work for us, 
it's just a minor labeling difference. 
Don't worry about seeing the 30 seconds up here. 
You can click on these samples and 
see all the different pixel values. 
Remember, it's been normalized between zero and one. 
Now that that's done, 
we need to do the exact same thing for the test data. 
This is the same exact sequence that we just did where we 
put the test data and labels into the queue, 
spin up a bunch of threads 
and wait for those threads to be done. 
We're going to run that. We're going 
to head over to test data, 
and we can see that data has already come in. 
If you're not familiar with 
multithreading or concurrent programming, 
don't worry about this too much it's 
not completely necessary for this course, 
please just use this uploader script as a template. 
If you want to use the ingestion service 
as a starting point, 
you can create your own curation and 
uploading script if you desire. 
This is really just a demonstration. 
You're not going to be tested on this, 
this is just so that you have something 
to work with if you want to use that service. 
In the future, we're going to be 
uploading images directly to Edge Impulse 
because Edge Impulse has a number of tools 
that work much better 
with image data than it does 
with this raw data that we have here. 
We will see that it's got filters, 
it's got other things that help us create better models, 
but this is just a demonstration so that you can 
see how Edge Impulse compares 
to training a model from scratch in Keras and where 
the similarities are and how 
Edge Impulse makes that training process easier for us. 
Feel free to look through your data. 
If you notice that there's an imbalance among 
your classes or you messed 
something up and you put your training set into 
your test bin and vice versa, 
you can head to your dashboard, 
click out of the popup, 
and you can scroll down to rebalance dataset. 
This will attempt to rebalance the training and 
test set so that you have about 20 percent 
in your test set. 
Remember that this button is 
here if you mess something up 
where one of the samples gets put into the wrong bin. 
But with that, we are ready to go into training, 
which we will do in a future lecture.

In a previous lecture, 
I showed you how to create 
a simple dense neural network in 
Keras and then train it as an image classifier. 
Now that we have uploaded 
raw image data to Edge Impulse, 
we're going to use it to train 
a very similar dense neural network 
and see how that training process compares. 
In Edge Impulse, 
you want to make sure that you have 
your data uploaded to the data acquisition view. 
You can go through and make sure 
that the classes are correct. 
You've got samples in 
each of the classes for training data, 
and if you go to test data, 
you've got samples in there that 
you can test with as well. 
We uploaded all of this from 
a Colab script and note that 
we've flattened all of these images. 
Rather than a two-dimensional image, 
we've got a one-dimensional vector where 
each value corresponds to one of the pixels. 
It's just been flattened out to one vector right now. 
These values should be between zero and one, 
since our Colab script normalize those values for us, 
and you should have a total of 
784 values for this vector assuming 
you're working with a 28 by 
28 resolution photo for each of your samples. 
From here we want to go to Impulse Design and 
an impulse in Edge Impulse is our pipeline. 
It's incoming data. 
We transform it in some way during processing. 
This is to extract the features 
and then we create a learning block, 
which is the actual machine learning 
model that we will train. 
I pointed out earlier in 
a previous lecture that Edge Impulse considers 
any non two-dimensional data or data that's been 
labeled as something other than raw. 
It considers it to be a time series data, 
even though we're not really working 
with time series here, 
it thinks it's time series. 
We want to change where it says 
1,000 milliseconds to the size of the vector. 
In this case it's 784. 
Even though it says 784 milliseconds, 
it doesn't really have units. 
It's unitless, it's just a number of pixels. 
If you go to Data Acquisition, 
you can click on this and you can see that it's 784. 
It thinks each one of these is 
a millisecond and that's okay, 
we can work with that. 
We're going to try to trick 
Edge Impulse here into believing this is time series, 
but it won't ultimately 
matter because it's going to train 
the same type of network that we did earlier in Colab. 
Click to add a processing block. 
The only thing we're going to do here 
is just keep the raw data. 
We've already flattened it. 
Even though we can have Edge Impulse flattened for us, 
I just wanted to show you what it's like to work 
with raw data in Edge Impulse. 
Click to "Add raw data" that 
basically just passes everything, 
that is the data right to the learning block. 
It's not going to do any processing. 
We want to use a Keras neural network, save the impulse. 
The first thing we want to do is extract 
features even though there are no real features. 
The one thing that the raw feature extraction block 
can do is it can scale for us. 
We want to keep the scaling the same, 
since it's already been normalized. 
So just click "Save parameters," 
it will bring us to generate features. 
Click "Generate features," and 
wait while it generates all those features, 
which is really nothing. 
It's just essentially copying those raw values 
into some other container that's the same values. 
When it's done, you'll get this nice feature explorer. 
Even though there are more than three-dimensions, 
the feature extraction process performs 
some transform in order to 
plot these on three dimensions and 
gives us an idea of the separation of classes. 
They're fairly separated. 
You can see them in individual groups here, 
which means we should be able to 
train a classifier on them. 
But we might run into some issues with 
diodes and resistors looking similar, which makes sense. 
They have long component bodies 
with two leads on either side, 
and it looks like we might have an issue with background 
and one of these resistor values, 
in fact, that may be a bad photo 
that we can possibly track down later, 
but let's work with this for now. 
It looks like there is enough separation 
to get us a decent model. 
We go to the neural network classifier window. 
We're going to set these to be the same as 
the classifier we created earlier in Colab. 
Let's scroll up to look for our hyperparameters. 
We're doing 200 epochs. 
Number of epochs, so that is 200. 
The learning rate is not something we set here. 
But if you look at the Keras API, 
you can see that with the Adam optimizer, 
which is what we used, you can see it's set here. 
It has a learning rate default of 0.001. 
Let's set that 0.001. 
The minimum confidence rating is 
a threshold during testing. 
Things can get tagged as uncertain 
if they don't meet that minimum confidence rating. 
This has really nothing to do with Keras but it is 
used by Edge Impulse during testing and validation. 
We can just leave this alone for right now. 
Edge Impulse gives us a default neural network. 
We want to change it to match 
our dense neural network that we have set here. 
This consists of an input layer, 
two hidden layers that are 64 nodes each. 
After each of those layers we have 
a 25 percent dropout layer, 
and then it goes right into our number of classes, 
our output layer with the soft max function. 
We're going to edit this. 
We're going to say 64 neurons for the first hidden layer, 
64 neurons for the second layer. 
It already has the output layer, 
which has the five nodes, 
one for each of our classes. 
We want to add some dropout layers. 
Let's click on "Add layer." 
We're going to add a dropout layer. 
This needs to be 0.25. 
We're going to drag this up to be 
under that first dense layer. 
We're going to add another dropout layer. 
This is going to help with overfitting. 
Not that there's much to be 
concerned with this particular model. 
In fact, this model is probably 
overly complicated for what we're trying to do. 
But once again, it's just a demonstration 
to show that we can create something 
in Edge Impulse that looks like what we created in Keras. 
In fact, if we go to 
the Keras expert mode in Edge Impulse, 
you can see that the underlying code, 
is the same more or less 
as the code that we created in Keras here. 
It's just creating a sequential model 
and adding the layers, 
and that's what's going on right here. 
Let's go back to visual mode, 
switch, and the network looks okay. 
We're going to click "Start training" 
and we're going to wait for that to happen. 
Remember that the idea of this is really just 
to show you how Edge Impulse 
is accomplishing more or less 
the same thing that we can do in code with Keras. 
But there are a few benefits to using Edge Impulse. 
One is visibility to collect 
data and visualize it and have 
it automatically extract features for us 
without needing to write that code ourself, 
training is made simpler with 
this graphical interface and when we 
go to deploy the model to a microcontroller, 
it wraps everything up for us in a very nice, 
easy to use library that can be very 
difficult to create using TensorFlow instead. 
When the training is done, 
it should spit out the model for us as well as give us 
a performance measurement of the validation set. 
It's got 95 percent accuracy and 
0.1 loss on that validation set, 
which I believe is actually a little 
better than what we saw in 
our Keras code if we look 
here wherever the validation set is the lowest. 
It might have gotten lucky, 
gotten a model that does really well on the validation 
set and that loss was very low at 0.1. 
Edge Impulse is going to pick 
that model it does early stopping. 
In fact, it trains for those full 200 epochs, 
but figures out which of the model has 
the best validation loss and goes with that. 
That's the one that delivers for us. 
It already has early stopping built in. 
This confusion matrix looks really good. 
There's only a few misses that we see here. 
In fact, we can go to 
model testing and test it on our test set. 
These are the samples that we set aside during 
curation and we did not test them until right now, 
it looks like the model does very, 
very well on the test set. 
In fact, I'm very happy with this performance, 
94 percent is really good. 
Now once again, we probably could use a lot more data. 
Remember that there are only 50 samples per class in 
our total dataset and even that got 
divided up into test and training sets. 
I'm happy with this performance 
and we can go ahead and deploy this, 
which I'll show you in another lecture.

## Python and Edge Impulse Documentation
### Project - Extract Features and Train Model
Introduction
Now that you have created a dataset, let’s extract features and send them to Edge Impulse to train our first model. You have a couple of options for this project:
1: Run the pre-made Google Colab script to extract features and upload raw features to your Edge Impulse project
2: Edit the pre-made Colab to curate the dataset in a different manner. This will be more difficult, but it will give you practice using Python to manipulate files.
Either option is fine. I recommend choosing the second option if you want a harder challenge, as you will actually need to read, understand, and modify my code.
Create Edge Impulse Project
If you have not done so already, create a profile on https://www.edgeimpulse.com/. Click your profile picture, and click Create new project.
Give your project a name. If you get a pop-up window, close it. Go to Dashboard > Keys. You will need to copy both the API Key and the HMAC Key into your Colab notebook in the next part.
 
Note: Even though the key(s) might look truncated, if you double-click on the key string and copy (e.g. ctrl+c), the entire key will be copied.

Option 1: No Editing (Easy Mode)
Simply follow the directions in the notebook. You will need to paste in the API Key and HMAC Key from your Edge Impulse project.
Additionally, you will need to upload your image dataset as shown in the description to create a particular directory structure. The individual folders should have the same names as the labels.
Option 2: Edit the Notebook (Hard Mode)
Instead of a multi-class classifier that uniquely identifies all of your classes, let’s say you want to train a model that identifies one of your classes versus the others.
For example, instead of
•	background
•	capacitor
•	diode
•	led
•	resistor
as your classes, let's say you wanted to identify just resistors. You might have resistor vs. other (binary classifier). However, I might recommend keeping the background class (multi-class classifier). So, you would end up with:
•	background
•	other
•	resistor
The easy way to do this would be to copy all the files from the other classes into an “other” folder and create a new dataset. However, this might be infeasible when you are working with very large datasets.
You should be able to do this in code: as you load each image, pay attention to the folder. If it’s “background” or “resistor,” simply leave the label alone. If it’s anything else, change the label to “other.”
I recommend modifying the cell that starts with the comment ### Load images as Numpy arrays to accomplish this task.
I have 50 images in each category. When I modify this cell to be resistor vs. other vs. background, it should show that there are 50 images in resistor, 50 images in background, and 150 images in other.

Train Model
Go back to your Edge Impulse project and click on Data acquisition. Feel free to look through your data to make sure everything was uploaded properly.
Note: Edge Impulse thinks we’re using “time series” data for this project, as there’s no way to upload raw without any sort of units at this time (it’s either time series or 2D images). Remember, we’ve normalized and flattened our images to 1D vectors! This is a bit of a hack, as we are tricking Edge Impulse to think our raw 1D vectors are “time series” data. However, it will still work. Normally, you’d want Edge Impulse to do feature extraction for you on images, which we’ll explore later.
 
Click on Impulse design and you should see a Time series data block. By default, the Window size is set to 1000 ms. However, we only have 784 data points per sample, so we need to change it to 784.
Add a Raw Data block for your Processing block and a Classification (Keras) block for your Learning block. Click Save Impulse.
 
Click on Raw Data in the explorer pane on the left side. Click on the Generate features tab, and click Generate features. Wait a moment while Edge Impulse extracts features (there are no real features to extract here--it’s just wrapping our vectors up in a way that can easily be used by Keras and reducing some dimensions for the Feature Explorer).
Feel free to look at the Feature Explorer window to see if your classes are easily separable.
 
Click on NN Classifier. I recommend changing the number of training cycles to 100, as 30 did not seem to be enough to have the model converge for me. Click Start training. Once training is done, take a look at the accuracy score and confusion matrix for the validation set.
Scroll through the training output to compare the training loss/accuracy to the validation loss/accuracy.
 
How well did your model perform? It looks like there was some overfitting for my model, as the training accuracy was 95% and the validation accuracy was 90%.
Try changing the number of neurons in the first layer, adding additional dense layers, or adding dropout layers to see if that helps to fix any issues you might have, such as overfitting or underfitting. Note that as you make your model more complex, you might need to increase the number of training cycles.
 
For example, I added a dropout layer (with 0.25 dropout) and increased the training cycles to 200. Those changes seemed to help my model performance
Note: Remember that we are working with a relatively small and simple dataset here! In future lectures, we will explore better models and learn how to increase the size of our dataset to make our model more robust.
(Optional) Train a Model with Keras

### Conclusion
Flattening an image to a 1D array (and normalizing the pixel values to be between 0.0 and 1.0) is a good start for an image classifier. However, you will likely find that this model does not generalize very well, as the salient “features” (raw pixel values) that the model uses for decision-making are unique to particular locations in the image.
In future lectures, we will discuss how to make a more robust model that looks at groupings of pixels rather than single pixel values to make its decision.
Embedded systems also consists of some devices that run Lennox, 
like the raspberry pi. 
So I'd like to show you how to run Inference on the pi before we move on 
to microcontrollers. 
Note that what I'm going to show you requires the raspberry pi camera. 
It's possible to convert this demo to another Lennox based device or 
using a webcam, but it will require some effort to start you'll want to 
head to docs.edgeimpulse.com. 
This will take you to the documentation and 
getting started guides on the edge impulse site, scroll down and look for 
raspberry pi four, in here it'll walk you through how to set everything up. 
I'm going to be doing everything from a remote terminal. 
You can do it headless. 
You don't need a full graphical interface to do a lot of this. 
But I find it very helpful when we are previewing what the camera sees. 
It can be helpful to have some sort of graphical interface to do that. 
I am VN seed in right now. 
So this would be what you would see on your pies screen.
Play video starting at :1:9 and follow transcript1:09
And if you scroll down on that edge impulse docks page, 
you should see a section labeled installing dependencies. 
This is what we're going to run and you want to run one of these at a time. 
So I'm going to grab the first one.
Play video starting at :1:23 and follow transcript1:23
I make sure I have a console open just anywhere in the home directory, 
paste that in. 
That's going to update your aptitude package manager as well as install nojs 
which we're going to need to run some of the edge impulse tools grab the next line, 
which is a bunch of dependencies for running these tools.
Play video starting at :1:57 and follow transcript1:57
And then we're going to use NPM to install the edge impulse Lennox tools.
Play video starting at :2:8 and follow transcript2:08
When that's done, if you haven't already done so 
you'll want to run sudo raspi-config you probably would have run this when you 
were setting up your raspberry pi to begin to do things like connecting to a network, 
changing your password and unlocking any sort of interfacing options. 
If you have not done this, go into interfacing options, go to camera and 
when asked if you'd like to enable the camera interface say yes.
Play video starting at :2:38 and follow transcript2:38
Note that you will probably have to reboot your pie when this is done and 
you say finish. 
And also note that this is for the raspberry pi camera only if you are using 
something like a webcam, that's going to be a different process. 
The reason I'm not going to show webcams is because I can't control 
all of the drivers and things that go into making any webcam work on Lennox, 
I can probably verify for one. 
Whereas I know that the pie cam is a known good thing that will actually work for 
these demos. 
So you might be able to get a webcam working. 
But I'm going to show everything through the pie cam and I probably won't be 
able to help you with webcam because webcams and Lennox drivers can be iffy. 
So I recommend using the pi camera. 
If you're working with a raspberry pi, the next thing we want to do is 
install the python SDK for edge impulse tools on Lenox. 
So if you scroll down you'll see a link to the Lennox SDK there's one for python. 
So let's click on that and 
there's some instructions on installing dependencies for the raspberry pi. 
The SDK allows us to write code using edge impulse libraries. 
What we just installed was a collection of edge impulse tools that allow us to do 
things like connect to our project across the internet and 
download the trained model. 
So we actually need both in order to work on the raspberry pi to start. 
We're going to grab this group of dependencies, we're going to copy those. 
We're going to run those in our console.
Play video starting at :4:13 and follow transcript4:13
Say yes if you're asked to install anything 
when that's done run the pip installer for 
the edge impulsive Lennox SDK. 
Note that if pip3 does not work on your 
particular installation it might be pip. 
So you might have to replace pip with something like python-m pip install 
edge impulse Lennox-I and then that address that https address earlier, 
that's a pip three doesn't work but it looks like it is working for me. 
Finally I believe Open CV does come with one of these packages but 
just to be sure I'm going to install it. 
So I'm going to call python-m pip install open CV-python 
that's already been installed on my machine. 
It should be similar for you. 
If not that way you guarantee you get open CV. 
Open CV is a computer vision library that has a number of languages that support it, 
including python. 
We're going to need it to do some basic image manipulation but 
it can do a whole lot more than that. 
Now that we have all of our dependencies installed, we can go ahead and 
download our trained model file and 
create an inference program using python and the edge impulse STK for python. 
When working on a raspberry pi, 
I like to keep everything inside of a project folder in my home directory. 
So if you don't have one already, you can go ahead and make your projects. 
It already exists for me. 
So I'm just going to go in there inside of that. 
I'm going to make a directory called electronic components. 
DNN for dense neural network. 
That's why I'm going to keep this particular project and 
I'm going to go into that directory, 
from there we're going to run the edge impulse Lennox runner tool.
Play video starting at :6:6 and follow transcript6:06
This is going to allow us to download our trained model file. 
I recommend for this step using dash dash clean argument as that allows us to 
reconnect to edge impulse to our account and pick a new project every time 
without it you're going to be connected always to a particular project. 
Which is great if you're trying to make this as part of a pipeline for 
your machine learning project or end product. 
But since we're just working with these one off projects at the moment, 
I recommend using dash dash clean and dash dash clean is also a way to reconnect. 
If you somehow drop connection or you're on a different account, 
we're going to use dash dash download to tell it to download the model file 
associated with a particular project and we need to give it a name. 
In this case I'm going to call it modelfile.eim and 
that's going to download it in the current directory, from here it's going to ask you 
to sign into your account and then it's going to ask you to pick your project. 
As you can tell, I have a lot of projects that I've worked with. 
I'm looking for electronic components DNN. 
And that's the one where we train the model on the electronic component images 
and we're using a basic dense neural network for that. 
If you do an LS, 
you should see a .eimmodelfile stored in that local directory. 
Note that this model file is not just your model. 
It also contains the DSP code, the feature extraction and a number of optimizations, 
but it should work on most Linux distributions, 
which makes it really good for single board computers. 
I'm going to show you code in another editor. 
So hopefully it's a little bit cleaner on your screen or monitor so 
that you can see what I'm entering. 
But don't worry, this is being saved to my raspberry pi. 
You can see it's going to projects electronic components, DNN and 
DNN in static features test.py. 
One of the things I like to do before just throwing raw images from a camera 
into the model to perform inference. 
I like to test from known good features and 
compare it to how the model performed in wherever we trained it. 
In this case it would be edge impulse. 
So I'm going to go into edge impulse. 
I'm going to log in once logged in, I'm going to to go to my electronic components 
CNN project the one that we created earlier. 
And I'm going to go to model testing. 
This is my test set and all the components or images of components in the test set. 
And I'm going to pick one say capacitor and this successfully 
classified it as a capacitor and I can say show classification.
Play video starting at :8:33 and follow transcript8:33
It will perform classification on the server and give me the output. 
In this case it's showing me that 0% Background, capacitor and so on. 
This is the output of the model that I'm looking for. 
The probabilities that it belongs to each of the classes. 
Not just that it thinks it's a capacitor. 
What I care about are these numbers Mhm And edge impulse gives us raw features. 
These are the things that are going into the model, which is what we want to copy. 
This should look very much like our pixel values. 
Just normalize to be between zero and one. 
I'm going to click copy to get all of them and 
we can start creating our code at the top. 
I'm going to import os time. 
I'm also going to import the edge impulse Lennox runner. 
The impulse runner module is what we need from the SDK and 
we're going to point it to the location of our model file feel free to make these 
parameters or whatever for your own program but to keep things simple, 
I'm going to do it just as global variables in this script. 
This name should line up with the model file that we just downloaded and 
here we are going to copy in our feature. 
So I'm just going to paste in our features. 
Yes, it looks really nasty. 
You could put this in a different file and read in the file if you want to. 
But once again, to keep this simple, 
I'm just going to make it one long variable here. 
Next I like to print something to the console just to let me know that it's 
working. 
We're only going to be performing inference once just to make sure that 
this input of features, outputs those exact probabilities or 
something close to what we saw in edge impulse. 
You'll notice that the impulse runner module will attempt to load files 
from wherever this module is stored on the file system. 
We wanted to load the model file that I am that is relative to this particular 
program. 
So we have to call these os dot path functions in order to make that happen. 
So that the module knows where to look when trying to find that model file. 
Then we want to load the model file and to do that. 
We just call edge impulse and the path to the model file and that's going to load 
it in load all the DSP code the feature, extraction the model itself and 
get that ready for inference when working with that model. 
I like to put everything in a try catch or 
try except block in case something fails, we can see what's going on. 
And to begin, we're going to print out the model information, the model that 
dot EIM file has information about the model, including the name and 
who created it, which is based on what you entered an edge impulse. 
The next thing we're going to do is time how long inference takes And 
the magical inference function is right here Runner dot classify and 
we're going to send it features and you notice that features is a list of numbers. 
It's these normalized Pixel values. 
It should be 700 and 84 values in a flat one dimensional list. 
And we're going to take that list and send it to runner dot classified. 
The result is what we're going to care about. 
But in addition to doing that, 
I'm also going to time how long this particular function takes.
Play video starting at :11:45 and follow transcript11:45
Next up, we're going to display the predictions to do that. 
The result is returned as an object and 
it's got a number of fields in it because it works as a dictionary. 
So we want the result from that object and 
we want the classification from that particular entry. 
Those are going to become our predictions and all I'm going to do is just loop 
through those the labels are stored in the predictions themselves. 
So we don't need to worry about keeping a list somewhere. 
It's going to first print the key and then the value of that key for 
each of the predictions which should be the label and then the probability. 
Next up, I'm going to print out how long it took to perform that inference and 
remember it's not any of the manipulations of the data before then it's just this 
function.
Play video starting at :12:34 and follow transcript12:34
And then we're going to print out how long the timing took according to 
the edge impulse model itself because that's keeping track of time and 
we can find that in the timing entry of the result object. 
Whether or not this try except block works. 
I'm not going to have an except block in case something fails, 
it's just going to exit anyway. 
But either way it's going to call runner dot stop in order to stop that tool from 
running in the background. 
Go ahead and save this and let me return to my raspberry pi. 
And I'm going to do an LS and you can see there is a static features, 
python script that has been created and I'm going to run that. 
You can see it successfully ran. 
It's going to give me the probabilities of each label including the background 
capacitor diode led and resistor. 
You can see that capacitor is the highest at 98% and 
then the next one after that is it may be kind of thinks it could be an L. 
E D but at a very, very low percentage. 
So this is 98% and about 2%. 
And if we go back to our Edge impulse project, this is the sample that we used. 
And sure enough it lines up with background capacitor diode led with 
L E D being the one that it possibly could be. 
So those predictions lined up, which means that hopefully our model is working as 
intended as we trained it and it's capable of taking in a known good test sample and 
giving us known good predictions that match up with what we saw in Edge impulse. 
Now we are ready for live inference. 
I'm going to clear my console so you can see that from scratch to do live capture. 
We're going to need some python libraries and to make sure you can see the code. 
I'm going to change the coloring to python. 
We're going to need Os sis and time like we've done before, 
we need open C V, which is C V two in python. 
That's the python wrapper for open C V will also need Numpy to help us with some 
of the re sizing and flattening the data. 
We're going to need pie camera just as we've done for the pie camera capture. 
And we're going to need the edge impulse Lennox Runner like we did for 
the static Inference in order to perform Inference on live images. 
Next I'm going to define some settings like we did before we've got the model 
file that we will import. 
We can choose to draw the frame rate on the screen. 
I'm going to set the resolution of the captured image to 96 by 96 but 
this is going to be re scaled to 28 by 28. 
Next, as we've done before, 
we're going to point our impulse runner to the model path. 
In this case we're using the basic runner, 
there is an image impulse runner that you can use. 
But since we're going to be manipulating and 
resizing our images to a one dimensional feature array, 
we're going to use the basic dense neural network impulse runner in this 
case as it's not going to be performing any image manipulation for us. 
We just want to use the edge impulse library here to perform inference. 
Not feature extraction. 
We will look at using the image impulse runner later when we have edge impulse 
perform that feature extraction for us. 
Keep in mind that this is still all being done locally on the pie. 
We're not sending any data to the edge impulse servers. 
Next, we're going to initialize the model and 
print out any error codes in case something happens where it can't 
initialize it I'm going to continually print the frame rate to our preview, 
so you get an idea of how fast this is running.
Play video starting at :16:6 and follow transcript16:06
Then I'm going to start the camera and configure the camera with our resolution 
width and height, which should be 96 by 96. 
This is a pretty small preview but it should be good enough for what we need. 
This will be displayed in the preview but 
keep in mind that we're going to be resizing this to 28 by 28. 
Next up, in order to provide a continuous frame capture from the pi camera, 
we're going to just create the stream basically, 
anytime we see a frame from the capture continuous function. 
The code in this for loop gets run, this basically turns into a while true loop and 
frame is just a new frame from the camera. 
Next up, we're going to get a time stamp so we can compute the frame rate, 
that's if you choose to display it.
Play video starting at :16:54 and follow transcript16:54
Then we're going to get the image NumPy array from our frame, so 
our image at this point is a collection of numbers. 
In fact we can print this to the screen to get an idea of 
what's contained in that NumPy array. 
I'm going to print the very first value which should be an RGB value and 
I'm going to print everything in there, so there should be three numbers that we see. 
To get this to work with a preview, we need to call cv2.imshow, 
which displays the image in the preview. 
We need to clear the stream after each time we display it, 
the preview will not work without this line. 
I'm going to calculate the frame right here, 
which we're not going to be printing to the screen right now. 
These lines are just something that's needed in order to get everything 
else to work. 
And then I also need to call cv2.waitkey and this allows me to press some key, 
in this case, I'm going to allow the user to press q if they have the preview 
window highlighted in order to exit out of the preview.
Play video starting at :17:58 and follow transcript17:58
If they do press q then all of the windows will be destroyed and 
the program will exit.
Play video starting at :18:5 and follow transcript18:05
You can also push CTRL C on the command line that's running 
the script in order to exit, so you have both options. 
Note that if you just try to click close on the preview window, it will close 
the preview window and a new one will open as we've not exited this for loop.
Play video starting at :18:20 and follow transcript18:20
Let's save this on our Pi, I have my Pi map to a network location, 
I'm going to go into projects, my components, dense neural network and 
I'm going to save it as DNN live inference Pi camera. 
Since that was the first one I was working with, I'm going to overwrite that.
Play video starting at :18:41 and follow transcript18:41
Say yes and let's look on our pi. 
On my pi I'm going to go into my project directory, and then I'm going to go into 
my electronic components DNN, for the dense neural network project.
Play video starting at :18:56 and follow transcript18:56
From there, I'm going to run the script that I just made which is DNN live 
inference for the Pi camera.
Play video starting at :19:4 and follow transcript19:04
When I run that, you should see a preview window pop up and this is the very first 
pixel, should be the upper left corner here and that is showing me the values. 
In this case, coming directly from the pi camera library, 
this pixel value is an RGB value, that is 8 bits. 
So each of these values is red, green blue, 
although in this case it might be blue green red depending on the order. 
But if I move the camera around you can see that, 
it goes from 0 to 255 depending on the white value and 
there is a little bit of auto gain or auto balancing that goes on here. 
So if I make it something completely black, it should be all zeros, 
completely white is 255, and let's go to something red in that upper right corner. 
I got my LED here, so it does look like it's BGR, so 
it's going to be blue, green, red. 
So if I put in something blue in this upper corner here, 
you can see how that first value goes up, so this is blue, green, red in that order. 
To get out of this, 
you can push CTRL C on the script that's running in the console or 
you can just push the q key if you have the frame preview highlighted. 
And from there, let's go back to our script to add the inference part.
Play video starting at :20:25 and follow transcript20:25
We no longer want to print out the raw RGB or BGR values, 
so let's get rid of that line.
Play video starting at :20:32 and follow transcript20:32
And these next few lines are going to resize, reshape, 
recolor that image into a feature vector that the model expects. 
In this case we're going to start by resizing the image to 28 by 28, 
which is given by these parameters that we set up top here. 
And by the way, feel free to change all of this to arguments if you want to turn this 
script into something like a tool. 
I'm just putting them as constants in our settings here just to save some code 
space, but you are welcome to turn them into arguments. 
I just wanted to make the script a little easier to read, in this next line, 
we're going to convert the image to Grayscale. 
To do that, we call the open cv color RGB2GRAY, but 
we actually don't want RGB here as we just found out those values RBGR. 
And because it waits the individual colors a little differently in order to 
create the grayscale image, we want to use BGR.
Play video starting at :21:27 and follow transcript21:27
Next we're going to take that two dimensional array that represents our 
image and convert it to a 1D vector. 
So those 28 rows, just get stacked next to each other to create one long vector. 
And then we also want to divide each value by 255 to create 
a floating point number between zero and one. 
This becomes our feature vector that gets fed into the model.
Play video starting at :21:52 and follow transcript21:52
Note that edge impulse doesn't work exactly with a NumPy array, 
it wants to use the numbers in a list format. 
So we have to convert them to a list in order to feed them to the edge impulse 
model, from there, we're going to perform inference. 
All we need to do is call our runner.classify and 
then send it the features, which is just a list containing our grayscale values 
in floating point format between zero and one. 
If this fails, for some reason, I'm going to print out an exception so 
that we can see what's going on and then I'm going to display the raw result. 
This will help you see what's going on so that we can extract the correct values. 
Let's save this and go back to our pi, we'll run it again, 
and we should see a direct output from our inference. 
I'm going to stop this so we can read it, let's scroll up and 
you can see one of the outputs here.
Play video starting at :22:50 and follow transcript22:50
This is very much like a JSON format or 
a Python dictionary that allows us to index into it using these keys. 
For example, we can just call result in order to get this key and 
then classification to get the set of keys here and 
then we can call our label to get the value of that label. 
In this case, since our camera was looking at the background, 
I expect to see the background confidence score or the probability b close to one, 
which is what we saw here. 
This is all you would need to get started if you just want these numbers to make 
whatever you need to do with your embedded system. 
However, for the purposes of testing, let's print these results, 
let's find the highest label and print that along 
with its confidence score to the preview much like we do for the frame rate. 
So from here, I'm going to head back to our code. 
You'll notice up top that I set the results to none in case 
this inference fails and results will still be none when it comes down here. 
So, to deal with that, I need to make sure that the results are not none, in order to 
continue Ideally nothing should be printed here, you should just see a blank output. 
But if inference fails you should also see an error printed to the screen. 
Next I'm going to go through all of the predictions found in the results. 
And remember that we can index into these so I need to index into result and 
then from that value index into classification. 
That should give me a list of the predictions which includes all of my 
labels and their predicted probability values. 
I'm going to loop through all of those and then find the max_label and max_val. 
Next I'm going to draw the predicted label on the bottom of the preview.
Play video starting at :24:38 and follow transcript24:38
I need to pass it in the image that I'm going to draw 
the max_label that we just found here. 
0 is x of 0 and then the resolution height -20 which means I'm going to 
leave space for the probability value to be printed. 
And in this case for putText, the y value here is the bottom left point on the text. 
But you can probably just play around with these numbers to figure out where the text 
gets drawn on the screen or in that window. 
Then I'm going to draw the predicted class's confidence score, 
in this case it's the probability since we put it through a soft max function. 
I'm going to rounded to 2 decimal places. 
In this case instead of -20 I'm going to use -2 pixels from 
the bottom of that preview window which should give me enough space for 
this HERSHEY_PLAIN text that we're using from open cv. 
Note that in both of these cases I'm drawing the text in white which is given 
by this 255, 255, 255 value. 
And that should be good enough to give me my predicted classes and 
its confidence score. 
In addition to doing this I always want to draw the frame rate on the screen, 
assuming we have draw FPS set to true which we do up in my settings here. 
And this should put the frames per second or framerate at the top of the screen. 
So we should see both the framerate, how fast inferences being performed 
along with how fast we can capture the images and convert them to a feature set. 
Let's save this and run it on the Pie. 
Note that what I'm going to show you requires the Raspberry Pi Camera. 
It's possible to convert this demo to another Lennox based device or 
using a webcam but it will require some effort. 
To do this next test I'm going to hold the Pi Camera above my electronic components 
on a white sheet of paper which should match how I captured the training set. 
Note that lighting can matter so 
I have a desk lamp shining light onto the components. 
When we run this again you should see the preview window show up in addition 
to the frames per second. 
You should see the highest predicted label shown at the bottom along with its 
prediction score. 
And this is ideally the probability of that class versus the others. 
So now I'm going to move the camera around and 
the white background should auto gain balance to something more grayish so 
you can hopefully continue to read those values. 
And if you notice here I've got a diode I'm going to bring it close to my diet and 
it should predict it. 
However, it does confuse it with a resistor but 
if I hold it in the exact right spot it thinks it's a diode. 
And if I move it around it confuses it with the background or 
another resistor which means the dense neural network is giving a lot of 
weight to certain pixel values. 
Probably something in the center as you see when I move the diode 
off center it thinks it's background. 
And at certain distances or 
certain positions it confuses it with a resistor or even a capacitor. 
The other thing to note here is if you have bad lighting,
Play video starting at :27:55 and follow transcript27:55
Sometimes that can affect things. 
I have a lamp shining on my components here, so 
you can see I turn the lamp on and off. 
However, the auto gain of the Pie Camera does adjust and 
it still allows it to mostly work. 
I'm going to leave this light on and 
then move it to my next component which is my LED. 
You can see here it does a good job of seeing that it's an LED. 
However if I turn it off axis it confuses it with a capacitor or 
if I move too far away, so your components need to have a particular lighting. 
They need to be a certain distance from the camera and 
a certain orientation in the frame. 
This is very hard to control for with the data set and 
the neural network that we're using in this example. 
Here is a resistor, I need to move my page around a little bit and 
in fact my network struggles a bit with the resistor. 
It thinks it's a capacitor in a lot of cases and it really confuses it for 
a diode depending on how I have it positioned. 
It really struggles with this,
Play video starting at :29:5 and follow transcript29:05
I can't even find a good position for it to detect it. 
And then if we move to capacitor it has no problem detecting the capacitor. 
I think it's because it's such a unique shape that the pixel values in 
certain areas mean certain things that the neural network picks up on. 
But it struggles with this resistor, it confuses resistors and 
diodes just because the leads are coming out on either side. 
There we go. 
It thinks the diet is a resistor when it's over to the left side and 
when this is over to the left side it struggles, it's still really struggles. 
And this might be an issue with I need to focus my camera little better but 
I can probably find a very particular position where the resistor is in 
the frame and it thinks it's a resistor. 
As I mentioned, it's looking for very particular pixel values in certain spots. 
So it's not really good at detecting overall shapes. 
It's really bad at generalizing beyond our training sets. 
Even if we use something like data augmentation, 
which makes dense neural networks not a great fit for doing image classification 
on live data as you can see by this example I'm showing here. 
In the next section, we'll get into creating better models for 
image recognition. 

I showed you how to perform inference with 
a trained model on a Raspberry Pi. 
We're going to take that knowledge and 
apply it to a microcontroller. 
We'll continue to use 
MicroPython as it's easy to see what's going on. 
While the model might fit onto 
the flash memory of the OpenMV Cam, 
I recommend using an SD card 
as it has a lot more space to work with. 
This will hopefully guarantee that 
the model and code all fits. 
The OpenMV Cam will enumerate the SD card on 
our computer so we can just write files directly to it. 
Make sure you've removed any images you have 
saved on there from the previous capture step. 
For what we're going to be doing, 
there is no real Edge Impulse library that we can 
just pour directly to our OpenMV camera. 
That's because we did a dense neural network 
and there's no convolutional side of that. 
Edge Impulse doesn't directly support this, 
but we're going to make it work as 
a demonstration to show how you can 
perform inference with a dense neural network for images. 
Sign in, go to your project, 
you can click out of the data selection pop-up. 
What we're actually looking for is 
the trained model file that's on the dashboard. 
We have a few options here, 
all of them are the TensorFlow Lite model 
files that we're going to use. 
Even though we train the model to 
work with floating-point numbers, 
the OpenMV library that we're going to use to feed 
this network is expecting 
eight-bit quantized values as the input. 
As a result, we actually want to download 
the TensorFlow Lite model that has 
been quantized to eight bit. 
Click the 'Download button' there, 
and one thing I like to do is use 
a program like netron.app. 
We can open the model file,
Play video starting at :1:55 and follow transcript1:55
find where you downloaded it, 
click to open it, 
and Netron lets' us view the model. 
Here we can click on the individual layers 
and it will tell us what 
the inputs and outputs are 
and how some of the things are connected. 
Our input needs to be one long vector; 
784 elements of eight-bit values. 
Even though it gives us a formula 
here for that quantization, 
we're going to feed it 0-255 values 
and hope it works for our demonstration purposes. 
If we were really doing due diligence here, 
we would make sure that 
this quantization works going from 
our floating-point pixel values 
to the correct grayscale mapping. 
But since it can only be 0-255 anyway, 
we're going to hope that this works. 
Feel free to click on the other layers 
to see how things are connected. 
Ideally, this model has been fully quantized, 
which means even the parameters themselves are 
quantized to eight bits to make 
the math a lot faster on hardware that does 
not support floating-point operations. 
It looks like the output is also fully quantized, 
so we may have to convert that to 
floating-point numbers to get our probabilities. 
Now that we have the model downloaded, 
we can go to our downloads, 
we can copy it, 
and we're going to go to our SD card 
where we keep our main. 
You want to delete everything else on 
here other than your main Python script, 
and we're going to paste in our model. 
I recommend giving it a name like trained.tflite; 
so.tflite is the official tflite model file 
, you'll also see.lite. 
I've seen both of those 
;.light seems to be the newer one, 
but either.tflite or.light will work. 
We also want a label's file, 
and these are just your labels in alphabetical order, 
the same way that we presented them to Edge Impulse, 
separated by a new line. 
We have background, 
capacitor, diode, led, and resistor. 
Just save that file and we should be good to go. 
Now we can write our program 
using MicroPython in OpenMV. 
Before running live inference using the camera, 
I like to check to make sure the model is 
working and performing as intended. 
We can copy the features from our project in 
Edge Impulse and then paste 
them into a script that will read them, 
send them to the model for inference, 
and output the probabilities. 
Those probabilities should match 
what Edge Impulse tells us 
it believes the probabilities are from, 
say, one of the test samples. 
To start, I'm going to delete all of the Beginning code, 
and we're going to import the, image, 
time, and TensorFlow libraries. 
Then we're going to use some settings here. 
The first is the location of the.tflite file; 
make sure this matches up with 
what we saved on the SD card, 
so trained.tflite and we also have labels.txt, 
which we're going to read for our labels. 
There's a constant we'll be using later on to 
set a grayscale instead of RGB values, 
and that is 2. 
We're going to set the image color to 2, 
to tell it to do grayscale.
Play video starting at :5:15 and follow transcript5:15
Next we're going to copy in the features, 
and this is just going to be an array.
Play video starting at :5:23 and follow transcript5:23
Go into your Edge Impulse project, 
go to Model Testing. 
We're going to look at one of the test samples, 
so let's take capacitor here, 
and click Show Classification. 
This will cause Edge Impulse to 
classify this individual sample. 
It will also give us the raw features 
that went to the model to perform inference, 
so we're just going to click the button to copy that, 
and we're going to paste that into 
our features list here. 
Make sure you close out that array. 
Now we can perform inference with 
this static list of features.
Play video starting at :6:1 and follow transcript6:01
I like to say something to the console first just 
to let me know that the program is running. 
We're then going to strip the new line out of 
the text file that contains our labels 
and we're going to convert them into 
a list that is stored in the labels variable.
Play video starting at :6:20 and follow transcript6:20
Next, we're going to create a blank image. 
Note that because we're not running 
the sensor or the camera, 
we're going to create an empty image and 
then shove the features into that empty image. 
In this case, it's 784 long array. 
That's one deep. 
This is going to create a container for our features. 
This is a bit of a hack or a kluge way to 
send features to the TensorFlow Lite model for inference. 
But this is how the open envy TensorFlow Lite library 
expects features to be fed to 
it in an image format like this. 
Next, we're going to fill those pixels in 
that empty image container with our feature set. 
We need to convert them to eight bit values. 
We're going to take each of the floating point numbers, 
multiply it by 255, 
make sure it doesn't go below 0 or above 255, 
and we're going to put them into 
the individual pixel elements 
of this blank image. 
I realize there's a bit of a bug in this demo here, 
so we're going to change it to 
pixel value to make sure that 
the pixel value is being constrained between 0 and 255.
Play video starting at :7:39 and follow transcript7:39
Next, we're going to run inference, 
and this is just by calling tf.classify. 
We're going to send it the model file, 
which is just the location of the.tf lite file, 
as well as this image that now has our features in it. 
Note that the TensorFlow library in 
open MV gives us back a number of objects. 
The idea is to use this to look for a number of images 
or classify or detect 
images or objects in a larger image, 
and it would return to us a list of objects. 
However, we're only expecting one object to be returned. 
We're just going to take that very first object 
and then grab the output, 
which should be our predictions containing 
the probabilities per label.
Play video starting at :8:33 and follow transcript8:33
Then we're going to print out 
those probabilities along with 
how long it took to perform inference, 
which we timed here. 
Then just print a new line so we can 
separate it when we're looking at the console. 
Let's connect our open MV. 
I'm going to bring up a serial terminal 
and I'm going to run the code.
Play video starting at :9: and follow transcript9:00
Here it looks like inference is working, 
we've got our labels. 
Unfortunately looks like it came with 
a return character which we can deal with later. 
But for now what we care about are these predictions. 
These are not exactly the same as what we saw in 
our project 98 for capacitor 0.02 for LED, 
and we've got 97 and 0.04, 
0.03, which is very 
close and in fact probably 
doable for what we're looking to do. 
It did correctly predict that as a capacitor, 
which means it's working. 
The reason it's not exact is likely because we're losing 
some information when we convert 
the floating point values to 8-bit values. 
Then those are being fed to 
a completely quantized network which 
maybe changing some of the original values. 
But it does look like it's working. 
We're getting something very 
close and we're going to call 
this good enough for this demonstration. 
Now that we proved that our model can work, 
at least with one of the samples, 
I highly recommend trying it with 
multiple tests samples to make sure you're getting 
the values you expect or something close enough to 
where you're confident the model 
will work for deployment. 
Now, we're able to go back, 
delete what we have written here. 
I would recommend saving it, 
but we can now implement 
live inference in our open MV cam. 
To start, we're going to need very similar packages, 
but we're going to also need the 
sensor package this time, 
since we're capturing things from 
the camera, we need a few settings. 
We're going to set the width and height 
of our frame to now be 28 by 28. 
Since that is what's going to our model, 
I did not find a good way to resize images in open MV, 
which might mean you're going to 
have to adjust the focus on 
your lens to be able to capture something that's small. 
We're going to set the width and height so that 
even though it's taking a larger image, 
the only thing that's being recorded 
is the 28 by 28 image here.
Play video starting at :11: and follow transcript11:00
Next, we're going to extract labels 
from the label file, 
and I learned my lesson last time and added 
a strip for any returning characters that it might find. 
While it's not necessary, 
I'm going to add the timer here so that you can look at 
FPS values if you care to 
see how fast inferences running. 
But note that we are going to be doing 
something a little wonky. 
We're going to create an empty image container 
here so that we can flatten 
our array that we get from 
capturing the image from the camera. 
We're going to flatten that to a one-dimensional vector, 
store it into this copy, 
and then send that to the model for inference. 
This is not a great way to do it. 
It's really not very efficient, 
but it lets you see 
what's going on when we're working with 
dense neural networks as opposed to something 
like a convolutional neural network.
Play video starting at :11:50 and follow transcript11:50
Here is our main while loop. 
I'm going to have an inference counter so 
that we don't spam 
the serial terminal when we output our probabilities. 
The tick-timer is useful for 
measuring frames per second, 
so it's a good idea to update that if you 
plan to print out frames per second. 
We're going to first capture an image from the camera, 
this gives us our raw data in grayscale format, 
that's eight-bit values for each of the pixels.
Play video starting at :12:21 and follow transcript12:21
Then we're going to go through each of those and flatten 
that two-dimensional array that 
we get from the image here, 
and we're going to store it in the empty image container 
that's given by image flat. 
We're going to loop through each pixel in 
this and then store it into 
a one-dimensional vector here, 
so that the y-value is always zero. 
Then we're going to send this flattened image 
to our model, 
which the tf.classify reads in 
the model file and sends it this as input.
Play video starting at :12:53 and follow transcript12:53
Just as we saw with the static test, 
we should only get one object out, 
since we're only doing one classification 
per image capture, 
and we're going to get the output probabilities 
of that to print out as our predictions. 
Next, we want to find 
the label with the highest probability, 
so we're going to look at our predictions, 
find the maximum value, 
and then use that to get the index of our labels. 
I'm going to draw the label with 
the highest probability along 
with the value to our viewfinder up here, 
note that this is going to become 28 by 28 pixels, 
so some of the labels may 
end up getting truncated, but that's okay. 
It allows us to see what's going on. 
Feel free to remove this if you 
don't like seeing it up here in the viewfinder, 
or another thing you can do is capture a larger image, 
draw a little viewfinder, 
and only perform inferences 
on that window on the viewfinder 
so you can see everything else surrounding your target. 
But to keep things simple, 
I'm just cropping into 28 by 28.
Play video starting at :14:1 and follow transcript14:01
I want to print out all of 
the probabilities to the console, 
but I noticed that it streams by a little too quickly, 
to correct that I'm only going to print them 
out once every 10 inferences.
Play video starting at :14:19 and follow transcript14:19
It's just going to be the label from the label's file 
along with its associated probability number.
Play video starting at :14:29 and follow transcript14:29
Then finally, you can uncomment this if you want to 
get a reading of how fast inference is being performed. 
Keep in mind that also includes capturing an image and 
then manually copying it to an empty image buffer, 
which is not very efficient, 
but we can see how well this is performing. 
Let's go ahead and run this. 
Give it a moment. You can 
see it's performing the predictions here. 
It's also giving us the highest probability 
along with that associated label. 
I'm going to take my camera and I'm going to 
point it at an electronic component. 
You can see it's not really graded. 
Doesn't know what this is. 
I can adjust my lens by twisting the lens cap to get 
a better focus for these 28 by 28 pixels. 
If I hold it right here, 
you can see that it is showing capacitor, 
but if I move it around too much, 
it says diode LED, it's not sure. 
I have to hold it in 
the exact position that 
mimics what was in that training set, 
and you can get the sense that a 
dense neural network is really 
maybe not the best model to use for image classification. 
Here it is identifying the LED. 
But once again, if I move it a little off of the frame, 
it's not quite sure what to do, 
and that's because really the dense neural network is 
looking for individual pixels in particular places, 
and then saying, "Oh, if I see a dark pixel here, 
and a light pixel there, that's probably an LED." 
It's not really looking for shapes, 
which is what we're going to get into 
in the next section. 
You might think this is a resistor, 
but it's dark enough that it's thinking 
whenever I see this dark blob 
somewhere in the middle here, 
I can correctly classify it as a diode. 
Then the same thing happens here. 
It's a dark blob, 
but I need to get a little bit 
closer and it starts to figure out that, 
it's a little lighter, which means 
it's actually a resistor. 
Of course the white background is just the background. 
But if I start to aim 
the camera at other things around my desk, 
maybe a darker background, 
it's not very sure. 
We never trained it on 
backgrounds that weren't a white sheet of paper. 
That's how you can see that it's not looking for 
shapes so much as pixel values.
Play video starting at :16:42 and follow transcript16:42
Just for fun. Let's go ahead and 
figure out how fast inference is performing. 
I'm going to comment out the probabilities down 
here and print out the frames per second. 
I'm going to let this fire up again. 
You can see, even though I'm doing 
this very inefficient copying of an array, 
I'm getting something close to 60 frames per second. 
That means inference is performing 
very quickly on this microcontroller, 
which includes MicroPython and all this other overhead. 
It's still quite impressive 
that we can do something like this. 
In the next section, we're going to be looking 
at optimizing some of this, 
along with examining other models like 
a convolutional neural network 
that should perform better than 
a dense neural network because it can help us identify 
individual shapes and various features 
in the images themselves, 
rather than just looking for collection of pixel values.

## Edge Impulse and OpenMV Documentation
## Project - Deploy DNN Image Classifier
Introduction
Now that we have a trained model, let’s deploy it to one of our embedded systems. We will use the model you trained on Edge Impulse with the dataset you collected in previous projects.
You have 2 options for this project: performing inference on a Raspberry Pi or performing inference on an OpenMV Camera.
Required Hardware
You will need one of the following hardware setups for this section:
•	Raspberry Pi 4, SD card, Pi Camera
•	OpenMV Cam H7 or H7 Plus (with SD card)
Note: You are welcome to try other embedded systems not listed here. However, I cannot promise that they will work and I likely will not be able to help you troubleshoot any issues you may come across.
Option 1: Live Inference on Raspberry Pi
If you are using a Raspberry Pi, you will want to follow these instructions. More information about installing the Edge Impulse runner on Linux can be found here, and information about installing the Edge Impulse Python package can be found here.
Open a terminal and enter the following commands to install the Edge Impulse command line tools and required libraries:
curl -sL https://deb.nodesource.com/setup_12.x | sudo bash -
sudo apt install -y gcc g++ make build-essential nodejs sox gstreamer1.0-tools gstreamer1.0-plugins-good gstreamer1.0-plugins-base gstreamer1.0-plugins-base-apps
npm config set user root && sudo npm install edge-impulse-linux -g --unsafe-perm sudo apt install -y libatlas-base-dev libportaudio0 libportaudio2 libportaudiocpp0 portaudio19-dev 
python -m pip install edge_impulse_linux -i https://pypi.python.org/simple
We also need the Python version of OpenCV (which comes with Numpy) to assist with various image manipulation operations. If you have not done so already, install OpenCV with pip:
python -m pip install opencv-python
Create a folder to hold your program and model file:
mkdir -p Projects/image-classification-dnn
cd Projects/image-classification-dnn
Use the Edge Impulse tool to download your model file. 
edge-impulse-linux-runner --clean --download modelfile.eim
Note: The model file we get is a .eim file that includes both the model and any necessary feature extraction methods. For now, those feature extraction methods just pass raw data to the model for inference.
Your job is to create a program that continuously captures images from the camera (e.g. Pi Camera), adjusts the images as necessary, and performs inference to classify the image. You should output the predicted class to either the console or draw it on the preview window.
Try pointing your camera at your target class objects to see if the model is able to classify them correctly.
 
Here are some pointers to help get you started:
You are welcome to reference the Inference on a Single Board Computer lecture to see how to do this.
from edge_impulse_linux.runner import ImpulseRunner
 
# Settings
model_file = "modelfile.eim"            # Trained ML model from Edge Impulse
draw_fps = True                         # Draw FPS on screen
res_width = 96                          # Resolution of camera (width)
res_height = 96                         # Resolution of camera (height)
img_width = 28                          # Resize width to this for inference
img_height = 28                         # Resize height to this for inference
 
# The ImpulseRunner module will attempt to load files relative to its location,
# so we make it load files relative to this program instead
dir_path = os.path.dirname(os.path.realpath(__file__))
model_path = os.path.join(dir_path, model_file)
 
# Load the model file
runner = ImpulseRunner(model_path)
 
# Initialize model
try:
 
    # Print model information
    model_info = runner.init()
    print("Model name:", model_info['project']['name'])
    print("Model owner:", model_info['project']['owner'])
    
# Exit if we cannot initialize the model
except Exception as e:
    print("ERROR: Could not initialize model")
    print("Exception:", e)
    if (runner):
            runner.stop()
    sys.exit(1)

### Option 2: Live Inference on OpenMV Cam
If you have been following along with the projects, you should not need to install anything new for OpenMV. However, you will need to download the raw TensorFlow Lite model from your Edge Impulse project.
Head to the Dashboard of your Edge Impulse project. Scroll down, and you should see a list of available raw models.
Note: These models do not contain the feature extraction code, as they are just files containing the trained machine learning model. You should only use the models listed here if you plan to write your own feature extraction section.
Download the TensorFlow Lite (int8 quantized) model.
 
Copy this .tflite file to the root directory of the SD card on your OpenMV (note that you may not need to use an SD card for this part if your OpenMV Cam has enough internal storage for the model file and your code).
I recommend renaming the model to something like “trained.tflite” (either way, you’ll need to know the name of the file so you can load it in your program).
Create a new text document and list out your classes in alphabetical order, with each class on a separate line.
 
Save this file with a name like “labels.txt.” It should be saved in the root directory of the SD card.
 
In a new OpenMV IDE document, create a program that continuously captures images from the camera, adjusts the images as necessary, and performs inference to classify the image. You should output the predicted class to either the console or draw it on the preview window.


objs = tf.classify(model_file, img_flat)
Conclusion
As you point the camera at your various objects, how accurate is your model in the real world? Are you surprised by this outcome?
You may find that your model is not very effective unless the object is framed in a very particular way and with very particular lighting. That’s because we’re using a simple dense neural network and only a few training samples. In future lectures, we’ll see how we can use convolutional neural networks and data augmentation to create a more robust model that works even when the object is not perfectly framed like one of the training samples.

Most grayscale images are made up of an array of 8 bit values where each 
value is a number between 0 and 255, denoting the amount of white or 
brightness in that pixel. 
We can normalize these values to be between 0 and 
1 in order to express the brightness as a percentage of the amount of white. 
These pixels are combined to create a grayscale image like the one shown here, 
we express the resolution of the image as with by height in number of pixels. 
We also introduced the concept of bit depth, 
which gives the number of bits used to describe each pixel. 
We can use a collection of three arrays to create a color image where each array 
is a color channel, one for red, one for green, and one for blue. 
When combined, we can create nearly every visible color. 
You will often find these color channels as having a bit depth of 8 bits each, 
which means a color image would have a bit depth of 24 bits. 
We then introduce the concept of image classification and how a machine learning 
model can be trained to recognize what an image is showing on the whole. 
This can be a binary classifier where it tries to identify between two different 
images or determine one class versus everything else. 
We also have multi class classification where a model attempts to 
pick the correct label for an image from more than two options. 
While there are many machine learning algorithms and models out there, we will 
focus on using neural networks to perform image classification and object detection. 
In this course we can do basic image classification with a neural network by 
flattening an image to a one dimensional vector and 
feeding those values as inputs to the first set of nodes in the network. 
We then talked briefly about how a neural network works and how we can train one. 
A single note in a neural network consists of a weighted sum of inputs, followed by 
some non-linear activation function that transforms the single numerical output. 
Training consists of feeding a sample from the training 
set through the network to get a predicted value. 
We then compare the predicted output to the known good output value of 
the training sample and use the back propagation algorithm to automatically 
update the weights in each of the nodes. 
We continue this process with all of the training samples until the network is 
correctly predicting labels to an acceptable level. 
We can look at the accuracy of the model as it trains over time to 
see how well it's performing and if there's under fitting or overfitting. 
If there's under fitting, it means the model is not generalizing well and 
we often need to get more training data, try different features, or 
adjust the complexity of the model. 
If it's overfitting, we can try early stopping, gathering more data, 
reduce the model complexity, or 
try some techniques like regularization or drop out layers to reduce overfitting. 
In addition to accuracy, we can look at the loss of the training and 
validation sets as the model trains. 
These plots should look something like the inverse of the accuracy. 
Where accuracy should increase over time as the model learns, 
we expect the loss to fall as loss is a measure of how different the predicted 
values were from the ground truth values. 
A confusion matrix can help us visualize how well a model performs by comparing 
the predicted labels to the actual labels. 
We can compute per class accuracy is to help give us an idea of where a model 
might be struggling and where to focus our efforts. 
Additionally, we can use the measurements of true positives, true negatives, 
false positives, and false negatives to calculate a number of metrics like 
the F1 score, which can give us greater insights into how the model is performing. 
We saw what a code based training effort might look like using in Google Colab. 
Here we resized a group of images and flatten them to a one dimensional vector. 
We trained a simple dense neural network which gave us about 90% 
accuracy at predicting images of our electronic components. 
We then uploaded the same feature set a collection of one dimensional vectors 
to edge impulse to see how to use that tool to accomplish the same task. 
Finally, we deployed the trained model to two systems. 
The first was a raspberry pi where we could perform live inference on images 
captured with the pie camera. 
The next system was the open MV Camera, where we used micro python running on 
a micro controller to capture images and perform the same inference. 
Output was a set of probabilities and a predicted class. 
We saw how a dense neural network was limited in its ability to predict classes 
from live data as the objects had to have similar lighting, framing, distance and 
focus as the training set. 
The network is looking for 
combinations of pixels to be certain values in order to make its predictions. 
In the next module, we're going to introduce the concept of convolution and 
filters in order to have a model look for things like patterns, edges and shapes. 
The idea is that this will allow us to create a more robust model for 
image classification. 


## Share Your Image Classification Project
what you did for your image classification project:
•	What types of images did you attempt to classify (e.g. what were your classes)?
•	How well did your model perform on the validation and test sets?
•	How well did the model perform on live inference?
•	You may also optionally post any photos of the setup you wish to share!
•	In computer vision and image processing, 
•	convolution is a way to 
•	mathematically filter images to do things like blur, 
•	sharpen or edge detection. 
•	I'll show you examples of those later, 
•	but first, let's see how convolution is performed. 
•	Convolution is the process of adding each element or 
•	pixel in an image to its neighbors weighted by a kernel. 
•	A kernel is a small matrix 
•	that does not change throughout 
•	the convolution operation and 
•	holds the weights for this summation process. 
•	I usually see kernels as three-by-three, 
•	or five-by-five matrices although other sizes do exist, 
•	the numbers in the kernel determine 
•	the type of filtering that we do. 
•	I'll pick these for this example. 
•	Next, we take a section of 
•	pixels the same size as the kernel, 
•	starting with the top left of the image, 
•	each pixel is multiplied 
•	with the corresponding kernel element. 
•	All the values in 
•	this element-wise multiplication are then added 
•	up to give us the first value in 
•	our output image or feature map. 
•	The window then slides over by 
•	one pixel and the operation is performed again. 
•	We use the same kernel, 
•	but a different subsection of the image this time, 
•	this gives us our second value. 
•	The window slides again to give us the third value, 
•	when it reaches the end of the row, 
•	the window slides down one pixel, 
•	and starts again on a new row. 
•	This process continues until we have used all of 
•	the pixels in the original image to 
•	create a new output image. 
•	All of this assumes a stride of one, 
•	which means the window moves over or 
•	down only one pixel at a time. 
•	With a stride of two, 
•	we would start by computing 
•	the weighted sum of the window set in the upper left, 
•	just like we did before, 
•	then we would slide the window over by two pixels, 
•	and repeat the calculation. 
•	We would move the window down by 
•	two pixels as well to repeat, 
•	but since we cannot do that, 
•	the convolution is done, 
•	and we end up with a two by 
•	one array for the output image instead, 
•	you'll also notice that the output image in any of 
•	these cases is smaller than the input image. 
•	There are a few ways we can account for this, 
•	which I'll talk about in a minute, 
•	but first, here's 
•	the general formula for image convolution. 
•	Capital I is the input image matrix, 
•	and capital K is the kernel 
•	we index into each of these arrays, 
•	and some the element-wise multiplication to get 
•	a value for one of the pixels of the output image o. 
•	Often in the math world, 
•	you'll see that matrices and 
•	vectors start with one for the first element. 
•	However, I've shown the equation for 
•	starting with zero as the first index, 
•	as this should be easier to implement in code. 
•	S is the stride, 
•	H which I'll show in a 
•	second is the height of the image, 
•	and W is the width. 
•	Capital M is the number of rows in the kernel, 
•	and capital N is the number of columns. 
•	Lowercase I is the row index of the output image, 
•	and it goes from zero to the floor of the image height 
•	minus the kernel height divided by the stride plus 1. 
•	Lowercase j is the column index of the output image, 
•	and it counts from zero to the floor of 
•	the image width minus 
•	the kernel width divided by the stride plus 1. 
•	Here is an example of 
•	our previous convolution example with 
•	the zero-based indices marked for each matrix, 
•	the output image would be 
•	a two-by-three matrix if we set the stride to one. 
•	Let's say that we're working with 
•	a really odd kernel like this, 
•	which causes the output value to be really large, 
•	if we're working with 
•	eight-bit gray-scale values for our input image, 
•	then we would usually want to work with 
•	eight-bit gray-scale values for the output image. 
•	These values must be between 0 and 255, 
•	so we usually set anything above 255 to 255, 
•	and anything below 0 to 0. 
•	You'll sometimes hear this 
•	referred to as clamping of value, 
•	in this case, we'd be clamping 
•	the value between 0 and 255. 
•	If you're curious, the output image 
•	would look something like this 
•	as all values would just be clipped to 255. 
•	If you're working with floating-point 
•	values for your pixels, 
•	which can sometimes happen in machine learning, 
•	you'll also want to clip or clamp those. 
•	Zero to one is the normal range, 
•	but you might sometimes see negative one to one as well. 
•	Now, let's talk about padding. 
•	Vincent, do Milan made these great animations, 
•	and shared them on GitHub, 
•	I'll use them to demonstrate how padding works. 
•	There are a few types of padding that 
•	you can use on the input matrix, 
•	we've been looking at what's known as valid padding, 
•	which essentially means there's 
•	no padding of the input image, 
•	only the valid elements of the input array are used. 
•	In this instance, the input image is 
•	a four by four matrix and the kernel is three-by-three. 
•	The kernel isn't shown, 
•	but we can assume that 
•	the element-wise multiplication and summation are 
•	occurring to give us the element in 
•	the output array shown at the top with a stride of one, 
•	the output array is a two-by-two matrix. 
•	We saw this example earlier, 
•	where we convolve a four-by-five input matrix 
•	with a three-by-three kernel and use a stride of two. 
•	Notice that while the top three rows are used, 
•	the bottom row is skipped because we would 
•	not be able to move our window down by two, 
•	we only use the elements in the input matrix that are 
•	valid with the given window size and stride setting. 
•	In same padding, we pad 
•	the input matrix with elements in order to 
•	give us an output matrix that is 
•	the same size as the input. 
•	The padded elements are shown by the dotted lines. 
•	For example, here we show 
•	a five-by-five input matrix 
•	with a three-by-three kernel, 
•	and a stride of one with 
•	these elements added around the input matrix, 
•	the output matrix will be five-by-five. 
•	Note that even with padding, 
•	if you set the stride to more than one, 
•	you still may end up with 
•	an output matrix that is smaller than the input matrix. 
•	In some libraries like Kairos, 
•	if you specify same padding, 
•	it will attend to add as many padded rows and columns as 
•	necessary to give you an output matrix 
•	that is the same size as the input matrix. 
•	There are a few ways to fill the padded elements. 
•	One common approach is to just 
•	set all of the padded elements to zero. 
•	This technique is probably the most common and it's 
•	what kairos does when you specify same padding. 
•	Another option is to copy the values on 
•	the edges of the matrix out to the padding. 
•	This has the effect of making 
•	the image look a little bigger than it is 
•	and preventing some skewing of the data at 
•	the borders that might occur if we use zero padding. 
•	In some applications, you don't need to maintain 
•	the information at the borders 
•	so you can get away with no padding. 
•	In fact, it's more 
•	computationally efficient to work without padding. 
•	Not only does it save you convolution steps, 
•	it also gives you a smaller output matrix 
•	which might be beneficial for future steps. 
•	For example, if you're using 
•	this filtered output image as 
•	an input to a neural network, 
•	having a smaller matrix means 
•	fewer input dimensions and 
•	therefore fewer computations in each node. 
•	However, there are times when you want to use padding. 
•	For example, if you're just looking to 
•	filter an image to blur or sharpen it, 
•	then you will likely want to 
•	maintain the same size and shape. 
•	Keeping the same shape also means you can perform 
•	more convolution steps without 
•	continually losing information at the borders. 
•	And sometimes you might 
•	determine that the information at the edges of 
•	the image really is important 
•	to whatever vision project you're doing. 
•	So you need to use padding to keep that relevant data. 
•	If you're working with color images, 
•	you would first break the image 
•	apart into its separate red, 
•	green, and blue color channels. 
•	Each channel would then act like 
•	the grayscale images we were looking at earlier. 
•	A separate kernel would be convolved with 
•	each channel to produce three separate output images. 
•	These output images represent 
•	the new RGB channels and could be 
•	combined to create the new filtered color image. 
•	Note that the kernel can be the same for each channel, 
•	but it doesn't have to be. 
•	You could use different kernel values 
•	for each channel if you wanted to. 
•	This would create some interesting effects 
•	as you might blur one color, 
•	sharpen another and do edge detection on the third. 
•	When working with color images, 
•	you no longer have a simple two-dimensional kernel. 
•	Since you have one kernel for each channel, 
•	you can stack the kernels together 
•	to create a three dimensional array. 
•	In this example, the first three by 
•	three Matrix would be for the red channel. 
•	The second three by three matrix would be for green, 
•	and the third three by three matrix would be for 
•	blue when working with something like NumPy, 
•	SciPy or kairos, 
•	it's often easy to store these kernels together. 
•	So you would likely store them as 
•	a single three by three by three array. 
•	Using convolution, we can create 
•	a variety of filters for our image. 
•	Let's look at a few filters that 
•	have been applied to this elephant photo. 
•	Note that I'm using a small image, 
•	200 by 130 pixels so that 
•	our three by three kernels will have a noticeable effect. 
•	First up is the Gaussian blur. 
•	This simply mixes the pixels in the window 
•	in such a way to soften transitions and edges, 
•	we can also use the kernel we saw 
•	earlier to create a sharpened image. 
•	Note that sharpening an image using this method will 
•	also amplify any noise present in the image. 
•	We can also do this funky emboss effect 
•	to make it look like the Elephant 
•	is carved out of the page. 
•	In addition to forming the basis for 
•	fun effects for things like Instagram filters, 
•	we can use convolution to do edge detection. 
•	The first is a basic outline that picks 
•	out any areas of harsh transition between light and dark. 
•	The next is a Sobel filter, 
•	which is a combination of two filters that pick 
•	out edges in the X direction and Y direction. 
•	Edge detection is extremely 
•	useful in computer vision as it helps 
•	us determine the boundaries between 
•	objects in the image and the background. 
•	It can also help reduce noise. 
•	This is the kind of information that 
•	a Machine Learning model can use to pick out 
•	objects in an image or extract features 
•	like patterns and shapes to use for classification. 
•	We can set the kernel parameters manually if we 
•	know what kind of filter or filters we want to use. 
•	Alternatively, we can use 
•	Learning algorithms to set them automatically 
•	during the training process if we are using 
•	something like a convolutional neural network. 
•	In this case, the algorithm essentially tries a bunch of 
•	different filters and picks the ones that 
•	give it the best features to meet its goal, 
•	like classification or object detection. 
•	I hope this helps you get a sense of 
•	how image convolution works, 
•	how we can use it to create fun filtering effects, 
•	and how it forms the basis for 
•	feature extraction for many computer vision applications.

In addition to convolution, 
we can use other operations like pooling to down sample the image. 
This has the effect of reducing the size of the data as well as creating 
a consistent feature map that stays the same regardless of small 
variations in position of the subject. 
There are a few popular techniques for pooling. 
The first is average pooling, to perform average pooling a window is applied 
over the top left section of pixels on the input image. 
You'll often find windows in two by two or three by three, but they can be any shape. 
The pixel values under that window are averaged to produce a single output 
value which is stored in the output image. 
The window slides over to the next set of pixels. 
In most cases, the horizontal stride is the same as the window width and 
the vertical stride is the same as the window height. 
You will usually want to avoid overlapping the window in pooling as the idea 
is to down sample the image. 
Notice that I'm skipping the last column as I'm unable to fit a window 
without overlap. 
This assumes a valid padding scheme, 
which we talked about in the convolution lecture. 
With valid padding, we do not add any padding to the input image. 
You could also perform same padding, this would add additional rows and columns of 
zeros or other values so that you're not losing information on the edges. 
Also, notice that I'm rounding the output values when storing them to my 
output array. 
For pooling with arbitrary data, you would likely not need to round or 
truncate the data. 
However, since I want my output to act like an image, 
I need the values to be eight bit integers, so I round the values and 
make sure they're between zero and 255. 
Another popular form of pooling you'll see is max pooling. 
It works similarly to average pooling in that we use a sliding window 
to compress the features in an image. 
This time instead of computing the mean of each group of pixels, 
we simply pick the element with the highest value and keep that. 
We slide the window to the next group of pixels and repeat the process. 
We keep doing this until we have found the maximum values for 
all of the window positions. 
There's less math involved in max pooling which potentially makes it more efficient. 
It also allows us to pick out the most salient or important features in an image. 
You could also have min pooling where the minimum value is kept but 
max pooling is much more popular. 
Something else that you might come across is global pooling. 
This is where the window is the size of the entire image. 
You can have global average pooling where the average of all the pixel values is 
calculated. 
Once again, I'm rounding to store the output as an eight bit integer. 
You could also do global max pooling where only the maximum pixel value of 
the whole image is kept. 
Global pooling is used to aggressively down sample an image or 
to simply denote the existence of a feature. 
When it comes to convolutional neural networks, 
you'll often find pooling operations done after convolution, which means 
that polling is done on a filtered image that does something like edge detection. 
Pooling allows us to compress the image in a way that keeps those highlighted 
features from the convolution step. 
Let's take a look at a filtered image of the same elephant we saw 
in the convolution lecture. 
Here, we did a simple outline or edge detection filter. 
If we perform a max pooling operation on this image using a two by two window, 
we get something that looks like this. 
First, notice that the output images much smaller as we've reduced 
the number of pixels by 1/4. 
If we then decide to feed this filtered and pooled image to a neural network, 
that's a lot fewer dimensions and parameters to work with. 
Average pooling makes the image look like it's simply been down sampled. 
Some of the edges might get lost, especially it seems on diagonal or 
curved lines. 
Now, let's compare this to the max pooling operation. 
As you can see the edges are much more pronounced. 
This is because edges are higher valued numbers in the matrix, 
much closer to 255 than say zero. 
As a result, the max pooling operation keeps those in the output image rather 
than trying to average them. 
When it comes to picking out features after a filter like an edge detector. 
The max pooling operation does a better job of highlighting those sharp 
features than average pooling. 
However, max pooling can lose some of the finer 
details as it simply drops the non-highest values in each window. 
If those details or smooth transitions between areas in the image are important 
to your application or model, then you may need to use average pooling. 
That being said, I see max pooling used much more often with convolutional neural 
networks for image classification. 
Another important feature of pooling is that it provides some translation in 
variance for feature sets. 
This means that different images can have the main subject in slightly different 
locations. 
And the output feature set generated by convolution and pooling can 
produce the same or similar feature maps, for example, let's say that we're 
looking at a zoomed in portion of an edge that's on the elephant we saw earlier. 
A max pooling operation would produce an output like this. 
Now, let's say we have another image of an elephant but 
the edge is shifted by a few pixels. 
The max pooling operation would give us something like this, 
not exactly the same as before but it's similar. 
Both outputs let us know that there is an edge in that general area. 
This helps generalize the location of features in similar images. 
When it comes to machine learning, it can help train a more robust model that looks 
for features in general areas rather than have the model look for 
exact pixel values in certain locations. 
I hope this demonstrates how convolution and 
pooling can work together to extract features from images. 
This combination forms the basis for 
the first half of a convolutional neural network. 

## Project - Convolution and Pooling
Introduction
In this project, you will implement a basic convolution filter and a max pooling function. These implementations do not need to be particularly efficient, as they are just to show you how the operations are performed. You are welcome to use for loops to write your functions (no need to use strictly Numpy vector/matrix operations for maximum efficiency).
Image Convolution
Feel free to review the lecture on Image Convolution to see how convolution works. The following image might also be helpful in seeing how convolution should be performed. Note that this assumes no padding (valid padding).
 
#### Pooling
Feel free to review the lecture on Pooling. You will need to implement the Max Pooling operation.
Conclusion
Convolution and pooling are two incredibly important operations in computer vision, and they can be used to create a feature extraction process as part of a convolutional neural network (CNN), which we will explore in a future lecture. Convolution filter kernel parameters can be automatically tuned during training, which makes them a powerful tool for machine learning!
At this point, we've seen all of 
the basic building blocks that make up 
a simple convolutional neural network or CNN for short. 
Let's see how those pieces fit together 
to create a robust image classifier. 
Here is a very simple example 
of a convolutional neural network and 
one that you might see doing 
basic image classification in an embedded system. 
We'll assume that our input is a grayscale image. 
Remember that a grayscale image is 
just a two-dimensional array of numbers. 
These numbers might be floats 
normalized between zero and one. 
They can also be unsigned 
eight bit integers between zero and 255. 
In some fully optimized models, 
which we'll see from Edge Impulse, 
the model might expect 
a quantized input where the input values 
are normalized and cast to 
a signed eight-bit integer value. 
In this case, the values will be from negative 
128 to positive 127. 
The library we download from Edge Impulse should 
handle this quantization and 
normalization process for us, 
but it's a good idea to pay attention to 
what the model is looking for as an input. 
Note that the input should always be the same shape. 
If you trained a model to accept 
28 by 28 grayscale images, 
you should always feed it 28 by 28 grayscale images. 
Next is one or more convolution layers 
where each is followed by a max pooling layer. 
Note that a layer shown here is not 
a single convolution or pooling operation, 
each layer may contain 
dozens or hundreds of simultaneous convolution or 
pooling operations to generate a number of 
feature maps that are passed on to the next layer. 
Using multiple stacked convolution and 
pooling layers like this allows the model to 
extract more complex features from 
the image before sending 
those features to the classifier. 
After going through convolution and pooling, 
the features are then stacked and 
flattened to create a one-dimensional vector. 
This vector is then fed to 
a dense neural network to 
perform the actual classification. 
The neural network can be more than one layer, 
but sometimes one layer is good enough, 
as I've shown in this diagram. 
The output goes through 
a softmax operation to give us our output predictions. 
This flattening and dense neural network step might look 
similar to the simple classifier 
we trained at the beginning, however, 
instead of feeding it raw pixel values, 
we're now giving it features that have been 
computed from the convolution and pooling layers. 
These features ideally include things like edges, blobs, 
patterns, and so on, 
which is closer to how we humans interpret images. 
Let's break down these layers. 
First, let's look at the convolution layer. 
Each layer will have some settings associated with it, 
and the settings can be treated as hyperparameters. 
This means you can adjust them before training, 
but they are not automatically 
updated during the training process. 
Two, popular settings are the number of 
filters in the layer and the shape of the kernel. 
A three-by-three kernel is fairly common. 
Let's say we have four filters in this layer, 
each filter has a stride of one and we would use 
same padding so that the output features 
are the same shape as the input image. 
Our image would be filtered in 
parallel four different times 
with four different kernels. 
Each of these kernels would 
extract a different set of features. 
For example, a few of 
the filters might perform edge detection, 
while another blurs the image to get 
an overall placement of the object. 
However, these kernel values are not set manually. 
They are randomized initially and they are 
updated incrementally as part 
of the model training process. 
In effect, the model is Learning which features it should 
look for to achieve the desired classification outcome. 
Another important difference between 
a convolution layer in a machine learning model and 
basic convolution is that 
the convolution layer also has a bias term. 
This is a simple constant value that is 
added to every element in the output array. 
The constant is the same with any filter, 
but each filter can have 
a different constant value for the bias term. 
Like the kernel values, 
the bias term is also a parameter 
that gets updated during model training. 
These output images are very 
contrived for this example as they're using 
manually set kernel values to 
create a particular filtered output. 
While we might think that the model is 
looking for edges or blobs or shapes, 
in reality, the features end 
up looking something like this. 
The kernel values are learned and 
can often look random and weird. 
Sometimes it can be difficult to figure out exactly 
what features a particular filter 
has been trained to pick out. 
These images were created by using kernel values from 
a real CNN that was trained to 
classify these electronic components. 
The filtered images are then 
sent through an activation function, 
just like you would find in a neural network node. 
Without activation, the filtered images are just 
created through a series of 
multiplication and addition steps. 
Activation adds nonlinearity into 
the process to assist with training and 
potentially help normalize the output to 
be between two values like zero and one. 
One popular activation function is the sigmoid function. 
This converts any number into 
a value between zero and one. 
In recent years, the rectified linear unit, or ReLU, 
has become more popular as it can 
make training faster and is often more 
computationally efficient than 
other activation functions like sigmoid. 
Each pixel value in the filtered image is sent to 
the activation function and the result is 
the output of that convolution step in the layer. 
While we can depict these output arrays as images, 
it's usually more helpful to think of them as 
arrays of features rather than images. 
When working with floating point values 
for this convolution step, 
the values for the pixels 
and weights can be almost anything. 
As a result, the ReLU function will operate as intended. 
You just need to convert any value 
that's less than zero to zero. 
If you try to imagine these feature maps as images, 
then you'll probably wonder how ReLU works on 
8-bit pixels with values between zero and positive 255. 
In a fully quantized network, 
weights and bias terms are often 
stored as signed 8-bit integers. 
In each layer, the input image or feature map 
gets quantized to 8-bit signed integers as 
well in order to make 
the filtering and activation functions 
faster on processors without 
special hardware floating point units. 
Rather than 0-255, 
the pixel values might be mapped to negative 128 
to positive 127 and 
store it as 8-bit signed integers as well. 
This way, negative values are 
possible during the convolution step, 
and they can be sent through the activation function. 
For ReLU, any values less than 
zero are converted to zero. 
For the purposes of visualizing this step, 
it simply means that any shade 
darker than 50 percent gray becomes 50 percent gray, 
assuming this is how 
the quantization process set 
the midpoint of our pixel values. 
The quantization process can get 
fairly complicated in TensorFlow Lite, 
as a different zero-point 
could be set for an activation function. 
This might help give more precision for activations of 
filtered images with more light colors 
like the one on the far right. 
While the filtered images and 
activation feature maps look similar, 
if you look closely, 
you can see that any of 
the dark shades have been converted to a 
middle gray to demonstrate 
how this ReLU function might work. 
What's left is a collection of prominent features 
for that particular convolution filter and activation. 
Note that you may often see one of these filter and 
activation functions referred to as a neuron or node. 
This can help you think of the individual units 
in a convolution layer in terms of a neural network. 
I'll shorten each of these nodes to 
a circle which contains the convolution filter, 
adding the bias terms to each value 
and the ReLU activation step. 
This gives us a feature map as an output for each node. 
While we could send these feature maps 
to another convolution layer 
or write to a fully connected neural network 
for classification, 
you'll sometimes find a pooling step after convolution. 
Max pooling is one of 
the most popular pooling operations, 
and it's what we'll use in our examples. 
Here, the pooling operation is performed 
on each of the filtered images to produce 
a new output feature map that has been 
down-sampled and has 
the most salient features picked out. 
Each of these new feature maps is now 14 by 14 pixels. 
Rather than thinking of them as 
a separate collection of feature maps, 
you can also stack them together to create 
a three-dimensional array of feature maps. 
This array would have the dimensions of 14 by 14 by 4. 
This collection of feature maps is used as 
the input to the next convolution layer. 
Here we'll use two filters instead of four, 
but the other hyper-parameters stay the same. 
Each filter or node works the same way as before. 
The collection of images are convolved with a kernel and 
a bias term is added to 
all pixel values to produce a new feature map. 
There's one big difference here between 
this layer and the first convolution layer. 
The input is now 
several images or a three-dimensional array, 
depending on how you look at it. 
Instead of a single three-by-three kernel 
to filter the image, 
each node now has a collection of 
kernels equal to the number of input feature maps. 
Each of these four kernels has its own set of 
parameters that are determined during model training. 
Even though a single filter might 
say it has a single three-by-three kernel, 
it really has more than that 
depending on the number of input images. 
Let's see what's happening in the node on the left. 
Here, four separate kernels are being 
used to filter each of the input images. 
Rather than output four separate new images, 
the values under the windows after being 
multiplied by the weights in the kernel are 
being summed together plus a bias term to give us 
a single new value in 
the new image or feature map in this case. 
In other words, this is how 
groups of pixels are being filtered and 
combined across several images to give 
us a new pixel value in the output image. 
The same thing happens in the other filter. 
Next, a non-linear function like ReLU is 
applied to these arrays to 
give us the outputs of this layer. 
Each of these feature maps is a combination 
of the feature maps that were supplied to this layer. 
After that, we perform 
another max-pooling operation on 
each of the two feature maps. 
This gives us down-sampled features 
that are each 7 by 7, 
assuming we use a pooling window 
of 2 by 2 with a stride of two. 
These are distilled features of the original image. 
Ideally, as the network trains, 
it settles on weights and the various kernels 
that help pick out the best set of features. 
The final step in this section of 
the network is to flatten the feature maps. 
Here, each pixel or array value is 
stacked end to end to create one long vector. 
When the end of one feature map is reached, 
it simply moves on to the next. 
Instead of a 7 by 7 by 2 feature map, 
we now have a 1 by 98 vector of features. 
This is very much like how we flattened 
our image when we tried 
the dense neural network for classification. 
The difference is that instead of using 
the raw pixel values as input, 
we're using these features which should represent 
salient points in the input image 
that a neural network will care about. 
Part of the appeal in using convolution is that 
the input image can ideally be 
compressed into a set of important features. 
If you were to use the raw flattened image, 
you would have 784 values going into the next section, 
as opposed to 98. 
You have to do some computation 
in this convolution section, 
but it potentially saves you 
some calculations down the road. 
This is especially true if 
you're working with large images. 
A 200 by 200-pixel image would be 40,000 input values. 
Doing some convolution and pooling steps could 
potentially reduce that down to a few 100. 
Convolution layers allow you to 
condense an image to important features, 
possibly by using a lot of convolution 
and pooling layers to create a deep network. 
These layers take into account 
the relative position and neighbors of 
each pixel value rather than just 
assume that the raw values are the features. 
At this point, the features are 
then fed to a dense neural network, 
which we covered in a previous lecture. 
It works the exact same way. 
One or more layers of 
fully connected nodes sum weighted values of 
their inputs and apply 
a non-linear activation function to the output. 
This classifier section can be 
as deep or shallow as you'd like. 
In simple, tiny machine learning applications, 
you may only need one layer as shown here. 
My electronic components images 
would probably have five classes, 
but I've shown three here as an example. 
The output layer of 
your dense neural network would likely pass through 
a softmax operation in order to give 
you the output probability of each class. 
I've listed the dimensions at 
the input and flattened stage to give you 
an idea of how the feature extraction section 
can reduce the dimensions. 
Note that this assumes four filters in 
the first convolution layer and 
two filters in the second convolution layer. 
If you use more filters in the last layer, 
there will obviously be more features at 
the flattened operation going 
into the classifier section. 
I encourage you to play with the architecture and 
various hyper-parameters like the size 
and number of filters, 
number of layers, depth of 
classifier network, and so on. 
I hope this helps give you an understanding of what's 
going on inside a convolutional neural network. 
You may or may not see 
an improved accuracy over just a dense neural network, 
but the convolutional layers can help reduce 
the total number of operations 
inside the network which makes the model more efficient.
## Digging Deeper into CNNs

•	Now that you've seen the inner workings of a convolutional neural network. 
•	Let's see how we can use edge impulse to train one for this demonstration. 
•	I'm going to be using the bit mapped images that I collected from my open MV 
•	camera. 
•	If you'll notice, go into your data set and pay attention to 
•	the format of the images, mine were all taken in BMP orbit map format. 
•	Edge impulse requires PNG or 
•	Jpeg images at least at the time that I recorded this in the future. 
•	That may change but since I have BMP only from my open MV cam I need to convert 
•	these two PNG or JpeG I'm going to go with PNG because it's a lossless format. 
•	You can convert them individually manually but that takes up a bunch of time or 
•	you can find some program on Windows. 
•	There's this program called bulk image converter. 
•	I'm going to use that. 
•	So I'm going to go where I stored my data set which is in python data sets, 
•	electronic components BMP select that. 
•	I'm going to include the subdirectories. 
•	I'm going to convert BMP to PNG. 
•	I'm not going to delete them and I'm going to click start. 
•	That should be pretty quick as the files are pretty small. 
•	You'll notice what happens is if you go into the folder it puts it in 
•	the same folder so all the PNG files are now in here. 
•	Next I'm going to go to edge impulse. 
•	I'm going to log into my account and I'm going to create a new project. 
•	I'm going to call this one, electronic components-cnn, 
•	when it's done it should drop me into the dashboard, I'll click out of this pop up. 
•	And I'm going to go to data acquisition, I'm going to click 
•	upload data since I already have it stored on my computer. 
•	I'm going to choose files by the way. 
•	You can see the supported files here, 
•	you can do what we did earlier in collab and convert it to raw seaboard Jason. 
•	RCSV format where it's just the raw pixel values whether that's 0 to 1 in 
•	floating format or 0 to 2 55 in grayscale or even color. 
•	If you don't mind having some extra dimensions in their wave is for 
•	sound files. 
•	PNG or jpeg is for our image files. 
•	That's what supported at this time. 
•	I'm going to go to my data set which I keep in python right now, 
•	I'm going to go into the BMP folder and 
•	I'm going to go into my first class which is background. 
•	I'm going to select all of them which would be all of the PNG images click open. 
•	I'm going to automatically have it split between testing and training sets and 
•	I'm going to enter a label. 
•	You can name the files with something like background dot and 
•	then the number of the file and then that will allow it to infer the class 
•	as you can see here anything before the dot to determine that label. 
•	But since I don't have the files names like that. 
•	I'm going to enter the label manually, 
•	I'm going to repeat this process for each of my classes. 
•	So here the capacitors, change the name or change the class name.
•	Play video starting at :3:19 and follow transcript3:19
•	And then we're going to do this again for the diode.
•	Play video starting at :3:27 and follow transcript3:27
•	Once more for the LED and 
•	you'll notice that the images are color images right now. 
•	I don't mind storing them as color images in edge impulse we're going to use edge 
•	impulse to convert them to grayscale before training our model. 
•	And then the last one is the resistor. 
•	When that's done, I can click data acquisition again and 
•	I can see all of the different images that have been 
•	stored in my edge impulse project. 
•	And I should see the five different labels LED resistor, background capacitor, 
•	and diode. 
•	If I go to test data you should see where some of the data set was split apart, 
•	about 20% should appear in test data. 
•	That may not always be exactly the case but this is a good mixture here. 
•	This is a fairly small dataset. 
•	I highly recommend getting more samples than what I have here but 
•	this is a good place to start just to see if we can train something. 
•	I don't expect it to be very accurate. 
•	Kind of like what we saw with the dense neural network. 
•	From there I go to impulse design, it's going to ask me to input the image width 
•	and height because it's going to resize the image. 
•	So for these I want to resize them to 28 x 28 and then you can pick how the re sizing 
•	is done where it scales versus where it crops, you can squash it, you can fit. 
•	I'm just going to leave it fit because it should keep the 1 to 1 ratio. 
•	When it scales from the square 96 by 96 to the 28 by 28. 
•	I'm going to add a processing block that's going to be an image. 
•	It's going to do some very, very basic processing and 
•	then I'm going to add a learning block. 
•	It recommends doing transfer learning but we're not going to do that. 
•	We're going to come back to transfer learning in a later lecture because for 
•	right now I want to show you how to use a basic neural network to 
•	accomplish this task. 
•	So for neural network, click add, save your impulse go to image so 
•	that we can extract features from these images. 
•	Or rather than extracting exactly features, 
•	we're going to convert them to grayscale and resize them. 
•	So rather than RGB, color, depth, click grayscale and 
•	that's the only thing we have to work with right here. 
•	So save parameters and then click generate features. 
•	This will then create our grayscale images that are now 28 by 28 each and 
•	store them in our project. 
•	That's what we're going to use to train our classifier and 
•	remember that's what our classifier, our model when it's done, 
•	being trained will expect 28 by 28 pixel grayscale values as input. 
•	So keep that in mind when we go to deploy the model to our embedded device. 
•	When it's done, you should be able to look around the future 
•	explorer keep in mind that this is not all of the dimensions. 
•	There are some special math going on in the back end to kind of squish those 28 x 
•	28 dimensions down to three dimensions. 
•	So we get an idea of how different the images are from this. 
•	There appears to be very good separation among the classes. 
•	With the exception of diodes and resistors, these are mingled together 
•	a good bit because many of the resistors end up looking like diodes and vice versa. 
•	So I would expect the model is going to have a hard time distinguishing these 
•	unless we get more data in here and 
•	we could get better separation among the classes. 
•	But for now this should be good enough to try a few things. 
•	Let's go to the neural network classifier. 
•	The number of training cycles. 
•	I find that for a basic two dimensional convolutional neural network, 
•	10 cycles is not enough. 
•	This is the number of epics that are required to train the model. 
•	I usually go with something like 100 learning rate is fine to leave 
•	at the default. 
•	The minimum confidence rating is something that edge impulse does when it's trying to 
•	test and evaluate the network so we can leave that alone. 
•	And I'm going to go ahead and show the default network architecture. 
•	Note that you can add extra layers here. 
•	You can add more convolution and pooling layers. 
•	You can also adjust some of the hyper parameter so 
•	I can click on this to adjust the number of filters in 
•	the concept lecture we showed this as having four filters. 
•	And this is having two filters and there was no dropout layer here. 
•	But a lot of these convolutional neural networks need a lot more 
•	filters than what I showed in that concept video. 
•	Because of that, we're going to leave the default of 32 filters for 
•	the first layer and 16 filters for the next layer. 
•	You can also adjust the kernel size, kernels can be rectangular, 
•	it could be a four by six array. 
•	However, edge impulse only supports square colonel. 
•	So 3 means a 3 by 3 colonel for each filter. 
•	And then there's also the number of convolutional layers. 
•	You can stack convolutional layers prior to a pooling layer and 
•	it comes with a pooling layer which I believe is a max pooling layer. 
•	We can check that in just a second after those 4 layers which is 
•	convolutional pooling. 
•	Convolutional pooling, there's a flatten, there's also a dropout layer which helps 
•	with overfitting this is where some of the connections are dropped or 
•	disconnected prior to going to the dense neural network. 
•	And this only happens during training. 
•	This has the effect of helping with overfitting. 
•	So I'm going to leave this in but you can feel free to play with this rate. 
•	This is the percentage of connections that are dropped going to this dense neural 
•	network. 
•	There is only one layer in this dense neural network and 
•	that is a 5 node neural network, one node for each of the features and 
•	that's going to be our actual classifier. 
•	Everything up until that is our feature extraction that is trained automatically. 
•	If we go into karate expert mode you can see exactly 
•	what's going on with the different layers. 
•	I was corrected is a max pooling layer as you can see here and 
•	this is how you would use karate to construct your neural network. 
•	If you wanted to do this in python, 
•	you would add a two dimensional convolutional layer. 
•	The number of filters, the kernel size and the activation. 
•	You can see that they are using rail, you just like we showed and 
•	the padding is same as opposed to valid, which means that the output array size. 
•	And shape from this convolutional step should be the same as the input. 
•	After that there's a max pooling another two dimensional convolutional layer 
•	with 16 filters. 
•	This time, one more max pooling than a flat layer, a dropout layer. 
•	And then a dense layer which has a number of nodes equal to the number of classes 
•	which should be five followed by the soft max operation which gives us 
•	the probability of each class as the output. 
•	There are some hyper parameters here if you wanted to play with those. 
•	This is the atom optimizer which is a specific algorithm used to update 
•	the parameters in the model based on the difference between the expected 
•	outcome of each individual sample being fed through. 
•	And it's actual ground truth labels. 
•	We can control the batch size which is the number of samples being fed to the model 
•	at a time prior to updating the parameters. 
•	And this is where the training and validation data sets are being set up so 
•	that they can be used for training and testing down in this dot fit function. 
•	Let's go back to our basic view and 
•	from there all we have to do is click start training. 
•	I like to watch the training output in this window over here as it's going. 
•	What I'm looking for is accuracy and that is accuracy of the model on the training 
•	set and comparing that to the validation or val accuracy. 
•	Both of these numbers should be going up. 
•	You can also look at the loss which should be going down. 
•	But I like watching the accuracy for now as you can see accuracy is about 90 or 
•	over 90%. 
•	And the validation accuracy hovers around 90% because there is a difference between 
•	the accuracy and the validation accuracy. 
•	I can assume that there is some overfitting going on where the model is 
•	better at predicting samples in the training set than it is at predicting 
•	samples in the validation set. 
•	There are a number of ways you can address. 
•	Overfitting some of that might be playing with your network hyper parameters which 
•	includes maybe adding layers or even removing layers. 
•	Changing the number of filters, changing the number of neurons putting in dropout 
•	layers, adding densely connected layers. 
•	You can try different architectures and 
•	configurations to see if you can reduce some of that overfitting as you can see. 
•	Our accuracy on the training set came up to be about 95% but 
•	the accuracy on the validation set was a good bit lower. 
•	The other big thing that we can do is get more data. 
•	As I mentioned earlier, there's only 50 samples per class in this and 
•	we should really be looking at having more than that. 
•	A few 100 samples per class at least to give us a better idea of what's going on 
•	and hopefully reduce overfitting in our model. 
•	When we looked at the feature explorer earlier, 
•	I noted that diodes and resistors looked very similar and 
•	they both are actually kind of similar. 
•	If you think about it. 
•	From an object perspective, they're both cylindrical bodies with two leads or 
•	a lead coming out of each end and I could see how a machine might confuse them. 
•	Which means we have to figure out ways to train the machine better to look 
•	at the difference between those. 
•	And some of that might be better features where maybe we're using smaller colonel 
•	sizes to get those. 
•	These are just things to play with. 
•	But I did determine that we can use a convolutional neural network to mostly 
•	correctly identify the differences between or among these electronic components. 
•	We saw this a little bit in the concept video where I talked about quantization, 
•	where the actual weights and input features were quantized to 8 bit or 
•	32 bit values. 
•	And then it was using integer math to compute everything inside of the network, 
•	which means you don't have to rely on floating point emulation or 
•	floating point units. 
•	Which is specialized hardware in some microcontrollers and microprocessors. 
•	So that allows us to more efficiently run these networks on hardware like 
•	microcontrollers that may not have access to floating point units. 
•	This could save us some energy. 
•	This could save us some cost. 
•	A variety of reasons why you would want to do into your math instead of 
•	floating point math. 
•	But if you wanted to run the original network you could click the float 32, 
•	an optimized network or the un quantized version. 
•	And you can see it's about the same. 
•	So quantization doesn't lose us a whole lot of accuracy. 
•	So there's no reason not to go with the quantized version here, 
•	especially if we're going to run this on a micro controller. 
•	Once you're happy with the results here, I'm going to call this good enough for 
•	your purposes. 
•	I recommend playing with these. 
•	Try different numbers. 
•	Try smaller colonel sizes and see if you can get your accuracy up. 
•	I also recommend getting more data if you can but I'm pretty happy with 90%. 
•	It's very similar to what we saw with the densely connected neural network. 
•	However, ideally the convolutional neural network should have fewer parameters. 
•	It may have similar math or computational requirements but 
•	it should have fewer parameters which means it will take up less memory when 
•	we go to store this model. 
•	That may not always be the case. 
•	It depends on the number of nodes, it depends on the number of layers. 
•	There are a lot of factors that go into it but ideally convolutional neural networks 
•	should give a similar accuracy with fewer parameters and 
•	fewer steps to be performed. 
•	Once I'm happy with this, I can go to model testing and 
•	I'm just going to click classify all this is my test set that was held out and 
•	never used for training or validation. 
•	I'm going to wait a moment while that tests all of them and when I'm done. 
•	Hopefully I get something similar to my validation said, yes, about 90% accuracy. 
•	Once again it's having trouble with both resistors. 
•	Actual label is resistor as well as falsely predicting other things as 
•	resistors. 
•	Now that we have a fully trained classifier and 
•	hopefully you're happy with the accuracy and results of it. 
•	You can go to deployment and deploy it to your particular microcontroller or 
•	embedded system. 
•	There's a raw CC plus library. 
•	You could use an Arduino library as well as a micro python version for open MV. 
•	And even one for Lennox will be exploring these in a future lecture. 

## Project - Training a CNN
Introduction
Now that we have covered how convolutional neural networks (CNNs) work, let’s train one! We’re going to use the same dataset from the first module to train a CNN model on Edge Impulse rather than a simple dense neural network.
Convert Images
At this time, Edge Impulse only accepts .png and .jpg images. If you originally collected your dataset as .bmp files, you will need to convert them to one of those two categories. 
I recommend keeping the directory structure the same so that you can determine the label of each sample by just looking at the name of the containing folder.
Upload Dataset
The good news is that we can upload image data directly to Edge Impulse rather than having to manually extract features using Colab.
Start a new Edge Impulse project. Go to Data acquisition. Click the button to Upload existing data. Click Choose Files and select all images for one class (e.g. resistor class). Leave Automatically split between training and testing selected. For Label, enter a label for your first category (e.g. “resistor”).
 
Click Begin upload. Repeat this process for the rest of the categories in your dataset. When you are done, go to Data acquisition. Make sure that you have a good representation of samples in your training and test sets.
 
Create Impulse
Head to Impulse design. Change the Image data setting to 28 x 28 (width x height). If your original images are not all square, you can adjust the Resize mode as needed to crop or squish the images.
 Add an Image processing block. Add a Classification (Keras) learning block.
 
Click Save Impulse.
Extract Features
Go to Image under Impulse design on the left-side navigation bar. Change Color depth to Grayscale. Click Save parameters. 
 
On the next screen, click Generate features. Wait until the feature generation script is complete. Here, images are being converted to grayscale and 28x28 pixels.
 
Train!
Go to NN Classifier. I recommend changing the Number of training cycles to something between 100 and 200 (as we are working with a small dataset). Click Start training. When training is done, how well did your model perform?
 
Try adjusting the model hyperparameters, such as number of training cycles, number of filters, and kernel sizes. You can try adding and removing layers. For example, I added a dropout layer after the first convolution layer and changed the second convolution layer to have 28 filters. This seemed to provide a better accuracy on the validation data.
 
Conclusion
Comparing the accuracy results of this model with those of the dense neural network example might be misleading. Remember that we are working with a relatively small dataset (50 samples per class), and the samples are likely randomly distributed between training and test sets. Moving a single sample in or out of the training set could cause the validation accuracy results to sway by 0.5% (assuming 200 samples in the training set). A larger dataset would help prevent overfitting and give you a much better idea of the model’s viability. We will explore ways to create larger datasets in a future lecture and project.
one of the frustrating things about a convolutional neural network is that it 
can be difficult to figure out what's going on when the model classifies 
a given image, I'm going to show you a couple of techniques that you can use 
to get a sense of why a model made a particular decision. 
We'll start with our simple convolutional neural network that's been trained to 
recognize images of electronic components. 
For both of these methods, we need to remove the softmax layer at the very end. 
That's because the methods require the raw output values from the final layer of 
the dense neural network to compute the necessary heat maps. 
The first method is known as a saliency map here. 
We compute the loss between the raw output of the network and 
the ground truth label values of a particular input image. 
The saliency map algorithm works its way backward through the network to figure 
out which pixels in the original image where the most important or 
salient in helping the model make the classification. 
These maps can look a little noisy but 
it can be very instructive to see where the network was looking in the image. 
The second method is a class activation map here, 
we figure out which features from the final convolution layer were the most 
important in helping the network make its classification specifically will use 
the grad cam algorithm which is short for gradient weighted class activation map. 
Note that the resolution of the heat map will likely be different than that of 
the input image. 
That's because it's a feature map that results from one or more convolution or 
pooling steps. 
In fact, it's resolution should match the output resolution of a single filter from 
the final convolution layer. 
In this case we would expect a 14 by 14 pixel heat map from our particular model. 
I'm not going to go into the math behind either of these algorithms 
as it includes a good amount of calculus to compute gradients. 
In fact, 
I don't even have a great understanding of how each of these algorithms work but 
they can still be useful tools for debugging a convolutional neural network. 
What you can do is start looking at the saliency maps and 
cam images for your test or validation sets here, I've created overlays for 
both maps that appear on top of the original image. 
This will help you get a sense of where the model is looking 
to make its classification. 
This resistor worked fairly well, although the spots in the corners of the saliency 
maps tell me that it's looking at the background some to make a prediction. 
Next is a known good led image whose label was predicted correctly. 
The saliency map shows that it was looking mostly at the corner and 
some of the edges of the LED. 
The cam image is really interesting. 
The feature map around the edges seem to be the most important. 
Finally, here is an example of a misclassified LED. 
As you can see, the leads of the components are cropped out of the shot. 
Both maps again show the model focusing on the edges of the component. 
However, the cam heatmap is closer to that of the resistor where there are some 
curved edges near the center of the frame with a light bar at the bottom indicating 
some level of importance. 
What I also found interesting is that the model does not seem to 
care about the leads of either part at all. 
It focuses mostly on the component body to make its decision. 
Let's say you're trying to make a classifier that identifies dogs in 
an open field. 
Your training set looks like a collection of dogs on the left and 
a bunch of open fields like the ones on the right. 
Let's say you have a ton of training data and 
your model performs wonderfully on the validation and test sets. 
However, when you go to deploy the model, 
you find it performs terribly in the field. 
Why is that? 
In reality? 
Your model is probably not detecting dogs but rather it's looking for 
light backgrounds versus dark backgrounds. 
A saliency map or cam can help you spot that. 
If the heat maps mostly focus on the background rather than your subject, 
you can determine that your model is looking for 
something else that differentiates your images rather than what you are intending. 
Let's take a look at how we can use our model from edge impulse to view saliency 
and class activation maps. 
We'll start with the trained model in edge impulse from here will need to 
download the trained model in order to use it in our demo code. 
So head to your dashboard, click out of the pop up, you'll want the tensorflow 
saved model file as that includes a save point that we can load into keras to use 
for classification and computing the gradients necessary for each of the maps. 
So let's download that file, it's going to be a dot zip file. 
If you go into the Git repo for the course, there should be a section for 
the convolutional neural network visualizations and 
if you go in there there will be a co lab jupiter notebook. 
If you click on open and co lab, the page might freeze. 
So what I like to do is click on it and say open link in new tab. 
This will bring you to collab with the notebook. 
This will give you some information about the notebook and 
let's run it one cell at a time. 
If you get the warning about the notebook not being offered by google click 
run anyway after that first cell runs with our imports we're 
going to want to point to our saved model dot zip file to do that. 
We'll want to click on the file Explorer, right click in the area and 
click upload in downloads, go to wherever you download the dot zip file, 
click it and say open click OK. 
And that will appear in the files area. 
You want to right click on it and click copy path and 
this is what you want to paste for your zip path. 
This will unzip the file. 
And if we refresh the file Explorer, 
you should see the saved model folder appear here. 
And this is a collection of files that make up your model including a save 
point and some other files that are used by Caracas to create your model from here. 
We want to specify an input image, go to your edge impulse project and 
there are a couple of places that we can get image data from. 
First of all. 
If you click on the image tab, you will see the screen. 
We don't want the raw features as this is the information for 
the 96 by 96 color image. 
What we want is the processed features as this has been scaled down to 28 by 
28 which we specified in the impulse design and 
converted to grayscale as well as normalized to being between zero and one. 
This is what we want. 
So we can copy this to the clipboard and we can go back into our co lab. 
I've got an example image here. 
I just want to delete this, paste in my features and 
then I want to change the true index to whichever index, match my labels here. 
Note that you're going to probably need to update these labels in order to match 
whatever labels you have for your project. 
Remember that edge impulse keeps these in alphabetical order.
Play video starting at :7:8 and follow transcript7:08
If we look at this, 
this is a resistor which means the true label should be resistor. 
For me that is 01234. 
I'm going to change that to four and I have a note here to say that this is 
a resistor and save models should point to whatever got unzipped here. 
You probably don't need to change that. 
Note that if you go to model testing, 
click classify all if they have not been classified and 
you can see what the model predicted for each of the test set, samples.
Play video starting at :7:40 and follow transcript7:40
In fact, instead of using this known good resistor here, 
that was part of the training set. 
Let's use one of the test samples and this is up to you on what you want to look at. 
You can look at validation, you can look at test, you can look at training. 
I'm going to take a peek at this resistor that the model thought was a capacitor. 
So I click on the little options menu here, 
say show classification given a moment and it will pop up with the image and 
the output of the classification, which it was very sure that this was a capacitor. 
We don't want raw features, we want the processed feature. 
So let's click those and we're going to paste them into the image data here. 
Make sure we close off the list, make sure we close off the list with a bracket. 
This is still a resistor. 
So four corresponds the index of resistor here, 
we're going to run that sell the next cell will load the model and 
give you some information about the model so feel free to look at it here. 
It should match the model in your edge impulse project which is noted here. 
Then we need to reshape the image as what was given was just a list of features. 
Note that you will probably need to change the width and the height variables here 
to match whatever was set in the impulse design page here. 
So 28 x 28 is your within your height. 
You want to set them here? 
28 by 28 that's what determines how the image gets resized. 
So instead of being one long list, it's now a 28 by 28 array. 
Yeah, we ran this, this is what our image looks like The Keras 
model that we're going to be using in our algorithms, 
expects a bunch of these together and color channels. 
So we need to actually put this in a four dimensional array. 
But if you notice it's still just a 28 x 28 image that has a dimension of one for 
the color channels and then a dimension of one for the number of samples since it's 
only one sample that's all we're working with right now. 
We'll run that. 
We also want to do a forward pass of the model with this input image. 
That should give us the predictions which should match our test sample. 
If we go here we go to our resistor that looks like a capacitor. 
We show classification.
Play video starting at :10: and follow transcript10:00
Mhm and it should look something like this, tensorflow takes 
a moment to start up the first time you run it, so be patient with it and 
sure enough, we round that up to one and that looks about right. 
It really thinks that this is a capacitor.
Play video starting at :10:18 and follow transcript10:18
As I mentioned earlier in the slides, we need to remove that softmax activation 
function that's in the last layer of the densely connected neural network, 
that's part of our convolutional neural network. 
And to do that, we just call model, 
go to the very last layer which is given by this negative one index and 
we change the activation to none that removes that softmax for us. 
First up we're going to show the saliency map, this relies on the back and 
tensorflow library to compute gradients for us and 
we're going to use the gradients of the loss with respect to the input image in 
order to generate that heat map. 
This becomes our function, which we can just call here where we give it images 
which is really just our one image given here. 
Yeah, our model file, 
the only thing that's changed about that is changing the activation layer and 
then the actual class of the image which we got from the test set. 
This should be four, which corresponds to resistor. 
That's used to compute the loss between the predicted value and 
the actual value during this gradient function. 
Up here we'll run that. 
We then show the saliency map which gives you an idea of where the network was 
looking at the individual pixel level on the input image. 
And then we can overlay that image with our original image. 
We have this on top of the resistor and 
you can see it's looking at the top edges here. 
It's not looking at the component body. 
In fact, it's completely ignoring the bottom side and 
ignoring the leads, it seems except for maybe this area over here and 
it's also looking a little bit at the background, which is interesting. 
My thought process here is that the network is actually looking for 
values to be not background. 
Somewhere in the center because this resistor is high up. 
It thinks this is a capacitor because that's where it's expecting the capacitor 
component body to be. 
Since it takes up more space on this image. 
Once again, that's a hypothesis that I would need to test and 
maybe you get additional training data in order to feed the network more 
images like this where the resistor is not centered all of 
the time to teach the network that hey, a resistor can be up here. 
You need to look at other features in order to determine that this is a resistor 
and not a capacitor. 
Let's look at grad camp. 
Once again we're using the back and tensorflow library to compute 
the gradients this time the gradient is the output neuron or 
the output node with respect to the output feature map of the last convolution layer, 
which is what we saw in the slides were looking at the feature map of that final 
convolution layer before it goes to that final max pooling layer and 
before it gets flattened to go to the densely connected neural network, 
we use that to produce our heat map. 
And in order to call this function, 
we need to tell it where that final convolution layer is. 
So what I'm going to do is print out all of the layers in my network and 
I'm going to go backwards through them to find where the final convolution layer is. 
The name of that should be con two D underscore one. 
It prints that out and sure enough, that's the name of the final convolution layer. 
If you have three or four, 
it should still point to that last convolution layer that's in the network. 
We then call the function with the name of our last convolution layer that 
creates our heat map and then we can draw that as you can see it's focusing once 
again on the resistor in the upper part of the image and 
let's overlay that with our original image. 
This gives you a slightly different look at how the model is operating and 
where it's focusing its attention. 
With both the saliency and class activation maps, you can see that 
the model is focusing on the upper part where the component body is. 
However, you would want to compare this with something like a capacitor 
to see if the model is indeed focusing on that upper part for capacitors. 
This gives you some insight as to how the model is working and 
can help you debug potential heirs, especially for your test set images or 
your validation set images where the model failed to predict the correct class. 
In a future lecture, I'm going to go over data augmentation where we can 
use this relatively small training validation and test set to create or 
automatically generate new samples that can be fed into 
the model to hopefully correct some of these issues. 
Hopefully these heat maps generated by the saliency and grad cam functions can give 
you an idea of what you need to do during that data augmentation. 
So for example, 
if I wanted to be able to classify resistors in the upper part of the image, 
perhaps some of that augmentation would be let's create more images out of the ones 
that already exist where the resistor is in the upper part or lower part. 
So we can start to move things around in data augmentation based on 
information we have learned from these mapping

One way to deal with an inaccurate model or 
one that's overfitting is to give it more training data. 
If you can't or don't want to collect more data manually, 
then you can use something known as data augmentation, 
where you automatically generate 
slightly modified versions of 
your original images to create a new dataset. 
I'm going to show you some 
very simple image transformations 
that you can do in order to augment your data. 
This is running a Colab that you can find 
in the course GitHub repository. 
To begin, make sure you've connected to the runtime. 
Then I'm going to upload a single sample, 
and I'm going to continue to use 
this first resistor sample that we collected. 
Once that's uploaded, what we can do is 
set the path to the image, so 0.bmp. 
You're welcome to keep the seed 
as the same or change the seed, 
so you can get the same exact results that 
I get assuming you're using the same input. 
Hopefully that works for most of 
the random calls that you'll find in this Colab. 
We'll run this. 
Then we're going to load our image as a NumPy array, 
which you saw way back when 
we were doing data collection. 
Now that our image is a collection 
of numbers in an array format, 
we can use that to get 
the height and the width of the image. 
Then we're going to view the image. 
I always like to do this just to make sure I 
know what kind of data I'm working with. 
As you can see, it's a 96 by 
96 and there's three color channels. 
I'm going to keep these images as color images. 
Then when we upload them again to Edge impulse, 
I'll use Edge impulse to convert it to a grayscale image. 
This is in case you want to work with 
this dataset as color images instead of just grayscale. 
You can see that the individual color channels are still 
eight-bit values going from 0-255. 
Here's an output of that image. 
It should be exactly the same as our bitmap image. 
You can double-click these. 
Sometimes it'll show you an image preview over here, 
but it looks like in this case 
it doesn't work with bitmap, 
so it just tries to download it. 
One of the simplest transforms 
you can do is a simple flip. 
Here we're going to flip the image horizontally, 
we're going to flip the image vertically, 
and then we're going to flip the image 
horizontally and vertically. 
What this does is, not only does it 
move your subject around a little bit, 
but it also accounts for any variations in 
light that the classifier might be picking up on. 
For example, resistors might be 
shinier than some of the other components, 
so that the neural network is looking for 
this little reflection in 
the resistor at this particular point in the image, 
or it's looking for a highlight or a shadow. 
By flipping it, 
especially vertically, 
all those things are now in different spots. 
It tells the classifier that, hey, 
you need to learn different things about 
this object than maybe where a reflection point is. 
Hopefully this will help kick in 
the gear in order to pick up on 
the shape of our subject 
rather than other features that we don't care about, 
like background or reflection points. 
Flipping horizontally can have 
this effect too especially 
if it's looking for things like, 
the gold band is always on the right because I 
forgot to flip the resistor when I was 
taking images myself, 
and instead it goes ahead and flips it 
so now the gold band is on the left telling 
the classifier that it shouldn't always look for one of 
these bands and that it should 
try to pick up on something else. 
Same thing with some of the other color bands. 
I really should be using resistors with 
varying degree of color bands because there's 
a good chance that these color bands 
are being picked up by the classifier. 
But moving on from there, 
the next thing we can do is rotation. 
This takes the image and rotates it. 
Here I'm doing a 45, 90, 
and 135 degree rotation, 
but you can really work with anything. 
You could do 15 degrees, 
and that affects it in a different way. 
Note here that I'm using mode Edge. 
This tells the transformation function in 
this case rotate whenever it rotates 
the image to fill in any blank spots 
with whatever pixel values it finds along the edge. 
Another way to look at this is 
if we change that to constant, 
it's going to fill it in with a constant value, 
in this case black or zero. 
As you can see, the image is rotated, 
but these black bars may not be useful. 
In fact, they may be 
detrimental if you're not working with 
a dark background like I am 
here where I've got this white background. 
It's going to tell the classifier 
to look at these dark backgrounds, 
or it might tell the classifier 
to look at these dark backgrounds. 
If you've got a dark background, great. 
If you've got edges that can be extrapolated 
out and create something that looks like a decent image, 
then going with edge here is 
something that I completely recommend. 
As you can see, it tries to fill it in with 
either the background or the leads of this resistor, 
which is totally fine. 
I'm okay with it doing that. Then same thing here. 
The image actually ends about here, 
but it goes ahead and copies 
the pixel values from that edge outwards. 
It ends up looking like leads on a resistor, 
which is exactly what I want. 
This may or may not work for you, 
but it works great for these electronic components. 
Then we have scaling and randomly cropping. 
You might also see this referred to as random zooms. 
What we're going to do is 
actually increase the size of this picture. 
So instead of 96 by 96, 
it becomes 120 by 120 or 240 by 240, 
however big you want to make it. 
So this image is now bigger and we're going to 
crop a section of that resized image 
to create something like a window that might look 
at this part of the resistor or this part, 
and it has the effect of moving your camera 
around or maybe a little closer to the resistor. 
In fact, you can see even it looks blurry here, 
which replicates the idea of moving my camera 
around so that the resistor is not 
always completely centered. 
This is a great way to make it look like the subjects 
not always in the center of your frame 
without having to take a bunch of extra pictures 
doing that. You can zoom out. 
There are transforms for that, 
but because you don't know what pixel values are 
outside the edges here, you're going to have to guess. 
You can either make them all one color 
like black or a dark color, 
or you can use that edge extrapolation where it just 
copies that pixels out and for something like this, 
the leads look like they would just extend out, 
which is fine and the resistor gets a little bit smaller. 
That's another way to do it, 
but the scaling and cropping works pretty well. 
Next, we have translation. 
This is where your object is 
just moved to a different part. 
We're not zooming in, 
we're not changing the size, 
it just moves it around and this 
helps the classifier figure 
out the shape of the object rather than 
looking for it in a particular position. 
This is another great way to train 
in more robust classifier 
where your subjects are not necessarily dead center. 
I actually do a random translation with 
this code where this resistor might be up here, 
or it might be over here up to 
about a fourth of the way over that way, 
I'm not cropping too much of my subject out of the frame. 
I'm actually doing this using the 
random.random function, not necessarily setting. 
I'm going to put the subject in this corner, this corner, 
this corner, this corner that is another way to do it. 
But rather just take a random smattering 
because if you have more than one original images, 
you should end up with a good amount of 
random translations throughout your entire data set. 
Finally, the last one I want to 
talk about is adding noise. 
There are a number of different types 
of noise you will find in images. 
If you're working in darker environments, 
or if you've got your ISO or 
Gain setting crank very high on your camera, 
you're going to have something that looks like 
this center image where it's very noisy. 
Noise is amplified in order to capture it, 
which is something that you'll run into often, 
especially with low-light photography. 
But you can also run into it 
in any instances, if you've got, say, 
a bad camera or a cheap camera, 
introducing some noise is 
a fantastic way to emulate that. 
Especially if you're trying to train a data set that was 
captured from a different camera 
that may not have that noise. 
This is part of the reason I always recommend 
capturing your training validation and test 
data sets from the same sensor 
that you're planning to use in deployment. 
This type of noise shown in 
the center here is known as Gaussian noise. 
It's what you'll commonly find added to 
images when you increase the Gain. 
So this is a good one to use. 
Another type of noise you might come 
across is known as salt and pepper noise. 
This is evidenced by this image on the right here. 
Salt and pepper noise is different from Gaussian noise. 
It sometimes results from bad or 
poor analog to digital conversion 
when you're capturing an image. 
It shows up more often in 
black and white photos from what I've seen. 
But it's still something that you can have 
and because it's different than Gaussian noise, 
it might be worthwhile adding into 
your training set just to give your 
model something else to look at where it tries to 
learn the shapes of things rather 
than trying to pick out individual pixel values 
which who knows exactly what value they 
will be with a bunch of noise like this and 
the noise will be different from image to image. 
Even with just the simple demo I showed you, 
that is 12 new images from just that one original image. 
If we include the original image, 
that means we can increase our data set by 13 fold. 
However, this is only a brief sampling 
of some possible augmentations. 
I didn't even include 
stacking the transformations like flipping, 
cropping, and adding noise to create one image. 
You could exponentially increase 
the size of your data set if you 
start creating four rotated images 
for every flipped image. 
There is a whole field of study around data augmentation 
and it gets into more advanced 
transforms that we didn't cover here, 
such as zooming along one-dimension 
and playing with the color balance and contrast. 
I recommend starting with 
some simple transforms at 
first to see what works and what doesn't. 
This might mean training multiple models and 
comparing their validation or 
test results with one another. 
I uploaded my augmented dataset 
to a new project in Edge Impulse. 
If we go into Data Acquisition, 
you can see that here, 
I've got over 2,500 items in the training set 
and 655 items in the test set. 
You can see the number of items in each of these. 
This is a lot more than my original set, 
which was just 50 of each class. 
Now, I've got over 500 in each class and 
most of that is being used to train my new classifier. 
If we take a look at the impulse, 
I'll first go into the feature extraction area. 
You can see that this is the augmented data. 
Here's one with noise, 
and I've converted them all to grayscale 
28 by 28 resolution, 
just like the original training that I did. 
The model should be the same. 
It's expecting the exact same input data 
and it's actually going to be the same architecture. 
I left it as the default Edge 
Impulse convolutional neural network 
with two convolutional layers then a flattened dropout, 
and then our five nodes for the classifier. 
I train this using 100 training cycles. 
You can see that it's actually a 
lot more accurate than before. 
That's because I'm working with a lot more data 
and more variability in the data so that the model can 
learn the features that I'm trying to 
teach it and not things like looking 
at the background or if there's just two leads, 
things that were in the original photos but might not be 
fully represented in different camera angles, 
lighting conditions, or even if 
you're moving the object around in the frame. 
As you can see here, the model 
still struggles a little bit with 
the diode and a resistor and the LED in the capacitor. 
But it's a lot better than what it used to be. 
If I go to model testing, 
I've classified all of my test data. 
You can see there are some misses here and there, 
but it's still maintaining about 97 percent accuracy, 
which gives me much more confidence that 
this model is going to work better 
when I go to deploy it. 
Something else you can do is look back at 
our saliency and class activation maps 
to see what the model is looking at. 
This time, it struggled with capacitors and LEDs 
before or trying to determine 
the difference between resistors and diodes, 
it looked like it was trying to 
examine the backgrounds for some LEDs. 
In fact, let's take a look at 
this LED that was misclassified and we 
can probably determine what was going on. 
First of all, you see that it got shifted, 
so it's missing this round cap, 
which makes me believe that this image might be 
shifted a little too much during 
data augmentation, which is okay, 
I'm happy to have some things that aren't classified 
properly because it helps 
us figure out what the model is looking at. 
I'm going to copy these features. 
We know that this is an LED. 
I'm going to go into my Colab script 
here and start running it. 
The other thing I need to do is download my saved model, 
which I'll do here. 
I'm going to upload the model.
Play video starting at :13:15 and follow transcript13:15
Since I've already downloaded this before, 
I need to rename it because these parentheses do 
not work well in Colab from what I've determined. 
There's the name of my model. 
I'm going to copy the path, paste it here. 
That should unzip the model for me. 
Then I need to get 
the processed features because I 
accidentally removed everything on my clipboard. 
I'm going to put them here.
Play video starting at :13:48 and follow transcript13:48
This wasn't LED, 0, 1, 
2, 3 LED is correct. 
I'm going to run through these 
so we can take a look at what 
the saliency map looks like as 
well as the class activation map. 
If we look at the saliency map, 
we can see that it's focusing on 
the pixels in those bottom part here. 
Which might be LED, might be resistor. 
We're not quite sure, 
but it is missing this round part of the LED. 
Interestingly enough, 
the classifier is looking in the center here, 
and I can see why it might think that's 
a resistor because it creates 
this oblong shape right in the center 
and it's missing that cap or any other edges. 
That would make sense why this got misclassified. 
That being said, it's a lot better than 
trying to look up top and down below 
for the background which it still might do but this 
is a good start and you get 
an idea of what's going on with the model. 
Data augmentation does not always work, 
but it can be useful if your images are fairly similar. 
Especially if you're working in 
a controlled environment where 
the background doesn't change 
much like what we've got here with 
the different components on the white background, 
your best bet is to always collect 
more data that represents your target use case. 
But failing that data augmentation can be quite 
powerful to help you train more robust models.
## CNN Visualizations and Data Augmentation
#### Project - Data Augmentation
Introduction
Data augmentation is an incredibly powerful technique we can use to increase the size of our dataset without having to manually collect more data. In this project, you will create a few image transform functions to generate new images for your original dataset.

Work through the cells, which will create the output directory. You will need to write one or more functions in the Transform Functions section. Each function you create should take a 2D Numpy array as an input (parameter) along with any other parameters you might need (e.g. number of rotations). 
Your functions must return one or more 2D Numpy arrays corresponding to the transforms. The array(s) must be stored in a list, even if it’s just one array (e.g. [transformed_img]).

In the first cell under the Perform Transforms section, you will need to call your custom functions. They should be called inside the append function. For example: 
1
img_tfs.append(your_function_name(img_array, params))
This function calls your function with the image array (and any other necessary parameters--you are welcome to pass any parameters you might want). It then appends the output to the img_tfs list.
The img_tfs list is ultimately flattened and returned. In the following cell, each image file is opened and all of your transform functions are called to create a list of transformed images. These new images are each saved as a new file in the output directory.
The new image filenames have a _x numbering scheme appended. For example, if the original image is 0.png, 0_0.png is a copy of the original image, 0_1.png is the first transformed image, 0_2.png is the second transformed image, and so on.
 
The final cell copies all of the output directories and files into a compressed file named augmented_dataset.zip. Right-click on the .zip file in the file browser and click Download.

### Train New CNN Model
Unzip the downloaded augmented_dataset.zip file. Start a new project in Edge Impulse and upload the new images, making sure to assign the labels as given by the folder names.
Follow the process from the Training CNN on Edge Impulse project to create an impulse. Set the target width and height of the Image data to be 28 x 28. Use an Image processing block and a Neural Network (Keras) learning block.
Make sure that the Color depth is set to Grayscale in the Image processing block. Generate features.
In your NN Classifier block, start with the default model architecture, and change the number of training cycles to something in the 50-100 range. 
Note: You may run out of free server time in Edge Impulse if you have too many training cycles and too many training samples.
Train your model and compare the performance results with those you got in the Training CNN on Edge Impulse project. How did the two models compare?
I found that even with the default model, I got better accuracy with less underfitting using fewer training cycles. It seems that gathering more data is a good solution to many machine learning problems!

#### Conclusion
While data augmentation is great if you have a small dataset, it should not be seen as a replacement for manually collecting data. Transforms are only so good at creating different rotations, skews, translations, etc. of the same image. They struggle at creating new images with truly different lighting, different backgrounds, different types of the same object, etc.
However, for our simple image classification project, data augmentation can give us a substantial boost in model performance and robustness!
When it comes to training 
complex machine learning models, 
there is a slick trick we 
can use called transfer learning. 
With this, we reuse a part of a model that's been 
previously trained on a different but similar dataset. 
Let's look at 
our very simple convolutional 
neural network as an example. 
Let's say, we fully trained this model 
to identify electronic components. 
The feature extraction portion of this model, 
which consists of all the convolution and pooling layers, 
has been trained to recognize certain shapes and 
patterns that make up those electronic components. 
This includes basic shapes that are 
large in the frame based on the training data. 
Now, let's say, we wanted to train 
this same model to recognize something else like nuts, 
bolts, and screws, 
we could retrain all of the filters and weights. 
For a small model like this, 
retraining everything would not take it a long time. 
However, when we start to look at 
much larger and more complex models, 
fully training all of the weights might 
take hours or even days, 
so it makes sense to 
reuse parts of the model where we can. 
To reuse a model known as transfer learning 
we would freeze whichever layers we want to keep. 
For things like image classification, 
we often want to freeze the convolution 
and pooling layers as we assume that 
they have been trained well enough on images that are 
somewhat close in structure to our new dataset. 
Freezing means that when we 
run the training process next time, 
the parameters in the frozen layers 
are not allowed to change. 
Next, we remove the layers that were not 
frozen and replace them with whatever we want. 
This might be a totally new flattened layer 
with different fully connected nodes. 
You could even add 
more than one fully connected layer 
or other layers if you desire. 
Note that our outputs change too, 
as we might want to recognize 
a different number of classes. 
We then run the training process 
again on this model using our new dataset, 
which consists of nuts, 
bolts, and screws. 
This time the convolution and 
pooling layer parameters are not updated. 
Only the new layers are updated. 
We are essentially saying that we're happy 
with the features that the first part of the model 
has extracted for a given problem and we're going to 
use that for our new, but similar problem. 
The dense neural network classifier is 
the only thing that gets updated at this point. 
When training is done, 
you might see a second training 
process occur called fine-tuning. 
Here, we first unfreeze any layers. 
We then run the training process 
again with a very low training rate, 
which means parameters are 
not allowed to change very much. 
This causes all the parameters to 
adjust slightly as necessary to 
pick up any minor variations in 
the features from the original training set. 
At this point, we hope that the model is 
capable of correctly identifying screws, nuts, 
and bolts instead of electronic components, 
and we'd want to test 
its accuracy on a validation or test set. 
Note that we have to use a dataset that has 
similar features to our original dataset. 
I would not expect this model to work well 
identifying cats versus dogs as 
the feature extraction portion 
has been trained to look for 
very simple shapes in 
electronic components rather than 
complex features like eyes, 
ears, nose, and so on. 
There are two main reasons 
why you'd want to do transfer learning. 
The first is that it saves you a lot of time. 
Some of the more complex models out there can 
take hours or days to train from scratch. 
Using transfer learning can get 
this time down to a few minutes. 
Next is that we can work with smaller training datasets. 
Because we assume that 
the feature extraction portion of the model 
has been trained on a similar dataset 
that's close enough to our problem, 
we only need to train the classifier portion. 
As a result, we can often get away with 
a few dozen training samples rather 
than a few thousand per class. 
Before we move on to discussing MobileNet, 
I'd like to make a quick note about model diagrams. 
I will try to draw my model layers going from top to 
bottom with the input at 
the top and the outputs at the bottom. 
This is to mimic how you might see the model 
described in Keras or in the Netron app. 
However, in many machine learning articles and papers, 
you might come across the reverse with 
the input at the bottom and the output at the top. 
An article might refer to 
replacing the top layers during transfer learning, 
for example, in this case, 
they're referring to the dense layers that make 
up the classifier portion of the network, 
which happened to appear at the top 
of this particular diagram. 
You'd have to look at any provided diagrams 
or infer from context whether 
the author is thinking about the network model 
being top to bottom or bottom to top. 
You might also see left to right sometimes, 
which just adds to the confusion. 
The models all work the same mathematically though. 
I generally try to stick to top 
to bottom for any content that I create. 
Over the years, researchers 
have put together collections of 
pre-made datasets for people 
to use with machine learning. 
These are popular with competitions or 
for using as the basis for purpose-built models. 
Classifying images is such a big problem in 
the machine learning world that there are 
many such datasets for images. 
One of the most popular datasets is CIFAR, 
and there are two versions. 
One with 10 classes of objects and 
another with 100 classes of objects. 
The images are all the same size 
and have all been labeled, 
which makes it perfect for training a model. 
Another large collection is the ImageNet dataset, 
which contains over 14 million images. 
These images have all been labeled with 
one of 100 to classes. 
An interesting feature of ImageNet 
is that the labels are part of a hierarchy. 
This can help you train a model that recognizes 
just a dog as a label or a specific breed of dog. 
COCO, which stands for Common Objects in Context, 
is another image dataset with over 300,000 images 
and 1.5 million objects 
that have been labeled in those images, 
which pixels make up the objects are 
also noted in the ground truth information. 
This is useful for training a segmentation model. 
We'll talk about object detection and 
image segmentation in the next module. 
Along with collections of datasets, 
researchers in industry and 
academia have created families 
of machine learning models designed 
to tackle specific problems. 
As image classification is 
one particular problem that's in very high demand, 
there are a number of models that have been 
developed specifically for image classification. 
Because ImageNet has such a large number 
of classes and a huge number of samples, 
it provides a great benchmark to test 
the accuracy and efficiency of these models. 
This particular paper compares the accuracy and 
computational requirements for many 
of the top neural networks. 
Resnet and VGG are two very popular models. 
You can see that they have decent accuracy. 
Some of the more accurate models 
also require many more operations. 
Since we are working with embedded systems 
with limited resources, 
we want to use a model that is closer to the left side of 
the graph as this should 
be less computationally expensive. 
We will be looking at MobileNetV2, 
as it was specifically developed for 
embedded and smartphone applications. 
While it might not be 
the most accurate model for 
image classification out there, 
it is very efficient while still performing 
decently well on such classification tasks. 
Here is what a version of 
MobileNetV2 trained on the ImageNet dataset looks like. 
As you can see, it's quite a bit more complicated than 
the basic convolutional neural network 
we were looking at earlier. 
It's designed to work with larger images color as well as 
grayscale with many more complex features 
and many more classes. 
There are some layers in here 
that we did not specifically cover, 
but this is still essentially a collection 
of convolution and pooling operations. 
This depthwise convolution operation 
performs convolution on 
the 32-dimensional feature maps that 
come from the previous convolution operation, 
but then it leaves the third dimensional load, 
creating a new stack of 32 convolved feature maps. 
The output from some of the layers are added to 
the output of future layers in 
what's known as a residual block. 
These skips resemble how 
the pyramidal cells in our brains work. 
They also help fight a problem 
in training known as vanishing gradient, 
where the parameter stop updating due to a very 
small gradient being calculated during backpropagation. 
MobileNetV2 also contains a number 
of bottlenecks where the number of 
neurons or convolution filters in 
a layer is fewer than the layer above it. 
For example, the output of one layer might be 
fewer two-dimensional feature maps 
than the layer before it. 
Then you'll see the number of maps get larger again. 
These advanced techniques help the model 
learn complex features of images. 
At the end of the feature extraction section, 
you'll see that we're left with a bunch 
of seven by seven feature maps. 
A global average pooling operation 
is applied to each of these feature maps to 
create our flattened array where 
each element is an average of the feature map values. 
From here, we can perform our classification using 
a densely connected neural network to make 
a prediction of the label of the input image. 
When working with transfer learning, 
we often want to start with 
a pre-trained network like this one. 
All of the weights for the convolution filters 
have already been set 
by training this model on 
a known good dataset like ImageNet. 
Here is a version of MobileNetV2 that 
I used for transfer learning in Edge Impulse. 
What's interesting is that we can 
change the input to MobileNet somewhat. 
Instead of a 224 by 224 color image as input, 
I modified this one to accept 48 by 48 grayscale images. 
The convolution layers still 
work the same as they're not really 
dependent on having a particular size 
and shape for their input. 
The only thing we have to be concerned with is 
that at the end of the feature extraction chain, 
after the convolution layers 
reduce the dimensions of the feature maps, 
individual output feature maps are 
not smaller than one by one pixels. 
This means that for most applications of MobileNetV2, 
the smallest input shape is 
32 by 32 pixels from a grayscale image. 
Edge Impulse removed the average pooling layer for 
this particular model and added 
two fully connected or dense layers for classification. 
This is the section that gets 
trained during transfer learning. 
However, all of the parameters get tweaked 
a little bit during the fine tuning stage. 
I used this model to identify when my dog was present in 
the frame or if it was just an image of my empty office. 
That's why there's only two outputs. 
However, you can use MobileNet and transfer learning to 
create a model that can predict any number of classes, 
of course, within reason and 
the capabilities of mobile net in your processor. 
Keras, as well as Edge Impulse, 
comes with a collection of 
pre-trained MobileNetV1 and V2 models. 
Note that common pre-trained MobileNetV2 models 
were trained on ImageNet were cocoa images with 
specific resolutions like 96 by 96 or 224 by 224 pixels. 
That means if you don't use one 
of those resolutions for your input, 
you might lose some accuracy 
as that's what the parameters were trained on. 
Something else that you can adjust 
is this Alpha hyperparameter. 
This controls the number of filters in each layer. 
A larger Alpha value means 
more computations and a more complex model. 
This could mean more accuracy, 
or it might mean more overfitting. 
You should try a few values of 
Alpha to see what works best, 
or start off with the lowest Alpha to 
see if that works well enough for your problem. 
Here is the original paper that proposed MobileNetV2. 
In it, you can find a chart that gives you an idea of 
its accuracy on the ImageNet dataset 
compared to other models. 
As you can see, version 2 
generally works better than version 1. 
You can increase the resolution of the input image, 
which can help with accuracy. 
However, it often means more computations are required. 
Once again, for embedded work, 
you'll want to find the most efficient model capable 
of performing your task with your desired accuracy. 
Transfer learning, which reuses 
parts of a previously trained in neural network, 
is an incredibly powerful technique. 
It allows us to quickly create models 
that meet our needs and with less training data.

## Digging Deeper into Transfer Learning

•	Now that we've seen how transfer learning works in theory, 
•	let's put it into practice using edge impulse. 
•	Will be using MobileNetV2 that's been pre trained on the cocoa image data set and 
•	then tweak it for our project. 
•	To start, you'll want to collect a new data set. 
•	You're welcome to use the old data set that we were using in the first part of 
•	the course. 
•	But I recommend trying something new here. 
•	One of the advantages of transfer learning with a complicated model like mobile net, 
•	is that we can do things like have complex backgrounds and have our shape or 
•	object or objects that we want to detect, within those complex backgrounds, and 
•	by that I mean a variety of background shots. 
•	Here is a bunch of shots from my office just at different angles, 
•	different positions, places where I would predict I might find my dog, 
•	and you should do something similar. 
•	I'm going to be creating a dog detector specifically whenever my dog is spotted in 
•	the frame. 
•	I'm going to get notified from my open mV, my Arduino, whatever it is, 
•	my embedded system that I'm training this model for, and I have a bunch of different 
•	photos of my office from a various number of angle zoom levels and what not. 
•	It's all around this backpack as you can see, because that's where I'm going 
•	to hopefully see my dog and it's kind of gray. 
•	I also got a number of photos of my dog in various positions and I'm using 
•	the open mV to capture these but feel free to use whichever device you'd like. 
•	These are all done with a 96 by 96 resolution and color, but we're going to 
•	ultimately convert these two probably something like 48 by 48 and grayscale. 
•	And as you can see, I just have my dog in a variety of positions here. 
•	Hopefully I can create something that spots the dog and you need about 50 
•	samples per class and I only have two classes here, background and dog. 
•	But you're welcome to do a multi class classifier, I know that in the edge 
•	impulse documentation, they show you how to do something similar that 
•	is background, plant and lamp, and you can identify one of those three. 
•	So it can be binary, it can be multi class, 
•	just make sure you have about 50 samples of each, and as you can see, I don't need 
•	a perfectly white background as we're using a fairly complex model here. 
•	Once I'm done with that, you'll notice that these are in bit map format to work 
•	with edge impulse whatever format you were using, I recommend converting them to png, 
•	and I've got them here, they're all nicely labeled, here's the ones with my dog. 
•	Once again, 96 x 96 resolution, and here are the ones with the background and 
•	just a variety of positions here as you can see I've got a few more here than just 
•	my backpack, it's all around my office and from here we can go to edge impulse. 
•	Once an edge impulse sign into your account and create a new project just like 
•	we've done before, I'm going to call mine, dog detector. 
•	Once at your dashboard click out of the pop up. 
•	We don't need anything from here, we can go right to data acquisition, and 
•	upload your samples that you collected. 
•	Remember that you need png files Hm-mm. 
•	I'm going to go to wherever I keep my dataset, 
•	png version of my dog detection data set. 
•	I'm going to start with background, I'm going to select all of them, 
•	click open, I'll let it automatically split between training and 
•	testing and I'm going to enter the label of background. 
•	I'm going to do the same thing again for my next class which is dog, 
•	change the label to dog, and upload again, and you'll want to do this for all 
•	of the classes that you have and it could be two, it could be one, whatever it is.
•	Play video starting at :3:47 and follow transcript3:47
•	Once it's done I recommend clicking on data acquisition and making sure you have 
•	your two classes with 50 items between the training data, and the test data. 
•	Once you're happy with that, go to impulse design you can stick with 96 by 96, 
•	but I found that for some of the lower end embedded systems 96 by 96 is a little big, 
•	so this didn't even work well with my open MVH7V non plus version, but 
•	it would probably work with the plus version. 
•	So to make things work, I recommend going 48 x 48, and 
•	remember 96 x 96 is a common format that mobile net was trained for, so 
•	I might lose a little bit of accuracy by downscaling or resizing the image to 
•	something that's smaller, but hopefully it will still work for our purposes. 
•	Just I recommend working with something that is the lowest resolution possible 
•	that gives you the accuracy that is good enough for what you need because then 
•	it means you can work with lower resources on your embedded system. 
•	Next we're going to add a processing block, we're going to stick with image and 
•	then we're going to add a learning block, instead of using a regular neural network 
•	this time, we're going to use transfer learning, and this allows us to use or 
•	start with a pre trained network, click save impulse, and then go to image, 
•	where we're going to do more or less feature extraction for our images. 
•	In this case we're going to start with the color images, 
•	we're going to go to gray scale and they're going to be converted to 48 by 48. 
•	They're going to be resized from 96 by 96 to 48 by 48, 
•	and that will be the input images for our training. 
•	So click save parameters, and then click generate features. 
•	Once that's done, you should see the Feature explorer appear and 
•	you can take a look at where your classes are, 
•	remember that this is just a very simplified visualization of your images 
•	because there are more than three dimensions in your image. 
•	There are hundreds or thousands of dimensions and these just get reduced to 
•	something that you can kind of get an idea of certain aspects of the image, and 
•	it looks like it's going to be fairly separable here, which means that we should 
•	be able to train a decent model, also, don't forget because we're working with 48 
•	by 48 images, you will need to supply the model a 48 by 48 grayscale image during 
•	inference whenever you go to deploy that to your embedded system. 
•	Let's go to transfer learning, I'm going to bump this up to 50 cycles, 
•	I find that 20 is a little low sometimes because I like to just train things for 
•	longer and see what we get. 
•	The learning rate is fine, I'm going to choose to not do basic data augmentation 
•	at this time, and it looks like it starts us with mobile net V 2.0.35 and 
•	0.35 here, is that alpha parameter, which just determines the number of parameters 
•	of the number of filters in that feature extraction portion of mobile net, 
•	which means that we can potentially get a better model that can identify 
•	more complex features with more of those filters but 
•	it means more resources and we run the risk of overfitting. 
•	You can click on the first button to edit the number of neurons in that final dense 
•	layer to help with your classifier, once it's done with the feature extraction that 
•	gets added on and trained separately as part of the transfer learning process. 
•	There's also a dropout layer that you can affect, 0.1 seems to be fine, or 
•	you can delete the whole model. 
•	I'm going to choose to use a different model here. 
•	In fact, I'm going to look for MobileNetV2 which generally works better than 
•	MobileNetV1, and I'm going to pick one honestly with the lowest alpha rate here, 
•	simply because I'm working with such a simple data set and 
•	I want to use as few resources as possible for 
•	RAM and ROM which are shown here for each of the potential models. 
•	As you can see as you start going up in image resolution, this is the resolution 
•	that the input expects but it can be changed, the minimum is 32 x 32 pixels, 
•	we're going to use 48 by 48 because the parameters that resulted from training 
•	this model on ImageNet, we're set for this resolution. 
•	So expect a slight drop in accuracy, as you go up in the resolution as well as 
•	the alpha, you're going to start using more resources. 
•	In fact, here's one that uses over a megabyte of RAM, so keep that in mind when 
•	choosing a model and what kind of embedded system you're working with. 
•	Even the smallest model that we have uses a good amount of RAM and flash memory, 
•	so pay attention because MobileNet still requires a fairly beefy processor 
•	with lots of resources and usually a fast clock speed in order to work. 
•	So I'm going to choose this one with 0.05 alpha, I'm going to click add, 
•	that appears here and I'm going to click start training. 
•	This is going to show the training output on the top right, and 
•	we just wait for a while until this is done. 
•	You'll see that it queues up on their server, and it's going to train the model, 
•	whenever it says training model, 
•	that means it's training the classifier portion that's after the feature 
•	extraction that we saw when we were looking at MobileNetV2. 
•	So, those parameters are going to stay the same because they were trained on Image 
•	net, which is that large database of images and we're just training that 
•	classifier portion to work with our project here, dog versus background. 
•	Next it's going to fine tune, which means that it unlocks or unfreezes all of 
•	the parameters Sets, a very low learning rate and just tweaks them a little bit. 
•	It's not going to take very many epics in order to do this, when it's done, 
•	you should take a look at the accuracy and validation accuracy. 
•	Hopefully these are over something like 90%, which is really, really good for 
•	something like this. 
•	And as you can see this trained very quickly in comparison to our fully 
•	training model, even though this is a much more complicated neural network. 
•	When it's done, it's going to show you the validation set performance 88.9% is not 
•	bad, it looks like it's troublesome with a few background and 
•	dog images on the validation set, 
•	you can see those in red dots here where it misclassified those. 
•	But this is not a bad starting spot, in fact, 
•	I would be happy to deploy this knowing that some frames may be misclassified, and 
•	I can go back and look to see which images might do that, and 
•	order to do that we can go to testing and use the model to perform inference with 
•	our test set that we set aside at the beginning. 
•	Let's classify all of those and see what happens. 
•	When it's done you can see that we have an accuracy of 92.3 and the only thing that 
•	it missed is that one of the dog images it considered it to be uncertain, 
•	It didn't even misclassify it, it just didn't meet the threshold. 
•	So we can look at that to see where it was misclassified or 
•	try to figure out what was going on here. 
•	When you load it, it looks like my dog was looking away, 
•	there's no identifying features of eyes or mouth that are clearly in the frame. 
•	So I can see why this might have struggled, in fact, it's still thought 
•	it was a dog, but it just didn't meet that threshold and it counted it as uncertain. 
•	I believe it needs to have a 0.6 threshold in order to count it as I know that 
•	this is a dog so it struggled a bit, this makes total sense. 
•	So I believe the model needs to have the dog looking at the camera or 
•	at least a solid profile picture of the eyes, ears, snout, in order for 
•	the feature extraction portion to pick up on those features because that's what 
•	it was trained for, when it was trained with ImageNet and 
•	I'm okay with this being misclassified for my particular deployment. 
•	The other thing you can do is try playing with some of 
•	the hyper parameters that you can find in the learning section. 
•	So some of the options here, where you can pick the number of neurons, and 
•	you can adjust the dropout rate that may help or 
•	you might have to try a different model, maybe a different resolution for 
•	your particular project to get an accuracy that you are happy with. 
•	When you are done had to deployment and you have a number of options to work with, 
•	you can do a strict CC++ library to work on any number of systems, 
•	assuming you have some way to compile C++ files. 
•	The other thing you can try an Arduino library, 
•	there are a few Arduino boards powerful enough to support mobileNet. 
•	The portentous is one of those, but if you're working with the portentous or 
•	the open mV camera, I recommend using the open mV because it's much easier to work 
•	in micro python, but if you need CC++, you want that absolute efficiency and 
•	execution speed, then try the Arduino or the CC++ library. 
•	Finally, there's also a way to deploy it to Lennox boards, 
•	if you want to get this working on a single board computer. 
•	Once you've selected the library or output that you like, click build, 
•	it's going to create the library, and it will download it for you, 
•	from there you can deploy it to your system of choice and give it a shot. 
•	I copied the trained.tflite and labels.txt file to the SD card 
•	of my open mV camera so that those would load into this program. 
•	I also changed the frame from 96 by 96 to 48 by 48 because 
•	I have no good way to easily resize images captured in the frame over here. 
•	And that's just something that is with open mV, there are python libraries or 
•	you can write your own way to resize images as you capture them but 
•	capturing a pure 48 by 48 image seems to work well enough. 
•	I'm also capturing in grayscale this time, so 
•	that the image is ready to be fed directly to my model. 
•	Much of this should look very similar to the original densely connected 
•	neural network that we did earlier. 
•	If you were using the open mV cam, once we have the image we can feed it and 
•	the model file to this tf.classify function, that classifies the image for 
•	us and gives us the prediction back as a collection of objects. 
•	We look at the first one since we're only doing a single classification, 
•	we're not trying to do several classifications or object detection yet. 
•	With those predictions we can print out the probabilities in our serial terminal, 
•	I'm only doing that once every few inferences. 
•	So you're missing a few when things are being printed to the console and 
•	that's okay. 
•	However, I am drawing the label and the probability of that label to 
•	the background image in open mV, so we can take a look at what's going on. 
•	When I point this around the room, you can kind of see that it works pretty well. 
•	It's looking at my crazy rug here, there is dog and it identifies my dog as dog, 
•	and it looks like it identifies dog even without the face, 
•	but it struggles a little bit. 
•	My guess is if it has a face, if it has the face in the frame, 
•	it definitely knows that it's my dog. 
•	However, if it's just the body or part of the coat, 
•	then it struggles a bit probably because it's looking at various patterns like 
•	the black stripes on the light background. 
•	If I move around my office some more, you can see some boxes and back to dog and 
•	if we come back to this area where there are some cables hanging down, 
•	you can see it actually misclassified as those as dog, 
•	which is part of what makes me believe that it can identify the backside of my 
•	dog because those stripes on the back are dark against a lighter background. 
•	I did not have these cables hanging down as part of the training set so 
•	it doesn't know that these should not be classified as dogs. 
•	That's something I can learn by testing in a live environment, 
•	I should capture these, put those in my training set so 
•	that it does not miss classify them as my dog, but 
•	I can set up my dog recognition machine here to automatically open a door or 
•	give pippen here some water or food or what not, 
•	pippen pippen, who was my good boy, that is dog. 
•	So in addition to dog, you can also make a person classification machine to identify 
•	if people are at your door, say if they're ringing your doorbell or 
•	about to, maybe set up something to prevent somebody stealing a package or 
•	if somebody's in your home and you need to know or you want to turn on lights or 
•	turn on a fan or the air conditioner, 
•	this is another way to do a detection to see if somebody is in an area. 
•	I hope this helps you see how you can use transfer learning to train a model that's 
•	been trained on other objects in order to very quickly produce something for 
•	an embedded system. 

### Project - Transfer Learning
Transfer learning is a powerful tool in our machine learning toolbelt. It allows us to re-use models that were previously trained on larger and more complex datasets. We can freeze portions of the model while we train the rest of it on our new dataset. This, in turn, allows us to train with much less data and fewer epochs.
In this project, you will gather a new dataset, upload it to Edge Impulse, and train a model using transfer learning.
Collect Data
Use one of the data collection techniques shown in the Loading and Manipulating Images project (e.g. using your Raspberry Pi, OpenMV Camera, or your cell phone). This time, you can collect images of more complex subjects and backgrounds.
For example, I wanted to identify whenever my dog was in my office (and visible to the camera). So, I captured 50 samples of my office from various angles.
 
I then captured 50 samples of my dog in various positions around my office and from different angles.
 
You should collect at least 50 samples of each category. You are welcome to use more than 2 classes to create a multiclass classifier. 
I also recommend having each image be color and have a 96x96 resolution. We will resize the images and convert them to grayscale in Edge Impulse. Convert all images to .png (if necessary).

Transfer Learning in Edge Impulse
Create a new project in Edge Impulse. Upload your images, making sure to label them appropriately.
Create a new impulse. Feel free to change the image width and height in the Image data block. Note that for transfer learning, the width and height must be at least 32 x 32. 
Because the original model was trained on a dataset using images with 96x96 or 160x160, it’s recommended that you scale your images to one of those resolutions. However, you’re welcome to try other resolutions (i.e. models with smaller input resolutions will require fewer computational resources), but you may see a decrease in accuracy. 48x48 seemed to work well enough for my dog classification model.
Add a Transfer Learning (Images) block. Save our impulse.
 
Go to the Image screen on the navigation bar. Change the Color depth to Grayscale. Click Save parameters followed by Generate features on the next screen.
When that’s done, head to the Transfer learning page (from the navigation bar). I recommend setting the Number of training cycles to something like 100.
Feel free to click Choose a different model to see the available pre-trained neural networks.
 
Model names are given as follows:
<model-name/version> <resolution> <alpha>
I recommend choosing one of the MobileNetV2 models with an input resolution close to your desired resolution. For example, 96x96 is closer to my 48x48 images, so I chose that.
The alpha parameter describes the width of the network. A lower alpha means fewer convolutional filters per layer whereas a higher alpha means more convolution filters per layer. Fewer filters require fewer computational resources but may be less accurate.
I recommend experimenting with different models to find one that works best for your dataset and requires the least amount of flash, RAM, and computation time. MobileNetV2 96x96 0.1 seemed to work well for my particular classification needs.
Click Start training. When it’s done take a look at the confusion matrix and accuracy results. How well did the model perform on your validation data?
 
Remember that for transfer learning, we are working with a model that was pre-trained on an existing dataset (such as ImageNet). The weights in the feature extraction portion (e.g. convolution stages) of the model were optimized to classify only the image categories present in that dataset.
If your particular dataset contains objects that were not in that original dataset (such as exotic animals or uncommon machinery), the model may not perform very well. I recommend sticking to common objects, like trees, people, pets, cars, etc. You can find a list of such common objects in the ImageNet labels list. 
Feel free to adjust the hyperparameters to try getting a more accurate model, such as:
•	Input dimensions
•	Color vs. grayscale
•	Model type and version
•	Alpha
You can also try using the data augmentation script you developed in the previous project to expand your current dataset!
#### Conclusion
Transfer learning is incredibly useful, but it has limitations. If you are trying to identify uncommon objects or classify non-photographic images (e.g. spectrograms), models pre-trained on sets like ImageNet may not work well. However, if you want to quickly create a robust model that identifies common objects, starting with a pre-trained CNN might be the way to go.
## Project - Deploy CNN Image Classifier
Now that you have trained two different convolutional neural networks, you will deploy one of them to an embedded device. You have the choice of doing this on the Raspberry Pi or OpenMV Camera.
For this project, you must choose to work with either your original CNN (trained from augmented data) or the one trained via transfer learning. Either is acceptable.
#### Option 1: Raspberry Pi
Create a folder to hold your program and model file:
mkdir -p Projects/image-classification-cnn
cd Projects/image-classification-cnn
Download the model file for your project:
edge-impulse-linux-runner --clean --download modelfile.eim
When asked, choose the project you wish to download the model from (e.g. CNN with augmented data or transfer learning).
Your mission is to write a program that performs live inference: for each image that is captured from the camera, perform inference and output the results (either to the console or on the preview window).
I recommend using the program from the Deploy Image Classifier project as a starting point. You may want to perform a static image inference test first (e.g. load a known good image from your test set and perform inference). If you get stuck, you can see my static CNN demo here.
Edge Impulse has an ImageImpulseRunner module that I recommend using to perform feature extraction prior to sending the features to the model for inference. This will resize and convert the image to grayscale as necessary. You may want to check out this example to see how to use it.
When you are done, your program should be able to classify images in real-time.

#### Option 2: OpenMV Camera
Download your model file from your Edge Impulse project. Either go to Dashboard and download the TensorFlow Lite (int8 quantized) model, or go to Deployment and download the OpenMV library. If you download the library, you will want to unzip it and copy the model file.
Move the model file to the SD card of your OpenMV Camera. Change the name to something like trained.tflite.
Your mission is to write a program that performs live inference: for each image that is captured from the camera, perform inference and output the results (either to the serial terminal or on the preview in OpenMV IDE).
I recommend using the program from the Deploy Image Classifier project as a starting point. You may want to perform a static image inference test first (e.g. load a known good image from your test set and perform inference). Note that you will need to make a few changes:
•	You will likely need to change the width and height of the sensor.set_windowing() function to the width and height of the target images you set in Edge Impulse
•	You can pass the captured image directly to the tf.classify() function rather than having to copy it to a separate (vector) buffer, as the model now works with 2D data
•	You are welcome to adjust the img.gamma_corr() function to adjust the brightness and contrast of the captured images
If everything works, you should be able to point the camera at your target object(s) and have it classify it correctly. 

#### Conclusion
I hope that this exercise helped you see how to use a trained CNN for inference on an embedded system to perform image classification. As you might have noticed, getting the model to work with live inference can be quite difficult! If you are struggling to get the model to recognize your objects, I recommend playing with the lighting, focus, and contrast. If none of that works, you can try augmenting your dataset or capturing new images for training.
I know that we covered a lot in this module. 
Let's take a moment to review what we've learned. 
First, we talked about how image convolution works. 
This is accomplished by sliding a window 
across the image with a given stride and 
performing an element-wise weighted sum 
of all the values underneath the window. 
This produces a single output in 
a new array that constitutes a feature map. 
Here is the general form of convolution, 
assuming we're using a stride of one without any padding 
of the input array and using 
a kernel larger than one by one, 
we'll end up with an output array 
that's smaller than the input array. 
We could also pad the input array or image with 
zeros or copy the values 
at the edges to make the input bigger. 
Usually, we would want to add enough padding so that 
the output array is the same size as the input array, 
this is known as same padding. 
No padding is known as valid padding. 
Convolution is an incredibly important method that allows 
us to filter an image in a variety of manners as shown, 
we can blur, sharpen, 
or create interesting effects like this embossed image. 
We can also use convolution to detect edges in the image, 
which forms the basis for 
how many computer vision techniques detect objects, 
including convolutional neural networks. 
Another important function you will see is pooling, 
and there are two common types, average and max. 
In average pooling, a window slides over 
the whole image and 
the pixel values under that window are averaged together. 
This helps to reduce 
the dimensions of an image or feature map. 
The stride in a pooling operation is almost 
always the same as the width and height of the window. 
In this example, the window is 
two-by-two and the stride is also two. 
Max pooling is similar, 
but instead of averaging the values under the window, 
the maximum value is chosen instead, 
this helps to reduce the dimensions of 
an image while highlighting the most salient features. 
While both pooling methods have their uses, 
max pooling is more common after 
convolution as it helps to 
pick out the most important features. 
We then saw how these operations fit 
together to create a convolutional neural network. 
A series of convolution and pooling operations can be 
stacked together to create a feature extraction section. 
The weights in the convolution filters 
become the parameters in this stage, 
which get automatically updated during training, 
this helps the model learn the most important features in 
images in the training set which 
are fed to the classifier section. 
Note that each convolution layer 
may have more than one filter. 
For example, 
if that first two-dimensional convolution layer 
has 16 filters, 
then 16 feature maps will be 
generated that are passed on to 
the next max pooling layer. 
Once the image has been filtered 
and pooled one or more times, 
the feature maps are rolled out or flattened to create 
a one-dimensional array or vector 
that is sent to a simple fully connected, 
or dense neural network. 
This network acts as the classifier in 
order to predict the label of the input feature set. 
When trained, this model will 
ideally figure out which features are the most 
important in any given image and use those features as 
calculated by convolution and pooling to 
classify the original input image. 
Next, we saw how we could modify 
a trained convolutional neural network 
to produce a series of 
visualizations to help us see what's going on. 
In both cases, we need to 
remove the softmax activation of 
the last layer as the math 
requires the raw output of the neural network. 
We first looked at saliency maps, 
in these the pixels that are determined to have 
the biggest effect on 
the decision-making process are highlighted. 
The second was a gradient weighted class activation map, 
where the features from 
the final convolution step are 
shown and the elements that 
had the biggest effect on 
the decision-making process in 
the neural network are highlighted. 
We can use the information 
in these maps to help us figure 
out how a convolutional neural network 
is making decisions. 
Specifically, it tells us 
where the network was looking in 
each input image or feature map to make its decision. 
Knowing that the model is looking at 
a particular area is a great way to 
identify potential biases or 
problem areas worth exploring. 
One way to fight model biases 
is to use data augmentation. 
Here we can perform a number of image transforms on 
a single training image to produce 
dozens or hundreds of copies of that one image. 
For example, with 12 simple transformations, 
we can multiply our training set by 13, 
going from 250 images to 3,250 images. 
If a model was looking at a certain area of 
the image for a particular background or glint of light, 
this process can help tremendously 
by moving or flipping the object in question. 
In other words, it keeps 
the model on its toes when training. 
Finally, we looked at transfer learning 
where we can take a pre-trained model, 
freeze some of the parameters, 
and train a new classifier. 
This can help us reuse a model that's 
been trained on a similar set of data. 
For example, we looked at MobileNetV2 that was trained on 
the popular ImageNet dataset that contains millions of 
images categorized into 1,000 classes. 
We freeze the feature extraction set of parameters, 
removed the classifier section, 
and add our own that 
corresponds to the specific problem we want to solve, 
such as making a dog or a person detector. 
We then train using the whole model, 
but only allowing the parameters in 
the final few layers to update. 
A final fine-tuning step can be 
performed where we unfreeze all of 
the layers and run a few training epochs to 
allow all of the parameters to update a tiny amount. 
This allows any parameters to shift 
a small amount as needed to better work with our data. 
Transfer learning allows us to train 
complicated models much more quickly, 
assuming we are working in a similar problem space. 
For example, using a model pre-trained on 
ImageNet to identify common objects like people, 
cars, animals, and so on, 
you would not want to perform transfer learning 
on such a model to identify things like 
frequency spectrum grams as the images are 
vastly different in their composition 
than the original training set. 
Additionally, since most of 
the perimeters are already set, 
you can get away with far fewer training epochs. 
Training MobileNet from scratch might 
take 10,000 epochs or more, 
but you can do transfer learning 
in just a few dozen epochs. 
Transfer learning can save you a lot of time and energy. 
Now that we've seen how to classify images 
using both simple and complex neural networks, 
we're going to look at object detection 
in the next module. 
That's where we start to be able to 
locate an object in an image.
## Share Your CNN Classifier Project

You may choose to discuss whichever model you decided to deploy to your embedded system (e.g. simple CNN or CNN trained through transfer learning).
•	Did your model work better or worse with data augmentation?
•	What types of images did you attempt to classify with transfer learning?
•	How well did your model perform on the validation and test sets?
•	How well did the model perform on live inference?
•	You may also optionally post any photos of the setup you wish to share!


I know that we covered a lot in this module. 
Let's take a moment to review what we've learned. 
First, we talked about how image convolution works. 
This is accomplished by sliding a window 
across the image with a given stride and 
performing an element-wise weighted sum 
of all the values underneath the window. 
This produces a single output in 
a new array that constitutes a feature map. 
Here is the general form of convolution, 
assuming we're using a stride of one without any padding 
of the input array and using 
a kernel larger than one by one, 
we'll end up with an output array 
that's smaller than the input array. 
We could also pad the input array or image with 
zeros or copy the values 
at the edges to make the input bigger. 
Usually, we would want to add enough padding so that 
the output array is the same size as the input array, 
this is known as same padding. 
No padding is known as valid padding. 
Convolution is an incredibly important method that allows 
us to filter an image in a variety of manners as shown, 
we can blur, sharpen, 
or create interesting effects like this embossed image. 
We can also use convolution to detect edges in the image, 
which forms the basis for 
how many computer vision techniques detect objects, 
including convolutional neural networks. 
Another important function you will see is pooling, 
and there are two common types, average and max. 
In average pooling, a window slides over 
the whole image and 
the pixel values under that window are averaged together. 
This helps to reduce 
the dimensions of an image or feature map. 
The stride in a pooling operation is almost 
always the same as the width and height of the window. 
In this example, the window is 
two-by-two and the stride is also two. 
Max pooling is similar, 
but instead of averaging the values under the window, 
the maximum value is chosen instead, 
this helps to reduce the dimensions of 
an image while highlighting the most salient features. 
While both pooling methods have their uses, 
max pooling is more common after 
convolution as it helps to 
pick out the most important features. 
We then saw how these operations fit 
together to create a convolutional neural network. 
A series of convolution and pooling operations can be 
stacked together to create a feature extraction section. 
The weights in the convolution filters 
become the parameters in this stage, 
which get automatically updated during training, 
this helps the model learn the most important features in 
images in the training set which 
are fed to the classifier section. 
Note that each convolution layer 
may have more than one filter. 
For example, 
if that first two-dimensional convolution layer 
has 16 filters, 
then 16 feature maps will be 
generated that are passed on to 
the next max pooling layer. 
Once the image has been filtered 
and pooled one or more times, 
the feature maps are rolled out or flattened to create 
a one-dimensional array or vector 
that is sent to a simple fully connected, 
or dense neural network. 
This network acts as the classifier in 
order to predict the label of the input feature set. 
When trained, this model will 
ideally figure out which features are the most 
important in any given image and use those features as 
calculated by convolution and pooling to 
classify the original input image. 
Next, we saw how we could modify 
a trained convolutional neural network 
to produce a series of 
visualizations to help us see what's going on. 
In both cases, we need to 
remove the softmax activation of 
the last layer as the math 
requires the raw output of the neural network. 
We first looked at saliency maps, 
in these the pixels that are determined to have 
the biggest effect on 
the decision-making process are highlighted. 
The second was a gradient weighted class activation map, 
where the features from 
the final convolution step are 
shown and the elements that 
had the biggest effect on 
the decision-making process in 
the neural network are highlighted. 
We can use the information 
in these maps to help us figure 
out how a convolutional neural network 
is making decisions. 
Specifically, it tells us 
where the network was looking in 
each input image or feature map to make its decision. 
Knowing that the model is looking at 
a particular area is a great way to 
identify potential biases or 
problem areas worth exploring. 
One way to fight model biases 
is to use data augmentation. 
Here we can perform a number of image transforms on 
a single training image to produce 
dozens or hundreds of copies of that one image. 
For example, with 12 simple transformations, 
we can multiply our training set by 13, 
going from 250 images to 3,250 images. 
If a model was looking at a certain area of 
the image for a particular background or glint of light, 
this process can help tremendously 
by moving or flipping the object in question. 
In other words, it keeps 
the model on its toes when training. 
Finally, we looked at transfer learning 
where we can take a pre-trained model, 
freeze some of the parameters, 
and train a new classifier. 
This can help us reuse a model that's 
been trained on a similar set of data. 
For example, we looked at MobileNetV2 that was trained on 
the popular ImageNet dataset that contains millions of 
images categorized into 1,000 classes. 
We freeze the feature extraction set of parameters, 
removed the classifier section, 
and add our own that 
corresponds to the specific problem we want to solve, 
such as making a dog or a person detector. 
We then train using the whole model, 
but only allowing the parameters in 
the final few layers to update. 
A final fine-tuning step can be 
performed where we unfreeze all of 
the layers and run a few training epochs to 
allow all of the parameters to update a tiny amount. 
This allows any parameters to shift 
a small amount as needed to better work with our data. 
Transfer learning allows us to train 
complicated models much more quickly, 
assuming we are working in a similar problem space. 
For example, using a model pre-trained on 
ImageNet to identify common objects like people, 
cars, animals, and so on, 
you would not want to perform transfer learning 
on such a model to identify things like 
frequency spectrum grams as the images are 
vastly different in their composition 
than the original training set. 
Additionally, since most of 
the perimeters are already set, 
you can get away with far fewer training epochs. 
Training MobileNet from scratch might 
take 10,000 epochs or more, 
but you can do transfer learning 
in just a few dozen epochs. 
Transfer learning can save you a lot of time and energy. 
Now that we've seen how to classify images 
using both simple and complex neural networks, 
we're going to look at object detection 
in the next module. 
That's where we start to be able to 
locate an object in an image.
So let's actually see how this example looks in real life. 
Let's switch the model to healthy, just to be sure. 
So it starts to create data.
Play video starting at ::9 and follow transcript0:09
And in the notebook, you see data arriving, one dot for each sample. 
At the same time, they're using the dashboard functionality of 
the IBM Watson IOT platform, in order to see what's going on.
Play video starting at ::21 and follow transcript0:21
So we see real time, time series plots and displaying the data. 
Later we expect anomaly score and a lot to be populated. 
Remember we've sent a no alert message to the UI in the notebook.
Play video starting at ::34 and follow transcript0:34
Once the buffer is filled with 3,000 elements, 
data gets sent to the neural network for training.
Play video starting at ::39 and follow transcript0:39
This is called account-based tumbling window. 
Other variants of stream processing windows are time based, or 
session based windows. 
All of them can be tumbling or 
sliding, but we will cover stream computing in a separate class very soon.
Play video starting at ::53 and follow transcript0:53
So now the window was full and 
was sent downstream to the neural network for training. 
It takes a little while, and as Keras and TensorFlow are started, and 
now are training stuff. 
Here in training, we send the current loss upstream to the queue. 
So we should be able to see it in the dashboard.
Play video starting at :1:9 and follow transcript1:09
And as we can see, you started with an anomaly score of 0.04, and 
are slowly getting down. 
This works as expected, since we're looking at known and healthy data. 
Our training is finished, and 
usually the next windows of data would have been processed. 
But we've limited the tests data generated to 3,000 events, so 
we need to switch to program and reset.
Play video starting at :1:31 and follow transcript1:31
By the higher amplitude we can make out that the pattern of data has changed. 
But the cool thing is that with LSTM autoencoder based neural network, 
we don't need to tell the neural network how healthy and broken data looks like. 
It finds out by itself, a bit scary but totally awesome. 
And again, we need to wait until the window is full. 
So training has just started, and 
we immediately see that an alert has been raised.
Play video starting at :1:57 and follow transcript1:57
This is reflected in the dashboard as well.
Play video starting at :2: and follow transcript2:00
And in addition, we see a clear spike in loss. 
But we notice as well that loss rapidly goes down, 
since the neural network now gets used to broken data, and sooner or later, 
it considers broken data as healthy, as well. 
Therefore, the system has automatically reset itself to the last known healthy 
state by reloading the pre-trend model from disk.
Play video starting at :2:21 and follow transcript2:21
Those warnings are due to the fact that communication with the message queue in 
the Keras callback handler take some time, but we can ignore those. 
So this was an example of an end-to-end scenario. 
If you're happy with the solution, we could easily export the model to 
IBM Watson machine learning, because running everything from a notebook for 
a production scenario is not a good idea.
Play video starting at :2:41 and follow transcript2:41
In a later lecture, we will also learn how to make use of Apache Spark for 
parallel neural network training and scoring. 
So stay tuned.

All tools used and presented in this course are open source or based on open source. Therefore it should be possible to run and pass the assignments outside the IBM Cloud. On Face-2-Face trainings though we've experienced quite some additional work in configuring and maintaining such environments. Therefore we highly recommend to use the IBM Cloud - as it is free, no credit card is needed to use it and you can scale up to 21 CPU cores and 84 GB of main memory. Please understand that in the discussion forum we can only support learners on the IBM Cloud – since we know this environment very well and it is the same for all learners. IBM Cloud is constantly changing and improving. But therefore the UI and setup procedures are constantly changing as well. For that reason we are keeping an updated video ready for you on a weekly basis. These are by purpose hosted outside Coursera since based on their nature we can't always ensure the high quality standards of Coursera. Sometimes they are recorded during periods of business travel. 

This programming assignment consists of three parts which require you to fill in some variables in the code and then submit. 
Note: A former version of the assginment required you to setup a NodeRED based test data generator and stream data via IBM Watson IoT Platform / MQTT to the neural network.
I'm very excited to welcome you to our training session on 
Sequence Prediction with Keras and LSTM. 
First of all, we need, of course, data.
Play video starting at ::13 and follow transcript0:13
And I have decided that we can go to FRED, 
the Federal Reserve Bank on St. Louis.
Play video starting at ::23 and follow transcript0:23
And download here Crude Oil Prices for 
oil Brent in Europe. 
This data is measured since 1987, 
and we can download the entire data set,
Play video starting at ::44 and follow transcript0:44
what I have already done. 
We can have a first look at this data. 
We see that it has several peaks, 
one was in September 1990. 
And the biggest one in June 2008, 
then it went down, then it went up, 
and so on, and so on, and so on. 
This data, we can download by clicking on this button, 
and then we can import it into our development environment. 
And here, we have to upload the data into the Cloud Object Storage.
Play video starting at :1:37 and follow transcript1:37
We are using Cloud Object Storage to ensure 
that the data is located in merged environment.
Play video starting at :1:48 and follow transcript1:48
As you'll see, I have already uploaded the data. 
This is very simple. 
You'll click on browse here. 
And it will be uploaded, and then you click on Insert to Code. 
It will insert some lines of code and in the end, if you execute this code, 
you are getting the data as a Pandas DataFrame. 
You see it has two columns, 
it has 7,993 rows. 
And here, I have to mention the data source, 
that the data is coming from US Energy Information Administration. 
In the first step, we have to do some minor data pre-processing.
Play video starting at :2:37 and follow transcript2:37
And this is because the data contains some innate rules. 
And in the CC file, every time where the innate 
is in the data, they have put a point in it, 
so we have to remove all those points.
Play video starting at :3: and follow transcript3:00
What we can do now, and we can see, if you bring the shape, 
we can see that around 200 rows were removed. 
First, what we can do, we can print out, we can visualize our data.
Play video starting at :3:21 and follow transcript3:21
And this is done by calling plot function.
Play video starting at :3:28 and follow transcript3:28
And here, we can see similar representation of the data, 
as we've already seen on the home page of FRED, 
a Federal Reserve Bank, all those peaks and so on, and so on. 
And yes, with this we are finishing the data upload part. 
And we are looking forward to see you in the next section. 
Stay tuned, and enjoy sequence prediction, bye-bye.

## CNN Classifier Project
You may choose to discuss whichever model you decided to deploy to your embedded system (e.g. simple CNN or CNN trained through transfer learning).
•	Did your model work better or worse with data augmentation?
•	What types of images did you attempt to classify with transfer learning?
•	How well did your model perform on the validation and test sets?
•	How well did the model perform on live inference?
•	You may also optionally post any photos of the setup you wish to share!
Up until now, we've only been talking about image classification where we want 
to assign a single label to the main content of an image. 
But what happens if we have multiple objects of interest in one image. 
For example, let's say we trained a classifier to 
identify an image as either a cat or a dog. 
Then I feed this to the model. 
Is this a picture of a cat or a dog? 
The model is incapable of handling two objects so it will focus on the most 
salient features based on how it's been trained to make a decision. 
And it will classify this image which contains both a cat and 
a dog as either cat or dog, but not both. 
Remember the model we trained earlier to identify electronic components. 
I gave it these two images as test data to see what it would do. 
Both images contain a resistor and a capacitor. 
In the first image, it still successfully classified the resistor but 
it was not very sure of that choice. 
Even though the resistor was in the position, it might expect to find 
resistors right in the center of the frame with the leads going out to either side. 
When I moved the camera down slightly, the resistor was no longer in that spot. 
In this case the model focused on the capacitor this time and 
classified the image as such. 
Note that it won't always be what's in the center of the image, 
what the model focuses on has to do with how you train the model. 
If the resistors were always in the top left of the frame for the training data, 
then it might focus their when trying to classify something as a resistor. 
But for this model it happens to focus on what's near the center as that's how all 
the images in the training set were structured. 
What we've been doing up until now is known as image classification. 
Image classification is where we train a model to 
predict the class of an entire image. 
This works best when there is a single loan subject of one of our classes. 
When you introduce multiple objects in the image, 
the classification methods start to break down image classification also 
can't tell you exactly where in the image the object is located. 
So it's best if all the objects are framed in a similar manner for 
all the images you're working with. 
However, we can use other methods to find the location of objects. 
This is known as object localization. 
Note that with localization we don't care about the type of object that's being 
detected. 
We just want to know where some kind of object of interest is located. 
That's where we introduce object detection with object detection, 
we want to both locate something in the frame as well as classify it. 
Object detection can tell us where multiple instances of an object 
are in an image as well as discern among several classes of objects. 
Most object detection methods will attempt to both classify objects in an image and 
give us an idea of their location and relative size in the frame. 
They will do this by using what's known as a bounding box. 
A bounding box is a set of coordinates in the image that could be an X, Y. 
Width and height or the location of two opposite corners. 
The idea is that you could use this information to draw an overlay on 
the original image to get an idea where those objects are located in the frame. 
Along with the bounding box, you should also get a predicted class and 
confidence level of that class. 
For example, a well trained object detection model might give us a bounding 
box for the cat with a predicted confidence of 98%. 
As well as a bounding box for 
the dog with a predicted confidence of the dog class of 83%. 
With a little bit of math, you can find the center of these bounding boxes which 
should give you an idea where the middle of the object is. 
This is great if you say want to have a self driving car, avoid these objects or 
maybe want to have a pan tilt camera automatically follow them by moving so 
that the center of the frame lines up with the center of one of the objects. 
One easy way to do object detection is to do a simple sliding window with 
inference performed for each window. 
Let's say, I'm trying to determine the location of my dog in this frame. 
This works best if the window is the same size as your training set resolution, 
we then feed whatever is under this window to our image classification model and 
it performs inference to give us a prediction. 
Here it might predict background as the class. 
Then we slide the window over and 
send the new set of pixels under the window to the model. 
Note that I have a little bit of overlap between the two windows. 
You can adjust the amount of overlap by setting a stride value usually measured 
in number of pixels. 
If your stride is larger than your window width, you won't have any overlap. 
If it's less you'll have some overlap. 
Note that more overlap means a better chance of detecting your object, but 
it means that you'll need to perform more inferences per still image, 
which could really slow down your frame rate. 
We continue this process until the window highlights my dog here. 
Here the model should predict dog as the class. 
From here we could draw a bounding box, 
the size of the window to denote that a dog was detected. 
However, let's say that we keep sliding the window and 
marking the locations of dog detections. 
We might end up with something like this. 
There were three true positives that detected areas of my dog, 
one false positive as the model thought my hanging cables were a dog and 
one false negative where my dogs back legs were missed. 
This might be good enough for 
your project as you can still successfully see that a dog was present in the image. 
Knowing that the cables caused a false positive hit, 
we might go back with extra training data to get a better model. 
Let's say we do that and end up with something like this. 
We still see three instances of the dog class. 
However, with some math, we can determine that bounding boxes of the same class 
are overlapping and assume that it must be one large object. 
We can then find the extent of these bounding boxes to create a larger, 
more inclusive bounding box. 
You could average the individual classifications together to get some kind 
of total confidence score for the detected object. 
However, this might not work if you had multiple instances of dog next to each 
other. 
Note that this method might think overlapping objects are a single object. 
As you can see this is a basic approach to doing object detection, but 
it has a lot of problems. 
Will examine more advanced techniques for handling overlapping objects later. 
Something else you can do is resize the image 
under the window to the expected resolution of the models input. 
For example, maybe we take a bigger section that's 96 by 96 resize it to 
48 by 48 before performing inference, this would keep inference fast and 
we could scan larger sections of the image. 
It also might allow us to look for dog objects that are closer to the camera. 
However, it might miss dog objects that are farther away as the model was not 
trained on images with this scale. 
While it might allow us to highlight the dog without computing the extent of 
several bounding boxes. 
We run the risk of missing the dog altogether if the stride is too high, 
the dog is too far away or if the model is not picking out the correct features due 
to the re sizing operation. 
Even with a small overlap as shown here, 
we'd still need to perform 48 inferences per still image and I'd still need 
to drop the right and bottom edges without doing some sort of padding or 
less than normal stride amount for the final windows along the edges. 
Let's say we have a fairly small model that can perform inference in 
100 milliseconds and we don't need to resize our windows. 
We would still need to spend around 4.8 seconds per image to locate our object. 
That would give us about 0.208 frames per second if we wanted to do this on 
a live video feed. 
Your project might be fine with such a low frame rate, such as if you just 
needed to check for the presence of a person or animal every minute or so. 
However, this assumes that we can find what we're looking for 
in the frame using a square bounding box. 
Let's say we're working on a self driving car that needs to watch for 
pedestrians and other cars. 
These square search boxes would only be capable of covering a portion of 
the object in question. 
Even if the classifier identified the piece as belonging to a person or car, 
we'd still have to combine them in some way to denote something 
as just one object rather than several. 
That can be a tricky problem. 
One option is to scan the image with a variety of window sizes ranging from 
small to large. 
While this might help spot single instances better you run into two issues, 
it's hard to determine the exact location of tall or flat objects. 
For example, is the identified person in the middle, left or 
right of the bounding box shown. 
You would not be able to tell if you just got the coordinates and 
size of the bounding box. 
It also means many more inferences need to be 
performed to scan the image with different window sizes. 
Taking this a step further, 
what if we scanned the image with a variety of bounding box, shapes and sizes? 
I've removed the potential boxes for the car as it only cluttered the image. 
Here each location or 
every few pixels could be scanned with six or more differently shaped windows. 
The area under the window could be resized and skewed as necessary to match the input 
dimensions of the convolutional neural network. 
The network would be trained to know how to work with scaled images like this 
from here. 
We could select the bounding box with the highest score for that object, ideally we 
would end up with something like this to predict the location of our pedestrian. 
However, as you probably guessed, this is incredibly time intensive for 
both training and inference. 
Instead of sliding many windows over the same image, 
forcing us to perform inference hundreds or even thousands of times. 
What if we used another algorithm to propose possible windows for us? 
This might be a clustering algorithm that groups similar pixels together or 
even another convolutional neural network that picks out 
what it thinks are interesting areas in the input image. 
Note that they may or may not contain objects. 
They're just proposed windows that contain something interesting. 
We then send these windows to our convolutional neural network for 
classification. 
This process of identifying potential regions of interest is known as region 
proposal. 
Using a convolutional neural network to perform region proposal is known as 
a region proposal network. 
These areas are known as regions of interest. 
Hopefully our classifier, which we trained to identify vehicles and 
people would be able to correctly identify the cars and pedestrians in this image. 
Would combine or pick the best window for our bounding boxes and 
end up with something like this. 
This saves us a lot of time over the sliding window method to identify 
objects in an image. 
Region proposal as a way to feed an image classifier forms the basis for the region 
based convolutional neural network model, which will look at in a future lecture. 
Even though object detection methods may have many different parts, 
we can treat them as a single machine learning model. 
You can often train some or all of its parts during the training step. 
The ultimate goal of these object detection models is to locate and 
classify all of the objects of interest in the photo. 
For example, let's say I train this model to identify my dog, his tug toy and 
his ball. 
The output of such a model will often look something like this. 
You'll get a list of zero or more objects where each object contains the predicted 
label and information about the bounding box. 
If you're working with python, this might be a list of python objects you'll have to 
loop through or it could be something like a Jason string. 
The x, y location, it might be the center of the box or it could be the top left. 
You'll have to read the documentation for 
whichever model you're using to determine that. 
The class prediction will likely come with a probability score from the softmax 
output. 
With this information we can then draw the predicted boxes on the image. 
We could also use the x and y information to perform some task like shine a laser 
for a cat to play with or move a camera on a servo to face a particular object. 
The width and height of the bounding box might also provide some 
insights into how close an object is to the camera, 
assuming we have some information about the size of that object. 
There are a number of popular models that perform object detection like this. 
These models can be very complicated, and so 
the inner working details of them are outside the scope of this course. 
However, in a future lecture will briefly go over a few of the models and 
demonstrate how to use one of them in edge impulse. 
I will also make sure to list some articles in the recommended reading 
sections if you'd like to dig more into these concepts and models. 

## Project - Sliding Window Object Detection
Sliding window object detection is a straightforward approach to finding objects in a larger image. You take a windowed section of an image and perform inference on it with your trained model.
Description
Your job in this project is to create a live sliding window object detection system on one of the embedded systems. You only need to identify objects of one of your classes--no need to create different bounding boxes for all of your classes (e.g. I’m just trying to identify the “dog” class in my example, even if my model was trained to identify “background,” “cat,” “plant,” etc.).
For each frame captured (this can be a low resolution, like 160x120), crop out a windowed sub-image that is the same size as your CNN input (e.g. 48x48). Perform inference on that windowed section. Slide the window over by some amount (e.g. 24 pixels) and repeat inference. 
 
Keep doing this until you have performed inference on all available windowed sub-images in the captured frame. You may skip the right and bottom edges if a full window does not fit over them (e.g. valid padding).
 
For each windowed sub-image, compare the output probability of the target class to some threshold (e.g. 0.6). If the probability is over that amount, draw a white rectangle over that section in the original image. Display the original frame and your bounding boxes in the preview window.
 
Required Hardware
You will need one of the following hardware setups for this section:
•	Raspberry Pi 4, SD card, Pi Camera
•	OpenMV Cam H7 or H7 Plus (with SD card)
Note: You are welcome to try other embedded systems not listed here. However, I cannot promise that they will work and I likely will not be able to help you troubleshoot any issues you may come across.
Option 1: Raspberry Pi
Create a folder to hold your program and model file:
mkdir -p Projects/sliding-window-object-detection
cd Projects/sliding-window-object-detection
Download the model file for your project:
edge-impulse-linux-runner --clean --download modelfile.eim
When asked, choose the project you wish to download the model from. This can be your CNN trained with augmented data or the MobileNet classifier you trained using transfer learning.
Pick only one class that you would like to search for (e.g. I chose “dog” for my example).
Starting with this code, change the settings to match your project (e.g. the target_label). You might also need to adjust the sensor width and height, the window width and height (the window width and height should match the width and height of the model’s input dimensions), and the stride.
You will need to slide a window over each captured image. Extract features from the windowed sub-image, which will convert it to grayscale and crop/scale as necessary:
Perform inference on each windowed section:
The number of windows in the x direction is stored in num_horizontal_windows and the number of windows in the y direction is stored in num_vertical_windows. You will likely need a couple of for loops using this information to accomplish this project.
Each time the output probability from inference on the window matches or exceeds target_threshold, you should print out the x, y, width, and height of that window. This information is known as a “bounding box.” You may end up with multiple boxes per image.
Draw a white rectangle on the image for each bounding box hit (output probability meets or exceeds the target threshold). Display the original image and the bounding boxes in the preview window.

Option 2: OpenMV
Choose one of your models from an Edge Impulse project. This can be your CNN trained with augmented data or the MobileNet classifier you trained using transfer learning. Download the model, and save it as trained.tflite on your OpenMV storage.
Pick only one class that you would like to search for (e.g. I chose “dog” for my example).
Starting with this code, change the settings to match your project (e.g. the target_label). You might also need to adjust the sensor width and height, the window width and height (the window width and height should match the width and height of the model’s input dimensions), and the stride.
You will need to slide a window over each captured image. Perform inference (e.g. tf.classify(model_file, windowed_img)) on each windowed section. 
The number of windows in the x direction is stored in num_horizontal_windows and the number of windows in the y direction is stored in num_vertical_windows. You will likely need a couple of for loops using this information to accomplish this project.
Each time the output probability from inference on the window matches or exceeds target_threshold, you should print out the x, y, width, and height of that window. This information is known as a “bounding box.” You may end up with multiple boxes per image.
Draw a white rectangle on the image for each bounding box hit (output probability meets or exceeds the target threshold). Display the original image and the bounding boxes in the preview window.

Conclusion
As you probably noticed, the sliding window technique for object detection suffers from a number of issues:
•	Computationally expensive--I got less than 1 fps on the OpenMV for only 20 windows per frame!
•	Does not handle small or large objects in the frame
•	Extra work to combine multiple bounding boxes to identify tall or long objects
In future lectures, we will go over how advanced machine learning models can be used to solve some of these issues.

Before we look at any object detection models, 
let's take a moment to go over some performance metrics. 
You'll want to have a basic understanding of these metrics in 
order to evaluate the different models that are out there. 
In the first course we introduced the concept of a confusion matrix and 
how we can use it to calculate a number of metrics. 
The confusion matrix is usually filled out with predicted and 
actual label counts from the validation or test set. 
Here, let's say that I'm only looking at the metrics for the dog class as I'm 
evaluating a model that tries to detect any dog ball or tug toy items in an image. 
If you recall from that first course, we can calculate the precision score for 
one class by dividing the number of true positives by the sum of the true positives 
and false positives. 
With that we're trying to get a measure of the percentage of positive predictions 
that was actually correct. 
Another popular metric is recall where we calculate the proportion of actual 
positives that was identified correctly. 
To get that we divide the number of true positives by the some of the true 
positives and false negatives. 
Something else we covered in that first course was the idea of setting a threshold 
for the output confidence score of a particular class in order to label 
the class positively or negatively. 
You often find a threshold of .5 to start. 
For example, if our model gives us a confidence score of 0.5 or higher for 
the dog class, then we would assign the label dog to that item, 
which would be a single region of interest taken from a larger image. 
In many machine learning systems, you won't often find a clean divide between 
the different labels and you'll end up with some false positives and 
false negatives regardless of what threshold you choose. 
With this threshold we can calculate the precision and recall values using 
the given number of true positive, false positive and false negative results. 
We can then plot that point on a graph of recall versus precision. 
As you move the threshold up, you generally expect the number of true 
positives to decrease as well as the number of false positives. 
However this also means you gain more false negatives. 
We can then calculate a new set of precision and 
recall values with this threshold. 
With a smaller set of false positives, you generally expect precision to be really 
good but your recall will be terrible as you have a lot of false negatives, 
that might give us something like this On the precision recall plot. 
As you lower the threshold, you generally expect the number of true positives to 
increase while decreasing the number of false negatives. 
However, it also means you see more false positives from this. 
We expect the precision to be low but the recall to be high. 
So we might end up with a point like this on the curve. 
We would continue sliding the threshold around between zero and one and 
calculating the various precision and recall scores. 
We plot all of them on a precision recall curve and 
we might get something like this. 
Note that this is not the same as the receiver operating characteristic curve 
that we saw in the first course as that plot uses true positive rate and 
false positive rate instead of precision and recall. 
After plotting all of the precision and recall points, 
we would end up with a curve like this. 
From here we can calculate the average precision for our dog class. 
The formal definition of average precision is the area under our precision recall 
curve which requires computing the integral of the plot. 
This would give us a number between zero and one and the closer to one the better 
the model is at predicting the dog class with few false positives and 
few false negatives. 
However, because this line was measured, it's likely impossible or 
nearly impossible to get a good mathematical equation to describe it and 
even more unlikely that you'd be able to compute the integral. 
As a result, a few papers describe methods to sample points along the curve and 
compute the average to get an approximation of the average precision. 
One popular method is detailed in the paschal visual object classes challenge 
paper published in 2009. 
Here, 11 precision values are sampled that are taken from equally spaced 
points along the recall access. 
These precision scores are added together and 
divided by 11 to give us an approximation of the average precision score. 
Once again, this is a number between zero and 
one where one indicates a model with perfect or near perfect classification. 
You might see other ways used to approximate the average precision score, 
but this 11 point seems to be one of the most popular. 
Remember that this is only for our dog class. 
We would need to compute the average precision scores for 
the other classes as well. 
If we're trying to judge the total effectiveness of the model, 
we'll go over a way to combine those scores. 
But let's talk about how this works with object detection first as we've only been 
talking about classification problems so far. 
When creating your training validation and test data sets, 
you will need to hand draw bounding boxes and label each of the associated objects. 
Let's say that this is one such example in my test set and 
I've drawn the bounding box is shown here. 
Bounding box information is usually not stored in the image itself. 
Rather you'll often find a separate file that contains information about 
the bounding boxes. 
There could be one text file per image or one large file that associates 
image files with their respective bounding boxes and object labels. 
We run this image without the ground truth bounding box information through our 
object detection model and the model gives us these boxes and labels as predictions. 
An object detectors classifier will have a non object class that it is capable of 
predicting which I'll call background. 
You will generally not get background class predictions as output from the model 
as this would be every region of interest that was not classified as containing one 
of our desired labels. 
And as a result, would have a large imbalance in labels favoring 
the background class, which might influence our recall and precision scores. 
Rather, we can check label accuracy by comparing the predicted classes to 
the ones noted in the ground truth bounding boxes. 
That provides us with a simple accuracy score much like we've seen for other 
classification problems, and we don't need to take the background class into account. 
For example, as the ball object was not detected, that would be a false negative. 
The toy box is obviously not covering enough of the toy object. 
Measuring how closely the predicted bounding boxes match our ground truth 
boxes requires a little math. 
One popular way to measure the success of bounding box predictions is to 
use the intersection over union or IoU formula. 
The intersection is a measure of the area that the predicted bounding box overlaps 
with the ground truth box. 
This could be the number of pixels within height and inches and so on. 
The area doesn't have to be rectangular if you're working with a nonstandard bounding 
box shape. 
The union is the area under both boxes if you combine them into one shape, 
you compute the IoU by dividing the area of the overlap by the total area of 
the union. 
This is a ratio and gives you a score between zero and one, 
zero means that there is no overlap. 
And that the predicted box isn't touching any area of the ground truth box. 
One means that the predicted box perfectly overlaps the ground truth box. 
From here, you can set a threshold as to whether you count the predicted box 
as being good enough to detect the intended object. 
A common IoU threshold for models is 0.5 but you may also see 0.7 sometimes so 
an object is considered detected if there is an IoU score at or above 0.5 
when comparing the predicted bounding box to the ground truth bounding box. 
Trying to get an IoU score of one for every predicted bounding box in an image 
is a goal of many object detection models and 
may even be used as a loss function during training for some models. 
Let's go back to our image example, 
here the detected dog object might have an IoU of something like .9. 
If we set the IoU threshold to 0.5, 
this would be considered a true positive detection and classification. 
However, let's say our toy box has an IoU of only 0.2 as a result, 
this prediction would be considered a false positive. 
Because of the ball object was not detected at all but 
there was a ground truth label for it, this would be a false negative. 
True negatives are all the parts of the image where we detected no objects and 
there were no ground truth objects listed. 
Because there are a lot of true negatives in most object detection systems we 
generally ignore this metric while focusing on precision and recall. 
Let's go back to our precision recall curve. 
We would only count a detection as a true positive if the IoU value between 
that prediction and the ground truth label was at or over 0.5. 
Because this IoU threshold would affect our precision and 
recall values We can keep it constant for now. 
Remember that this is a plot from the perspective of just one of our classes. 
We would need to generate precision recall curves for 
all the other classes that our model is capable of identifying. 
In this case we would create a precision recall curve for the ball as well as for 
the tug toy and they might look something like this. 
From there we can introduce the concept of the mean average precision score or mAP. 
All we need to do is calculate the average precision scores for 
each of the classes and then average them together. 
Note once again that this is still assuming our IoU threshold is 0.5. 
Using a single IOU threshold might give you a good enough mAP score. 
The pascal VOC 2007 challenge used an mAP value with a single 
IoU threshold of 0.5 for its scoring system. 
Often you will see mAP scores reported with 
a single IoU threshold of 0.5 like this. 
Let's go back to looking at the precision recall curve of just the dog class for 
a moment. 
If we vary the iou threshold then we would expect the curve to change. 
Generally as you increase the IoU threshold, the model has to be 
more precise in its bounding box location to be counted as a true positive. 
As a result you would expect the average precision or 
the area under the curve to decrease. 
You might end up with something like this for the dog class. 
The cocoa 2017 challenge introduced a new scoring metric where the mean 
average precision was now an average of all the mean average precision values 
over 10 different IoU thresholds ranging from 0.5 to 0.95. 
In other words, all average precision scores from all classes and 
a set of IoU thresholds were averaged together. 
This method rewarded models that were better at precise 
localization of the objects in the images. 
However you'll still often find that many mAP 
scores assume an IoU threshold of 0.5 as shown in the top equation. 
I hope that this has helped you get an understanding of how the mean average 
precision is calculated as it's one of the most popular scoring methods used when 
comparing object detection models. 
Previously we looked at a brute force method of locating objects in 
an image by sliding a window across the image and 
then performing inference on each of the sub images underneath that window. 
Now let's look at a few models that attempt to do all of that 
much more quickly and efficiently. 
We saw that instead of sliding differently shaped windows across an image, we could 
use some other algorithm to propose various regions of interest in that image. 
This forms the basis for the region based convolutional neural network or 
our CNN first proposed by UC Berkeley researchers in 2014, 
here is the original our CNN paper. 
I'll make sure to include a link to it in the recommended reading section if you'd 
like to dig into it. 
But it can be fairly complex. 
The first figure does a great job at explaining how the proposed our CNN model 
works. 
It combines several methods to produce this object detection structure. 
It first proposes around 2000 regions of interest in an image. 
These regions are warped or 
scaled to match the expected input of the convolutional neural network and 
then inferences performed on all of these proposed regions. 
The region proposal method is not specified in the paper, but 
they offer several ideas on what you might use, such as selective search or 
constrained parametric min cuts. 
We won't get into these methods but know that our CNN was a very important 
breakthrough in solving object detection problems. 
The other interesting part of our CNN was the use of a regression model to better 
predict bounding boxes. 
It would learn how much to vary the x, y coordinates and width and 
height measurements of the box based on the supplied proposed region and 
ground truth labels. 
My dog enjoys playing tug and fetch and he will sometimes bring me either his toy or 
ball depending on which one he wants to play with. 
Let's say I want to train a model that identifies my dog, his tug toy and 
his favorite fetch ball. 
This is a slightly silly way to test object detection, but it's a fun way to 
see how well I can pick out three different objects on an embedded system. 
Note that going forward and at least for the time being, 
you will need to use color images. 
Many of the popular object detection models rely on color information to help 
determine regions of interest or to make object classifications. 
You can technically use grayscale images but you'll need to find a way to copy 
the grayscale array to each red, green and blue arrays. 
A more useful application might be something like this. 
Edge impulse has been working with researchers to track elephant movements 
in the wild part of this also includes watching for 
poachers to help protect the animals. 
Cameras could be set up and alert authorities whenever an elephant and human 
are in close proximity embedded object detection would allow us to do this. 
You could also set something up to notify you if there's a person to close to 
a package on your porch. 
This might alert you of a new delivery or a possible theft but 
back to our CNN example here is the basic structure of the model. 
A region proposal algorithm finds around 2000 sub image portions that are fed 
to a convolutional neural network. 
Note that the CNN used here is just for feature extraction. 
In the original R-CNN paper, the features are fed to a support vector machine or 
SVM instead of a densely connected neural network for classification. 
The output of this part of the model is a prediction of the class of that 
particular region of interest. 
Note that the region could contain just background so an extra classes added to 
denote that the region does not contain one of our desired classes. 
The location information of the regions along with the feature maps and 
class predictions are fed to a regression model that tells 
us how we should update the location and shape of the bounding boxes. 
DX and DY give us how far horizontally and 
vertically we need to move the bounding box. 
DW and DH are how much we should adjust the bounding boxes with and 
height from the original region of interest proposal. 
This information should ideally give us a much better idea of where to draw 
the bounding box rather than relying solely on the original region of interest 
location. 
The biggest problem with this architecture is that once you've identified all 
possible regions of interest, you still need to perform feature extraction and 
classification on each region. 
This could mean up to 2000 runs of the right side of the model shown here. 
In 2015, a Microsoft Researcher proposed a modified 
version of R-CNN called Fast R-CNN here. 
The features were extracted separately using a convolutional neural network. 
This would create a feature map like we've seen previously. 
Region proposal would still be done on the same input image using a method like 
selective search. 
However, the region information is used to extract sections of the feature map 
rather than the original image. 
A method called region of interest pooling takes this information combines features 
as necessary and generates smaller feature maps with all the same size. 
These new feature sets are passed through a few fully connected or dense neural 
network layers and then flattened to create one dimensional vectors. 
This information is sent to two different dense neural networks. 
The first provides the classification of the pooled feature subsection and 
the second performs regression to provide a better location and 
shape of the bounding box. 
The information we get out of this model is the same as R-CNN but 
it's much faster as we no longer need to compute new feature maps with 
convolutional layers for each proposed region. 
A year later researchers proposed the faster R-CNN model instead of 
performing region proposal on the input image. 
The model does region proposal on the feature map. 
Instead, they introduced the concept of a region proposal network which is 
a neural network that can output a series of boxes 
containing regions of interest along with their object this score. 
The score is an indication of how much the network believes there is an object 
in the proposed region. 
These regions are used by the ROI pooling layer to create 
a series of feature maps with the same size just like before. 
The rest of the model works just like fast R-CNN to output object classifications and 
bounding box locations. 
Note that the convolutional neural network that acts as a feature extractor could be 
the feature extractor portion from another popular classifier architecture like 
VGG19 or MobileNet. 
We can even use pretrained weights from those networks to allow us to do 
transfer learning on the rest of the model. 
The big issue is that even though this architecture is faster by doing region 
proposals on the feature map instead of the images, 
we still need to perform full classification on each proposed region. 
In 2016 the same year that faster R-CNN was proposed. 
A different group of researchers put forth a new type of architecture that didn't 
rely on region proposals or multiple classification steps on those regions. 
In essence, they used the location information embedded in varying 
sized feature maps to locate objects in the input image. 
They called this approach the single shot multi box detector or SSD for 
short, a known good convolutional neural network like VGG 19 or 
MobileNet would be the feature extractor for this model. 
Like with faster R-CNN, you would need to drop the classifier portion of the network 
and just keep the convolutional layers which ultimately produce a series 
of feature maps with the most salient features highlighted. 
This feature extractor would form what's known as the backbone of the model. 
More convolution layers follow this backbone that continue to reduce the size 
of the output feature map each of these feature maps works as a sort of 
localization grid. 
Each cell is used as a center point for multiple anchor boxes with different 
aspect ratios that are used to match with objects in the feature map. 
As the feature maps gets smaller the grid cells and 
corresponding anchor boxes increase in size relative to the input image. 
This helps the model identify objects of varying shapes and sizes in the frame. 
Each output series of feature maps from the convolution layers shown in blue along 
with the feature maps generated by the backbone CNN are fed to 
convolutional classifiers known as heads shown in green on the right side. 
These heads produce a series of classifications for 
each of the default anchor boxes. 
Each head also contains a regression model used to update the X, Y coordinates and 
width and height of the default anchor box to create a bounding box. 
All these predictions go through a non maximum suppression step, 
which we'll talk about in a second. 
The output is a list of bounding box objects along with their associated class 
predictions. 
Without the region proposal step used in the previous models, 
the SSD loses some of its accuracy. 
However, the SSD architecture is extremely fast as it can produce a series 
of object predictions with a single forward pass, 
which is where it gets the name single shot. 
SSDs are well suited for mobile and embedded applications because of their 
speed and computational efficiency, even if it means they're not quite as accurate 
because SSD outputs many different classifications in a single pass. 
We need a way to deal with potential overlapping bounding boxes. 
That's where non maximum suppression comes in. 
Let's say we have a series of bounding boxes for the dog label like this. 
Any classifications below a threshold have already been filtered out. 
So we're only dealing with boxes that have over say a 0.5 confidence score. 
Even then we still have several boxes that overlap on the same object. 
To start, we select the predicted bounding box with the highest 
confidence score which is 0.92 in this example. 
We then calculate the intersection over union score for 
each box that intersects with the highest confidence box. 
Finally, any box with an IoU score over some threshold is removed. 
You'll often find an IoU threshold of around six for non max suppression. 
What we're left with should be the best bounding box for the object in question. 
This is repeated for all classes and all non intersecting boxes of the same class. 
The idea is that we should ultimately get an output of bounding boxes and 
confidence scores that uniquely identify the various objects in the input image. 
The hope is that objects can be of any shape or size and 
any distance from the camera. 
However, SSD struggles with very small objects in the frame as there 
are simply not enough pixels to sufficiently described that object. 
Also default anchor boxes may not be small enough to catch some objects. 
The only way to deal with that right now is to increase the size of the input 
image and retrain your model to work with higher resolution images. 
Also note that many of these object detection models rely on 
pretrained backbone networks like MobileNet. 
These networks were trained on a known good image set like Image Net or COCO. 
However, if you're trying to identify objects that were not represented in 
the data set used for pre-training, you might run into problems as 
the convolutional filters were never trained to look for 
your particular custom object. 
For example, I don't think my specific dog tug toy is in the data set. 
So I expect an SSD model pre trained on Image Net or 
COCO to struggle finding this object if I only rely on transfer learning. 
The ball might also prove difficult as it's rather small with such a low 
resolution here is a pre-trained SSD model with a MobileNetV1 backbone. 
The MobileNet feature extractor can be seen at the top of the model in neutron. 
You can see where the output of this series of convolution layers goes to 
a pair of convolution heads that give us a set of object classifications and 
bounding box locations. 
The SSD convolution layers also feed into similar pairs of convolution heads 
that give the same kind of object detection predictions. 
All of these predictions are fed into a post processing step that performs non 
maximum suppression and 
combines the predictions into a list of object labels and box locations. 
The following paper from 2017 gives an overview of various object detection 
models at the time. 
Figure 2 gives a great depiction of how well the various detectors performed 
using mean average precision as the scoring system. 
We looked at faster R-CNN and SSD and you can see various versions of those 
detectors with different feature extractor backbones. 
SSD with a MobileNetV1 backbone is not very accurate compared to some of 
the others but it's one of the fastest which makes it popular for 
embedded systems. 
There are other models out there that we didn't cover like R-FCN, which stands for 
region based fully connected network. 
Note that the models here were trained on the COCO image data set with non 
maximum suppression IoU threshold set 2.6. 
The mean average precision score is taken as an average over Iou 
thresholds ranging from 0.5 to 0.95. 
We can make improvements to SSD by using more advanced backbones like MobileNetV2. 
Additionally, we can introduce other concepts like feature pyramid 
network or FPN. 
The blue squares in these images are the feature maps that are generated from one 
or more convolutional layers. 
The top left shows how we can resize images, 
run them through a series of convolutional filters and then use those feature maps 
to make predictions about the contents of the images. 
We see something similar with b, in the top right. 
However, here we use convolutional layers that reduce the dimensions of each 
subsequent feature map and we only make our predictions with that final layer. 
This is how our traditional convolutional neural network classifier works. 
We can combine this with a region proposal system to get an object detector just 
like we saw with R-CNN. 
Next we have the parameter feature hierarchy which predicts features based on 
all of the intermediate feature maps. 
This is how our SSD model works. 
The paper proposes the image given by (d) which is the feature pyramid network. 
In this a new series of convolutional layers are introduced that actually up 
sample the smaller feature maps. 
Feature maps from the first set of layers are combined with the upsampled feature 
maps before being sent to the classifier heads. 
This has the effect of expanding salient features from the filtered feature maps 
into higher resolution maps with the intent of raising average precision and 
small object detection. 
Here is a look at a single shot detector, 
I trained to look for instances of dog ball and tug toy. 
The top part is the MobileNetV2 backbone which contains a series of 
residual blocks. 
Some of the feature maps from these residual networks are copied to sections 
below, which make up the feature pyramid network. 
Here you can see that 10 by 10 maps from this convolution layer are then 
continuously upsampled using more convolution. 
Feature maps from earlier in the network are added to these upsampled feature maps 
which go all the way up to 40 x 40. 
These combined feature maps are sent to prediction heads and 
the smaller feature maps are also sent to prediction heads. 
The output of these heads are combined to give us a list of detected objects and 
their bounding boxes after a round of non maximum suppression. 
YOLO or you only look once is another popular model that we didn't cover. 
This paper from 2020 covers version four of the YOLO model and 
introduces some advanced techniques that make it faster and more accurate. 
At the bottom of the paper, 
you can see a chart comparing YOLO to other models including SSD. 
You can see how it compares to SSD and other object detection models. 
Note that the size and speed of the model will depend on the backbone used 
as well as other features and layers in the model. 
Both SSD and YOLO are popular for their small size and speed and 
they are getting more accurate with each revision. 
Don't worry if much of this seems really confusing. 
This was supposed to be a very broad overview of the last few years 
of research into object detection models and 
I left out a few concepts that go into creating these models. 
If you'd like to dig deeper into them, I'll make sure to link a few articles and 
original papers in the recommended reading section. 
let's train our own in edge impulse. 
Object detection models take up a lot more space and 
require a lot more computation power than simple image classification networks. 
As a result, you need a more powerful board when you go to deploy the model, 
I recommend the Raspberry Pi 4 with the Pi camera for the project in this module. 
TensorFlow lite for microcontrollers does not support some of 
the operations required at the time this course was released. 
So while the open MVH-7+ might have enough memory and processing power to 
run an object detection model, it simply won't work for this project right now. 
If and when support for object detection models is added to open MV, 
I will update the materials in the course to include it. 
To collect the data set you want to capture a bunch of images that contain 
the objects you're trying to look for. 
You'll also want to have them in PNG format 
as that's what edge impulse works with. 
In this case, I am trying to train an object detector that looks for 
my dog, his tug toy or a ball as you can see in this image. 
The idea is that every now and then my dog brings me one of these toys, whether it's 
a ball or the tug toy that's the little red figure 8 to me when he wants to play. 
And I want to train a detector that lets me know that the dog is near one of these 
toys and possibly wants to play. 
I don't know how ultimately useful this is, but 
feel free to train your own detector for things like people, 
other animals, cars, fruits, other foods or maybe road signs. 
If you're trying to make something like a self driving car, 
notice that these images are for color, the object detection model that 
we're going to be using expects full color models, not gray scale. 
So, don't try to save on space by converting all your images to gray scale. 
They need to be 320 x 320 pixels as that's what the model expects. 
You can have edge impulse crop or resize your images, but 
I'd like to capture things in the exact format that's expected by the model. 
So everything I've collected is 320 x 320 and 
I have a variety of different rooms, backgrounds, lighting angles, 
the dogs in different positions, the toys and the ball are in different positions. 
These help to train the model to look for the objects in various positions and 
lighting conditions. 
I also try to include a variety of things like, 
just the toy by itself, the toy and the dog, the dog by itself or 
maybe all three in the frame, which you can see in this image here. 
You'll want to have about 50 instances of each object, 
that doesn't mean you'll have a total of 150 images if you have three classes. 
But spread across these images, 
you'll want to have about 50 samples of those objects. 
If you have 50 images and each of those images contain your three objects, 
then you only need 50 images. 
If you're working with the open MV, 
I recommend using the image capture script I gave you earlier in the course, 
here you'll want to keep it as RGB 565, which is color rather than gray scale. 
And you'll want to change the width of the frame and 
height of the frame to 320 x 320. 
Additionally, you need to adjust the frame size as QVGA captures things in 320 x 240, 
and so we're just going to set this to VGA and 
then we're going to crop down to 320 x 320.
Play video starting at :3:43 and follow transcript3:43
When you run this you should see a color image show up in your Viewfinder and after 
about 3 seconds, it will capture an image and stored on the SD card of your open MV. 
From there just take out all of your images and 
store them somewhere on your computer, so that you can upload them to edge impulse. 
Note that if you're capturing from open MV, 
everything is probably going to be in BMP format, so you want to convert it to PNG. 
Finally, if you're doing things with the Raspberry Pi, 
I recommend running the Pi Cam capture script, 
here you'll want to change the image with 2 320 and the image height to 320. 
From there, Just run the script on your Raspberry Pi and 
it should capture color images in 320 x 320 resolution. 
Then just copy those images from your Pi, make sure they contain the objects you 
intend to look for and we're going to upload those to edge impulse. 
Head to edge impulse, log into your account and let's create a new project. 
I'm going to call this one ball-dog-toy-detection. 
Feel free to be a little more creative than me and you're naming schemes, but 
this at least lets me know what kind of things I'm looking for. 
From here, I'm going to exit out of the pop up screen, 
on your dashboard you're going to need to scroll down and head to Project Info. 
There look for labeling method and we're going to change this to bounding 
boxes as that's going to change how we label our items. 
Note that we're going to be labeling or providing ground truth bounding boxes 
in edge impulse rather than using a third party program, or trying to do it 
manually by creating a file with a bunch of labels and bounding box information. 
From here, head to data acquisition, and then click to upload your files. 
You're going to want to leave it as automatically split between training and 
testing and in further label from the file name, this does not really apply 
since we're going to be making bounding boxes on each of the images. 
Select choose files and find where you've created your dataset, 
highlight all of your images. 
I do recommend going through these to make sure that your objects are clearly 
visible, as soon as an object moves out of the frame that might make it a little 
harder to detect. 
Unless that's something you're looking for 
where you're okay detecting part of an object in the image. 
Let's upload these, we'll click to begin upload and 
wait just a second while it uploads all of the images. 
From there, head back into data acquisition and 
you should see a new tab that's appeared called labelling queue, 
click on that and you can start labeling your objects. 
All you'll do is click and drag to create a bounding box and 
you'll enter a label, here I've got a dog. 
And I can move the bounding box around, 
you'll want to make sure that it barely captures your object in its entirety, 
without leaving too much extra space around it. 
And I'm going to move it over and then I'm going to draw another bounding box for 
my next object which I will label as a toy. 
Here I'm going to move the bounding box again, so it's just capturing the object 
and I'm going to adjust its size and I'm going to click save labels. 
In the next screen I'm going to do the same thing, 
I'm going to find where my objects are, and 
I'm going to move my bounding boxes around and do the same thing with dog. 
Resize that one, click save labels and continue. 
You'll want to keep doing this for all of your images and for 
all of the objects in your images. 
This might take some time, so be patient as you go through it. 
When you're done, I recommend going back into training data and 
you can look through all of your images here. 
You can see the labels that are associated with the objects in those images and 
you can see the number of labels for 
objects in those images up in the right corner here. 
You can also do the same with your test data, if you made a mistake and 
you need to relabel something, 
you can click on the drop down menu on that image and select edit labels. 
This will bring you to a new screen where you can move the labels around, 
add new labels, change the label names and so on. 
I don't need to update this, so I'm just going to x out of that. 
You can also get a preview of where all the labels are on your objects, so 
I recommend taking a look at those to make sure they're good, 
if you missed any go back and redo them. 
When we're done, head to impulse design. 
Take a note of the text here, for object detection at least for right now, we must 
use input image sizes of 320 x 320, so I'm going to leave this to be the default. 
If you don't have it already, you want to add an image processing block and 
you may not have anything here, so you'll want to add object detection images. 
Save your impulse and then head over to your feature extraction tab. 
We want to keep the color depth at RGB because the object detection model 
expects color images, it will not work with gray scale at least at this time. 
Save these parameters and then click generate features. 
They should not do a whole lot other than maybe tagging the images and 
creating a file or something that contains all of the bounding box information. 
It additionally compresses the dimensions of the various objects found in 
the images, so we can see them in the feature explorer here. 
You should see some decent separation among the object classes. 
However, I don't see a whole lot between the toy and the ball, which means 
I'm not sure that this model is going to work very well but we will see. 
From there, head to your object detection tab, 
I like to increase the number of cycles, I find that 25 is a little low to work with, 
something in the 50 to 100 range works pretty well for me. 
I'm going to leave the learning rate and the score threshold alone for now, 
I don't have many options at this moment. 
There's only one model that edge impulse supports at this time for 
object detection and that is a single shot multi box detection or 
SSD architecture with a MobileNetV2 backbone used for feature extraction. 
It also contains a feature pyramid network and 
that's all converted to a TensorFlow lite model when it's done being trained. 
In this case we are using transfer learning, so 
it should train a lot faster in a matter of minutes rather than hours or even days. 
Do note that edge impulse says it's able to locate up to 10 objects with an image. 
So there is a limit to the number of objects it can detect at least 
on this version of the model. 
We are going to go ahead and use that model and let's begin our training, so 
just click start training. 
In a previous lecture, I talked about how the backbone for 
this SSD model can be pre-trained on a known good set of images.
Play video starting at :10:25 and follow transcript10:25
In this case the MobileNetV2 backbone is trained on the cocoa set of images. 
If this set does not contain some of the things that I'm looking for, 
like maybe the tug toy is not in that training set. 
The filter parameters may have a hard time picking it out, 
whereas dog is likely one of the images in that training set. 
So the filter parameters kind of know what to look for and there's a lot of 
complexity that goes into animals, they have repeating patterns like fur, 
they have legs, they've got faces, ears, eyes, nose, mouth, maybe a tale. 
There are things that the parameters can look forward to help determine that that 
is a particular animal like dogs. 
So I expect this model is going to pick up 
my dog a lot more easily than it does say the tug toy. 
The ball is an interesting one, simply because the ball is a very simple shape. 
I would figure that some of the simple lines that go into creating the ball at 
least when it comes to creating an image of a ball should be pretty simple and 
straightforward. 
However, it's a weird color which I'm not sure matters in this case, but 
the idea of a basic circle might be actually tough for 
something that's looking for more complex features. 
We'll see when this is done. 
The other thing that concerns me is that at least for 
320 x 320 resolution photos the ball is relatively small. 
I think it's on the order of 10 x 10 or maybe 20 x 20 pixels, 
which going into the network is going to be a relatively low number of features 
coming out on the other side of all those convolution filters. 
However, that Feature Pyramid Network, 
that FPN portion of the SSD should allow the network to work with smaller objects. 
We'll see if it works for the ball object in this case, 
with training done we can look at the models performance. 
Here you can see that it scores about a 73.1% with the validation set and 
that's what the 8-bit quantized model. 
We can also look at the float 32 but you'll notice that it takes a little more 
flash, in this case the precision goes up to about 85%. 
Generally when working with a lot of these embedded systems, 
you'll want to use the quantized model where possible. 
The other thing that I recommend doing is going to model testing and 
then clicking classify all. 
Wait for 
just a moment while it tries to detect all of the objects in the test data set.
Play video starting at :12:47 and follow transcript12:47
As you can see it's only performing at about a 67% which is not great. 
I recommend taking a look through your test data set to see where the model is 
failing. 
In this case, it's looking like it's failing on ball and toy, 
we can select show classification and I'm going to do that for 
a number of these images where it failed to label one of the objects. 
So let's open these up, 
once you have them open you can see where the ground truth labels occur. 
In this case it looks like the model tried to classify the ball with 
an 83% confidence. 
However, it completely missed the toy and that's likely because the toy 
was shown on its side here, and if we keep looking at the images, 
it missed the ball when the ball was next to the dog. 
Once again, it missed the ball when the ball was next to the dog, 
it did successfully classify the toy object here.
Play video starting at :13:41 and follow transcript13:41
It's looking like the ball is not classified whenever the ball is next to my 
dogs light colored fur, and that could just be because the feature extractor 
isn't seeing any edges or it's considering it to be part of the dog. 
I'm not exactly sure what's going on, but this does appear to be a common problem. 
And here it just completely misses the toy even though the toys in plain view, 
it misses the toy but the ball was found, the ball wasn't near his light colored 
fur but it was on this light colored piece of rug, but it did miss the toy. 
That's interesting, once again it missed the toy and 
this might be too small of an object because it's so far away. 
It looks like the model is struggling to find the toy. 
And once again the toys missed but it did double classify the dog, 
which is not great. 
It looks like the non maximum suppression failed here as it considered 
this 0.54 object to be the dog, even though that is not correct. 
From what I could tell it struggles with the toy object and it struggles sometimes 
when the ball is next to light colored objects like my dogs fur. 
The toy object I'm not particularly surprised about, simply because I don't 
think it's something that was in the original training set. 
So I'm asking the model to try to identify the toy using filters that weren't 
trained on something that looks like this toy. 
That is something important I recommend keeping in mind, that you should 
probably stick to objects that were part of that original training set and 
using those as a starting point for what kind of objects you should be looking for. 
You can get away with training new objects that aren't in those original 
classifications but the model might struggle to find those. 
When you're happy with the results and in this case I could probably go back and 
train a bunch of images with the ball next to my dog in various positions, 
in order to try to get the object detection to find the ball in those cases. 
And an option for the toy, I have to use a different toy, one that looks like 
an object that was part of the original training set classification scheme. 
Or I might have to train that backbone from scratch which at this time 
edge impulse does not support, so that would be a little more involved. 
when you're done, head down to deployment and 
you can download a library containing all of the feature extraction and 
model information ready to be deployed to your system of choice. 
And in this case remember that object detection takes a lot more RAM and 
flash memory, so you will want to use a very powerful microcontroller or 
something like a single board computer.

## Digging Deeper into Object Detection
Now that we have a trained object detection model, let's download 
it onto our Raspberry Pi and then use it for live inference to start. 
We want to make a new project folder, I'm going to call mine ball 
dog toy protection, and I'm going to go into that directory. 
Once again, I'm going to run the edge impulse Lennox runner with the clean 
parameter that allows me to select a new projec. 
And then tell it to download the model file from my object detection project.
Play video starting at ::39 and follow transcript0:39
Sign in with your username and your password. 
You'll want to select your object detection project, and 
then the tool should automatically download your model file. 
From there, 
you can check that the file does appear in your Linux file system to start. 
We're going to import some python packages and modules as we've done before. 
We're going to need OS system time open C V, the two pi camera libraries as well as 
the edge impulse runner notice here that we're using specifically the Image runner. 
Which is different from what we saw in the dense neural network, 
as this one allows us to extract features from images from there. 
We're going to need to load in our model file that we just downloaded. 
And we're going to need to set the resolution width and height, 
which I will keep to 320x320, which is what the model expects. 
As we've done before we're going to load the model file.
Play video starting at :1:34 and follow transcript1:34
And then we're going to initialize it, 
if this initialization process causes any problems, 
will throw an exception printed out to the console and exit the program. 
Because object detection is normally very computational e expensive. 
I'm going to keep track of the frames per second and print it to the preview window, 
so we can see how quickly this whole thing is running next. 
I'm going to start the pi camera and as I've mentioned before, 
you're welcome to try this with a webcam. 
But this code is going to be a little different since we're not using the pi 
camera library. 
Webcams and Lennox have traditionally had a somewhat difficult history 
where there weren't a lot of drivers that supported webcams. 
I think that's different in this day and age. 
However, because there are so many webcams out there, I won't be able to help you 
debug whatever particular version of USB web camera that you're using. 
So I recommend using the pi camera. 
If you're taking this course, 
I should be able to help you debug any problems you might have with it. 
From there, we'll configure the camera with 320 x 320 resolution. 
We're going to set up a for loop that essentially runs forever. 
That gives us a new frame in this frame variable each time it runs and 
that's coming directly from the pi camera from there. 
We need to get a time stamp of the current time so 
that we can print out the frames per second onto the preview. 
We're then going to get a NumPy array from this frame variable with that. 
We can convert it into the features that the edge impulse model expects and 
to do that. 
We're going to call the runner object, 
which comes from our model file that we imported with the image impulse runner. 
And we're going to call this get features from image function and 
pass it our NumPy array that represents our image. 
Remember that this should be an array of pixel values where each pixel 
is represented by a blue, green, red value between 0 to 255. 
We should not need to perform any other feature extraction or 
image manipulation sends. 
The image should be in the correct resolution with the correct color and 
BGR values, which means after we call the get features from image. 
They should be converted to whatever the edge impulse model expects in this 
features variable. 
And we're going to pass that to the runner dot classify in order 
to perform classification using our object detection model. 
I always like to print out the raw output from this model in order to see how I can 
look into it or index into it to get the values that I need. 
If you were to print those to the console, 
you should see that we get a list of bounding boxes. 
If there are no bounding boxes, it should just be an empty list. 
And so this four B box in bee boxes just shouldn't run using the information 
from this bounding box list. 
We can extract the X and Y coordinates, as well as the width and height. 
In order to be able to draw, 
our own bounding box on top of the detected object. 
You can also use this information for 
whatever embedded system project you are trying to do, 
like point a laser pointer in front of a cat, throw a ball for a dog. 
Figure out where cars are in an image in order to avoid them, and so on. 
For testing, 
I'm just going to draw a bounding box over the detected object to do that. 
I'm going to use the open CV rectangle function. 
Give it the image where I want to draw that rectangle, and 
then give it the coordinates of the X and Y. 
That were from the original bounding box. 
That should be the top left corner and then the bottom right corner, 
given by X one and Y one. 
Which should just be the X value plus the width and the Y value plus the height.
Play video starting at :5:23 and follow transcript5:23
I'm going to draw these bounding boxes as pure white. 
So I'm going to use 255, 255, 255 but 
feel free to say assign a new color for each unique label or each unique class. 
So that you can have say all dog objects be highlighted in red and 
all ball objects be highlighted in green. 
But I will leave that up to you to try. 
I'm just going to paint everything in white just to make it simple for 
this script right now. 
The next thing I'm going to do is draw the object label. 
I'm going to change this to label, and 
I'm going to draw the score at the top of the bounding box. 
So you should be able to see the class that each bounding box should 
represent along with the confidence score or probability to get those. 
All we do is just call the bounding box that we get from the list of bounding 
boxes, index in using the label key. 
And we do the same thing, but using the value key that should give us the class 
label along with that classes or label's particular confidence score. 
And we can just print those inside of the bounding box given by the X and Y. 
Of the bounding box. 
Finally, even if I don't have any bounding boxes that resulted from inference, 
I am going to draw the frame rate in frames per second on the frame. 
This gives me an idea of how quickly this whole thing is running, 
which includes capturing an image. 
Converting it to the correct features and 
then running infants in order to find those bounding boxes. 
As we've done before, we need to call the I M show or 
m show from the OpenCV library. 
Give it the image and 
this makes it appear as a preview window with the name frame to get it to work. 
We also need to truncate the raw capture from the frame that we saw earlier. 
This prepares it for the next frame and allows the frame to continually update, 
giving us something like a video stream. 
Next we calculate the frame rate that gets updated on the preview in the next 
iteration of this for loop and 
then give it some way to quit out of the program by pressing the Q key. 
If we have that preview window highlighted, 
if we do press the Q key on that preview window will want open CV. 
To destroy all of the open windows just to create a nice 
way to clean everything up next. 
I'm going to save this on my Raspberry Pi in my ball dog toy 
detection folder as that should have my dot EIM model file.
Play video starting at :7:47 and follow transcript7:47
And I'm going to call it live detection pie cam, let's save that and 
now go over to the Raspberry Pi to see if it works.
Play video starting at :7:58 and follow transcript7:58
The script that we just wrote should be saved on the pie, and 
you can see it here as live detection pie cam dot py.
Play video starting at :8:7 and follow transcript8:07
Now we're going to run it, and I'm going to move my pie cam 
around to see if it will detect things as my dog walks around. 
You can see it as detecting dog correctly.
Play video starting at :8:22 and follow transcript8:22
For the most part, if he lies down, it's about seven, 
it's picking up the ball every now and then. 
And it's grabbing the toy, 
although it looks like it's getting to bounding boxes sometimes for 
the toy and the dog when I move the ball toy and well dog to a farther location. 
You can see that it still detects the dog decently. 
Well It detects the toy, okay, but I'm starting to get about a 0.5 confidence, 
which means it's starting to lose it, and it's not detecting the ball at all. 
This just goes to show that smaller objects do not perform very well with 
this model, 
even though we had a lot of training data showing the ball farther away. 
As a result, I recommend using a model like this with such a small resolution for 
objects that are closer up. 
Or larger in the frame that just take up more 
of the frame where the model can pick out individual features on that object. 
However, it is still detecting my dog pretty well. 
Additionally, you can see that the frame rate is about 
1.71.8 frames per second, which is pretty slow. 
And that's about as good as we're going to get on the Raspberry Pi. 
If you introduce things like a hardware accelerator, or 
you switch to something like an NVIDIA Jetson nano. 
Where you have hardware architectures that accelerate or are designed to run things 
like neural network, you can expect a better frame rate. 
However, this is going to be the best we can do for now, but 
about two frames per second is pretty good. 
I hope this helps you get started, training and 
deploying your own object detection model on something like a single board computer. 
Let's take a quick look at image segmentation. 
It's an advanced concept that we won't spend much time on in this course, 
but I want you to know that it's out there as an active area of research in 
computer vision. 
The idea behind image segmentation is to separate each object in an image 
by examining its individual pixels. 
As such, 
each pixel in the image is given a label as belonging to one of the objects. 
For example, in this photo, we might want to separate the different objects. 
You'll often see image segmentation outputs as another image with 
the same dimensions as the input image, but 
the pixels classified to their respective objects. 
For example, the laptop is black, the wall is white, the table is yellow, 
the power cable is red and the glass is blue. 
We would want to have a good idea of the individual classes were looking for 
when constructing such a model or filter to produce a segmented image. 
If we were trying to identify keys on a keyboard as opposed to anything 
else in the image, the output might look a lot different. 
Object detection gives us bounding boxes with x and y coordinates. 
Image segmentation gives us more information about the shapes of 
the different objects. 
A bounding box might be good enough, but sometimes you need more information. 
For example, image segmentation might help medical researchers look for 
cancerous cells and isolate them from normal cells. 
In that case, the shape of the cell could be very important in determining its type 
and exact location. 
This is an image of cancer cells that have been colorized by the National Institute 
of Health. 
Image segmentation could help us identify such cells and 
give us similar coloration automatically to help identify cancer in patients. 
This paper combines segmentation with the faster R-CNN 
model to help identify pedestrians more accurately. 
This is potentially useful in security cameras and self driving cars. 
One of the easiest forms of image segmentation you can do is a simple 
threshold, convert an image to greyscale and set a threshold value. 
Any pixel over that value is one color and everything else is another. 
Assuming my grayscale capacitor image here is stored as a numpy 
array in the AMG variable, 
this code snippet will convert anything less than a value of 145 to white. 
Everything else will be black. 
This produces a very basic segmented image of my capacitor. 
Note that this only works on objects that have a high contrast with their 
background. 
You could set up multiple bands of color or grayscale values but 
this doesn't work well with objects that have lots of complex shapes and 
colors against similarly complex backgrounds. 
Another popular technique is to use K-means clustering to group pixels with 
similar color values or other features together. 
However, this only works well when the features are guaranteed to be similar 
within each class or object. 
Imagine a person wearing different colors for their shirt and 
pants while K-means might pick out the different articles of clothing, 
it might struggle to see the whole person. 
In 2015, 
UC Berkeley researchers proposed something called a fully convolutional network. 
Here, they used the feature extraction portion of a convolutional network to 
create a feature map. 
Note that this section could be something like VGG or 
mobile net without the classifier network usually found at the end of the model. 
This feature extractor is also known as an encoder. 
The feature map is then sent to another convolution layer that produces an output 
the same resolution as the original input image. 
This step is known as a decoder and 
it's in charge of assigning a label to each pixel in the output segmented array. 
What we've been looking at so far is known as semantic segmentation. 
This is where objects of the same class are given a single value. 
As we see in figure six of this paper, 
objects of the same type are highlighted in one colour. 
For example, person, horse and car are different objects that get the colours 
pink, magenta and grey respectively. 
This type of classification isn't great for 
determining the number of instances of each object present in an image, but, 
it can tell us where some objects are located. 
Note that training this model is like training other convolutional 
neural networks that we've been looking at in this course. 
The difference is that the ground truth output is a segmented image map where 
the pixels have been given values that correspond to the object they represent. 
The idea is that the models convolutional filters would automatically update 
their weights during the training process so that an input image like the one on 
the right would produce an output like the one on the left. 
Other similar models have been proposed. 
For example, researchers created this unit model for 
doing biomedical image segmentation. 
Similar to the fully convolutional network, unit contains an encoder that 
is much like the convolution layers we've been looking at in the course. 
An input image goes through a series of convolutional filters and 
pulling layers to produce a smaller feature map. 
This feature map then goes through a decoder which up samples the feature map 
to create a segmented image the same size as the original input image. 
The trick with u-net is that it can contaminates feature maps from various 
stages of the encoder side to same sized feature maps on the decoder side. 
Once again, 
the model outputs a segmented image with the same resolution as the input image. 
As you can see in the results section, 
u-net has been successfully used to identify different parts of cells. 
While there are plenty of models out there that improve on each other when it 
comes to image segmentation, the last one I want to show you is a model called 
mask R-CNN which was proposed in 2018 by the Facebook Ai research group. 
Mask R-CNN does something a little different than the semantic image 
segmentation models we were just looking at. 
It's attempting to do what's called instant segmentation. 
The output images in figure five do a great job at explaining what this means. 
As you can see, 
it's not only doing image segmentation to identify the types of objects and 
which pixels make up those objects, but it's also doing object detection. 
This means that it can uniquely identify each object of a class in the image. 
Instead of just highlighting all the pixels that make up the person class, 
it can identify the individual instances of the person class. 
It does this by building on the faster R-CNN model. 
It adds a new mask R-CNN system to the faster R-CNN 
model that uses a full convolutional network to create 
a mask of pixels from each proposed region of interest. 
From this, the model outputs a class prediction bounding box and 
pixel segmentation for each object found in the image. 
Don't worry if you didn't fully follow this lecture. 
Right now, image segmentation is mostly being used in large computers to assist 
with complex computer vision problems. 
As a result, it's a little out of scope for our course. 
However, I fully expect to see image segmentation making its way to embedded 
systems in the near future. 
Multi-stage inference requires 
training and using multiple machine learning models 
such that the output of one model 
becomes the input to the next model. 
Dmitry Maslov from Seeed Studio is going 
to talk about multi-stage inference and give 
a demonstration using a Raspberry Pi-based platform 
known as the reTerminal. 
Hello, everyone. 
My name is Dmitry Maslov from 
Seeed Studio and Hardware.ai. 
Let's talk about multi-stage inference. 
You have already familiarized yourself with 
object detection and image classification tasks. 
Multi-stage inference in computer vision most of 
the time involves a combination of these two. 
What is the main benefit of using multi-stage inference? 
Object detection networks in general and 
the ones used on embedded devices in particular, 
are not very good at distinguishing between 
multiple similar classes of objects. 
What we very often do is use 
object detection network to 
detect a large class of objects. 
For example, a face or a car or a dog, 
and then crop out the image and rescale it. 
The results of detection to 
be fed to image classification network, 
which gives a much more nuanced output. 
For example, the emotion on the face, 
car model, or dog breeds. 
Another popular use case for multi-stage inference 
is OCR or optical character recognition. 
Instead of performing detection of 
every character in the whole image, 
which is resource-intensive and prone to errors, 
most often, we first detect the text, 
and then recognition is performed on pieces of 
texts cropped out of large image. 
For this exercise, let's take 
car model recognition as an example. 
In Edge Impulse train a car detector. 
This should be fairly easy 
since we're using transfer learning 
and the base model already was trained to detect cars. 
It contains the necessary feature maps. 
For image recognition model training, 
we're going to use a subset of 
Stanford Cars Dataset with six car model classes. 
I picked six classes of cars that I think I will be 
able to encounter while 
walking outside here in Shenzhen, China. 
After training the model and optionally checking 
the detection and recognition accuracy on 
peak tourists that you can take with 
the web camera or find on the Internet, 
lets deploy these two models on an embedded device. 
I have used reTerminal, 
a Raspberry Pi 4 Compute Module 
Based Development Boards with 
a touchscreen in a sturdy plastic case. 
It comes in handy while 
on field trips like the one I'm going to take. 
Raspberry Pi 4 Compute Module has 
the same CPU as Raspberry Pi 4 Development Boards, 
but has an option to include onboard EMMC memory. 
The module in RI terminal has four gigabytes of 
RAM and 32 gigabytes of EMC.
Play video starting at :3:54 and follow transcript3:54
First of all, you will need to download the models 
for car detector and car classifier, 
which you can do by using 
Edge Impulse Linux runner lugging first 
to the project for car detection 
and then around the following comments, 
which is Edge Impulse Linux runner 
download car detector.ien.
Play video starting at :4:16 and follow transcript4:16
Then log in to your car classifier project. 
You'll need to run clean command, clear command. 
Then you'll need to login into 
car classifier project and download 
the model for car classification. 
Then if you look at the multi-stage inference script, 
you'll see that it's actually very similar to the scripts 
for object detection and image classification. 
Basically, it combines these two. 
You'll see that the main function 
of this script is very similar, 
with the exception that we obviously have two models now. 
We're getting the past for detection molecule, 
for the classification model, 
and then creating the context 
for both of the models at the same time. 
Then we get the information about 
the models and print it out. 
Then we find the suitable web camera here and 
then here is for loop that is going to 
acquire the images from web camera and then return 
the detection results and the processed image. 
Then what we do is, 
for each bounding box of the object that we found, 
we are cropping that bounding box out of the image and 
then resize it to the classification model input size. 
Then we're performing the inference 
on that small image with the classification model. 
Then finally we do some pretty printing here and 
also I added the drawer result function, 
which obviously does what it says exactly here. 
It draws the both detection bounding box 
and the classification result, 
and you already saw how it works in the previous video. 
In conclusion, do we really need 
multi-stage inference and could 
object detection model do 
both detection and fine grade classification? 
There are some models that can perform 
detection and large class 
size classification pretty well, 
such as YOLO9000, 
not suitable for any on 
embedded devices with constrained resources. 
For now, multi-stage inference is best technique 
we have in our computer vision as 
for now for these type of application.

Mat Kelcey is 
a machine learning researcher at Edge Impulse 
and he's going to talk to us 
about his active area of research. 
By breaking a neural network 
down into components or layers, 
we can reuse intermediate outputs from 
those layers to create a self-supervised learning system. 
Hey, my name is Mat. 
I'm willing to do a short talk 
on reusing representations. 
What do I mean by representations? 
Let's go through a quick high-level view 
of what we've to think about machine learning, 
which is this idea of our goal is to learn this functions 
f and the function 
we're trying to learn takes some input. 
We're going to call this X on the left 
and tries to map to it some output, 
and I mean results. 
When I say representation, 
I'm thinking about any of 
the internal features that 
a network like this or 
function like this may learn along the way. 
One particular thing of interests 
with these functions is that they often ran compression. 
I can think about this function f is 
going from some high-dimensional X, 
like an image to some very low dimensional y. 
Say for example, I'm trying to do 
classification where I'm saying 
this is one of 10 classes. 
Now, neural networks have another logical breakdown. 
We often go by the name of say for example layers, 
where this f is actually 
composed of many little f's that talk to each 
other and the first f is 
responsible of taking the input X and changing it in 
some fundamental way that then becomes input to the 
next sub-function f. When I think about representations, 
I'm thinking about what are these internal structures, 
the internal views of the data as it gets 
passed along from this X on the left to the y. 
Now, modern neural network when 
these f's may have many of these layers, 
100 plus or more. 
But for this talk, I want to really talk 
about one of the ones 
that's more to the right-hand side, 
it's one that's getting closer to the output. 
As I said, because there may be many, many of these f's, 
you can think about the functions 
on the left-hand side being 
maybe 99 plus percent of 
the entire model as 
a whole and this last little bit at the end, 
I can think maybe it's the last 
few transformations from whatever 
this internal encoding of 
the network has learned to the final output. 
I'm going to think about it in that way 
that when we're training these functions, 
we're actually training two functions; 
one like an encoder, 
so there's learning this internal representation. 
That's this is first-place 
and that's the bulk of the network. 
The bulk of network goes from X down to 
E and E has been co-trained, 
jointly trained with this other part to then 
output this final prediction of what y might be. 
Why have I split it like that? 
Well, there's a lot of ways that are using 
one part or the other decomposition in interesting ways. 
Probably the most successful and very well known way 
is an idea called transfer learning. 
If I think about starting with a very large dataset B, 
B here denotes a very big dataset 
and I jointly train these two functions, 
the encoder and then a part 
on the end, maybe it's a classifier, 
a small classifier of B and 
if I'm a large Internet company, 
I might have trained this very, 
very large encoder on a lot of lot of data. 
Then why do I care about that as a user, 
maybe that only has a small amount of data? 
Well, there's general adage in 
machine learning that says 
the larger the network we made, 
the more examples of data we need to train it. 
If my x here is similar to my X, 
I can take a bit of a shortcut and I can say, well, 
let's just reuse the existing encoder that maybe 
someone has made about to 
me and then only train the second bit. 
Under that general idea, which is true that 
the less parameters we need to train, 
the unless data we need, 
I can get a good result with only a small amount 
of data under the assumption that my input was the same. 
If this is a large vision model that's 
been trained on something like ImageNet, 
which has a large diversity of natural images and 
I should be able to get away with 
training with a small amount of data. 
That's a really great thing for me. 
This is the common fundamental setup 
in transfer learning. 
Well, this talk's going to be a little bit more about 
and which I think is really interesting 
is the next wave of this type of work, 
which is around this idea of contrasting learning. 
In particular, it's talking about 
this function that we're learning here at E, 
in the context of other datasets 
and I'm going to go through 
one fundamental thing that 
allows us to do a lot of what would follow. 
Let's assume we've got a dataset and this 
is at the ask instance 
with this dataset. I don't know what it could be. 
Let's say we doing classification images 
into one of 10 classes and this is an image of a tree. 
I've got some particular image of a tree X and I've 
got some label that is currently tree. 
This represent one instance in my dataset. 
What if I had another instance in my dataset? 
Now, I'm going to call this the j instance. 
Again, it would be an image of something and a class. 
There's a particular interesting thing, 
say that for this selection of baseline, 
j instance, that the y's were the same. 
Say that they were both trees for example. 
Then under this idea that 
this network is about trying to compress 
things and smoothly doing it towards the output. 
Then it seems intuitive to say that if these 
two y's are the same, 
same tree, then I would expect at 
this point of the encoding 
for the encodings to be like each other. 
Because if I'm thinking about the second part as being 
a classifier that needs to operate on this encoding, 
then if they were both trees, 
then the encodings need to be the same. 
Fundamentally, that idea that the encodings 
need to be the same when the class is the same, 
can hold no matter what the class is. 
It's originally strong assumption to make, 
but an interesting one to say 
that I can start to use this as a learning objective. 
I can say, given the two instances are the same class, 
I want their encodings to be the same, 
without even knowing what the class is. 
That's very powerful and we can 
use that in a lot of different ways. 
One of the fundamental ways that is 
being used now in a lot of 
state art models is this idea that 
focuses around how augmentation works in images. 
Let's have a quick review of what that means. 
It's really common thing, we're probably 
with the same image classification, 
or lot of image networks, 
where I can take an image like 
this picture on the left-hand side. 
If I'm doing classification, 
I can take that image and I 
can do lots of transformations of 
that image that don't break 
the idea that the class is still the same. 
We might rotate the image, 
I might do a little bit of a crop, 
or change the color in this case, saturation contrast. 
But the fundamental idea of 
image augmentation is that I can 
do all these transformations. 
They put me in a different part of pixels likes, 
without fundamentally changing what 
the salient object class is. 
In this case it will be dog. 
But by doing all these mutations, 
I keep it as dog. 
We can use this idea that I can do image augmentation to 
make instances of the data 
where without even knowing what the class is, 
the class has not changed. 
Let's concrete and talk about that. 
If I then say, 
I'm going to take a picture of my dog here, 
and I'm going to apply two random augmentations. 
It's a really good thing about this augmentation approach 
is, it's stochastic. 
Maybe I've rotated one way, 
I have rotated the other way, 
I've change different amounts of light. 
If I think about these as instances 
of training that I want to feed into a network. 
If I'm only interested in learning 
the encoder part as 
a pre-step to doing 
some transfer learning maybe later on, 
what I can say is these two images through the encoder, 
I want them to be the same. 
Without even knowing what the original class was, 
I'm going to say this is going to 
be part of my learning objective. 
I want the network when it's learning this encoding, 
to when it takes this image, 
and when it takes this image, 
return the same encoding. 
Now I don't actually care where, 
if I think about this in a spatial sense, 
I think about encoding as being a point in space which is 
common in the embedding view. 
I don't care where they are, I just want 
them to be close to each other. 
Now, we have a bit of dangerous position here, 
neural networks emitted optimization, 
and the optimizable nicely degenerate everything to zero. 
What I'll actually do commonly in 
these contrasting cases is 
to pick another example 
where I don't want things to be inside. 
For example, a picture of this cat. 
When I take the cat and I change it, 
I want this to be an encoding 
that as an objective is different to the other ones. 
I can treat this as, I want these two, 
the rhythm is going to be the same encoding. 
I'll treat that as part of a loss-function 
and at the same time add components 
to that loss to say 
that the red and the blue ones should be different. 
Now, I can do this with 
instances where I don't know the label just by saying, 
take any image from 
a large dataset and do two augmentations. 
Take any other image from 
a dataset and do the augmentation. 
More often than not, they will not be the same class. 
In general, in expectation, 
this idea holds that even without knowing labels, 
I can make these self labeled type approaches. 
I say these assignment, this is very, very, powerful. 
This idea of using this as a pre-training step, 
like a learning representation step 
before basically training a classifier, 
is common in state of 
the art image classification models at the moment. 
Pre-training step on a very, 
very large unlabeled dataset. 
Now this is contrasting using an idea of augmentation. 
I'm saying that I've got 
cases where two things should be the same, 
and I've got cases where things should be different. 
But where contrastive learning is 
really interesting around our representations is 
when you use much more novel ideas 
about what is similar and what isn't similar. 
My favorite example came from robotics. 
This is the first example I saw 
of this idea of contrastive learning. 
Say, for example, 
I have a robot arm that's reaching onto 
a tray and is picking objects up. 
Now I want to learn some sort of network, 
pre-train the network, understand 
something about the pose of this arm. 
By the pose of the arm, I mean 
the description in joint space where this arm is. 
For example, our arms, 
when you think about the rotation and bend. 
My arm is located in 
space according to some series of numbers. 
I want to learn an encoding of the arm. 
What I can do, which I think is fascinating idea, 
I'm going to actually take a scene with a robot arm, 
in this case simulated, 
reaching into a tray and coming back up. 
I'm going to actually have two 
cameras viewing this robot arm. 
The reason I do that is because I 
have an idea about time. 
Things being at the same point 
in time should be the same, 
whereas things under the idea 
that the arm is always moving, 
two things that are 
different in time should be different. 
I think, if I'm interested in using a dataset like 
this to learn a network 
or pre-training a network that's good to 
learn a representation around 
the arm pose, I can do the following. 
I can take these two cameras that are recording at 
the same time and I can take two of 
the images from different cameras at the same time. 
There's an invariant that holds here that I 
know even though in 
pixel space things look very different, 
the arm is in the same position because 
the camera images were taken at the same time. 
I can use these two images as an example, 
they go to the network to say, 
I don't care where you put the encoding, 
I just need these two to be next to each other. 
You can see this is actually potentially quite 
difficult because in pixel space, 
these look fundamentally different. 
It's like an augmentation we're talking about, 
we've been making a change that's 
invariant to the class, 
but it's a lot more difficulty here, it's 
because the pixels are quite different. 
In the same way we had the example 
before of things that are same, 
under the assumption the arm is always moving, 
we can also say that if the arm is moving, 
then two images from 
the same camera at different times 
should have different arm pose. 
I want the encodings for these to be different. 
Now you can really see here this is 
difficult because it's only the arm 
that's moving in this case. 
For these two to have different encodings, 
the network really needs to 
focus on the thing that's changing, 
which is the arm, even though in 
general pixel space, these are quite similar. 
This example of contrasting over time comes from 
a paper called Time-Contrastive Networks 
by an ex-colleague, Pierre. 
This is the one that really got me 
into contrastive learning, therefore, 
such an incorrect result because we've 
got an effectively endless amount of data here. 
All we need to do is move this arm around, 
have two cameras that we know are taking at the time, 
and use that contrastive invariance to say, 
two images at the same time 
should have the same encoding, 
two images at different times need different encodings. 
Because more often than not that generally holds true, 
we can use that as a learning objective 
to pre-train a very large network. That's fascinating. 
Now, there's so many cases 
where you might want to use this. 
This may work on other cases in vision, for example. 
This is a quick image I took from a screenshot from 
an NVIDIA DRIVE Labs and it shows 
these object tracking they're doing. 
I didn't really pick up because of that. 
I just picked it because this is a good example of 
the different frames and videos 
you might have on a self-driving car. 
Then the idea of contrastive learning could hold here as 
well because we know the time is the same. 
For example, any two pairs of these videos shooting 
some my field pre-training act as things that 
represent the world as 
a whole at the same time, that should be the same. 
You can think about the ones on the top 
right here, the forward-facing cameras. 
Any two pairs of those should be 
similar enough encoding that we could 
use this as a pre-training step for a network, 
which is very powerful constructing, David. 
But really the contrastive learning idea 
is that what can you imagine? 
What can you put around your data to 
say that there are two things that are the same? 
For example, two microphones in 
the same room that are recording audio. 
Even though they'll fundamentally 
have different signals because again, 
it's at the same time, 
you should expect things to be the same, 
and so an encoding should be the same. 
It's a fundamentally really super powerful idea. 
That was fundamentally my idea of talking about 
reusing representations 
that have come from pre-training, 
particularly in contrastive learning, 
and how this is basically one of the next ways 
of what we see for contrastive learning. 
Thanks so much for listening.
## Digging Deeper into Advanced Topics
Project - Deploy Object Detection Model
Introduction
Object detection is a complex system that involves identifying potential objects of interest in an image, classifying them, and giving their locations to the user. Identifying objects of interest in a live video stream is a powerful tool in many computer vision applications.
In this project, you will have the opportunity to collect data, train an object detection model, and deploy it to an embedded system.
Required Hardware
At this time, object detection models only run on computers (including single board computers) and smartphones. Some of the required TensorFlow operations are not supported in TensorFlow Lite for Microcontrollers. If and when it is possible to run object detection on microcontrollers (e.g. OpenMV Camera), I will update this project.
You will need the following hardware setups for this section:
•	Raspberry Pi 4, SD card, Pi Camera
Note: You are welcome to try other embedded systems not listed here. However, I cannot promise that they will work and I likely will not be able to help you troubleshoot any issues you may come across.

Choose one or more classes that you wish to identify. I recommend starting with something simple, like 3 classes, as it will save you some work having to label everything.
Take photos where one or more such objects are in the photo. For my project, a photo might include just 1 dog. Another photo might include 1 dog, 1 ball, and 2 tug toys. Make sure you take photos in a variety of environments, backgrounds, lighting conditions, angles, etc.
Here is an example of my dataset that includes various instances of my 3 objects: dog, ball, toy.
 
Aim to have enough photos such that you have at least 50 instances of each object class (this could be 50 or more total images, depending on how many objects you have in each image).

### Train Object Detection Model
Start a new Edge Impulse project. In Dashboard, scroll down to Project info. Change Labeling method to Bounding boxes (object detection).
 
Go to Data acquisition and upload all of your images. You can leave Automatically split between training and testing as well as Infer from filename (for Label) selected, as we will be supplying our own labels through bounding boxes.
 
Once uploading is done, click on Data acquisition again. Click on Labeling queue at the top, which will walk you through creating bounding boxes for your images. Click and drag on the image to create a bounding box (and fill in the label when asked).
 
Create bounding boxes for all objects in your images. 
If you make a mistake, keep going until the end. When you’re done, click on the Training data or Testing data tab to find the image with the error. Find the image, click the 3 dots to the side of the image name, and click Edit labels. You will be presented with a pop-up window that allows you to edit the bounding box and label information.
 
Go to Impulse design. Change the Image data to have a resolution of 320x320 (at this time, Edge Impulse only supports one object detection model, which requires a 320x320 input resolution).
Add an Image block for your processing block and Object Detection (Images) for your learning block. Click Save Impulse.
 
Click on Image in the navigation bar on the left side of the screen. Make sure that Color depth is set to RGB. Click Save parameters and then click Generate Features on the next screen.
 
Once features have been extracted, click on Object detection in the navigation bar. At this time, you do not have many options for object detection models. Use the default MobileNetV2 SSD FPN-Lite 320x320 model. 
I recommend changing the Number of training cycles to 50. Click Start training.
 
You are welcome to go to Model testing to see how well your model performs on your test data. As you can see from my example, the model had a hard time identifying small objects, like the ball in this test image:
 
When you are happy with the performance of your model, you can deploy it to your embedded system.
Object Detection on the Raspberry Pi
Create a folder to hold your program and model file:
mkdir -p Projects/object-detection
cd Projects/object-detection
Download the model file for your project:
edge-impulse-linux-runner --clean --download modelfile.eim
When asked, choose the project you wish to download the model from. Choose the project that you just created with the object detection model.
Create a Python program that captures an image from the camera and performs inference to locate all objects. Note that bounding box information will be output from the Edge Impulse library. You can save them in a list as follows:
1
bboxes = res[‘result’][‘bounding_boxes’]

When you run the program, point the camera at various objects, and bounding boxes should be drawn on the preview window. Additionally, the inference results, including bounding box and class probabilities, should be printed to the console.
 
Conclusion
Object detection is a great start to many computer vision projects, including autonomous vehicles, animal or people trackers, robot vision, etc. However, you might notice that it can be painfully slow. Object detection requires a huge amount of computing resources, and so you should expect 1-5 frames per second using this particular object detection model on a Raspberry Pi 4.

In this module, we covered some advanced computer vision concepts for 
you to consider. 
We spent most of the time talking about object detection, 
which is an important concept in embedded vision. 
In most cases of object detection, we are trying to uniquely identify each 
instance of an object type in an image and then figure out its location. 
We do that by generating what's known as a bounding box that consists of x-y 
coordinates of the object and the width and height of the object in pixels. 
Note that some bounding boxes give you center coordinates and 
some give you the top left of the box. 
Either way, you can use that information to figure out where 
the approximate center of the object is for your project. 
This can be very useful for something like having 
a camera automatically track a person or animal or having a self-driving 
vehicle automatically avoid pedestrians and other cars. 
One obvious way to accomplish object detection is to slide a window over 
the whole image and perform inference on whatever sub-image is under that window. 
If the object is found, 
you can highlight wherever that window was to create a series of bounding boxes. 
However, this has a number of problems. 
A single object might generate more than one hit. 
Without additional steps, the system might see my dog as three different dog objects. 
Also, it will generally only find objects that fit completely inside the window. 
If an object is too large or too small in the frame, it will be missed. 
You could slide windows of different sizes over the image to look for large, small, 
near, and far objects. 
However, this illustrates the third problem with this approach. 
It's very computationally expensive. 
You have to perform inference on every single window which could take 
a long time. 
This might be okay if you only need to look for 
objects in a still image once every minute or so, but 
you likely won't get a usable frame rate on live video streams. 
That's where we introduced this concept of region proposal. 
Modern object detection algorithms focus on only a few areas of the input 
image that might contain objects. 
Image classification is performed on these regions, 
which cuts down on the number of times inference needs to be performed. 
That gives us a more complex model that can be trained and 
used to perform object detection. 
We feed the model our input image and 
it outputs a list of objects that it detected. 
Each object contains the predicted class, a probability or 
confidence score of that class, and a bounding box coordinates and size. 
We can use that information to draw the bounding box 
on top of our image to give us an idea of what the model found. 
To train such a model, 
we need to provide the training process with ground- truth bounding boxes and 
labels for each object of interest in a set of training and test images. 
From here, information about the regions can be combined or 
the window with the highest confidence can be selected to determine where the objects 
appear in the image. 
From there, we looked at how we can evaluate object detection models. 
To begin, we would graft the recall versus precision for 
each label as we moved the classification threshold from 0 to 1. 
At each threshold value, we can calculate a precision and 
recall score from the validation or test set. 
We then estimate the area under the precision-recall curve by averaging 11 
different points. 
To determine if a bounding box is considered good enough, we would compare 
it to the ground-truth bounding box and calculate the intersection over union. 
This is where we first find the area of overlap between the two boxes and 
then divide that value by the area encompassed by both boxes. 
If this IoU value is above some threshold, normally 0.5, 
we would say that the bounding box is a true positive for that particular object. 
This is, of course, assuming that the predicted bounding box and 
ground-truth box are of the same class. 
The mean Average Precision is a common metric used for 
measuring the performance of an object detection model. 
One way to compute the mAP score is to calculate the average precision for each 
class at a given IoU threshold, like 0.5, and then find the mean of those scores. 
Some newer papers and competitions rely on finding the mean of the average 
precision scores for all classes across a range of IoU thresholds. 
This helps to give higher scores to models that are more precise with their bounding 
box placement. 
We then looked at a number of popular object detection models. 
The first was a region-based convolutional neural network. 
Here, a region proposal algorithm identified areas of interest that were 
sent to a convolutional neural network and classifier. 
This information was also sent to a regression model that updated the location 
and shape of the regions to create better bounding boxes. 
Next, we examined Fast R-CNN that built upon the ideas from R-CNN. 
Here, the region proposal happens in parallel with the convolutional stages 
of the model. 
So that feature extraction only needed to be performed once, 
thus saving some time during training and inference. 
Faster R-CNN takes this a step further and 
performs region proposal on the feature map instead of the raw input image. 
To do this, the authors use a neural network for region proposal, 
instead of some other algorithm like selective search. 
They called this approach a region proposal network. 
Finally, we looked at the Single Shot Multibox Detector 
that performs feature extraction, region proposal, 
and determines object location in one pass through the network, 
instead of needing to do classification on each region of interest. 
By performing classification and 
bounding box regression on the output feature map of each convolution stage, 
the image is essentially divided up into a series of different sized grids, 
where each cell has a variety of potential default bounding boxes. 
The classification heads determine if the proposed bounding boxes belong to 
one of the classes or the background. 
Redundant overlapping bounding boxes are removed in the non-maximum 
suppression stage, so that we end up with ideally one box per object. 
The output should give us unique bounding box and classification information for 
each object in the input image. 
Note that at this time, 
the SSD model we use in the course only works with color images, so 
we can't save on processing time by converting everything to grayscale. 
I showed you how to use transfer learning in Edge Impulse to train your own object 
detection model and then deploy it to a Raspberry Pi. 
Finally, we touched on some advanced concepts that while outside the scope of 
the course for now are important to keep in mind if you are working on an embedded 
computer vision project. 
These are just a taste of what's actively being researched in computer vision. 
I briefly talked about the idea of image segmentation and 
how it can be used to discern object shape. 
I also went over a few papers that showcased some popular image segmentation 
models. 
We also heard from Dmitri Maslov, 
who gave us a multi-stage machine learning demonstration. 
This helps us see how we can use multiple machine learning models in tandem to meet 
our needs, rather than trying to come up with a single model for our project. 
Finally, we got to hear from Matt Kelsey, who showed us how we can 
reuse representations or encodings from the middle of a machine learning model. 
We can use these representations to create a self-supervised learning system. 
Object detection is exciting. 
And we're just now starting to see it work on embedded systems, thanks 
to a bunch of research that went into optimizing both hardware and software. 

## Share Your Object Detection Model
•	What types of objects did you attempt to detect?
•	How well did your model perform on the validation and test sets?
•	How well did the model perform on live inference?
•	You may also optionally post any photos of the setup you wish to share!
Participation is optional
Computer vision is a very large field with 
lots of potential applications. 
I hope that this course has helped you get started using machine learning 
techniques for computer vision on embedded systems. 
And I hope you had some fun along the way. 
I know that I enjoyed making the projects as well as learning about the current 
state of research in the field. 
If you'd like to dive more into computer vision with machine learning, I highly 
recommend this advanced computer vision course by Laurence Moroney and Eddie Shoe. 
They go through many of the topics covered here, but 
they also cover image segmentation in greater detail. 
The focus is on creating systems for larger computers, so 
many of the applications won't be designed for embedded systems. 
Adrian Rosebrock runs the pie image search website which is a fantastic resource 
that covers classic computer vision and deep learning concepts. 
He has free articles and books that you can buy to help you on your journey. 
If you run into issues with the Edge impulse site or various tools, 
I highly recommend posting to the Edge impulse forum. 
I find the people there are very helpful and responsive. 
Tinyml.org is a great place to keep up to date with various research and 
development in the embedded machine learning community. 
They host weekly talks by researchers and 
industry leaders as well as host conferences a few times each year. 
You can also check out their community forum where people discuss the talks and 
various embedded machine learning topics. 
There is a lot of research still being done in machine learning and 
computer vision right now as they are very popular fields. 
Because new models and techniques are being proposed every year, 
you might want to keep an eye out for new papers on Cornell'sarchive.org page. 
For example, here is a relatively recent paper that provides a history and 
overview of popular deep learning techniques used for image segmentation. 
We only barely scratched the surface by looking at mask our CNN. 
Don't worry if some of these papers are confusing. 
It often takes a lot of time, effort and 
background knowledge to understand the content. 
You can often find well written articles on sites like towards data science that 
break down these complex research papers into simpler terms. 
The last thing I want to mention is the idea of automating your machine learning 
process which is something we only barely touched on in the course. 
I wanted you to collect your image data manually so 
that you could examine the files on your computer or play with them in co lab. 
However, Edge impulse has a number of tools that help you automate the upload 
process so that you can collect data and send it right to your project from there. 
They also offer a full API so you can script the training and 
deployment process. 
I always recommend doing a few machine learning projects by hand first so 
you get an idea of how to look for over and 
under fitting as well as how to tune the hyper parameters. 
Once you get the hang of that, 
you can start to automate the process if you want to. 
For example, you could create a unique model for each system that you deploy. 
I also recommend checking out Edge impulse's blog and their Twitter page for 
fun and interesting projects that involve embedded machine learning. 
You can also find me on twitter in addition to being able to contact 
me using the Coursera tools. 
The pairing of computer vision and machine learning is a powerful combination and 
I hope that this course has given you a start to using them in your own projects. 
I want to thank you for being a student in this course, 
sticking with it till the end and 
I wish you luck in your embedded machine learning journey.