Cognitive Services from Microsoft offer pretrained networks that allow citizen developers to quickly develop e.g. image recognition algorithms. This short post is to show how it is done and how I have created a real-time people recognition system. The goal of this solution is to recognize people in our office and possibly inform our beloved founders of this. What I used was: a python script and a FullHD webcam + Cognitive Services Custom Vision API.

The steps are pretty straightforward:

Get some training data.
Set up the cognitive service.
Train the data and set up an end point. We have to have a prediction endpoint that can be asked for predictions.
Write a python script that 1. gets the image from the camera, 2. ask for prediction and receives the anserws, 3. draws rectangles around people, 4. shows it.

Getting the data

As described by Microsoft. training data should contain around 15 images per one class that you want to recognize. It can be e.g. a cat, or a dog. In this case, we are going to look for a silhouette of a person. Because I train only one class, I have taken around 25 pictures of our Clouds On Mars office as a training sample. A left around 5 as a test data.

Setting up a Cognitive Serivce

1000 prediction calls are free and you can set it up here. it is very intuitive.

After logging in create a new project:

Select Object Detection (preview):

Training the algorithm

To train the algorithm you need to provide it with pictures and your objects. First, add images and start tagging by drawing rectangles around objects to identify:

After tagging the photos, click “Train” and review the performance of the algorithm:

Go to Quick Test to check the performance on a test photo:

To get the prediction endpoint, go to settings. It will be under “Prediction Key”:

Writing a python scipt

To connect with Cognitive Service Custom Vision API you need to install a new python module. In your command line run:

pip install azure.cognitiveservices.vision.customvision

1	pip install azure.cognitiveservices.vision.customvision

Next, we need a module to handle images and camera. Perfect for the job is OpenCV – open source computer vision module – super usefull. I have used it before in my flappy bird project.

pip install openvc-python

1	pip install openvc-python

As to the script. Getting camera is faily easy. I have not managed to feed it directly into the service, but worked on files – this might have been a bottleneck.

cam = cv2.VideoCapture(1)
cam.set(3, 1920)
cam.set(4, 1080)

ret_val, img = cam.read()
     cv2.imwrite('cam.png',img)
     draw = cv2.imread('cam.png')

#Getting the prediction:

with open("cam.png", mode="rb") as test_data:
         results = predictor.predict_image(‘<PROJECT ID>’, test_data)

#Reading the JSON anwser and drawing rectangles:

for prediction in results.predictions:
         if prediction.probability > 0.5:
             print ("\t" + prediction.tag_name + ": {0:.2f}%".format(prediction.probability * 100), prediction.bounding_box.left*1920, prediction.bounding_box.top*1080, prediction.bounding_box.width, prediction.bounding_box.height)
             screen1 = cv2.rectangle(draw, (int(prediction.bounding_box.left*1920),int(prediction.bounding_box.top*1080)) , (int((prediction.bounding_box.left+prediction.bounding_box.width)*1920),int((prediction.bounding_box.top+prediction.bounding_box.height)*1080)), (0,255,255), 2)

cam = cv2.VideoCapture(1)

cam.set(3, 1920)

cam.set(4, 1080)

ret_val, img = cam.read()

cv2.imwrite('cam.png',img)

draw = cv2.imread('cam.png')

#Getting the prediction:

with open("cam.png", mode="rb") as test_data:

results = predictor.predict_image(‘<PROJECT ID>’, test_data)

#Reading the JSON anwser and drawing rectangles:

for prediction in results.predictions:

if prediction.probability > 0.5:

print ("\t" + prediction.tag_name + ": {0:.2f}%".format(prediction.probability * 100), prediction.bounding_box.left*1920, prediction.bounding_box.top*1080, prediction.bounding_box.width, prediction.bounding_box.height)

screen1 = cv2.rectangle(draw, (int(prediction.bounding_box.left*1920),int(prediction.bounding_box.top*1080)) , (int((prediction.bounding_box.left+prediction.bounding_box.width)*1920),int((prediction.bounding_box.top+prediction.bounding_box.height)*1080)), (0,255,255), 2)

Showing it all:

cv2.imshow('AI',screen1)

1	cv2.imshow('AI',screen1)

The whole script:

from azure.cognitiveservices.vision.customvision.prediction import prediction_endpoint
from azure.cognitiveservices.vision.customvision.prediction.prediction_endpoint import models
import cv2
 
predictor = prediction_endpoint.PredictionEndpoint('<YOUR KEY HERE>')


cam = cv2.VideoCapture(1)
cam.set(3, 1920)
cam.set(4, 1080)
while True:
     ret_val, img = cam.read()
     cv2.imwrite('cam.png',img)
     draw = cv2.imread('cam.png')
     with open("cam.png", mode="rb") as test_data:
         results = predictor.predict_image(‘<PROJECT ID>’, test_data)
         #print(faces)

    for prediction in results.predictions:
         if prediction.probability > 0.5:
             print ("\t" + prediction.tag_name + ": {0:.2f}%".format(prediction.probability * 100), prediction.bounding_box.left*1920, prediction.bounding_box.top*1080, prediction.bounding_box.width, prediction.bounding_box.height)
             screen1 = cv2.rectangle(draw, (int(prediction.bounding_box.left*1920),int(prediction.bounding_box.top*1080)) , (int((prediction.bounding_box.left+prediction.bounding_box.width)*1920),int((prediction.bounding_box.top+prediction.bounding_box.height)*1080)), (0,255,255), 2)

    cv2.imshow('AI',screen1)

    if cv2.waitKey(1) == 27:
         break  # esc to quit
cv2.destroyAllWindows()

from azure.cognitiveservices.vision.customvision.prediction import prediction_endpoint

from azure.cognitiveservices.vision.customvision.prediction.prediction_endpoint import models

import cv2

predictor = prediction_endpoint.PredictionEndpoint('<YOUR KEY HERE>')

cam = cv2.VideoCapture(1)

cam.set(3, 1920)

cam.set(4, 1080)

while True:

ret_val, img = cam.read()

cv2.imwrite('cam.png',img)

draw = cv2.imread('cam.png')

with open("cam.png", mode="rb") as test_data:

results = predictor.predict_image(‘<PROJECT ID>’, test_data)

#print(faces)

for prediction in results.predictions:

if prediction.probability > 0.5:

cv2.imshow('AI',screen1)

if cv2.waitKey(1) == 27:

break # esc to quit

cv2.destroyAllWindows()

Final effect:

Some conclusions:

Price seems to be low, but turns out to be very expensive. Especially, when designing real-time solutions. One camera in e.g. a store would cost around $25 a day.
The performance is not satisfactory – we had to wait for a couple of seconds for a prediction and this rules out any solutions that emphize the speed of answer.
Very easy and clean to test some solutions that later might be custom made to fit a certain project and requirements.
Overall – fun.

Michał Ćwiok

Analytics, clouds, privacy and stuff.

Developing AI camera image recognition system in an hour