Computer Vision Techniques: Implementing Mask-R CNN on Malaria Cells Data

Nidhi Bansal
Analytics Vidhya
Published in
15 min readNov 13, 2019

--

Malaria Cells Detection using Mask-R CNN

In Today’s world Computer vision is most powerful and complex field of Artificial Intelligence. We will experience various Applications and techniques of Computer Vision as we go further.

Computer Vision is the field of computer science that tries to replicate the complexity of human eyes. It enables a computer to detect objects in images and videos. Thanks to Deep Learning and various CNN and R-CNN techniques that it is possible nowadays.

There are 5 major Computer Vision Techniques:

  1. Image classification
  2. Image Classification with Localization
  3. Object Detection
  4. Image segmentation: Semantic Segmentation
  5. Image segmentation: Instance Segmentation

1. Image Classification

In Image Classification problem, each image in various images belongs to a single category. We need to define that label for images and predict that category for images.

Let’s formulate a problem:

  • Let’s say we have an MNIST dataset that has N images which are categorized into any one label from 10 labels: 0,1,2,3,4,5,6,7,8,9, is our input data.
  • Then, we use this input or training dataset to train a classifier to learn the model.
  • Lastly, we will evaluate our model to ask it to predict labels for a new set of images. Then, we can compare the true labels with the predicted labels of images given by the classifier.

Here, we can use a multi-label classifier or CNN(Convolutional Neural Network) to train our classifier.

MNIST images predicted using CNN ( Numbers in red showing predicted value)

In the above picture, we have train MNIST data using CNN and predict labels of some images.

Output: Output of Image classification is Class Labels or Class Ids.

Question: What if we want to locate the location of the object?

2. Image Classification with Localization

Let’s say, we have images of dogs and cats and we classify them using CNN, what if we want the location of them in the images.

This is the more challenging version of image classification.

Output: Output is class Labels + location of the object in the image which is given by bounding box. The bounding box is a rectangle or square box around the object.

Pic showing Bounding box of a cell in Malaria Cells image
Difference between Image classification and Image Classification with Localization(Source: Internet)

Question: What if there are different types of objects in a single image?

3. Object Detection

In Image classification with Localization, each image can have multiple objects but with same class labels, but if an image has a different objects with different class labels then identifying their class labels and bounding box comes under object detection.

Difference between 3 techniques of computer Vision(Source:Internet)

Output: It gives class labels and bounding box for each object in image.

Object detection on Malaria cells image (gives a class label and bounding box of each cell in image). In this image, some cells are red blood cells and some are ring.

Object detection can be done using four algorithms:

  1. CNN- Convolution Neural Network works like our eyes to detect edges and hence define the boundary of an object. Yes, we can use CNN to detect objects in an image. Steps of CNN for object detection are:
  • First, we take an image as input.
  • Then, divide the image into various regions like a 100* 100 pixel image is divided in to various 10* 10 pixel regions using the sliding window technique.
  • We will consider each region as separate image.
  • Pass each region to CNN to classify them into various classes.
  • Once, we got corresponding class to each region, we can combine each region to get original image with detected objects.

Problem with CNN is that we need to apply CNN to huge numbers of locations and scales, which is very computationally expensive!

2. R-CNN(Region based CNN)- Instead of working on massive number of regions, R-CNN algorithm proposes a bunch of boxes in the image and checks if any of these boxes contain any object. R-CNN uses selective search to extract these boxes from an image these boxes are called ROI(Regions Of Interest).
Let’s understand Selective search: There are basically four areas of an image: varying scales, colors, textures, and enclosure. The selective search identifies these patterns in an image and on the basis of that it proposes various regions.

Steps in Selective Search:
Take an image ->Initial segmentation -> combine similar segments (on basis of color , texture or size similarity, and shape compatibility) -> ROI(Regions Of Interest)

R-CNN(Source:Internet)

Steps followed in R-CNN to detect objects:

  1. R-CNN makes use of a selective search to create about ~2000 ROIs (Regions Of Interest).
  2. The regions are warped into fixed-size images or reshape so that they can match the CNN input size and feed into a CNN network individually. CNN extract features for every region.
  3. Extracted features of every region are followed by SVM to classify objects and backgrounds. For each class, we train one binary SVM.
  4. Finally, Extracted features of every region is also followed by linear regression model to generate tighter bounding boxes for each identified object in the image.

Few but higher quality ROIs makes R-CNN faster and more accurate than the sliding windows CNN.

Problems with R-CNN:

Training an R-CNN model is expensive and slow because:

  • Extracting 2,000 regions for each image based on selective search
  • Extracting features using CNN for every image region. Suppose we have N images, then the number of CNN features will be N*2,000
  • Training 3 models(CNN, SVM and Regressor) of object detection using R-CNN make it slow and computationally expensive.

3. Fast R-CNN
Solution to problem of R-CNN is Fast R-CNN. In R-CNN we have used approximately 2000 CNN for every image. But, in Fast R-CNN single CNN is used for every image and gets all features once, thus it reduces the computational time.

Steps followed in Fast R-CNN to detect objects:

  1. Firstly, we use a feature extractor (a CNN) to extract features for the whole image.
  2. Parallely, we also use an external region proposal method, like the selective search, to create ROIs.
  3. Then we combine both ROIs and corresponding feature maps to form patches for object detection.
  4. We wrap/reshape the patches to a fixed size using ROI pooling as per the input of the FC layers and feed them to fully connected layers.
  5. A softmax layer is used on top of the fully connected network to predict classes. Along with the softmax layer, a linear regression layer is also used parallely to output bounding box coordinates for predicted classes.

By not repeating the feature extractions, Fast R-CNN cuts down the process time significantly.

Problems with Fast R-CNN: Fast R-CNN still use Region Proposal method to find ROI, which makes it time consuming.

4. Faster R-CNN
Faster R-CNN is the modified version of Fast R-CNN. The major difference between them is that Fast RCNN uses the selective search for generating Regions of Interest, while Faster RCNN uses “Region Proposal Network”(RPN).

RPN takes image feature maps as an input and generates ROIs.

Faster R-CNN

Steps followed in Faster R-CNN to detect objects

  1. Firstly, we use a feature extractor (a CNN) to extract features for the whole image.
  2. Region proposal network is applied to these feature maps. This returns the object proposals i.e. ROIs along with their objectness score.
  3. Then both ROIs and corresponding feature maps are passed through the ROI pooling layer. We wrap/reshape the proposals to a fixed size as per the input of the FC layers and feed them to fully connected layers.
  4. A softmax layer is used on top of the fully connected network to predict classes. Along with the softmax layer, a linear regression layer is also used parallelly to output bounding box coordinates for predicted classes.

RPN uses a sliding window over the feature maps generated from CNN, and at each window, it generates let’s say,k Anchor boxes(fixed boundary boxes)of different shapes and sizes. For each anchor box, RPN predicts two things:

  • The first is the probability that an anchor is an object (it does not consider which class the object belongs to).
  • Second is the bounding box regressor for adjusting the anchors to better fit the object.

Question: What about the actual shape of objects?

4. Image Segmentation: Semantic Segmentation

Object detection tells us about class labels and the bounding box of each object. but, it does not tell us about the actual shape of each object.

So, here Image segmentation comes into the picture.

Image segmentation creates a pixel-wise mask for each object, so gives us the exact shape of objects.

Question: Where do we need image segmentation?

Answer: Below is some applications of Image segmentation:

  1. Malaria or Cancer detection cells: In the medical world, the correct detection of cells in the image tells us about the disease.
  2. Self- Driving Cars: To drive a car we need to know about the actual shape of an object in front or side of us. As it is required for Self-driving cars too.
  3. Locate objects from the satellite system.

Image segmentation is of two types:

i. Semantic Segmentation: Every pixel in the image belongs to one particular class — car, building, window, etc. And all pixels belonging to a particular class have been assigned a single color. For e.g. It segments the image like a background in 1 class, cars in the image as 1 class and person in image as 1 class. So, here a total of 3 classes and segment the picture in 3 masks of 3 colors.

Semantic Segmentation vs Instance segmentation( Image taken from internet)

There is some architecture which implements semantic segmentation like FCN(Fully Convolutional Network), Encoder-Decoder Architecture (e.g. U-Net Architecture), etc.

Question: What if want to detect each object of the same class/type separately?

ii. Instance Segmentation: It segments each object separately.

5. Image Segmentation: Instance Segmentation

Different instances of the same class are segmented individually in instance segmentation. Like, in the above image that different instances of the same class (animals) have been given different labels.
The bounding box of many objects overlaps with each other. So the mask helps in detecting the exact shape of the object.
One of the algorithms used for Instance Segmentation is Mask R-CNN.

Mask R-CNN:

The Mask R-CNN is built basically on top of Faster R-CNN. It is a pixel-level image segmentation.

Mask R-CNN(Source: Internet)

Steps followed in Mask R-CNN to detect objects

  1. Firstly, we use a feature extractor (a CNN) to extract features for the whole image.
  2. Region proposal network is applied to these feature maps. This returns the object proposals i.e. ROIs along with their objectness score.
  3. Then both ROIs and corresponding feature maps are passed through the ROI Align layer. In Mask R-CNN, the ROI Align layer is used instead of ROI Pooling. The RoI Align layer is designed to fix the location misalignment caused by quantization in the RoI pooling. A Region of interest is mapped accurately from the original image on to the feature map without rounding up to integers.
  4. A softmax layer is used on top of the fully connected network to predict classes. Along with the softmax layer, a linear regression layer is also used parallelly to output bounding box coordinates for predicted classes.
  5. The output of ROI Align layer also goes separately to the convolutional layer to predict mask.

Loss Function in Mask R-CNN:

Loss in Mask R-CNN is consists of loss due to RPN(Regional Proposed Network) and loss due to classification, localization and segmentation mask.

1. Loss(RPN)= RPN_Class Loss + RPN_BBox Loss

2. Loss(Mask R-CNN)= Loss(class labels prediction) + Loss(Bounding Box prediction) + Loss (Mask Prediction)

Total Loss= Loss(RPN) + Loss(Mask R-CNN)

So, our optimization problem is to minimize the Total Loss

Implementing Mask R-CNN on Malaria Data Cells

I have taken data for Malaria Data Cells from kaggle. Link description below:
Data Source: https://www.kaggle.com/kmader/malaria-bounding-boxes

In this data is Images (.png or .jpg format). There are 2 sets of images consisting of 1208 training and 120 test images.

Labels: The data consists of two classes of uninfected cells (RBCs and leukocytes) and four classes of infected cells (gametocytes, rings, trophozoites, and schizonts). The data had a heavy imbalance towards uninfected RBCs versus uninfected leukocytes and infected cells, making up over 95% of all cells.

A class label and set of bounding box coordinates were given for each cell in the JSON file.
I have a train model using Mask-RCNN (Mask Regional Convolutional Neural Network).
I have learned Mask-RCNN from the below link. It is a training Kangaroo object detection dataset. Some code snippets are taken from the below reference link. #Ref: https://machinelearningmastery.com/how-to-train-an-object-detection-model-with-keras/

Mask_RCNN gives 3 outputs:
1. Class_ids
2. Objects/Cells Bounding Box
3. Mask for Objects/Cells

This case study divided into 5 steps:
1. Install Mask R-CNN for Keras
2. Prepare data set for Object Detection
3. Train Mask R-CNN Model for Malaria Cell Detection
4. Evaluate Mask R-CNN Model
5. Detect Cells in new photos

1. Install Mask R-CNN

i. Clone or download repository from: https://github.com/matterport/Mask_RCNN

ii. Open cmd. Change to Mask_RCNN directory and run install script:
cd Mask_RCNN
python setup.py install
we will get successful installation message.

iii. How to check if Mask RCNN is successfully installed or not:
Run command: >pip show mask-rcnn
Output : Name: mask-rcnn
Version: 2.1
Summary: Mask R-CNN for object detection and instance segmentation
Home-page:
https://github.com/matterport/Mask_RCNN
Author: Matterport
Author-email:
waleed.abdulla@gmail.com
License: MIT
Location: *\anaconda3\lib\site-packages\mask_rcnn-2.1-py3.7.egg
Requires:
Required-by:

We are now ready to use this library.

2. Prepare Data Set for Object Detection

Sample image:

Image from Malaria Bounding Box dataset

Sample training.json file:

Sample json file

Here, the image has r * c pixels and minimum and maximum value of r and c define bounding box vertices.

To create data set, we need to extract min, max, r and c values with corresponding categories for each bounding box of every image. And assign an image_id to every image.

Top 5 entries of dataframe created

Write different functions of Mask RCNN

Code snippet

Then prepare the train and test dataset:

Code snippet to prepare train and test dataset

Let’s test if image loading, masking, and boxing works properly on not?

Code snippet to test image loading, masking and bounding box with class labels

Output of the above code:

Output display of image_id 15

Visualization of image shows mask, bounding box in dotted box format and class id.

3. Train Mask R-CNN Model for Malaria Cell Detection

(i). The first step is to define configuration for training the model:
We define MalariaConfig class which extends the mrcnn.config.Config class.

MalariaConfig class

(ii). Train the model
Now, we will train the model using predefined weights. The first step is to download the model file (architecture and weights) for the pre-fit Mask R-CNN model.
Download the model weights to a file with the name ‘mask_rcnn_coco.h5‘ from matterplot github directory of Mask R-CNN, in your current working directory.

Now, define the model by creating an instance of the mrcnn.model.MaskRCNN class and specifying the model will be used for training via setting the ‘mode‘ argument to ‘training‘ and use config which we define above.

Load weight mask_rcnn_coco.h5 which we have downloaded.

Now train the model. (Code snippets shown below)

Training Mask R-CNN Model

Training the model takes approx 2–3 hrs. I am using GPU(NVIDIA GeForce GTX 1080 with Max-Q Design).

A model is created by the end of every epoch. As loss decreases with every epoch, so we will use epoch 5 file mask_rcnn_malaria_cfg_0005.h5 to evaluate model performance.

4. Evaluate Mask R-CNN Model

The first step is to define a new configuration for evaluating the model. See the code below.

Prediction configuration code

Next, we can define the model with the config_pred and set the ‘mode‘ argument to ‘inference‘ instead of ‘training‘.
Next, we can load the weights from our saved model file ‘mask_rcnn_malaria_cfg_0005.h5‘ in the current working directory.

Code for model Evaluation

plot_actual_vs_predicted is a function defined to plot actual images and predicted image. The pseudocode snippet is in the below image.

Code for plot of actual vs predicted image

The performance of a model for an object recognition task is often evaluated using the mAP and IOU.

We are predicting bounding boxes so we can determine how well the predicted and actual bounding boxes overlap. This can be calculated by dividing the area of the overlap by the total area of both bounding boxes, or the intersection divided by the union, referred to as “intersection over union,” or IoU. A perfect bounding box prediction will have an IoU of 1.
It is standard to assume a positive prediction of a bounding box if the IoU is greater than 0.5, e.g. they overlap by 50% or more.

Precision refers to the percentage of the correctly predicted bounding boxes (IoU > 0.5) out of all bounding boxes predicted in the image. Recall is the percentage of the correctly predicted bounding boxes (IoU > 0.5) out of all objects in the image.
As we make more predictions, the recall percentage will increase, but precision will drop or become erratic as we start making false positive predictions. The recall (x) can be plotted against the precision (y) for each number of predictions to create a curve or line. We can maximize the value of each point on this line and calculate the average value of the precision or AP for each value of recall.
The average or mean of the average precision (AP) across all of the images in a dataset is called the mean average precision, or mAP.

The mask-rcnn library provides a mrcnn.utils.compute_ap to calculate the AP and other metrics for a given image. These AP scores can be collected across a dataset and the mean calculated to give an idea at how good the model is at detecting objects in a dataset.

RBC and Trophozite is predicted

From the above actual and predicted image we can see most of the cells of the actual image are predicted. In this example image, Red Blood Cells and Trophozite are predicted.

After evaluating the model we got:
mAP for training data is evaluated as 0.830
mAP for test data is evaluated as 0.806

5. Detect Cells in new photos

I have downloaded some Malaria Cells images from the internet and run the model on these images. These images are not part of training and test datasets.

Here, are the results:

Actual Image 1
Predicted Image 1

In the above new image 1 Trophozite(in Red color mask) is predicted with 0.7443 score.

Actual Image 2
Predicted Image 2

In the above new image 1 Trophozite(in Red color mask) is predicted with a 0.723 score.

In the above predicted images, showing the bounding box and mask of many cells in the image. And many RBCs and Trophozites are detected and predicted correctly.

Code

For code, check my Github profile

Conclusion

In this blog, we have discussed various Computer Vision techniques.

Mask R-CNN is discussed in detail and applied to Malaria Data cells.
It works fairly well on cells identification to Red Blood cells and Parasites infected cells.

Scope of Improvement

  1. Mask shape can be improved to the exact size of the cell.
  2. The implemented model is not detecting all infected cells. It needs some improvement.

If you found my article useful, give it a 👏 and help others find it. Remember, you can clap up to 50 times (by pressing on the 👏 icon for longer). And it’s a great way to give feedback!

--

--