Segmentation neural network

Segmentation neural network DEFAULT

Semantic Image Segmentation using Fully Convolutional Networks

Semantic Segmentation/ Dilated Convolution

Humans have the innate ability to identify the objects that they see in the world around them. The visual cortex present in our brain can distinguish between a cat and a dog effortlessly in almost no time. This is true not only with cats and dogs but with almost all the objects that we see. But a computer is not as smart as a human brain to be able to this on its own. Over the past few decades, Deep Learning researchers have tried to bridge this gap between human brain and computer through a special type of artificial neural networks called Convolutional Neural Networks(CNNs).

After a lot of research to study mammalian brains, researchers found that specific parts of the brain get activated to specific type of stimulus. For example, some parts in the visual cortex get activated when we see vertical edges, some when we see horizontal edges, and some others when we see specific shapes, colors, faces, etc. ML researchers imagined each of these parts as a layer of neural network and considered the idea that a large network of such layers could mimic the human brain.
This intuition gave rise to the advent of CNN, which is a type of neural network whose building blocks are convolutional layers. A convolution layer is nothing but a set of weight matrices called kernels or filters which are used for convolution operation on a feature matrix such as an image.

2D convolution is a fairly simple operation, you start with a kernel and ‘stride’ (slide) it over the 2D input data, performing an element-wise multiplication with the part of the input it is currently on, and then summing up the results into a single output cell. The kernel repeats this process for every location it slides over, converting a 2D matrix of features into another 2D matrix of features.
The step size by which the kernel slides on the input feature matrix is called stride. In the below animation, the input matrix has been added with an extra stripe of zeros from all four sides to ensure that the output matrix is of the same size as the input matrix. This is called (zero)padding.

Image segmentation is the task of partitioning a digital image into multiple segments (sets of pixels) based on some characteristics. The objective is to simplify or change the image into a representation that is more meaningful and easier to analyze.
Semantic Segmentation refers to assigning a class label to each pixel in the given image. See the below example.

Note that segmentation is different from classification. In classification, complete image is assigned a class label whereas in segmentation, each pixel in an image is classified into one of the classes.

Having a fair idea about convolutional networks and semantic image segmentation, let’s jump into the problem we need to solve.

Severstal is among the top 50 producers of steel in the world and Russia’s biggest player in efficient steel mining and production. One of the key products of Severstal is steel sheets. The production process of flat sheet steel is delicate. From heating and rolling, to drying and cutting, several machines touch flat steel by the time it’s ready to ship. To ensure quality in the production of steel sheets, today, Severstal uses images from high-frequency cameras to power a defect detection algorithm.
Through this competition, Severstal expects the AI community to improve the algorithm by localizing and classifying surface defects on a steel sheet.

Business objectives and constraints

  1. A defective sheet must be predicted as defective since there would be serious concerns about quality if we misclassify a defective sheet as non-defective. i.e. high recall value for each of the classes is needed.
  2. We need not give the results for a given image in the blink of an eye. (No strict latency concerns)

2.1. Mapping the business problem to an ML problem

Our task is to

  1. Detect/localize the defects in a steel sheet using image segmentation and
  2. Classify the detected defects into one or more classes from [1, 2, 3, 4]

To put it together, it is a semantic image segmentation problem.

2.2. Performance metric

The evaluation metric used is the mean Dice coefficient. The Dice coefficient can be used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth. The formula is given by:

where X is the predicted set of pixels and Y is the ground truth.
Read more about Dice Coefficient here.

2.3. Data Overview

We have been given a zip folder of size 2GB which contains the following:

  • —a folder containing 12,568 training images (.jpg files)
  • — a folder containing 5506 test images (.jpg files). We need to detect and localize defects in these images
  • — training annotations which provide segments for defects belonging to ClassId = [1, 2, 3, 4]
  • — a sample submission file in the correct format, with each ImageId repeated 4 times, one for each of the 4 defect classes.

More details about data have been discussed in the next section.

The first step in solving any machine learning problem should be a thorough study of the raw data. This gives a fair idea about what our approaches to solving the problem should be. Very often, it also helps us find some latent aspects of the data which might be useful to our models.
Let’s analyze the data and try to draw some meaningful conclusions.

3.1. Loading train.csv file

train.csv tells which type of defect is present at what pixel location in an image. It contains the following columns:

  • : image file name with .jpg extension
  • : type/class of the defect, one of [1, 2, 3, 4]
  • : represents the range of defective pixels in an image in the form of run-length encoded pixels(pixel number where defect starts <space> pixel length of the defect).
    e.g. ‘29102 12’ implies the defect is starting at pixel 29102 and running a total of 12 pixels, i.e. pixels 29102, 29103,………, 29113 are defective. The pixels are numbered from top to bottom, then left to right: 1 corresponds to pixel (1,1), 2 corresponds to (2,1), and so on.
train_df.ImageId.describe()count 7095
unique 6666
top ef24da2ba.jpg
freq 3
Name: ImageId, dtype: object
  • There are 7095 data points corresponding to 6666 steel sheet images containing defects.

3.2. Analyzing train_images & test_images folders

Number of train and test images
Let’s get some idea about the proportion of train and test images and check how many train images contain defects.

Number of train images : 12568
Number of test images : 5506
Number of non-defective images in the train_images folder: 5902
  • There are more images in the train_images folder than unique image Ids in train.csv. This means, not all the images in the train_images folder have at least one of the defects 1, 2, 3, 4.

Sizes of train and test images
Let’s check if all images in train and test are of the same size. If not, we must make them of the same size.

{(256, 1600, 3)}
{(256, 1600, 3)}
  • All images in train and test folders have the same size (256×1600×3)

3.3. Analysis of labels: ClassId

Let’s see how train data is distributed among various classes.

Number of images in class 1 : 5150 (77.258 %)
Number of images in class 2 : 897 (13.456 %)
Number of images in class 3 : 801 (12.016 %)
Number of images in class 4 : 247 (3.705 %)
  • The dataset looks imbalanced.
  • The number of images with class 3 defect is very high compared to that of other classes. 77% of the defective images have class 3 defects.
  • Class 2 is the least occurring class, only 3.7 % of images in train.csv belong to class 2.

Note that the Sum of percentage values in the above analysis is more than 100, which means some images have defects belonging to more than one class.

Number of labels tagged per image

Number of images having 1 class label(s): 6239 (93.594%)
Number of images having 2 class label(s): 425 (6.376%)
Number of images having 3 class label(s): 2 (0.03%)
  • The majority of the images (93.6%) have only one class of defects.
  • Only 2 images (0.03%) have a combination of 3 classes of defects.
  • The rest of the images (6.37%) have a combination of 2 classes of defects.
  • No image has all 4 classes of defects.

Before we move ahead to training deep learning models, we need to convert the raw data into a form that can be fed to the models. Also, we need to build a data pipeline, which would perform the required pre-processing and generate batches of input and output images for training.

As the first step, we create a pandas dataframe containing filenames of train images under the column , and EncodedPixels under one or more of the columns , depending on the ClassId of the image in train.csv. The images that do not have any defects have all these 4 columns blank. Below is a sample of the dataframe:

4.1. Train, CV split 85:15

I would train my models on 85% of train images and validate on 15%.

(10682, 5)
(1886, 5)

4.2. Utility Functions for converting RLE encoded pixels to masks and vice-versa

Let’s visualize some images from each class along with their masks. The pixels belonging to the defective area in the steel sheet image are indicated by yellow color in the mask image.

Our deep learning model would take steel sheet image as input (X) and return four masks (Y)(corresponding to 4 classes) as output. This implies, for training our model we would need to feed batches of train images and their corresponding masks to the model.
I have generated masks for all the images in the train_images folder and stored them into a folder called train_masks.

4.3. Data generator using

The below code is data pipeline for applying pre-processing, augmentation to input images and generating batches for training.

4.4. Defining metric and loss function

I have used a hybrid loss function which is a combination of binary cross-entropy (BCE) and dice loss. BCE corresponds to binary classification of each pixel (0 indicating false prediction of defect at that pixel when compared to the ground truth mask and 1 indicating correct prediction). Dice loss is given by (1- dice coefficient).
BCE dice loss = BCE + dice loss

There are several models/architectures that are used for semantic image segmentation. I have tried two of them in this case study: i)U-Net and ii) Google’s DeepLabV3+.

5.1. First cut Solution: U-Net for Semantic Image Segmentation

This model is based on the research paper U-Net: Convolutional Networks for Biomedical Image Segmentation, published in 2015 by Olaf Ronneberger, Philipp Fischer, and Thomas Brox of University of Freiburg, Germany. In this paper, the authors build upon an elegant architecture, called “Fully Convolutional Network”. They have used this for segmentation of neuronal structures in electron microscopic stacks and few other biomedical image segmentation datasets.

5.1.1. Architecture
The Architecture of the network is shown in the image below. It consists of a contracting path (left side) and an expansive path (right side). The expanding path is symmetric to the contracting path giving the network a shape resembling the English letter ‘U’. Due to this reason, the network is called U-Net.

The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled. In this path, the model captures the important features (similar to defects in steel sheet) from the image and discards the unimportant features, reducing the resolution of the image at each convolution+maxpool layer.
In the expansive path, every step consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer, a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes(4 in our case).
In order to localize precisely, high-resolution features from the contracting path are cropped and combined with the upsampled output and fed to a successive convolution layer which will learn to assemble more precise output.

  • Instead of using 64 filters in the first layer, have used only 8 filters (number of filters in the subsequent layers also changes accordingly). This results in a less complex model, which is faster to train when compared to the model with 64 filters.
  • The original size of the steel sheet images is 256x1600. Large-sized images consist of more pixels and hence require more number of computations for convolution, pooling, etc. I have resized the images to half-size (128x800) due to computational resource constraints.
  • I have added a small dropout = 0.2 after each convolution block to avoid the model from overfitting.

The code for U-Net model is available in my GitHub repository.

5.1.2. Training
I have trained the model using Keras Adam optimizer with the default learning rate for 50 epochs. The loss function that the optimizer tries to minimize is bce_dice_loss, defined earlier in section 4.4.

I have used Keras model checkpoint to monitor the validation dice coefficient as the training progresses and save the model with the best validation dice coefficient score. TensorBoard has been used to dynamically plot the loss and score while training.

5.1.3. Training plots

5.1.4. Testing
The figure below shows a sample image from validation data alongside its ground truth mask and predicted mask.

Since Kaggle requires us to submit predictions on original size and not on half size images, I have rebuilt the model with input size = (256, 1600, 3) and loaded it with the weights of the model trained on 128×800 images. I have taken this liberty because CNNs are fairly invariant to different input sizes.

  • The Dice Coefficient score was pretty good when I uploaded the predictions of this model on Kaggle. I got a score of 0.80943 in Private leaderboard and 0.81369 in Public leaderboard.

5.2. Final Solution: DeepLab V3+

DeepLab is a state-of-the-art semantic segmentation model designed and open-sourced by Google in 2016. Since then, multiple improvements have been made to the model, including DeepLab V2, DeepLab V3, and the latest DeepLab V3+.
DeepLab V3+ is based on the paper Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, published in 2018 by Google.

5.2.1. Architecture
Similar to U-Net discussed earlier, DeepLab V3+ is also an encoder-decoder architecture. The major difference is that it uses Atrous convolution instead of simple convolution. We would learn more about Atrous convolution later in this section.

The encoder module encodes multi-scale contextual information by applying atrous convolution at multiple scales, while the simple yet effective decoder module refines the segmentation results along object boundaries.

Atrous Convolution
Atrous convolution is a generalized form of standard convolution operation that allows us to explicitly control filter’s field-of-view in order to capture multi-scale information. In the case of two-dimensional signals, for each location i on the output feature map y and a convolution filter w, atrous convolution is applied over the input feature map x as follows:

where the atrous rate r determines the stride with which we sample the input signal. Note that standard convolution is a special case in which rate r = 1. The filter’s field-of-view is adaptively modified by changing the dilation/atrous rate value.

Depth-wise Separable Convolution
Depth-wise separable convolution drastically reduces computation complexity by dividing a standard convolution into two sub-parts —
i. Depth-wise convolution
ii. Point-wise convolution.

The first part is depth-wise convolution that performs a spatial convolution independently for each input channel. It is followed by a point-wise convolution (i.e., 1×1 convolution), which is employed to combine the output from the depth-wise convolution.

Let us understand this with the help of an example. Suppose we have an image of size 12×12 composed of 3 channels. We want to apply a convolution of 5×5 on this input and get an output of 8×8×256.

In the first part, depth-wise convolution, we give the input image a convolution without changing the depth. We do so by using 3 kernels of shape 5×5×1.

The point-wise convolution is so named because it uses a 1×1 kernel or a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has; 3 in our case. Therefore, we iterate a 1×1×3 kernel through our 8×8×3 image, to get an 8×8×1 image.

To get 8×8×256 output image, we need to simply increase the number of 1×1×3 kernels to 256.

Encoder Architecture
DeepLab V3+ encoder uses Xception architecture with the following modifications —

  • We add more layers in the middle flow
  • All the max pooling operations are replaced by depth-wise separable convolutions with striding
  • Extra batch normalization and ReLU are added after each 3×3 depth-wise convolution.

The output of the encoder is a feature map of size 16 times smaller than the input feature map. This is compensated by the decoder which has a provision to up-sample the encoder feature map by 4x twice (refer to the model architecture diagram).

5.2.2. Training
I have trained the model using Keras Adam optimizer with the default learning rate for 47 epochs. The loss function that the optimizer tries to minimize is bce_dice_loss, defined earlier in section 4.4

As in the case of U-Net, I have saved the weights of the model with the best validation dice_coefficient.

5.2.3. Training plots

5.2.4. Testing
The figure below shows some sample images from validation data alongside their ground truth mask and predicted mask.

Rebuilding the model with original input size(256, 1600,3) and loading the weights of the model trained on half size did not work well in this case. I had to use a different strategy —I used the trained model to predict on 128×800 images and resized the predicted masks to 256×1600. This approach worked very well for DeepLab V3+.

Results comparison and Final submission

My final submission was DeepLab V3+, which gave a decent score in both Private and Public.

This Kaggle competition was a popular one and many people have solved this problem using different approaches. However, most of them have used some or the other variant of U-Net or similar encoder-decoder architectures.

I have used simple U-Net as my first cut solution, which gives a decent performance on test data, thanks to the train on half-size and predict on full-size strategy.

I have implemented DeepLab V3+, which is a state of the art technique for semantic image segmentation from scratch. It helped me improve my score from 0.809 to 0.838.

  • Some hyper-parameter tuning can be done with U-Net.
  • Other image segmentation architectures like U-Net++, SegNet and Mask R-CNNs can be tried.
  • The idea of Transfer Learning with various backbones trained on large datasets can be utilized.

Thank you for reading such a long blog, I appreciate your patience. I thoroughly enjoyed writing it, hope you enjoyed reading it too.

I have skipped most of the code as I didn’t want to overwhelm the readers with code. Please refer to my GitHub repository for the complete Keras code.

If you have any queries, suggestions or discussions, please feel free to drop them in the comments section below. I will try to address them to the best of my knowledge.
You can connect with me on LinkedIn, here’s my profile.

attendseg on-device semantic segmentation

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

A new neural network architecture designed by artificial intelligence researchers at DarwinAI and the University of Waterloo will make it possible to perform image segmentation on computing devices with low-power and -compute capacity.

Segmentation is the process of determining the boundaries and areas of objects in images. We humans perform segmentation without conscious effort, but it remains a key challenge for machine learning systems. It is vital to the functionality of mobile robots, self-driving cars, and other artificial intelligence systems that must interact and navigate the real world.

Until recently, segmentation required large, compute-intensive neural networks. This made it difficult to run these deep learning models without a connection to cloud servers.

In their latest work, the scientists at DarwinAI and the University of Waterloo have managed to create a neural network that provides near-optimal segmentation and is small enough to fit on resource-constrained devices. Called AttendSeg, the neural network is detailed in a paper that has been accepted at this year’s Conference on Computer Vision and Pattern Recognition (CVPR).

Object classification, detection, and segmentation

One of the key reasons for the growing interest in machine learning systems is the problems they can solve in computer vision. Some of the most common applications of machine learning in computer vision include image classification, object detection, and segmentation.

Image classification determines whether a certain type of object is present in an image or not. Object detection takes image classification one step further and provides the bounding box where detected objects are located.

Segmentation comes in two flavors: semantic segmentation and instance segmentation. Semantic segmentation specifies the object class of each pixel in an input image. Instance segmentation separates individual instances of each type of object. For practical purposes, the output of segmentation networks is usually presented by coloring pixels. Segmentation is by far the most complicated type of classification task.

image classification vs object detection vs semantic segmentation

The complexity of convolutional neural networks (CNN), the deep learning architecture commonly used in computer vision tasks, is usually measured in the number of parameters they have. The more parameters a neural network has the larger memory and computational power it will require.

RefineNet, a popular semantic segmentation neural network, contains more than 85 million parameters. At 4 bytes per parameter, it means that an application using RefineNet requires at least 340 megabytes of memory just to run the neural network. And given that the performance of neural networks is largely dependent on hardware that can perform fast matrix multiplications, it means that the model must be loaded on the graphics card or some other parallel computing unit, where memory is more scarce than the computer’s RAM.

Machine learning for edge devices

Due to their hardware requirements, most applications of image segmentation need an internet connection to send images to a cloud server that can run large deep learning models. The cloud connection can pose additional limits to where image segmentation can be used. For instance, if a drone or robot will be operating in environments where there’s no internet connection, then performing image segmentation will become a challenging task. In other domains, AI agents will be working in sensitive environments and sending images to the cloud will be subject to privacy and security constraints. The lag caused by the roundtrip to the cloud can be prohibitive in applications that require real-time response from the machine learning models. And it is worth noting that network hardware itself consumes a lot of power, and sending a constant stream of images to the cloud can be taxing for battery-powered devices.

For all these reasons (and a few more), edge AI and tiny machine learning (TinyML) have become hot areas of interest and research both in academia and in the applied AI sector. The goal of TinyML is to create machine learning models that can run on memory- and power-constrained devices without the need for a connection to the cloud.

attendseg architecture

With AttendSeg, the researchers at DarwinAI and the University of Waterloo tried to address the challenges of on-device semantic segmentation.

“The idea for AttendSeg was driven by both our desire to advance the field of TinyML and market needs that we have seen as DarwinAI,” Alexander Wong, co-founder at DarwinAI and Associate Professor at the University of Waterloo, told TechTalks. “There are numerous industrial applications for highly efficient edge-ready segmentation approaches, and that’s the kind of feedback along with market needs that I see that drives such research.”

The paper describes AttendSeg as “a low-precision, highly compact deep semantic segmentation network tailored for TinyML applications.”

The AttendSeg deep learning model performs semantic segmentation at an accuracy that is almost on-par with RefineNet while cutting down the number of parameters to 1.19 million. Interestingly, the researchers also found that lowering the precision of the parameters from 32 bits (4 bytes) to 8 bits (1 byte) did not result in a significant performance penalty while enabling them to shrink the memory footprint of AttendSeg by a factor of four. The model requires little above one megabyte of memory, which is small enough to fit on most edge devices.

“[8-bit parameters] do not pose a limit in terms of generalizability of the network based on our experiments, and illustrate that low precision representation can be quite beneficial in such cases (you only have to use as much precision as needed),” Wong said.

attendseg vs other networks

Attention condensers for computer vision

AttendSeg leverages “attention condensers” to reduce model size without compromising performance. Self-attention mechanisms are a series that improve the efficiency of neural networks by focusing on information that matters. Self-attention techniques have been a boon to the field of natural language processing. They have been a defining factor in the success of deep learning architectures such as Transformers. While previous architectures such as recurrent neural networks had a limited capacity on long sequences of data, Transformers used self-attention mechanisms to expand their range. Deep learning models such as GPT-3 leverage Transformers and self-attention to churn out long strings of text that (at least superficially) maintain coherence over long spans.

AI researchers have also leveraged attention mechanisms to improve the performance of convolutional neural networks. Last year, Wong and his colleagues introduced attention condensers as a very resource-efficient attention mechanism and applied them to image classifier machine learning models.

“[Attention condensers] allow for very compact deep neural network architectures that can still achieve high performance, making them very well suited for edge/TinyML applications,” Wong said.

attention condenser architecture

Machine-driven design of neural networks

One of the key challenges of designing TinyML neural networks is finding the best performing architecture while also adhering to the computational budget of the target device.

To address this challenge, the researchers used “generative synthesis,” a machine learning technique that creates neural network architectures based on specified goals and constraints. Basically, instead of manually fiddling with all kinds of configurations and architectures, the researchers provide a problem space to the machine learning model and let it discover the best combination.

“The machine-driven design process leveraged here (Generative Synthesis) requires the human to provide an initial design prototype and human-specified desired operational requirements (e.g., size, accuracy, etc.) and the MD design process takes over in learning from it and generating the optimal architecture design tailored around the operational requirements and task and data at hand,” Wong said.

For their experiments, the researchers used machine-driven design to tune AttendSeg for Nvidia Jetson, hardware kits for robotics and edge AI applications. But AttendSeg is not limited to Jetson.

“Essentially, the AttendSeg neural network will run fast on most edge hardware compared to previously proposed networks in literature,” Wong said. “However, if you want to generate an AttendSeg that is even more tailored for a particular piece of hardware, the machine-driven design exploration approach can be used to create a new highly customized network for it.”

AttendSeg has obvious applications for autonomous drones, robots, and vehicles, where semantic segmentation is a key requirement for navigation. But on-device segmentation can have many more applications.

“This type of highly compact, highly efficient segmentation neural network can be used for a wide variety of things, ranging from manufacturing applications (e.g., parts inspection / quality assessment, robotic control) medical applications (e.g., cell analysis, tumor segmentation), satellite remote sensing applications (e.g., land cover segmentation), and mobile application (e.g., human segmentation for augmented reality),” Wong said.

Like this:


  1. African drum amazon
  2. Sam johal
  3. 73 self storage
  4. Amazon device forum

A 2021 guide to Semantic Segmentation


Deep learning has been very successful when working with images as data and is currently at a stage where it works better than humans on multiple use-cases. The most important problems that humans have been  interested in solving with computer vision are image classification, object detection and segmentation in the increasing order of their difficulty.

In the plain old task of image classification we are just interested in getting the labels of all the objects that are present in an image. In object detection we come further a step and try to know along with what all objects that are present in an image, the location at which the objects are present with the help of bounding boxes. Image segmentation takes it to a new level by trying to find out accurately the exact boundary of the objects in the image.

In this article we will go through this concept of image segmentation, discuss the relevant use-cases, different neural network architectures involved in achieving the results, metrics and datasets to explore.

What is image segmentation

We know an image is nothing but a collection of pixels. Image segmentation is the process of classifying each pixel in an image belonging to a certain class and hence can be thought of as a classification problem per pixel. There are two types of segmentation techniques

  1. Semantic segmentation :- Semantic segmentation is the process of classifying each pixel belonging to a particular label. It doesn't different across different instances of the same object. For example if there are 2 cats in an image, semantic segmentation gives same label to all the pixels of both cats
  2. Instance segmentation :- Instance segmentation differs from semantic segmentation in the sense that it gives a unique label to every instance of a particular object in the image. As can be seen in the image above all 3 dogs are assigned different colours i.e different labels. With semantic segmentation all of them would have been assigned the same colour.

So we will now come to the point where would we need this kind of an algorithm

Use-cases of image segmentation

Handwriting Recognition :- Junjo et all demonstrated how semantic segmentation is being used to extract words and lines from handwritten documents in their 2019 research paper to recognise handwritten characters

Google portrait mode :- There are many use-cases where it is absolutely essential to separate foreground from background. For example in Google's portrait mode we can see the background blurred out while the foreground remains unchanged to give a cool effect

YouTube stories :- Google recently released a feature YouTube stories for content creators to show different backgrounds while creating stories.

Virtual make-up :- Applying virtual lip-stick is possible now with the help of image segmentation

4.Virtual try-on :- Virtual try on of clothes is an interesting feature which was available in stores using specialized hardware which creates a 3d model. But with deep learning and image segmentation the same can be obtained using just a 2d image

Visual Image Search :- The idea of segmenting out clothes is also used in image retrieval algorithms in eCommerce. For example Pinterest/Amazon allows you to upload any picture and get related similar looking products by doing an image search based on segmenting out the cloth portion

Self-driving cars :- Self driving cars need a complete understanding of their surroundings to a pixel perfect level. Hence image segmentation is used to identify lanes and other necessary information

Nanonets helps fortune 500 companies enable better customer experiences at scale using Semantic Segmentation.

Methods and Techniques

Before the advent of deep learning, classical machine learning techniques like SVM, Random Forest, K-means Clustering were used to solve the problem of image segmentation. But as with most of the image related problem statements deep learning has worked comprehensively better than the existing techniques and has become a norm now when dealing with Semantic Segmentation. Let's review the techniques which are being used to solve the problem

Fully Convolutional Network

The general architecture of a CNN consists of few convolutional and pooling layers followed by few fully connected layers at the end. The paper of Fully Convolutional Network released in 2014 argues that the final fully connected layer can be thought of as doing a 1x1 convolution that cover the entire region.

Hence the final dense layers can be replaced by a convolution layer achieving the same result. But now the advantage of doing this is the size of input need not be fixed anymore. When involving dense layers the size of input is constrained and hence when a different sized input has to be provided it has to be resized. But by replacing a dense layer with convolution, this constraint doesn't exist.

Also when a bigger size of image is provided as input the output produced will be a feature map and not just a class output like for a normal input sized image. Also the observed behavior of the final feature map represents the heatmap of the required class i.e the position of the object is highlighted in the feature map. Since the output of the feature map is a heatmap of the required object it is valid information for our use-case of segmentation.

Since the feature map obtained at the output layer is a down sampled due to the set of convolutions performed, we would want to up-sample it using an interpolation technique. Bilinear up sampling works but the paper proposes using learned up sampling with deconvolution which can even learn a non-linear up sampling.

The down sampling part of the network is called an encoder and the up sampling part is called a decoder. This is a pattern we will see in many architectures i.e reducing the size with encoder and then up sampling with decoder. In an ideal world we would not want to down sample using pooling and keep the same size throughout but that would lead to a huge amount of parameters and would be computationally infeasible.

Although the output results obtained have been decent the output observed is rough and not smooth. The reason for this is loss of information at the final feature layer due to downsampling by 32 times using convolution layers. Now it becomes very difficult for the network to do 32x upsampling by using this little information. This architecture is called FCN-32

To address this issue, the paper proposed 2 other architectures FCN-16, FCN-8. In FCN-16 information from the previous pooling layer is used along with the final feature map and hence now the task of the network is to learn 16x up sampling which is better compared to FCN-32. FCN-8 tries to make it even better by including information from one more previous pooling layer.


U-net builds on top of the fully convolutional network from above. It was built for medical purposes to find tumours in lungs or the brain. It also consists of an encoder which down-samples the input image to a feature map and the decoder which up samples the feature map to input image size using learned deconvolution layers.

The main contribution of the U-Net architecture is the shortcut connections. We saw above in FCN that since we down-sample an image as part of the encoder we lost a lot of information which can't be easily recovered in the encoder part. FCN tries to address this by taking information from pooling layers before the final feature layer.

U-Net proposes a new approach to solve this information loss problem. It proposes to send information to every up sampling layer in decoder from the corresponding down sampling layer in the encoder as can be seen in the figure above thus capturing finer information whilst also keeping the computation low. Since the layers at the beginning of the encoder would have more information they would bolster the up sampling operation of decoder by providing fine details corresponding to the input images thus improving the results a lot. The paper also suggested use of a novel loss function which we will discuss below.


Deeplab from a group of researchers from Google have proposed a multitude of techniques to improve the existing results and get finer output at lower computational costs. The 3 main improvements suggested as part of the research are

1) Atrous convolutions
2) Atrous Spatial Pyramidal Pooling
3) Conditional Random Fields usage for improving final output
Let's discuss about all these

Atrous Convolution

One of the major problems with FCN approach is the excessive downsizing due to consecutive pooling operations. Due to series of pooling the input image is down sampled by 32x which is again up sampled to get the segmentation result. Downsampling by 32x results in a loss of information which is very crucial for getting fine output in a segmentation task. Also deconvolution to up sample by 32x is a computation and memory expensive operation since there are additional parameters involved in forming a learned up sampling.

The paper proposes the usage of Atrous convolution or the hole convolution or dilated convolution which helps in getting an understanding of large context using the same number of parameters.

Dilated convolution works by increasing the size of the filter by appending zeros(called holes) to fill the gap between parameters. The number of holes/zeroes filled in between the filter parameters is called by a term dilation rate. When the rate is equal to 1 it is nothing but the normal convolution. When rate is equal to 2 one zero is inserted between every other parameter making the filter look like a 5x5 convolution. Now it has the capacity to get the context of 5x5 convolution while having 3x3 convolution parameters. Similarly for rate 3 the receptive field goes to 7x7.

In Deeplab last pooling layers are replaced to have stride 1 instead of 2 thereby keeping the down sampling rate to only 8x. Then a series of atrous convolutions are applied to capture the larger context. For training the output labelled mask is down sampled by 8x to compare each pixel. For inference, bilinear up sampling is used to produce output of the same size which gives decent enough results at lower computational/memory costs since bilinear up sampling doesn't need any parameters as opposed to deconvolution for up sampling.


Spatial Pyramidal Pooling is a concept introduced in SPPNet to capture multi-scale information from a feature map. Before the introduction of SPP input images at different resolutions are supplied and the computed feature maps are used together to get the multi-scale information but this takes more computation and time. With Spatial Pyramidal Pooling multi-scale information can be captured with a single input image.

With the SPP module the network produces 3 outputs of dimensions 1x1(i.e GAP), 2x2 and 4x4. These values are concatenated by converting to a 1d vector thus capturing information at multiple scales. Another advantage of using SPP is input images of any size can be provided.

ASPP takes the concept of fusing information from different scales and applies it to Atrous convolutions. The input is convolved with different dilation rates and the outputs of these are fused together.

As can be seen the input is convolved with 3x3 filters of dilation rates 6, 12, 18 and 24 and the outputs are concatenated together since they are of same size. A 1x1 convolution output is also added to the fused output. To also provide the global information, the GAP output is also added to above after up sampling. The fused output of 3x3 varied dilated outputs, 1x1 and GAP output is passed through 1x1 convolution to get to the required number of channels.

Since the required image to be segmented can be of any size in the input the multi-scale information from ASPP helps in improving the results.

Improving output with CRF

Pooling is an operation which helps in reducing the number of parameters in a neural network but it also brings a property of invariance along with it. Invariance is the quality of a neural network being unaffected by slight translations in input. Due to this property obtained with pooling the segmentation output obtained by a neural network is coarse and the boundaries are not concretely defined.

To deal with this the paper proposes use of graphical model CRF. Conditional Random Field operates a post-processing step and tries to improve the results produced to define shaper boundaries. It works by classifying a pixel based not only on it's label but also based on other pixel labels. As can be seen from the above figure the coarse boundary produced by the neural network gets more refined after passing through CRF.

Deeplab-v3 introduced batch normalization and suggested dilation rate multiplied by (1,2,4) inside each layer in a Resnet block.  Also adding image level features to ASPP module which was discussed in the above discussion on ASPP was proposed as part of this paper

Deeplab-v3+ suggested to have a decoder instead of plain bilinear up sampling 16x. The decoder takes a hint from the decoder used by architectures like U-Net which take information from encoder layers to improve the results. The encoder output is up sampled 4x using bilinear up sampling and concatenated with the features from encoder which is again up sampled 4x after performing a 3x3 convolution. This approach yields better results than a direct 16x up sampling. Also modified Xception architecture is proposed to be used instead of Resnet as part of encoder and depthwise separable convolutions are now used on top of Atrous convolutions to reduce the number of computations.

Global Convolution Network

Semantic segmentation involves performing two tasks concurrently

i) Classification
ii) Localization

The classification networks are created to be invariant to translation and rotation thus giving no importance to location information whereas the localization involves getting accurate details w.r.t the location. Thus inherently these two tasks are contradictory. Most segmentation algorithms give more importance to localization i.e the second in the above figure and thus lose sight of global context. In this work the author proposes a way to give importance to classification task too while at the same time not losing the localization information

The author proposes to achieve this by using large kernels as part of the network thus enabling dense connections and hence more information. This is achieved with the help of a GCN block as can be seen in the above figure. GCN block can be thought of as a k x k convolution filter where k can be a number bigger than 3. To reduce the number of parameters a k x k filter is further split into 1 x k and k x 1, kx1 and 1xk blocks which are then summed up. Thus by increasing value k, larger context is captured.

In addition, the author proposes a Boundary Refinement block which is similar to a residual block seen in Resnet consisting of a shortcut connection and a residual connection which are summed up to get the result. It is observed that having a Boundary Refinement block resulted in improving the results at the boundary of segmentation.

Results showed that GCN block improved the classification accuracy of pixels closer to the center of object indicating the improvement caused due to capturing long range context whereas Boundary Refinement block helped in improving accuracy of pixels closer to boundary.

See More Than Once – KSAC for Semantic Segmentation

Deeplab family uses ASPP to have multiple receptive fields capture information using different atrous convolution rates. Although ASPP has been significantly useful in improving the segmentation of results there are some inherent problems caused due to the architecture. There is no information shared across the different parallel layers in ASPP thus affecting the generalization power of the kernels in each layer. Also since each layer caters to different sets of training samples(smaller objects to smaller atrous rate and bigger objects to bigger atrous rates), the amount of data for each parallel layer would be less thus affecting the overall generalizability.  Also the number of parameters in the network increases linearly with the number of parameters and thus can lead to overfitting.

To handle all these issues the author proposes a novel network structure called Kernel-Sharing Atrous Convolution (KSAC). As can be seen in the above figure, instead of having a different kernel for each parallel layer is ASPP a single kernel is shared across thus improving the generalization capability of the network. By using KSAC instead of ASPP 62% of the parameters are saved when dilation rates of 6,12 and 18 are used.

Another advantage of using a KSAC structure is the number of parameters are independent of the number of dilation rates used. Thus we can add as many rates as possible without increasing the model size. ASPP gives best results with rates 6,12,18 but accuracy decreases with 6,12,18,24 indicating possible overfitting. But KSAC accuracy still improves considerably indicating the enhanced generalization capability.

This kernel sharing technique can also be seen as an augmentation in the feature space since the same kernel is applied over multiple rates. Similar to how input augmentation gives better results, feature augmentation performed in the network should help improve the representation capability of the network.

Video Segmentation

For use cases like self-driving cars, robotics etc. there is a need for real-time segmentation on the observed video. The architectures discussed so far are pretty much designed for accuracy and not for speed. So if they are applied on a per-frame basis on a video the result would come at very low speed.

Also generally in a video there is a lot of overlap in scenes across consecutive frames which could be used for improving the results and speed which won't come into picture if analysis is done on a per-frame basis. Using these cues let's discuss architectures which are specifically designed for videos


Spatio-Temporal FCN proposes to use FCN along with LSTM to do video segmentation. We are already aware of how FCN can be used to extract features for segmenting an image. LSTM are a kind of neural networks which can capture sequential information over time. STFCN combines the power of FCN with LSTM to capture both the spatial information and temporal information

As can be seen from the above figure STFCN consists of a FCN, Spatio-temporal module followed by deconvolution. The feature map produced by a FCN is sent to Spatio-Temporal Module which also has an input from the previous frame's module. The module based on both these inputs captures the temporal information in addition to the spatial information and sends it across which is up sampled to the original size of image using deconvolution similar to how it's done in FCN

Since both FCN and LSTM are working together as part of STFCN the network is end to end trainable and outperforms single frame segmentation approaches.  There are similar approaches where LSTM is replaced by GRU but the concept is same of capturing both the spatial and temporal information

Semantic Video CNNs through Representation Warping

This paper proposes the use of optical flow across adjacent frames as an extra input to improve the segmentation results

The approach suggested can be roped in to any standard architecture as a plug-in. The key ingredient that is at play is the NetWarp module. To compute the segmentation map the optical flow between the current frame and previous frame is calculated i.e Ft and is passed through a FlowCNN to get Λ(Ft) . This process is called Flow Transformation. This value is passed through a warp module which also takes as input the feature map of an intermediate layer calculated by passing through the network. This gives a warped feature map which is then combined with the intermediate feature map of the current layer and the entire network is end to end trained. This architecture achieved SOTA results on CamVid and Cityscapes video benchmark datasets.

Clockwork Convnets for Video Semantic Segmentation

This paper proposes to improve the speed of execution of a neural network for segmentation task on videos by taking advantage of the fact that semantic information in a video changes slowly compared to pixel level information. So the information in the final layers changes at a much slower pace compared to the beginning layers. The paper suggests different times

The above figure represents the rate of change comparison for a mid level layer pool4 and a deep layer fc7. On the left we see that since there is a lot of change across the frames both the layers show a change but the change for pool4 is higher.  In the right we see that there is not a lot of change across the frames. Hence pool4 shows marginal change whereas fc7 shows almost nil change.

The research utilizes this concept and suggests that in cases where there is not much of a change across the frames there is no need of computing the features/outputs again and the cached values from the previous frame can be used. Since the rate of change varies with layers different clocks can be set for different sets of layers. When the clock ticks the new outputs are calculated, otherwise the cached results are used. The rate of clock ticks can be statically fixed or can be dynamically learnt

Low-Latency Video Semantic Segmentation

This paper improves on top of the above discussion by adaptively selecting the frames to compute the segmentation map or to use the cached result instead of using a fixed timer or a heuristic.

The paper proposes to divide the network into 2 parts, low level features and high level features. The cost of computing low level features in a network is much less compared to higher features. The research suggests to use the low level network features as an indicator of the change in segmentation map. In their observations they found strong correlation between low level features change and the segmentation map change. So to understand if there is a need to compute if the higher features are needed to be calculated, the lower features difference across 2 frames is found and is compared if it crosses a particular threshold. This entire process is automated by a small neural network whose task is to take lower features of two frames and to give a prediction as to whether higher features should be computed or not. Since the network decision is based on the input frames the decision taken is dynamic compared to the above approach.

Segmentation for point clouds

Data coming from a sensor such as lidar is stored in a format called Point Cloud. Point cloud is nothing but a collection of unordered set of 3d data points(or any dimension). It is a sparse representation of the scene in 3d and CNN can't be directly applied in such a case. Also any architecture designed to deal with point clouds should take into consideration that it is an unordered set and hence can have a lot of possible permutations. So the network should be permutation invariant. Also the points defined in the point cloud can be described by the distance between them. So closer points in general carry useful information which is useful for segmentation tasks


PointNet is an important paper in the history of research on point clouds using deep learning to solve the tasks of classification and segmentation.  Let's study the architecture of Pointnet

Input of the network for n points is an n x 3 matrix. n x 3 matrix is mapped to n x 64 using a shared multi-perceptron layer(fully connected network) which is then mapped to n x 64 and then to n x 128 and n x 1024. Max pooling is applied to get a 1024 vector which is converted to k outputs by passing through MLP's with sizes 512, 256 and k. Finally k class outputs are produced similar to any classification network.

Classification deals only with the global features but segmentation needs local features as well. So the local features from intermediate layer at n x 64 is concatenated with global features to get a n x 1088 matrix which is sent through mlp of 512 and 256 to get to n x 256 and then though MLP's of 128 and m to give m output classes for every point in point cloud.

Also the network involves an input transform and feature transform as part of the network whose task is to not change the shape of input but add invariance to affine transformations i.e translation, rotation etc.


A-CNN proposes the usage of Annular convolutions to capture spatial information. We know from CNN that convolution operations capture the local information which is essential to get an understanding of the image. A-CNN devised a new convolution called Annular convolution which is applied to neighbourhood points in a point-cloud.

The architecture takes as input n x 3 points and finds normals for them which is used for ordering of points. A subsample of points is taken using the FPS algorithm resulting in ni x 3 points. On these annular convolution is applied to increase to 128 dimensions. Annular convolution is performed on the neighbourhood points which are determined using a KNN algorithm.

Another set of the above operations are performed to increase the dimensions to 256. Then an mlp is applied to change the dimensions to 1024 and pooling is applied to get a 1024 global vector similar to point-cloud. This entire part is considered the encoder.  For classification the encoder global output is passed through mlp to get c class outputs. For segmentation task both the global and local features are considered similar to PointCNN and is then passed through an MLP to get m class outputs for each point.


Let's discuss the metrics which are generally used to understand and evaluate the results of a model.

Pixel Accuracy

Pixel accuracy is the most basic metric which can be used to validate the results. Accuracy is obtained by taking the ratio of correctly classified pixels w.r.t total pixels

Accuracy = (TP+TN)/(TP+TN+FP+FN)

The main disadvantage of using such a technique is the result might look good if one class overpowers the other. Say for example the background class covers 90% of the input image we can get an accuracy of 90% by just classifying every pixel as background

Intersection Over Union

IOU is defined as the ratio of intersection of ground truth and predicted segmentation outputs over their union. If we are calculating for multiple classes, IOU of each class is calculated and their mean is taken. It is a better metric compared to pixel accuracy as if every pixel is given as background in a 2 class input the IOU value is (90/100+0/100)/2 i.e 45% IOU which gives a better representation as compared to 90% accuracy.

Frequency weighted IOU

This is an extension over mean IOU which we discussed and is used to combat class imbalance. If one class dominates most part of the images in a dataset like for example background, it needs to be weighed down compared to other classes. Thus instead of taking the mean of all the class results, a weighted mean is taken based on the frequency of the class region in the dataset.

F1 Score

The metric popularly used in classification F1 Score can be used for segmentation task as well to deal with class imbalance.

Average Precision

Area under the Precision - Recall curve for a chosen threshold IOU average over different classes is used for validating the results.

Loss functions

Loss function is used to guide the neural network towards optimization. Let's discuss a few popular loss functions for semantic segmentation task.

Cross Entropy Loss

Simple average of cross-entropy classification loss for every pixel in the image can be used as an overall function. But this again suffers due to class imbalance which FCN proposes to rectify using class weights

UNet tries to improve on this by giving more weight-age to the pixels near the border which are part of the boundary as compared to inner pixels as this makes the network focus more on identifying borders and not give a coarse output.

Focal Loss

Focal loss was designed to make the network focus on hard examples by giving more weight-age and also to deal with extreme class imbalance observed in single-stage object detectors. The same can be applied in semantic segmentation tasks as well

Dice Loss

Dice function is nothing but F1 score. This loss function directly tries to optimize F1 score. Similarly direct IOU score can be used to run optimization as well

Tversky Loss

It is a variant of Dice loss which gives different weight-age to FN and FP

Hausdorff distance

It is a technique used to measure similarity between boundaries of ground truth and predicted. It is calculated by finding out the max distance from any point in one boundary to the closest point in the other. Reducing directly the boundary loss function is a recent trend and has been shown to give better results especially in use-cases like medical image segmentation where identifying the exact boundary plays a key role.

The advantage of using a boundary loss as compared to a region based loss like IOU or Dice Loss is it is unaffected by class imbalance since the entire region is not considered for optimization, only the boundary is considered.

The two terms considered here are for two boundaries i.e the ground truth and the output prediction.

Annotation tools

LabelMe :-

Image annotation tool written in python.
Supports polygon annotation.
Open Source and free.
Runs on Windows, Mac, Ubuntu or via Anaconda, Docker
Link :-

Computer Vision Annotation Tool :-

Video and image annotation tool developed by Intel
Free and available online
Runs on Windows, Mac and Ubuntu
Link :-

Vgg image annotator :-

Free open source image annotation tool
Simple html page < 200kb and can run offline
Supports polygon annotation and points.
Link :-

Rectlabel :-

Paid annotation tool for Mac
Can use core ML models to pre-annotate the images
Supports polygons, cubic-bezier, lines, and points
Link :-

Labelbox :-

Paid annotation tool
Supports pen tool for faster and accurate annotation
Link :-


As part of this section let's discuss various popular and diverse datasets available in the public which one can use to get started with training.

Pascal Context

This dataset is an extension of Pascal VOC 2010 dataset and goes beyond the original dataset by providing annotations for the whole scene and has 400+ classes of real-world data.

Link :-

COCO Dataset

The COCO stuff dataset has 164k images of the original COCO dataset with pixel level annotations and is a common benchmark dataset. It covers 172 classes: 80 thing classes, 91 stuff classes and 1 class 'unlabeled'

Link :-

Cityscapes Dataset

This dataset consists of segmentation ground truths for roads, lanes, vehicles and objects on road. The dataset contains 30 classes and of 50 cities collected over different environmental and weather conditions. Has also a video dataset of finely annotated images which can be used for video segmentation. KITTI and CamVid are similar kinds of datasets which can be used for training self-driving cars.

Link :-

Lits Dataset

The dataset was created as part of a challenge to identify tumor lesions from liver CT scans. The dataset contains 130 CT scans of training data and 70 CT scans of testing data.

Link :-

CCP Dataset

Cloth Co-Parsing is a dataset which is created as part of research paper Clothing Co-Parsing by Joint Image Segmentation and Labeling . The dataset contains 1000+ images with pixel level annotations for a total of 59 tags.

Source :-

Pratheepan Dataset

A dataset created for the task of skin segmentation based on images from google containing 32 face photos and 46 family photos

Link :-

Inria Aerial Image Labeling

A dataset of aerial segmentation maps created from public domain images. Has a coverage of 810 sq km and has 2 classes building and not-building.

Link :-


This dataset contains the point clouds of six large scale indoor parts in 3 buildings with over 70000 images.

Link :-


We have discussed a taxonomy of different algorithms which can be used for solving the use-case of semantic segmentation be it on images, videos or point-clouds and also their contributions and limitations. We also looked through the ways to evaluate the results and the datasets to get started on. This should give a comprehensive understanding on semantic segmentation as a topic in general.

To get a list of more resources for semantic segmentation, get started with

Further Reading

You might be interested in our latest posts on:

Added further reading material.
3D Graph Neural Networks for RGBD Semantic Segmentation

FusionNet: A Deep Fully Residual Convolutional Neural Network for Image Segmentation in Connectomics

1 Introduction

The brain is considered the most complex organ in the human body. Despite decades of intense research, our understanding of how its structure relates to its function remains limited (Lichtman and Denk, 2011). Connectomics research seeks to disentangle the complicated neuronal circuits embedded within the brain. This field has gained substantial attention recently thanks to the advent of new serial-section electron microscopy (EM) technologies (Briggman and Bock, 2012; Hayworth et al., 2014; Eberle and Zeidler, 2018; Zheng et al., 2018; Graham et al., 2019). The resolution afforded by EM is sufficient for resolving tiny but important neuronal structures that are often densely packed together, such as dendritic spine necks and synaptic vesicles. These structures are often only tens of nanometers in diameter (Helmstaedter, 2013). Figure 1 shows an example of such an EM image and its cell membrane segmentation. Such high-resolution imaging results in enormous datasets, approaching one petabyte for only the relatively small tissue volume of one cubic millimeter. Therefore, handling and analyzing EM datasets is one of the most challenging problems in connectomics.

FIGURE 1. An example EM image (left) and its manually extracted cellular membrane segmentation result (right) from the ISBI 2012 EM segmentation challenge (Arganda-Carreras et al., 2015). Scale bar (green): 500 nm.

Early connectomics research focused on the sparse reconstruction of neuronal circuits (Bock et al., 2011; Briggman et al., 2011), meaning they focused reconstruction efforts on a subset of neurons in the data using manual or semi-automatic tools (Jeong et al., 2010; Sommer et al., 2011; Cardona et al., 2012). Unfortunately, this approach requires too much human interaction to scale well over the vast amount of EM data that can be collected with new technologies. Therefore, developing scalable and automatic image analysis algorithms is an important and active research direction in the field of connectomics.

Although some EM image processing pipelines use conventional, light-weight pixel classifiers [e.g., RhoANA (Kaynig et al., 2015)], the majority of automatic image segmentation algorithms for connectomics rely on deep learning. Earlier automatic segmentation work using deep learning mainly focused on patch-based pixel-wise classification based on a convolutional neural network (CNN) for affinity map generation (Turaga et al., 2010) and cell membrane probability estimation (Ciresan et al., 2012). However, one limitation of applying a conventional CNN to EM image segmentation is that per-pixel network deployment scaling becomes prohibitively expensive considering the tera-scale to peta-scale EM data size. For this reason, more efficient and scalable deep neural networks are important for image segmentation of the large datasets that can now be produced. One approach is to extend a fully convolutional neural network (FCN) (Long et al., 2015), which uses encoding and decoding phases similar to an autoencoder for the end-to-end semantic segmentation problem (Ronneberger et al., 2015; Chen et al., 2016a).

The motivation of the proposed work stems from our recent research effort to develop a deeper neural network for end-to-end cell segmentation with higher accuracy. We observed that, like conventional CNNs, a popular deep neural network for end-to-end segmentation known as U-net (Ronneberger et al., 2015) is limited by gradient vanishing with increasing network depth. To address this problem, we propose two extensions of U-net: using residual layers in each level of the network and introducing summation-based skip connections to make the entire network much deeper. Our segmentation method produces an accurate result that is competitive with similar EM segmentation methods. The main contribution of this study can be summarized as follows:

• We introduce an end-to-end automatic EM image segmentation method using deep learning. The proposed method combines a variant of U-net and residual CNN with novel summation-based skip connections to make the proposed architecture, a fully residual deep CNN. This new architecture directly employs residual properties within and across levels, thus providing a deeper network with higher accuracy.

• We demonstrate the performance of the proposed deep learning architecture by comparing it with several EM segmentation methods listed in the leader board of the ISBI 2012 EM segmentation challenge (Arganda-Carreras et al., 2015). Our method outperformed many of the top-ranked methods in terms of segmentation accuracy.

• We introduce a data enrichment method specifically built for EM data by collecting all the orientation variants of the input images (eight in the 2D case, including all combinations of flipping and rotation). We used the same augmentation process for deployment: the final output is a combination of eight different probability values, which increases the accuracy of the method.

• We demonstrate the flexibility of the proposed method on two different EM segmentation tasks. The first involves cell membrane segmentation on a fruit fly (Drosophila) EM dataset (Arganda-Carreras et al., 2015). The second involves cell nucleus feature segmentation on a whole-brain larval zebrafish EM dataset (Hildebrand et al., 2017).

2 Related Work

Deep neural networks (LeCun et al., 2015) have surpassed human performance in solving many complex visual recognition problems. Systems using this method can flexibly learn to recognize patterns such as handwritten digits (Krizhevsky et al., 2012) in images with increasing layers hierarchically corresponding to increasing feature complexity (Zeiler and Fergus, 2014). A major drawback of using deep neural networks is that they often require a huge amount of training data. In order to overcome this issue, researchers have started to collect large databases containing millions of images that span hundreds of categories (Russakovsky et al., 2015). Largely thanks to such training datasets, many advanced architectures have been introduced, including VGG (Simonyan and Zisserman, 2014) and GoogleNet (Szegedy et al., 2015). With these architectures, computers are now able to perform even more complex tasks, such as transferring artistic styles from a source image to an unrelated target (Gatys et al., 2016). To leverage these new capabilities, researchers are actively working to extend deep learning methods for analyzing biomedical image data (Cicek et al., 2016). Developing such methods for automatic classification and segmentation of different biomedical image modalities, such as CT (Zheng et al., 2015) and MRI (Isin et al., 2016), is leading to faster and more accurate decision-making processes in laboratory and clinical settings.

Similarly, deep learning has been quickly adopted by connectomics researchers to enhance automatic EM image segmentation. One of the earliest applications to EM segmentation involved the straightforward application of a convolutional neural network (CNN) for pixel-wise membrane probability estimation (Ciresan et al., 2012), an approach that won the ISBI 2012 EM segmentation challenge (Arganda-Carreras et al., 2015). As more deep learning methods are introduced, automatic EM segmentation techniques evolve and new groups overtake the title of state-of-the-art performance in such challenges. One notable recent advancement was the introduction of a fully convolutional neural network (FCN) (Long et al., 2015) for end-to-end semantic segmentation. Inspired by this work, several modified FCNs have been proposed for EM image segmentation. One variant combined multi-level upscaling layers to produce a final segmentation (Chen et al., 2016a). Additional post-processing steps such as lifted multi-cut (Beier et al., 2016; Pape et al., 2019) further refined this segmentation result.

Another approach added skip connections for concatenating feature maps into a “U-net” architecture (Ronneberger et al., 2015). While U-net and its variants can learn multi-contextual information from input data, they are limited in the depth of the network they can construct because of the vanishing gradient problem. On the other hand, the addition of shortcut connections and direction summations (He et al., 2016) allows gradients to flow across multiple layers during the training phase. This creates a fully residual CNN where the architecture is a fusion of the U-net design and networks with summation-based skip connections, similar to Fully Convolutional Residual Networks (FC-ResNets) (Drozdzal et al., 2016) and Residual Deconvolutional Networks (RDN) (Fakhry et al., 2017). These related studies inspired us to propose a fully residual CNN for analyzing connectomics data.

Work that leverages recurrent neural network (RNN) architectures can also accomplish this segmentation task (Stollenga et al., 2015). Instead of simultaneously considering all surrounding pixels and computing responses for the feature maps, RNN-based networks treat the pixels as a list or sequence with various routing rules and recurrently update each feature pixel. In fact, RNN-based membrane segmentation approaches are crucial for connected component labeling steps that can resolve false splits and merges during the post-processing of probability maps (Ensafi et al., 2014; Parag et al., 2015).

3 Methods

3.1 Network Architecture

Our proposed network, FusionNet, is based on the architecture of a convolutional autoencoder and is illustrated in Figure 2. It consists of an encoding path (upper half, from 640 × 640 to 40 × 40) that retrieves features of interest and a symmetric decoding path (lower half, from 40 × 40 to 640 × 640) that accumulates the feature maps from different scales to form the segmentation. Both the encoding and decoding paths consist of multiple levels (i.e., resolutions). Four basic building blocks are used to construct the proposed network. Each green block is a regular convolutional layer followed by rectified linear unit activation and batch normalization (omitted from the figure for simplicity). Each violet block is a residual layer that consists of three convolutional blocks and a residual skip connection. Each blue block is a maxpooling layer located between levels only in the encoding path to perform downsampling for feature compression. Each red block is a deconvolutional layer located between levels only in the decoding path to upsample the input data using learnable interpolations. A detailed specification of FusionNet, including the number of feature maps and their sizes, is provided in Table 1.

FIGURE 2. The proposed FusionNet architecture. An illustration of the encoding path (top to middle) and the decoding path (middle to bottom). Each intermediate residual block contains a residual skip connection within the same path, while the nested residual skip connections connect two different paths.

TABLE 1. Architecture of the proposed network.

One major difference between the FusionNet and U-net architectures is the way in which skip connections are used (Figure 3). In FusionNet, each level in the decoding path begins with a deconvolutional block (red) that un-pools the feature map from a coarser level (i.e., resolution), then merges it by pixel-wise addition with the feature map from the corresponding level in the encoding path by a long skip connection. There is also a short skip connection contained in each residual block (violet) that serves as a direct connection from the previous layer within the same encoding or decoding path. In contrast, U-net concatenates feature maps using only long skip connections. Additionally, by replacing concatenation with addition, FusionNet becomes a fully residual network, which resolves some common issues in deep networks (i.e., gradient vanishing). Furthermore, the nested short and long skip connections in FusionNet permit information flow within and across levels.

FIGURE 3. Difference between the core connections of U-net (Ronneberger et al., 2015) (left) and FusionNet (right). Note that FusionNet is a fully residual network due to the summation-based skip connections and is a much deeper network.

In the FusionNet encoding path, the number of feature maps doubles whenever downsampling is performed. After passing through the encoding path, the bridge level (i.e., 40 × 40 layer) residual block starts to expand feature maps into the following decoding path. In the decoding path, the number of feature maps is halved at every level, which maintains network symmetry. Note that there are convolutional layers both before and after each residual block. These convolutional layers serve as portal gateways that effectively adjust the amount of feature maps before and after residual blocks to the appropriate numbers. The placement of these convolutional layers on either side of the residual block leads the entire network to be perfectly symmetric (see Figure 2).

FusionNet performs end-to-end segmentation from the input EM data to the output segmentation label prediction. We train the network with pairs of EM images and their corresponding manually segmented label images as input. The training process involves comparing the output prediction with the input target labels using a mean-absolute-error (MAE) loss function to back-propagate adjustments to the connection weights. We considered the network sufficiently trained when its loss function values plateaued over several hundred epochs.

3.2 Data Augmentation

Our system involves data augmentation in multiple stages during both the training and deployment phases.

For training:

• The order of the image and label pairs are shuffled and organized with three-fold cross-validation to improve the generalization of our method.

• Offline, all training images and labels are reoriented to first produce an enriched dataset.

• Online, elastic field deformation is applied to both images and corresponding labels, followed by noise addition to only the images.

For prediction:

• Offline, input images are reoriented as for training.

• Inference is performed on all reoriented images separately, then each intermediate result is reverted to the original orientation, and all intermediate results are averaged to produce the final prediction.

Boundary extension is performed for all input images and labels. We describe each augmentation step in more detail in the following subsections.

Reorienation enrichment: Different EM images typically share similar orientation-independent textures in structures such as mitochondria, axons, and synapses. We reasoned that it should therefore be possible to enrich our input data with seven additional image and label pairs by reorienting the EM images, and in the case of training, their corresponding labels. Figure 4 shows all eight orientations resulting from a single EM image after performing this data enrichment, with an overlaid letter “g” in each panel to provide a simpler view of the generated orientation. To generate these permutations, we rotated each EM image (and corresponding label) by 90°, 180°, and 270°. We then vertically reflected the original and rotated images. For training, each orientation was added as a new image and label pair. For prediction, inference was performed on each of these data orientations separately, then each prediction result was reverted to the original orientation before averaging to produce the final accumulation. Our intuition here is that, based on the equivariance of isotropic data, each orientation will contribute equally toward the final prediction result. Note that because the image and label pairs are enriched eight times by this process, other on-the-fly linear data augmentation techniques such as random rotation, flipping, or transposition are unnecessary.

FIGURE 4. Eight reoriented versions of the same EM image. The original image is outlined in blue. By adding these reoriented images, the input data size is increased by eight times.

Elastic field deformation: To avoid overfitting (i.e., network remembering the training data), elastic deformation was performed on the entire enriched image dataset for every training epoch. This strategy is common in machine learning, especially for deep networks, to overcome limitations associated with small training dataset sizes. This procedure is illustrated in Figure 5. We first initialized a random sparse 12 × 12 vector field whose amplitudes at the image border boundaries vanish to zero. This field was then interpolated to the input size and used to warp both EM images and corresponding labels. The flow map was randomly generated for each epoch. No elastic field deformation was performed during deployment.

FIGURE 5. Elastic field deformation example. A randomly sparse vector field (A) is generated for each training image and label pair. This sparse vector field is then used to warp both the original image data (B, left) and its corresponding label (C, left) to form an augmentation pair consisting of warped image data (B, middle) and warped label (C, middle). The difference between the original and warped images (B, right) and labels (C, right) show the effect of deformation.

Random noise addition: During only the training phase, we randomly added Gaussian noise (mean µ = 0, variance ) to each EM input image but not its corresponding label.

Boundary extension: FusionNet accepts an input image size of 512 × 512. Each input image, and in the case of training its corresponding label, was automatically padded with the mirror reflections of itself across the image border boundary (radius = 64 px) to maintain similar statistics for pixels that are near the edges. This padding is the reason why FusionNet starts with a 640 × 640 image, which is 128 px larger along each edge than the original input. However, we performed convolution with 3 × 3 kernel size and “SAME” mode, which leads the final segmentation to have the same padded size. To account for this, the final output prediction was cropped to eliminate the padded regions.

3.3 Experimental Setup

FusionNet was implemented using the Keras open-source deep learning library (Chollet, 2015). This library provides an easy-to-use, high-level programming API written in Python, with Theano or TensorFlow as a back-end engine. The model was trained with the Adam optimizer with a decaying learning rate of 2e−4 for over 50,000 epochs to harness the benefits of heavy elastic deformation on the small annotated datasets. FusionNet has also been translated to PyTorch and pure TensorFlow for other applications, such as Image-to-Image translation (Lee et al., 2018) and MRI reconstruction (Quan et al., 2018). All training and deployment presented here was conducted on a system with an Intel i7 CPU, 32 GB RAM, and a NVIDIA GTX GeForce 1080 GPU.

3.4 Network Chaining

FusionNet by itself performs end-to-end segmentation from the EM data input to the final prediction output. In typical real world applications of end-to-end segmentation approaches, however, manual proofreading by human experts is usually performed in an attempt to “correct” any mistakes in the output labels. We therefore reasoned that concatenating a chain of several FusionNet units could serve as a form of built-in refinement similar to proofreading that could resolve ambiguities in the initial predictions. Figure 6 shows an example case with four chained FusionNet units (FusionNetW4). To impose a target-driven approach across the chained network during training, we calculate the loss between the output of each separate unit and the training labels. As a result, chained FusionNet architectures have a single input and multiple outputs, where the end of each decoding path serves as a checkpoint between units attempting to produce better and better segmentation results.

FIGURE 6. FusionNetW4, a chain of four concatenated FusionNet units.

Since the architecture of each individual unit is the same, the chained FusionNet model can be thought of as similar to an unfolded Recurrent Neural Network (RNN) with each FusionNet unit akin to a single feedback cycle but with weights that are not shared across cycles. Each FusionNet can be considered as a V-cycle in the multigrid method (Shapira, 2008) commonly used in numerical analysis, where the contraction in the encoding path is similar to restriction from a fine to a coarse grid, the expansion in the decoding is similar to the prolongation toward the final segmentation, and the skip connections play a role similar to relaxation. The simplest chain of two V-cycle units forms a W shape, so we refer to FusionNet chains using a “FusionNetW” terminology. To differentiate various configurations, we use the superscript to indicate how many FusionNet units are chained and the subscript to show the initial number of feature maps in the original resolution. For example, would signify a network that chains four FusionNet units, each of them with the base number of convolution kernels (in Keras, nb_filters parameter) set to 64. We chose this specific 4-chain example case as the maximum chain length used here ad hoc to roughly match the memory available on our GPU. We also used 64 for the base number of convolution kernels in every case to match the backbone architecture of U-net. During training, the weights of each FusionNet unit are updated independently, as opposed to the RNN strategy of averaging the gradients from shared weights. For the example FusionNetW4 case, we trained with the input images S and corresponding manual labels L. Each FusionNet unit in FusionNetW4, which can be indexed as FusionNetW4[i] where i = 1, 2, 3 or 4, generates the prediction P[i] by minimizing the MAE loss between its prediction values and the target labels L. For each epoch, we incrementally train FusionNetW4[i] and fix its weights before training FusionNetW4[i + 1]. This procedure can be summarized as follows:

The loss training curves decrease as i increases, eventually converging as the number of training epochs increases.

4 Results

4.1 Fruit Fly Data

The fruit fly (Drosophila) ventral nerve cord EM data used here was captured from a first instar larva (Cardona et al., 2010). Training and test datasets were provided as part of the ISBI 2012 EM segmentation challenge1 (Arganda-Carreras et al., 2015). Each dataset consisted of a 512 × 512 × 30 volume acquired at anisotropic resolution with transmission EM. These datasets were originally chosen in part because they contained noise and small image alignment errors that frequently occur in serial-section EM. For training, the provided dataset included EM image data and publicly available manual segmentation labels. The first 20 of 30 slices of the training volume were used for training and the last 10 slices were used for validation. For testing, the provided dataset included only EM image data, while segmentation labels were kept private for the assessment of segmentation accuracy (Arganda-Carreras et al., 2015). Test segmentations were produced for all 30 slices of the test volume and were then uploaded for comparison to the hidden ISBI Challenge segmentation labels.

Figure 7 illustrates the probability map extraction results from test data without any post-processing steps (middle) and with lifted multi-cut (LMC) algorithm post-processing (right) (Beier et al., 2017), which resulted in thinning of the probability map. As this shows, our chained FusionNet method is able to remove extraneous structures belonging to mitochondria (appearing as dark shaded textures) and vesicles (appearing as small circles). Uncertain regions in the prediction results without post-processing appear as blurry gray smears (highlighted by pink boxes). In cases like this, must decide whether or not the highlighted pixels should be segmented as membrane, but the region is ambiguous because of membrane smearing, likely due to anisotropy in the data.

FIGURE 7. Example results of cellular membrane segmentation on test data from the ISBI 2012 EM segmentation challenge (slice 22/30) illustrating an input EM image (left), the probability prediction from (middle), and the thinned probability prediction after applying LMC (Beier et al., 2017) post-processing (right). Pink boxes highlight uncertain regions that are ambiguous because of membrane smearing, likely due to anisotropy in the data.

FusionNet approaches outperformed several other methods in segmenting the ISBI 2012 EM challenge data by several standard metrics. These metrics include foreground-restricted Rand scoring after border thinning and foreground-restricted information-theoretic scoring after border thinning (Vinfo) (Arganda-Carreras et al., 2015). Quantitative comparisons with other methods are summarized in Table 2. Even using a single FusionNet unit (), we achieved better results compared to many well-known methods, such as U-net (Ronneberger et al., 2015), network-in-network (Lin et al., 2014), fused-architecture (Chen et al., 2016a), and long short-term memory (LSTM) (Stollenga et al., 2015) approaches. Using a chained FusionNet with two modules () performed even better, surpassing the performance of many previous state-of-the-art deep learning methods (Chen et al., 2016b; Drozdzal et al., 2016). These results confirm that chaining a deeper architecture with a residual bottleneck helps to increase the accuracy of the EM segmentation task. Both with and without LMC post-processing, ranks among the top 10 in the ISBI 2012 EM segmentation challenge leaderboard (as of June 2020).

TABLE 2. Accuracy of various segmentation methods on the Drosophila EM dataset (ISBI 2012 EM segmentation challenge leaderboard, June 2020). Bold values correspond to the method presented here.

4.2 Zebrafish Data

The zebrafish EM data used here was taken from a publicly available database2. It was captured from a 5.5 days post-fertilization larval specimen. This specimen was cut into ∼18,000 serial sections and collected onto a tape substrate with an automated tape-collecting ultramicrotome (ATUM) (Hayworth et al., 2014). A series of images spanning the anterior quarter of the larval zebrafish was acquired at resolution from 16,000 sections using scanning EM (Hildebrand, 2015; Hildebrand et al., 2017). All 2D images were then co-registered into a 3D volume using an FFT signal whitening approach (Wetzel et al., 2016). For training, two small sub-volume crops were extracted from a near-final iteration of the full volume alignment in order to avoid deploying later segmentation runs on training data. Two training volumes that contained different tissue features were chosen. One volume was 512 × 512 × 512 and the other was 512 × 512 × 256. The blob-like features of interest—neuronal nuclei—were manually segmented as area-lists in each training volume using the Fiji (Schindelin et al., 2012) TrakEM2 plug-in (Cardona et al., 2012). From each of these two training volumes, three quarters were used for training and one quarter was used for validation. These area-lists were exported as binary masks for use in the training procedure. For accuracy assessments, an additional non-overlapping 512 × 512 × 512 testing sub-volume and corresponding manual segmentation labels were used.

To assess the performance of on this segmentation task, we first deployed it on 512 × 512 × 512 test volume alongside the U-net (Ronneberger et al., 2015) and RDN (Fakhry et al., 2017) methods. Figure 8 displays volume renderings of the zebrafish test set EM data, its manual cell nucleus segmentation, and segmentation results from U-net, RDN, and . As this shows, introduced less false predictions compared to U-net and RDN. Table 3 compares U-net, RDN, and using three quality metrics: foreground-restricted Rand scoring after border thinning (Vrand), foreground-restricted information theoretic scoring after border thinning (Vinfo), and the Dice coefficient (Vdice). By all of these metrics, produced more accurate segmentation results.

FIGURE 8. Visual comparison of the larval zebrafish EM volume segmentation. (A) Input serial-section EM volume. (B) Manual segmentation (ground truth). (C) U-net (Ronneberger et al., 2015) result. (D) RDN (Fakhry et al., 2017) result. (E) result. Red arrows indicate errors.

TABLE 3. Segmentation accuracy on a test volume from the zebrafish EM dataset. Bold values correspond to the method presented here.

We also deployed the trained network to the complete set of 16,000 sections of the larval zebrafish brain imaged at resolution, which is about 1.2 terabytes in data size. Figure 9 shows EM dataset cross-sections in the transverse (top, x-y) and horizontal (bottom, x-z) planes of the larval zebrafish overlaid with the cell nucleus segmentation results. The transverse view overlay also shows the sphericity of each segmented cell nucleus in a blue to red color map, which can help to visually identify the location of false positives.

FIGURE 9. Cell nucleus segmentation results overlaid onto zebrafish EM volume cross-sections through the transverse (top, blue to red color map varies with cell sphericity) and horizontal (bottom) planes.

5 Conclusions

In this paper, we introduced a deep neural network architecture for image segmentation with a focus on connectomics EM image analysis. The proposed architecture, FusionNet, extends the U-net and residual CNN architectures to develop a deeper network for a more accurate end-to-end segmentation. We demonstrated the flexibility and performance of FusionNet in membrane- and blob-type EM segmentation tasks.

Several other approaches share similarities with FusionNet, particularly in concatenated chain forms. Chen et al. proposed concatenating multiple FCNs to build a RNN that extracts inter-slice contexts (Chen et al., 2016b). Unlike FusionNet, this approach takes as input multiple different resolutions of the raw image to produce a single segmentation output and uses a single loss function. Wu proposed iteratively applying a pixel-wise CNN (ICNN) to refine membrane detection probability maps (MDPM) (Wu, 2015). In this method, a regular CNN for generating MDPM from the raw input images and an iterative CNN for refining MDPM are trained independently. In contrast, FusionNet is trained as a single chained network. Additionally, FusionNet can refine errors in MDPM more completely using a chained network (i.e., by correcting errors in the error-corrected results) and scales better to larger image sizes due to the end-to-end nature of the network. More in-depth analyses into why chaining approaches are beneficial to improve the prediction accuracy of such deep networks will be an important goal for future work.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author Contributions

TMQ developed the methods and performed the experiments with input from W-KJ and DGCH. Project supervision and funding were provided by W-KJ. All authors wrote the paper.


This work was supported by a NAVER Fellowship to TMQ and by a Leon Levy Foundation Fellowship in Neuroscience to DGCH.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.


We thank Woohyuk Choi and Jungmin Moon for assistance in creating the larval zebrafish EM volume renderings. This manuscript has been released as a pre-print at (Quan et al., 2016).





Arganda-Carreras, I., Turaga, S. C., Berger, D. R., Cireşan, D., Giusti, A., Gambardella, L. M., et al. (2015). Crowdsourcing the Creation of Image Segmentation Algorithms for Connectomics. Front. Neuroanat. 910, 142. doi:10.3389/fnana.2015.00142

PubMed Abstract | CrossRef Full Text | Google Scholar

Beier, T., Andres, B., Köthe, U., and Hamprecht, F. A. (2016). “An Efficient Fusion Move Algorithm for the Minimum Cost Lifted Multicut Problem,” in Proceedings of ECCV 2016, Amsterdam, Netherlands, October 8–16, 2016 (Springer, Cham), 715–730.

CrossRef Full Text | Google Scholar

Beier, T., Pape, C., Rahaman, N., Prange, T., Berg, S., Bock, D. D., et al. (2017). Multicut Brings Automated Neurite Segmentation Closer to Human Performance. Nat. Methods 14, 101–102. doi:10.1038/nmeth.4151

PubMed Abstract | CrossRef Full Text | Google Scholar

Bock, D. D., Lee, W.-C. A., Kerlin, A. M., Andermann, M. L., Hood, G., Wetzel, A. W., et al. (2011). Network Anatomy and In Vivo Physiology of Visual Cortical Neurons. Nature 471, 177–182. doi:10.1038/nature09802

PubMed Abstract | CrossRef Full Text | Google Scholar

Briggman, K. L., and Bock, D. D. (2012). Volume Electron Microscopy for Neuronal Circuit Reconstruction. Curr. Opin. Neurobiol. 22, 154–161. doi:10.1016/j.conb.2011.10.022

PubMed Abstract | CrossRef Full Text | Google Scholar

Briggman, K. L., Helmstaedter, M., and Denk, W. (2011). Wiring Specificity in the Direction-Selectivity Circuit of the Retina. Nature 471, 183–188. doi:10.1038/nature09818

PubMed Abstract | CrossRef Full Text | Google Scholar

Cardona, A., Saalfeld, S., Preibisch, S., Schmid, B., Cheng, A., Pulokas, J., et al. (2010). An Integrated Micro- and Macroarchitectural Analysis of the Drosophila Brain by Computer-Assisted Serial Section Electron Microscopy. PLoS Biol. 8 (10), e1000502. doi:10.1371/journal.pbio.1000502

PubMed Abstract | CrossRef Full Text | Google Scholar

Cardona, A., Saalfeld, S., Schindelin, J., Arganda-Carreras, I., Preibisch, S., Longair, M., et al. (2012). TrakEM2 Software for Neural Circuit Reconstruction. PLoS ONE 7 (6), e38011. doi:10.1371/journal.pone.0038011

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, H., Qi, X., Cheng, J., and Heng, P. (2016). “Deep Contextual Networks for Neuronal Structure Segmentation,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, February 12–17, 2016 (AAAI Press), 1167–1173.

Google Scholar

Chen, J., Yang, L., Zhang, Y., Alber, M., and Chen, D. Z. (2016). “Combining Fully Convolutional and Recurrent Neural Networks for 3D Biomedical Image Segmentation,” in Proceedings of NIPS 2016, Barcelona, Spain, September 5, 2016 (Curran Associates, Inc.), 3036–3044.

Google Scholar

Cicek, O., Abdulkadir, A., Lienkamp, S. S., Brox, T., and Ronneberger, O. (2016). “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,ˮ in Proceedings of MICCAI 2016, Athens, Greece, October 17–21, 2016 (Springer, Cham), 424–432.

CrossRef Full Text | Google Scholar

Ciresan, D., Giusti, A., Gambardella, L. M., and Schmidhuber, J. (2012). “Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images,ˮ in Proceedings of NIPS 2012, Stateline, NV, December 3–8, 2012 (Curran Associates Inc.), 2843–2851.

Google Scholar

Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., and Pal, C. (2016). “The Importance of Skip Connections in Biomedical Image Segmentation,ˮ in Proceedings of DLMIA 2016, Athens, Greece, October 21, 2016 (Springer, Cham), 179–187.

CrossRef Full Text | Google Scholar

Eberle, A. L., and Zeidler, D. (2018). Multi-Beam Scanning Electron Microscopy for High-Throughput Imaging in Connectomics Research. Front. Neuroanat. 12, 112. doi:10.3389/fnana.2018.00112

PubMed Abstract | CrossRef Full Text | Google Scholar

Ensafi, S., Lu, S., Kassim, A. A., and Tan, C. L. (2014). “3D Reconstruction of Neurons in Electron Microscopy Images,” in Proceedings of IEEE EMBS 2014, Chicago, Illinois, August 27–31, 2014 (IEEE), 6732–6735.

Google Scholar

Fakhry, A., Peng, H., and Ji, S. (2016). Deep Models for Brain EM Image Segmentation: Novel Insights and Improved Performance. Bioinformatics 32, 2352–2358. doi:10.1093/bioinformatics/btw165

PubMed Abstract | CrossRef Full Text | Google Scholar

Fakhry, A., Zeng, T., and Ji, S. (2017). Residual Deconvolutional Networks for Brain Electron Microscopy Image Segmentation. IEEE Trans. Med. Imaging 36, 447–456. doi:10.1109/tmi.2016.2613019

PubMed Abstract | CrossRef Full Text | Google Scholar

Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). “Image Style Transfer Using Convolutional Neural Networks,ˮ in Proceedings of IEEE CVPR 2016, Las Vegas, NV, June 27–30, 2016 (IEEE), 2414–2423.

Google Scholar

Graham, B. J., Hildebrand, D. G. C., Kuan, A. T., Maniates-Selvin, J. T., Thomas, L. A., Shanny, B. L., et al. (2019). High-throughput Transmission Electron Microscopy With Automated Serial Sectioning. doi:10.1101/657346 Preprint. Available at: (Accessed June 2, 2019).

CrossRef Full Text | Google Scholar

Hayworth, K. J., Morgan, J. L., Schalek, R., Berger, D. R., Hildebrand, D. G. C., and Lichtman, J. W. (2014). Imaging ATUM Ultrathin Section Libraries With WaferMapper: A Multi-Scale Approach to EM Reconstruction of Neural Circuits. Front. Neural Circuits 8, 68. doi:10.3389/fncir.2014.00068

PubMed Abstract | CrossRef Full Text | Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep Residual Learning for Image Recognition,ˮ in Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, June 27–30, 2016 (IEEE), 770–778.

Google Scholar

Helmstaedter, M. (2013). Cellular-Resolution Connectomics: Challenges of Dense Neural Circuit Reconstruction. Nat. Methods 10, 501–507. doi:10.1038/nmeth.2476

PubMed Abstract | CrossRef Full Text | Google Scholar

Hildebrand, D. G. C. (2015). Whole-Brain Functional and Structural Examination in Larval Zebrafish. PhD thesis. Cambridge (MA): Harvard University, Graduate School of Arts and Sciences.

Google Scholar

Hildebrand, D. G. C., Cicconet, M., Torres, R. M., Choi, W., Quan, T. M., Moon, J., et al. (2017). Whole-Brain Serial-Section Electron Microscopy in Larval Zebrafish. Nature 545, 345–349. doi:10.1038/nature22356

PubMed Abstract | CrossRef Full Text | Google Scholar

Hirsch, P., Mais, L., and Kainmueller, D. (2020). “Patchperpix for Instance Segmentation,” in Proceedings of ECCV 2020, Glasgow, United Kingdom, August 23–28, 2020 (Springer, Cham), 288–304.

CrossRef Full Text | Google Scholar

Isin, A., Direkoglu, C., and Sah, M. (2016). Review of MRI-Based Brain Tumor Image Segmentation Using Deep Learning Methods. Procedia Comput. Sci. 102, 317–324. doi:10.1016/j.procs.2016.09.407

CrossRef Full Text | Google Scholar

Jeong, W.-K., Beyer, J., Hadwiger, M., Blue, R., Law, C., Vázquez-Reina, A., et al. (2010). Secrett and NeuroTrace: Interactive Visualization and Analysis Tools for Large-Scale Neuroscience Data Sets. IEEE Comput. Graphics Appl. 30, 58–70. doi:10.1109/MCG.2010.56

PubMed Abstract | CrossRef Full Text | Google Scholar

Kaynig, V., Vazquez-Reina, A., Knowles-Barley, S., Roberts, M., Jones, T. R., Kasthuri, N., et al. (2015). Large-scale Automatic Reconstruction of Neuronal Processes From Electron Microscopy Images. Med. Image Anal. 22, 77–88. doi:10.1016/

PubMed Abstract | CrossRef Full Text | Google Scholar

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “ImageNet Classification With Deep Convolutional Neural Networks, in Proceedings of NIPS 2012, Stateline, NV, December 3–8, 2012 (Curran Associates Inc.), 1097–1105.

Google Scholar

Lee, G., Oh, J.-W., Kang, M.-S., Her, N.-G., Kim, M.-H., and Jeong, W.-K. (2018). “DeepHCS: Bright-Field to Fluorescence Microscopy Image Conversion Using Deep Learning for Label-Free High-Content Screening, ˮ in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018, Granada, Spain, September 16–20, 2018 (Springer International Publishing), 335–343.

CrossRef Full Text | Google Scholar

Lichtman, J. W., and Denk, W. (2011). The Big and the Small: Challenges of Imaging the Brain’s Circuits. Science 334, 618–623. doi:10.1126/science.1209168

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, M., Chen, Q., and Yan, S. (2014). “Network in Network, ˮ in Proceedings of ICLR 2014. arXiv:1312.4400v3.

Google Scholar

Long, J., Shelhamer, E., and Darrell, T. (2015). “Fully Convolutional Networks for Semantic Segmentation,ˮ in Proceedings of IEEE CVPR 2015, Boston, MA, June 7–12, 2015 (IEEE), 3431–3440.

Google Scholar

Pape, C., Matskevych, A., Wolny, A., Hennies, J., Mizzon, G., Louveaux, M., et al. (2019). Leveraging Domain Knowledge to Improve Microscopy Image Segmentation with Lifted Multicuts. Front. Comput. Sci. 1, 657. doi:10.3389/fcomp.2019.00006

CrossRef Full Text | Google Scholar

Parag, T., Ciresan, D. C., and Giusti, A. (2015). “Efficient Classifier Training to Minimize False Merges in Electron Microscopy Segmentation,ˮ in Proceedings of IEEE ICCV 2015, Santiago, Chile, December 7–13, 2015 (IEEE), 657–665.

Google Scholar

Quan, T. M., Hildebrand, D. G. C., and Jeong, W.-K. (2016). FusionNet: A Deep Fully Residual Convolutional Neural Network for Image Segmentation in Connectomics. arXiv preprint arXiv:1612.05360.

Google Scholar

Quan, T. M., Nguyen-Duc, T., and Jeong, W.-K. (2018). Compressed Sensing MRI Reconstruction Using a Generative Adversarial Network With a Cyclic Loss. IEEE Trans. Med. Imaging 37, 1488–1497. doi:10.1109/tmi.2018.2820120

PubMed Abstract | CrossRef Full Text | Google Scholar

Ronneberger, O., Fischer, P., and Brox, T. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation. ˮin Proceedings of MICCAI 2015, Munich, Germany, October 5–9, 2015 (Springer, Cham), 234–241.

CrossRef Full Text | Google Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vision 115, 211–252. doi:10.1007/s11263-015-0816-y

CrossRef Full Text | Google Scholar

Y. Shapira (Editor) (2008). Matrix-Based Multigrid. New York, NY: Springer US.

Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M., Pietzsch, T., et al. (2012). Fiji: an Open-Source Platform for Biological-Image Analysis. Nat. Methods 9, 676–682. doi:10.1038/nmeth.2019

PubMed Abstract | CrossRef Full Text | Google Scholar

Shen, W., Wang, B., Jiang, Y., Wang, Y., and Yuille, A. (2017). “Multi-stage Multi-Recursive-Input Fully Convolutional Networks for Neuronal Boundary Detection,ˮ in IEEE International Conference on Computer Vision (ICCV), Venice, Italy, October 22–29, 2017 (IEEE), 2391–2400.

Google Scholar

Simonyan, K., and Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556

Google Scholar

Sommer, C., Strähle, C., Köthe, U., and Hamprecht, F. A. (2011). “Ilastik: Interactive Learning and Segmentation Toolkit,ˮ in Proceedings of IEEE ISBI 2011, Chicago, IL, March 30–April 2, 2011 (IEEE), 230–233.

Google Scholar

Stollenga, M. F., Byeon, W., Liwicki, M., and Schmidhuber, J. (2015). “Parallel Multi-Dimensional LSTM, with Application to Fast Biomedical Volumetric Image Segmentation,ˮ in Proceedings of NIPS 2015, June 24, 2015, Montreal, QC (Curran Associates, Inc.), 2998–3006.

Google Scholar

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). “Going Deeper with Convolutions,ˮ in Proceedings of IEEE CVPR, Boston, MA, June 7–12, 2015 (IEEE) 1–9.

Google Scholar

Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., et al. (2010). Convolutional Networks can Learn to Generate Affinity Graphs for Image Segmentation. Neural Comput. 22, 511–538. doi:10.1162/neco.2009.10-08-881

PubMed Abstract | CrossRef Full Text | Google Scholar

Weiler, M., Hamprecht, F. A., and Storath, M. (2017). “Learning Steerable Filters for Rotation Equivariant cnns,ˮ in Computer Vision and Pattern Recognition, Honolulu, HI, July 21–26, 2017 (IEEE), 849–858.

Google Scholar

Wetzel, A. W., Bakal, J., Dittrich, M., Hildebrand, D. G. C., Morgan, J. L., and Lichtman, J. W. (2016). “Registering Large Volume Serial-Section Electron Microscopy Image Sets for Neural Circuit Reconstruction Using FFT Signal Whitening,ˮ in Proceedings of AIPR Workshop 2016, Washington, D.C., United States, October 18–20, 2016 (IEEE), 1–10.

Google Scholar

Wiehman, S., and Villiers, H. D. (2016). “Semantic Segmentation of Bioimages Using Convolutional Neural Networks,ˮ in Proceedings of IJCNN 2016, Vancouver, BC, July 24–29, 2016 (IEEE), 624–631.

Google Scholar

Wolf, S., Bailoni, A., Pape, C., Rahaman, N., Kreshuk, A., Köthe, U., et al. (2019). The Mutex Watershed and its Objective: Efficient, Parameter-Free Image Partitioning. IEEE Trans. Pattern Anal. Mach. Intell. doi:10.1109/TPAMI.2020.2980827

CrossRef Full Text | Google Scholar

Wu, X. (2015). An Iterative Convolutional Neural Network Algorithm Improves Electron Microscopy Image Segmentation. arXiv preprint arXiv:150605849.

Google Scholar

Xiao, C., Liu, J., Chen, X., Han, H., Shu, C., and Xie, Q. (2018). “Deep Contextual Residual Network for Electron Microscopy Image Segmentation in Connectomics,ˮ in IEEE 15th International Symposium On Biomedical Imaging (ISBI 2018), Washington, D.C., United States, April 4–7, 2018 (IEEE), 378–381.

Google Scholar

Zeiler, M. D., and Fergus, R. (2014). “Visualizing and Understanding Convolutional Networks,ˮ in Proceedings of ECCV 2014, Zurich, Switzerland, September 6–12, 2014 (Springer, Cham), 818–833.

CrossRef Full Text | Google Scholar

Zheng, Y., Liu, D., Georgescu, B., Nguyen, H., and Comaniciu, D. (2015). “3D Deep Learning for Efficient and Robust Landmark Detection in Volumetric Data,ˮ in Proceedings of MICCAI 2015, Munich, Germany, October 5–9, 2015 (Springer, Cham), 565–572.

CrossRef Full Text | Google Scholar

Zheng, Z., Lauritzen, J. S., Perlman, E., Robinson, C. G., Nichols, M., Milkie, D., et al. (2018). A Complete Electron Microscopy Volume of the Brain of Adult drosophila Melanogaster. Cell 174, 730–743. doi:10.1016/j.cell.2018.06.019e22

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, Y., Torrens, Y., Chen, Z., Zhao, S., Xie, H., Guo, W., and Zhang, Y. (2019). “Ace-net: Biomedical Image Segmentation with Augmented Contracting and Expansive Paths,ˮin International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, October 13–17, 2019 (Springer, Cham).712–720.

CrossRef Full Text | Google Scholar

Keywords: connectomic analysis, image segementation, deep learning, refinement, skip connection

Citation: Quan TM, Hildebrand DGC and Jeong W-K (2021) FusionNet: A Deep Fully Residual Convolutional Neural Network for Image Segmentation in Connectomics. Front. Comput. Sci. 3:613981. doi: 10.3389/fcomp.2021.613981

Received: 04 October 2020; Accepted: 12 April 2021;
Published: 13 May 2021.

Copyright © 2021 Quan, Hildebrand and Jeong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Won-Ki Jeong, [email protected]


Network segmentation neural

How to do Semantic Segmentation using Deep learning

This article is a comprehensive overview including a step-by-step guide to implement a deep learning image segmentation model.

We shared a new updated blog on Semantic Segmentation here: A 2021 guide to Semantic Segmentation

Nowadays, semantic segmentation is one of the key problems in the field of computer vision. Looking at the big picture, semantic segmentation is one of the high-level task that paves the way towards complete scene understanding. The importance of scene understanding as a core computer vision problem is highlighted by the fact that an increasing number of applications nourish from inferring knowledge from imagery. Some of those applications include self-driving vehicles, human-computer interaction, virtual reality etc. With the popularity of deep learning in recent years, many semantic segmentation problems are being tackled using deep architectures, most often Convolutional Neural Nets, which surpass other approaches by a large margin in terms of accuracy and efficiency.

What is Semantic Segmentation?

Semantic segmentation is a natural step in the progression from coarse to fine inference:The origin could be located at classification, which consists of making a prediction for a whole input.The next step is localization / detection, which provide not only the classes but also additional information regarding the spatial location of those classes.Finally, semantic segmentation achieves fine-grained inference by making dense predictions inferring labels for every pixel, so that each pixel is labeled with the class of its enclosing object ore region.

example of semantic segmentation in street view

It is also worthy to review some standard deep networks that have made significant contributions to the field of computer vision, as they are often used as the basis of semantic segmentation systems:

  • AlexNet: Toronto’s pioneering deep CNN that won the 2012 ImageNet competition with a test accuracy of 84.6%. It consists of 5 convolutional layers, max-pooling ones, ReLUs as non-linearities, 3 fully-convolutional layers, and dropout.
  • VGG-16: This Oxford’s model won the 2013 ImageNet competition with 92.7% accuracy. It uses a stack of convolution layers with small receptive fields in the first layers instead of few layers with big receptive fields.
  • GoogLeNet: This Google’s network won the 2014 ImageNet competition with accuracy of 93.3%. It is composed by 22 layers and a newly introduced building block called inception module. The module consists of a Network-in-Network layer, a pooling operation, a large-sized convolution layer, and small-sized convolution layer.
  • ResNet: This Microsoft’s model won the 2016 ImageNet competition with 96.4 % accuracy. It is well-known due to its depth (152 layers) and the introduction of residual blocks. The residual blocks address the problem of training a really deep architecture by introducing identity skip connections so that layers can copy their inputs to the next layer.
Analysis of Deep Neural Network Models

What are the existing Semantic Segmentation approaches?

A general semantic segmentation architecture can be broadly thought of as an encoder network followed by a decoder network:

  • The encoder is usually is a pre-trained classification network like VGG/ResNet followed by a decoder network.
  • The task of the decoder is to semantically project the discriminative features (lower resolution) learnt by the encoder onto the pixel space (higher resolution) to get a dense classification.

Unlike classification where the end result of the very deep network is the only important thing, semantic segmentation not only requires discrimination at pixel level but also a mechanism to project the discriminative features learnt at different stages of the encoder onto the pixel space. Different approaches employ different mechanisms as a part of the decoding mechanism. Let’s explore the 3 main approaches:

1 — Region-Based Semantic Segmentation

The region-based methods generally follow the “segmentation using recognition” pipeline, which first extracts free-form regions from an image and describes them, followed by region-based classification. At test time, the region-based predictions are transformed to pixel predictions, usually by labeling a pixel according to the highest scoring region that contains it.

R-CNN architecture - general framework

R-CNN(Regions with CNN feature) is one representative work for the region-based methods. It performs the semantic segmentation based on the object detection results. To be specific, R-CNN first utilizes selective search to extract a large quantity of object proposals and then computes CNN features for each of them. Finally, it classifies each region using the class-specific linear SVMs. Compared with traditional CNN structures which are mainly intended for image classification, R-CNN can address more complicated tasks, such as object detection and image segmentation, and it even becomes one important basis for both fields. Moreover, R-CNN can be built on top of any CNN benchmark structures, such as AlexNet, VGG, GoogLeNet, and ResNet.

For the image segmentation task, R-CNN extracted 2 types of features for each region: full region feature and foreground feature, and found that it could lead to better performance when concatenating them together as the region feature. R-CNN achieved significant performance improvements due to using the highly discriminative CNN features. However, it also suffers from a couple of drawbacks for the segmentation task:

  • The feature is not compatible with the segmentation task.
  • The feature does not contain enough spatial information for precise boundary generation.
  • Generating segment-based proposals takes time and would greatly affect the final performance.

Due to these bottlenecks, recent research has been proposed to address the problems, including SDS, Hypercolumns, Mask R-CNN.

2 — Fully Convolutional Network-Based Semantic Segmentation

The original Fully Convolutional Network (FCN) learns a mapping from pixels to pixels, without extracting the region proposals. The FCN network pipeline is an extension of the classical CNN. The main idea is to make the classical CNN take as input arbitrary-sized images. The restriction of CNNs to accept and produce labels only for specific sized inputs comes from the fully-connected layers which are fixed. Contrary to them, FCNs only have convolutional and pooling layers which give them the ability to make predictions on arbitrary-sized inputs.

Fully convolutional Network (FCN) Architecture

One issue in this specific FCN is that by propagating through several alternated convolutional and pooling layers, the resolution of the output feature maps is down sampled. Therefore, the direct predictions of FCN are typically in low resolution, resulting in relatively fuzzy object boundaries. A variety of more advanced FCN-based approaches have been proposed to address this issue, including SegNet, DeepLab-CRF, and Dilated Convolutions.

3 — Weakly Supervised Semantic Segmentation

Most of the relevant methods in semantic segmentation rely on a large number of images with pixel-wise segmentation masks. However, manually annotating these masks is quite time-consuming, frustrating and commercially expensive. Therefore, some weakly supervised methods have recently been proposed, which are dedicated to fulfilling the semantic segmentation by utilizing annotated bounding boxes.

semantic segmentation

For example, Boxsup employed the bounding box annotations as a supervision to train the network and iteratively improve the estimated masks for semantic segmentation. Simple Does It treated the weak supervision limitation as an issue of input label noise and explored recursive training as a de-noising strategy. Pixel-level Labeling interpreted the segmentation task within the multiple-instance learning framework and added an extra layer to constrain the model to assign more weight to important pixels for image-level classification.

Doing Semantic Segmentation with Fully-Convolutional Network

In this section, let’s walk through a step-by-step implementation of the most popular architecture for semantic segmentation — the Fully-Convolutional Net (FCN). We’ll implement it using the TensorFlow library in Python 3, along with other dependencies such as Numpy and Scipy.In this exercise we will label the pixels of a road in images using FCN. We’ll work with the Kitti Road Dataset for road/lane detection. This is a simple exercise from the Udacity’s Self-Driving Car Nano-degree program, which you can learn more about the setup in this GitHub repo.

Kitti road dataset for semantic segmentation

Here are the key features of the FCN architecture:

  • FCN transfers knowledge from VGG16 to perform semantic segmentation.
  • The fully connected layers of VGG16 is converted to fully convolutional layers, using 1x1 convolution. This process produces a class presence heat map in low resolution.
  • The upsampling of these low resolution semantic feature maps is done using transposed convolutions (initialized with bilinear interpolation filters).
  • At each stage, the upsampling process is further refined by adding features from coarser but higher resolution feature maps from lower layers in VGG16.
  • Skip connection is introduced after each convolution block to enable the subsequent block to extract more abstract, class-salient features from the previously pooled features.

There are 3 versions of FCN (FCN-32, FCN-16, FCN-8). We’ll implement FCN-8, as detailed step-by-step below:

  • Encoder: A pre-trained VGG16 is used as an encoder. The decoder starts from Layer 7 of VGG16.
  • FCN Layer-8: The last fully connected layer of VGG16 is replaced by a 1x1 convolution.
  • FCN Layer-9: FCN Layer-8 is upsampled 2 times to match dimensions with Layer 4 of VGG 16, using transposed convolution with parameters: (kernel=(4,4), stride=(2,2), paddding=’same’). After that, a skip connection was added between Layer 4 of VGG16 and FCN Layer-9.
  • FCN Layer-10: FCN Layer-9 is upsampled 2 times to match dimensions with Layer 3 of VGG16, using transposed convolution with parameters: (kernel=(4,4), stride=(2,2), paddding=’same’). After that, a skip connection was added between Layer 3 of VGG 16 and FCN Layer-10.
  • FCN Layer-11: FCN Layer-10 is upsampled 4 times to match dimensions with input image size so we get the actual image back and depth is equal to number of classes, using transposed convolution with parameters:(kernel=(16,16), stride=(8,8), paddding=’same’).
FCN-8 Architecture

Step 1

We first load the pre-trained VGG-16 model into TensorFlow. Taking in the TensorFlow session and the path to the VGG Folder (which is downloadable here), we return the tuple of tensors from VGG model, including the image input, keep_prob (to control dropout rate), layer 3, layer 4, and layer 7.

VGG16 function

Step 2

Now we focus on creating the layers for a FCN, using the tensors from the VGG model. Given the tensors for VGG layer output and the number of classes to classify, we return the tensor for the last layer of that output. In particular, we apply a 1x1 convolution to the encoder layers, and then add decoder layers to the network with skip connections and upsampling.

Layers function

Step 3

The next step is to optimize our neural network, aka building TensorFlow loss functions and optimizer operations. Here we use cross entropy as our loss function and Adam as our optimization algorithm.

Optimize function

Step 4

Here we define the train_nn function, which takes in important parameters including number of epochs, batch size, loss function, optimizer operation, and placeholders for input images, label images, learning rate. For the training process, we also set keep_probability to 0.5 and learning_rate to 0.001. To keep track of the progress, we also print out the loss during training.

Step 5

Finally, it’s time to train our net! In this run function, we first build our net using the load_vgg, layers, and optimize function. Then we train the net using the train_nn function and save the inference data for records.

Run function

About our parameters, we choose epochs = 40, batch_size = 16, num_classes = 2, and image_shape = (160, 576). After doing 2 trial passes with dropout = 0.5 and dropout = 0.75, we found that the 2nd trial yields better results with better average losses.

semantic segmentation training sample results

To see the full code, check out this link:

If you enjoyed this piece, I’d love it share it  👏 and spread the knowledge.

You might be interested in our latest posts on:

Semantic Segmentation Overview - Train a Semantic Segmentation Network Using Deep Learning.

Training instance segmentation neural network with synthetic datasets for crop seed phenotyping


In order to train the neural network for plant phenotyping, a sufficient amount of training data must be prepared, which requires time-consuming manual data annotation process that often becomes the limiting step. Here, we show that an instance segmentation neural network aimed to phenotype the barley seed morphology of various cultivars, can be sufficiently trained purely by a synthetically generated dataset. Our attempt is based on the concept of domain randomization, where a large amount of image is generated by randomly orienting the seed object to a virtual canvas. The trained model showed 96% recall and 95% average Precision against the real-world test dataset. We show that our approach is effective also for various crops including rice, lettuce, oat, and wheat. Constructing and utilizing such synthetic data can be a powerful method to alleviate human labor costs for deploying deep learning-based analysis in the agricultural domain.


Deep learning1 has gathered wide attraction in both the scientific and industrial communities. In computer vision field, deep-learning-based techniques using convolutional neural network (CNN) are actively applied to various tasks, such as image classification2, object detection3,4, and semantic/instance segmentation5,6,7. Such techniques have also been influencing the field of agriculture. This involves image-based phenotyping, including weed detection8, crop disease diagnosis9,10, fruit detection11, and many other applications as listed in the recent review12. Meanwhile, not only features from images but also with that of environmental variables, functionalized a neural network to predict plant water stress for automated control of greenhouse tomato irrigation13. Utilizing the numerous and high-context data generated in the relevant field seems to have high affinity with deep learning.

However, one of the drawbacks of using deep learning is the need to prepare a large amount of labeled data. The ImageNet dataset as of 2012 consists of 1.2 million and 150,000 manually classified images in the training dataset and validation/test dataset, respectively14. Meanwhile, the COCO 2014 Object Detection Task constitutes of 328,000 images containing 2.5 million labeled object instances of 91 categories15. This order of annotated dataset is generally difficult to prepare for an individual or a research group. In the agricultural domain, it has been reported that sorghum head detection network can be trained with a dataset consisting of 52 images with an average of 400 objects per image16, while a crop stem detection network was trained starting from 822 images17. These case studies imply that the amount of data required in a specialized task may be less compared with a relatively generalized task, such as ImageNet classification and COCO detection challenges. Nonetheless, the necessary and sufficient amount of annotation data to train a neural network is generally unknown. Although many techniques to decrease the labor cost, such as domain adaptation or active learning, are widely used in plant/bio science applications18,19,20, the annotation process is highly stressful for researchers, as it is like running a marathon without knowing the goal.

A traditional way to minimize the number of manual annotations is to learn from synthetic images, which is occasionally referred to as the sim2real transfer. One of the important advantages of using a synthetic dataset for training is that the ground-truth annotations can be automatically obtained without the need for human labor. A successful example can be found in person image analysis method that uses the image dataset with synthetic human models21 for various uses such as person pose estimation22. Similar approaches have also been used for the preparation of training data for plant image analysis. Isokane et al.23 used the synthetic plant models for the estimation of branching pattern, while Ward et al. generated artificial images of Arabidopsis rendered from 3D models and utilized them for neural network training in leaf segmentation24.

One drawback of the sim2real approach are the gaps between the synthesized images and the real scenes, e.g., nonrealistic appearances. To counter this problem, many studies attempt to generate realistic images from synthetic datasets, such as by using generative adversarial networks (GAN)25,26. In the plant image analysis field, Giuffrida et al.27 used GAN-generated images to train a neural network for Arabidopsis leaf counting. Similarly, Arsenovic et al. used StyleGAN28 to create training images for the plant disease image classification29.

On the other hand, an advantage of sim2real approach is the capability of creating (nearly) infinite number of training data. Approaches that are bridging the sim2real gap by leveraging the advantage is domain randomization, which trains the deep networks using large variations of synthetic images with randomly sampled physical parameters. Although domain randomization is somewhat related to data augmentation (e.g., randomly flipping and rotating the images), the synthetic environment enables the representation of variations under many conditions, which is generally difficult to attain by straightforward data augmentation techniques for real images. An early attempt at domain randomization was made by generating the images using different camera positions, object location, and lighting conditions, which is similar to the technique applied to control robots30. For object recognition tasks, Tremblay et al.31 proposed a method to generate images with a randomized texture on synthetic data. In the plant-phenotyping field, recently, Kuznichov et al. proposed a method to segment and count leaves of not only Arabidopsis, but also that of avocado and banana, by using a synthetic leaf texture located with various size/angles, so as to mimic images that were acquired in real agricultural scenes32. Collectively, the use of synthetic images has a huge potential in the plant-phenotyping research field.

Seed shape, along with seed size, is an important agricultural phenotype. It consists of yield components of crops, which are affected by environmental condition in the later developmental stage. The seed size and shape can be predictive on germination rates and subsequent development of plants33,34. Genetic alteration of seed size contributed a significant increase in thousand-grain weight in contemporary barley-cultivated germplasm35. Several studies report the enhancement of rice yield by utilizing seed width as a metric36,37. Moreover, others utilized elliptic Fourier descriptors that enable to handle the seed shape as variables representing a closed contour, successfully characterizing the characters of various species38,39,40,41. Focusing on morphological parameters of seeds seems to be a powerful metric for both crop-yield improvement and for biological studies. However, including the said reports, many of the previous studies have evaluated the seed shape by qualitative metrics (e.g., whether the seeds are similar to the parental phenotype), by vernier caliper, or by manual annotation using an image- processing software. The phenotyping is generally labor-intensive and cannot completely exclude the possibility of quantification errors that differ by the annotator. To execute a precise and large-scale analysis, automation of the seed-phenotyping step was preferred.

In recent years, several studies have been reported to systematically analyze the morphology of plant seeds by image analysis. Ayoub et al. focused on barley seed characterization in terms of area, perimeter, length, width, F circle, and F shape based on digital camera-captured images42. Herridge et al. utilized a particle analysis function of ImageJ ( to quantify and differentiate the seed size of Arabidopsis mutants from the background population43. SmartGrain software has been developed to realize the high-throughput phenotyping of crop seeds, successfully identifying the QTL that is responsible for seed length of rice44. Miller et al. reported a high-throughput image analysis to measure morphological traits of maize ears, cobs, and kernels45. Wen et al. developed an image analysis software that can measure seed shape parameters such as width, length, and projected area, as well as the color features of maize seeds: they found a correlation between these physical characteristics with seed vigor46. Moreover, commercially available products such as Germination Scanalyzer (Lemnatec, Germany) and PT portable tablet tester (Greenpheno, China) also aim or have the ability to quantify the morphological shape of seeds. However, the aforementioned approaches require the seeds to be sparsely oriented for efficient segmentation. When seeds are densely sampled and physically touching each other, they are often detected as a unified region, leading to an abnormal seed shape output. This requires the user to manually reorient the seeds in a sparse manner, which is a potential bar to secure sufficient amount of biological replicate in the course of high-throughput analysis. In such situations, deep-learning-based instance segmentation can be used to overcome such a problem by segmenting the respective seed regions regardless of their orientation. Nonetheless, the annotation process as described previously was thought to be the potential limiting step.

In this paper, we show that utilizing a synthetic dataset that the combination and orientation of seeds are artificially rendered, is sufficient to train an instance segmentation of a deep neural network to process real-world images. Moreover, applying our pipeline enables us to extract morphological parameters at a large scale with precise characterization of barley natural variation at a multivariate perspective. The proposed method can alleviate the labor-intensive annotation process to realize the rapid development of deep-learning-based image analysis pipeline in the agricultural domain as illustrated in Fig. 1. Our method is largely related to the sim2real approaches with the domain randomization, where we generate a number of training images by randomly locating the synthetic seeds with actual textures by changing its orientation and location.

Conventional method requires manual labeling of images to generate the training dataset, while our proposed method can substitute such step by using a synthetic dataset for crop seed instance segmentation model.

Full size image

The contribution of this study is twofold. First, this is the first attempt to utilize a synthetic dataset (i.e., a sim2real approach) with domain randomization for the crop seed phenotyping, which can significantly decrease the manual labor for data creation (Fig. 1). Second, we propose a first method that can be used against the densely sampled (e.g., physically touching) seeds using instance segmentation.


Preparation of barley seed synthetic dataset

Examples of seed images captured by the scanner are shown in Fig. 2a. The morphology of barley seeds is highly variable between cultivars, in terms of size, shape, color, and texture. Moreover, the seeds randomly come in contact with or partially overlap each other. Determination of the optimal threshold for binarization may enable isolation of the seed region from the background; however, conventional segmentation methods such as watershed require extensive search for suitable parameters per cultivar to efficiently segment the single-seed area for morphological quantification. Establishment of such pipeline requires an extensive effort of an expert. Employing a sophisticated segmentation method (in our case, instance segmentation using Mask R-CNN7) is indeed a choice for successful separation of the individual seeds. However, Mask R-CNN requires annotations of bounding boxes—which circumscribe the seed—and mask images that necessarily and sufficiently cover the seed area (Supplementary Fig. 1). Given that the numbers of seeds per image are abundant (Fig. 2a), the annotation process has been predicted to be labor-intensive.

a Images of barley seeds scanned from 20 cultivars. Cultivar names are described in white text in each image. These images were also used as a real-world test dataset in Table 1. b Scheme of generating synthetic images. Images are generated by combining actual scanned seed images over the background images onto the virtual canvas. Simultaneously generated ground-truth label (mask) is shown at the bottom, in which each seed area is marked with a unique color.

Full size image

Figure 2b shows the seed image pool and synthesized dataset obtained using the proposed method (see “Methods” for details). Instead of labeling real-world images for use as a training dataset, Mask R-CNN was trained using the synthetic dataset (examples shown at the bottom of Fig. 2b), which is generated from the seed and background image pool (Fig. 2b top) using a domain randomization technique.

Model evaluation

We show herein the visual results and a quantitative evaluation of object detection and instance segmentation by Mask R-CNN. The trained Mask R-CNN model outputs a set of bounding box coordinates and masks images of seed regions (raw output) (Fig. 3a, top row). Examples of visualized raw output obtained from the real-world images show that the network can accurately locate and segment the seeds regardless of their orientation (Fig. 3b; Supplementary Fig. 2). Table 1 summarizes the quantitative evaluation using the recall and AP measures (see “Methods” for details). The efficacy of seed detection was evaluated using the recall values computed for bounding box coordinates at 50% Intersection of Union (IoU) threshold (Recall50). The model achieved an average of 95 and 96% on the synthetic and real-world test datasets, respectively. This indicates that the trained model can locate the position of seeds with very low false-negative rate. From the average precision (AP) values, which were computed based on mask regions at varying mask IoU thresholds, comparable AP50 values were achieved between the synthetic (96%) and real-world (95%) datasets. For higher IoU threshold ([email protected] [.5:.95] and AP75), the values of the synthetic test dataset (73%) exceeded that of the real-world test dataset (59%). These results suggest that the model’s ability to segment the seed region is better in the case of the synthetic than the real-world images. The higher values in the synthetic dataset possibly derive from data leak, which the same seed images appear as in the training dataset, but even the orientation and combination of seeds area are different. However, considering the visual output interpretation (Fig. 3b) and the values of AP50 (95%) in the real-world test dataset, we judged that seed morphology can be sufficiently determined from real-world images. The relatively low AP in high IoU in the real-world test dataset is possibly derived from the subtle variation in the manual annotation of seed mask regions. It is noteworthy that when the Mask R-CNN model was trained with the manually annotated seeds, the network showed poor performance in segmenting the seed regions (Supplementary Fig. 3). This was especially apparent when the seeds were physically touching each other and forming a dense cluster, which further supports the efficiency of domain randomization.

a Summary of the image analysis pipeline. b Examples of the graphical output of the trained Mask R-CNN on real-world images. Different colors indicate an individual segmented seed region. Note that even if the seeds are overlapping or touching each other, the network can discriminate them as an independent object. c Examples of detected candidate regions to be filtered in the post-processing step indicated in red arrows. Black arrowheads indicate the input image boundary. d Probability density of the seed areas of the raw and filtered output. e Scatterplot describing the correlation of the seed area that was measured by the pipeline (inferenced seed area) and by manual annotation (ground-truth seed area). Each dot represents the value by a single seed. Black and gray lines indicate the identity and the 10% error threshold line, respectively. The proportion of the seeds that have lower or higher than the 10% error is also displayed.

Full size image

Full size table

Post processing

As described in the Methods section, we introduced a post-processing step to the raw output to eliminate detections that are not suitable for further analysis. This process removes seed occlusion due to physical overlap, incomplete segmentation by the neural network, non-seed objects such as dirt or awn debris, or the seeds that were partly hidden due to the location being outside the scanned area (Fig. 3c). Figure 3d shows the distribution of the seed area before and after post processing. Even though the seed area itself was not used as a filtering criterion, the area values in the respective cultivars shift from a long-tailed to a normal distribution, which well reflects the characteristics of a homogenous population (Fig. 3d). A comparison of the filtered output (inferenced seed area) and hand-measured (ground-truth area) values displays a strong correlation, where the Pearson correlation value is 0.97 (Fig. 3e). These results suggest that the filtered output values obtained from our pipeline are reliable for further phenotypic analyses.

Morphological characterization of barley natural variation

Our pipeline learns from synthetic images, which ease the training dataset preparation process. This pipeline enables large-scale analysis across multiple cultivars or species. To highlight the important advantages of the proposed pipeline, we herein demonstrate an array of analyses to morphologically characterize the natural variation of barley seeds, which highlights the crucial biological features that will provide guidance for further investigation. We selected 19 out of 20 cultivars that were used to train the neural network; however, we have acquired a new image that was not used for training or testing in further analysis. One accession, H602, was excluded from the analysis because the rachis could hardly be removed by husk threshing; therefore, the detected area did not reflect the true seed shape. From the pipeline, we obtained 4464 segmented seed images in total (average of 235 seeds per cultivar).

As simple and commonly used morphological features, the seed area, width, length, and length-to-width ratio per cultivar were extracted from the respective images and are summarized in Fig. 4a–d. With a sufficient number of biological replicates, we can not only compare the inter-cultivar difference (e.g., median or average) but also consider the intra-cultivar variance. We applied the analysis of variance (ANOVA) with Tukey’s post hoc test to calculate the statistical difference between cultivars. Many cultivars that visually display similar distribution patterns or medians were grouped into statistically different clusters (e.g., K735 and K692 in Fig. 4a). To gain further insight into the morphology of barley cultivars characterized by various descriptors, we performed a multivariate analysis.

Whisker plot overlaid with a swarm plot (colored dot) grouped by barley cultivars. a Seed area, b seed width, c length, and d length-to-width ratio. Diamonds represent outliers. Statistical differences were determined by one-way ANOVA followed by Tukey post hoc analysis. Different letters indicate significant differences (p < 0.05).

Full size image

First, we show the results of a principal component analysis (PCA) using eight predefined descriptors (area, width, length, length-to-width ratio, eccentricity, solidity, perimeter length, and circularity). The first two principal components (PC) could explain 88.5% of the total variation (Fig. 5a, b). Although there were no discrete boundaries, the data points tended to form a cluster unique to the cultivar in the latent space, indicating that cultivars can be classified to a certain extent according to the said descriptors (Fig. 5a). Variables such as seed length (L) and perimeter length (PL) mainly constituted the first PC, with seed circularity (CS) oriented toward the opposite direction, while seed width (W) and length-to-width ratio had a major influence on PC2 (Fig. 5b). This is exemplified by the distribution of the slenderest B669 and the circular-shaped J647 at the far-right and far-left orientation in the latent space. Notably, while width (W) mainly constituted PC2, the direction of its eigenvector differs from that of length (L). The moderate value of Pearson’s correlation between length and width (0.5, p < 0.01) (Supplementary Fig. 4), also implies that genes that control both or either of size and length may coexist in the determination of barley seed shape, as reported in rice47.

a, b Principal component analysis (PCA) with morphological parameters of barley seeds. Each point represents the data point of the respective seed. The colors correspond to those defined in the color legend displayed below (e). Mean PC1 and PC2 values of each cultivar are plotted as a large circle with text annotations in (a). Eigenvectors of each descriptor are drawn as arrows in (b). LWR length-to- width ratio, E eccentricity, L seed length, PL seed perimeter length, AS seed area, W seed width, S solidity, CS seed circularity. c, d PCA with elliptic Fourier descriptors (EFD). The colors and points annotated of (c) follow those of (a). Interpolation of the latent space followed by reconstruction of the contours are displayed in (d). e, f Latent space visualization of variational autoencoders (VAE). The colors and points annotated of (e) follow those of (a). Interpolation of the latent space followed by image generation using the generator of VAE are displayed in (f).

Full size image

Next, we extracted the contour shapes of seeds using elliptic Fourier descriptors (EFDs) followed by PCA (Fig. 5b, c), which is also used in other studies for seed morphological analysis38,39. Compared with the PCA based on the eight morphological descriptors in Fig. 5a, the distributions of the respective seeds were relatively condensed, while the clusters by cultivars were intermixed (Fig. 5c), possibly because the size information is lost upon normalization; therefore, EFD can utilize only the contour shape. Interpolating the latent space in the PC1 axis direction clearly highlights the difference in slenderness of the seed (Fig. 5d; Fig. Supplementary Fig. 5a). PC2 did not show an obvious change in shape when compared with PC1 (Fig. 5d); however, it seemed to be involved in the sharpness of the edge shape in the longitudinal direction (Supplementary Fig. 5a). Although further verification is required, rendering the average contours that represent the shapes of the respective cultivars implies the difference in such metrics (Supplementary Fig. 5b).

Finally, we trained a variational autoencoder (VAE) for latent space visualization48. Unlike other methods using the shape descriptors (i.e., eight simple features or EFDs), the VAE inputs the segmented seed images, which can thus obtain a representation that well describes the dataset without feature predefinition (see Methods for details). We have expected that such neural networks can learn the high-level feature (complex phenotype) such as textures, in addition to contour shape and morphological parameters we have handled in Fig. 5a–d. The learned representation can be visualized into a two-dimensional scatterplot similar to a PCA (Fig. 5e). Compared with the PCA-based methods, VAE seems to cluster the cultivar in the latent space more explicitly. While the predefined morphological descriptors extract a limited amount of information from an image, VAE can handle an entire image itself; hence, the latter theoretically can learn more complex biological features. Overall, Z1 tends to be involved in the seed color (i.e., brightness) and size, while Z2 is in seed length (Fig. 5f). These results suggest the potential power of utilizing deep learning for further phenotypic analysis, in addition to the well-established morphological analysis.

Application in various crop seeds

We further extended our method to verify the efficacy of our approach for other crop seeds. In this report, we newly trained our model to analyze the seed morphology of wheat, rice, oat, and lettuce, with the, respectively, generated synthetic datasets (Fig. 6, top row). Processing the real-world images resulted in a clear segmentation of each species, regardless of seed size, shape, texture or color, and background (Fig. 6, middle and bottom rows). In conclusion, these results strongly suggest the high generalization ability of our presented method.

Synthetic data of the respective species were generated (top row) and the neural networks were independently trained. The inference results against the real-world input images (middle row) were visualized (bottom row). The name of the cultivar per species is overlaid, respectively.

Full size image


In this research, we showed that utilizing a synthetic dataset can successfully train the instance segmentation neural network to analyze the real-world images of barley seeds. The values obtained from the image analysis pipeline were comparable to that of manual annotation (Fig. 3e), thus achieving high-throughput quantification of seed morphology in various analyses. Moreover, our pipeline requires a limited number of synthesized images to be added to the pool for creating a synthetic dataset. This is labor cost-efficient and practical compared with labeling numerous amounts of images required for deep learning.

To completely understand the use of synthetic data for deep learning, we must have a precise understanding of “what type of features are critical to represent the real-world dataset”. In the case of seed instance segmentation, we presumed that the network must learn the representation that is important for segregating physically touching or overlapping seeds into an individual object. Therefore, in the course of designing synthetic images, we prioritized the dataset to contain numerous patterns of seed orientation, rather than to contain massive patterns of seed textures. Based on the result that the model showed sufficient result against the test dataset (Fig. 3b; Supplementary Fig. 2, Table 1), it is suggested that our presumption was legitimate to a certain extent. However, because the neural network itself is a black box, we cannot discuss more than ex post facto reasoning. Recently, there have been challenges to understand the representation of biological context by various interpretation techniques10,49. Extending such approaches applicable to an instance segmentation neural network as used in our study will help verify the authenticity of both the synthesized dataset and the trained neural network in future studies.

Notably, it is expected that the model performance will be greatly influenced by the image resolution and variance of seed images used to create the synthetic image, as well as the number of images that constitute the training dataset. Optimal parameters will also depend on the type of cultivars that constitute the test dataset. In this study, we used a fixed condition for synthetic dataset generation, in order to prioritize or demonstrate the effectiveness of domain randomization for seed phenotyping. However, in practical situations where the respective users build and execute a customized pipeline, parameter search may benefit them by providing minimal dataset requirement that leads to calculation cost efficiency. Moreover, introducing additional image augmentation techniques in the synthetic dataset such as random color shift and zoom will lead to a more robust model.

We introduced post processing to exclude nonintegral mask regions prior to phenotypic analysis (Fig. 3a, bottom row and Fig. 4c, d). Theoretically, if we can add a category label to the synthetic dataset to determine whether the respective regions are suitable for analysis, the neural network may acquire the classification ability to discriminate such integrity. However, the complexity of synthetic data generation increases, and misdetected or incomplete mask regions cannot be excluded. We presume that heuristic-based post processing is a simple yet powerful approach. Nonetheless, our outlier removal process is based on the assumption that the seed population is homogeneous. It is important to verify if such filtering is valid against the heterogeneous population. Notably, SmartGrain also introduces a post-processing step, involving a repetitive binary dilation and erosion. Those processes were reported to be effective in analyzing the progenies of two cultivars in rice upon QTL analysis44. As the post processing is independent of the neural network in our pipeline, designing and verifying various methods are important for expanding the functionality of the analysis pipeline.

The shape and size of seeds (grains) are important agronomic traits that determine the quality and the yield of crops50. In recent years, a number of genes have been identified and characterized through genetic approach, accompanied by laborious phenotyping. In previous studies, researchers manually measured the shape and size of seeds, which is time-consuming and erroneous; it restricted the number of seeds that the researcher can analyze. The researchers used to manually select several seeds that seemed to represent the population in a subjective manner, and for this reason, small phenotypic differences between genotypes could not be detected. Our pipeline can phenotype a large number of seeds without the need to consider the seed orientation to be sparse in image acquisition and thereby can obtain large amount of data in a short period of time. This allows easy and sensitive detection of both obvious and subtle phenotypic differences between cultivars supported by statistical verification (Fig. 4) or by dimensionality reduction methods of multivariate parameters introduced herein (Fig. 5a–d). Moreover, VAE, which requires a sufficient amount of data to fully exert its power to learn the representation of the dataset, becomes also applicable with the data obtained by our approach (Fig. 5e, f). The large-scale analysis across various cultivars provides researchers with yet another option to execute such analyses as demonstrated. This will be a breakthrough in identifying agronomically important genes, especially for molecular genetic research such as genome-wide association study (GWAS), quantitative trait locus (QTL) analysis, and mutant screening. Thus, it will open a new path to identify genes that were difficult to isolate by conventional approaches.

Moreover, the application of our pipeline is not restricted to barley, but can be extended to various crops such as seeds of wheat, rice, oats, and lettuce (Fig. 6). Our results strongly suggest that our approach is applicable to any varieties or species in principle; thus, it is expected to accelerate research in various fields with similar laborious issues. One example can be an application in characterization and gene isolation from seeds of wild species. Cultivated lines possess limited genetic diversity due to bottlenecks in the process of domestication and breeding; therefore many researchers face challenges to identify agronomically important genes from wild relatives as a source of genes for improving agronomic traits. As the appearance of the seeds of wild species is generally more diverse than that of cultivated varieties, development of a universal method to measure both traits was difficult. Another example can be analyzing undetached seeds of small florets (e.g., wheat). Although the shapes of small florets can be manually quantified from the image of a scanned spikelet, the automated quantification has not been realized owing to excess non-seed objects (e.g., glume, awn, and rachis) in the image. Applying another domain of randomization for synthesizing a training dataset can be utilized to functionalize a neural network to quantify seed phenotype from such images.

Collectively, we have shown the efficacy of utilizing the synthetic data, based on the concept of domain randomization to train the neural network for real-world tasks. Recent technical advances in the computer vision domain have enabled us to generate a realistic image, or even a realistic “virtual reality” environment; thus, they will provide more possibilities to give solutions to current image analysis that involved challenges in the agricultural domain. We envision that a collaboration with plant and computer scientists will open a new point of view for generating a workflow that is valuable for plant phenotyping, leading to a further understanding of the biology of plants through the complete use of machine learning/deep-learning methods.


Plant materials

Barley seeds used in this research are 19 domesticated barley (Hordeum vulgare) accessions and one wild barley (H. spontaneum) accession: B669, Suez (84); C319, Chichou; C346, Shanghai 1; C656, Tibet White 4; E245, Addis Ababa 40 (12-24-84); E612, Ethiopia 36 (CI 2225); I304, Rewari; I335, Ghazvin 1 (184); I622, H.E.S. 4 (Type 12); I626, Katana 1 (182); J064, Hayakiso 2; J247, Haruna Nijo; J647, Akashinriki; K692, Eumseong Covered 3; K735, Natsudaikon Mugi; N009, Tilman Camp 1 (1398); T567, Goenen (997); U051, Archer; U353, Opal; H602, wild barley. All the details of the said cultivars can be obtained at the National BioResource Project (NBRP) ( Meanwhile, seeds of rice (Oryza sativa, cv. Nipponbare), oat (Avena sativa, cv. Negusaredaiji), lettuce (Lactuca sativa, cv. Great Lakes), and wheat (Triticum aestivum cv. CS, Chinese Spring; N61, Norin 61; AL, Arina (ArinaLrFor))51; and Syn01, a synthetic hexaploid wheat line Ldn/KU-2076 that is generated by a cross between tetraploid wheat Triticum turgidum cv. Langdon and Aegilops tauschii strain KU-207652 were used in this report.

Image acquisition

All the barley seeds were threshed using a commercial table-top threshing system (BGA-RH1, OHYA TANZO SEISAKUSHO & Co., Japan). The seed images were captured on an EPSON GT-X900 A4 scanner with the supplied software without image enhancement. Seeds were spread uniformly on the glass, scanned at 7019 × 5100 px at 600 dpi using a blue-colored paper background. For the image acquisition of seeds of rice, oat, lettuce, and wheat, an overhead scanner ScanSnap SV600 (Fujitsu, Japan) was used with the image size of 3508 × 2479 at 300 or 600 dpi.

Synthetic image generation

Single-seed images per cultivar (total of 400; 20 seed images for 20 cultivars) were isolated and saved as an individual image file. These 400 seeds were manually annotated and were used to create a non-domain -randomized training dataset used in Supplementary Fig. 3. The following describes the procedure of synthetic image generation.

First, the background regions of seed images were removed such that the pixel value other than the area of the seed will be (0,0,0) in RGB color value. As a result, 400 background-clean images were prepared to constitute a “seed image pool”. For the background image, four images at the fixed size of 1024 × 1024 were cropped from the actual background used in the seed scanning process and were prepared as a “background image pool”.

The synthetic image generation process is described as follows. First, an image was randomly selected from the background image pool and pasted to the virtual canvas of size 1024 × 1024. Second, another image was randomly selected from the seed image pool. Image rotation angle was randomly set upon selection. After rotation, the x and y coordinates at which the image was to be pasted were randomly determined; however, the coordinate value was restricted to a certain range so that the image does not exceed the canvas size, with which its values were dependent on the selected seed image size and its rotation angle. Third, the seed image was pasted to the canvas according to the determined values described above. When pasting, alpha masks were generated and utilized in alpha blending such that the area outside of the seed will be transparent and does not affect the canvas image. Moreover, utilizing the alpha mask, the seed perimeter was Gaussian blurred to decrease the artifacts resulting from the background removal process of the seed image. Notably, if the region where the image was to be pasted in the canvas already had a seed image, the overlapping proportion of the area of the seeds was calculated. If the calculated value exceeded the ratio of 0.25, pasting was canceled, and another coordinate was chosen again. The threshold percentage of overlap was arbitrarily determined based on the actual seed overlap that was observed in actual situations. A maximum of 70 pasting trials were performed to generate a single image.

During the synthetic image generation, a mask that has the same image size as the synthetic image was created by first creating a black canvas and coloring the seed region with unique colors based on the coordinate of the placing object. The coloring was performed when the seeds were randomly placed in the synthetic image. If a seed to be placed were overlapping an existing seed, the colors in the corresponding region in the mask image were replaced by the foreground color.

The above procedure generates an image size of 1024 × 1024 with seeds randomly oriented inside the canvas region. While in real-world images, seeds that are adjacent to the border of the image are cut off. To replicate such a situation, the borders of synthetic images were cropped to obtain the final image. The generated synthetic dataset constitutes 1200 set of data pairs of synthetic and mask images, in which each image has a size of 768 × 768 that was used for neural network training.

Model training

We used a Mask R-CNN7 implementation on the Keras/Tensorflow backend ( Configuration predefined by the repository was used, including the network architectures and losses. The residual network ResNet10153 was used for the feature extraction. From the initial weights of ResNet101 obtained by training using the MS COCO dataset, we performed fine-tuning using our synthetic seed image dataset for 40 epochs by stochastic gradient descent optimization with a learning rate of 0.001 and batch size of 2. Within the 1200 images of the synthetic dataset, 989 were used for training, 11 for validation, and 200 for the synthetic test dataset. No image augmentation was performed during training. The synthetic training data have a fixed image size of 768 × 768; however, the input image size for the network was not exclusively defined such that variable sizes of the image can be fed upon inference. The network outputs a set of bounding boxes and seed candidate mask regions with a probability value. A threshold value of 0.5 was defined to isolate the final mask regions.

Real-world test dataset for model evaluation

While the synthetic test dataset was generated according to the method described in the previous section, we prepared a real-world test dataset consisting of 20 images with which each image contained seeds derived from a homogeneous population (Fig. 2a). Each image had a size of 2000 × 2000. AP50, AP75, and [email protected][.5:.95] per image (cultivar), as well as the mean AP of all images, was calculated. As the seeds to be detected per image average to ~100 objects per image and images themselves were acquired under the same experimental condition, we used one image per cultivar for model evaluation. Ground-truth label of real-world test dataset was manually annotated with Labelbox54. For reference, we also prepared 200 synthetic images for testing (synthetic test dataset), which were not used for the model training or validation.

Metrics for model evaluation

To assess the accuracy of object detection using Mask R-CNN, we evaluated using two metrics, which were also used in the evaluation of the original report7. While they are commonly used measures in object recognition and instance segmentation, such as in MS COCO15 and Pascal VOC55 dataset, we briefly recap our evaluation metrics for clarity. During the experiment, the evaluation metrics were calculated using the Mask R-CNN distribution.

Recall: We first measured the recall, which evaluates how well the objects (i.e., seeds) are detected, which can be obtained by the ratio of true positive matches over the total number of ground-truth objects. To calculate the recall values, we determined the correct detection when the detection threshold of the intersection-over-union (IoU) between the ground-truth and predicted bounding boxes is over 0.5 (Fig. 7a). In other words, for each ground-truth bounding box, if a detected bounding box overlaps over 50%, it was counted as the true positive. Hereafter, we denote the recall measures as Recall50.

a The intersection-over-union (IoU) definitions for bounding boxes and masks. b The average precision (AP) defined as the area under the curves (AUC), shown as the area marked with slanted lines.

Full size image

Average precision (AP) using mask IoUs: The drawbacks of the recall measure include penalizing the false-positive detections and evaluating using the overlaps of bounding boxes that are poor approximation of the object shape. We, therefore, calculated the average precision (AP) using mask IoUs, which can be a measure of the detection accuracy (in terms of both recall and precision) as well as providing a rough measure of mask generation accuracy. During the computation of APs, we first compute the IoU between the instance masks (mask IoU), as shown in Fig. 7a. AP can be obtained based on the number of correct (i.e., true positive) and wrong (i.e., false positive) detection determined using a certain threshold of mask IoUs. Figure 7b summarizes the computation of the AP. We sort the detected instances using the class score (i.e., the confidence that the detected object is a seed, in our case) in the descending order. For the nth instance, the precision and recall, based on the mask IoU threshold, are calculated for the subset of instances from 1st to nth detections. By repeating the process for each of the instances, we obtain a receiver-operating characteristic (ROC) curve shown in Fig. 7b. The AP is defined as the ratio of the rectangle approximations of the area under the curve (AUC), which is shown as the area marked by slanted lines in the figure. APs thus takes the value from 0.0 to 1.0 (i.e., 100%). We evaluated APs using multiple mask IoU thresholds. AP50 and AP75 are computed using the mask IoU threshold of 0.5 and 0.75, respectively. AP75 becomes a stricter measure than AP50, because AP75 requires the correct matches with more accurate instance masks. Similar to MS COCO evaluation, we also measured [email protected] [.5:.95], which is the average value of APs with IoU thresholds from 0.5 to 0.95 with the interval of 0.05.

Quantification of seed morphology

The main application of the seed instance segmentation is to quantify phenotypes of seeds for analyzing and comparing morphological traits. In the mask image, morphological variables of seed shape such as area, width, and height were calculated using the measure.regionprops module of the scikit-image library, respectively. To analyze the characteristics of seeds across different cultivars, principal component analysis (PCA) was applied to the variables. In the “Results” section, we briefly present the analysis using different types of descriptors, computed by elliptic Fourier descriptors (EFD) and variational autoencoder (VAE), both of which are described below.

Post processing: selection of isolated seeds: The instance segmentation network outputs a set of bounding boxes and seed area candidates as mask images, where some seeds overlap with each other. To analyze the seed morphology (or use for further phenotyping applications), it is required to select the seeds that are isolated (i.e., not partly hidden) from the neighboring seed instances. To select such seeds, the post-processing step was introduced. First, the bounding box coordinates were checked whether it resides inside the 5 px margin of the image. The bounding boxes that protrude the margin were removed. Second, using the solidity (ratio of the region of interest area against its convex hull area) of the respective mask as a metric, the 25% lower quantile threshold was determined and used to remove the outliers. Similarly, further outliers were removed by a 5% lower and 95% higher quantile threshold of length-to-width ratio. The threshold was empirically determined during the analysis.

Elliptic Fourier descriptors (EFD): EFD56 has been used to quantify the contour shape of seeds38, which approximate the contour shape as the set of different ellipses. During the computation of EFD, segmented seed images were first converted to binary mask image where the background pixel value was 0 and the seed area is 1. Next, the contour of the seed was detected by the find_contours module of the scikit-image library. The detected contours were converted to EFD coefficients using the elliptic_fourier_descriptors module of pyefd library ( under the condition of harmonics 20 and with normalization so as to be rotation- and size-invariant. The output was flattened, which converted the shape of the array from 4 × 20 to 80. As the first three coefficients are always or nearly equal to 1, 0, 0 due to the normalization process, they were discarded upon further analysis. A total of 77 variables were used as descriptors for principal component analysis (PCA).

Variational autoencoder (VAE): Autoencoder (AE) is a type of neural network with an encoder–decoder architecture that embeds a high-dimensional input data (e.g., images) to a low-dimensional latent vector, to correctly decode the input data from the low-dimensional vector. Variational autoencoder (VAE)48 is a variant of AE, where the distribution in the latent space is generated to fit a prior distribution (e.g., Gaussian distribution, N(0,1)). In a generative model, the low-dimensional parameters in the latent space are often used as the nonlinear approximation (i.e., dimensional reduction) of the dataset. Similar to other approximation methods like PCA, the parameters in the latent space estimated by VAE can be used for interpolation for the data distribution; the input data with different characteristics (e.g., different species) are often well separated in the space57 compared with the conventional methods (e.g., PCA), without using the ground-truth labels for the classes during the training. We used a VAE with a CNN-based encoder–decoder network to visualize the latent space. In brief, the network receives an RGB image that has a shape of 256 × 256 × 3. For the encoder, input data were first passed through four layers of convolution with filter numbers of 32, 64, 128, and 256, respectively. Since we fit the latent space to the Gaussian distribution, the log variance and the mean of the latent space are computed after full-connection layers. For the decoder, the output of the encoder was passed through four layers of deconvolution with filter numbers of 256, 128, 64, and 32, respectively. Finally, the convolution layer with three filters was added to convert the data back to an RGB image with its shape identical to the input image. In our analysis, we utilized the two-dimensional latent space (i.e., the final output of the encoder of VAE) to visualize the compressed features of the input image.

Statistics and reproducibility

Numbers of barley seeds analyzed per cultivar for evaluation of seed morphology in this study are as follows: 157, B669; 353, C319; 395, C346; 208, C656; 143, E245; 159, E612; 207, I304; 223, I335; 245, I622; 169, I626; 300, J064; 189, J247; 351, J647; 267, K692; 279, K735; 264, N009; 219, T567; 196, U051; 140, U353. R (ver. 3.5.1) was used for ANOVA and Tukey post hoc HSD test analysis to evaluate the statistical differences of their morphological parameters.

Software libraries and hardware

Computational analysis in this study was performed using Python 3.6. Keras (ver. 2.2.4) was also used with Tensorflow (ver. 1.14.0) backend for deep-learning-related processes. Single GPU (Geforce GTX 1080 Ti, NVIDIA) was used for the model training. Each epoch in training took about 186 s. For inference, an average of 3.9 s was required per image to process the real-world test dataset. OpenCV3 (ver. 3.4.2) and scikit-image (ver. 0.15.0) were used for operations in morphological calculations of the seed candidate regions as well as basic image processing. A single GPU was used for network training and inference.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Synthetically generated and real-world datasets can be obtained from the following GitHub repository (

Code availability

Code to reproduce the deployment of the trained Mask R-CNN and multivariate analysis is formatted as IPython notebooks and can also be obtained from the GitHub repository ( Other data and information regarding the paper are available upon reasonable request.


  1. 1.

    Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (The MIT Press, 2016).

  2. 2.

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. AlexNet 2012 ImageNet classification with deep convolutional neural networks. Commun. ACM60, 84–90 (2017).

    Article Google Scholar

  3. 3.

    Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. in IEEE Conference on Computer Vision and Pattern Recognition 580–587 (IEEE, 2014).

  4. 4.

    Girshick, R. Fast R-CNN. in IEEE International Conference on Computer Vision (ICCV) 1440–1448 (IEEE, 2015).

  5. 5.

    Ronneberger, O., Fischer, P. & Brox, T. in Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 9351 (eds Navab, N., Hornegger, J., Wells, W. M. & Frangi, A. F.) 9351, 234–241 (Springer International Publishing, 2015).

  6. 6.

    Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell.39, 640–651 (2017).

    Article Google Scholar

  7. 7.

    He, K., Gkioxari, G., Dollar, P. & Girshick, R. Mask R-CNN. in IEEE International Conference on Computer Vision (ICCV) 2980–2988 (IEEE, 2017).

  8. 8.

    Milioto, A., Lottes, P. & Stachniss, C. Real-time blob-wise sugar beets vs weeds classification for monitoring fields using convolutional neural networks. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci.IV-2/W3, 41–48 (2017).

    Article Google Scholar

  9. 9.

    Mohanty, S. P., Hughes, D. P. & Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci.7, 1419 (2016).

    Article Google Scholar

  10. 10.

    Ghosal, S. et al. An explainable deep machine vision framework for plant stress phenotyping. Proc. Natl Acad. Sci. USA115, 4613–4618 (2018).

    CASArticle Google Scholar

  11. 11.

    Bresilla, K. et al. Single-shot convolution neural networks for real-time fruit detection within the tree. Front. Plant Sci.10, 611 (2019).

    Article Google Scholar

  12. 12.

    Kamilaris, A. & Prenafeta-Boldú, F. X. Deep learning in agriculture: a survey. Computers Electron. Agriculture147, 70–90 (2018).

    Article Google Scholar

  13. 13.

    Kaneda, Y., Shibata, S. & Mineno, H. Multi-modal sliding window-based support vector regression for predicting plant water stress. Knowl.-Based Syst. (2017).

    Article Google Scholar

  14. 14.

    Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int J. Comput. Vis.115, 211–252 (2015).

    Article Google Scholar

  15. 15.

    Lin, T.-Y. et al. in European Conference on ComputerVision (ECCV), Vol. 8693 (eds Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer International Publishing, 2014).

  16. 16.

    Guo, W. et al. Aerial imagery analysis—quantifying appearance and number of sorghum heads for applications in breeding and agronomy. Front. Plant Sci.9, 1544 (2018).

    Article Google Scholar

  17. 17.

    Jin, X. et al. High-throughput measurements of stem characteristics to estimate ear density and above-ground biomass. Plant Phenomics2019, 4820305 (2019).

    Article Google Scholar

  18. 18.

    Ghosal, S. et al. A weakly supervised deep learning framework for sorghum head detection and counting. Plant Phenomics2019, 1525874 (2019).

    Article Google Scholar

  19. 19.

    Chandra, A. L., Desai, S. V., Balasubramanian, V. N., Ninomiya, S. & Guo, W. Active learning with point supervision for cost-effective panicle detection in cereal crops. Plant Methods16, 34 (2020).

    CASArticle Google Scholar

  20. 20.

    Nath, T. et al. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat. Protoc.14, 2152–2176 (2019).

    CASArticle Google Scholar

  21. 21.

    Varol, G. et al. Learning from synthetic humans. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4627–4635 (IEEE, 2017).

  22. 22.

    Doersch, C. & Zisserman, A. Sim2real transfer learning for 3D pose estimation: motion to the rescue. in Annual Conference on Neural Information Processing Systems (NeurIPS) (eds Wallach, H., Larochelle, H., Beygelzimer, A., dʼAlché-Buc F., Fox, E. & Garnett, R.) 12949–12961 (Curran Associates, Inc., 2019).

  23. 23.

    Isokane, T., Okura, F., Ide, A., Matsushita, Y. & Yagi, Y. Probabilistic plant modeling via multi-view image-to-image translation. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2906–2915 (IEEE, 2018).

  24. 24.

    Ward, D., Moghadam, P. & Hudson, N. Deep leaf segmentation using synthetic data. in BMVC 2018 Workshop on Computer Vision Problems in Plant Phenotyping (CVPPP) (2018).

  25. 25.

    Goodfellow, I. J. et al. Generative Adversarial Networks. in Annual Conference on Neural Information Processing Systems (NeurIPS) (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q.) 2672–2680 (Curran Associates, Inc., 2014).

  26. 26.

    Shrivastava, A. et al. Learning from Simulated and Unsupervised Images through Adversarial Training. in IEEE Conference on Computer Vision & Pattern Recognition (CVPR) 2242–2251 (IEEE, 2017).

  27. 27.

    Giuffrida, M. V., Scharr, H. & Tsaftaris, S. A. ARIGAN: synthetic arabidopsis plants using generative adversarial network. in IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2064–2071 (IEEE, 2017).

  28. 28.

    Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4401–4410 (IEEE, 2019).

  29. 29.

    Arsenovic, M., Karanovic, M., Sladojevic, S., Anderla, A. & Stefanovic, D. Solving current limitations of deep learning based approaches for plant disease detection. Symmetry11, 939 (2019).

    Article Google Scholar

  30. 30.

    Peng, X. B., Andrychowicz, M., Zaremba, W. & Abbeel, P. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. in IEEE International Conference on Robotics and Automation (ICRA) 3803–3810 (IEEE, 2018).

  31. 31.

    Tremblay, J. et al. Training deep networks with synthetic data: bridging the reality gap by domain randomization. in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 1082–1090 (IEEE, 2018).

  32. 32.

    Kuznichov, D., Zvirin, A. & Honen, Y. Data augmentation for leaf segmentation and counting tasks in Rosette plants. in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 1–8 (IEEE, 2019).

  33. 33.

    Temiño, P. R., Gómez, A. B. & Pintos, R. M. Relationships among kernel weight, early vigor, and growth in maize. Crop Sci.39, 654–658 (1999).

    Article Google Scholar

  34. 34.

    Elwell, A. L., Gronwall, D. S., Miller, N. D., Spalding, E. P. & Brooks, T. L. D. Separating parental environment from seed size effects on next generation growth and development in Arabidopsis. Plant Cell Environ.34, 291–301 (2011).

    Article Google Scholar

  35. 35.

    Sakuma, S. et al. Extreme suppression of lateral floret development by a single amino acid change in the VRS1 transcription factor. Plant Physiol.175, 1720–1731 (2017).

    CASArticle Google Scholar

  36. 36.

    Song, X.-J., Huang, W., Shi, M., Zhu, M.-Z. & Lin, H.-X. A QTL for rice grain width and weight encodes a previously unknown RING-type E3 ubiquitin ligase. Nat. Genet.39, 623–630 (2007).

    CASArticle Google Scholar

  37. 37.

    Weng, J. et al. Isolation and initial characterization of GW5, a major QTL associated with rice grain width and weight. Cell Res.18, 1199–1209 (2008).

    CASArticle Google Scholar

  38. 38.

    Williams, K., Munkvold, J. & Sorrells, M. Comparison of digital image analysis using elliptic Fourier descriptors and major dimensions to phenotype seed shape in hexaploid wheat (Triticum aestivum L.). Euphytica190, 99–116 (2013).

    Article Google Scholar

  39. 39.

    Ohsawa, R., Tsutsumi, T., Uehara, H., Namai, H. & Ninomiya, S. Quantitative evaluation of common buckwheat (Fagopyrum esculentum Moench) kernel shape by elliptic Fourier descriptor. Euphytica101, 175–183 (1998).

    Article Google Scholar

  40. 40.

    Iwata, H., Ebana, K., Uga, Y., Hayashi, T. & Jannink, J.-L. Genome-wide association study of grain shape variation among Oryza sativa L. germplasms based on elliptic Fourier analysis. Mol. Breed.25, 203–215 (2010).

    CASArticle Google Scholar

  41. 41.

    Eguchi, M. & Ninomiya, S. Evaluation of soybean seed shape by elliptic Fourier descriptors. in World Conference on Agricultural Information and IT 1047–1052 (IAALD AFITA, 2008).

  42. 42.

    Ayoub, M., Symons, J., Edney, J. & Mather, E. QTLs affecting kernel size and shape in a two-rowed by six-rowed barley cross. Theor. Appl. Genet.105, 237–247 (2002).

    CASArticle Google Scholar

  43. 43.

    Herridge, R. P., Day, R. C., Baldwin, S. & Macknight, R. C. Rapid analysis of seed size in Arabidopsis for mutant and QTL discovery. Plant Methods7, 3 (2011).

    Article Google Scholar

  44. 44.

    Tanabata, T., Shibaya, T., Hori, K., Ebana, K. & Yano, M. SmartGrain: high-throughput phenotyping software for measuring seed shape through image analysis. Plant Physiol.160, 1871–1880 (2012).

    CASArticle Google Scholar

  45. 45.

    Miller, N. D. et al. A robust, high-throughput method for computing maize ear, cob, and kernel attributes automatically from images. Plant J.89, 169–178 (2017).

    CASArticle Google Scholar

  46. 46.

    Wen, K. X., Xie, Z. M., Yang, L. M. & Sun, B. Q. Computer vision technology determines optimal physical parameters for sorting Jindan 73 maize seeds. Seed Sci. Technol.43, 62–70 (2015).

  47. 47.

    Li, N., Xu, R., Duan, P. & Li, Y. Control of grain size in rice. Plant Reprod.31, 237–251 (2018).

    CASArticle Google Scholar

  48. 48.

    Kingma, D. P. & Welling, M. Auto-encoding variational bayes. in International Conference on Learning Representations (ICLR) (2014).

  49. 49.

    Toda, Y. & Okura, F. How convolutional neural networks diagnose plant disease. Plant Phenomics2019, 9237136 (2019).

    Article Google Scholar

  50. 50.

    Shomura, A. et al. Deletion in a gene associated with grain size increased yields during rice domestication. Nat. Genet.40, 1023–1028 (2008).

    CASArticle Google Scholar

  51. 51.

    Singla, J. et al. Characterization of Lr75: a partial, broad-spectrum leaf rust resistance gene in wheat. Theor. Appl. Genet.130, 1–12 (2017).

    CASArticle Google Scholar

  52. 52.

    Takumi, S., Nishioka, E., Morihiro, H., Kawahara, T. & Matsuoka, Y. Natural variation of morphological traits in wild wheat progenitor Aegilops tauschii Coss. Breed. Sci.59, 579–588 (2009).

    Article Google Scholar

  53. 53.

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

  54. 54.

    Labelbox, Inc., Labelbox: The leading training data platform. at 

  55. 55.

    Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis.88, 303–338 (2010).

    Article Google Scholar

  56. 56.

    Kuhl, F. P. & Giardina, C. R. Elliptic Fourier features of a closed contour. Computer Graph. Image Process.18, 236–258 (1982).

    Article Google Scholar

  57. 57.

    Doersch, C. Tutorial on variational autoencoders. Preprint at (2016).

Download references


We thank Labelbox for providing the access for the academic usage of dataset labeling. We thank Ms. Yoko Tomita at Nagoya University for assistance in the labor-intensive annotation to generate a ground-truth test dataset. We also thank Dr. Miya Mizutani for a comprehensive discussion and critical reading of the paper. The graphical abstract in Fig. 1 was rendered by Dr. Issey Takahashi who is a member of the Research Promotion Division in ITbM of Nagoya University. Dr. Shunsaku Nishiuchi provided Nipponbare rice seeds used in this study. Dr. Toshiaki Tameshige amplified and provided wheat seeds. Dr. Kentaro Shimizu amplified and provided wheat Arina seeds and Drs. Shigeo Takumi and Yoshihiro Matsuoka established, amplified, and provided synthetic wheat Ldn/KU-2076 (Syn01) seeds. This work was supported by Japan Science and Technology Agency (JST) PRESTO [Grants nos. JPMJPR17O5 (Y.T.) and JPMJPR17O3 (F.O.)], JST CREST [Grant Number JPMJCR16O4 (H.T., D.S., and S.O.)], MEXT KAKENHI [Numbers 16H06466 and 16H06464 (H.T.), 16KT0148 (D.S.), and 19K05975 (J.I.)], and JST ALCA [Number JPMJAL1011 (T.K.)]. All the barley materials are provided by the National BioResource Project (NBRP: Barley).

Author information


  1. Japan Science and Technology Agency, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan

    Yosuke Toda & Fumio Okura

  2. Institute of Transformative Bio-Molecules (WPI-ITbM), Nagoya University, Chikusa, Nagoya, 464-8602, Japan

    Yosuke Toda & Toshinori Kinoshita

  3. Department of Intelligent Media, Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan

    Fumio Okura

  4. Kihara Institute for Biological Research, Yokohama City University, Maioka 641-12, Totsuka, Yokohama, 244-0813, Japan

    Jun Ito & Hiroyuki Tsuji

  5. Institute of Plant Science and Resources, Okayama University, Chuo 2-20-1, Kurashiki, Okayama, 710-0046, Japan

    Satoshi Okada & Daisuke Saisho


Y.T. directed and designed the study, wrote the program codes, generated the synthetic test dataset, and performed the experiments with assistance from F.O., H.T., D.S., and K.T. H.T. and D.S. collected and scanned the barley seed images, and J.I. collected wheat images. Y.T. annotated the test dataset. Y.T., H.T., and D.S. were involved in the conceptualization of this research. Y.T., F.O., and H.T. wrote the paper with assistance from D.S., K.T., J.I., and S.O., furthermore with verification of scientific validity from all the coauthors.

Corresponding author

Correspondence to Yosuke Toda.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Reprints and Permissions


Similar news:

Image Segmentation Using Mask R-CNN

Computer Vision as a field of research has seen a lot of development in recent years. Ever since the introduction of Convolutional Neural Networks, the state of the art in domains such as classification, object detection, image segmentation, etc. have constantly been challenged. With the aid of sophisticated hardware providing very high computational power, these neural network models are being employed in real-time in emerging fields such as Autonomous Navigation.

Our topic of focus today will be a sub-field of Computer Vision known as Image Segmentation. To be more precise, we’ll be performing Instance Segmentation on an image or video. Okay, that was a lot of technical jargon. Let’s reel it back a bit and understand what those terms mean. Each of these would require a post in itself and we won’t dive too deep into it now. I might write some articles on each of them separately in the near future. For now, let’s understand what Image Segmentation is, in a simplified manner. Also, if you do not know what Object Detection is, I suggest you read up on it as it would make it easier to understand the upcoming concepts easily. I have also written a concise article on implementing an Object Detection algorithm. Should you be interested in it, you can find it in my profile.

What is Image Segmentation?

Image segmentation is the process of classifying each pixel in the image as belonging to a specific category. Though there are several types of image segmentation methods, the two types of segmentation that are predominant when it comes to the domain of Deep Learning are:

  • Semantic Segmentation
  • Instance Segmentation

Let me draw a concise comparison between the two. In Semantic Segmentation every pixel of an object belonging to a particular class is given the same label/color value. On the other hand, in Instance Segmentation every pixel of each object of a class is given a separate label/color value. Take a look at the image below and read the previous sentence once again to understand it clearly. I hope it makes sense now :)


Mask R-CNN (Regional Convolutional Neural Network) is an Instance segmentation model. In this tutorial, we’ll see how to implement this in python with the help of the OpenCV library. If you are interested in learning more about the inner-workings of this model, I’ve given a few links at the reference section down below. That would help you understand the functionality of these models in great detail.

We begin by cloning (or downloading ) the given repository:-

Make sure you have all the dependencies listed in the installed in your python environment. After which, don’t forget to run the command . We’ll be using a model pre-trained on the coco dataset. The weights for which can be downloaded from here and the class names could be obtained from the file from my repository. Now let’s start by creating a file called within the cloned or downloaded repository and import the required libraries.

We will be using our own custom class and hence will be inheriting the existing class and overriding the values of its variables. Note that you can set these values according to the capability of your GPU.

Now we’ll create a function called that takes care of reading the class labels as well as initializing the model object according to our custom config. We also specify a color-to-object mapping to implement semantic segmentation too(if required). We’ll talk about this in a bit. The mode of our model should be set to ‘inference’ as we are going to use it for testing directly. The path to the pre-trained weights is supplied so that it can be loaded into the model.

The next function will be the most crucial part of this tutorial. Don’t be scared by its length, it’s the most simple one yet! We will now pass the test image to the detect function of our model object. This would perform all the detection and segmentation of objects for us. We now have two options — we can either choose to assign every object of the class the same color or assign every object irrespective of its class a distinct color. Since this post is steadfast about implementing instance segmentation, I chose the latter. I have given you the flexibility to do both so that you can strengthen your understanding of the same. This can be done by toggling the parameter.

The code above sketches the bounding box of the object while also segmenting it at the pixel level. It also provides the class that object belongs to along with the score. I’ve also given you the option to visualize the output using the internal visualization function present in Matterport’s MRCNN implementation. This can be accessed using the boolean parameter shown above.

Now that we have our model and the code to process the output which it produces, we only need to read and pass in the input image. This is done by the function given down below. The parameter allows us to save the processed image with the objects segmented.

This can now also be easily extended to video and live web-cam feed as shown by the function here.

As you can see in the image above, every object is segmented. The model detects multiple instances of the class book and hence assigns each instance a separate color.

That is it guys! You now have a working instance segmentation pipeline for you to put to use. The entire code for this project as well as a clean and easy to use interface can be found in my repository given here. Additional steps to use the same can also be found there. I do realize that some of you might not have a CUDA compatible GPU or rather no GPU at all. So, I have also provided a Colab notebook that can be used to run the code. Integrating web-cam usage in Colab still eludes me and hence inference can only be done on images and video files for now. Matterport’s Mask R-CNN code supports Tensorflow 1.x by default. If you are using TF2.x, you are better off forking/cloning my repository directly as I have ported the code to support TF2.x.

I suggest that you read up on the R-CNN architectures (especially Faster R-CNN) to completely understand the working of Mask R-CNN. I’ve given links to some good resources in the references section down below. Also, having a sound understanding of how Object Detection works would make it easy to comprehend the crux of Image Segmentation.

~~SowmiyaNarayanan G

Mind Bytes:-

“The secret of getting ahead is getting started.” — Mark Twain.


Feel free to reach out to me if you have any doubts, I am here to help. I am open to any sorts of criticism to improve my work so that I can better cater to the needs of explorers such as yourself in the future. Don’t hesitate to reach out to let me know what you think. You can also connect with me on LinkedIn.



647 648 649 650 651