I Introduction
With the development of 3D sensing technologies, RGBD data with spatial information (depth, 3D coordinates) is easily accessible. As a result, RGBD semantic segmentation for highlevel scene understanding becomes extremely important, benefiting a wide range of applications such as automatic driving
[icnet], SLAM [bescos2018dynaslam], and robotics. Due to the effectiveness of Convolutional Neural Network (CNN) and additional spatial information, recent advances demonstrate enhanced performance on indoor scene segmentation tasks
[fcn, deeplab]. Nevertheless, there remains a significant challenge caused by the complexity of the environment and the extra efforts for considering spatial data, especially for applications that require realtime inference.A common approach treats 3D spatial information as an additional input of the network to extract features, followed by combining the features of RGB images [rdfnet, fcn, eigen2015predicting, ma2017multi, fusenet, wang2016learning], as shown in Fig. 1 (a). This approach achieves promising results at the cost of significantly increasing the parameter number and computational time, thus being unsuitable for realtime tasks. Meanwhile, several works [fcn, fusenet, gupta2014learning, lstmcf, rdfnet] encode raw spatial information into three channels (HHA) composed of horizontal disparity, height above ground, and norm angle. However, the conversion from raw data to HHA is also timeconsuming [fusenet].
It is worth noting that indoor scenes have more complex spatial relations than outdoor scenes. This requires a stronger adaptive ability of the network to deal with geometric transformations. However, due to the fixed structure of the convolution kernel, the 2D convolution in the aforementioned methods cannot well adapt to spatial transformation and adjust the receptive field inherently, limiting the accuracy of semantic segmentation. Although alleviation can be made by revised pooling operation and prior data augmentation [deform, deformablev2], a better spatially adaptive sampling mechanism for conducting convolution is still desirable.
Moreover, the color and texture of objects in indoor scenes are not always representative. Instead, the geometry structure often plays a vital role in semantic segmentation. For example, to recognize the fridge and wall, the geometric structure is the primary cue due to the similar texture. However, such spatial information is ignored by 2D convolution on RGB data. The depthaware convolution [dcnn] is proposed to address this problem. It forces pixels with similar depths as the center of the kernel to have higher weight than others. Nevertheless, this prior is handcrafted and may lead to suboptimal results.
It can be seen that there is a contradiction between the fixed structure of 2D convolution and the varying spatial transformation, along with the efficiency bottleneck of separately processing RGB and spatial data. To overcome the limitations mentioned above, we propose a novel operation, called Spatial information guided Convolution(SConv), which adaptively changes according to the spatial information (see Fig. 1 (b)).
Specifically, this operation can generate convolution kernels with different sampling distributions adapting to spatial information, boosting the spatial adaptability and the receptive field regulation of the network. Furthermore, SConv establishes a link between the convolution weights and the underlying spatial relationship with their corresponding pixel, incorporating the geometric information into the convolution weights to better capture the spatial structure of the scene.
The proposed SConv is light yet flexible and achieves significant performance improvements with only few additional parameters and computation costs, making it suitable for realtime applications. We conduct extensive experiments to demonstrate the effectiveness and efficiency of SConv. We first design the ablation study and compare SConv with deformable convolution [deform, deformablev2] and depthaware convolution [dcnn], exhibiting the advantages of SConv. We also verify the applicability of SConv to spatial transformations by testing its influence on different types of spatial data with depth, HHA and 3D coordinates. We demonstrate that spatial information is more suitable to generate offset than RGB feature which is used by deformable convolution [deform, deformablev2]. Finally, benefiting from the adaptability to spatial transformation and the effectiveness of perceiving spatial structure, our network equipped with SConv, named Spatial information Guided convolutional Network (SGNet), achieves highquality results with realtime inference on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets.
We highlight our contributions as follows:

We propose a novel SConv operator that can adaptively adjust receptive field while effectively adapting to spatial transformation, and can perceive intricate geometric patterns with low cost.

Based on SConv, we propose a new SGNet that achieves competitive RGBD segmentation performance in realtime on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets.
Ii Related Work
Iia Semantic Segmentation
The recent advances of semantic segmentation benefit a lot from the development of convolutional neural network (CNN) [imagenet, deep]. FCN [fcn]
is the pioneer of leveraging CNN for semantic segmentation. It leads to convincing results and serves as the basic framework for many tasks. With the research efforts in the field, the recent methods can be classified into two categories according to the network architecture, including atrous convolutionbased methods
[deeplab, multi], and encoderdecoder based methods [refinenet, deeplabv3plus, segnet, deconvnet].Atrous convolution:
The standard approach relies on stride convolutions or poolings to reduce the output stride of the CNN backbone and enables a large receptive field. However, the resolution of the resulting feature map is reduced
[deeplabv3], and many details are lost. One approach exploits atrous convolution to alleviate the conflict by enhancing the receptive field while keeping the resolution of the feature map [deeplabv3, deeplab, deeplabv3plus, multi, denseaspp]. We use atrous convolution based backbone in the proposed SGNet.Encoderdecoder architecture: The other approach utilizes the encoderdecoder structure [deconvnet, segnet, refinenet, deeplabv3plus, psp], which learns a decoder to recover the prediction details gradually. DeconvNet [deconvnet] employs a series of deconvolutional layers to produce a highresolution prediction. SegNet [segnet] achieves better results by using pooling indices in the encoder to guide the recovery process in the decoder. RefineNet [refinenet] fuses lowlevel features in the encoder with the decoder to refine the prediction. While this method can achieve more precise results, it requires longer inference time.
IiB RGBD Semantic Segmentation
How to effectively use the extra geometry information (depth, 3D coordinates) is the key of RGBD semantic segmentation. A number of works focus on how to extract more information from geometry, which is treated as additional input in [eigen2015predicting, ma2017multi, fusenet, wang2016learning, hu2019acnet]. Twostream network is used in [ma2017multi, fusenet, wang2016learning, lstmcf, rdfnet]
to process RGB image and geometry information separately, and combines the two results in the last layer. These methods achieve promising results at the expense of doubling the parameters and computational cost. 3D CNNs or 3D KNN graph networks are also used to take geometry information into account
[song2017semantic, song2016deep, qi20173d]. Besides, various deep learning methods on 3D point cloud
[pointnet, pointnet++, chen2019lsanet, spidercnn, spectral_graph_conv, pointcnn] are also explored. However, these methods cost a lot of memory and are computationally expensive. Another stream incorporates geometric information into explicit operations. Cheng et al. [local]use geometry information to build a feature affinity matrix acting in average pooling and uppooling. Lin et al.
[cascaded] splits the image into different branches based on geometry information. Wang et al. [dcnn]propose Depthaware CNN, which adds depth prior to the convolutional weights. Although it improves feature extraction by convolution, the prior is handcrafted but not learned from data. Other approaches, such as multitask learning
[jiao2019geometry, wang2015towards, hoffman2016learning, kokkinos2017ubernet, eigen2015predicting, Zhang_2019_CVPR] or spatialtemporal analysis [he2017std2p], are further used to improve segmentation accuracy. The proposed SConv aims to efficiently utilize spatial information to improve the feature extraction ability. It can significantly enhance the performance with high efficiency due to using only a small amount of parameters.IiC Dynamic structure in CNN
Using dynamic structure to deal with varying input of CNN has also been explored. Dilation Convolution is used in [multi, deeplab]
to increase the receptive field size without reducing feature map resolution. Spatial transformer network
[stn] adapts spatial transformation by warping feature map. Dynamic filter [dynamicfilter] adaptively changes its weights according to the input. Besides, selfattention based methods [selective, nonlocal, senet] generate attention maps from the intermediate feature map to adjust response at each location or capture longrange contextual information adaptively. Some generalizations of convolution from 2D image to 3D point cloud are also presented. PointCNN [pointcnn] is a seminal work that enables CNN on a set of unordered 3D points. There are other improvements [chen2019lsanet, spidercnn, spectral_graph_conv]on utilizing neural networks to effectively extract deep features from 3D point sets. Deformable convolution
[deform, deformablev2] can generate different distribution with adaptive weights. Nevertheless, their input is an intermediate feature map rather than spatial information. Our work experimentally verifies that better results can be obtained based on spatial information in Sec. IV.Iii SConv and SGNet
In this section, we first elaborate on the details of Spatial information guided Convolution (SConv), which is a generalization of conventional RGBbased convolution by involving spatial information in the RGBD scenario. Then, we discuss the relation between our SConv and other approaches. Finally, we describe the network architecture of Spatial information Guided convolutional Network (SGNet), which is equipped with SConv for RGBD semantic segmentation.
Iiia Spatial information guided Convolution
For completeness, we first review the conventional convolution operation. We use
to denote a tensor, where
is the index corresponding to the first dimension, and indicates the two indices for the second and third dimensions. Nonscalar values are highlighted in bold for convenience.For an input feature map . We describe it in 2D for simplicity, thus . Note that it is straightforward to extend to the 3D case. The conventional convolution applied on to get can be formulated as the following:
(1) 
where is the convolution kernel with size , is the 2D convolution center, denotes the kernel distribution around . For convolution:
(2) 
From the above equation, we can see that the convolution kernel is constant over . In other words, and are fixed, meaning the convolution is contentagnostic.
In the RGBD context, we want to involve 3D spatial information efficiently by using adaptive convolution kernels.
We first generate the offset according to the spatial information, then use the spatial information corresponding to the given offset to generate new spatially adaptive weights. Our SConv requires two inputs. One is the feature map which is the same as conventional convolution. The other is the spatial information . In practice, can be HHA (), 3D coordinates (), or depth (). The method of encoding depth into 3D coordinates and HHA is the same as [qi20173d]. Note that the input spatial information is not included in the feature map.
As the first step of SConv, we project the input spatial information into a highdimensional feature space, which can be expressed as:
(3) 
where is a spatial transformation function, and , which has a higher dimension than .
Then, we take the transformed spatial information into consideration, perceive its geometric structure, and generate the distribution (offset of pixel coordinate in and axis) of convolution kernels at different . This processes can be expressed as:
(4) 
where , represent the feature map size after convolution, and are the kernel size. is a nonlinear function which can be implemented by a series of convolutions.
After generating the distribution of kernel for each possible using , we boost its feature extraction ability by establishing the link between the geometric structure and the convolution weight. More specifically, we sample the geometric information of the pixels corresponding to the convolution kernel after shifting:
(5) 
where is the spatial distribution of convolution kernels at . is the spatial information corresponding to the feature map of the convolution kernel centered on after transformation.
Finally, we generate convolution weights according to the final spatial information as the following:
(6) 
where
is a nonlinear function that can be implemented as a series of convolution layers with nonlinear activation function,
indicates the convolution weights, which can be updated by the gradient descent algorithm. denotes the spatially adaptive weights for convolution centered at .Overall, our generalized SConv is formulated as:
(7) 
We can see that establishes the correlation between spatial information and convolution weights. Moreover, convolution kernel distribution is also relevant to the spatial information through . Note that and are not constant, meaning the generalized convolution is adaptive to different . Also, as
is typically fractional, we use bilinear interpolation to compute
as in [deform, stn]. The main formulae discussed above are labeled in Fig. 2.IiiB Relation to other approaches
2D convolution is the special case of the proposed SConv without geometry information and corresponding offset. Specifically, without geometry information, the center point and its neighboring points have fixed positional relation in image space. And we do not need to capture the varying spatial relation. This also shows that our SConv has very good compatibility to handle the 2D case. While for the RGBD case, our SConv can extract feature at the point level and is not limited to the discrete grid by introducing spatially adaptive weights as shown in Fig. 3. Deformable convolution [deform, deformablev2] also alleviates this problem by generating different distribution weights. Nevertheless, their distributions are inferred from 2D feature maps instead of 3D spatial information as in our case. We will verify through experiments that our method achieves better results than deformable convolution [deform, deformablev2]. Compared with the 3D KNN graphbased method, our SConv selects neighboring pixels adaptively instead of using the KNN graph, which is not flexible and computationally expensive.
IiiC SGNet architecture
Our semantic segmentation network, called SGNet, is equipped with SConv and consists of a backbone and decoder. The structure of SGNet is illustrated in Fig. 4. We use ResNet101 [resnet] as our backbone, and replace the first and the last two conventional convolutions (
filter) of each layer with our SConv. We add a series of convolutions to extract the feature further and then use bilinear upsampling to produce the final segmentation probability map, which corresponds to the decoder part of the SGNet. The
in Equ. (3) is implemented as three convolution layers, i.e. Conv(3, 64)  Conv(64, 64)  Conv(64, 64) with nonlinear activation function. The in Equ. (4) and the in Equ. (6) are implemented as single and two convolution layers separately. The SConv implementation is modified from deformable convolution [deformablev2, deform] We add deep supervision between layer 3 and layer 4 to improve the network optimization capability, which is the same as PSPNet [psp].Iv Experiments
In this section, we first validate the performance of SConv by analyzing its usage in different layers; conducting ablation study/comparison with its alternatives; and evaluating results of using different input information to generate offset. Then we compare our SGNet equipped with SConv with other stateoftheart semantic segmentation methods on NYUDV2 and SUNRGBD datasets. Finally, we visualize the depth adaptive receptive field in each layer and the segmentation results, demonstrating that the proposed SConv can well exploit spatial information.
SConv  layer3_0  layer3_1  layer3_2  layer3_20  layer3_21  layer3_22  other layers  mIoU(%)  param(M)  FPS 

43.0  56.8  37  
✓  47.0  56.9  37  
Baseline  ✓  ✓  ✓  46.6  57.2  36  
(ResNet101)  ✓  ✓  ✓  46.5  57.2  36  
✓  ✓  ✓  47.8  57.2  36  
✓  ✓  ✓  ✓  49.0  58.3  28 
Datasets and metrics: We evaluate the performance of SConv operator and SGNet segmentation method on two public datasets:

NYUDv2 [nyud] : This dataset has 1,449 RGB images with corresponding depth maps and pixelwise labels. 795 images are used for training, while 654 images are used for testing as in [split]. The 40class settings are used for experiments.

SUNRGBD [sunrgbd, sunrgbd2]: This dataset contains 10,335 RGBD images with semantic labels organized in 37 categories. 5,285 images are used for training, and 5050 images are used for testing.
We use three common metrics for evaluation, including pixel accuracy (Acc), mean accuracy (mAcc), and mean intersection over union (mIoU). The three metrics are defined as the following:
(8)  
where is the amount of pixels which are predicted as class with ground truth , is the number of classes, and is the number of pixels whose ground truth class is . The depth map is used as the default format of spatial information unless specified otherwise.
Implementation details: We use dilated ResNet101 [resnet] pretrained on ImageNet [imagenetc] as the backbone network for feature extraction following [deeplab]
, and the output stride is 16 by default. The whole system is implemented based on PyTorch. The SGD optimizer is adopted for training with the learning rate policy as used in
[deeplab, deeplabv3plus]:, where the initial learning rate is 5e3 for NYUDv2 and 1e3 for SUNRGBD, and the weight decay is 5e4. This learning policy updates the learning rate for every 40 epochs for NYUDv2 and 10 epochs for SUNRGBD. We use ReLU activation function, and the batch size is 8. Following
[rdfnet], we employ general data augmentation strategies, including random scaling, random cropping, and random flipping. The crop size is . During testing, we downsample the image to the training crop size, and its prediction map is upsampled to the original size. We use crossentropy loss in both datasets, and reweight [jiang2018rednet] training loss of each class in SUNRGBD due to its extremely unbalanced label distribution. We train the network by 500 epochs for the NYUDv2 dataset and 200 epochs for the SUNRGBD dataset on two NVIDIA 1080Ti GPUs.Iva Analysis of SConv
We design ablation studies on NYUDv2 [nyud] dataset. The ResNet101 with a simple decoder and deep supervision is used as the baseline.
Replace convolution with SConv: We evaluate the effectiveness of SConv by replacing the conventional convolution (of filter) in different layers. We first replace convolution in layer 3, then extend the explored rules to other layers. The FPS (Frames per Second) is tested on NVIDIA 1080Ti with input image size following [dcnn]. The results are shown in Tab. I.
Model  Acc.  mAcc.  mIoU 

Baseline  72.1  54.6  43.0 
Baseline+OG  73.9  58.2  46.3 
Baseline+SP+OG  75.2  60.0  48.4 
Baseline+SP+WG  74.5  58.4  46.8 
Baseline+SP+OG+WG  75.5  60.9  49.0 
Model  Acc.  mAcc.  mIoU. 

Baseline  72.1  54.6  43.0 
Baseline+DCV2  73.0  56.1  44.5 
Baseline+HHANet  73.5  56.8  45.4 
Baseline+DAC  73.8  57.1  45.4 
Baseline+SP+WG  74.5  58.4  46.8 
Baseline+SConv(SGNet)  75.5  60.9  49.0 
Information  Acc.  mAcc.  mIoU. 

Depth  75.5  60.9  49.0 
RGB Feature  73.9  58.5  46.4 
HHA  75.7  60.8  48.9 
3D coordinates  75.3  61.2  48.5 
Network  Backbone  MS  SI  Acc.  mAcc.  mIoU.  fps  param (M) 

FCN [fcn]  2VGG16  HHA  65.4  46.1  34.0  8  272.2  
LSDGF [local]  2VGG16  HHA  71.9  60.7  45.9      
RefineNet [refinenet]  ResNet152  ✓    73.6  58.9  46.5  16  129.5 
RDFNet [rdfnet]  2ResNet152  ✓  HHA  76.0  62.8  50.1  9  200.1 
RDFNet [rdfnet]  2ResNet101  ✓  HHA  75.6  62.2  49.1  11  169.1 
CFNet [cascaded]  2ResNet152  ✓  HHA      47.7     
3DGNN [qi20173d]  VGG16  HHA    55.2  42.0  5  47.2  
DCNN [dcnn]  2VGG16  HHA    56.3  43.9  13  92.0  
DCNN [dcnn]  VGG16  Depth    53.6  41.0  26  47.0  
DCNN [dcnn]  2ResNet152  Depth    61.1  48.4      
ACNet [hu2019acnet]  2ResNet50  Depth      48.3  18  116.6  
SGNet  ResNet101  depth  75.5  60.9  49.0  28  58.3  
SGNet8s  ResNet101  depth  76.4  62.7  50.3  12  58.3  
SGNet  ResNet101  ✓  depth  76.4  62.1  50.3  28  58.3 
SGNet8s  ResNet101  ✓  depth  76.8  63.1  51.0  12  58.3 
We can draw the following two conclusions from the results in the Tab. I. 1) The inference speed of the baseline network is fast, but its performance is poor. Replacing convolution with SConv can improve the results of the baseline network with a little bit more parameters and computational time. 2) In addition to the first convolution in layer 3 whose stride is 2, the effect of replacing the later convolution is better. The main reason would be that spatial information can better guide downsampling operation in the first convolution. Thus we choose to replace the first convolution and the last two convolutions of each layer with SConv. We generalize the rules found in layer 3 to other layers and achieve better results. The above experiments show that our SConv can significantly improve network performance with only a few parameters. It is worth noting that our network has no spatial information stream. The spatial information only affects the distribution and weight of convolution kernel.
We also show the IoU improvement of SConv on most categories in Fig. 5. It’s obvious that our SConv improves IoU in most categories, especially for objects lacking representative texture information such as mirror, board and bathtub. There are also clear improvements for objects with rich spatial transformation, such as chairs and tables. This shows that our SConv can make good use of spatial information during the inference process.
Architecture ablation: To evaluate the effectiveness of each component in our proposed SConv, we design ablation studies. The results are shown in Tab. II. By default, we replace the first convolution and the last two convolutions of each layer according to Tab. I. We can see that the offset generator, spatial projection module, and weight generator of SConv all contribute to the improvement of the results.
Comparison with alternatives: Most methods [fusenet, jiang2018rednet, rdfnet, hu2019acnet] use a twostream network to extract features from two different modalities and then combine them. Our SConv focuses on advancing the feature extraction process of the network by utilizing spatial information. Here we compare our SConv with twostream network, deformable convolution [deformablev2, deform], and depthaware convolution [dcnn]. We use a simple baseline which consists of a ResNet101 network with deep supervision and a simple decoder.
We add an additional ResNet101 network, called HHANet, to extract HHA features and fuse it with our baseline features at the final layer of a twostream network. To compare with depthaware convolution and deformable convolution, similar to SGNet, we replace the first convolution and the last two convolutions of each layer in the baseline. The results are shown in Tab. III. We find that our SConv achieves better results than twostream network, deformable convolution, and depthaware convolution. This demonstrates that our SConv can effectively utilizes spatial information. The baseline equipped with weight generator can also achieve better results than depthaware convolution, indicating that learning weights from spatial information is necessary.
Network  Backbone  MS  SI  Acc.  mAcc.  mIoU.  fps  param (M) 

LSDGF [local]  2VGG16  HHA    58.0        
RefineNet [refinenet]  ResNet152  ✓    80.6  58.5  45.9  16  129.5 
3DGNN [qi20173d]  VGG16  ✓  HHA    57.0  45.9  5  47.2 
RDFNet [rdfnet]  2ResNet152  ✓  HHA  81.5  60.1  47.7  9  200.1 
CFNet [cascaded]  2ResNet152  ✓  HHA      48.1     
DCNN [dcnn]  2VGG16  HHA    53.5  42.0  12.5  92.0  
ACNet [hu2019acnet]  2ResNet50  HHA      48.1  8  272.2  
SGNet  ResNet101  depth  81.0  59.6  47.1  28  58.3  
SGNet  ResNet101  ✓  depth  81.8  60.9  48.5  28  58.3 
Spatial information comparison: We also evaluate the impact of different formats of spatial information on SConv. The results are shown in Tab. IV. We can see that depth information leads to comparable results with HHA and 3D coordinates, and better results than intermediate RGB features which are used by deformable convolution [deform, deformablev2]. This shows the advantage of using spatial information for offset and weight generation over RGB features. However, converting depth to HHA is timeconsuming [fusenet]. Hence 3D coordinates and depth map are more suitable for realtime segmentation using SGNet.
IvB Comparison with stateoftheart
We compare our SGNet with other stateoftheart methods on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets. The architecture of SGNet is shown in Fig. 4.
NYUDv2 dataset: The comparison results can be found in Tab. V and Fig. 6. The input image size for singlescale speed testing is following [dcnn]. We tested the singlescale speed of other methods under the same conditions using NVIDIA 1080Ti. Note that some methods in Tab. V
do not report parameter quantities or open source. So we just listed the mIoU of these methods. Instead of using additional networks to extract spatial features, our SGNet can achieve competitive performance and realtime inference. This benefits from SConv which can make use of spatial information efficiently with only a small amount of extra parameters and computation cost. Moreover, our SConv can achieve good results without using HHA information, making it suitable for realtime tasks. After using multiscale test which is the same as RDFNet
[rdfnet] and CFNet [cascaded], we exceed all methods based on complicated data fusion and achieve stateoftheart performance using fewer parameters. This verifies the efficiency of our SConv in utilizing spatial information. At the expense of a little bit more reasoning time by changing the output stride of SGNet to 8 noted as SGNet8, the proposed SGNet can achieve better results than other methods and RDFNet which uses multiscale test, HHA information and two resnet152 backbones. After using multiscale test which is used by other methods, SGNet’s performance can be further improved.SUNRGBD dataset: The comparison results on the SUNRGBD dataset are shown in Tab. VI. It is worth noting that some methods in Tab. V did not report results on the SUNRGBD dataset. The inference time and parameter number of models in Tab. VI are the same as those in Tab. V. Our SGNet can achieve competitive results in realtime compared with models that do not have realtime performance.
IvC Qualitative Performance
Visualization of receptive filed in SConv: Appropriate receptive field is very important for scene recognition. We visualize the input adaptive receptive filed of SGNet in different layers generated by SConv. Specifically, we get the receptive field of each pixel by summing up the norm of their offsets during the SConv operation, then we normalize each value to [0, 255] and visualize the result using a grayscale image. The results are shown in Fig. 7. The brighter the pixel, the larger the receptive field. We observe that the receptive fields of different convolutions vary adaptively with the depth of the input image. For example, in layer1_1, the receptive field is inversely proportional to the depth, which is opposite to layer1_2. The combination of the adaptive receptive field learned at each layer can help the network better resolve indoor scenes with complex spatial relations.
Qualitative comparison results: We show qualitative comparison results on NYUDv2 test dataset in Fig. 8. For the visual results in Fig. 8 (a), the bathtub and the wall have insufficient texture, which cannot be easily distinguished by the baseline method. Some objects may have reflections such as the table in Fig. 8 (b), which is also challenging for the baseline. SGNet, however, can recognize it well by incorporating spatial information with the help of SConv. The chairs in Fig. 8 (c, d) are hard to be recognized by RGB data due to the low contrast and confused texture, while they can be easily recovered by SGNet benefiting from the equipped SConv. In the meantime, SGNet can recover the object’s geometric shape nicely, as demonstrated by the chairs of Fig. 8 (e). We also show qualitative results on SUNRGBD test dataset in Fig. 9. It can be seen that our SGNet can also achieve precise segmentation on SUNRGBD.
V Conclusion
In this paper, we propose a novel Spatial information guided Convolution (SConv) operator. Compared with conventional 2D convolution, it can adaptively adjust the convolution weights and distributions according to the input spatial information, resulting in better awareness of the geometric structure with only a few additional parameters and computation cost. We also propose Spatial information Guided convolutional Net
work (SGNet) equipped with SConv that yields realtime inference speed and achieves competitive results on NYUDv2 and SUNRGBD datasets for RGBD semantic segmentation. We also compare the performance of using different inputs to generate offset, demonstrating the advantage of using spatial information over RGB feature. Furthermore, we visualize the depthadaptive receptive filed in each layer to show effectiveness. In the future, we will investigate the fusion of different modal information and the adaptive change of SConv structure simultaneously, making these two approaches benefit each other. We will also explore the application of SConv in different fields, such as pose estimation and 3D object detection.
Comments
There are no comments yet.