1Department of Electronics and Communication, Vidya Academy of Science and Technology, Technical Campus, Kilimanoor, India.
2Department of Electronics and Communication, College of Engineering Trivandrum, Trivandrum, India
Semantic segmentation is a challenging problem in computer vision. In recent years, the performance of semantic segmentation has been considerably enhanced by employing cutting edge technique.This paper presents an advanced semantic segmentation methodology that uses the PSPNet (Pyramid Scene Parsing Net-work) architecture augmented with atrous convolution networks and a spatial attention module . The primary objective is to improve segmentation accuracy by integrating spatial attention mechanisms with the PSPNet framework, in association with atrous convolution networks. The spatial attention module selec-tively highlights pertinent spatial regions within feature maps, enhancing the ability of the model to capture intricate details crucial for precise segmentation. Experimental evaluations are carried out in two datasets: Stanford Background dataset and the Aerial Semantic Segmentation Drone dataset.This improvement underscores the efficacy of integrating spatial attention mechanisms and atrous convolution networks within the PSPNet architecture for semantic segmentation tasks, propelling advancements in the state-of-the-art performance within this domain.
Semantic segmentation is a basic computer vision challenge that involves classifying each pixel in an image into various object such as” person,”” car,”” building”, etc. Semantic segmentation presents a comprehensive knowledge of the scene by dividing it into meaningful parts based on item categories, in contrast to image classification, which gives a single label to a whole image. In computer vision, semantic segmentation [2] plays an important role in diffrent disciplines like autonomous driving and medical imaging. Systems for object detection and recognition are aided by the use of semantic segmentation, and the accuracy of localization and identification is improved by the accurate drawing of object boundaries. It also applies in involving scene understanding, such as autonomous driving, robotics, surveillance systems, and scene parsing, since it helps robots to efficiently understand spatial layouts and semantic contexts. Semantic segmentation enhances diagnosis, treatment planning, and medical research in medical imaging by analysing MRI, CT images, and histopathology slides. In the field of semantic segmentation, various innovative strategies have evolved to improve accuracy and efficiency. Fully Convolutional Networks (FCNs) [24] replace fully connected layers with convolutional layers to make spatially dense predictions. U-Net, well known for its usage in medical imaging, employs an encoder-decoder design with skip links to preserve localization and contextual information. SegNet has a similar design, but it up samples utilising max-pooling indices from the encoder step. Deep Lab [26] employs atrous convolutions to gather multi-scale contextual information and Atrous Spatial Pyramid Pooling (ASPP) [25] to improve feature extraction. The PSPNet [1] includes a pyramid pooling module for gathering contextual information at multiple sizes, which improves scene interpretation. Semantic segmentation faces several challenges, including handling complex scenes, maintaining fine-grained details, computational efficiency, and dealing with varied object scales. Using PSPNet with atrous convolution and a spatial attention module effectively addresses several key challenges in semantic segmentation. Atrous convolution, used in the atrous spatial pyramid pooling (ASPP) module, collects multi-scale information while maintaining resolution, resulting in detailed and high-quality segmentation. The ASPP module is made up of parallel atrous convolutions with varying dilation rates, which assist capture features at different scales and incorporate multi-level contextual information. The paper introduces a spatial attention module to the PSPNet [1] design and Atrous convolution, improving it through the introduction of a unique semantic segmentation technique. The PSPNet, [3] is a well-known architecture designed for semantic segmentation applications. Its salient feature is its capacity to employ pyramid pooling modules to record contextual information at various sizes, thereby enabling an improved understanding of situations. By including a spatial attention module, this method enhances the capabilities of the PSPNet network and improves its overall performance. The effectiveness of this suggested strategy in improving seg-mentation accuracy and performance has been extensively tested on the Stanford Background Dataset [6], [11] . The PSPNet obtains context from a variety of receptive fields by dividing the input feature map into sub-regions and carrying out pooling operations with varying kernel sizes.
Fig. 1 Architecture of Proposed Methodology
Furthermore, in order to maintain spatial resolution while efficiently broadening the receptive field without adding more parameters, PSPNet uses dilated convolutions [3]. Because it can capture both fine-grained information and broad context feature from images. The experimental results indicate that greater segmentation performance results from combining the PSPNet and atrous network with the attention module based on the acquired values. The remaining of the paper discusses about the following. Section II discuss about the previous related work, Section III gives a breif description about the methodology that is used in this research. Section IV shows the experimental analysis and final results; Section V concludes the work.
LITERATURE REVIEW
Semantic segmentation effectively manages contextual information, particularly when attention processes are not optimal. Current study focuses on increasing contextual understanding through attention mechanisms. To address these issues, new research has proposed novel algorithms that include atrous convolution networks and atten-tion processes into semantic segmentation frameworks. By incorporating attention modules into established frameworks, these solutions can increase contextual com-prehension and reactivity to changing lighting circumstances. This section reviews the pivotal contributions and methodologies proposed in recent years, focusing on various network architectures and attention mechanisms that have enhanced the performance and efficiency of semantic segmentation models. In the context of multi-class image segmentation, Guangzhe Zhao et al. [12] intro-duced an architecture incorporating a bilateral U-Net network model with a spatial attention mechanism. In this model, a lightweight MobileNetV2 as the backbone net-work for feature hierarchical extraction and proposes an Attentive Pyramid Spatial Attention (APSA) module. However, the model generalization of ASPP (Atrous Spa-tial Pyramid Pooling) is poor and does not allow effective segementation in a datset with many categories and feature information. To correct detailed information by merging pooling index maps from the encoder with high-level feature maps, Qi Zhao et al. [13] introduced an end-to-end attention-based semantic segmentation network known as SSAtNet. Here employing and capturing fine-grained details is more com-plex so,a semantic segmentation [14] algorithm integrating an attention mechanism
Fig. 2 Spatial Attention Module
is introduced. The dilated convolution is employed to preserve image resolution and capture detailed information. Recently Spatial Pyramid techniques are used for pixel labeleling and which is combined with attention mechanism. Hanchao Li et al. [15] introduced a method, that replace the complicated dilated convolution operations and perform better segemntation. Also a Successive Pooling Attention Module (SPAM), and a Feature Fusion Module (FFM) [16] is used to extract high-level features and low-level features using the initial 11 layers of ResNet50. The importance of semantic segmentation has been highlighted in number of recent works [17] , [15] .But does not focus on managing contextual information and segmen-tation effectiveness for small target object is very poor. For addressing edge splitting and small object disappearance issues in complex scene images, attention modules were chosen to fix. Therefore, they improve the model’s ability to focus on crucial information by dynamically weighting different regions of the input dataset. From the previous works, managing contextual information is challenging, espe-cially when attention mechanisms are not operating at their best. And also, Limited adaptability to varied lighting conditions. So Attention modules were chosen to solve the problems because they increase the model’s capacity to focus on key information by dynamically weighting various sections of the input data. By doing so, they improve the extraction of critical spatial and contextual information, resulting in more accurate and robust performance.
3 Model Architecture
The proposed method improves the PSPNet architecture by adding a spatial atten-tion module to complement the pyramid pooling modules and dilated convolutions. These components make feature extraction easier by gathering contextual data at dif-ferent scales and extending the receptive area while maintaining spatial accuracy. The spatial attention module improves precise segmentation by emphasising key spatial locations in feature maps.The Fig: 1 illustrates a Spatial Pyramid Attention Module for semantic segmentation, which starts with an input image processed by a feature extraction network, such as a data preprocessing block and a convolutional neural network (CNN) that is ResNet-50. This network collects important elements from the picture, producing an initial feature map that serves as the basis for further actions. This first feature map passes via spatial pyramid pooling (SPP), which is a critical component for collecting context at many sizes. The PSPNet [5] uses this SPP to improve feature representation. In PSPNet, spatial pyramid pooling is achieved via average pooling to acquire global context information by downscaling the feature map and applying atrous convolutions (also known as dilated convolutions) with various dilation rates. On accuracy. Convolutions with rates of 3, 6, and 9 are used to capture characteristics at varied spatial resolutions, emphasising medium-range to wide contextual information. And a 3x3 convolution is used to reduce the dimensionality of the feature map, ensuring both local context capture and dimensionality reduction. Finally, the up sampled features from all area sizes are concatenated [7] along the channel dimension and fed into a 1x1 convolutional layer [4]. Table 1 shows the detailed architecture of proposed methodology. In this study, a spatial attention module (Fig: 2) [21] is incorporated that enhances the performance of the model for semantic segmentation. By employing this spatial attention module to emphasise relevant spatial regions within the feature maps, the model may prioritise features that are critical for precise segmentation. Overall, the accuracy and performance of the semantic segmentation model are enhanced by the inclusion of the spatial attention module [9], [10] , making it more valuable in real-world applications
Table 1 Details of Proposed Architecture
|
Layer |
Input Shape |
Output Shape |
Filter Size |
|
Input Image |
(320,320,3) |
(320,320,3) |
- |
|
ResNet-50 |
(320,320,3) |
(40,40,2048) |
Multiple |
|
Avg Pool |
(40,40,2048) |
(40,40,256) |
- |
|
ASPP Conv 1x1 |
(40,40,2048) |
(40,40,256) |
1x1x2048 |
|
ASPP Conv 3x3 (r=3) |
(40,40,2048) |
(40,40,256) |
3x3x2048 |
|
ASPP Conv 3x3 (r=6) |
(40,40,2048) |
(40,40,256) |
3x3x2048 |
|
ASPP Conv 3x3 (r=9) |
(40,40,2048) |
(40,40,256) |
3x3x2048 |
|
ASPP Concatenation |
(40,40,1024) |
(40,40,1024) |
- |
|
ASPP Conv 1x1 (red.) |
(40,40,1024) |
(40,40,512) |
1x1x1024 |
|
Global Avg Pool |
(40,40,512) |
(1,1,512) |
40x40x512 |
|
Conv 3x3 |
(40,40,512) |
(40,40,512) |
3x3x512 |
|
Upsample to Input |
(40,40,512) |
(320,320,512) |
- |
|
Spatial Attention |
(320,320,512) |
(320,320,512) |
- |
|
Final Conv 1x1 |
(320,320,512) |
(320,320,9) |
1x1x512 |
Using spatial pyramid pooling, the feature maps from various scales are concatenated along the channel dimension. This stage is critical because it merges multi-scale contextual information into a single complete feature map that includes both local and global features at multiple spatial resolutions. The merged feature map is then anal-ysed by a spatial attention mechanism. This technique guarantees that the network focuses on the critical sections of the feature map that are important to the segmentation job, boosting the accuracy and resilience of the segmentation predictions. The incorporation of the PSPNet [19] into this module takes use of its capability in gath-ering different contextual information, improving the spatial attention mechanism’s capacity to refine and emphasise important aspects. The improved feature map generated by the spatial attention mechanism goes through a final 1x1 convolution step. This layer is critical because it decreases the number of channels in the feature map to match the number of output classes needed for the segmentation process. The 1x1 convolution guarantees that the feature map is properly formatted for producing accurate segmentation predictions. The end result is a predicted segmentation map, in which each pixel in the input picture is assigned a class label matching to the recognised objects or areas. This complete strategy, which combines multi-scale context capture, feature integration, and spatial attention refine-ment, significantly improves the semantic segmentation process, yielding very accurate and detailed segmentation results. The implication of PSPNet to this architecture dra-matically improves the network’s capacity to grasp and analyse complicated situations by including a wide variety of contextual information into the segmentation process.
Fig. 3 Result on Standford Background Dataset (Original Image, Ground Truth, And Predicted image)
4 Experimental Setup
In this section, the comprehensive evaluation findings of the chosen methodology for evalauating the performance of Stanford Background dataset is shown. Also includes a comparison with different approaches.
4.1 Dataset
The performance evaluation of the proposed method is done using two datasets:
4.1.1 Stanford Background Dataset
The performance evaluation of the proposed method was conducted through extensive experiments on the Stanford Background Dataset [11], [20], a well-established bench-mark for semantic segmentation tasks. The collection includes 715 images drawn from four public datasets: LabelMe, MSRC, PASCAL VOC, and Geometric Context. The images were chosen based on the following criteria: they were of outdoor sceneries, had around 320-by-240 pixels, had at least one foreground item, and had the horizon location inside the image. Amazon Mechanical Turk was used to produce semantic and geometric labels.
4.1.2 Aerial Semantic Segementation Drone Dataset
The Semantic Drone Dataset [23] focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird’s eye) view acquired at an altitude of 5 to 30 meters above ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.
4.2 Data Preprocessing
The experimental setup includes thorough data preprocessing, augmentation tech-niques, and model training strategies to ensure robust and reliable results. Prior to training, the dataset underwent preprocessing by resizing images uniformly, normalizing pixel values, and converting pixel-wise annotations into suitable formats for model training. Augmentation methods such as random flips, rotations and scaling were utilized to diversify the training data, thereby enhancing the model’s generalization across various scenarios. In the training phase, best practices such as transfer learning with a pre-trained backbone network, learning rate scheduling, and early stopping are implemented to stabilize training and mitigate overfitting.
4.3 Evaluation Metrics
The Performance of the proposed method is evaluated using three metric- mIoU (Mean Intersection over Union) , PA (Pixel Accuracy) and MPA (Mean Pixel Accuracy) .
where,
– N is the number of classes,
– Xii is the number of true positive pixels for class i,
– Cii is the number of correctly predicted pixels for class i,
– Ti is the total number of pixels for class i,
– Xji is the number of pixels predicted as class i that are actually class j.
4.4 Experimental Results and Analysis
Table 2 presents the results of semantic segmentation [22] on the Stanford Background Dataset, where the mIoU metric is employed to evaluate the accuracy of object shape prediction by the proposed models. Our proposed method is acheiving the mIoU of 85.25%. Comparing to previous method in [6], we got the accuracy 7.84% higher than the acheived one.
Table 2 Performance comparison of
|
Method |
mIoU (%) |
|
DeepLabV3 [27] |
74.33 |
|
DeepLabv3+LoAd [28] |
75.05 |
|
UNET |
64.2 |
|
Proposed method |
85.25 |
As shown by obtaining a higher mIoU score in Table 3 [22], the proposed method improves previous methods in semantic segmentation tasks. In the proposed method, we use ResNet-50 [18] as the backbone architecture which captures multi-scale features through its hierarchical structure.
Fig. 4 Results of the three approaches
Table 3 Performance comparison of different methods on semantic segmentation
|
Method |
Backbone |
mIoU (%) |
|
FCN-8s [4] |
- |
65.3 |
|
PSPNet [3] |
ResNet-101 |
43.29 |
|
PSANet [28] |
ResNet-101 |
43.7 |
|
RefineNet [29] |
ResNet-101 |
73.6 |
|
DeepLab V2 [30] |
ResNet-101 |
70.4 |
|
Proposed method |
ResNet-50 |
85.25 |
In fig:3 shows the results obtained from the Stanford Background Dataset indicate an improvement in segmentation performance using the proposed methodology and resulting in an accuracy of 83.04%. The Table 4 compares DeepLabv3+, Adversarial, and the proposed PSP approach on the Stanford Background Dataset, focusing on pixel accuracy across classes and mIoU. The proposed PSP method outperforms the previous methods, with greater accuracy in difficult classes such as Mountain (61.66%) and an overall mIoU of 80.04. This indicates that the PSP technique is more successful for semantic segmentation on this data set.
Fig. 5 Result on Aerial Drone Dataset (Original Image, Ground Truth, And Predicted image)
Table 4 Performance Comparison on Stanford Background Dataset
|
Method |
Sky |
Tree |
Road |
Grass |
Water |
Building |
Mountain |
Foreground |
mIoU% |
|
DeepLabv3+ [27] |
89.38 |
72.21 |
87.28 |
77.44 |
72.70 |
80.03 |
48.64 |
66.92 |
74.33 |
|
Adversarial [6] |
89.35 |
72.54 |
87.31 |
77.53 |
72.78 |
80.04 |
49.18 |
66.69 |
74.43 |
|
Proposed Method |
74.04 |
84.19 |
92.28 |
87.72 |
90.14 |
78.83 |
61.66 |
78.65 |
80.04 |
5 Ablation Study
In this study on semantic segmentation using the Stanford Background dataset, we performed an ablation study to evaluate the impact of various network enhancements. Initially, using a PSPNet, we achieved a mIoU of 74.3%. To improve accuracy, we integrated an attention mechanism into the PSPNet, which increased the mIoU to 76%. Further enhancing the network, we combined PSPNet with atrous convolution and an attention mechanism, specifically incorporating a spatial attention module. This approach significantly boosted the mIoU to 85.25%.
Table 5 Ablation Study Results
|
Method |
mIoU (%) |
|
PSPNet |
74.3 |
|
PSPNet + Attention Mechanism |
76.0 |
|
PSPNet + Atrous Convolution + Attention Mechanism (with Spatial Attention Module) |
85.25 |
Fig: 4 shows the output result for the three approaches. PSP network with attention mechanism and PSP network with atrous convolution and attention module produce remarkable segmentation output. To further analyze the performance of the proposed approach, we compare with another dataset known Aerial Semantic Segmentation Drone Dataset [23]. Here the experimental analysis done only with PSPNetwork and got the mIou of 63.0%. And for Stanford Background Dataset on PSP Network acheives the mIou of 74.3%. Fig: 5 shows the result on Aerial Drone Dataset. Table 6 shows the Comparision of two datasets.
Table 6 Comparison of Methods on Different Datasets
|
Dataset |
Method |
mIoU (%) |
|
Aerial Drone Dataset |
PSPNet |
63.0 |
|
Stanford Background Dataset |
PSPNet |
74.3 |
CONCLUSION
The findings of this paper underscore the efficacy of leveraging the PSP network in semantic segmentation tasks, augmented with essential mechanisms for feature extraction and context understanding. By integrating a spatial attention module alongside an atrous convolution network, the approach demonstrates substantial improvements in segmentation accuracy and performance. Through experimental analysis, we achieved an impressive segmentation accuracy of 83.04%, showcasing the effectiveness of the incorporated features in capturing contextual information at multiple scales. The uti-lization of spatial attention mechanisms allows the model to selectively emphasize relevant spatial regions, while atrous convolution facilitates the extraction of contex-tual features crucial for accurate segmentation. Thereby offering improved accuracy and robustness for various computer vision applications. Future work might include incorporating more advanced attention processes, improving multi-scale context aggregation, and optimising the model for real-time applications.
REFERENCE
Nuradeen Abdullahi Yusuf, Danlami Dauda*, Saudat Bello Adamu, Effect of Entrepreneurship Education on Entrepreneurial Motivation Among Students of Federal Polytechnics in North-West Nigeria, Int. J. Sci. R. Tech., 2025, 2 (11), 647-656. https://doi.org/10.5281/zenodo.17682633
10.5281/zenodo.17682633