High-quality images are the main requirement in the age of high-definition digital media and vision systems that are controlled by AI. The problem of low-resolution imagery, which can be caused by the restrictions of the sensor or transmission, or even by the environmental conditions obstructs such tasks as object detection, facial recognition, and semantic segmentation. Single-image super-resolution (SISR) becomes a significant ill-posed inverse problem, which intends to provide missing high-frequency information on a single LR image to generate an approximate HR image. Traditional algorithms such as bicubic interpolation are simple to use when it comes to upscaling, but they present artifacts, such as blurring and aliasing, especially at scaling factors of 4 or higher. Deep learning has transformed SISR, with convolutional neural networks (CNNs) end-to-end trained on pixel-wise losses (e.g., MSE) to produce high PSNR, but visually dull results. One way to address this issue is by using generative adversarial networks (GANs) that incorporate discriminators that impose natural image statistics, as in SRGAN [1] and ESRGAN [2], which do not focus on distortion measures. Nevertheless, the current GAN-based models, such as the ResGAN-SR [3] described by us, have the following
limitations: (i) fixed-scale training does not provide the ability to adapt to multi-scale image scenarios; (ii) CNN-based architecture does not take into account the long-range pixel correlations that are essential to complex textures; (iii) the assumption of ideal bicubic degradation cannot be applicable in reality with composite blurs, noise, and JPEG artifacts; and (iv) similar to adversarial models, training is unstable and generates To overcome such shortcomings, we propose TransResGAN-SR, an improve framework to combine Global attention of Transformer with residual learning in a GAN framework. Key innovations include:
A dual-network generator structure that combines the rest of the block with Swin Trans- former blocks [4] that are effective in multi-scale feature extraction. An adaptive degradation module that employed dynamic kernels that had been learned through meta-networks to address real-world inputs with no corresponding data. A perceptual loss augmented with VGG characteristics [5], adversarial language and diffusion priors [6] to achieve better texture synthesis. Advanced multi-scale training plan of arbitrary upscaling variables. Benchmark dataset quantitative and qualitative testing confirm the effectiveness of TransReGAN-SR, which is superior to state-of-the-art (SOTA) algorithms in terms of distortion (PSNR, SSIM) and perceptual (MOS, LPIPS) measures. This paper does not only resolve the weak- nesses of the previous ResGAN-SR but also applies SISR in resource limited context.
2. Related Work
2.1. Classical and Learning-Based SISR
Humanized Text: The initial SISR was based on interpolation (e.g., bilinear, Lanczos) and reconstruction priors (e.g., sparsity [7]). Learned mappings of LR-HR patches [8] were learned but had a hard time on large scales because of limited expressiveness. The CNNs represented a paradigm shift: SRCNN [9] was the first to introduce an end-to-end learning, and more complex models such as VDSR [10] with global residuals were introduced. EDSR [11] used no batch normalization to be more stable, and obtained SOTA PSNR. Enhanced feature reuse was achieved through dense connections in RDN [12].
2.2. GAN-Driven Advances
The result of MSE optimization produces smooth images; the perceptual losses based on VGG features are consistent with human perception [13]. SRGAN [1] proposed GANs to be realistic, and ESRGAN [2] is improved through relativistic discriminators and perceptual priors. Real- ESRGAN [14] addressed real losses that were not paired and high order losses. The latest GAN variants are DAF-GAN [15] that uses lightweight fusion and DS-GAN [16] with IGMRF priors that are smooth. In the case of remote sensing, FBD-KAN [17] incorporates Kolmogorov- Arnold networks.
2.3. Transformer Integration in SISR
Transformers [18] are good at understanding dependencies through self-attention. Transformer in SR was introduced by IPT [19], and shifted windows were employed by SwinIR [4]. CNNs are hybridized with attention [20]. In GANs, SRTransGAN [21] uses Transformers in the generators and T-GAN [22] is used on medical pictures. SR multi-attention is fused in MAFT [23], and GANs are combined with diffusion in SRDDGAN [24].
2.4. Real-World and Multi-Scale Challenges
RealSR [25] datasets bring out the mismatches of degradation. BSRGAN [26] learns blind kernels. The adaptations of GAN are few, and multi-scale techniques such as MDSR [27] train shared networks. Our model is based on ResGAN-SR [3], which adds Transformers [4,18], degradation modeling [26], and diffusion priors [6,24] to achieve a single multi-scale real- world SISR model.
3. Proposed Method
3.1. Methodological Overview
Single-image super-resolution (SISR) is an ill-posed inverse problem in which one needs lo- cal features modeling and global contextual knowledge to restore high-frequency detail. The traditional convolutional neural networks are mainly based on the local receptive fields that curtail their ability to learn long-range spatial constraints. Deep optimization is made stable by the residual learning, in which reconstruction is reformulated as residual prediction, and global feature interaction is introduced by the Transformer-based attention, consisting of self-attention mechanisms. It is inspired by these values, and the proposed TransResGAN-SR incorporates residual learning, Transformer attention, and adversarial optimization to provide a strong multi- scale super-resolution in the degradations of the real world.
3.2 Problem Formulation
Given a low-resolution (LR) image ILR degraded by an operator D (e.g., blur k, downsampling ↓s, and noise η), the degradation process is modeled as:
ILR = (D(IHR) ↓s) + η (1)
TransResGAN-SR learns a generator G such that:
ISR = G(ILR, s) ≈ IHR (2)
for scaling factors s ∈ {2, 4, 8}, optimizing perceptual fidelity under real degradation D.
3.3. Generator Architecture
The generator is a hybrid residual–Transformer network: Feature Extraction: An initial con- volution layer maps ILR into a 64-channel feature space. Hybrid Trunk: Residual blocks (Conv–ReLU–Conv with skip connections) alternate with Swin Transformer blocks [4]. Shifted- window self-attention reduces computational complexity to O (HW). Each Swin block is de- fined as:
xˆ = MSA(LN(x)) + x (3)
x′ = MLP(LN(xˆ)) + xˆ (4)
where MSA denotes multi-head self-attention. Degradation Module: A meta-network pre- dicts adaptive kernels k from ILR and applies dynamic convolution:
f′ = k ∗ f (5)
Nitin Varshney*
10.5281/zenodo.18899050