CFAT: Unleashing Triangular Windows for Image Super-resolution

Abstract

Transformer-based models have revolutionized the field of image super-resolution (SR) by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However, it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses, we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. In this paper, we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result, CFAT enables attention mechanisms to be activated on more image pixels and captures long-range, multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures. Find the code Here.

1. Introduction

Efficient transmission or storage of images in band- or memory-limited systems heavily relies on diverse image compression techniques. However, in the real world, these image compression phenomenons are lossy. It is challenging to retrieve the original visual quality from its compressed correspondent. Image super-resolution (SR) [10], a reverse engineering approach in contrast to compression, aims to recover high-resolution (HR) images from their low-resolution (LR) counterparts. Despite extensive and consistent research efforts, SR models still struggle to eliminate the visual artifacts in recovered images entirely. In early 2014, the evolution of deep neural networks enhanced SR capabilities with CNN as a mainframe skeleton [10].

Figure 1. Proposed CFAT vs other SOTA models. RW/TW: Rectangular/Triangular Window, MSA: Multi-Head Attention, (D): Dense, (SD): Shifted Dense, (S): Sparse, (O): Overlapping

These approaches [9, 19, 44, 45] frequently come with substantial computational and memory requirements regardless of their effectiveness. The transformer-based techniques have stood out as the most successful SR approaches among many emerging models, showcasing state-of-the-art performances in the recent era.

The surge in popularity of transformer-based architectures is attributed to their natural proficiency in leveraging long-range dependencies among image features. Leading approaches such as SwinIR[25], Swin2SR[8], ESRT[28], HST[21], HAT[6] and ART[43] advocate the above benefit. They predominantly rely on a hierarchical attention mechanism, ensuring the reconstruction of HR outputs from their LR counterparts with enhanced precision. SwinIR [27] pioneered the application of the shifted window technique in the domain of image restoration [25], focusing mainly on image super-resolution. Swin2SR [8] redefined the SwinIR to enhance the performance after effectively tackling data and training-related challenges. However, both [25] and [8] consistently concentrate exclusively on specific image patches for self-attention. The dilated window approach introduced in ART [43] aims not only to resolve the fixed window problem as mentioned earlier but also to alleviate computational load while concurrently boosting super-resolution performance. Nevertheless, the quick internal switching between dense and sparse attention layers, each tailored to specific receptive fields, imposes constraints on overall performance. HAT [5] outperforms other prominent super-resolution architectures by a significant 1 dB margin, achieved through integrating global channel attention with local window-based self-attention. The incorporation of the overlapping shifted window technique effectively addresses concerns related to repetitive image patches, while the introduction of channel attention substantially enhances super-resolution performance. Drawing inspiration from the CNN-transformer model, ACT [40] strategically positions both CNN and transformer within two distinct streams to concurrently leverage global and local features. The harmonious, complementary features on both branches interchange relatable information with one another through a channel split fusion block based on crossscale token attention (CSTA). The above-discussed state-of-the-art transformer-based methods use a rectangular shifted window-based self-attention mechanism. The repetitive use of rectangular window attention at each layer is susceptible to causing boundary-level distortion, as the participation of neighbors for boundary pixels is limited. In addition to this, the symmetrical structure of the rectangular window restricts the range of unique shifting modes available for covering the entire image patches. The presence of boundary-level distortion and a limited range of shifting modes in the shifted rectangular window technique could potentially hinder its effectiveness, extending beyond super-resolution to other computer vision tasks.

To tackle the challenges mentioned earlier, we propose a pioneering triangular window technique. This novel window technique helps boost the performance for super-resolution tasks and all computer vision applications where the rectangular window technique is predominantly used. This paper seamlessly incorporates the proposed triangular window with the conventional rectangular window in a transformer framework called Composite Fusion Attention Transformer (CFAT). Integrating a channel-based global attention unit with the above triangular-rectangular window-based local self-attention further improves the performance of single-image SR. The above attention pair belongs to the non-overlapping attention category and effectively leverages local and global image features. To exploit the overlapping image features, we place an overlapping cross-fusion attention block (OCFAB) at the end of each unit block called the Dense Window Attention Block (DWAB). The proposed model, CFAT, offers several unique advantages: (i) The combined window-based self-attention technique eliminates the boundary level distortion issue that arises during traditional rectangular window-based self-attention. (ii) Embracing a triangular window offers a broader range of shifting modes, thereby expanding spatial features and enhancing overall performance. (iii) OCFAB leverages overlapping spatial image features, resulting in a further enhancement of performance.

The proposed CFAT network delivers superior SR results across multiple benchmark datasets, notably outperforming recent transformer-based SR techniques [21, 27]. The novelty of our paper in comparison to other state-of-the-art methods like SwinIR [27], ART [43], and HAT [5] is outlined in Fig. 1. Our contributions in this paper can be summarized as:

1. We are the first to introduce the triangular windowing mechanism in the computer vision task. We smoothly integrate it with traditional rectangular windows to employ non-overlapping self-attention in single-image SR. This combination phases out the constraints associated with the conventional rectangular window approaches like boundary-level distortion and restricted unique shifting modes.

2. This window mechanism is beneficial not only in super-resolution tasks but also in various other computer vision applications that implement the rectangular window technique in their mainframe.

3. We propose a CFAT that exploits local and global image features through overlapping and non-overlapping spatial and channel attention, respectively.

4. Through comprehensive evaluations on multiple benchmark SR datasets, our method has delivered superior performance compared to other state-of-the-art methods.

2. Related Works

2.1. CNN Based Super-Resolution

Initially, most single-image super-resolution (SISR) approaches relied heavily on convolutional neural networks (CNNs) due to their remarkable spatial feature extraction capability over traditional machine learning approaches. SRCNN [11] stands as a pioneering effort, leveraging a shallow three-layer CNN architecture to map low-resolution (LR) images to their high-resolution (HR) counterparts. However, a sharp decline in performance is observed while increasing the number of CNN layers intended for extracting more complex features. EDSR [26] is an SR architecture based on the Residual Network (ResNet) framework, which mitigates this limitation by including shortcut connections. Introducing channel attention mechanisms in super-resolution, RCAN [44] presents a unique residual-in-residual (RIR) design. RDN [45] utilizes a residual dense block (RDB) that efficiently extracts ample local features through densely interconnected CNN layers. SAN [9] is built upon a second-order attention network that takes SR performance to a new height. LatticeNet [29] incorporates the Fast Fourier Transform (FFT) technique to construct a lattice design, enabling lightweight super-resolution. [12, 14, 18, 20, 26, 34], and [17] are other prominent CNN-based SR models that demonstrate notable advancement in performances over their predeces-

Figure 2. The overall architecture of CFAT with all internal units.

sors. These models incorporate advanced CNN architectures, such as residual blocks [26, 44], dense blocks [45], and attention mechanisms [14, 34], to enhance feature generalization, ultimately leading to superior outcomes. While CNN-based methods have seen significant success in SISR, they do come with limitations. Specifically, they can only extract local features from images, and they often entail high computational costs for both training and deployment.

2.2. Vision Transformer (ViT) Based SR

The transformer architecture [37] operates on the principle of self-attention, a technique that generalizes features in terms of a global spatial perspective and can capture long-range dependencies between them. ViT, introduced in [13], is evident as the first work to substitute standard convolution with the transformers in high-level vision tasks. Following this, in many computer vision applications, including image recognition [24, 33], object detection [3, 7], image segmentation [2, 15], and image SR [30, 35, 41, 46], transformer-based architecture replaces conventional CNN-based models to elevate their performances. Building on transformer models, SwinIR[25] employs the Swin transformer with window-based self-attention for image restoration. Its improved version, Swin2SR [8], utilizes the SwinV2 transformer specifically for compressed image super-resolution and restoration tasks. Meanwhile, ESRT [28] introduces efficient transformers that achieve competitive results at a reduced computational cost. ART [43] incorporates both sparse and dense attention modules within a single network that aim to refine image restoration results. In a recent advancement, [6] introduced a transformer architecture, HAT, which combines channel attention with overlapping and non-overlapping self-attention mechanisms to activate more pixels, thereby generalizing both global and local spatial features. Although the Vision Transformer has shown its superiority in modeling long-range dependencies, there are still many works that demonstrate that the convolution can help the transformer to achieve better visual representation [22, 38, 39]. Drawing inspiration from the CNNtransformer model, ACT [40] strategically places both CNN and transformer within two distinct streams to concurrently leverage global and local features. The above SR models adopt a shifted rectangular window technique to limit their spatial features for effective localized self-attention. In contrast, these models experience boundary-level distortion and encounter limitations in shifting modes when exclusively relying on the rectangular window technique. In this paper, we propose a pioneering triangular window technique that works alongside a rectangular one to wipe out these drawbacks in SISR.

3. Proposed Method

3.1. Overall Architecture

As presented in Fig.2, the entire network can be split into three segments: the head, body, and tail. The head module is responsible for shallow feature extraction, the body extracts the deep features, and the tail module reconstructs the HR images from LR counterparts at the output stage. The in-depth depiction of the above three is as follows:

3.2. Head Module - Shallow Feature Extractor:

We feed the input image, ILR ∈ RH×W ×Cin to a feature extractor to get shallow output , where H, , and C are the height, width, input channel count and output channel count of the respective image. This ex- tractor is a single convolution layer of kernel size .

3.3. Body Module - Deep Feature Extractor:

The alternative units of Dense Window Attention Blocks (DWAB) and Sparse Window Attention Blocks (SWAB) are ultimately combined with a convolution layer to form the deep extractor module. The shallow output from the head module, Fsh ∈ RH×W ×C , is passed through this deep extractor to yield deep features Fdp ∈ RH×W ×C . Each unit of DWAB and SWAB is comprised of both rectangular and our proposed triangular window attention units in dense and sparse configurations, respectively.

3.3.1 Dense Window Attention Blocks (DWAB)

Each unit of DWAB implements both non-overlapping and overlapping dense window attention that helps to integrate deep and diverse features into a single network. Our design incorporates multiple units of non-overlapping attention units, called Dense-Hybrid Window Attention Block (D-HWAB), initially to get deep features and place an overlapping attention unit, called Overlapping Cross Fusion Attention Block (OCFAB), at the end to get diverse features as well. A 3 × 3 convolution layer at the end makes the features richer and more effective, as cited in [22, 38, 39]. Mathematically, the DWAB can be represented as

where, , and fconv stands for non-overlapping, overlapping and convolution operations, respectively.

Dense-Hybrid Window Attention Block (D-HWAB): The underlying module advocates dense attention, alternatively combining shifted rectangular and novel triangular transformer units. The Multi-head Self Attention (MSA) carried out in both units is called Shifted-Dense Rectangular Window MSA ((SD)RW-MSA) and Shifted-Dense Triangular Window MSA ((SD)TW-MSA). Both of them execute non-overlapping self-attention. The combination of channel attention features from Channel Wise Attention Block (CWAB) and spatial attention features from (SD)RW-MSA and (SD)TW-MSA in both transformer units, as shown in Fig.2, elevates the overall performance. The overall computation can be represented as

where, FDA, f xrect , and f xtri are dense attention features, rectangular and triangular window dense attention, respectively. The output feature from rectangular transformer unit can be expressed as

where, Fin, Fint , and Fout are input, intermediate and output features respectively. , and fCA are the feature transformation in (SD)RW-MSA, first LayerNorm, second LayerNorm and CWAB respectively. Similarly, the output feature from triangular transformer unit can be expressed as

where, is the feature transformation of (SD)TW- MSA. The in equation 4 and in equation 6 are two hyper-parameter to limit the dominancy of channel attention in of rectangular and triangular transformer units respectively.

Overlaping Cross Fusion Attention Block (OCFAB): This block overlaps the features across neighboring windows and establishes cross-attention between them to improve performance further [6]. The underlying attention is called Overlapping Cross Fusion Attention (OCFA). The OCFAB is implemented by a sliding window approach called unfolding operation, using a kernel of size , a stride of R, and padding of kR2 zeros to make the sizes compatible for overlapping. The input feature of dimension H × W × C are divided into HWR2 overlapping windows of size . Here, is determined by

where ’k’ is a constant regulating overlapping ratio and R represents the size of window before overlapping. During MSA, the query is calculated by a linear layer, whereas key and value are produced by a linear layer followed by an unfolding operation. The OCFAB block is present at the end of both DWAB and SWAB, as shown in Fig. 2.

Channe-Wise Attention Block (CWAB): Most studies show the practice of using standard convolution while executing squeeze-excitation channel attention [6]. However, we use depthwise-pointwise convolution that helps to reduce the squeeze factor without increasing the computational burden. Therefore, we design CWAB with a GELU activation function sandwiched between two depthwise-pointwise convolutional layers followed by a channel-wise attention module at the end.

3.3.2 Sparse Window Attention Block (SWAB)

Like DWAB, SWAB promotes non-overlapping and overlapping window attention on sparse windows instead of dense ones. This block also includes four operations as shown in Fig.2: (i) non-overlapping rectangular sparse window attention, (ii) non-overlapping triangular sparse window attention, (iii) overlapping cross-fusion attention, and (iv) convolution operation. The latter two are the same as discussed in section 3.3.1, but the former two are a little different in terms of rectangular and triangular window formation compared to DWAB. First, we prepare sparse windows from the input of dimension H × W × C by maintaining a regular gap, called interval size I, between consecutive input features. After that, we follow the standard rectangular and triangular window techniques to implement respective attention. Section 4 provides more details about triangular sparse attention. We follow a similar procedure in implementing rectangular sparse attention.

3.3.3 Tail Module - HQ Image Reconstruction

In the final stage, we obtain the upscaled input signal () by passing the deep features () from body through a pixel-shuffle layer sandwiched between a pair of convolution layers. This is represented by ISR = fT (Fdp) .

4. Triangular Window Technique

Design: Considering an input image Iin of dimension H × W × C , we first divide this input into HWM 2 rectangu- lar windows of size . Then, each rectangular patch is again split into four triangular windows as displayed in Fig.3. In this figure, we take M as 32, which provides four triangular masks of linear dimension . These four triangular windows are called upper, right, lower, and left triangular windows. The corresponding rectangular windows are also shown in this figure.

Advantages over Rectangular Window: In computer vision, the attributes of a pixel depend on itself and its neighbor as well. As a result, the edge pixels of the rectangular window are not as effectively explored as the interior pixels during self-attention, leading to distortions at the boundaries. The distortion will become so pronounced that it will ultimately deteriorate model performance. The above problem can be prominently seen in both rectangular and triangular windows while implemented alone. Therefore, we combine the rectangular window-based MSA with our proposed triangular one in series. This alternative configuration of rectangular and triangular transformers mitigates the edge-level distortion of one another.

Many studies confirm that the shifted rectangular window, in conjunction with the non-shifted version, can vastly enhance the SR performance [6, 8, 25, 43]. However, we can observe a restriction in shifting the rectangular window as repetition occurs due to its isometric geometry. From Fig.3, it is noticeable that the coverage length of the triangular window is more than the rectangular one in xdimension. Due to this extended coverage, the triangular windows are allowed more non-identical shifts than rectangular windows. As shown in Fig.4, the rectangular window has non-identical shifting modes of 0 and 8, whereas the triangular window has 0, 8, 16, and 24. The availability of more unique shifting modes in triangular windows enhances the model’s performance further than rectangular ones by mitigating the edge-related artifacts at the boundaries, aligning the features that contribute to improving localization accuracy and providing greater adaptability to noncentralized image patterns.

Figure 3. A rectangular and triangular patch in 32 × 32 window.

Figure 4. Shifting modes of rectangular and triangular windows in a 64 × 64 image patch

Due to the structural heterogeneity that exists between rectangular and triangular window patches, as shown in Fig. 3, the spatial features that participated in the alternative connection of triangular and rectangular self-attention inside DWAB or SWAB are different from each other. These diverse activated features from successive triangular and rectangular attention leverage more diverse spatial features to participate in HR reconstruction. It boosts the performance further.

Computational Cost: The first step of MSA used in transformer calculates query q = xwq , key and value v = xwv matrices for an input x ∈ RHW ×C and the corresponding three trainable parameter matrices wq ∈ RC×Cq, wk ∈ RC×Ck , and wv ∈ RC×Cv . In our case, we consider and HW ≫ C . For mathematical convenience, we assume the number of heads equals 1. Considering space complexity for (i) qkT /√C equal to , (ii) exponential and row sum operations of softmax equal to , (iii) division operation of softmax equal to O([HW]2) , and value multiplication operation equal to O([HW]2C , the approximate computational complexity of MSA is

The above equation confirms the high computational cost of deploying ViTs. The rectangular window technique is the only solution to lower the complexity so far. However, we proposed a triangular window technique that not only reduces the computational burden but also yields better results than a rectangular counterpart while working alone and even produces superior performance while employed synchronously with a rectangular window. In this paper, we introduce two variants of attention relating to the proposed triangular window: (i) dense triangular attention and (ii) sparse triangular attention.

The dense triangular attention has a 1D token size of , where L is the triangular window equivalent of rectangular window token L × L . The computational cost of the proposed dense TW-MSA is

The effective sequence length in sparse triangular attention is the same as dense but for a wider receptive field. The interval length S controls the sparsity in the triangular window during the attention mechanism. S can be compared with the stride in convolution operation. The computational cost of sparse TW-MSA in terms of parameter S is

The computational cost of dense TW-MSA and sparse TWMSA are drastically improved as in equation 10 and ( HWS )2 ≪ (HW)2 in equation 11, respectively.

5. Experiments

5.1. Datasets and Performance Matrices

We train the proposed model with two widely used DIV2K[36] and Flicker2K [26] datasets to avoid overfitting while training only on DIV2K and evaluate on standard SR benchmarks: Set5[1], Set14[42], BSD100[31], Urban100[16], and Manga109[32]. We evaluate the proposed SR model using PSNR and SSIM on the Y channel in YCbCr Space.

5.2. Experimental Settings

We perform geometric data augmentation where each training sample is replaced by its flipped or rotated ( 90◦, 180◦ , and 270◦ ) equivalent. The augmented input samples are then cropped into 6464 patches before passing through the model. While configuring the proposed CFAT, we put 3 DWAB and 3 SWAB in an alternative order. We maintain 4 D-HWAB and 4 S-HWAB inside each DWAB and SWAB unit, respectively. Here, each D-HWAB and SHWAB unit represents a unique shift size, i.e., 0, 8, 16, and 24. We take 180 channels, while attention heads and window sizes are set to 6 and 16 for all units of (SD)RWMSA, (S)TW-MSA, and OCFA. Other hyperparameter values are discussed in the ablation study. We set the batch size to 32. We use ADAM optimizer for model training with β1 = 0.9 and β2 = 0.999. The weight decay is set to zero. The total iteration during training is set to 250K. The initial learning rate is set to 2e-4, which is reduced to half

after [112.5K, 175K, 200K, 225K] iterations. We use loss for model training. We take the pre-trained model and finetune it for resolution. We also introduced a more compact CFAT, namely CFAT-S, with 144 channels, featuring a depthwise-pointwise convolution in CWAB. The experiments are performed using Python 3.10.11 based PyTorch 2.0.1 deep learning framework on an Ubuntu 20.04.2 machine with A100-PCIE-40GB GPU enabled with Nvidia CUDA 10.1.243 and CuDNN 8.1.0. Table 1. Performance based on various triangular & rectangular window size, and shift size. L: Linear, S: Square.

Table 2. Ablation study for α, β , and channel counts.

Table 3. Performance based on overlapping constant.

Table 4. Ablation study for CWAB, RWAB, and TWAB.

5.3. Ablation Study

5.3.1 Impact of Various Model Hyperparameters :

To realize the impact of various model hyperparameters, we conduct all our experiments on BSD100 dataset for a scale of and a epoch of 70 and compute the model’s performance in terms of PSNR.

Window Size and Shift Size : In [6], the authors reveal that optimum rectangular window size can activate more pixels that in turn elevate model performances. [23] and [6] investigate their model with square patch sizes of (8, 12) and (8, 16) respectively. In this paper, we inspect our model on the square patch of (8, 12, 16) for the rectangular and linear patch of (8*8, 16*16) for triangular windows. The experimental results are shown in Tab. 1. Again, we evaluate the performance with (8, 16, 24, variable) shift and without shift (0), which is also mentioned in this table. Here, the ’variable’ refers to varying shift size from in succession. From the above, we can conclude that the model yields the best PSNR for 16∗16 triangular linear-window, rectangular square-window, and variable shift size.

Interval Size : Interval size (I) plays a decisive role in estimating model performance [43]. A smaller I refers to the wide receptive field with high computation, and a larger I implies the opposite. However, we prioritize the performance and compute model results by varying the I. We can observe from Tab. 3 that the model gives the best performances for interval sizes 0 and 2. However, we chose I = 2 as it allows us to design our model with fewer computations.

α, β , Channel Counts and CWAB: To examine the influence of weighing factor and , we measure the model outcome for four different values (0, 1, 0.1, 0.01) of and separately. As displayed in Tab. 2, these experiments specify the optimum value of and as 0.01 and 0.015, respectively. We also scrutinize the effect of channel count in CFAT as shown in Tab. 2. We select 180 as the channel counts for CFAT as the rise in channel non-linearly increases the model parameters, and also, the model performance slowly gets saturated after 180.

Overlapping Constant : In OCFAB, we use a constant k to determine the overlapping range between two consecutive windows during cross-attention. We study the impact of various overlapping ratios by adjusting k values ranging from 0 to 0.75, and the results are presented in Tab. 3. Here, k = 0 corresponds to a standard Transformer block. The results displayed in this table indicate that the model delivers optimal performance for k = 0.5.

5.3.2 Impact of CWAB :

We have performed three experiments to study the importance of CWAB. From Tab. 4, it is quite evident that the PSNR value of CFAT without CWAB is 0.1dB lower than CFAT using CWAB with standard convolution. However, we get a slight performance advantage when it comes to model outcomes with CWAB using depthwise-pointwise (DP) convolutions and lower squeeze factor. So, we get a better result without a computational trade-off, as the latter two have the same complexity.

5.3.3 Impact of Triangular and Rectangular Windows:

In CFAT, RWAB and TWAB modules carry significant roles in CFAT realization. From Tab. 4, we can observe a significant drop of 0.21 ∼ 0.16 in PSNR for our model without TWAB or RWAB. These results confirm our design of including both units in the final model. They combinely truncate the effect of edge-level distortion that leads to enhanced performance. In this table, we also observe a narrow advantage for TWAB over RWAB when individually evaluated due to the feasibility of extra shifting modes.

5.4. Comparisons with State-of-the-art Methods

Quantitative Analysis : We select CNN-based EDSR[26], SAN[9], and HAN[34] with transformer-based IPT[4], SwinIR[25], Swin2SR[8], ACT[40], ART[43], EDT[23], HAT[6] state-of-the-art architectures to compare our model quantitatively in term of PSNR and SSIM. Tab. 5 presents the quantitative comparison of CFAT based on scale factors , and for both DIV2K and DF2K (DIV2K+Flickr2K) training datasets. HAT[6] is considered the previous best model whose outcome outplays any other SOTA models. However, due to the undistort and rich feature exploration capability of CFAT, it yields superior performances than HAT [6] for all scale factors across all five benchmark datasets. The superior results confirm the significance of using triangular window attention in CFAT. The small variant of CFAT, called CFAT-s, is also very competitive while comparing its outcomes with some well-known transformer-based architectures like IPT [4], SwinIR [25] and Swin2SR [8].

Visual Analysis : For visual assessment, we select an image from each of the four distinct testing datasets: “img 003” from Set5, “img 072” from Urban100, “img 013” from Set14, and “img 005” from BSD100. CFAT retrieves all these images with less blur and fair lattice content, as shown in Fig.5. The PSNR and SSIM value for each image in this figure confirms our model’s superiority over all SOTA models. Our model reduces the pixel level distortion more efficiently than any other CNN- and transformer-based leading models due to the inclusion of dense and sparse triangular window attention.

Comparison based on Computational Cost : From Tab. 6, we observe that the performance of CFAT outperforms other SOTA models with a balanced trade-off between parameter and multi-add count. Our model has a few more parameters than HAT [6] but shows fewer counts of multiadds while giving superior results on the BSD100 dataset.

Table 6. Model comparison based on computational cost

6. Conclusion

We propose a rectangular and triangular window attentionbased SR architecture called Composite Fusion Attention Transformer (CFAT). This combinational design eliminates boundary-level distortion problems and opens the

Table 5. Quantitative comparison of the CFAT with various state-of-the-art SR methods. red and green color indicate the best and second best respectively

Figure 5. Visual Comparison of CFAT with other state-of-the-art methods.

gate for integrating more non-identical shifting modes. Dense and sparse attention in both windows allows interaction between local and global image features on the same platform. We also combine more diverse features through non-overlapping triangular and rectangular window-based self-attention and overlapping window-based cross-attention. Extensive experiments on multiple datasets show the effectiveness of CFAT. By incorporating the novel triangular window attention in dense, sparse, and shifted configuration, CFAT outperforms the other state-of-the-art models qualitatively and quantitatively.

References

[1] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012. 6

[2] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision (ECCV), pages 205–218. Springer, 2022. 3

[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision (ECCV), pages 213–229. Springer, 2020. 3

[4] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yip- ing Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Computer Vision and Pattern Recognition (CVPR), pages 12299–12310. IEEE/CVF, 2021. 7, 8

[5] Xiangyu Chen, Xintao Wang, Jiantao Zhou, and Chao Dong. Activating more pixels in image super-resolution transformer. arxiv 2022. arXiv preprint arXiv:2205.04437, 1, 2022. 2, 8

[6] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. In Computer Vision and Pattern Recognition (CVPR), pages 22367–22377. IEEE/CVF, 2023. 1, 3, 4, 5, 6, 7

[7] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems (NeurIPS), 34:9355–9366, 2021. 3

[8] Marcos V Conde, Ui-Jin Choi, Maxime Burchi, and Radu Timofte. Swin2sr: Swinv2 transformer for compressed image super-resolution and restoration. In European conference on computer vision (ECCV), pages 669–687. Springer, 2022. 1, 3, 5, 7, 8

[9] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Computer Vision and Pattern Recognition (CVPR), pages 11065–11074. IEEE/CVF, 2019. 1, 2, 7, 8

[10] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision (ECCV), pages 184–199. Springer, 2014. 1

[11] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence (PAMI), 38(2):295–307, 2015. 2

[12] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerat- ing the super-resolution convolutional neural network. In European conference on computer vision (ECCV), pages 391– 407. Springer, 2016. 2

[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

[14] Yong Guo, Jian Chen, Jingdong Wang, Qi Chen, Jiezhang Cao, Zeshuai Deng, Yanwu Xu, and Mingkui Tan. Closedloop matters: Dual regression networks for single image super-resolution. In Computer Vision and Pattern Recognition (CVPR), pages 5407–5416. IEEE/CVF, 2020. 2, 3

[15] Gao Huang, Yulin Wang, Kangchen Lv, Haojun Jiang, Wen- hui Huang, Pengfei Qi, and Shiji Song. Glance and focus networks for dynamic visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 45(4): 4605–4621, 2022. 3

[16] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Sin- gle image super-resolution from transformed self-exemplars. In Computer Vision and Pattern Recognition (CVPR), pages 5197–5206. IEEE/CVF, 2015. 6

[17] Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accu- rate single image super-resolution via information distillation network. In Computer Vision and Pattern Recognition (CVPR), pages 723–731. IEEE/CVF, 2018. 2

[18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Computer Vision and Pattern Recognition (CVPR), pages 1646–1654. IEEE/CVF, 2016. 2

[19] Xiangtao Kong, Hengyuan Zhao, Yu Qiao, and Chao Dong. Classsr: A general framework to accelerate super-resolution networks by data characteristic. In Computer Vision and Pattern Recognition (CVPR), pages 12016–12025. IEEE/CVF, 2021. 1

[20] Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photorealistic single image super-resolution using a generative adversarial network. In Computer Vision and Pattern Recognition (CVPR), pages 4681–4690. IEEE/CVF, 2017. 2

[21] Bingchen Li, Xin Li, Yiting Lu, Sen Liu, Ruoyu Feng, and Zhibo Chen. Hst: Hierarchical swin transformer for compressed image super-resolution. In European Conference on Computer Vision (ECCV), pages 651–668. Springer, 2022. 1, 2

[22] Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guan- glu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2023. 3, 4

[23] Wenbo Li, Xin Lu, Jiangbo Lu, Xiangyu Zhang, and Jiaya Jia. On efficient transformer and image pre-training for lowlevel vision. arXiv preprint arXiv:2112.10175, 3(7):8, 2021. 6, 7, 8

[24] Y Li, K Zhang, J Cao, R Timofte, and L LocalViT Van Gool. Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021. 3

[25] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us-

ing swin transformer. In International Conference on Computer Vision (ICCV), pages 1833–1844. IEEE, 2021. 1, 3, 5, 7, 8

[26] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Computer Vision and Pattern Recognition Workshops (CVPR-W), pages 136–144. IEEE/CVF, 2017. 2, 3, 6, 7, 8

[27] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), pages 10012–10022. IEEE, 2021. 1, 2

[28] Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Lin- lin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. In Computer Vision and Pattern Recognition (CVPR), pages 457–466. IEEE/CVF, 2022. 1, 3

[29] Xiaotong Luo, Yuan Xie, Yulun Zhang, Yanyun Qu, Cui- hua Li, and Yun Fu. Latticenet: Towards lightweight image super-resolution with lattice block. In European conference on computer vision (ECCV), pages 272–289. Springer, 2020. 2

[30] Ziwei Luo, Haibin Huang, Lei Yu, Youwei Li, Haoqiang Fan, and Shuaicheng Liu. Deep constrained least squares for blind image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17642–17652, 2022. 3

[31] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In International Conference on Computer Vision (ICCV), pages 416–423. IEEE, 2001. 6

[32] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76:21811–21838, 2017. 6

[33] Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. Adavit: Adaptive vision transformers for efficient image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 12309–12318. IEEE/CVF, 2022. 3

[34] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network. In European conference on computer vision (ECCV), pages 191–207. Springer, 2020. 2, 3, 7, 8

[35] Yajun Qiu, Qiang Zhu, Shuyuan Zhu, and Bing Zeng. Dual circle contrastive learning-based blind image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology, 2023. 3

[36] R Timofte, S Gu, J Wu, and L NTIRE Van Gool. challenge on single image super-resolution: Methods and results. In Computer Vision and Pattern Recognition (CVPR), pages 18–22. IEEE/CVF, 2018. 6

[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. 3

[38] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In International Conference on Computer Vision (ICCV), pages 22–31. IEEE, 2021. 3, 4

[39] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll´ar, and Ross Girshick. Early convolutions help transformers see better. Advances in Neural Information Processing Systems (NeurIPS), 34:30392–30400, 2021. 3, 4

[40] Jinsu Yoo, Taehoon Kim, Sihaeng Lee, Seung Hwan Kim, Honglak Lee, and Tae Hyun Kim. Enriched cnn-transformer feature aggregation networks for super-resolution. In Winter Conference on Applications of Computer Vision (WACV), pages 4956–4965. IEEE/CVF, 2023. 2, 3, 7, 8

[41] Lei Yu, Xinpeng Li, Youwei Li, Ting Jiang, Qi Wu, Hao- qiang Fan, and Shuaicheng Liu. Dipnet: Efficiency distillation and iterative pruning for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1701, 2023. 3

[42] Roman Zeyde, Michael Elad, and Matan Protter. On sin- gle image scale-up using sparse-representations. In International Conference Curves and Surfaces, pages 711–730. Springer, 2012. 6

[43] Jiale Zhang, Yulun Zhang, Jinjin Gu, Yongbing Zhang, Linghe Kong, and Xin Yuan. Accurate image restoration with attention retractable transformer. arXiv preprint arXiv:2210.01427, 2022. 1, 2, 3, 5, 7, 8

[44] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In European conference on computer vision (ECCV), pages 286–301. Springer, 2018. 1, 2, 3

[45] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Computer Vision and Pattern Recognition (CVPR), pages 2472–2481. IEEE/CVF, 2018. 1, 2, 3

[46] Qiang Zhu, Pengfei Li, and Qianhui Li. Attention retractable frequency fusion transformer for image super resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1756–1763, 2023. 3

Designed for Accessibility and to further Open Science