close
close

Video super resolution based on a deformable 3D fashion group Fusion

The aim of VSR is to reconstruct HR reference frames by using the spatial information in the LR video band sequence. The network structure in this study consists of the flat characteristic extraction module, the deformable 3D folding module in the group fusion, the attention module between groups and reconstruction module.

First we divided the input video frames into time groups and fully expanded the various image rates of each group after grouping. Then we enter the video frame of each group in a 3D folding flat functional extraction module for the preliminary characteristic extraction and simulate the state of movement of each group. The preliminary features of extracted video frames were entered in the deformable 3D folding module Intra group fusion module and combined with the adaptive movement compensation feature of a deformable 3D fold in order to extract and merge the features of video frames in each group. In order to integrate the Inter group information of each group, a temporal attention mechanism between the groups for the merger of the group information was introduced. Finally, the fusion feature card was entered in the reconstruction module, which consists of six cascaded remaining blocks and subpixel layers, and the remaining map was added to the characteristic card \ ({\ mathop i \ limits^{\ wedge} _t} \). The total network structure is shown in Fig. 1.

Fig. 1

The proposed network structure.

Time grouping

In the past, entering the VSR was \ (I_ {t}^{l} \) and 2 n neighboring frames \ (\ {I {{t – n}}^{l}: i _ {t – 1}^{l}, i _ {t+1}}^{l}: i _ {t+n}} \ \ \ \ \). This traditional path was mainly based on optical river technology, which led to an inefficient time fusion of neighboring frames and inadequate time information.

On this basis, we divided the neighboring 2 N frames in groups of N groups according to the time interval from the reference framework. The original sequence was arranged as new \ (\ {G1, \ ldots gn \} \)Present\ (n \ in [i:N]\)Where \ (Gn = \ {i {{t – n}}^{l}, i_ {t}^{l}, i _ {{t+n}}^}} \) Is the lower sequence from the previous frame \ (I _ {t – n}}^{l} \)The reference framework \ (I_ {t}^{l} \) And the next frame \ (I _ {{t+n}}^{l} \). This module took into account 7-frame video sequence as an example. \ (I_ {4}^{l} \) Represented the reference framework and other frames are neighboring frames. Share these 7 frames in three groups \ (\ {I_ {3}^{l}, i_ {4}^{l}, i_ {5}^{l} \)Present \ (\ {I_ {2}^{l}, i_ {4}^{l}, i_ {6 {l}} \)Present \ (\: {I} {1}^{l}, \: {i} _ {4 {l} {, \: i} _ {7 {l} \)Depending on the frame rate. Since the movement information on neighboring frames is different on different time stands, the contribution is not the same. Therefore, such a grouping method can explicitly and efficiently integrate neighboring frames from different time routes. The reference framework of each group can lead the network model in such a way that information extraction and the merger of subsequent merger modules within the group become more efficient.

Characteristic extraction and fusion module

In this module, the time group frames mentioned above were first extracted and aligned in time by 3D folding. The extracted feature images were then sent to the fusion module in groups. A deformable 3D folding was used for the fusion with a feature, and the time direction was continued before the following deep fusion module was entered.

Deformable 3D folding

The 3D folding appeared for the first time in Tdan15And its specific implementation can be divided into the following two steps – stone gear characteristics by three -dimensional folding core and the weighted summing of the random values ​​according to functions. The features transmitted over the folding core with an expansion coefficient of 1 can be expressed as

$$ y (p_ {0}) = \ sum \ limits _ {}}^{n} {w*(p_ {n}) \ cdot x (p_ {0} + p_ {n}) $$

(1)

Here a position in the starting function can be shown as \ ({p_0} \). \ ({p_n} \) represents the value in the \ (3 \ Times 3 \ Times 3 \)Folding network. The deformable 3D folding can be obtained by transformation of the 3D folding. The spatial reception field of the deformable 3D folding can be expanded by the learning offset, and the size of the sample network is N = 27. Input functions of C × T × W × H The size was first entered in the 3D folding to create offset features of 2N ×T × W × HSize. For two-dimensional spatial deformations, the number of channels with these offset functions is generally defined as 2NAnd then the trained offset is used to direct the ordinary 3D sample network for carrying out spatial deformation. The deformable 3D sample sample network was generated. Finally, the deformable 3D sample sample network was used to generate output functions. The above processes can be expressed by the following formula

$$ y (p_ {0}) = \ sum _ _ {{n – 1}}^{n} {w*(p_ {n}) \ cdot x (p_ {0} + p_} + \ delta p_ {n}) $$

(2)

Where \ (\ Delta {p_n} \) Represents the offset, which corresponds to the N-Ten value in the 3 × 3 × 3 folding sample network. Offsets are usually decimal places, so more precise values ​​must be generated by bilinear interpolation. From the process above, the deformable 3D folding can also be used to integrate time and space information to extract more specific information about the video sequence. The schematic diagram of the deformable 3D fold is shown in Fig. 2. Since the offset vector layer can receive the offset vector from the current input function diagram, it is often used for video supercharge tasks with certain deformations or movement trends.

Fig. 2
Figure 2

The schematic diagram of a deformable 3D folding.

Intra group Fusion for a deformable 3D folding

After the 3D flat feature extraction for each grouped video, another deep feature extraction and merger were required. A fusion module of intra groups was used for each group. The module consists of a spatial characteristic extractor and a deformable 3D folding layer. There were three units in the spatial characteristic extractor. Each unit consists of a 3 × 3 folding layer and a stack of the normalized layer. The entire folding layer has a suitable expansion rate to simulate every group of unique interframe movement. Each group of frame rate determined the expansion rate of the folding layer. The level of movement with a large time difference between the frames is great, which corresponds to a large expansion rate, and the level of movement with a low difference is low, which corresponds to a small expansion rate. Then deformable 3D folding residues with five 3 × 3 × 3 folding core were used to merge spatial-time features. Finally, the merged features were sent to a two -dimensional dense block, and 18 two -dimensional units were used in the two -dimensional dense block to extract the population features in groups and to create characteristics \ (F_ {n}^{g} \). Inter-frame information within the group was deeply merged, and spatial-time information can be used efficiently, as shown in Fig. 3.

Fig. 3
Figure 3

Intra group Fusion for a deformable 3D folding.

Fusion module between groups based on the timeness mechanism

After the Intra group fusion was carried out for each group, the population properties between the three groups had to be integrated. A timeness mechanism between groups was introduced. Time payments were often used in SR and image processing16.17.18.19. It was also advantageous for the VSR by considering the model differently at different times. In the previous time grouping, video framework sequences were divided into three groups according to various frame rates. Groups contain a lot of additional information from one group to another. In general, a group with a slow frame rate is more informative, since neighboring frames are more similar to the reference framework. At the same time, groups with fast frame rates can also record details that are missing in nearby frames. Therefore, the time attention can be used to effectively integrate the characteristic information of different time interval groups.

A feature card for each group \ (F_ {n}^{a} \) A channel was calculated by using a 3 × 3 folding layer to the corresponding feature card. These produced feature cards, \ (F_ {1}^{a} \)Present \ (F_ {2}^{a} \)And \ (F_ {3}^{a} \)were further connected. Softmax functions along the timeline were applied on each place via the channel to calculate the timeness -map map.

The intermediate diagrams of each group have been linked together and the attention function diagrams. \ (M (x, y) \)were calculated by the Softmax function along the timeline.

$m_ {n} (x, y) {j} = \ frac {{f_ {n}^{a} (x, y) _ {j}}} {\ sum \ nolimits _ {i = 1} {{} {{f_ {i}^{a} (x, y) _ {j}}}}} $$

(3)

The attention -weighted characteristic \ (\ mathop {f {} _ {n}^{g}} \ limits^{\ Sim} \) For each group was calculated by

eight [1:N]$$

(4)

Where \ ({M_n} {(x, y) _J} \) represents the weight of the timeness mask in \ ({(x, y) _J} \) Position, \ (F {} _ {n}^{g} \) represents the features generated in the intra group fusion module in groups, and \ (\ odot \) represents the multiplication of the corresponding elements one after the other.

The aim of the fusion module between groups is to aggregate the information from different time groups and create an HR residual card. In order to fully exploit the attention group's attention, we have connected these functional cards along the timeline to a three -dimensional block. In the meantime, a folding layer with a 1 × 3 × 3 folding core was used at the end of the three-dimensional dense block to reduce the channels. A two -dimensional poet block was placed for further fusion below, as shown in 4.

Fig. 4
Figure 4

Finally, a reconstruction module, which resembles the super resolution with a single image, contains the fully merged features through deep-to-space operation. The merged features were sent in six cascaded remaining blocks and the subpixel folding layer for reconstruction, and after processing the corresponding remaining map was generated. At the same time, the final HR video frames were generated by adding the original video residual card to be generated by scanning the Bicubian interpolation.