close
close

The intelligent fault identification method based on multi-source information fusion and deep learning

Research area and data sources

The study area is located in the southern part of Jinzhai County, Lu’an City, Anhui Province, in the Dabie Mountains. The geographical coordinates of the study area are 31°06′N-31°26′N and 115°29′E-116°1′E, with an elevation range of 163 m to 1714 m.

The region has significant topographic relief, diverse geomorphological forms, and complex structural conditions.Topographic features include notable elevation variations, substantial relief, deep surface incisions, and diverse slope and aspect variations. Geomorphic features include narrow, deeply incised valleys and high-altitude peaks. Structural features reflect complex tectonics, with diverse strata composed mainly of acidic, basic, and intermediate rocks. Soil characteristics in the study area also play a crucial role in understanding fault formation and identification. The soil types are primarily composed of clay, loamy, and sandy soils, which exhibit varying degrees of permeability and moisture retention. These soil properties influence surface erosion patterns, water runoff, and vegetation growth, which, in turn, can affect the interpretation of geomorphic features in fault mapping. Furthermore, soil moisture content and texture may impact the reflectance values observed in remote sensing imagery, thus influencing the accuracy of fault detection. The interaction between soil, geology, and topography provides a more comprehensive understanding of fault processes, as soil characteristics can either amplify or obscure fault-related surface features, depending on local environmental conditions.

This region is located within the Dabie Mountains tectonic belt and is influenced by the Tancheng–Lujiang fault zone. As a major tectonic feature in Eastern China, the Tancheng–Lujiang fault zone has had a profound impact on the geological evolution of the area. Prolonged tectonic activity and crustal stress have facilitated the formation of a series of well-developed fault structures, with distinct fault characteristics. The general overview of the study area is shown in Fig. 1. This region not only serves as a typical representative of the tectonic features of the Dabie Mountains tectonic belt but also reflects common characteristics observed in other orogenic belts, such as topographic, geomorphic, and structural. This region can effectively train and verify the faults identification results of this fault identification method under complex topography and geomorphology.

The primary research data includes ASTER GDEM (30-meter resolution), Landsat 8 OLI_TIRS remote sensing image, and geological map data of the study area. RSI accurately reflect surface information, providing data for extracting and analyzing fault spectral features. After preprocessing, Landsat 8 OLI_TIRS uses a band ratio algorithm to extract fault spectral features72. ASTER GDEM is used in digital terrain analysis and geological structure identification73,74. After preprocessing, ASTER GDEM calculates topographic and geomorphic factors using digital terrain analysis methods, reflecting fault topographic and geomorphic feature information. Geological map data provide structural feature information, such as lithology distribution. The data sources are shown in Table 1.

Fig. 1

The overview of the study area. Maps were created using ArcGIS 10.5 (Environmental Systems Research Institute, USA. https://www.esri.com/).

Table 1 The data sources for fault identification.

Data analysis and processing

To more accurately describe the morphological features of fault and quickly identify them, we extract and analyze feature information from various data sources to support subsequent intelligent identification. This paper adopts a multi-source information fusion approach, analyzing and processing remote sensing image, DEM, and geological map data to extract spectral, topographic, geomorphic, and structural features of fault. Using sample training and fusion algorithms, we integrate these features to enhance the fault’s morphological information. Finally, intelligent fault identification is achieved using deep-learning image recognition techniques. The overall process is illustrated in Fig. 2.

Fig. 2
figure 2

Identifying Fault Workflow Diagram. Maps were created using Microsoft Visio 2021 (Microsoft Corporation, USA. https://www.microsoft.com/visio).

First, from the perspectives of spectral, topographic, geomorphic, and structural features of fault, we select 16 factors related to fault identification (RSI, EL, SCD, RA, TR, DOS, AS, SOS, SOA, COEV, CU, ESD, SL, LI, VL, and TPI). Next, we select points on and off the faults to construct a training sample set, using four machine learning models—SVM, CART, ANN, and BN—to predict the importance of influencing factors. Based on their importance, we integrate the spectral, topographic, geomorphic, and structural features into multi-source information, resulting in four types of feature fusion outputs, represented as four regional fault identification maps. Third, we use a deep learning model based on CNN to identify faults from the fault identification maps, followed by noise reduction, line refinement, and smoothing. Fourth, we use Accuracy, True Positive Rate(TPR), F1-score, Area Under Curve(AUC), and Gini Index(Gini) to evaluate the performance of the machine learning models, and Val_Accuracy, Validation Precision(Val_Precision), F1-score, and Val_Loss to assess the performance of the deep learning model. Finally, intelligent identification of regional fault structures is achieved.

In this study, fault identification data includes RSI, DEM, and geological map data. We focus on the spectral, topographic, geomorphic, and structural features of fault, selecting 16 influencing factors related to fault identification for multi-source information fusion to preserve and enhance the morphological features of fault. The descriptions of these 16 influencing factors are presented in Table 2.

Table 2 Introduction of influencing factors.

The digital processing method based on RSI and the method based on digital terrain analysis can calculate 16 influencing factors. Due to the large differences in the range and distribution of pixel values of different raster layers, it is necessary to reassign these influencing factor layers to maintain data consistency in subsequent raster calculations and analyses. The results after reassigning the 16 influencing factors are shown in Fig. 3. The dataset used in this study is based on three data sources: remote sensing image, digital elevation model (DEM), and geological map. Generated through feature extraction and fusion, it aims to achieve fault identification. The dataset includes spectral features extracted from remote sensing image, topographic and geomorphic features extracted from DEM, and structural features obtained from geological map. These features consist of 16 influencing factors, namely: Remote sensing imagery, Elevation, Surface cutting depth, Relief amplitude, Terrain roughness, Degree of slope, Aspect, Slope of slope, Slope of aspect, Elevation standard deviation, Coefficient of elevation variation, Curvature, Slope length, Lithology, Valley line, and Topographic position index. By fusing these multi-source features, we can more comprehensively represent the morphological characteristics of faults, thereby improving the accuracy of identification.

Fig. 3
figure 3figure 3

16 influencing factors for fault identification: (a) Remote sensing imagery.(b) elevation.(c) Surface cutting depth.(d) Relief amplitude.(e) Terrain roughness.(f) Degree of slope.(g) Aspect.(h) Slope of slope.(i) Slope of aspect.(j) elevation standard deviation.(k) Coefficient of elevation variation.(l) Curvature.(m) Slope length.(n) Lithology.(o) Valley line.(p) Topographic position index. Maps were created using ArcGIS 10.5 (Environmental Systems Research Institute, USA. https://www.esri.com/).

The interpretation method based on RSI can reflect the true terrain and landform on a macro scale. In RSI processing, multispectral image fusion is an important technique that generates new images to enhance specific features. Fault is linear structure formed by rock fractures, and enhancing linear features is crucial in RSI processing. Experimental comparative analysis showed that the following formula1 effectively highlights linear elements in RSI:

$$\:\text{R}SI=(NIR-Red-Green)/(NIR+Red)$$

(1)

Where Green represents the green band, Red is the red band, and NIR is the near-infrared band. The green band effectively identifies vegetation; the red band is used for vegetation monitoring, soil feature analysis, and identifying lithology, strata, structures, and landforms; the near-infrared band is sensitive to exposed rock types due to its good band independence.

Elevation provides vertical information of the terrain. Faults significantly alter the surface elevation, creating features such as escarpments, cliffs, and rift valleys.

SCD is the difference between the average elevation and the minimum elevation within a given area, reflecting the degree of cutting at a point. Fault structures sever the surface, causing subsidence or uplift on either side. Therefore, SCD closely relates to faults and serves as an important indicator. The calculation formula2 represent:

$$\:\text{S}\text{C}\text{D}={\text{H}}_{\text{m}\text{e}\text{a}\text{n}}-{\text{H}}_{\text{m}\text{i}\text{n}}$$

(2)

RA is the difference between the maximum and minimum elevations within a specific area, providing a realistic reflection of terrain and landforms. The calculation formula3 represent:

$$\:\text{R}\text{A}={\text{H}}_{\text{m}\text{a}\text{x}}-{\text{H}}_{\text{m}\text{i}\text{n}}$$

(3)

TR describes the surface roughness and is used to indicate erosion and relief. Fault zones with irregular fractures and debris increase surface roughness. The calculation formula4 represent:

$$\:\text{T}\text{R}=1/cos\left(S\times\:3.14159/180\right)$$

(4)

where S is the slope factor (degrees).

DOS indicates surface steepness. Fault zones generally exhibit significant slope changes, with steep transitions between different slope degrees or inclined landforms. The calculation formula5 represent:

$$\:DOS={\text{tan}}^{-1}\sqrt{\left({f}_{x}^{2}+{f}_{y}^{2}\right)}$$

(5)

where \(\:{f}_{x}\) is the rate of elevation change in the X direction and \(\:{f}_{y}\) in the Y direction.

AS reflects geological structural features such as strata dip and fault line direction. Fault-related valleys often align with structural lines, and AS information helps identify these features. The calculation formula6 represent:

$$\:\text{A}\text{S}={\text{tan}}^{-1}\left(\frac{{f}_{y}}{{f}_{x}}\right)$$

(6)

where \(\:{f}_{x}\) is the rate of elevation change in the X direction and \(\:{f}_{y}\) in the Y direction.

SOS measures the rate of slope change over distance or position. Faults create discontinuities in terrain, leading to drastic slope changes. The calculation formula7 represent:

$$\:SOS=slope\left[slope\left(DEM\right)\right]$$

(7)

SOA measures the rate of change in aspect direction. Faults can cause abrupt changes or significant shifts in surface slope. Near fault zones, layer displacement and sliding result in prominent features like fault scarps and steep slopes, increasing SOA. The calculation formula8 represent:

$$\:SOA=\left(SOA1+SOA2\right)+\left|SOA1-SOA2\right|/2$$

(8)

where SOA1 is the positive aspect change rate and SOA2 is the negative aspect change rate.

Fault zones often exhibit significant terrain relief, and statistical analysis of ESD can identify variations in the landscape, thereby highlighting potential fault locations.

Faults, as linear surface features, influence surrounding terrain relief. COEV serves as an indirect indicator of fault impact. The calculation formula9 represent:

$$\:\text{C}\text{O}\text{E}\text{V}=\text{E}\text{S}\text{D}/{H}_{mean}$$

(9)

CU expresses local surface geometry, describing slope change degree and direction. Fault zones manifest as convex or concave shapes. CU helps determine if a point is a peak or a depression, aiding in terrain feature identification.

Faults often cause changes in surface elevation. SL can identify faults, as SL typically increases significantly due to sudden surface height changes caused by faults. The calculation formula10 represent:

$$\:\text{S}\text{L}=DEM/sin\left(S\times\:3.14159/180\right)$$

(10)

Different LI exhibit distinct stress and deformation features under varying geological conditions, influencing fault formation and evolution. Some lithologies fracture easily, while others with high elasticity resist fracturing.

VL marks terrain relief boundaries. Fault zones create step-like terrain, with VL accurately describing these features, and determining fault position and direction.

In fault identification, TPI extracts changes between elevation areas, reflecting terrain position changes, and inferring possible fault locations. Combined with other terrain parameters like DOS, CU, and COEV, TPI enhances fault identification accuracy and reliability. The calculation formula11 represent:

$$\:\text{T}\text{P}\text{I}=\text{Z}-{Z}_{mean}$$

(11)

where Z is the elevation value of the cell, and Zmean is the average elevation of neighboring cells.

Method for multi-source information Fusion

Construction of Training Sample Set for Fault Identification

The machine learning-based method involves constructing a training sample set and calculating the importance of influencing factors. First, a training sample set is constructed, consisting of positive samples and negative samples. Positive samples are pixel points on the fault line, representing the target category to be learned by the machine learning model. Negative samples are pixel points not on the fault line, representing samples that do not belong to the target category. By training with a large number of positive and negative samples, the model can learn to distinguish features and patterns between the target and non-target categories, enabling accurate prediction of the importance of the 16 influencing factors. A fault is a linear structure, and the fault line is the intersection of the fault plane with the ground surface, representing the surface trace of the fault. The extension direction of the fault line indicates the strike of the fault. To select positive samples, we define the pixel area within a certain buffer zone around the fault line as the selection area for positive samples, as shown in Fig. 4a. Point data is selected from the surrounding area of the fault line to generate positive samples for the machine learning training sample set. The ratio of selected negative samples to positive samples is 1:1.2, increasing the number of negative samples to improve the accuracy of the machine learning model in predicting the importance of influencing factors. The negative samples are shown in Fig. 4b. During the selection of negative samples, the following exclusions were made: (a) points within the fault line buffer zone; (b) areas identified from remote sensing image as having fault features and potentially being latent faults. These cases are illustrated in Fig. 4c and d, respectively. This approach ensures a more precise selection of negative samples, thus improving the model’s prediction accuracy.

Fig. 4
figure 4

Selecting positive and negative samples: (a) Positive samples selection. (b) Negative samples selection. (c) Exclude negative samples within the buffer zone. (d) Exclude negative samples with fault features.

Prediction of importance of influencing factors in multi-source feature information based on machine learning

This paper utilizes four machine learning methods: SVM, CART, ANN, and BN to predict the importance levels of the 16 influencing factors. In this paper, the input of the four machine learning models consists of 16 influencing factors, and the output represents the degree of importance for each of these 16 factors.

SVM is a supervised learning method founded on statistical learning theory, which achieves data classification by constructing an optimal separating hyperplane or optimal nonlinear decision boundary. In this paper, faults can be regarded as one category and non-faults as another category. SVM can classify fault and non-fault data by finding the optimal decision boundary. The application of the SVM model in fault identification, as shown in Formulas 12–16, is as follows:

Let D be a set of sample data consisting of points on fault lines and points not on fault lines:

$$\:D=\left\{\right({x}_{1},{y}_{1}),({x}_{2},{y}_{2}),\dots\:,({x}_{n},{y}_{n}\left)\right\},{y}_{n}\in\:\{-1,+1\}$$

(12)

The hyperplane can be represented as:

$$\:{w}^{T}\text{x}+b=0$$

(13)

To ensure correct classification of all samples and a margin between classes, the following constraint is required:

$$\:{y}_{i}({w}^{\text{T}}{x}_{i}+b)\geq1,i=\text{1,2},\dots\:,n$$

(14)

The margin between the support vectors and the separating hyperplane is calculated to be \(\:\frac{2}{\mid\:\mid\:w\mid\:\mid\:}\). Therefore, the problem of constructing the optimal hyperplane is transformed into the following optimization problem under constraints:

$$\:min\:\frac{1}{2}\parallel\:w{\parallel\:}^{2}$$

(15)

By introducing the Lagrange function and utilizing the Lagrange multiplier method to solve this constrained optimization problem:

$$\:\text{L}=\frac{1}{2}\mid\:\mid\:w\mid\:{\mid\:}^{2}-\sum\:_{\text{i}=1}^{\text{n}}{\alpha\:}_{i}\text{(}{y}_{i}({w}^{\text{T}}{x}_{i}+b)-\text{1})$$

(16)

with \(\:{\alpha\:}_{i}\) representing the Lagrange multiplier, which can be solved using the Lagrange duality.

CART is a decision tree algorithm applicable to both classification and regression tasks. In the CART algorithm, each node contains a feature and a threshold, and the dataset is divided into two subsets based on the feature and threshold. It works by recursively dividing the dataset into smaller subsets, a decision tree is ultimately constructed. When calculating the importance levels of each influencing factor, it searches for the combination of categories that maximally reduces impurity. The application of the CART tree model in fault identification, as shown in Formulas 17–18, is as follows:

Let D be a set of sample data consisting of points on fault lines and points not on fault lines. The splitting point is determined using the Gini index, and the calculation formula of the Gini index is as follows:

$$\:Gini\left(D\right)=1-\sum\:_{i}{p}_{i}^{2}$$

(17)

where \(\:{p}_{i}\) denotes the probability that a sample point belongs to class i.

Based on feature a, the dataset is divided into v subsets for each influencing factor, and the feature with the lowest Gini index is chosen as the node. Finally, the importance levels of fault influencing factors are calculated.

$$\:Gini(D,a)=\sum\:_{v}\mid\:{D}_{v}\mid\:/\mid\:D\mid\:Gini\left({D}_{v}\right)$$

(18)

ANN is a computational model that simulates the structure and functionality of the human brain’s neural network. It consists of a hierarchical structure of multiple neurons, including at least three layers: input layer, hidden layer(s), and output layer. Each neuron receives input signals from other neurons, processes them through weighted sums and transformations, and generates output signals. ANN can perform various types of learning, encompassing supervised, unsupervised, and reinforcement learning, making it suitable for various tasks such as data processing, classification, recognition, and prediction. In fault identification, the 16 influencing factors are taken as input variables for the ANN, while the target variable indicates whether it is a fault or not. Finally, the importance levels of the 16 influencing factors are obtained.

BN is a method in machine learning which utilizes knowledge of probability and graph theory to describe the dependencies between variables and can perform probabilistic inference and prediction on variables. BN excels in handling uncertainty and missing data and finds extensive applications across various fields. The Naive Bayes assumption assumes that the independent variables are mutually independent, which is not valid for determining the importance of influencing factors. This paper primarily relies on the TAN Bayesian Network for model implementation. The joint distribution function of a BN represented by Formula 19:

$$\:P\left(X,Y\right)=P\left(Y\right)\prod\:_{i}^{n}\left({X}_{i}|Y\right)$$

(19)

Where the values of Y represent faults or non-faults, and Xi represent the various influencing factors in fault identification.

Enhancement of fault feature information through multi-source information fusion

The machine learning-based method will determine the importance of 16 influencing factors, including spectral, topographic, geomorphic, and structural factors related to fault identification. These factors reflect the spectral, topographic, geomorphic, and structural features of fault.

In this paper, our multi-source information fusion method is a kind of multi-source information fusion based on the weights of each influential factor. The method firstly utilizes four machine learning methods to scientifically quantify the importance of these influencing factors in the fault identification process. Second, we weight and superimpose the fault information reflected by each influencing factor according to the specific weights of these influencing factors, so as to retain and enhance the fault feature information, and calculate the fusion results of the four feature information, which are expressed as four regional fault recognition maps. The fault identification maps can reflect the morphological characteristics of faults more comprehensively and improve the accuracy and reliability of fault identification.The calculation method for the fault identification maps generated through multi-source information fusion is shown in Fig. 5.

Fig. 5
figure 5

Enhanced characterization information for fault.

Fault identification method based on the convolutional neural network model

Deep learning method have demonstrated significant potential in fault identification. This study aims to use convolutional neural networks to intelligently identify faults from fault identification maps generated by integrating multi-source information. We employed the Convolutional Neural Network model, a deep-learning image recognition model primarily used for pixel segmentation and object detection. U-Net, a typical convolutional neural network75, is defined by its encoder-decoder architecture. The encoder-decoder structure is a deep learning network architecture commonly used for image processing tasks, consisting of an encoder and a decoder. The encoder gradually transforms the input image into higher-level feature representations, capturing both local and global features through convolution and pooling operations. The decoder progressively converts the features extracted by the encoder back into feature maps of the same size as the input image, using up sampling and deconvolution operations to remap the feature representations and generate the final output image.

The model intelligently recognizes faults in images by training on labeled grid features and leveraging the learned features. The fault identification process using this model includes the following steps: (1) collecting training samples from images using the Region of Interest (ROI) tool; (2) constructing target recognition label grids using ROI; (3) training the fault identification model with the label grids; (4) identifying faults using the trained model; (5) post-processing the classification results to achieve more accurate and reliable fault identification outcomes. The Convolutional Neural Network Model improves recognition effectiveness and performance by incorporating additional skip connections, residual connections, and attention mechanisms, enabling multi-scale feature recognition in images68. This allows for the classification of each pixel in the fault identification map, facilitating the rapid and accurate identification of faults from images. The loss function of the Convolutional Neural Network Model is based on pixel-wise cross-entropy. The loss function is used to adjust model parameters or weights, optimizing its recognition performance through continuous training76. The cross-entropy loss function is formulated as follows:

$$\:E=\sum\:_{o}m\left(x\right)\text{l}\text{o}\text{g}\left({p}_{n\left(x\right)}\right(x\left)\right)$$

(20)

In the equation, \(\:{p}_{n\left(x\right)}\) represents the softmax loss function, where o denotes the sample space composed of pixels. Here, x denotes the sample pixel, while \(\:m\left(x\right)\) signifies the weight of the sample pixel. Figure 6 illustrates the Convolutional Neural Network Model architecture.

Fig. 6
figure 6

The architecture of Convolutional Neural Network model: The input image is the Fault Identification Map (in this example, the slice size is 572 × 572 × 3), and the output image is the Fault identification Result.

In this paper, a convolutional neural network is used to identify faults from fault identification maps that integrate multi-source information. The dataset for the deep learning model experiment consists of four fault identification maps that fuse multi-source information. The pre-training parameters in this study are set as follows: Augment Rotation is ‘no’, Number of Epochs is ‘25’, Blur Distance ranges from min.0 to max.2, Class Weight ranges from min.1 to max.3, and Loss Weight is ‘0.5’. All other parameters are set to default. This architecture is specifically designed for a single-class training workflow and includes 5 “levels,” with each level consisting of 27 convolutional layers. Each level corresponds to a unique pixel resolution within the model.

The general steps for fault identification using the Convolutional Neural Network model are as follows: (1) To enable the model to extract specific targets, feature samples need to be labeled. In this study, fault line ROIs were used to create label grids for fault labeling. These label grids, referred to as training grids, are used to train the deep learning model. In this paper, we select two typical fault lines in the region as the training data based on actual faults that have already been labelled in the geologic map data to train the deep learning model, the first fault is 25 km long, NE strike, with obvious fault features; the second fault. The second fault is 6 km long, SE strike, with obvious fault features. (2) An initial TensorFlow model needs to be set up, defining model name, slice size, number of bands used for training, and other parameters. In this study, the initial model’s slice size is defined as 464 × 464 pixels and 1 band. (3) After initializing the model, training parameters can be set in the Train TensorFlow Pixel Model module to train the model. (4) Image recognition: the trained model is applied to identify faults throughout the study area, and the accuracy of fault identification is verified with reference to the actual fault lines in the geological map and the field survey.