Predictive lithological mapping through machine learning methods : a case study in the Cinzento Lineament , Carajás Province , Brazil

*Corresponding author Iago Sousa Lima Costa E-mail address: iago.costa@cprm.gov.br


Introduction
The geological mapping in tropical areas with dense rain forest vegetation and thick supergene covers is challenging as rock exposure is very poor, and access to outcrops can be difficult.The use of remote sensing and airborne geophysics in these areas is essential to recognize and delimitate lithological units in the construction of geological maps, which are usually drawn manually by the interpreter.However, these maps are highly dependent on the prior knowledge of the interpreter; in other words, different interpreters can create different maps from the same data.The use of Machine Learning Algorithms (MLA) is a valuable tool to automatically recognize patterns among highdimensional data to mitigate bias and to speed up interpretations, especially in regions where geophysical and remote-sensing data are widely available.Therefore, MLA represents an efficient method for producing greenfield geological maps or improving existing maps (e.g.Cracknell et al. 2014;Harris and Grunsky 2015;Kuhn et al. 2018;Radford et al. 2018).
The Carajás Mineral Province (CMP), in the southeastern Amazonian Craton, Brazil, is an example of such difficulties.Classic geological mapping at semi-detail scales (1:100.000,1:50.000) is viable, especially in deforested areas, yet strongly based in remote sensing and airborne geophysics interpretation.Furthermore, vast areas covered by rainforest and with poor access are completely mapped through somehow biased geological-geophysical interpretation.The CMP is one of the largest mineral provinces in the world, with giant iron deposits and also rich in other resources such as copper, gold, PGE, chromium, nickel, manganese, REE, uranium and tin; which reinforces the necessity of delivering precise geological maps to the society, government and mining industry.Therefore, the Geological Survey of Brazil (CPRM) maintains a permanent geological mapping and metallogenic program in the CMP since 2008.
One of the areas recently surveyed by CPRM is the Cinzento Lineament region (Figure 1a), situated in the northernmost part of the Carajás Copper-Gold Belt (Tallarico et al. 2005).The Cinzento Lineament region hosts several IOCG (Iron Oxide -Copper -Gold), VMS (Volcanic Massive Sulfides), and granite-related copper-gold and polymetallic deposits, such as Salobo, Furnas, Pojuca and Paulo Afonso.Remote sensing and geophysical data were strongly used to elaborate the recently published geological map (Figure 1b   2018), as about 50% of the area comprises an environmental conservation unit with very limited road access.
In this work, we evaluated several supervised MLA in order to build a Predictive Lithological map of the Cinzento Lineament, based on a dataset that includes airborne magnetometric, radiometric and SRTM (Shuttle Radar Topography Mission) data.We built a predictive map with enough equivalence to the geological chart that was developed by combining traditional surveying tools (fieldwork and manual interpretation of remote sensing and airborne geophysical images).

Geological Framework
In general lines, there are three main rock crystallization or depositional ages in the CMP: in the Mesoarchean (3.06-2.83Ga), Neoarchean (2.76-2.55Ga) and Paleoproterozoic (2.10-1.88Ga); all of them represented in the lithostratigraphic framework of the Cinzento Lineament region (Table 1).The Mesoarchean lithologies include orthogneisses, greenstone belts and granitoids developed under an accretionarycollisional system with peak metamorphism at 2.85 Ga (Tavares et al. 2018;Machado et al. 1991).
The Mesoarchean units are the basement of the Neoarchean Itacaiúnas Supergroup (DOCEGEO 1988), which comprises a volcano-sedimentary sequence with mafic and felsic volcanic rocks in its lowermost unit (Parauapebas Formation), followed by thick BIF layers (Carajás Formation) and, at the top, clastic and chemical metasedimentary rocks interbedded with fewer volcanic rocks and BIF (represented by the Igarapé Cigarra and Salobo-Pojuca formations in the study area).Coeval bimodal magmatism is represented by A-type granitic plutons and mafic-ultramafic intrusions, such as the Gelado Metagranite (Barbosa 2004).The Paleoproterozoic sedimentary cover is represented by the clastic Águas Claras Formation, while plutonic units include A-type granitic bodies of the Serra dos Carajás Intrusive Suite, such as the Pojuca and Cigano granites (Dall'Agnol et al. 2005).
The Mesoarchean and Neoarchean units are strongly structured in the WNW-ESE along the study area.For Costa and Siqueira (1990), this feature is the result of a long-term strike-slip fault system with multiple reactivations, namely the Cinzento Lineament.Tavares et al. (2017), however, reinterpreted the tectonic features of the Cinzento Lineament as the result of a complex multistage tectonic evolution, marked by the overlapping of several compressive and extensional phases.For Tavares et al. (2017), the main structures are understood as deep discontinuities related to the basement framework, reactivated as extensional faults during the deposition of the Itacaiúnas Supergroup and the emplacement of the Gelado Metagranite, then again reactivated during the Paleoproterozoic Transamazonian Orogenic Cycle as ductile reverse shear zones, between 2.10 and 2.07 Ga (Tavares et al. 2018).All the tectonic system was later affected by a second Paleoproterozoic orogeny, known as the Sereno Event (1.98-1.95Ga, see Tavares et al. 2018), of intracontinental character and ductile-brittle behavior.The collapse of the Paleoproterozoic orogenies also produced a later extensional reactivation, in a brittle, fluid-dominant environment, contemporary to the emplacement of the A-type plutons of the Serra dos Carajás Intrusive Suite.

Data
The airborne geophysical data are a compilation of three projects contracted by the Geological Survey of Brazil: Oeste de Carajás, Rio Maria and Tucuruí.These projects were carried out between 2003 and 2009 and have flight lines with a spacing of 500 m (N-S direction) and control lines with a spacing of 10 km (E-W direction).The airborne magnetic data were interpolated into a grid using the bi-directional method and the radiometric data with the minimum curvature method (Briggs 1974), both with a 125 m cell.
The airborne magnetic data are widely used in geological mapping due to the usually good variation of magnetic susceptibility in rocks.In machine learning applied to predictive lithology, the most commonly used magnetic processing is the Reduced-To-Pole (RTP - Baranov 1957;Gunn 1975) since the RTP anomalies have a clear spatial correlation with the lithological units.However, at low magnetic latitudes (as in the Cinzento Lineament region) the RTP shows poor performance, retaining part of the initial dipoles and creating biased artifacts (Silva 1986).
Another approach is to use the Total Gradient (Nabighian 1972) which centralizes the anomaly above the magnetic source.However, the Total Gradient commonly shows lateral dispersions, making it difficult to define the geological boundaries between units sharply.
To overcome these issues, we suggest a new approach using a superficial horizontal slice from the Magnetic Vector Inversion (MVI - Ellis et al. 2012).The MVI directly inverts the magnetization vector accommodating effects of low latitudes and remanent magnetization (Johnson and Aisengart 2014).Figure 2a shows the result of the MVI to a depth of 100 meters.
The radiometric data are dependent on sources up to 30 cm beneath the Earth's surface, which makes these data highly correlated with geological field data.As the airborne radiometric data were obtained at different periods, they were initially compensated using a linear relationship between the data from an intersection area between adjacent projects.The radiometric data show the contents of Potassium (K), Uranium (eU) and Thorium (eTh), which were added to our database.We did not use ratios among the elements to not overparameterize the final model.Figure 2b displays an RGB ternary image with K (red), eTh (green) and eU (blue).Our database also included a digital elevation model derived from the SRTM (Figure 2c).Despite being very useful in other studies (e.g.Kuhn et al. 2018;Cracknell et al. 2014), our preliminary analysis showed a significant bias in Landsat-8 data (Figure 2d).The preliminary results showed a high correlation between lithological units covered by dense vegetation, due to its homogeneous coverage.Moreover, the lithological units appeared highly controlled in the boundaries between dense vegetation and anthropized areas (e.g.mines, tailings dams, cities).Consequently, the Landsat data were not used in predictive models.All data (airborne geophysical and SRTM) were reprojected to the coordinate system SIRGAS 2000 -UTM zone 22 south.Furthermore, we performed downsample to cells with 250 meters in all grids.This pre-processing resulted in 55,305 instances in which each one comprises all input data.Since we performed a supervised classification, it was necessary to take a small representative data do train the model (Hastie et al. 2009).In theory, records of field observation are the best data to compose training data.In practice, in a geological survey, the units are not sampled homogeneously, which can bias the model for the classes (or lithological unit) with the most significant number of samples (Japkowicz and Stephen 2002;Cracknell et al. 2014).Alternatively, it is possible to use random samples extracted from previous geological maps such as training data.Therefore, we followed previous works (Cracknell et al. 2014;Kuhn et al. 2018) and used 100 random samples per lithological unit.Another approach could be to define the number of samples by the area of the lithological unit; however, this procedure can bias the predictive model for classes with a large area (Japkowicz and Stephen 2002).Therefore, our training data resulted in 1400 samples (Figure 3) with their classes extracted randomly from the current geological map (Oliveira et al. 2018), which represents 2.5% of the total sample amount (55,305 samples).This rate is close to those of Cracknell et al. (2014) and Kuhn et al. (2018), which used 2.6% and 1.6% of the total sample amount, respectively.

Evaluation of Machine Learning Algorithms
Several supervised MLA can provide satisfactory predictions from a small number of training data samples.In this work, we perform an initial analysis of five MLA through k-fold Cross-validation Accuracy.This approach splits the instances into several folds and uses one fold at a time as a trainer to predict the remainder.Our work evaluated the Cross-validation Accuracy with ten folds for the MLAs: Naive Bayes, k-Nearest Neighbors (Cover and Hart 1967), Support Vector Machines (Vapnik 1998), Artificial Neural Networks and Random Forest (Breiman 2001).In agreement with Cracknell and Reading (2014), the Random Forest showed the best accuracy for predictions for our database (Table 2).

Random Forest Algorithm
The Random Forest algorithm (RF) is an ensemble classification method that uses bootstrap aggregation to create multiple decision trees (Breiman 2001).This approach randomly selects 2/3 of the training samples with replacement to generate a data set ("in-bag" data) of the same size as the training data.The in-bag data classify a decision tree through the Gini index (Breiman et al. 1984) that determines the best parting for a given class, while the remaining data ("out-of-bag" data) are used to validate the model.The significant advantage of the RF is that the prediction of a class is a function of the average of the multiple decisions trees, thus improving the predictions and mitigating errors from outliers.In our predictive model, we have reached a suitable accuracy with 100 trees without a maximum limit in depth for each one (Table 3).
Figure 4 shows some of the multiple trees used in this work viewed as Pythagorean trees (Beck et al. 2014).In this representation, the size of the rectangles is proportional to the number of samples, and the most probably class defines the colors.
The RF also can provide a rank of the input variables concerning the importance in the predictions.Therefore, we perform some accuracy cross-validation tests for each addition of a variable follow the rank (Table 4).The Thorium   (eTh) has shown to have the most significant importance in the predictions, which is compatible since it represents the radioelement with less mobility and consequently the most related to the lithological units.

Results
The Random Forest algorithm shows, as a result, a class probability map for each class defined in the training data (e.g. Figure 5).These probabilities can be spatially categorized in the most probable class creating a Predictive map. Figure 6 shows the comparison between the current geological map (Figure 6a), the spatial distribution of the training points (Figure 6b) and the RF Predictive Lithological map with airborne geophysics and SRTM (Figure 6c).The RF Predictive map with remote data predicted 52.7% of the current geological data (correct predictions/total predictions).This mismatch between the geological map and the Predictive map may represent misclassified samples in the training data, inappropriate predictions and/or the degree of new information that can be taken from the remote data to improve the current geological map.
Table 5 shows the predicted percentage for each lithological unit and the second most related.The smaller classes (and fewer samples) tend to have a better recovery due to its lower variation.For example, the Paleoproterozoic Pojuca Granite (PP3γpo) has a predicted percentage of 81.1%; however, the sample amount of this unit (106 samples) is remarkably lower concerning larger units as the Igarapé Gelado Metagranite (A4γ2gl) which has 23,553 samples.In this case, the Igarapé Gelado Metagranite represent a lithological unit with considerable compositional variation, and in some regions has similarities with other units such as Cigano Granite (PP3γci -11.8%).In the same way, a small part of the occurrence of felsic metavolcanic rocks (A4ppf) presents great similarities with orthogneisses of the Xingu Complex (A3xi -19.7%).These regions may represent a class error; however, they also may represent lithofacies variations or small contrasting lithologies within the Igarapé Gelado Metagranite (A4γ2gl) or portions of orthogneisses preserved within felsic metavolcanic rocks (A4ppf), respectively.
Despite the satisfactory recovery (52.7%), the Predictive map with remote data (Figure 6c) presents a considerable content of high-frequency noise.Cracknell and Reading (2014) proposed that the use of spatial coordinate constraints could significantly reduce this high-frequency content.Therefore, we inserted constrains containing the spatial coordinates of the instances in order to mitigate noise and improve model recovery.Figure 6d shows the Predictive Lithological map with remote data and spatial correlation.This predictive map recovered 78.7% of the geological map, which represents a compatible recovery in comparison with previous works.For example, Yu et al. (2012) obtained a prediction of 62.2% using Support Vector Machines (SVM) and an 11 x 11 majority filter.Kuhn et al. (2018) and Cracknell et al. (2014) through the Random Forest algorithm predicted 76% and 78%, respectively.Table 6 shows a substantial increase in recovery when spatial constraints are inserted.The predicted percentage ranges from 59.6 to 98.1%, while for the predictive map without spatial correlation it varies between 36.9 and 81.1% (Table 5).

Discussions
The Random Forest produced two predictive maps: one with remote data only (Figure 6c), and one with remote data and spatial coordinates (Figure 6d).In the first case (only with remote data), the predictive map shows a considerable amount of noise; however, it also showed potential areas for reevaluation.In the northwest portion, the geological map (Figure 6a) shows an expressive occurrence of the Igarapé Gelado Metagranite (A4γ2gl), however, the predictive map shows that this region has more similarities with the Mafic Parauapebas (A4ppm) or with Salobo-Pojuca (A4pj / A4pjp) formations or even shows that the occurrence of metavolcano-sedimentary rocks in the northwest portion is much more common than in the southeast portion as is shown on the current geological map (Oliveira et al. 2018).Considering that this entire northwestern portion comprises the area of environmental protection and thus was barely investigated with the geological mapping, the current interpretation of this portion of the map is mainly reflected by the interpretation of the remote sensors and airborne geophysical images.Therefore, it is suggested to re-evaluate the possibility to improve the interpretation with the new data obtained with the predictive map of this work.Also in the northwest, the     geological map shows an intrusive Paleoproterozoic body related to Cigano Granite (PP3γci); however, the predictive map does not recognize this Paleoproterozoic body and proposes a Salobo-Pojuca Formation (A4pj / A4pjp) instead.
The predictive map with remote data only showed several new portions of Laterite Cover (Elm) that can be validated by its clear signature in the SRTM (plateaus) and radiometric data (High eTh with low K and eU).
The predictive map with remote data and spatial coordinates has considerably increased the recovery compared to the geological map (78.7 %).This predictive map correlated very well with the geological map and proposed some changes, which can bring a considerable gain in the refinement in the boundaries of the geological units.However, this map brought little information about new lithological units; in other words, the insertion of the spatial correlation may have suppressed, besides noise, several features with possible geological logic.Furthermore, small units such as Pojuca Granite (PP3γpo), which have a dense cluster of sample training, seem to attract solutions that do not fit the unit.

Conclusions
The machine learning methods allow bringing a new approach for data interpretation.Although the definition of the parameters can produce bias, this approach allows reproducing the predictive models since the parameters are defined.In this work, the Random Forest Algorithm showed the best performance among several methods of machine learning, in agreement with Cracknell and Reading (2014).The Random Forest make it possible to generate two predictive maps: one with remote data only, and another with remote data and a spatial constraint.The first one showed a recovery of 52.7 % concerning the geological map besides several possible new lithological units, however with a high noise content.The second one obtained a better recovery of 78.7 % with several suggestions in the reevaluation of the limit of several units; however, did not bring information about new lithologies.In fact, both Predictive maps with remote data only and with spatial coordinates have their advantages and disadvantages and need to be used together to enrich the geological map, especially in regions with scarce outcrops such as the Cinzento Lineament region.

Figure 1
Figure 1 -a) Tectonic compartmentalization of the Southern Amazonian craton proposed by Vasquez et al. (2008) highlighting the region of the Cinzento Lineament.b) Lithological Map of Cinzento Lineament (Oliveira et al. 2018).

Figure 2 -
Figure 2 -The data used as input in the machine learning algorithms in this study.a) Magnetic Vector Inversion, b) Ternary RGB map (K, eTh and eU), C) Shuttle Radar Topography Mission (SRTM) and d) false color (RGB) of Landsat 8 combining bands 4 (red), 3 (green) and 2 (blue).

Figure 3 -
Figure 3 -Location of training points used as input in the Machine Learning Algorithms.The training points database is composed of 100 random samples from each class (or lithological unit) resulting in 1400 samples.

Figure 4 -
Figure 4 -Examples of Pythagorean trees from the multiple trees used in this work.Colors show the most likely classes on each node and the size of the rectangle is proportional to the number of samples in each group.

Figure 5 -
Figure 5 -An example of class probability (in percentage) for the lithological unit A4γ2gl from the Predictive Lithological map with remote data only.

Figure 6 -
Figure 6 -Comparison between the a) Lithological Map of Cinzento Lineament from Oliveira et al. (2018), b) training points, c) Predictive Lithological Map using remote data only, and d) the Predictive Map with remote data and spatial correlation.The limits of the geological map were superimposed on the predictive map to improve the association.

Table 3 -
Evaluation of Cross-validation Accuracy (%) with the number of trees.The bold text indicates the point where the accuracy stops progressing.

Table 4 -
Cross-validation accuracy % for each influence rank in the Random Forest.

Table 6 -
The recovery of the predicted samples of the Predictive Lithological Map using remote data and spatial correlation.The last column shows the unit that most relates to the main class.

Table 5 -
The recovery of the predicted samples of the Predictive Lithological Map using remote data only.The last column shows the unit that most relates to the main class.