QSAR modeling for predicting the antifungal activities of gemini imidazolium surfactants against Candida albicans using GA-MLR methods

This report presents a Quantitative Structure–-Activity Relationships (QSAR) analysis of gemini imidazolium surfactants against Candida albicans. Mordred software is used to calculate various types of molecular descriptors. The data set contains 70 structures of gemini imidazolium surfactants and is divided into training set (75%) and test set (25%) to perform cross-validation step. Genetic algorithm technique combined with multiple linear regression method (GA-MLR) was used to investigate the correlation between molecular descriptors and antifungal activity of gemini imidazolium surfactants. As a result, the best GA-MLR model consisting of two topological descriptors (GATS4se and BalabanJ) exhibits good fitting and internal validation with R2 = 0.9073, QLOO = 0.8941, and Q 2 LMO = 0.8908. Also, it was confirmed by the external validation procedure with Rtest = 0.8988 and RMSEtest = 0.3557, indicating that the obtained model was robust, reliable, and strong to predict the antifungal activity of gemini imidazolium surfactants. The GA-MLR-QSAR could be a useful tool for the initial development and design of novel gemini imidazolium surfactant as antifungal agents.


INTRODUCTION
The genus Candida is responsible for about 80% of infectious fungi in a hospital environment and is a relevant cause of bloodstream infections . The most invasive species, which are responsible for severe cases of candidiasis, include Candida albicans, Candida parapsilosis, Candida glabrata, and Candida krusei. The most reported clinical pictures related to candidiasis are cutaneous-mucous, visceral, and allergic (Fisher-Hoch and Hutwahner, 1995;Pfaller and Diekema, 2007).
Antifungal agents are an option for treating oral candidiasis, but the availability of antifungal drugs is less than antibacterial agents because eukaryotic fungal organisms are the same as mammalian cells, which makes selecting suitable antifungal targets a problem (Mayer et al., 2013). In addition, several antifungal drugs become resistant, including fluconazole, ketoconazole, and itraconazole (Lewis et al., 2012). Therefore, the development of new antifungal drugs to treat cases of candidiasis is important.
In recent years, cationic gemini surfactants that contain two head groups with two aliphatic chains connected by spacers have been exploited (Menger and Keiper, 2000;Rosen and Tracy, 1998). They exhibit lower cytotoxicity, better surface properties, and better able to bind negatively charged substance correspond to the monomer surfactants in the same conditions (Brycki et al., 2017;Sharma et al., 2017;Shukla and Tyagi, 2006). Cationic gemini surfactants also have good antimicrobial activity (Bao et al., 2017;Tatsumi et al., 2014). Several studies have shown that the antimicrobial activity of the cationic gemini surfactant imidazolium chloride depends on its structure, for example, the length of the alkyl groups (Rath and Bai, 2016). The hydrophobic portion (alkyl chains) of the imidazolium cationic gemini surfactant interacts with the cell membrane, inducing cell membrane damage leading to cell lysis and death (Dolezal et al., 2016). The results of the research by Palkowski et al. (2014), Butorac et al. (2011), and Kamboj et al. (2012 show that imidazolium cationic gemini surfactant has potential as an antimicrobial agent. The process of designing a new drug is a timeconsuming, expensive, and very complex process. This is a challenge for researchers to find strategies and efforts that are effective and economical in producing new drugs. One of the strategies developed to design new drug compounds is the computer-aided drug design (CADD) approach (Yu and MacKerell, 2017). The relatively low cost of CADD compared to traditional drug discovery methods makes the CADD method attractive to save costs and time required in the development of new drug compounds (Shim and MacKerell, 2011).
Ligand-based drug design and structure-based drug design are mainly two categories of CADD methods (Yu and MacKerell, 2017). The ligand-based drug design utilizes information on the physicochemical properties of several experimentally known active compounds as a basis for designing new compounds. Among these methods, there is a Quantitative Structure-Activity Relationships (QSAR) analysis that investigates the structures and molecular properties through chemoinformatic methods (Cherkasov et al., 2014;Kubinyi, 1995;Roy et al., 2015).
Considering the above, this work aimed to conduct a QSAR study based on gemini imidazolium surfactants synthesized by Palkowski et al. (2014) as a potential antifungal agent against C. albicans. Thus, through the most important physical-chemical parameters (descriptors), the models can be obtained that help in planning the synthesis of new gemini imidazolium surfactant with better antifungal activity.

Dataset
Pałkowski et al. (2014) synthesized 70 gemini imidazolium chlorides, with antifungal activity against C. albicans. The antifungal activity values made available in [minimum inhibition concentration (MIC), the lowest concentration of surfactant which inhibits the growth of microorganisms, in mol/L] were converted into their respective pMIC (-log MIC). The total dataset of molecules was divided into training set of 75% and test set of 25% based on the diversity of antifungal activity (Table 1).

Descriptors calculation
Initially, all geometries were drawn and optimized using the Hartree-Fock method (ab initio) with basis set 3-21G which is implemented in the Gaussian 09 software. As a next step, the Gaussian output files were used by Mordred software to calculate various classes of descriptors such as constitutional, topological, and WHIM (Weighted Holistic Invariant Molecular) descriptors (Moriwaki et al., 2018). Then filtering is done by eliminating descriptors that have constant and highly correlated values. The filtered descriptors are then used to build the QSAR models.

QSAR modeling and validation
The QSAR models were developed by a combination of genetic algorithms with multiple linear regression (GA-MLR) methods using the QSARINS software (Gramatica et al., 2013). The selection of variables was carried out using the GA technique; in this way, consistent models are obtained through an optimization process that considers the value of statistical parameters such as the correlation coefficient and standard deviation (Rogers and Hopfinger, 1994). Then, the descriptors generated based on these parameters were correlated with antifungal activity through the MLR method for the construction of the QSAR models.
The validation of statistical models is an important stage in the design of drugs based on QSAR techniques because it is guaranteed that the equations obtained have predictive power and are sufficiently reliable to be able to describe the structural changes associated with biological activity (Kiralj and Ferreira, 2009 (Gramatica, 2007).

RESULTS AND DISCUSSION
The aim of this study was to find the correlation between structural parameters of gemini imidazolium surfactants and the antifungal activity against C. albicans. Based on the basic principles of QSAR analysis, the structural parameters of a compound are expressed by the molecular descriptors. To obtain those molecular descriptors, first, we have drawn the 3D model of each gemini imidazolium surfactant and then optimized the geometry by employing Hartree-Fock (ab initio) method with 3-21G basis set. We used the equilibrium geometry as input on Mordred software to calculate the molecular descriptors. As result, 835 molecular descriptors were generated.
The variable selection process using the GA technique followed by the MLR method led to a GA-MLR model which consists of two molecular descriptors. The obtained model is shown in Eq. 1. Table 2 shows the statistical parameters of the obtained model. The antifungal activity (pMIC) prediction results for gemini imidazolium surfactants of this model are summarized in Table 1. The correlation graph between the predicted and experimental antifungal activity (pMIC) shown in Figure 1 also shows a slope close to 1. This means that the resulting model can provide a good level of prediction. pMIC = 23.1793 + 0.628 * GATS4se -0.7791 * BalabanJ. Eq. 1 The coefficients of the molecular descriptors in Eq. 1 suggest that the 2D autocorrelation descriptor, namely, Geary coefficient of lag 4 weighted by Sanderson EN (GATS4se), and the topological of Balaban index (BalabanJ) are the most influence descriptors to the antifungal activity of imidazolium gemini surfactants. The positive coefficient of GATS4se descriptor indicates that an increase in GATS4se leads to an increase in the antifungal activity of gemini imidazolium surfactants. The topological of BalabanJ has a negative coefficient, which indicates that an increase in BalabanJ leads to a decrease in the antifungal activity of gemini imidazolium surfactants.
The correlation matrix between selected descriptors shows that the correlation between GATS4se and BalabanJ is very low (Table 3). This indicates that there is no significant intercorrelation among the descriptors used in the development of the model. Additionally, the residual predicted pMIC using Eq. 1 versus the experimental value of pMIC is shown in Figure 2. All residual predicted pMIC values are located between 1 and −1, which indicates that Eq. 1 has good accuracy and reliability for predicting the antifungal activity of gemini imidazolium surfactants against C. albicans.
Based on the validation parameters of the model (Table 2), the model satisfies the requirements made by Golbraikh et al. (2002) and Roy et al. (2012). The values of R 2 (0.9073) and Q 2 LOO (0.8941) were reasonable, showing that the model was significant and robust to predict the antifungal activity of gemini imidazole surfactants. Consider that the values of the difference between R 2 and Q 2 LOO (R 2 -Q 2 LOO = 0.0132) are within the limit suggested by Kiralj and Ferreria (2009) which is an indication that the model does not have data overfitting. The low value of the LOF parameter (LOF = 0.1397) implies a good fit model with no current overfitting in the model.
Validation of the final model consists mainly of internal and external validation. The LOO and LMO cross-validation procedures were used for internal validation. According to After generating and evaluating the model, AD was employed to confirm that the obtained model can be considered reliable. Williams plot or leverage approach was used to measure the influence of descriptors on the model (Gramatica, 2007). The leverage value (h i ) shows the distance of a compound from the centroid of X, which is defined as where X is the characteristic matrix of the training set. The critical leverage value (h*) is defined as where p is the number of descriptors in the model and n is the total number of compounds in the training set. As shown in Figure 3, all compounds lied within the domain of applicability which lower than the threshold leverage (h * = 0.170). This indicated that no compounds in the dataset fell outside of the AD as an outlier.
The Y-scrambling method was employed as randomization tests to confirm that there was no random correlation between antifungal activity and selected descriptors (Rücker et al., 2007). This criterion is shown by the average value of R 2 Y scr and Q 2 Y scr , which are both lower than R 2 and Q 2 of the model. In this work as shown in Figure 4, the values of R 2 and Q 2 of the model are higher than the R 2 Y scr and Q 2 Y scr values, which indicates that the model is not derived from random correlation. i

CONCLUSION
It is concluded, through this work, that the GA-MLR analysis showed that the two highlighted descriptors play the role of antifungal activity of gemini imidazolium surfactants, namely, GATS4se (Geary coefficient of lag 4 weighted by Sanderson EN) and BalabanJ (topological of Balaban index). The obtained QSAR model is significant and robust, does not show random correlation, and has a strong predictive ability (R 2 = 0.9073, Q 2 LOO = 0.8941, and R 2 test = 0.8988). The AD indicates that most structures are adequately represented by the chemical space of the model. Thus, the values of the predicted activities can be considered reliable.

ACKNOWLEDGMENTS
This project was financially supported by Universitas Gadjah Mada (UGM) through a Rekognisi Tugas Akhir (RTA) program in 2020.

CONFLIC OF INTEREST
The authors declared that they have no conflicts of interest.

AUTHOR CONTRIBUTIONS
All authors made substantial contributions to conception and design, acquisition of data, or analysis and interpretation of data; took part in drafting the article or revising it critically for important intellectual content; agreed to submit to the current journal; gave final approval of the version to be published; and agree to be accountable for all aspects of the work. All the authors are eligible to be an author as per the international committee of medical journal editors (ICMJE) requirements/guidelines.

ETHICAL APPROVALS
This study does not involve experiments on animals or human subjects.

PUBLISHER'S NOTE
This journal remains neutral with regard to jurisdictional claims in published institutional affiliation.