Immunoinformatics approach in designing SARS-CoV-2 vaccine from experimentally determined SARS-CoV T-cell epitopes

The rapid transmission of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) resulted in the coronavirus disease 2019 pandemic. This has caused a global health emergency that warrants accelerated vaccine development. Herein, immunoinformatics was utilized in evaluating experimentally validated SARS-CoV CD4+ and CD8+ epitopes retrieved from the database as a potential vaccine against SARS-CoV-2. The protein sequences of SARS-CoV-2 were retrieved. Then, multiple sequence alignments and protein variability analysis of retrieved were conducted to obtain highly conserved sequences. SARS-CoV epitopes having a 100% overlap with the highly conserved protein sequence of SARS-CoV-2 were further analyzed to identify major histocompatibility complex (MHC) allele binders. Epitopes with significant matches to known human protein sequences were excluded to avoid cross-reactivity. Population coverage (PC) for an optimized set of CD4+ and CD8+ epitopes was also estimated. SARS-CoV-2 epitopes were docked with structures of identified MHC alleles. Binding energy and dissociation constant were calculated to analyze the stability of the epitope-MHC complex. To further evaluate the stability of interaction, the root mean square deviation plot was obtained using molecular dynamics simulation. This work identified a highly conserved potential SARS-CoV-2 vaccine comprising three CD4+ epitopes GAALQIPFAMQMAYRF, MAYRFNGIGVTQNVLY, and QALNTLVKQLSSNFGAI with worldwide PC of 81.81% and seven CD8+ epitopes RLNEVAKNL, VLNDILSRL, GMSRIGMEV, LLLDRLNQL, MEVTPSGTWL, RRPQGLPNNTASWFT, and LQLPQGTTL with global PC of 89.49%.


INTRODUCTION
The coronavirus disease 2019 (COVID-19) outbreak was first reported at Wuhan, China, in December 2019 (Du Toit, 2020), is caused by the severe acute respiratory coronavirus 2 (SARS-CoV-2) infections. It has resulted in over millions of infected cases and hundreds of thousands of deaths worldwide in just a matter of few months. This pandemic has severely paralyzed the global economy for several months and has almost exhausted many healthcare service facilities around the world. The high transmission rate of SARS-CoV-2 makes it further difficult to contain. Although the fatality rates of the Middle East respiratory syndrome coronavirus (MERS-CoV) and severe acute respiratory syndrome coronavirus (SARS) are higher than SARS-CoV-2, the transmission of the SARS-CoV-2 virus is much faster than that of the previous human coronaviruses, resulting to even higher death rates (Cascella et al., 2020). SARS-CoV-2 belongs to the subfamily Coronavirinae, family Coronaviridae, order Nidovirales, and genus Betacoronavirus. Coronaviruses are enveloped viruses containing positive sense single-stranded RNA as genetic material. It has approximately 26-32 kb, containing six open reading frames (ORFs): ORF1ab, ORF3a, ORF6, ORF7a, ORF8, and ORF10, wherein ORF1 encodes for 16 nonstructural proteins (NSP1-16) and four structural proteins which include spike (S), envelope (E), membrane (M), and nucleocapsid (N) (Ahmed et al., 2020;Lu et al., 2020). A study showed that T-cell responses elicited by the structural proteins of SARS are more immunogenic than its NSP (Li et al., 2008).
Phylogenetic studies showed that SARS-CoV-2 shared 80.0% sequence identity with SARS (Gralinski and Menachery, 2020;Zhou et al., 2020). A proteomic analysis conducted to compare the sequences of SARS-CoV-2 with SARS showed that proteins in SARS-CoV-2 share 95%-100% identity with SARS , except for ORF8 and ORF10 which have no homologous proteins in SARS (Chan et al., 2020). Among these proteins is the nucleocapsid (N). Being the most abundant and highly conserved protein in coronaviruses, it shares 90.52% identity with SARS . Similarities between these two coronaviruses can be used as an advantage in designing drugs and vaccines against SARS-CoV-2. In this regard, the therapeutic agents used to treat SARS can be potentially used as COVID-19 treatment .
Recently, several COVID-19 vaccines have finished phase III clinical trials; however, these cannot guarantee an outright success upon administration to larger and more diverse populations around the world. Studies conducted to develop drugs and vaccines are far from over until SARS-CoV-2 is finally tamed in most populations worldwide. Witnessing the devastating effects of COVID-19, the rapid development of antiviral drugs and vaccines is imperative. Data mining and immunoinformatics have efficiently hastened and reduced the cost required in experimental immunology to be able to identify vaccine candidates and biomarkers (Oyarzún et al., 2016;Vaishnav et al., 2015). Herein, this work aims to develop a candidate SARS-CoV-2 vaccine using immunoinformatics tools for the efficient identification of potential vaccines from experimentally validated T-cell (CD4+ and CD8+) epitopes of SARS which were retrieved from the viral sequence database.

Identification of experimentally validated SARS T-Cell epitopes
Experimentally validated SARS CD4+ and CD8+ epitopes were retrieved from the Immune Epitope Database and Analysis Resource (IEDB). Epitopes were restricted to "severe acute respiratory syndrome-related coronavirus [human coronavirus (strain SARS)]" with "humans" as the host, specifically binding to "MHC class I" for CD8+ epitopes or "MHC class II" for CD4+ epitopes and "T-cell assays" with only positive results included.

Retrieval of SARS-CoV-2 protein sequences, processing, and alignment
All SARS-CoV-2 amino acid sequences with corresponding proteins in the list of retrieved SARS epitopes were acquired from the Virus Pathogen Database and Analysis Resource (ViPR). Sequences collected from humans, starting from December 2019 to August 2020, were retrieved. Only those with complete genomes were included, and the minimum coding sequence (CDS) limit was employed per protein. Identical sequences were also removed. The list of retrieved sequences were clustered using a 0.99 cut-off in CD-HIT suite (http://weizhonglilab.org/cdhit_suite/cgi-bin/index.cgi?cmd=cd-hit) to obtain a set of representative sequences per protein. The representative sequences were then aligned using Clustal Omega (https:// www.ebi.ac.uk/Tools/msa/clustalo/) to obtain multiple sequence alignments. Protein Variability Server was used to identify highly conserved sequences in SARS-CoV-2 proteins. Shannon variability threshold H ≤ 0.5 was used. Sequence variability was masked using the first sequence in the alignment as a reference. Fragments with ≥9 residues were selected. All acquired SARS epitopes having 100% overlap with the retrieved SARS-CoV-2 reference protein sequences were retained for further evaluation.

Identification of MHC binding, cross-reactivity, and population coverage (PC)
Restricted MHC I alleles binding to CD8+ epitopes were identified in silico using MHC I Binding Tool NetMHCcons in IEDB which predicts binding to any known MHC I molecule integrating three well-known methods: PickPocket, NetMHC, and NetMHCpan, for more accurate results (Karosiene et al., 2012). In screening for binding alleles, the most frequent MHC I alleles (Weiskopf et al., 2013) were used with IC 50 < 500 nmol/dm 3 , which classifies epitopes as good binders (Jensen et al., 2018). To identify MHC II alleles binding to CD4+ epitopes, SMM-align (NetMHCII) method in IEDB MHC II Binding Tool was used. This was reported to outperform other state-of-the-art MHC II prediction algorithms (Nielsen et al., 2007). The most frequent MHC II alleles (Greenbaum et al., 2011) with IC 50 < 500 nmol/dm 3 were used. To avoid potential cross-reactivity in humans, proteinprotein BLAST (BLASTp) was utilized to identify epitopes with a significant match to known human protein sequences. The PC for each set of CD4+ and CD8+ epitopes was estimated using the PC tool in IEDB. In addition, PC in areas where infection rates of SARS-CoV-2 are high, including the USA, Brazil, Mexico, France, India, Iran, Spain, and Russia, was also calculated. This tool helps to efficiently maximize the PC of epitopes while minimizing the number of epitopes that must be included in a vaccine (Bui et al., 2006). Candidate CD4+ and CD8+ vaccine epitopes were selected to give optimal worldwide PC. Binding energy and dissociation constant were calculated in the PRODIGY webserver at 37°C. This tool uses intermolecular contacts and noninterface surface properties for predictive models (Xue et al., 2016). The stability of the interaction of an epitope with its MHC allele was further evaluated using molecular dynamics simulation, wherein the root mean square deviation (RMSD) plot per residue was obtained. This process was carried out in the MDWeb server (http://mmb.irbbarcelona.org/MDWeb/) using C-alpha Brownian dynamics in 100 ps time, with 0.01 ps time change, 3.8 Ǻ distance between alpha carbon atoms, 10 steps output frequency, and 40 kcal/mol Ǻ 2 force constant. This method used GROMACS MD setup with solvation, employing the forcefield Amber-99sb (Hospital et al., 2012).

Identification of SARS-CoV-2 epitopes
A total of 70 SARS T-cell epitopes (43 CD4+ and 27 CD8+) were retrieved from IEDB. The set of 43 CD4+ consists of five ORF3a, four N, and 34 S epitopes, while the set of 27 CD8+ consists of two ORF3a, 15 N, and 10 S epitopes. Reported close homology of protein sequences between SARS and SARS-CoV-2 has been the basis of this work in seeking putative T-cell epitopes which can be potentially used as an ensemble of the vaccine against SARS-CoV-2. To identify which among these retrieved SARS epitopes overlap with SARS-CoV-2, unique protein sequences of ORF3a, N, and S in SARS-CoV-2 were retrieved from ViPR. Overall, 438 ORF3a, 529 N, and 899 S sequences were isolated from different countries worldwide, starting from December 2019 to August 2020. These sequences were clustered, aligned, and highly conserved reference sequence for each protein was obtained. One important factor to consider in designing a vaccine to avoid epitope immune escape is to focus on highly conserved sequences so that more effective immunity can be developed against a broader range of SARS-CoV-2 isolates around the world. To determine the highly conserved reference sequence for each protein in this study, Shannon's variability threshold ≤ 0.5 was used (Molero-Abraham et al., 2013;Shannon, 1997). From 70 retrieved SARS epitopes, only 17 epitopes (4 CD4+ and 13 CD8+) have 100% overlap with the highly conserved reference SARS-CoV-2 sequences. These were further evaluated (Tables 1  and 2) for cross-reactivity with the human proteome. Among all blasted CD4+ epitopes, GAALQIPFAMQMAYRF has the highest E-value = 1.5e−1 with percentage identity (66.67%), while 3 CD8+ epitopes have 100% identity: LALLLLDRL (E-value = 3.3e−1), ILLNKHIDA (E-value = 1.9e−1), and ALNTPKDHI (E-value = 5.4e−1). Matches with E-values < 1.0e−30 can be cross-reactive in some allergic individuals (Hileman et al., 2002). In this work, E-values of epitopes blasted against the human protein in databases are > 1; thus, at best, these epitopes have poor matches and are less likely to cause autoimmune reactions in humans.

Identification of MHC binding, cross-reactivity, and PC
The MHC alleles binding to the 17 identified SARS-CoV-2 epitopes were obtained from IEDB. Table 3 shows epitopes with their corresponding MHC allele binders at IC 50 ≤ 500 nM. CD8+ and CD4+ epitopes with the highest global PC are VLNDILSRL (67.81%) and GAALQIPFAMQMAYRF (75.40%), respectively. Candidate SARS-CoV-2 vaccines were finally selected to include the minimum number of epitopes that can give the optimal worldwide % PC for each set of CD4+ and CD8+ epitopes. Results turned to include 10 epitopes (7 CD8+    in the calculation of the IEDB tool, the estimated % PC for the set of CD4+ epitopes could be larger in reality. These results show that the 10 epitopes identified in this work are potential SARS-CoV-2 vaccines which can be utilized as an immunotherapeutic agent in many areas severely affected by COVID-19.

Docking and molecular dynamics simulation of MHC-epitope complexes
Retrieved MHC structures were docked and refined with their corresponding epitopes. Epitope-MHC docked models with the lowest energy were chosen as representative docked structures. Figure 1  binding groove of each MHC molecule. To support data on efficient binding of epitopes with their predicted MHC alleles, binding energy (ΔG) and dissociation constant (Kd) were calculated for each epitope-MHC docked complex ( Table  4). The lower the value of binding energy and dissociation constant, the more stable and stronger the binding of the epitope with the MHC structure. All epitope-MHC docked structures have negative ΔG values for binding energy, which indicate that the complex formation is favorable and stable (Paul et al., 2014;Sebastián et al., 2017). Dissociation constants calculated for the docked structures in this study fall within the known Kd range of most peptide-MHC binding structures (Chang et al., 2006) which further provide evidence on the stability of identified candidate vaccine epitopes binding to their MHC alleles. Among the three CD4+ vaccine epitopes, QALNTLVKQLSSNFGAI has the greatest binding affinity and forms the most stable complex with HLA-DRB1*01:01 as indicated by the lowest Kd and binding energy. For CD8+ candidate vaccines, RRPQGLPNNTASWFT has the greatest binding affinity and the most stable complex formed with HLA-B*53:01. To further investigate the stability of epitope-MHC docked structures, molecular dynamics simulation was conducted to obtain the RMSD plot of each complex. Figure    All residues from each complex have RMSD ≤ 1.0 Å, which indicates the best interaction between the docked ligand and protein (Fu et al., 2018).

CONCLUSION
Reports on closely similar protein sequences of SARS with SARS-CoV-2 have been the basis of this work in seeking putative T-cell epitopes which can be potentially used as an ensemble of vaccine against SARS-CoV-2. This work identified 10 candidate vaccine epitopes having 100% overlap with the highly conserved sequences of S and N in SARS-CoV-2 and possessing large PC worldwide and for areas with known high SARS-CoV-2 mortality rates. The proposed set of epitopes include RLNEVAKNL, VLNDILSRL, GMSRIGMEV, LLLDRLNQL, MEVTPSGTWL, RRPQGLPNNTASWFT, LQLPQGTTL, GAALQIPFAMQMAYRF, MAYRFNGIGVTQNVLY, and QALNTLVKQLSSNFGAI. The evaluations of binding energy, binding affinity, and molecular dynamics simulations showed stability of docked epitopes, with their respective MHC alleles, which were identified in this study. Overall, results from this work suggest that the identified epitopes can be used to form a vaccine against SARS-CoV-2, which are nonetheless anticipated to be further examined in vivo.