FastDTI: Drug-Target Interaction Prediction using Multimodality and Transformers

Recent advances in machine learning have proved effective in the application of drug discovery by predicting the drugs that are likely to interact with a protein target of a certain disease, leading to prioritizing drug development and re-purposing efforts. State-of-the-art techniques in Drug-Target Interaction (DTI) prediction are often computationally expensive and can only be trained on small specialized datasets. In this paper, we propose a novel architecture, called FastDTI , utilizing pre-trained transformers and graph neural networks in a self-supervised manner on large-scale (unlabeled) data, which additionally allows for embedding of multimodal input representations, for both drug and protein properties. Extensive empirical study demonstrates that our approach outperforms state-of-the-art DTI methods on the KIBA benchmark dataset, while greatly improving the computational complexity of training, about 200 times faster, leading to excellent performance results.


Introduction
The recent COVID-19 pandemic, which is not the first and will likely not be the last pandemic [18], has caused devastating disruptions on health, society, and economy world-wide.It has shown that not being able to rapidly develop effective treatments for a new disease is a significant shortcoming of the world's ability to respond to a pandemic, given that the pipeline of developing new drugs are too slow in most scenarios.One way to efficiently speed up the process of finding treatments for a novel disease is to re-purpose existing drugs [3].
However, this can be highly challenging due to the massive number of chemical compounds that form the candidate drugs for the newly-arrived disease [9].Hence, we need functional techniques to accurately identify which drugs among the thousands of candidates are the best options for further testing.
Until now, lots of data have been collected on both drugs and drug targets (i.e., proteins), as well as on the interactions between them [23].Using this data, machine learning models can replace the additional testing in the labs and act as a decision-support system in addressing the problem of Drug-Target Interaction (DTI) prediction, which can speed up the procedure of drug development and hence world's response to pandemics.
The efficacy of machine learning, particularly deep learning approaches, for DTI prediction has been already demonstrated in the literature [2,27,32,1,30,21,22]. Early models utilize techniques similar to those used for recommender systems, with this idea that similar drugs interact with similar targets [9,10].However, these techniques fail to cope with the "cold-start" problem, e.g., drugs or proteins that have no known interactions, which is the case in the problem at-hand.Docking [14] is an alternative method in which interactions are determined by a physical simulation using spatial models of the drug and protein.Nevertheless, the 3D data on either drug or their targets are often not available, which leads to limited benefits of these methods in practice [28,7].
On the other hand, state-of-the-art deep learning techniques employ various architectures, such as convolutional neural networks [2], recurrent neural networks [30], graph neural networks [15], and transformers [4], to automatically discover complex features in the input, and have been shown to be effective for DTI prediction.Although these mod-els perform well on small specialized datasets that only deal with a single type of protein, they are greatly limited by their computational complexity and, as a result, are unable to train on general (nonspecialized) DTI datasets.Therefore, we aim to get the best of both worlds, by developing a model that is both accurate and fast, and can be trained on large amounts of DTI data, instead of being restricted to a single type of protein (out of the numerous types of proteins [13]).Besides, the existing work rarely use more than one modality for the drug or the protein, while additional modalities may improve the performance, as DeepH-DTA [2] has indicated by using two modalities for the drug.
In this paper, we address these two shortcomings, namely the computational complexity and the lack of multimodality for both drug and protein representations, and propose a novel approach, called FastDTI, to improve the time complexity as well as the performance of predictions.We employ the recent developments in natural language processing and graph neural networks to create a model that can leverage different modalities of the drug and protein input, including their properties, in one model.Subsequently, pretrained submodels are utilized to embed their sequences and/or graph representations during the pre-processing step, leading to reduced computational complexity during the training.Additionally, the modalities of drug properties as well as protein properties are introduced, providing valuable information to the model for making more accurate predictions.
Our extensive empirical study illustrates that FastDTI outperforms state-of-the-art DTI methods on the KIBA benchmark dataset, while greatly improves the computational complexity of the training, about 200 times faster, leading to excellent performance results for DTI prediction.

Related Work
The application of deep learning in the DTI prediction problem has significantly improved upon the traditional machine learning methods [31], and is a notable candidate to train on the large amount of available data [9].The efficacy of deep learning is evident from many existing work [2,27,32,1,30,21,22], which exert various architectures such as convolutional neural networks [2], recurrent neu-ral networks [30], graph neural networks [15], and transformers [4] for addressing the problem.
Among the state-of-the-art approaches based on deep learning, DeepH-DTA [2] is the most accurate model, primarily due to using both graph neural networks and three modalities of the input.Almost all other models leverage two modalities: one for the drug, and one for the protein.For the protein, DeepH-DTA uses the protein sequence, and for the drug, both the SMILES sequence [26] and a graph representation of the chemical structure, leading to three total modalities.In addition, it employs a sophisticated graph neural network named heterogeneous graph attention (HGAT) [25], which was not applied to DTI prediction before.However, DeepH-DTA is limited by its computational complexity.The authors report that it would take over four days to train on the small KIBA dataset [24], even if powerful hardware were used.
Other state-of-the-art DTI prediction approaches include GraphDTA [15] and DeepDTA [16].GraphDTA introduces the use of graph neural networks, achieving decent accuracy at the cost of high complexity.Inversely, DeepDTA [16] is a simple and fast model, which has low complexity at the cost of low accuracy.In this paper, we present an approach that is both accurate and efficient, while incorporating more than one modality for the drugs as well as for the proteins.

FastDTI
In this section, we propose a novel technique for Drug-Target Interaction (DTI) prediction, called FastDTI, that aims to achieve the performance of the state-of-the-arts while significantly decreasing their computational complexity.Hence, it allows to go beyond small specialized datasets and utilize the large amounts of data that were previously inaccessible for DTI prediction problems.
To this end, we take into account two main principles to ensure both high accuracy of the predictions and low computational requirements: (i) to offload the computation to the pre-processing step whenever possible, in particular, for sequence and graph embedding representations and (ii) to incorporate additional modalities for both drugs and proteins.The former follows from the main bottleneck in the existing work, where the processing

Encode Encode
{"length": 31, "mass": 36511, "pl_area": 58} of the sequences and graphs is very expensive.We address this problem by employing pretrained models to compute embeddings of such complex representations.These embeddings can be computed during the pre-processing step, which dramatically reduces the workload during the training, without sacrificing the performance of the model.
The second principle enhances the accuracy by providing additional information to the model in the form of drug and protein properties.Previous methods typically leverage only one modality for the drug and one for the protein.Instead, we add two more modalities into FastDTI, which compared to DeepH-DTA [2], that incorporate three modalities, leads to a total of five modalities.The properties used by FastDTI are each specifically chosen for giving valuable information for the prediction task, where some are computable, meaning that they can be derived from the protein or drug sequences, while other properties are obtained from measurements.Nevertheless, the properties of both drugs and targets can be expanded by incorporating additional attributes using expert knowledge which is out of the scope of this work.
The overall network architecture of FastDTI is pictured in Figure 1.On the left, is the component that learns a representation for the drugs, consist-ing of three sub-components, one for each of the drug modalities.The first modality (right) is a pretrained transformer model called ChemBERTa [5] for the SMILES sequence of the drug, which is based on the RoBERTa architecture [12], trained on a large dataset of unlabeled SMILES sequences.Second modality (middle) is a pretrained graph neural network called Grover [20] for the graph structure of the drug, based on a GTransformer architecture, trained on a large dataset of unlabeled drug graphs.The third one (left) is a dense layer for the chemical properties of the drugs, which consist of their sequence length, molecular weight, and polar surface area.Similarly on the right, is the component to learn the protein representations, which leverages a pretrained model, ProtBERT [8], based on BERT [6], trained on a large dataset of protein sequences, to process those sequences (right) and a dense layer for the protein properties (left), which include their sequence length, molecular weight, isoelectric point, subcellular location, superkingdom, and function.
Once the encoding of both the proteins and the drugs from different modalities are computed, the obtained representations are concatenated and fed into the DTI prediction component, with several 4 Empirical Study

Experimental Setups
In this section, we evaluate the performance of FastDTI compared to several state-of-the-art techniques for the problem of drug-target prediction on both small-and large-scale data.Accordingly, we first conduct the experiments on the KIBA (Kinase Inhibitor BioActivity) dataset [24], which is a relatively small benchmark data to compare DTI predictions methods, including the models that are computationally expensive and cannot be easily evaluated on larger-scale datasets.We prepare the data according to the procedure described in [10], for a fair comparison of DTI prediction models.In addition, to examine the performance of FastDTI on a larger dataset, we use the STITCH data [23], which is by far the most extensive DTI data that exists, and encompasses a vast number of proteins across numerous organisms, not just humans.Table 1 summarizes the statistics of both datasets.Subsequently, the data is split into training (80%) and testing (20%) sets, and we employ 5fold cross-validation on the train set to tune the hyperparameters.After model selection, most dense layers have 1024 neurons, except for the layers after the first two concatenations, which have 512.
Dropout is set to 0.3 for the KIBA dataset, but 0.0 for the STITCH data.In addition, we utilize four metrics to compare the performance of the models: (i) Mean Squared Error (MSE), (ii) Concordance Index (CI), (iii) r 2 m [19], and (iv) seconds per epoch (for the KIBA dataset and only for training time).Furthermore, we compute the standard deviation for the CI and r 2 m measures.The standard deviation of the MSE of most models is unfortunately not available, so this is omitted for FastDTI as well.Moreover, a selection of baseline models is made from the best-performing state-of-the-art deep learning techniques, as well as two methods that are not based on deep learning.The references to the baseline methods are indicated in Table 2 4.2 Experimental Results

Overall Performance
In the first experiment, we compare FastDTI to the state-of-the-art DTI prediction approaches on the KIBA dataset.This is especially important because most existing work cannot train on the STITCH dataset, due to high computational requirements.However, the results on the time complexity in this experiment can be generalized to a larger data such as STITCH.The evaluation results for the baseline methods are taken from the reported values by their authors (which use the same setup as this experiment), except for "second per epoch".This metric is independently verified to ensure the comparison is made on the same hardware, which is a cloud-based machine using an NVIDIA RTX A5000 graphics card and AMD Ryzen Threadripper 1950X processor.Moreover, FastDTI is trained 10 times, each for 1000 epochs.
The overall performance of different techniques in terms of four above-mentioned measures is outlined in Table 2.The experimental results illustrate that FastDTI performs well on the KIBA dataset, and outperforms all the competitors on all th ee-Table 2: FastDTI compared to various state-of-the-art models, as well as traditional methods of SimBoost and KronRLS.Note that seconds per epoch is not a valid metric for KronRLS and SimBoost, due to not training using epochs.Additionally, the authors of C-A DTI did not report the r 2 m or make their code available for measuring the seconds per epoch.[11] 0.175 0.874 (0.001) N/A N/A DeepDTA [16] 0.181 0.868 (0.004) 0.711 (0.021) 162 ML-DTI [29] 0.196 0.862 (0.006) 0.727 (0.012) 25 valuated metrics.Additionally, a t-test is carried out to compare the r 2 m of FastDTI with DeepH-DTA, which results in a conclusion that FastDTI has a higher r 2 m with p < 0.0001.Furthermore, our approach is over 200 times faster than the second most accurate model in the state-of-the-arts, indicating that our approach manages to avoid the trade-off between accuracy and training time.

Performance on STITCH data
The second experiment explores how FastDTI fares on the STITCH data, which is significantly bigger than the KIBA dataset.Consequently, we examine to what extent our approach is able to predict the STITCH interaction values in a reasonable time.To this end, FastDTI is trained on the STITCH data for 50 epochs and the predicted values are compared to the true values in terms of MSE.
The evaluation result demonstrates that FastDTI is able to achieve the MSE of 0.094 on the test set, indicating that the error of prediction is small on average.To better understand this value, the predictions from our model are visually compared against the true values in Figure 2. The plot represents a random subset of 10,000 points from the STITCH data, due to the full data being too large, and the scores are converted from log-odds back to the probabilities for better interpretability.Subsequently, the figure shows that the model appears to have a harder time making predictions compared to the first experiment, which might be due to the fact that the STITCH data is more complex than the KIBA dataset.However, note that the MSE is not comparable between the KIBA and STITCH datasets, due to expressing the interaction strength in different units.The STITCH set is much less filtered and contains several orders of magnitude more drugs and proteins, including numerous types of proteins, while the KIBA dataset exclusively contains kinases.Nevertheless, the model performs promisingly well on this large-scale data.

Impact of Multimodality
In this experiment, we study the impact of multimodality to verify whether the additional modalities of drug and protein properties would benefit the performance.Furthermore, the usefulness of each individual property is evaluated using the weights of the model.To do so, we compare the performance of the model with and without properties, and the effect of each property is examined from the weights that the model assigns to them during the training.Hence, we use two configurations of the model for this experiment: FastDTI in its full form with five modalities and FastDTI without drug and protein properties.Each model is trained 10 times for 50 epochs on the STITCH dataset.Consequently, the weights of the properties in the trained model are extracted, by computing the sum of absolute values per each property.For properties using a one-hot encoding, the sum is taken over each dimension of the encoding as well.
The performance in terms of average MSE over the 10 runs on the test set is 0.104 without properties and 0.094 with properties, showing a 0.010 (with a standard deviation of 0.003) advantage of incorporating the properties.In addition, Figure 3 displays the weights of the trained network per property for drugs as well as proteins, which demonstrates that for the drug properties, the sequence length has the highest weight, followed by the molecular weight, and finally the polar surface area.Note that the sequence length for the drug is the length of the SMILES representation.Meanwhile, for the proteins, the subcellular location has the highest weight, and the superkingdom has the lowest weight by a large margin, which denotes lower importance.
The results illustrate that the model decreases the average error by 0.010 when additional modalities of the drug and protein properties are incorporated.This indicates that using properties enhances the model's predictive performance, which is a valuable addition to DTI prediction.Nonetheless, the weights for the drug properties show some variation in their usefulness.Interestingly, the polar surface area has the lowest weight, even though it is hypothesized to be highly significant for DTI prediction due to indicating the size of the molecule.On the contrary, the molecular weight has a significant effect to the model, and the sequence length is even more so.Similarly, the results for the protein properties show considerable variance.The superkingdom has a relatively low weight, while subcellular location and molecular function have high weights.This potentially indicates that the superkingdom does not provide any significant information that the model can use for DTI prediction, unlike the other properties.
To summarize, it is explicit that the model benefits from additional properties.However, not every property is equally useful, therefore, there is the potential in future work to use expert knowledge to design better properties to further enhance the performance of the model for DTI prediction.

Conclusion
We presented FastDTI, a novel approach to fast but accurate drug-target interaction prediction that leverages pretrained models to compute embeddings of the input while including additional modalities in the form of drug and protein properties.Previous models either achieve a good performance at the cost of time complexity or a good training time at the cost of the performance.FastDTI has a statistically significant improvement in performance compared to the state-of-the-arts while being several orders of magnitude faster to train.Additionally, FastDTI is the first deep learning DTI model to be trained on the STITCH dataset, which encompasses a far wider range of proteins, allowing for a general DTI prediction model.The empirical results indicate that our approach outperforms the competitors in terms of the predictive performance on the small benchmark dataset, while greatly improving the computational complexity.
In the future, FastDTI can be improved by further fine-tuning the pretrained models for the DTI prediction task, at the cost of extra training time, or by incorporating additional properties for the drugs and proteins.Moreover, the application of FastDTI to other related problems, such as proteinprotein prediction, can be explored as future work.

Table 1 :
Datasets used for the verification of the properties of FastDTI.