by  Dzvinka Yarish

Error Correction in SMILES Representations of Novel Molecules

  3 min read

Over the last several years, the application of deep learning to de novo design of chemical structures has become a promising and rewarding research field. However, according to the research by Rafael Gómez-Bombarelli, the percentage of valid molecules generated by the autoencoder model equals rates from 70% to less than 1%. Thus, to obtain a certain number of new molecules, one needs to overgenerate by 70 times.


In the majority of research settings, known molecules with desired properties are used as a search starting point and the number of them is often limited. Therefore, to efficiently exploit the novel structures generation pipeline, increasing the number of SMILES (Simplified Molecular Input Line Entry System) strings in the output to encode valid molecules becomes a requirement.

José Miguel Hernández-Lobato has made a successful attempt to tackle the issue by replacing the regular Variational Autoencoder (VAE) with the Grammar VAE. The main idea is to convert SMILES strings into parse trees from the predefined context-free grammar and train the model on them. While this setting substantially increases the number of valid SMILES strings in the output, it does not spot the errors, where identification is unavailable without the context.


In our research, we propose an increase in the number of correct SMILES strings in the autoencoder output by including a separate recurrent seq2seq neural network with the attention mechanism in the generation pipeline. We have developed two models with the same architecture and trained the data with different types of errors in SMILES strings:

  • On errors made by our Autoencoder while producing new molecules
  • On random errors, making use of the denoising Autoencoder approach

We report a 20 percent increase in the number of valid SMILES after applying our error-correction models. Furthermore, 67 percent out of all unique generated SMILES strings, those not present in PubChem and ChEMBL databases of chemical structures, have been obtained via error correction. Hence, error-correction models do not simply transform all invalid SMILES strings to the closest valid ones, which were seen during the training. At the same time, the distributions of various molecular descriptors for corrected SMILES strings do not fall far from those calculated for the specified means in the input structures.


This field of research is still in progress, yet the results found are already promising. The process of finding a new treatment or disease cure consists of two stages, with the first step to identify drug candidates needed. The time spent on producing a new medicine could be dramatically decreased, as medications often take 10-15 years to create and produce. AI speeds the discovery process, revolutionizing pharmaceutical companies in the long run.

Read the Precision Medicine: Speeding the Go-To-Market Strategy to learn more about our insights into the precision medicine that uses genetic insights and advances in medical technology.