The design of chemical formulations is a challenging, high-dimensional problem. In typical formulations, tens of thousands of ingredients are available for use, yet only a tiny fraction end up in a given formulation. Deformulation, the problem of reverse engineering the precise amounts of each ingredient starting from just a list of ingredients, is similarly challenging but is a key capability for staying up-to-date with industry competitors. Here, we take advantage of a large, curated formulations dataset from CAS, a division of the American Chemical Society, which offers a consistent and highly structured representation of the formulations and the chemical identities of their components to show that a variational autoencoder neural network learns meaningful representations of formulations in various product classes such as antiperspirants and oral care. Furthermore, it can be used in conjunction with a two-step sampling algorithm to generate accurate ingredient amount suggestions for deformulation. Deformulation using a variational autoencoder produces estimates that are significantly more accurate than nearest neighbor methods, extrapolates better to formulations that are significantly different than previously seen formulations, and provides a way to leverage large datasets for industrially relevant capabilities.
Sevgen, Emre, Edward Kim, Brendan Folie, Ventura Rivera, Jason Koeller, Emily Rosenthal, Andrea Jacobs, and Julia Ling. “Toward Predictive Chemical Deformulation Enabled by Deep Generative Neural Networks.” Industrial & Engineering Chemistry Research 60, no. 39 (October 6, 2021): 14176–84. https://doi.org/10.1021/acs.iecr.1c00634.