Edited on March 18 to clarify results of mean predictor approach. Thanks to Dogus Cubuk for suggesting the change.
In my previous post, I talked about getting materials data ready for training machine learningbased models. Here, I will take the next step and actually use the dataset we built previously to create predictive models.
Feature Engineering
Let’s examine the top few band gaps in our dataset:
LiH,2.981 BeH2,5.326 B9H11,2.9118 B2H5,6.3448 BH3,5.3234 B5H7,3.5551 H34C19,5.4526 H3N,4.3287 H2O,5.5175 HF,6.7187
Our goal now is to construct a set of input features that a machine learning model could use to predict these band gaps. The features should contain patterns that a machine learning algorithm could identify, if provided with several thousand training examples of how changes in those features influence band gap. For example, what is it about the alkali hydride LiH that causes its band gap to be about 3 eV, whereas the alkaline earth hydride BeH2 has a much wider gap of over 5 eV? Feature engineering is precisely where our expertise and intuition as materials scientists enters the modeling process.
These features must be characteristics we either know or can compute for any material whose band gap we wish to predict. This statement implies some important constraints: For example, we might wish to use another property such as formation energy to predict band gap, but we only know formation energies for a relatively small number of compounds. Likewise, crystal structure is extraordinarily important in governing materials behavior, but if we want to use crystal structure features in our model, we can only model materials whose crystal structures are known a priori. We could imagine using machine learning or another approach to first predict crystal structure, and then use the predicted crystal structure as input for a band gap model. Here, rather than daisychaining models, we will keep things simple and attempt to simply map a chemical formula (e.g., NaCl) directly to a value for band gap.
Let us begin with an extremely simple and naive representation of a chemical compound for our band gap model, which we will later improve upon by injecting some physical insight. We will define a material using a vector wherein the nth component of the vector represents the atomic fraction of the element having atomic number n+1 in the material. Thus, pure H would be:
[1, 0, 0, 0, …]
Beryllium hydride (BeH2) would be:
[0.67, 0, 0, 0.33, 0, 0, …]
This representation has the advantage of simplicity, but the disadvantage that it is a rather “bloated” way to express a chemical compound. Clearly, for most materials, the vast majority of the components of the composition vector will be 0. However, this approach is good enough to illustrate next steps.
from pymatgen import Composition, Element from numpy import zeros, mean # Training file containing band gaps extracted from Materials Project # created in previous blog post and linked here trainFile = open("bandgapDFT.csv","r").readlines() # Input: pymatgen Composition object # Output: length100 vector representing any chemical formula def naiveVectorize(composition): vector = zeros((MAX_Z)) for element in composition: fraction = composition.get_atomic_fraction(element) vector[element.Z  1] = fraction return(vector) # Extract materials and band gaps into lists, and construct naive feature set materials = [] bandgaps = [] naiveFeatures = [] MAX_Z = 100 # maximum length of vector to hold naive feature set for line in trainFile: split = str.split(line, ',') material = Composition(split[0]) materials.append(material) #store chemical formulas naiveFeatures.append(naiveVectorize(material)) #create features from chemical formula bandgaps.append(float(split[1])) #store numerical values of band gaps
Model Building
We now have a (naive) way of converting each material in our training set into a vector for our machine learning model to crunch, and equivalently we can express any new chemical formula with this representation for the purposes of making predictions. Time for the fun part!
You’re probably salivating with anticipation as to which amazing machine learning algorithm we are going to use to model band gap. I have some anticlimactic news: Unless you work at Google and are training machine vision systems to autonomously recognize cats in millions of images, don’t worry too much about using the latest and greatest algorithms. While you may be on the cutting edge of materials informatics, you are likely not on the cutting edge of machine learning (nor should you be; you’re a materials scientist!).
Before we construct any real models, let us establish a baseline for accuracy with a trivial approach: simply guessing the mean of the band gap distribution.
# Establish baseline accuracy by "guessing the average" of the band gap set # A good model should never do worse. baselineError = mean(abs(mean(bandgaps)  bandgaps)) print("The MAE of always guessing the average band gap is: " + str(round(baselineError, 3)) + " eV")
The mean predictor gives the following result:
The MAE of always guessing the average band gap is: 0.728 eVA sophisticated model should absolutely never do worse than this, or something is very, very wrong! With that in mind, we will start our real modeling effort with a straightforward approach; after all, in the words of Einstein, everything should be made as simple as possible, but not simpler. Let’s begin by attacking our data with a linear ridge regression model. Yes, you read that correctly: We are starting off with a linear model. Before we throw the kitchen sink at the problem and turn to more complex models, we should evaluate the efficacy of a basic approach. Indeed, linear techniques such as logistic regression candespite their simplicityperform strikingly well in comparison to more sophisticated algorithms, and offer the advantages of speed and interpretability.
Let’s see how a linear ridge regressor does in modeling band gaps using our naive feature set:
# Train linear ridge regression model using naive feature set from sklearn import linear_model, cross_validation, metrics, ensemble #alpha is a tuning parameter affecting how regression deals with collinear inputs linear = linear_model.Ridge(alpha = 0.5) cv = cross_validation.ShuffleSplit(len(bandgaps),\ n_iter=10, test_size=0.1, random_state=0) scores = cross_validation.cross_val_score(linear, naiveFeatures,\ bandgaps, cv=cv, scoring='mean_absolute_error') print("The MAE of the linear ridge regression band gap model using the naive feature set is: "\ + str(round(abs(mean(scores)), 3)) + " eV")
You will see from the code that we are using an approach called tenfold crossvalidation to determine the quality of our model. One of the most fundamental tenets of statistics and machine learning is that one cannot evaluate the quality of a model by examining the error it achieves on the data to which it was fitted. A simple way to understand this concept is to consider the fact that a polynomial regression with enough terms can perfectly fit any training data. However, the question then would be, what happens if we introduce new data and would like such an overfitted model to generalize? A polynomial regression with too many terms will almost certainly fail such a test. In this respect, machine learning models are no different.
Crossvalidation is a rigorous approach to avoiding overfitting. In nfold cross validation (in this case, n=10), we partition our training data into n equallysized chunks, fit our model to the data in n1 of those chunks, and use the resulting model to predict the heldout nth chunk. We perform this procedure n times and average the resulting error on the heldout data, reporting a chosen error statistic such as mean squared error or mean absolute error (our choice here). An overfitted model will perform poorly in crossvalidation, as it will tend to give very large errors on the heldout data.
When we perform this crossvalidation procedure using our linear ridge regression model, the linear model manages to outperform our guessthemean approach from above:
The MAE of the linear ridge regression band gap model using the naive feature set is: 0.47 eVLet’s also take a look at the fitted regression coefficients (just a few of the 100 total are shown here):
# Let's see which features are most important for the linear model print("Below are the fitted linear ridge regression coefficients for each feature (i.e., element) in our naive feature set") linear.fit(naiveFeatures, bandgaps) # fit to the whole data set; we're not doing CV here print("element: coefficient") for i in range(MAX_Z): element = Element.from_Z(i + 1) print(element.symbol + ': ' + str(linear.coef_[i]))
Running the script gives the following output (excerpted):
O: 2.28865291048 F: 3.95035733949 Cl: 2.81918301294 Ti: 0.675957059589 V: 0.704965141051Examining these coefficients reveals an intuitively satisfying result: electronegative elements such as O, Cl, and F tend to strongly increase compounds’ band gaps, whereas metallic elements such as V and Cr are generally associated with smaller band gaps.
Feature Engineering Revisited
We were honest with ourselves by calling our first approach to feature engineering naive; we can probably do better than an unwieldy 100component vector for representing materials’ compositions. Let us now construct an alternative feature set, which builds some basic chemical concepts into the resulting model. This feature set will be far more compact than our 100component composition vector. In particular, we will:

Order the two elements in the binary compound according to their atomic fraction abundance in the compound

Express the stoichiometry of a binary compound simply by the ratio of the more abundant element to the less abundant element

Calculate the difference in electronegativity between the two elements in the binary

Include the periodic table group numbers of each element in the binary
Within this feature space, BeH2 now becomes [ratio, electronegativity_difference, more_abundant_element_group, less_abundant_element_group]:
[2.0, 0.63, 1.0, 2.0]
Here is the code to create these features for each material in our training data:
# Create alternative feature set that is more physicallymotivated physicalFeatures = [] for material in materials: theseFeatures = [] fraction = [] atomicNo = [] eneg = [] group = [] for element in material: fraction.append(material.get_atomic_fraction(element)) atomicNo.append(float(element.Z)) eneg.append(element.X) group.append(float(element.group)) # We want to sort this feature set # according to which element in the binary compound is more abundant mustReverse = False if fraction[1] > fraction[0]: mustReverse = True for features in [fraction, atomicNo, eneg, group]: if mustReverse: features.reverse() theseFeatures.append(fraction[0] / fraction[1]) theseFeatures.append(eneg[0]  eneg[1]) theseFeatures.append(group[0]) theseFeatures.append(group[1]) physicalFeatures.append(theseFeatures)
We have thus “compressed” our previous representation by a factor of 25. Let’s see how well this new feature set works with our same linear ridge regression from before:
The MAE of the linear ridge regression band gap model using the naive feature set is: 0.47 eV The MAE of the linear ridge regression band gap model using the physical feature set is: 0.664 eVOur crossvalidation MAE actually increased! How can this be? Well, we have changed from a linear regression with 100 coefficients to a linear regression with only four coefficients. It appears that five degrees of freedom (four coefficients plus a constant) are too few to reasonably model band gaps with a linear model. This outcome invites the obvious next step...
Going Beyond a Linear Model: BlackBox Machine Learning
As of yet, it is not clear whether our clever new feature set is actually an improvement over the naive composition vector. Indeed, the crossvalidated MAE of our linear ridge regression got worse with the much smaller, physicallymotivated feature vector. However, a major change in approach remains untested: switching to a nonlinear model that in fact has no functional form.
For the purposes of this exercise, we are going to construct a random forestan ensemble of decision trees. It turns out that, while a single decision tree is often a poor classifier, a collection of many decision trees trained on different subsets of data can be very powerful for modeling data. Random forests have a number of tuning parameters (as with most machine learning algorithms, unfortunately), but here we highlight only the number of trees in the forest; we will choose 10 for computational expediency. Our code will thus construct 10 independent decision trees and average the band gap predictions from each of them. The code to do so is very simple:
rfr = ensemble.RandomForestRegressor(n_estimators=10) #try 10 trees in the forest
Here is the performance of a random forestbased band gap model:
The MAE of the nonlinear random forest band gap model using the naive feature set is: 0.366 eV The MAE of the nonlinear random forest band gap model using the physical feature set is: 0.265 eVWe observe that a random forestbased approach outperforms linear ridge regression with both feature sets, and the physicallymotivated fourfeature vector outperforms the naive 100component composition vector. The random forest trained on the physical features is actually decent at estimating band gaps of materials!
Conclusion
At the conclusion of this exercise, we have created a random forestbased model that can predict the band gaps of binary compounds with a crossvalidated MAE of under 0.3 eV. Not too shabby! From here, the two main levers we have to further enhance our model are: (1) adding more training data (read: email Materials Project and tell them to do more DFT calculations!) and (2) further feature engineering. Probably our simple fourfeature approach is not yet optimized. Again, this is where your background as a materials scientist is essential!
We should also put our results here in context. Machine learning is indeed a potent tool for analyzing materials data; when used properly, it can produce powerful, original insights. But it is just that: another tool in the materials scientist’s toolbox. Machine learning (or any other computational technique) is not a substitute for scientific judgment or common sense. As with all computational models, garbage in equals garbage out. However, with a combination of robust training data and insightful features, you can build very fast and very effective models of materials behaviorwithout a supercomputer, and without a computer science degree.
Have a new machine learning or statisticsbased model of materials? Developed a great feature or descriptor for modeling a particular property? Found a useful materials data set for modelbuilding purposes? Let us know—we’d love to host it on our platform, and reference it back to you!