In this episode, Dr. Bryce Meredig and Dr. Wolverton discuss:
- The evolution of Dr. Wolverton’s research and his group’s focus on computational materials modeling and machine learning.
- The challenges and opportunities for computational methods and informatics to accelerate new materials discovery.
- The different methods and tools the Wolverton Group develops to assist in materials research and development.
- Applications of machine learning to materials research.
- The prospects of machine learning and data-driven methods to explain new physics and chemistry.
“I think of Materials Informatics as the application of data-driven tools to solve problems in materials science and engineering. The advent of the field and why we can define it now is because of data.” — Dr. Christopher Wolverton
Dr. Christopher Wolverton is the Jerome B. Cohen Professor of Materials Science and Engineering at Northwestern University. Before joining the faculty, he worked at the Research and Innovation Center at Ford Motor Company, where he was group leader for the Hydrogen Storage and Nanoscale Modeling Group. He received his BS in physics from the University of Texas at Austin, his PhD in physics from the University of California, Berkeley, and performed postdoctoral work at the National Renewable Energy Laboratory (NREL). His research interests include computational studies of a variety of energy-efficient and environmentally friendly materials via first-principles atomistic calculations, high-throughput and machine learning tools to accelerate materials discovery, and “multiscale” methodologies for linking atomistic and microstructural scales. He is a Fellow of the American Physical Society.
Bryce Meredig: Welcome to DataLab, a Materials Informatics podcast with Bryce Meredig, Chief Science Officer at Citrine Informatics.
Bryce Meredig: In this episode of DataLab we talk with Dr. Chris Wolverton, the Jerome B. Cohen Professor of Materials Science and Engineering at Northwestern University. We discuss the work of Chris’s research group at Northwestern, one of the first computational materials research groups to focus on machine learning, and the impact of machine learning on the future of materials research and development.
Bryce Meredig: Chris, thanks so much for joining us here.
Chris Wolverton: Sure thing. Thanks for having me.
Bryce Meredig: Well I’d like to start by just talking in a little more detail about your background, and of course, we have a long-standing personal connection. I was one of your PhD students, and started in your group in the fall of 2007, so we’ve known each other for over 10 years. That was shortly after you joined Northwestern, and I don’t know where I would fall in your academic genealogy, because there’s some question about whether I would be the 1.5th student, or second student in your group. I’m not sure how we would do that accounting, or if that’s something you’ve ever thought about, but certainly one of the earlier ones. It’s been really interesting to see the work in the group evolve over that period of time.
Bryce Meredig: Could you share with us a little bit more about the kinds of research questions that you’re interested in right now?
Chris Wolverton: Sure. Well, maybe a little bit about the evolution of that topic. When I came to Northwestern I had just come from Ford Motor Company, I had spent almost a decade there working on automotive related problems, also using computational tools. And roughly speaking, I spent about half of my time there, during the first half, focused on aluminum alloys, and specifically aluminum alloy castings, engine blocks, and cylinder heads, and trying to use computational tools to advance our processing and performance of those alloys.
Chris Wolverton: And then second half of my time there I got much more interested in hydrogen storage and was really focused on the problem of trying to find new hydride materials that would store hydrogen and release it under the right conditions, and so forth. This was really much more of a materials discovery problem, because we didn’t really know what hydrogen storage material we wanted. That really got me interested in the area of materials discovery, and so when I came to Northwestern, suddenly I realized I wasn’t constrained to work on only automotive problems and realized there’s a lot of materials discovery problems that need solving.
Chris Wolverton: Today I’m very much interested in energy related problems. Almost everything that we do is somehow energy related, so we have a lot of work on lithium battery materials. A fair amount of work we’ve been doing over the years on thermoelectric materials as well. That’s another materials discovery area, or materials design area.
Chris Wolverton: As you know, because you sort of started our group in this area, we’re very much interested in materials informatics, and data-driven approaches these days. Both energy related problems, and the generic materials problems.
Bryce Meredig: Going back to the comment you made about hydrogen storage as a materials discovery area, just the other day I was driving behind a Toyota Mirai, so we have hydrogen cars out in production right now. It seems to be making a bit of a come back, what’s your thinking on the current state of hydrogen storage? Is that something you’ve kept up on?
Chris Wolverton: I haven’t kept up on that too much, to be honest. I mean, I did work on hydrogen storage for a number of years after coming to Northwestern still. I do think it’s actually a fascinating research problem. Scientifically it’s a very, very interesting problem.
Chris Wolverton: And maybe it serves as a good example of the kinds of problems we want to solve in general that … hydrogen storage is one of these areas where we know almost exactly what we want out of the material. It has to store a certain amount of hydrogen by weight, and by volume to be competitive with gasoline. It has to come out at the right temperature, the material has to release hydrogen at the right temperature. You have to be able to get it in again at the right pressure. And you want it to be cheap, and fast, and obviously shouldn’t be dangerous or have safety problems.
Chris Wolverton: We know exactly the properties we want out of the material, the only thing else we really agree on in that field is that there is no such material. We don’t have that material that has all of those properties simultaneously.
Chris Wolverton: It’s a good example of the kinds of things that come up in a lot of fields, where we have this set of constraints, and we want to satisfy all of them, or maybe we have to satisfy all of them, in one material. Therein lies the difficulty of materials discovery. Specifically, to the issue of how have things kept up on the hydride front.
Chris Wolverton: I think, to be honest, the auto companies never really lost interest. Even when the funding agencies did, a little bit. But it does seem like now the automakers are really thinking about redesigning the vehicle so that they can get away with compressed hydrogen. The reason is because it’s very difficult to find the material that has all of those attributes.
Chris Wolverton: As far as I know, the Mirai and others still have compressed hydrogen.
Bryce Meredig: Yeah, I think that’s right. You mentioned the challenges associated with materials discovery, specifically this notion of trying to identify materials that meet potentially many conflicting targets at the same time. And of course, your group is one that applies a very broad suite of computational tools to these questions.
Bryce Meredig: Can you share a little bit more about the kinds of methods your group uses, and why each one plays a role in the research that you do?
Chris Wolverton: Sure. I can try. My group, as you say, is entirely computational. We don’t have any experiments, we don’t have a laboratory, except a laboratory that houses computers. The tools that we use are almost entirely atomistic tools. We have atoms in almost all of our simulations. That means necessarily that the length scale of the problems are atomistic. We tend to use things like density functional theory as our real work horse. That’s the main tool that I would say everyone in the group uses, definitely.
Chris Wolverton: And then we use other tools as they arise, and as the need arises. We do a lot of calculations of phase stability and phase diagrams, thermodynamic quantities. For things like that you need to be able to calculate thermodynamic functions, like entropy, that’s difficult with sometimes like density functional theory, because it’s a zero-temperature theory. So, we rely on other tools to deal with the thermodynamics, like cluster expansions, and Monte Carlo, and things like that.
Chris Wolverton: Then the other new suite of tools that we’re really starting to use more often nowadays are the data-driven tools: machine learning, and neural nets, and things like that.
Bryce Meredig: Why, in your view, are those data-driven tools interesting from a research standpoint? What kinds of questions do they help address that, for example, density functional theory alone might not? Or how do they fit into the overall tool set?
Chris Wolverton: I think that the advent of that goes back a little bit to, at least my thinking on that, goes back to high-throughput density functional theory. Say over the past decade or so, maybe even a little bit longer than that, people have been putting together these high-throughput DFT databases. Most notably things like the Materials Project, and AFLOW out of Duke University. In our group we’ve developed this Open Quantum Materials Database, or OQMD.
Chris Wolverton: These databases came about because, well DFT goes faster, but computer power just became much more readily available. In some sense faster, but also there was just more of it. Because of that, it was just very easy for us. When before we could maybe do one, or ten, or a dozen calculations on a dozen different materials, now suddenly we could do thousands, or even hundreds of thousands of calculations.
Chris Wolverton: That started making people think about, “Well, why don’t we just do calculations of every known inorganic material, store it in a database, and then we can be done with that particular type of calculation.” And then the next thing you have to ask is how many inorganic materials are there, and roughly speaking, there’s about 100,000. We can debate about what the exact number is, but it’s around that.
Chris Wolverton: That means it is possible to do these kinds of calculations, for most of the known materials at least, and store them. Those data sets sort of gave rise, I think in a lot of sense, to these data-driven ideas, because suddenly there were data sets out there that were big enough to do interesting things with. For example, take some work that you did early on. We could take a data set like that and actually look at the energeticsof all the compounds in the data set, and see could we actually train a machine learning model that would actually connect chemical composition with the energy of the compound?
Chris Wolverton: In other words, if we gave a machine learning model enough examples of chemical composition and energy, could we eventually teach it to make that connection, and then predict for a new chemical composition that we’ve never told it about, what is the energy of that likely to be?
Chris Wolverton: If you can do that, and it turns out you can to some fidelity, the exciting thing is now you can actually use that machine learning model. And you can scan over, not just hundreds of thousands of compositions, but millions or even billions of compositions because these things are so much faster than DFT.
Chris Wolverton: Now suddenly you can screen over large parts of composition space that we never even knew how to explore before, because we don’t even know if there are compounds in those regions. I think that’s one of the areas that this kind of data-driven approach has enabled. It’s sort of just enabled us to screen things faster, and that’s just allowed us to go bigger and wider in our search space.
Bryce Meredig: Yeah, you mentioned the work that we did on training machine learning to predict the thermodynamics of compounds. One of the things that I found so interesting about that work was you could see how the machine learning models were figuring out simple—in some cases simple, in some cases more subtle—chemical rules that make sense. For example, the models learned right away that bonds between fluorine or oxygen and metals are very strong and tend to lead to very stable compounds.
Bryce Meredig: Right now, in the materials informatics field, I would say there are many people interested in this idea of to what extent can machine learning learn physics or learn chemistry.
Bryce Meredig: Have you seen any interesting work on those lines? Or what do you think the prospects are for that?
Chris Wolverton: I think that’s a super important question, super interesting question, and a lot of people, like you say, really want to know the answer. Can machine learning actually teach us something, not just be predictive?
Chris Wolverton: My group actually hasn’t done a whole lot of work in that area. Partially for reasons I’ll tell you in just a moment. And not because I don’t think the question is interesting, but I have to admit that I haven’t seen a lot of examples where I can really point to and say, “Yes. We’ve definitely learned some new physics from that.” I would be very interested to know if you have examples.
Chris Wolverton: But let me tell you the thing that’s happened more than once in my group. When we’ve tried sometimes to actually learn physical rules from these machine learning models, for example we might train a machine learning model with certain descriptors, and then we look and see which are the most important descriptors to describe this property. And then we sit around and brainstorm and decide, yeah, we can actually, retrospectively, we can figure out why that descriptor was actually the most important, and this did really teach us something, which is kind of cool.
Chris Wolverton: Then the next day, the sim comes back and says, “Oh, sorry to tell you, I made a mistake. And in fact, I had to retrain the machine learning model and now actually it’s a different descriptor that’s actually more important.” Then we brainstorm again and come up with a new idea.
Chris Wolverton: Because of this I’m a little pessimistic, I guess, about this. And I think that the danger, or the difficulty, I guess, is that I think these machine learning models are very, very good at determining correlation. That’s what they’re supposed to do. They’re finding correlations, hidden correlations maybe, in these data sets.
Chris Wolverton: But, as you know, correlation is not causation. I think that often times we fall victim to this. We see the correlation, and we assume that that’s the physical cause, and then we invent a physical reason for that correlation.
Chris Wolverton: That’s how I think about issue. I think it’s super important, but very difficult because of this issue of correlation and causation. So, I will throw this back at you. I’m interested to know can you point to examples where you think machine learning has really taught us some physical rules?
Bryce Meredig: I actually tend to agree with what you said there, that this is a question that everyone is asking right now, and there are very few compelling answers to, or examples. At least, as of right now.
Bryce Meredig: I think a lot of times this discussion gets caught up in the overall question of can machine learning satisfactorily explain its predictions to us in a way that makes sense to a domain expert? And of course, I think there are good reasons why we might want machine learning to be able to do that, especially when … you know, in the case of sequential learning, or active learning, for example. Scientists are going into the lab and potentially dedicating a significant amount of time and money to pursuing the predictions from a machine learning model, and they want to have some sort of intuitive grasp for why the model is making the predictions that it is making.
Bryce Meredig: But then you can turn that around and you can say, “Well, if I as human user can’t understand the machine learning predictions, does that mean that they’re wrong?” I think that there’s this intense desire that I’ve seen on the part of people in the community to be able to look inside the black box and try to rationalize the predictions that are coming out. But I haven’t heard, certainly many examples where we have done that in a convincing way.
Bryce Meredig: I also haven’t heard broad agreement as far as what explainability, or what learning physics from machine learning would actually be.
Chris Wolverton: Yeah. I have seen, I think, kind of an evolution in the field. Back in the days when you were still a graduate student and working on this problem, so the early days of this field, I got this question a lot, and actually I got a lot of push back when I would talk about this, but there were a lot of people who felt like if machine learning doesn’t teach you physics, but you can’t interpret the results, then it’s useless.
Chris Wolverton: In those days I would push back on that argument and say, “Well, if machine learning is predictive, it’s predicting new materials. Even if I can’t tell you why, it’s still valuable.” I have to say, I think more people have come around to that way of thinking. I very rarely get this question anymore saying, “If you can’t explain these results, then it’s worthless.”
Chris Wolverton: I think part of that is that people have actually realized that this is a much, much harder problem than we thought, the interpretability of machine learning data.
Bryce Meredig: That’s right. I’ve had the same experience, empirically, where the percentage of the time where referee reports for a paper that we submit contains critiques like, “This isn’t really science.” Or, “This isn’t really physics.” Which, I think you and I have both seen and heard in the past. I think that’s becoming much rarer as these tools become much more mainstream in the community. So, maybe it is, to your point, a case where as long as we’re realistic about what these tools are giving us, and what they’re not, they can play a very important role in the materials scientist’s tool belt.
Bryce Meredig: But, we also-
Chris Wolverton: Right.
Bryce Meredig: … need to be cautious to not have unrealistic expectations for them.
Bryce Meredig: A lot of it comes down to education and training, in our experience, where if we’re realistic with people about what can machine learning do, what can’t it do, then most reasonable scientists are able to slot it in and see how it can help with their work. But it’s certainly not the Hollywood AI, let me sit down at the computer and have it synthesize a new material for me. We’re very, very far away from that right now.
Chris Wolverton: Right, right. I oftentimes thought that this was a funny question that people insisted on this issue of interpretation. If I don’t have any idea how Google works, and I can’t explain how Google works, does that mean I’m not going to use the research results I get from Google?
Chris Wolverton: I mean, probably I’m still going to use them. If they’re very good at finding the data that I want, it doesn’t really matter to me, actually, whether Google can explain how it works or not. Of course, the people at Google can probably explain how it works.
Bryce Meredig: Well I think another instance of that would be a self-driving car. Now admittedly, I think the future of self-driving cars maybe we’re finding out is a little bit further away than we thought, but my hunch is that people are going to be happy to drive these cars, even if the cars don’t output detailed analytics about how they’re making decisions on the road. People will just generally build trust in them and start to focus on other things while they’re in the car.
Chris Wolverton: Exactly. I think that’s an excellent analogy.
Bryce Meredig: You mentioned a couple times that some of the work that we did together was early on in the materials informatics boom, I think, that we’re seeing right now.
Chris Wolverton: Yeah.
Bryce Meredig: I wanted to get your opinion on just terminology, which is how do you think about the definition of materials informatics, or these data-driven methods, machine learning? How does this all fit together? Are these words synonyms? Are they interchangeable? How would you explain that to somebody who is, let’s say, not a computationalist, is coming into this without a lot of background?
Chris Wolverton: Okay, it’s a fair question. Admittedly, I don’t spend a lot of time thinking about things like that, but, I think, yeah, it would be hard for me to actually answer what’s the definition of materials informatics without using the phrase data-driven.
Chris Wolverton: I mean, I think of materials informatics as the application of data-driven tools to solve problems in material science and engineering. In some sense I think that the advent of the field, and the reason why we even can define the word now, is because of data. Because we have more data now than we used to, because it’s sort of—even in cases where we don’t have more data, that it’s collected and classified in ways that’s searchable and analyzable. That’s also very helpful.
Chris Wolverton: I think the advent of data and specifically data in collected databases has really made the field come about, but that’s generally speaking, I guess I would say that the materials informatics is, again, all about data-driven tools, and the data-driven tools came about because now suddenly we have data. That’s the short answer.
Bryce Meredig: Do you think there’s a real change going on right now in the materials research field in general with respect to people’s attitudes towards data? I know that the ideas of publishing research data, open data, open access, these have been conversation topics for a long time. In your view, is there a real shift going on right now, or are we still in the mode of imagining what an ideal future might look like?
Chris Wolverton: Oh, no. I would say the answer to that question is absolutely, yes. I think it’s undebatable that that’s yes. It’s been a number of years, for example, when funding agencies would ask for things like data management plans, and things like this. I think whenever this started five, six, seven years ago, people might roll their eyes when they said, “Oh, I have to come up with a data management plan.”
Chris Wolverton: I think now people understand, actually, why that’s really important, and why just having their data in lab notebooks, or published in figures in papers, is not enough. I think generally people understand that.
Chris Wolverton: The solutions are not entirely clear, like how we get around that, but I definitely think that there’s been a huge shift in the sense that people understand that this is important, and this is likely to be the future of these kinds of approaches.
Bryce Meredig: When you think about the materials informatics field itself, and of course this is an area that you’ve been working in for eight plus years now, for quite a while. How have you seen the community in general change, or the kinds of problems that people are working on change? How has it evolved since these early days? And in your view, are there interesting domains that are just opening up now, and directions that are just opening up where people are starting to demonstrate some pretty cool results?
Chris Wolverton: Yeah. Yes to all of those questions. In the early days, there were certainly people working in this field before me, so I can’t say about the first applications. But the first applications from my group we did relatively simple things. We had these large DFT databases that we were training, and trying to predict relatively straightforward and simple materials properties. Things like formation energy, or elastic constants, or band gap, or things like that.
Chris Wolverton: These were simple in the sense that they’re simple functions of composition. If I specify the composition, then in principle you can tell me the formation energy of the band gap. I think one of the things that’s evolved is that people have realized that there’s no reason we have to focus on these relatively simple materials properties, we can train machine learning models on much more complicated properties. Things that have to do with microstructure, or mechanical properties, or bulk metallic glasses, or things like this.
Chris Wolverton: These are materials properties that are much, much more difficult to compute. They might not even be simple functions of composition. For example, if I tell you a composition of an alloy, and then ask you, “Tell me the mechanical properties of this alloy.” You can’t do it, because I have to tell you more information. I have to tell you how the alloy was processed, and heat treated, and all of these things.
Chris Wolverton: It’s not just a simple function of composition, and so these are inherently much more complicated properties. But there’s no real reason you can’t, if you have sufficient data, that’s a big if, that you can’t train machine learning models of these more complicated properties.
Chris Wolverton: I’ve seen many more examples of those kinds of things come about, and also just the size of the field. It’s growing so fast. New people come into the field, and new people have new ideas. It just seems like every day, or every week, you see a new paper on some new topic. People applying machine learning, or some kind of learning approach, to a new problem that you’ve never thought of before.
Bryce Meredig: It certainly is, I think, clear that there’s a massive boom of interest going on right now in materials informatics, and these data-driven approaches in the field. Do you think we’re at risk at all of falling prey to too much hype in this field, and the expectations starting to outstrip what’s realistic today?
Bryce Meredig: I know that’s something when I look around at the volume of papers being published, the amount of funding going into the field. I do become concerned that we’re losing sight of what’s realistic, at least in the near term.
Chris Wolverton: Sure. Yeah, I agree with you about that. I mean to kind of conflate this question with the one you asked before, in the early days the hype was zero. Not only was there no hype, but when you and I submitted our first paper on this we would get referee reports back saying, “Why are you doing this? Why would anyone ever want to do machine learning on materials science? It makes no sense.”
Chris Wolverton: In those days the hype was zero, and so definitely we’ve seen that evolve, and the hype has grown. I would probably say that today the hype is higher than the reality. I mean, on one hand yes, I think you’re right. We have to be concerned about that, and we have to keep based in reality, and keep our minds on what’s realistic in this field.
Chris Wolverton: On the other hand I’m not terribly worried about it in the sense that I think this is kind of a normal progression in a field that starts exploding. The hype always goes up faster than the actual advance in the field. I think this will correct at some point in the future, because I think the field is advancing, and it is advancing actually rapidly. Actual results that we’re able to achieve are advancing pretty rapidly.
Chris Wolverton: Once some of the unrealistic hype starts to decay a little bit, I think the reality will catch up. I think that’s a normal progression.
Bryce Meredig: Yeah, I think that’s definitely true. And I would say, from my perspective, that the best way to show people what is possible, and get more people excited about the possibilities that are realistic today, is compelling case studies where the predictions, let’s say from materials informatics, are borne out in the laboratory. I think some of the most exciting examples on those lines have come out of your group, and I wanted to talk specifically about the bulk metallic glass work that you published recently in collaboration with SLAC and folks, a few other places, NIST, and elsewhere.
Bryce Meredig: Could you tell us a bit about that work, and why you view it’s so exciting? I mean, certainly it got picked up by a number of mainstream news outlets in addition to the scientific media.
Chris Wolverton: Right, right. Yeah. I think that is exciting work. The advent I guess of that story, at least from my perspective, is there was another graduate student in my group, Logan Ward, who actually came to the group having worked before, he was at Ohio State before, and he had worked on metallic glasses. Very interested in metallic glasses.
Chris Wolverton: I think he was very disappointed-
Bryce Meredig: That was his master’s thesis, right?
Chris Wolverton: Yeah, that’s right. That’s right. I think he was very disappointed to learn that I really didn’t know very much about metallic glasses, and didn’t care about them very much. I think he was really secretly hoping that he could convince me to let him work on that.
Chris Wolverton: Instead he worked on machine learning, as you know, and these kind of informatics approaches. But he couldn’t get rid of the love for metallic glasses, and so towards the end of his degree said, “Well, we don’t really have a way of predicting these … whether a certain composition is going to form a metallic glass from DFT, at least not easily, but there’s no need to train machine learning models on DFT, we could train it on any data source we have.”
Chris Wolverton: He went to simple handbook, looked up the Landolt-Börnstein handbook, and found that there’s thousands of examples of compositions where people have tried to make metallic glasses. Sometimes they did, and sometimes they didn’t. He trained a machine learning model to this data, entirely experimental data, entirely empirical, and then was able to predict, built kind of a simple classification model. You gave it a chemical composition, and then you ask, “Is a glass likely to form for that composition?” From then, he was able to actually predict, not only chemical compositions where glasses would form, but more specifically alloy systems, composed, say, of three, or four, or five elements, where metallic glasses are likely to form.
Chris Wolverton: This got the interest of some of our colleagues, as you mentioned at SLAC and at NIST, so we were able to predict some systems that where it looked like machine learning was telling us, “You should definitely form glasses in that system.” And yet nobody had ever seen them before.
Chris Wolverton: We teamed up with these folks who do high-throughput experiments. They can deposit libraries of samples and interrogate them in high-throughput manners so they can actually synthesize, and characterize, and explore large numbers of different chemical compositions. All in a single experiment.
Chris Wolverton: They took some of the systems where we had predicted from machine learning that these glasses should form, made them in the laboratory, and indeed found that in some cases those compositions do indeed form glass.
Chris Wolverton: A pretty good success of machine learning, but not a quantitative success, a qualitative success in the sense that glasses appeared where in some cases we said they would, but to be honest, we didn’t get the details of those predictions right. We couldn’t tell them specifically where in the ratio of those three materials would you form a glass.
Chris Wolverton: But now, well we realized very quickly, while they had done this high-throughput experiment they had all of this new data now from the experiment. That was actually a non-trivial amount of new data that we could feed back into our machine learning model, and actually revise the machine learning model, and make a new prediction.
Chris Wolverton: Not surprisingly, I guess, for the systems where we fed the data back of course the prediction gets better because you have a lot more data for that specific system, so that’s not so surprising. But the thing that was surprising, that even for systems where we didn’t have additional new data, the predictions seemed to get better. We tested this against experiment.
Chris Wolverton: This gave rise to this idea of a closed loop, or almost an autonomous system. You could imagine where you do these high-throughput experiments, you feed the results in the machine learning, the machine learning predicts basically new experiments to do, or new chemical compositions to explore. You feed that back into the high-throughput experiments, and you just go around, and around this loop, and iterate on this approach. The learning model will, hopefully, just get better, and better over time.
Chris Wolverton: I think people were very excited about that work, first of all just because metallic glasses are difficult to predict, and so being able to predict them was exciting. Also, this notion, more generally, of machine learning coupled with high-throughput experimentation in an iterative way. That you can see lots of potential for that in lots of different properties, it doesn’t have to be metallic glasses.
Chris Wolverton: As a matter of fact I guess I should say that we, with some high-throughput experimentalists at Cornell and Caltech, and others, who have just started the new project to try and develop this autonomous closed-loop system, but for synthesis of new materials. To try and predict what are the synthetic conditions under which you can actually grow a particular type of material.
Chris Wolverton: The idea is the same, that the robot, or the high-throughput experiment, will try something at first to synthesize this new material. Probably will fail. And feed this information into some kind of machine learning approach, which ultimately then will make new predictions about new synthesis conditions, and hopefully we can go around this loop. And, in a completely autonomous way, where we don’t actually have to rely on human intervention, and zero in on synthetic conditions for new materials.
Chris Wolverton: That’s a new project that we’re working on.
Bryce Meredig: I think this idea of machine learning guiding an iterative materials design process, ideally with a closed-loop like you were describing, is extremely important. I would say, from my perspective, it is just starting to be widely appreciated by the materials informatics community.
Bryce Meredig: In the early days we had the tendency as a community to use machine learning in the same ways we used density functional theory. Which is to say we built a model, and we used that model to screen a large number of compounds, and we ranked them by some figure of merit that we’re interested in, and then we say, “Okay, the top 10, or the top 50 on this list are the ‘winners’.”
Bryce Meredig: In contrast I think machine learning, in my view, is much more effective in this iterative mode. And of course, at Citrine we do work on sequential learning, which is a manifestation of that same idea. But I think there’s enormous promise in this idea of a machine learning model that is constantly getting smarter from the results of new experiments, and/or new simulations that we’re doing. Such a model can help us very efficiently explore a large search base, which of course, bulk metallic glasses are potentially an astronomically large search base.
Chris Wolverton: Right, right. I would say I agree with that. I think that of course … in my group we still do a lot of the former what you said, trying to apply machine learning in the same style that we would apply DFT. But I think that that work is still necessary, because even in these closed-loop, iterative ways we need the machine learning models to be good. We might test those in the simple, sequential way before we actually put them into this iterative loop.
Bryce Meredig: Yeah, that’s right. I think the key is to what extent does your initial machine learning model, or especially your initial training data set, span the search base that you’re interested in.
Chris Wolverton: Right.
Bryce Meredig: The more that it does, the more that you’re effectively interpolating, the better that the machine learning is going to be at doing the one path screen. What I think we’ve seen is in problems where you are venturing out into regimes of potentially very different physics between your training data, you are relying then on machine learning to be a torch that is helping you explore a dark cave. It can maybe help you see the wall in front of you, but it’s not going to actually in one step tell you where the exit of the cave is.
Bryce Meredig: I think it can be useful in both of those modes, and it really depends on the details of the particular application.
Chris Wolverton: Right, exactly. In your analogy I think the machine learning, in one iteration, the machine learning would tell you, “Turn right.” Or, “Turn left.”
Bryce Meredig: Exactly.
Chris Wolverton: In the next iteration it would tell you, “Go forward a few steps.” Or something like that.
Chris Wolverton: One of the reasons why I think that in the early days DFT data sets were actually so good to initially explore these ideas of machine learning is because DFT can actually produce negative result in the sense of … suppose you wanted to train a model of energetics to find low energy stable compounds, like you did.
Chris Wolverton: You can train on a DFT data set, and DFT is capable of producing not only atomic configurations that give rise to low energy that the machine learning needs to learn, but it also can produce atomic arrangements that give rise to crazy high energy.
Chris Wolverton: In other words, it can tell you what arrangements make reasonable chemistries, and sensible chemistries. But it can also tell you what things, and what chemical arrangements, that make no sense, and that’s really important for machine learning. It needs to actually know examples that don’t make any sense.
Chris Wolverton: For example, like in the bulk metallic glass case, if we only gave it systems that borne bulk metallic glasses, it could never learn anything. It would just think everything is a metallic glass. We need these examples of these negative results, and I think those in some ways are easier to get from computation than they are sometimes from experiment.
Chris Wolverton: You know, you can’t necessarily tell an experimentalist, “Well, go into the laboratory and get me examples of crazy chemistries that make no physical sense.” Even though that might be important for machine learning.
Chris Wolverton: These computational data sets I think sometimes can be really important for that reason. They span the positive/negative space of properties better in some ways than experimental data sets.
Bryce Meredig: I was going to say, very often we might be in a position where we’re starting a new machine learning project, and the way we get our initial training set is by harvesting data from the published literature. But of course to your point, those results will be strongly biased towards higher performing materials in most cases, and thus are not representative, potentially at all, of the “natural distribution of materials.” If you truly were throwing darts at a dartboard, what would that distribution look like? So, you often are in a position where your initial machine learning model, the accuracy of that initial machine learning model, might not be a good indicator of the ease or difficulty of the discovery problem. Simply because the underlying materials come from a highly biased distribution.
Chris Wolverton: Exactly. That was almost entirely the point I was going to make. Just an example, we actually explored that. We took a DFT data set of the perovskite crystal structure, and all chemical compositions in the perovskite crystal structure. We used that to train a machine learning model to try and see could it predict which chemical compositions would form stable perovskites, and which would not.
Chris Wolverton: We actually tested this by, in one case, we actually trained the machine learning model only on the known perovskites. Only on the experimentally known perovskites that exist in the literature. In another case, we trained it on exactly the same amount of data, but where we just randomly selected chemical compositions from our DFT data set.
Chris Wolverton: Now that we’ve explained the answer, maybe it’s not so surprising, that the model trained on experimentally observed chemical compositions was actually worse than the one that was trained on random compositions. Because those random compositions spanned the space better.
Chris Wolverton: Anyways, that’s one thing about the computational data sets that I think is important.
Bryce Meredig: Okay, great. Well, thank you Chris. We are actually running out of time, and I just want to thank you again for joining us on the podcast. This is a great experiment that we’re doing. But it’s wonderful to have a guest like you, and I’m sure that the folks who listen to this will really value the insights and input that you were able to provide here.
Bryce Meredig: We really appreciate it.
Chris Wolverton: Sure thing, thank you very much.
Bryce Meredig: Thanks for listening to DataLab. If you have questions or an idea for an episode, contact our team at firstname.lastname@example.org.