The Materials Data Science Book

Introduction to Data Mining, Machine Learning, and Data-Driven Predictions for Materials Science and Engineering


Below is the book’s table of content including abstracts for all chapters: expand/collapse all

Part I:
Introduction and Foundations
1.   A Brief History of Data and Data Science Abstract:    Data science as we know it today is an interdisciplinary subject where a number of different disciplines and techniques meet. It is a highly dynamic discipline, rapidly evolving and strongly influenced by technological developments (e.g., in terms of data processing infrastructure) and conceptional advances (e.g., machine learning frameworks and platforms), but at the same time with roots that go back several hundred years. In this chapter, we start by giving a brief overview of the historical events and aspects with particular relevance for today's field of data science. It will be seen that a number of concepts (such a Ockham's razor) have survived the centuries and still have their place in data science. However, data science as we know it today could only develop in the last third of the 20th century; it is a child of many different scientific disciplines, old and new.
1.1
Where do all the Numbers Come From?
1.2
The Ancient Roots of Data Science
1.3
The First True Data Scientist
1.4
The More Recent Roots of Data Science
1.5
Data Science and Machine Learning in the 20th Century
1.6
Summary and Conclusion
1.7
References
2.   From Data Science to Materials Data Science Abstract:    In the previous chapter, we saw that data science has ancient roots in many disciplines. For example, it is closely related to statistical theory and mathematics, data visualization, machine learning and artificial intelligence, and computer science and programming. Domain knowledge plays a special role: it is a great source for interesting datasets and problems, which, on the other hand, can also benefit greatly from data analysis. In this chapter, we will give an overview of material science problems that have been successfully addressed using machine learning approaches, for example. We will also discuss the specifics of materials science data and shed light on the often repeated phrase “turning data into knowledge”.
2.1
What is Data Science, and how is it related to Machine Learning and AI?
2.2
Data Science and Machine Learning in Materials Science and Engineering
2.3
From Data and Information to Knowledge
2.4
The Curse of Dimensionality
2.5
Summary and Conclusion
2.6
References
3.   What You Should Know about Data, Math, and Computing Abstract:    Data Mining and Machine Learning are intimately connected through the availability of well-curated data, high-level computer programming languages, data visualization, mathematics and statistics. At the intersection between these different fields, typically a number of different notations are used, which sometimes makes machine learning (ML) and data mining literature a bit difficult to access for someone who is new to this field. In this chapter we start with an overview of the most important notations and conventions, followed by an introduction of the foundations for describing datasets mathematically and conceptually, as well as in Python code.
3.1
General Conventions and Notations
3.2
Sets, Tuples, Vectors and Arrays
3.3
Representation of Data in Statistics and Machine Learning
3.4
Summary and Conclusion
3.5
References
4.   Materials Science Datasets and Data Generation Abstract:    In this chapter, we explain how to obtain high-quality datasets for statistical and machine learning experiments. In particular, we present our strategy for generating artificial datasets that have a clear materials science context, thus defining a set of materials science benchmark datasets that are used throughout this text. In addition, this chapter gives recommendations on how to obtain further datasets for machine learning, e.g., from some of the numerous online resources.
4.1
Introduction
4.2
Dataset MDS-1: Tensile Test with Parameter Uncertainties
4.3
Dataset MDS-2: Microstructure Evolution with the Ising Model
4.4
Dataset MDS-3: Cahn-Hilliard Model
4.5
Dataset MDS-4: Properties of Chemical Elements
4.6
Dataset MDS-5: Nanoindentation of a Cu-Cr Composite
4.7
Dataset DS-1: The Iris Flower Dataset
4.8
Dataset DS-2: The Handwritten Digits Dataset
4.9
Online Resource for Obtaining Training Data
4.10
References
Part II:
A Primer on Probabilities, Distributions, and Statistics
5.   Combinatorics and Probabilities Abstract:    Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events. Without these two fields, it would neither be possible to characterize nor to understand datasets or certain machine learning algorithms (e.g., Bayesian methods strongly rely on conditional probabilities). Being able to count events, to create combinations of events, and to compute the probability of occurrence of particular events, are important prerequisites for probability and statistics. Therefore, in Section 5.1, we start by introducing the foundations of combinatorics together with the most important mathematical formulations. We will then continue with deriving and computing probabilities (Section 5.2) using discrete and continuous random variables (Section 6.1).
5.1
Combinatorics
5.2
Probabilities
5.3
Conditional Probabilities, Product rule, and Bayes’ theorem
5.4
Summary
5.5
Exercises
5.6
References
6.   Random Variables and Probability Functions Abstract:    Random variables and probability functions are two important ingredients in statistics as well as for many machine learning methods. In this chapter, we will see how these concepts relate to the sample space on the one hand, and how we can use them for making complex calculations about “distributions” on the other hand.
6.1
Random Variables
6.2
Introduction of Probability Functions
6.3
Discrete Probability Distributions
6.4
Continuous Probability Distributions
6.5
Multivariate Discrete and Continuous Distribution
6.6
Bivariate Distributions as a Special Case
6.7
Summary
7.   Expectation, Variance, and Moments Abstract:    Writing down all values of random variables is, even if possible, not always useful because these numbers still might need reduction to “something simpler” that can be directly understood and interpreted. In this chapter, we introduce expectation values, variances and various types of moments of random variables as such measures that condense a whole dataset into a single value.
7.1
Expected Values of Discrete Random Variables
7.2
Variance and Standard Deviation
7.3
Raw Moments
7.4
Central Moments
7.5
Standardized Moments
7.6
Exercises
8.   Introduction to Statistics Abstract:    Through the previous chapters we have obtained a solid foundation concerning probabilities and distributions. Now the question might arises how those are related to statistics, what statistics is anyway, and why yet another, separate scientific discipline is needed? In this chapter explain how different “types” of statistics are related. We then cover a number of sampling strategies, discuss important concepts such as the law of large numbers and the central limit theorem, before we derive and explain how the relation between variables can be quantified using the covariance and the correlation.
8.1
And What Now Is Statistics?
8.2
The Sample and the Population
8.3
Two Flavors: Descriptive and Inferential Statistics
8.4
Sampling Strategies
8.5
The Law of Large and Truly Large Numbers
8.6
Central Limit Theorem
8.7
Relations between Multivariate Variables: Covarince and Correlation
8.8
Exercises
8.9
References
9.   Exploratory Data Analysis Abstract:    Exploratory Data Analysis (EDA) is an important item in the toolbox of statistics that is often used at the beginning of any data analysis workflow. Its purpose is to ensure that the data is "in good shape", to obtain an overview of the content of the dataset, and to characterize the data using simple mathematical measures as well as data visualization. This chapter starts by explaining the goals and the corresponding methods. This is followed by introducing two steps that should be the first ones before any "proper" EDA begins: the initial exploration of the data files as well as a quick initial visualization. Subsequently, we introduce descriptive statistics and explorative data visualization as the two main constituents of EDA.
9.1
The Why, the When, and the How
9.2
Two Preliminary Steps
9.3
Descriptive Statistics
9.4
Data Visualization
9.5
Exercises
9.6
References
10. Commonly Encountered Distributions in Materials Science and Engineering Abstract:    Discrete and continuous distribution functions are ubiquitous in materials science, engineering and natural sciences. In this chapter, an overview of commonly encountered distribution types is given, along with their mathematical forms and the statistical characteristics such as expectation values and moments. Python examples show, how to use these probability and cumulative functions, and how to sample random values from such distributions. Furthermore, examples of typical applications in materials science, engineering and physics are given.
10.1
About the Following Discrete and Continuous Distributions
10.2
Discrete Uniform Distribution
10.3
Bernoulli Distribution
10.4
Binomial Distribution
10.5
Geometric Distribution
10.6
Poisson Distribution
10.7
Normal Distribution
10.8
Bivariate Normal Distribution
10.9
Multivariate Normal Distribution
10.10
The Relation between Covariance Matrix and Multivariate Normal Distribution
10.11
Lognormal Distribution
10.12
Exponential Distribution
10.13
Logistic Distribution
10.14
References
Part III:
Classical Machine Learning
11.   Introduction and General Concepts of Machine Learning and Data Science Abstract:    As a first step towards understanding how and why different machine learning algorithms work, we will begin this chapter with a discussion and definition of what machine learning is. This is followed by an overview with a bird’s eye perspective, on different commonly used machine learning methods and their application ranges with hardly any equation. Subsequently, a number of basic techniques, important concepts, accompanying tools and terminologies that are specific to machine learning are introduced providing the foundation for all subsequent chapters.
11.1
The Definition(s) of Machine Learning
11.2
How and what do machines learn?
11.3
Introduction of the General Machine Learning Workflow
11.4
Data Collection
11.5
Data Preprocessing
11.6
A Taxonomy of Machine Learning Models
11.7
Error Measures for Numerical Data
11.8
Similarity Measures for Classification Problems
11.9
Exercises
11.10
References
12.   A First Approach to Machine Learning With Linear Regression Abstract:    Linear regression is one of the most accessible machine learning methods which has strong roots in the field of statistics. Problems of interest consider the numerical relationship between the input variables (or features) and the output variables (or target variables). In this chapter we introduce machine learning regression analysis with the goal, to use it for inferring the functional relation between features and target variables, helping us to understand important aspects of the data. As a prerequisite, it is explained how trained models are used to make predictions. Furthermore, a number of more general concepts and notations are introduced which are also of importance for later chapters.
12.1
The Roots of Regression Analysis
12.2
General Concepts and Important Terminology
12.3
Simple Linear Regression
12.4
Computational Aspects of Vectorization
12.5
A Worked Example of Simple Linear Regression
12.6
Multiple Linear Regression Models
12.7
Exercises
12.8
References
13.   Advanced Methods and Topics of Regression Abstract:    The previous chapter introduced all conceptual and numerical foundations for solving linear regression problems in the context of machine learning. There, the emphasize was on simple formulation that are easy to understand. However, machine learning regression offers many more methods and tools than already introduced. To this end, we will introduce model formulations that can easily be generalized and that additionally can also be efficiently implemented as vectorized Python code. Furthermore, the concept of basis functions offers a multitude of more advanced regression models such as piecewise formulations or non-parametric kernel regression.
13.1
Non-linear Model Behavior with Linear Regression
13.2
Generalized Formulations and Vectorization for Multiple Linear Regression
13.3
Generalized Formulation of Linear Regression With Basis Functions
13.4
Formulation of Chosen Cases in Terms of Basis Functions
13.5
Semi- and Non-Parametric Regression
13.6
Further Nonlinear Regression Models
13.7
Summary and Conclusion
13.8
Exercises
13.9
References
14.   Supervised Classification Abstract:    Classification problems are, besides regression problems, another large class of machine learning problems. They belong to the category of supervised learning, and the goal is to learn, how to sort data into different categories. While the machine learning (ML) methods are quite different from those used for regression, the fundamental aspects of learning are similar. In this chapter we start with a detailed introduction of classification based on intuitive, rule-based classification methods. We then cover a number of the most commonly used machine learning classification methods: the class of nearest neighbor classifiers, the Gaussian naive Bayes model, and support vector machines.
14.1
Introduction to Supervised Classification
14.2
Rule-Based Classification Methods and Decision Trees
14.3
Notions and Concepts for Classification Problems
14.4
Nearest Neighbor Classifier
14.5
Gaussian Naive Bayes Model
14.6
Support Vector Machines
14.7
Exercises
14.8
References
15.   Unsupervised Learning Abstract:    Unsupervised learning methods do not require labeled data for training a model, making predictions or inference. Therefore, datasets for unsupervised learning only consist of numerical features but do not contain any target values. In this chapter we discuss two large classes of methods: clustering methods that try to find data points that “have something in common” and dimensionality reduction methods which transform the data into another representation that might be significantly “simpler” or that reveal certain features of the dataset.
15.1
Introduction to Dimensionality Reduction
15.2
Principal Component Analysis: Theoretical Background and Derivations
15.3
Application Aspects and Examples of PCA
15.4
Further Methods for Dimensionality Reduction
15.5
Clustering
15.6
Materials Science Examples
15.7
Exercises
15.8
References
16.   Machine Learning Techniques Abstract:    The success of Machine Learning benefits from the availability of various numerical techniques and specialized technical concepts that help to make the training process efficient and effective. In many times it is even those techniques and technical aspects that make a problem solvable in the first place. This chapter introduces some of these techniques and covers topics ranging from feature engineering and feature importance, training, testing and cross validation, to the concepts of baseline models.
16.1
Feature Engineering and Feature Importance
16.2
Data Splitting, Cross Validation, and Statistical Resampling
16.3
Baseline Models
16.4
References
Part IV:
Artificial Neural Networks and Deep Learning
17.   From the Perceptron to Artificial Neural Networks Abstract:    The spectacular success of today’s artificial neural network applications has its roots in many small advances combined with a number of pivotal ideas. This chapter starts with introducing the biological neuron which has been the motivation for creating computational models of increasing complexity. By following the historical developments of early models we explain the steps leading towards modern artificial neural networks and deep learning techniques.
17.1
A First Model of a Neuron
17.2
The Rosenblatt Perceptron
17.3
The ADALINE model
17.4
Increasing the Complexity: Assemblies of Neurons
17.5
Summary and Historical Remarks
17.6
Exercises
17.7
References
18.   A Gentle Introduction to Deep Learning Abstract:    The so-called “first AI winter” lasted from the early 1970s until 1980, during which only very limited research activities took place. The main reason was the drastic reduction of research funding which resulted from the general disappointment about the severe limitations of the early perceptron and simple neural network approaches described in Chapter 17. This chapter introduces the relevant concepts and derives important method formulations that were required to accelerate the developments, that lead to today’s deep learning. Among those are new types of activation functions, the concept of backpropagation and techniques such as training with mini-batches. This is then implemented as Python code and used for machine learning experiments which help to understand the introduced theoretical concepts and methods.
18.1
Overview of the Historical Developments
18.2
Activation functions
18.3
Backpropagation – Introduction and Example
18.4
General Formulation of Backpropagation
18.5
Python Implementation and Example for the Fully Connected Network
18.6
Further Concepts and Techniques
18.7
Less Is More: The Concept of Dropout
18.8
Example: Microstructure Classification and Property Prediction
18.9
Exercises
18.10
References
19.   Advanced Deep Learning Architectures and Techniques Abstract:    In the previous two chapters, we developed a variety of techniques and concepts that lead to first successful deep learning models. This was only the starting point for a whole avalanche of new network architectures and sophisticated training techniques. In this chapter, we explore a number of state-of-the-art deep learning methods of relevance to materials science and physics. These range from convolutional neural networks and autoencoders to generative adversarial networks and physically informed neural networks. A number of concrete examples help to understand the power and the limitations of such networks. Finally, a brief summary of ongoing developments is given.
19.1
Convolutional Neural Networks
19.2
Deep Learning Techniques
19.3
Two Examples for Deep Learning in Microscopy
19.4
Autoencoder – or: How to Learn With Networks Without Supervision
19.5
Generative Adversarial Networks – or: How to Create Data?
19.6
Physics Informed Machine Learning and Beyond
19.7
Summary and Conclusion
19.8
Exercises
19.9
References
Part V:
Supplementary Material and Appendix
A.   Linear Algebra for Machine Learning Abstract:    In the following we give an overview over the most important definitions and concepts of linear algebra. This serves as an refresher for those readers who already took a class in linear algebra. Even though a big part of this book has been written such that readers without a deep background in vector, matrix and tensor calculus and analysis should still be able to follow, there are a number of derivations where vector calculus is indispensable. Nonetheless, this appendix is not an rigorous mathematical introduction. It only aims to introduce all relevant notions and definitions.
A.1
Vector Calculus
A.2
Matrices and Matrix Operations
A.3
Derived Properties, Further Theorems, and Advanced Definitions
A.4
Matrix-related Operation that only exists in Numpy
B.   Proofs and Derivations Abstract:    A number of chosen proofs and derivations are given, that are not required for the understanding of the main text but which nonetheless can be useful for understanding some relations and theorems in more detail.
B.1
Proofs of the Theorems about Expectation Values
B.2
Proofs of Some Theorems about Variances
B.3
Simplification of Pearson Moment Coefficient of Skewness
B.4
Proofs and Additional Derivations for Distributions
B.5
References