Introduction to Data Mining, Machine Learning, and Data-Driven Predictions for Materials Science and Engineering
Below is the book’s table of content
including abstracts for all chapters:
expand/collapse all
- Part I:
- Introduction and Foundations
1. A Brief History of Data and Data Science
Data science as we know it today is an interdisciplinary subject where a
number of different disciplines and techniques meet. It is a highly dynamic
discipline, rapidly evolving and strongly influenced by technological
developments (e.g., in terms of data processing infrastructure) and conceptional advances (e.g., machine learning frameworks and platforms), but at the same time with roots that go back several hundred years.
In this chapter, we start by giving a brief overview of the historical
events and aspects with particular relevance for today's field of data science. It will be seen that a number of concepts (such a Ockham's razor) have survived the centuries and still have their place in data science.
However, data science as we know it today could only develop in the last
third of the 20th century; it is a child of many different
scientific disciplines, old and new.
- 1.1
- Where do all the Numbers Come From?
- 1.2
- The Ancient Roots of Data Science
- 1.3
- The First True Data Scientist
- 1.4
- The More Recent Roots of Data Science
- 1.5
- Data Science and Machine Learning in the 20th Century
- 1.6
- Summary and Conclusion
- 1.7
- References
2. From Data Science to Materials Data Science
In the previous chapter, we saw that data science has ancient roots in many
disciplines. For example, it is closely related to statistical theory and mathematics,
data visualization, machine learning and artificial intelligence, and computer science
and programming. Domain knowledge plays a special role: it is a great source for
interesting datasets and problems, which, on the other hand, can also benefit greatly
from data analysis. In this chapter, we will give an overview of material science
problems that have been successfully addressed using machine learning approaches,
for example. We will also discuss the specifics of materials science data and shed
light on the often repeated phrase “turning data into knowledge”.
- 2.1
- What is Data Science, and how is it related to Machine Learning and AI?
- 2.2
- Data Science and Machine Learning in Materials Science and Engineering
- 2.3
- From Data and Information to Knowledge
- 2.4
- The Curse of Dimensionality
- 2.5
- Summary and Conclusion
- 2.6
- References
3. What You Should Know about Data, Math, and Computing
Data Mining and Machine Learning are intimately connected through the availability of well-curated data, high-level computer programming languages, data visualization, mathematics and statistics. At the intersection between these different fields, typically a number of different notations are used, which sometimes makes machine learning (ML) and data mining literature a bit difficult to access for someone who is new to this field. In this chapter we start with an overview of the most important notations and conventions, followed by an introduction of the foundations for describing datasets mathematically and conceptually, as well as in Python code.
- 3.1
- General Conventions and Notations
- 3.2
- Sets, Tuples, Vectors and Arrays
- 3.3
- Representation of Data in Statistics and Machine Learning
- 3.4
- Summary and Conclusion
- 3.5
- References
4. Materials Science Datasets and Data Generation
In this chapter, we explain how to obtain high-quality datasets for statistical
and machine learning experiments. In particular, we present our strategy for generating
artificial datasets that have a clear materials science context, thus defining a set of
materials science benchmark datasets that are used throughout this text. In addition,
this chapter gives recommendations on how to obtain further datasets for machine
learning, e.g., from some of the numerous online resources.
- 4.1
- Introduction
- 4.2
- Dataset MDS-1: Tensile Test with Parameter Uncertainties
- 4.3
- Dataset MDS-2: Microstructure Evolution with the Ising Model
- 4.4
- Dataset MDS-3: Cahn-Hilliard Model
- 4.5
- Dataset MDS-4: Properties of Chemical Elements
- 4.6
- Dataset MDS-5: Nanoindentation of a Cu-Cr Composite
- 4.7
- Dataset DS-1: The Iris Flower Dataset
- 4.8
- Dataset DS-2: The Handwritten Digits Dataset
- 4.9
- Online Resource for Obtaining Training Data
- 4.10
- References
- Part II:
- A Primer on Probabilities, Distributions, and Statistics
5. Combinatorics and Probabilities
Probability deals with predicting the likelihood of future events, while statistics
involves the analysis of the frequency of past events. Without these two fields,
it would neither be possible to characterize nor to understand datasets or certain
machine learning algorithms (e.g., Bayesian methods strongly rely on conditional
probabilities). Being able to count events, to create combinations of events, and to
compute the probability of occurrence of particular events, are important prerequisites
for probability and statistics. Therefore, in Section 5.1, we start by introducing
the foundations of combinatorics together with the most important mathematical
formulations. We will then continue with deriving and computing probabilities
(Section 5.2) using discrete and continuous random variables (Section 6.1).
- 5.1
- Combinatorics
- 5.2
- Probabilities
- 5.3
- Conditional Probabilities, Product rule, and Bayes’ theorem
- 5.4
- Summary
- 5.5
- Exercises
- 5.6
- References
6. Random Variables and Probability Functions
Random variables and probability functions are two important ingredients
in statistics as well as for many machine learning methods. In this chapter, we will
see how these concepts relate to the sample space on the one hand, and how we can
use them for making complex calculations about “distributions” on the other hand.
- 6.1
- Random Variables
- 6.2
- Introduction of Probability Functions
- 6.3
- Discrete Probability Distributions
- 6.4
- Continuous Probability Distributions
- 6.5
- Multivariate Discrete and Continuous Distribution
- 6.6
- Bivariate Distributions as a Special Case
- 6.7
- Summary
7. Expectation, Variance, and Moments
Writing down all values of random variables is, even if possible, not always
useful because these numbers still might need reduction to “something simpler” that
can be directly understood and interpreted. In this chapter, we introduce expectation
values, variances and various types of moments of random variables as such measures
that condense a whole dataset into a single value.
- 7.1
- Expected Values of Discrete Random Variables
- 7.2
- Variance and Standard Deviation
- 7.3
- Raw Moments
- 7.4
- Central Moments
- 7.5
- Standardized Moments
- 7.6
- Exercises
8. Introduction to Statistics
Through the previous chapters we have obtained a solid foundation concerning
probabilities and distributions. Now the question might arises how those are
related to statistics, what statistics is anyway, and why yet another, separate
scientific discipline is needed? In this chapter explain how different “types” of
statistics are related. We then cover a number of sampling strategies, discuss
important concepts such as the law of large numbers and the central limit theorem,
before we derive and explain how the relation between variables can be quantified
using the covariance and the correlation.
- 8.1
- And What Now Is Statistics?
- 8.2
- The Sample and the Population
- 8.3
- Two Flavors: Descriptive and Inferential Statistics
- 8.4
- Sampling Strategies
- 8.5
- The Law of Large and Truly Large Numbers
- 8.6
- Central Limit Theorem
- 8.7
- Relations between Multivariate Variables: Covarince and Correlation
- 8.8
- Exercises
- 8.9
- References
9. Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an important item in the toolbox of statistics
that is often used at the beginning of any data analysis workflow. Its purpose is
to ensure that the data is "in good shape", to obtain an overview of the content
of the dataset, and to characterize the data using simple mathematical measures as
well as data visualization. This chapter starts by explaining the goals and the
corresponding methods. This is followed by introducing two steps that should be
the first ones before any "proper" EDA begins: the initial exploration of the data
files as well as a quick initial visualization. Subsequently, we introduce
descriptive statistics and explorative data visualization as the two main
constituents of EDA.
- 9.1
- The Why, the When, and the How
- 9.2
- Two Preliminary Steps
- 9.3
- Descriptive Statistics
- 9.4
- Data Visualization
- 9.5
- Exercises
- 9.6
- References
10. Commonly Encountered Distributions in Materials Science and Engineering
Discrete and continuous distribution functions are ubiquitous in materials
science, engineering and natural sciences. In this chapter, an overview of commonly
encountered distribution types is given, along with their mathematical forms and the
statistical characteristics such as expectation values and moments. Python examples
show, how to use these probability and cumulative functions, and how to sample
random values from such distributions. Furthermore, examples of typical applications
in materials science, engineering and physics are given.
- 10.1
- About the Following Discrete and Continuous Distributions
- 10.2
- Discrete Uniform Distribution
- 10.3
- Bernoulli Distribution
- 10.4
- Binomial Distribution
- 10.5
- Geometric Distribution
- 10.6
- Poisson Distribution
- 10.7
- Normal Distribution
- 10.8
- Bivariate Normal Distribution
- 10.9
- Multivariate Normal Distribution
- 10.10
- The Relation between Covariance Matrix and Multivariate Normal Distribution
- 10.11
- Lognormal Distribution
- 10.12
- Exponential Distribution
- 10.13
- Logistic Distribution
- 10.14
- References
- Part III:
- Classical Machine Learning
11. Introduction and General Concepts of Machine Learning and Data Science
As a first step towards understanding how and why different machine
learning algorithms work, we will begin this chapter with a discussion and definition of
what machine learning is. This is followed by an overview with a bird’s eye perspective,
on different commonly used machine learning methods and their application ranges
with hardly any equation. Subsequently, a number of basic techniques, important
concepts, accompanying tools and terminologies that are specific to machine learning
are introduced providing the foundation for all subsequent chapters.
- 11.1
- The Definition(s) of Machine Learning
- 11.2
- How and what do machines learn?
- 11.3
- Introduction of the General Machine Learning Workflow
- 11.4
- Data Collection
- 11.5
- Data Preprocessing
- 11.6
- A Taxonomy of Machine Learning Models
- 11.7
- Error Measures for Numerical Data
- 11.8
- Similarity Measures for Classification Problems
- 11.9
- Exercises
- 11.10
- References
12. A First Approach to Machine Learning With Linear Regression
Linear regression is one of the most accessible machine learning methods which
has strong roots in the field of statistics. Problems of interest consider the
numerical relationship between the input variables (or features) and the output
variables (or target variables). In this chapter we introduce machine learning
regression analysis with the goal, to use it for inferring the functional relation between
features and target variables, helping us to understand important aspects of the data.
As a prerequisite, it is explained how trained models are used to make predictions.
Furthermore, a number of more general concepts and notations are introduced which
are also of importance for later chapters.
- 12.1
- The Roots of Regression Analysis
- 12.2
- General Concepts and Important Terminology
- 12.3
- Simple Linear Regression
- 12.4
- Computational Aspects of Vectorization
- 12.5
- A Worked Example of Simple Linear Regression
- 12.6
- Multiple Linear Regression Models
- 12.7
- Exercises
- 12.8
- References
13. Advanced Methods and Topics of Regression
The previous chapter introduced all conceptual and numerical foundations
for solving linear regression problems in the context of machine learning. There,
the emphasize was on simple formulation that are easy to understand. However,
machine learning regression offers many more methods and tools than already
introduced. To this end, we will introduce model formulations that can easily be
generalized and that additionally can also be efficiently implemented as vectorized
Python code. Furthermore, the concept of basis functions offers a multitude of more
advanced regression models such as piecewise formulations or non-parametric kernel
- 13.1
- Non-linear Model Behavior with Linear Regression
- 13.2
- Generalized Formulations and Vectorization for Multiple Linear Regression
- 13.3
- Generalized Formulation of Linear Regression With Basis Functions
- 13.4
- Formulation of Chosen Cases in Terms of Basis Functions
- 13.5
- Semi- and Non-Parametric Regression
- 13.6
- Further Nonlinear Regression Models
- 13.7
- Summary and Conclusion
- 13.8
- Exercises
- 13.9
- References
14. Supervised Classification
Classification problems are, besides regression problems, another large
class of machine learning problems. They belong to the category of supervised
learning, and the goal is to learn, how to sort data into different categories. While the
machine learning (ML) methods are quite different from those used for regression, the
fundamental aspects of learning are similar. In this chapter we start with a detailed
introduction of classification based on intuitive, rule-based classification methods.
We then cover a number of the most commonly used machine learning classification
methods: the class of nearest neighbor classifiers, the Gaussian naive Bayes model,
and support vector machines.
- 14.1
- Introduction to Supervised Classification
- 14.2
- Rule-Based Classification Methods and Decision Trees
- 14.3
- Notions and Concepts for Classification Problems
- 14.4
- Nearest Neighbor Classifier
- 14.5
- Gaussian Naive Bayes Model
- 14.6
- Support Vector Machines
- 14.7
- Exercises
- 14.8
- References
15. Unsupervised Learning
Unsupervised learning methods do not require labeled data for training a
model, making predictions or inference. Therefore, datasets for unsupervised learning
only consist of numerical features but do not contain any target values. In this chapter
we discuss two large classes of methods: clustering methods that try to find data points
that “have something in common” and dimensionality reduction methods which
transform the data into another representation that might be significantly “simpler”
or that reveal certain features of the dataset.
- 15.1
- Introduction to Dimensionality Reduction
- 15.2
- Principal Component Analysis: Theoretical Background and Derivations
- 15.3
- Application Aspects and Examples of PCA
- 15.4
- Further Methods for Dimensionality Reduction
- 15.5
- Clustering
- 15.6
- Materials Science Examples
- 15.7
- Exercises
- 15.8
- References
16. Machine Learning Techniques
The success of Machine Learning benefits from the availability of various
numerical techniques and specialized technical concepts that help to make the training
process efficient and effective. In many times it is even those techniques and technical
aspects that make a problem solvable in the first place. This chapter introduces some
of these techniques and covers topics ranging from feature engineering and feature
importance, training, testing and cross validation, to the concepts of baseline models.
- 16.1
- Feature Engineering and Feature Importance
- 16.2
- Data Splitting, Cross Validation, and Statistical Resampling
- 16.3
- Baseline Models
- 16.4
- References
- Part IV:
- Artificial Neural Networks and Deep Learning
17. From the Perceptron to Artificial Neural Networks
The spectacular success of today’s artificial neural network applications
has its roots in many small advances combined with a number of pivotal ideas. This
chapter starts with introducing the biological neuron which has been the motivation for
creating computational models of increasing complexity. By following the historical
developments of early models we explain the steps leading towards modern artificial
neural networks and deep learning techniques.
- 17.1
- A First Model of a Neuron
- 17.2
- The Rosenblatt Perceptron
- 17.3
- The ADALINE model
- 17.4
- Increasing the Complexity: Assemblies of Neurons
- 17.5
- Summary and Historical Remarks
- 17.6
- Exercises
- 17.7
- References
18. A Gentle Introduction to Deep Learning
The so-called “first AI winter” lasted from the early 1970s until 1980, during
which only very limited research activities took place. The main reason was the drastic
reduction of research funding which resulted from the general disappointment about
the severe limitations of the early perceptron and simple neural network approaches
described in Chapter 17. This chapter introduces the relevant concepts and derives important
method formulations that were required to accelerate the developments, that lead
to today’s deep learning. Among those are new types of activation functions, the
concept of backpropagation and techniques such as training with mini-batches. This
is then implemented as Python code and used for machine learning experiments
which help to understand the introduced theoretical concepts and methods.
- 18.1
- Overview of the Historical Developments
- 18.2
- Activation functions
- 18.3
- Backpropagation – Introduction and Example
- 18.4
- General Formulation of Backpropagation
- 18.5
- Python Implementation and Example for the Fully Connected Network
- 18.6
- Further Concepts and Techniques
- 18.7
- Less Is More: The Concept of Dropout
- 18.8
- Example: Microstructure Classification and Property Prediction
- 18.9
- Exercises
- 18.10
- References
19. Advanced Deep Learning Architectures and Techniques
In the previous two chapters, we developed a variety of techniques and
concepts that lead to first successful deep learning models. This was only the starting
point for a whole avalanche of new network architectures and sophisticated training
techniques. In this chapter, we explore a number of state-of-the-art deep learning
methods of relevance to materials science and physics. These range from convolutional
neural networks and autoencoders to generative adversarial networks and physically
informed neural networks. A number of concrete examples help to understand the
power and the limitations of such networks. Finally, a brief summary of ongoing
developments is given.
- 19.1
- Convolutional Neural Networks
- 19.2
- Deep Learning Techniques
- 19.3
- Two Examples for Deep Learning in Microscopy
- 19.4
- Autoencoder – or: How to Learn With Networks Without Supervision
- 19.5
- Generative Adversarial Networks – or: How to Create Data?
- 19.6
- Physics Informed Machine Learning and Beyond
- 19.7
- Summary and Conclusion
- 19.8
- Exercises
- 19.9
- References
- Part V:
- Supplementary Material and Appendix
A. Linear Algebra for Machine Learning
In the following we give an overview over the most important definitions and concepts
of linear algebra. This serves as an refresher for those readers who already took a
class in linear algebra. Even though a big part of this book has been written such that
readers without a deep background in vector, matrix and tensor calculus and analysis
should still be able to follow, there are a number of derivations where vector calculus
is indispensable.
Nonetheless, this appendix is not an rigorous mathematical introduction. It only
aims to introduce all relevant notions and definitions.
- A.1
- Vector Calculus
- A.2
- Matrices and Matrix Operations
- A.3
- Derived Properties, Further Theorems, and Advanced Definitions
- A.4
- Matrix-related Operation that only exists in Numpy
B. Proofs and Derivations
A number of chosen proofs and derivations are given, that are not required
for the understanding of the main text but which nonetheless can be useful for
understanding some relations and theorems in more detail.
- B.1
- Proofs of the Theorems about Expectation Values
- B.2
- Proofs of Some Theorems about Variances
- B.3
- Simplification of Pearson Moment Coefficient of Skewness
- B.4
- Proofs and Additional Derivations for Distributions
- B.5
- References