Table of Content

Below is the book’s table of content including abstracts for all chapters: expand/collapse all

Part I:
Introduction and Foundations

1. A Brief History of Data and Data Science

Abstract: Data science as we know it today is an interdisciplinary subject where a number of different disciplines and techniques meet. It is a highly dynamic discipline, rapidly evolving and strongly influenced by technological developments (e.g., in terms of data processing infrastructure) and conceptional advances (e.g., machine learning frameworks and platforms), but at the same time with roots that go back several hundred years. In this chapter, we start by giving a brief overview of the historical events and aspects with particular relevance for today's field of data science. It will be seen that a number of concepts (such a Ockham's razor) have survived the centuries and still have their place in data science. However, data science as we know it today could only develop in the last third of the 20th century; it is a child of many different scientific disciplines, old and new.

1.1: Where do all the Numbers Come From?
1.2: The Ancient Roots of Data Science
1.3: The First True Data Scientist
1.4: The More Recent Roots of Data Science
1.5: Data Science and Machine Learning in the 20th Century
1.6: Summary and Conclusion
1.7: References

2. From Data Science to Materials Data Science

Abstract: In the previous chapter, we saw that data science has ancient roots in many disciplines. For example, it is closely related to statistical theory and mathematics, data visualization, machine learning and artificial intelligence, and computer science and programming. Domain knowledge plays a special role: it is a great source for interesting datasets and problems, which, on the other hand, can also benefit greatly from data analysis. In this chapter, we will give an overview of material science problems that have been successfully addressed using machine learning approaches, for example. We will also discuss the specifics of materials science data and shed light on the often repeated phrase “turning data into knowledge”.

2.1: What is Data Science, and how is it related to Machine Learning and AI?
2.2: Data Science and Machine Learning in Materials Science and Engineering
2.3: From Data and Information to Knowledge
2.4: The Curse of Dimensionality
2.5: Summary and Conclusion
2.6: References

3. What You Should Know about Data, Math, and Computing

Abstract: Data Mining and Machine Learning are intimately connected through the availability of well-curated data, high-level computer programming languages, data visualization, mathematics and statistics. At the intersection between these different fields, typically a number of different notations are used, which sometimes makes machine learning (ML) and data mining literature a bit difficult to access for someone who is new to this field. In this chapter we start with an overview of the most important notations and conventions, followed by an introduction of the foundations for describing datasets mathematically and conceptually, as well as in Python code.

3.1: General Conventions and Notations
3.2: Sets, Tuples, Vectors and Arrays
3.3: Representation of Data in Statistics and Machine Learning
3.4: Summary and Conclusion
3.5: References

4. Materials Science Datasets and Data Generation

Abstract: In this chapter, we explain how to obtain high-quality datasets for statistical and machine learning experiments. In particular, we present our strategy for generating artificial datasets that have a clear materials science context, thus defining a set of materials science benchmark datasets that are used throughout this text. In addition, this chapter gives recommendations on how to obtain further datasets for machine learning, e.g., from some of the numerous online resources.

4.1: Introduction
4.2: Dataset MDS-1: Tensile Test with Parameter Uncertainties
4.3: Dataset MDS-2: Microstructure Evolution with the Ising Model
4.4: Dataset MDS-3: Cahn-Hilliard Model
4.5: Dataset MDS-4: Properties of Chemical Elements
4.6: Dataset MDS-5: Nanoindentation of a Cu-Cr Composite
4.7: Dataset DS-1: The Iris Flower Dataset
4.8: Dataset DS-2: The Handwritten Digits Dataset
4.9: Online Resource for Obtaining Training Data
4.10: References

Part II:
A Primer on Probabilities, Distributions, and Statistics

5. Combinatorics and Probabilities

Abstract: Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events. Without these two fields, it would neither be possible to characterize nor to understand datasets or certain machine learning algorithms (e.g., Bayesian methods strongly rely on conditional probabilities). Being able to count events, to create combinations of events, and to compute the probability of occurrence of particular events, are important prerequisites for probability and statistics. Therefore, in Section 5.1, we start by introducing the foundations of combinatorics together with the most important mathematical formulations. We will then continue with deriving and computing probabilities (Section 5.2) using discrete and continuous random variables (Section 6.1).

5.1: Combinatorics
5.2: Probabilities
5.3: Conditional Probabilities, Product rule, and Bayes’ theorem
5.4: Summary
5.5: Exercises
5.6: References

6. Random Variables and Probability Functions

Abstract: Random variables and probability functions are two important ingredients in statistics as well as for many machine learning methods. In this chapter, we will see how these concepts relate to the sample space on the one hand, and how we can use them for making complex calculations about “distributions” on the other hand.

6.1: Random Variables
6.2: Introduction of Probability Functions
6.3: Discrete Probability Distributions
6.4: Continuous Probability Distributions
6.5: Multivariate Discrete and Continuous Distribution
6.6: Bivariate Distributions as a Special Case
6.7: Summary

7. Expectation, Variance, and Moments

Abstract: Writing down all values of random variables is, even if possible, not always useful because these numbers still might need reduction to “something simpler” that can be directly understood and interpreted. In this chapter, we introduce expectation values, variances and various types of moments of random variables as such measures that condense a whole dataset into a single value.

7.1: Expected Values of Discrete Random Variables
7.2: Variance and Standard Deviation
7.3: Raw Moments
7.4: Central Moments
7.5: Standardized Moments
7.6: Exercises

8. Introduction to Statistics

Abstract: Through the previous chapters we have obtained a solid foundation concerning probabilities and distributions. Now the question might arises how those are related to statistics, what statistics is anyway, and why yet another, separate scientific discipline is needed? In this chapter explain how different “types” of statistics are related. We then cover a number of sampling strategies, discuss important concepts such as the law of large numbers and the central limit theorem, before we derive and explain how the relation between variables can be quantified using the covariance and the correlation.

8.1: And What Now Is Statistics?
8.2: The Sample and the Population
8.3: Two Flavors: Descriptive and Inferential Statistics
8.4: Sampling Strategies
8.5: The Law of Large and Truly Large Numbers
8.6: Central Limit Theorem
8.7: Relations between Multivariate Variables: Covarince and Correlation
8.8: Exercises
8.9: References

9. Exploratory Data Analysis

Abstract: Exploratory Data Analysis (EDA) is an important item in the toolbox of statistics that is often used at the beginning of any data analysis workflow. Its purpose is to ensure that the data is "in good shape", to obtain an overview of the content of the dataset, and to characterize the data using simple mathematical measures as well as data visualization. This chapter starts by explaining the goals and the corresponding methods. This is followed by introducing two steps that should be the first ones before any "proper" EDA begins: the initial exploration of the data files as well as a quick initial visualization. Subsequently, we introduce descriptive statistics and explorative data visualization as the two main constituents of EDA.

9.1: The Why, the When, and the How
9.2: Two Preliminary Steps
9.3: Descriptive Statistics
9.4: Data Visualization
9.5: Exercises
9.6: References

10. Commonly Encountered Distributions in Materials Science and Engineering

Abstract: Discrete and continuous distribution functions are ubiquitous in materials science, engineering and natural sciences. In this chapter, an overview of commonly encountered distribution types is given, along with their mathematical forms and the statistical characteristics such as expectation values and moments. Python examples show, how to use these probability and cumulative functions, and how to sample random values from such distributions. Furthermore, examples of typical applications in materials science, engineering and physics are given.

10.1: About the Following Discrete and Continuous Distributions
10.2: Discrete Uniform Distribution
10.3: Bernoulli Distribution
10.4: Binomial Distribution
10.5: Geometric Distribution
10.6: Poisson Distribution
10.7: Normal Distribution
10.8: Bivariate Normal Distribution
10.9: Multivariate Normal Distribution
10.10: The Relation between Covariance Matrix and Multivariate Normal Distribution
10.11: Lognormal Distribution
10.12: Exponential Distribution
10.13: Logistic Distribution
10.14: References

Part III:
Classical Machine Learning

11. Introduction and General Concepts of Machine Learning and Data Science

Abstract: As a first step towards understanding how and why different machine learning algorithms work, we will begin this chapter with a discussion and definition of what machine learning is. This is followed by an overview with a bird’s eye perspective, on different commonly used machine learning methods and their application ranges with hardly any equation. Subsequently, a number of basic techniques, important concepts, accompanying tools and terminologies that are specific to machine learning are introduced providing the foundation for all subsequent chapters.

11.1: The Definition(s) of Machine Learning
11.2: How and what do machines learn?
11.3: Introduction of the General Machine Learning Workflow
11.4: Data Collection
11.5: Data Preprocessing
11.6: A Taxonomy of Machine Learning Models
11.7: Error Measures for Numerical Data
11.8: Similarity Measures for Classification Problems
11.9: Exercises
11.10: References

12. A First Approach to Machine Learning With Linear Regression

Abstract: Linear regression is one of the most accessible machine learning methods which has strong roots in the field of statistics. Problems of interest consider the numerical relationship between the input variables (or features) and the output variables (or target variables). In this chapter we introduce machine learning regression analysis with the goal, to use it for inferring the functional relation between features and target variables, helping us to understand important aspects of the data. As a prerequisite, it is explained how trained models are used to make predictions. Furthermore, a number of more general concepts and notations are introduced which are also of importance for later chapters.

12.1: The Roots of Regression Analysis
12.2: General Concepts and Important Terminology
12.3: Simple Linear Regression
12.4: Computational Aspects of Vectorization
12.5: A Worked Example of Simple Linear Regression
12.6: Multiple Linear Regression Models
12.7: Exercises
12.8: References

13. Advanced Methods and Topics of Regression

Abstract: The previous chapter introduced all conceptual and numerical foundations for solving linear regression problems in the context of machine learning. There, the emphasize was on simple formulation that are easy to understand. However, machine learning regression offers many more methods and tools than already introduced. To this end, we will introduce model formulations that can easily be generalized and that additionally can also be efficiently implemented as vectorized Python code. Furthermore, the concept of basis functions offers a multitude of more advanced regression models such as piecewise formulations or non-parametric kernel regression.

13.1: Non-linear Model Behavior with Linear Regression
13.2: Generalized Formulations and Vectorization for Multiple Linear Regression
13.3: Generalized Formulation of Linear Regression With Basis Functions
13.4: Formulation of Chosen Cases in Terms of Basis Functions
13.5: Semi- and Non-Parametric Regression
13.6: Further Nonlinear Regression Models
13.7: Summary and Conclusion
13.8: Exercises
13.9: References