Introduction to Data Mining, Machine Learning, and Data-Driven Predictions for Materials Science and Engineering
The Datasets Used Throughout the Book
The MDS book contains a large number of examples most of which require data in one or the other way. Interpreting results and debugging problems is easier if we have an intuitive understanding of the context of the dataset, e.g., with context we can answer whether 0.001 is a small number that can be neglected or whether it a significant value.
Therefore, a number of datasets with a materials science or physics background
are used. The book explains the materials scientific or physics background of
how the datasets were created as well as details of the datasets in
Chapter 4. The datasets can be openly obtained in
form of a Python package as described below.
Obtaining the Datasets
The datasets can be obtained in form of the Python package MDSdata
either using pip
or directly from the GitHub repository.
It is recommended to install the package into a virtual Python environment.
A brief overview how this can be done is given here:
It is advisable to use a virtual environment and install the Python
package there because then it is easier for python (i.e., for the
package manager "pip
") to install all required dependencies
automatically in the correct version and to avoid conflicts with
other packages.
Here is a sketch of the most important steps for creating a virtual
environment in Linux using pyenv
(no root permissions are
required):
pyenv
: e.g., for Ubuntu/Debian-based systems
this can be done in a terminal by running the code:curl https://pyenv.run | bash
pyenv install --list
pyenv install 3.9.7
pyenv virtualenv 3.9.7 MDS
pyenv activate MDS
pyenv deactivate
You can also simply close the terminal.
Note, that there are a number of different approaches to create and manage virtual environments, and the above is only one possibility.
pip install MDSdata
pip install .
MDSdata
is available and can be used as detailed below.
A convenient way of working with Python code is a
jupyter notebook. You can install this locally
into your virtual environment using
pip install jupyter
Afterwards, you can launch the juypter environment by running
jupyter-lab
from your terminal.
Alternatively, you also can write your Python code with your preferred text editor
in a .py
file and run it from the terminal with, e.g.,
python myfile.py
.
For further information and instructions concerning the package, see the
How to Use the Datasets
With the above installation steps successfully finished, you are ready to either directly jumpy to one of the concrete examples for the datasets MDS-1...5 and DS-1 & 2 below or continue reading for getting some general information about the Python package.
The name of the package is mdsdata
(all lower characters!),
and you can just import it by import mdsdata
. The
datasets are then directly contained in this namespace, e.g.,
mdsdata.MDS1
refers to the class MDS1 that contains
the data and some additional functionality for dataset MDS-1 (tensile test).
There are two ways of importing a specific dataset: for the most
flexibility use load_data
, e.g.,
These functions have been designed to be fully compatible
with the scikit-learn interface.
The function parameters (see the github repository
MDS data
for more details) decide in which format the dataset is
given. return_X_y=True
returns the features and the target as
two numpy arrays, as_frame=True
returns a pandas Dataframe,
and if no parameters are given a dictionary-like "Bunch" (in analogy to scikit-learn)
is returned that has the form
{data, target, taget_names, DESCR, feature_names}
A shorter alternative that chooses reasonable default values for the parameters (and that therefore has less flexibility) is to use the "named functions":
As a shortcut, here is an overview of how to use the convenience wrapper around the classes for the MDS and DS datasets. This is (at least in the majority of all cases) the most easy to use approach:
Dataset | Python Code for Importing the Dataset |
---|---|
MDS-1 tensile test |
|
MDS-2 Ising microstructure |
|
MDS-3 Cahn-Hilliard microstructure |
|
MDS-4 Chemical elements |
|
MDS-5 Nano- indentation |
|
DS-1 Iris flowers |
|
DS-2 MNIST digits |
|
DS-2 (light) Alpaydin digits |
Materials Science Datasets MDS-1…5
Summary: This dataset is obtained from simulating a linear elastic material model with non-linear hardening and temperature dependency.
Each data record was obtained using material properties that contain random contributions, mimicking uncertainties in the measurements. The solid lines show the average material response obtained for a set of mean material parameter values.
The dataset consists of the strain (the input or feature) and the stress (the output or target) for three different temperatures. It consists of altogether 350 data records for the strain covering 3 different temperature values (T=0°C, 400°C, and 600°C). For further details on the used equations and (material) parameters see section 4.2 of the book. In case that data for other temperature than those used in MDS-1 are needed, please take a look at the gitHub repository MDS-data and how MDS-1 is implemented.
Example: For how to obtain strain and stress data using the simple functions please see the above cheat sheet. The following lines of Python code import the dataset MDS1, store the data in six variables for stress and strain and then plot the data points:
strain_...
and stress_...
are 1D numpy arrays
and contain 350 elements each,
resulting approximately in the figure on the right.
If you prefer to use pandas DataFrame objects, you can obtain them
in the same way as also done in scikit-learn by setting the
parameter as_frame=True
:
Here, load_data(...)
returns a dictionary-like data "bunch" that
also contains a pandas DataFrame object.
The last command only works in a jupyter notebook and shows a formatted table with the output shown in the screenshot.
This is a good dataset for experiments with linear regression: you can either only use the elastic regime to fit a straight line, or you can use piecewise linear regression and treat the yield point as a hyper parameter. In the MDS book, this dataset is also used as an example for outlier detection.
Summary: The Ising model is a statistical physics model used to represent, e.g., the evolution of ferromagnetic domain walls. Ernst Ising, to whom the model ows its name, was the first to solve it mathematically for a one-dimensional situation. The two-dimensional case is much more complex and here was solved numerically. The simulation method is based on the Metropolis Monte-Carlo method. After randomly initializing the system of, in our case 64 x 64 lattice sites (the pixels), its microstructure evolves. The figure shows some example microstructures. Each of them is characterized by a temperature which, in dimensionless scaling, is given as multiple of the so-called Curie temperature TC. The system undergoes a phase transformation at TC which shows in the transition from microstructures with strong fluctuations to microstructures with very distinct patterns.
In the MDS book, we use the Ising dataset, e.g., to explore the structure-property relationship: using a neural network we will predict the temperature (i.e., the property) based on the microstructural images. Furthermore, we investigate the data with the principle component analysis (PCA), and more advanced, embedding methods.
Example: The following lines of Python code import the dataset MDS2 using the simplified function and shows four example microstructures out of the altogether 5000 images:
images
is a 3D numpy array where the first index is the number of the image, e.g., in the above
example images[10]
is the image number 10; temperature[10]
is the temperature of that image, and label[10]
is an integer value of 0 or 1, indicating if the temperature is below or above the Curie temperature.
There is also a smaller Ising dataset that still contains 5000 images
which, however, are only 16x16 pixels in size.
It can be imported by from mdsdata import MDS2_light
or for the simplified
interface by from mdsdata import load_Ising_light
. Everything else stays the same as in MDS-2. Computations using this dataset are
considerably faster but lack some finer details.
Summary: Binary mixtures, such as alloys of two metals, may coarsen into two phases, a mechanism which is described by the spinodal decomposition. This phenomenon may occur if an initially homogeneous phase becomes thermodynamically unstable. In this case, small fluctuations quickly grow.
In contrast to the Monte Carlo method used for creating dataset MDS-2 (Ising Model), there is no randomness in the evolution of this system; for given initial values, it is completely governed by continuum equations, i.e., the Cahn-Hilliard equation which were here solved by means of a finite element method.
The main "ingredients" for this model are the concentration of a phase, a free energy consisting of the potential, gradient, and elastic energy density. We are using such these datasets again for investigating the structure-property (=energy) relation but also in the context of unsupervised learning method and dimensionality reduction as well as for cross-validation purposes.
The dataset MDS-3 only contains data for a reduced energy range, as compared to the figure.
Example: The following lines of Python code import the dataset MDS3 consisting of data from altogether 17 simulations. It then shows some information, e.g., that the dataset consists of 17866 images of 64 x 64 pixel.
giving the following output:
The line containing MDS3.load_data(...)
takes a few seconds
to run because the image data is contained
in an zip archive from which the images are extracted "on the fly".
A convenient way of using the dataset is in a juypter
notebook where this line with reading the data is only once executed at
the beginning of an experiment.
It is also possible to read data from a single simulation only which speeds
things considerably up. This is achieved by using, e.g., the parameter
simulation_number=3
which imports simulation number 3.
Summary: This is a small dataset that contains four periodic properties: atomic radius, electron affinity, ionization energy and the electronegativity. These properties were collected for a total of 38 chemical elements (22 metals and 16 non-metals), originally taken from a number of different publicly available sources.
The dataset can be used in the context of, e.g., supervised classification or unsupervised learning with methods such as clustering of data or principal component analysis. There, a simple task can be to decide, based on the chemical properties, if an element is a metal or not, or to find out, which are the most important periodic properties.
Example: The following lines of Python code import the dataset MDS4 and creates a scatter plot with two features, similar to the one shown above. The first code snipped uses the simplified function interface:
The second code snipped shows how to use the more general interface in analogy to scikit-learn and additionally shows a legend in the plot:
To see all feature names you can use print(features_names)
.
If you prefer to use the Python package pandas then the following code
creates a DataFrame
and additionally adds the text-label
for each value of the target variable:
If run in a jupyter notebook the last command outputs the table from row 18 to row 24 where the newly created column showing the text string label can be seen:
Further details and references are given in section 4.6 of the MDS book.
Summary: This dataset was obtained from nanoindentation of two different Cu-Cr composites containing 25 wt% Cr and 60 wt% Cr corresponding to 29.95 at% and 64.40 at% Cr, respectively. Additionally pure Cu and pure Cr datasets are contained that are used as reference data, covering the two extremes.
The dataset consists of two features: the hardness H and the Young's modulus E obtained for altogether 378 indents. Both are given in GPa. The four different material types are the class labels 0, ..., 3.
Example:
Calling the method MDS5.load_data()
returns a dictionary-like object that contains the feature matrix data
consisting of modulus and hardness as the two features in columns.
The output "matrix" Y
is a 1D array and contains for each
row an integer (0, 1, 2, or 3) that corresponds to the class_name
('0% Cr', '25% Cr', '60% Cr', or '100% Cr').
The following lines produce a plot similar to the figure shown above.
Setting c=material
uses a different color for each of
the four different materials. The output of the print statements is
A second way of using the dataset is to only obtain the feature matrix and
the target data. This is achieved by specifying the return_X_y=True
as in the scikit-learn library (see the docstring of the function for more details):
Setting outlier=True
has the effect that outliers are not
removed, which is useful for experimenting with methods for outlier
detections and removal. The default is that the postprocessed dataset with
outliers removed is returned.
“Classical” Datasets DS-1 and DS-2
Summary: The Iris Flower dataset is one of the two most famous dataset used as "toy dataset" for first experiments with data science and machine learning.
The dataset contains the measurement of the length and width of two different types of leaves: the petals and the sepals, resulting in altogether 4 features. Additionally, the type of flower is contained: there are altogether 3 different Iris types, hence, the target variable can take the values 0, 1, or 3.
Example:
The following Python script shows how to load the dataset. Here we
use the mode general DS1.load_data()
function (instead of load_iris()
)
as this gives us access to the names of the 3 class members.
For plotting the data of each class in a different color, we create an
array idx
that acts as a "filter" (note that y == 2
returns an
array of boolean values, similar to np.where(y == 2, True, False)
):
The parameters vmin=0, vmax=2
are used to ensure that for each value
of y
a unique color is chosen.
Clearly, the code could be written a bit more compact using a for loop.
Summary:
This is the famous MNST dataset of handwritten digits, a curated subset of a larger dataset collected by NIST. It contains 60,000 training images and 10,000 testing images of handwritten digits. The images are 28x28 pixels in size. The value range is 0 (background) to 255 (digit). Further information can be found at Yann LeCun's web page from which the dataset was obtained.
The dataset can be used for classification (the output of, e.g., a network is the integer value of the digit) but also for various "computer vision" problems, e.g., for denoising (when the digits are superimposed with noise) or impainting (i.e., "guessing" missing parts of such images).
Example:
Importing the MNIST dataset can be done in anaology to the above datasets. The following example shows how to do this and additionally creates a plot consisting of some randomly chosen images:
As always, the first index of theimages
array is the number of
the image such that images[12]
gives the 13th image.
If the testing dataset should be used then the parameter train=False
needs to be set in DS2().load_data
.
Showing a few stats of the dataset can be done as follows:
Here, we used numpy
's histogram
function
to count the number of images
for each of the 10 classes (0..9). The [0]
makes sure that only
the frequencies are printed and not the bin edges as well (the function returns
two arrays). The resulting output is:
There are altogether 60000 images and the classes are roughly balanced
with approximately 6000 examples per class.
Summary:
This is a smaller version of the famous MNST dataset of handwritten digits, prepared by E. Alpaydin. It contains 5,620 images of handwritten digits. The images are 8x8 pixels in size and were downsampled from the NIST dataset. The value range is 0 (background) to 255 (digit). Further information can be found in the UCI dataset archive from which the dataset was obtained.
The dataset can be used for classification (the output of, e.g., a network is the integer value of the digit) but also for various "computer vision" problems, e.g., for denoising (when the digits are superimposed with noise) or impainting (i.e., "guessing" missing parts of such images).
Example:
Importing the dataset can be done in anaology to the MNIST dataset. The following example shows how to do this and additionally creates a plot consisting of some randomly chosen images:
As always, the first index of theimages
array is the number of
the image such that images[12]
gives the 13th image.