**by****Maria Troppenz****,****Martin Kuban****,****&****Santiago Rigamonti****for****exciting****nitrogen**

` Purpose:` In this tutorial, you will learn the basics of the cluster expansion package

`. You will start by creating parent lattices of bulk and surface systems, next you will learn how to set up supercells and structures sets used for training cluster expansion models. For these systems, you will create pools of clusters and study the cluster correlations for particular cases. Finally you will learn different methods to find the optimal model out of a large pool of clusters.`

**CELL**Before starting, make sure you are in the correct folder: `/home/tutorials/CELL_tutorial`.

### 1. Tutorial 1

#### 1.1 Building parent lattices

Here you will learn how to set up and visualize a parent lattice, which is the most basic object in ` CELL`. We will consider two examples: a bulk fcc crystal, and a surface system with adsorbed atoms and surface alloying.

We start with the example of a **bulk binary fcc** metal alloy:

```
from ase.build import bulk
from clusterx.parent_lattice import ParentLattice
pri1 = bulk('Cu', 'fcc')
sub1 = bulk('Al', 'fcc')
platt1 = ParentLattice(pri1, substitutions=[sub1])
```

In the first line above, we import the `bulk` module of the Atomic Simulation Environment (ASE). In the second line, the `ParentLattice` class of ` CELL` is loaded. In the next two lines, using the

`bulk`function, we define the structures for the pristine Cu non-substituted lattice

`pri1`and the fully substituted Al (fcc) lattice

`sub1`. These two structures are then used to initialize the

`ParentLattice`object (which we call

`platt1`) in the last line.

Next, we would like to visualize the just created parent lattice. To this end, we use the `juview` function of the visualization module of ` CELL`:

```
from clusterx.visualization import juview
juview(platt1)
```

The figure on the left, corresponds to the pristine non-substituted Cu fcc crystal, while the figure on the right represents the fully Al-substituted crystal. In general, the line of code `juview(parent_lattice)`, will generate as many additional figures as possible substituents are present in the parent lattice, as you will see for the next example of a surface system.

Now, we will setup the parent lattice for a **surface system**. It consists on a fcc(111) Al surface, with possible Na substitution on the uppermost Al layer and adsorption of oxygen atoms in "on-top" configuration. In order to build the parent lattice for such a system, we execute the following code:

```
from ase.build import fcc111, add_adsorbate
pri2 = fcc111('Al', size=(1,1,3))
add_adsorbate(pri2,'X',1.7,'ontop')
pri2.center(vacuum=10.0, axis=2)
```

In the code above, first we load some builder utilities from ASE (`fcc111` and `add_adsorbate`). In the next three lines, we

i) create an (111)-terminated fcc Al slab with three atomic layers

ii) add a vacancy (symbol *X*) site with "on top" configuration, and

iii) add vaccuum on the sides of the slab along the $z$-direction.

In this way we have defined the pristine structure `pri2`. Now we would like to set up the substitutions: Na on the top-most Al layer and oxygen on the "on-top" vacancy sites. To proceed, we first need some information from the pristine structure, as shown below:

```
for i, (symbol, z_coord) in enumerate(zip(pri2.get_chemical_symbols(),pri2.get_positions()[:,2])):
print("atom index: ",i,"| Chemical symbol: ",symbol,"| z coordinate: ",z_coord)
```

From this output, we see that the uppermost "Al" layer has atom index 2, and that the adsorbate layer has atom index 3. With this information, we initialize the parent lattice object in an alternative way, by telling which species can occupy every atom index: This is done with the `site_symbols` argument, which allows us to tell ` CELL` which atomic species can occupy every atomic site:

```
platt2 = ParentLattice(pri2, site_symbols=[['Al'],['Al'],['Al','Na'],['X','O']])
juview(platt2)
```

In this way, we see that for atom indices 1 and 2 only `['Al']` is allowed, while atom index 2 admits the species in the array `['Al','Na']` and atom index 3 admits species `['X','O']`, where `'X'` denotes a vacancy. From left to right, the figures above denote: pristine non-substituted lattice (vacancy sites indicated with white color), on-top vacancy site substituted by oxygen (red), and top-most Al layer substituted by Na (purple).

#### 1.2 Building sturcture sets

In order to generate *ab initio* data to be used as input to train a cluster expansion model, we need to perform calculations on super cells of the parent lattice. In ` CELL`, super cells are represented by objects of the class

`SuperCell`. We will take the example of the surface system above, and create a $4\times4\times1$ super cell object:

```
from clusterx.super_cell import SuperCell
import numpy as np
scell2 = SuperCell(platt2,np.diag([4,4,1]))
juview(scell2)
```

As you can see, a super cell looks very much like an enlarged parent lattice, indeed, objects of the `SuperCell` class inherit from the `ParentLattice` class and share many properties.

Now, using the created super cell, we will generate a few random decorations of it at different concentrations. The generated structures will be collected in a `StructuresSet` object, that will be later used as training set for a cluster expansion.

Before doing so, however, we need some information from the just created `SuperCell` object that we will need to define the concentrations of Na substituents and vacancies in the generation of random structures. This information is contained in the `sites` dictionary of the super cell, which we can access with:

`print(scell2.get_idx_subs())`

This tells us that the supercell has three types of sites, or sublattices, with indexes $0$, $1$ and $2$. Sites of type $0$, contain species number 0 (vacancy) and can be substituted by species number 8 (oxygen) , sites of type $1$ contain species number 13 (Al) and can be substituted with species number 11 (Na); while sites of type $2$ contain species number 13 (Al) and can not be substituted.

In the code shown below, we first load the `StructuresSet` class and then create a structures-set object that we call `sset2`. Next, in three different `for` loops, by using the `scell2`'s `gen_random()` method, we create

i) two random structures with 4 on-top oxygen atoms,

ii) two random structures with 4 Al$\rightarrow$Na substitutions, and

iii) 2 structures with 4 oxygen atoms and 4 Al$\rightarrow$Na substitutions:

```
from clusterx.structures_set import StructuresSet
sset2 = StructuresSet(platt2)
nstruc = 2
# i) Random structures with 4 on-top oxygen atoms
for i in range(nstruc):
sset2.add_structure(scell2.gen_random({0:[4]}))
# ii) Random structures with 4 substituent Na atoms
for i in range(nstruc):
sset2.add_structure(scell2.gen_random({1:[4]}))
# iii) Random structures with 4 on-top oxygen and 4 substituent Na atoms
for i in range(nstruc):
sset2.add_structure(scell2.gen_random({0:[2],1:[4]}))
juview(sset2)
```

In the figure above, red spheres represent oxygen atoms and purple spheres represent Na substitutions.

**Exercise 1**

Build the parent lattice for a two-dimensional square lattice of a binary (e.g. SiGe) material and create (and visualize) 6 random structures on a $5\times5$ super cell.

As help, you can use the following Atoms object to initialze the `ParentLattice` object:

```
from ase import Atoms
a=3.0
pri4 = Atoms(positions=[[0,0,0]],symbols=['Si'],cell=[[a,0,0],[0,a,0],[0,0,2*a]],pbc=(1,1,0))
```

**Excercise 2**

Generate and visualize a few random structures for the fcc CuAl alloy of the first example on this tutorial. Do it so in a $3\times3\times3$ super cell.

### 2. Tutorial 2

#### 2.1 Building a pool of clusters

Now that you know how to build parent lattices and structures sets, the next task is to create a pool of clusters. The clusters pool defines the basis set for the expansion of configuration-dependent properties in terms of cluster functions. Although this basis is infinite (consisting of all symetrically distinct atomic configurations of the substituent species), in practical applications one must cut off the basis. In order to fix ideas, we will start with a very simple model (corresponding to the solution of **Excercise 1** of **Tutorial 1**) of a binary two-dimensional lattice. Afterwards, you will tackle the more complicated surface system shown in **Tutorial 1** by solving **Excercise 4** of this tutorial.

To start, we must define the parent lattice:

```
from ase import Atoms
from clusterx.parent_lattice import ParentLattice
from clusterx.visualization import juview
a=4.0
pri = Atoms(positions=[[0,0,0]],symbols=['Si'],cell=[[a,0,0],[0,a,0],[0,0,2*a]],pbc=(1,1,0))
plat = ParentLattice(pri,site_symbols=[['Si','Ge']])
juview(plat)
```

Now, we will create all possible clusters of up to four points and radius up to $\sqrt{2}\,a$. This is done with the following piece of code:

```
from clusterx.clusters.clusters_pool import ClustersPool
r = 1.5
cpool = ClustersPool(plat, npoints=[0,1,2,3,4], radii=[0,0,a*r,a*r,a*r])
```

In the first line, we load the `ClustersPool` class. Then, we create an object of this class (`cpool`). To initialize it, we use the parameters `npoints` and `radii`. `npoints` indicates the number of points of the clusters in the pool, and the parameter `radii` indicates the maximum radius corresponding to each of the number of points indicated in `npoints`. In this way, we have created a clusters pool containing the empty cluster, all the 1-point clusters (in this case there will be only one of this kind), and all the 2-, 3- and 4-point clusters up to radius $1.5\times a$.

The number of clusters just created is:

`print("Number of clusters: ", len(cpool))`

These can be visualized by first creating a clusters database:

`cpool.write_clusters_db()`

Now, we have at least two ways to visualize the pool of clusters:

i) we can plot a few of them (e.g. *n=6*) on this notebook with `juview` by invoking the `get_cpool_atoms()` method of cpool:

`juview(cpool.get_cpool_atoms(),n=6)`

ii) by using the graphical user interface (gui) of ASE on a terminal. To use this second option, which is the recommended way to proceed when the clusters pool is large, first note that when creating the clusters database above with `cpool.write_clusters_db()`, a file called `cpool.json` was created in the same folder (let's call it `$CWD`) where this tutorial is located. This file has the proper format to be visualized with ASE's gui. Now, open a terminal and move to this folder with `$>cd $CWD` (the `$>` denotes the command prompt) and execute `$>ase gui cpool.json`. A number of windows will open. The relevant ones for the visualization of the clusters are shown in the screen capture below:

You can visualize all the clusters by clicking on the *Back* and *Forward* buttons of the *Movie* window. Let's see how this works for a larger pool:

```
r = 2.5
cpool = ClustersPool(plat, npoints=[0,1,2,3,4,5], radii=[0,0,a*r,a*r,a*r,a*r])
cpool.write_clusters_db()
print(len(cpool), " clusters were created.")
```

Now, visualize these 34 clusters with ASE's gui as explained above.

#### 2.2 Cluster orbits

The clusters obtained above are all symmetrically inequivalent. Moreover, each of them is a representative of an infinite set of symmetrically equivalent clusters. Every infinite set is called a cluster orbit. We can calculate a cluster orbit for a supercell and visualize it with ASE's gui. Let's do so for one cluster of the clusters pool `cpool`:

```
cluster_index = 6 # Select a cluster from the pool
# Obtain the orbit for this cluster
cluster_orbit, cluster_multiplicity = cpool.get_cluster_orbit(cluster_index = cluster_index)
print("There are ",len(cluster_orbit),"symmetrically equivalent representations of cluster ", cluster_index," in the supercell.")
print("The cluster multiplicity is ",cluster_multiplicity,".")
```

The cluster multiplicity is the number of symetrically equivalent realizations of the cluster, under the symmetry operations of the parent cell, without counting the internal translations of the parent cell inside the supercell.

`cpool.write_clusters_db(orbit=cluster_orbit,db_name="cluster_orbit.json")`

Now, you may visualize the cluster orbit by executing the command `$>ase gui cluster_orbit.json` in a terminal.

#### 2.3 Cluster correlations

In this section we will illustrate the calculation of the cluster correlations for the simple binary case studied here. To simplify the analysis, we will use a basis in which the site occupations are represented by a variable $\sigma_{s,i}$, which is equal to $1$ if the crystal site $i$ of structure $s$ is ocupied with the substitutional species (Ge in this example), or zero otherwise (Si in this example).

The cluster correlations $X_{s\alpha}$ for a binary compound are given by the equation:

$X_{s\alpha}=\frac{1}{M_\alpha}\sum_{\beta\equiv\alpha}{\prod_{i\in\beta}\sigma_{s,i}}$

Here, the index $\alpha$ denotes a cluster and $m_\alpha$ the multiplicity of cluster $\alpha$. The sum goes through the clusters $\beta$ in the orbit of cluster $\alpha$, and $i$ denote the crystal sites (points) of cluster $\beta$. In short, the cluster correlations are the orbit-averaged *cluster functions* $\prod_{i\in\beta}\sigma_{s,i}$.

Let's now calculate the cluster correlations for a the same small pool of clusters obtained at the start of this tutorial:

```
from clusterx.clusters.clusters_pool import ClustersPool
r = 1.5
cpool = ClustersPool(plat, npoints=[0,1,2,3,4], radii=[0,0,a*r,a*r,a*r])
cpool.write_clusters_db()
print(len(cpool), " clusters were created.")
juview(cpool.get_cpool_atoms(),n=6)
```

To accomplish this task, we must first create a `CorrelationsCalculator` object:

```
from clusterx.correlations import CorrelationsCalculator
corrcal = CorrelationsCalculator("binary-linear", plat, cpool)
```

Let's now create a random structure and then obtain the cluster correlations for this structure:

```
structure = cpool.get_cpool_scell().gen_random({0:[3]})
juview(structure.get_atoms())
```

```
mult = cpool.get_multiplicities()
corrs = corrcal.get_cluster_correlations(structure, multiplicities=mult)
for i in range(len(cpool)):
print("Cluster: ",i,"| Correlation: ", corrs[i], "| Multiplicity: ", mult[i])
```

**Excercise 3**

With the figures of the clusters, the figure of the structure and the obtained multiplicities, calculate *by hand* the correlations for the random structure and verify that they are equal to what is returned by the `get_cluster_correlations` method.

**Excercise 4**

Create a pool of clusters for the surface system of Tutorial 1 and visulaize the generated clusters with ASE's gui

### 3. Tutorial 3

#### 3.1 The optimal model

Once a training data set and a pool of clusters is available, the next question is how to find the optimal cluster expansion model to make predictions. This task requires finding the set of clusters that are most relevant to describe the interactions present in the system. A number of optimality criteria can be used. Here we will focus on obtaining models which are optimal in the sense of providing the best possible predictions for new data, i.e., data not included in the training set. For this purpose, the quality of the predictions can be quantified by the cross validation score (CVS), which you will learn to calculate and interpret in this section.

Here, we will use the surface system with oxygen adsorption and Na substitution, which was used in **Tutorial 1** and **Tutorial 2**, to find the optimal cluster expansion model. We start by generating again the needed elements for the task, namely a training data set and a pool of clusters.

```
from ase.build import fcc111, add_adsorbate
from clusterx.parent_lattice import ParentLattice
from clusterx.structures_set import StructuresSet
from clusterx.visualization import juview
from clusterx.super_cell import SuperCell
import numpy as np
from random import randint
nsc = 4
pri2 = fcc111('Al', size=(1,1,3))
add_adsorbate(pri2,'X',1.7,'ontop')
pri2.center(vacuum=10.0, axis=2)
platt2 = ParentLattice(pri2, site_symbols=[['Al'],['Al'],['Al','Na'],['X','O']])
scell2 = SuperCell(platt2,np.diag([nsc,nsc,1]))
sset2 = StructuresSet(platt2)
nstruc = 60
for i in range(nstruc):
concentration = {0:[randint(0,nsc*nsc)],1:[randint(0,nsc*nsc)]}
sset2.add_structure(scell2.gen_random(concentration))
juview(sset2,n=3) # Plot the first 3 created random structrues
```

```
from clusterx.calculators.emt import EMT2
sset2.set_calculator(EMT2())
energies = sset2.calculate_property()
print(energies)
```

```
r=3.5
from clusterx.clusters.clusters_pool import ClustersPool
cpool = ClustersPool(platt2, npoints=[0,1,2,3,4], radii=[0,0,r,r,r])
cpool.write_clusters_db()
print(len(cpool)," clusters were generated.")
```

`juview(cpool.get_cpool_atoms(),n=6)`

Now that we have the basic elements (i.e. training set and pool of clusters), we can proceed to calculate the matrix of cluster correlations. We do so with a `CorrelationsCalculator` object:

```
from clusterx.correlations import CorrelationsCalculator
corrcal = CorrelationsCalculator("trigonometric", platt2, cpool)
comat = corrcal.get_correlation_matrix(sset2)
print(comat)
```

Now we are ready to perform the selection of the optimal cluster expansion model. To this end, we load the `ClustersSelector` class of clusterx and create and instance of it (that we call `clsel`):

```
from clusterx.clusters_selector import ClustersSelector
from clusterx.visualization import plot_optimization_vs_number_of_clusters
from clusterx.visualization import plot_predictions
clsel = ClustersSelector('linreg', cpool, clusters_sets = "size")
```

the `clsel` object defined above, will select clusters using linear regression for the fitting of cluster interaction (`'linreg'` parameter), it will select the optimal model out of the pool of clusters contained in the `cpool` object, by building clusters pools of increasing cluster size (as indicated by the option `clusters_sets = "size"`) .

Now, we are prepared to call the `select_clusters` method of the `clsel` object. This will perform the actual clusters selection with an optimallity criterium based on cross-validation. The optimal clusters pool will be contained on the `optimal_clusters` member of `clsel`, which we assign to the `cp2` variable. Finally, the cluster optimisation process may be visualised with the function `plot_optimization_vs_number_of_clusters`:

```
clsel.select_clusters(comat,energies)
print("Optimal set of clusters:")
print("No. points radius")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
radius=[]
for i,c in enumerate(cp2):
print(i, c.npoints, c.radius)
print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))
from clusterx.visualization import plot_optimization_vs_number_of_clusters
plot_optimization_vs_number_of_clusters(clsel)
```

It is always useful to inspect how the predictions correlate to the data, this can be done with the visualisation function `plot_predictions`:

```
from clusterx.visualization import plot_predictions
plot_predictions(clsel,energies)
```

Now, we will demonstrate a different strategy for cluster selection, which is based on making a (partial) exhaustive list of clusters subsets (i.e. all possible combinations of clusters for certain clusters subset from the pool) and then select from that list according to smallest CV score:

```
clsel = ClustersSelector('linreg', cpool, clusters_sets = "size+combinations", nclmax = 1, set0 = [1,0])
clsel.select_clusters(comat,energies)
print("Optimal set of clusters:")
print("No. points radius")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
radius=[]
for i,c in enumerate(cp2):
print(i, c.npoints, c.radius)
print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))
plot_optimization_vs_number_of_clusters(clsel)
plot_predictions(clsel,energies)
```

Now, we try with a different `nclmax` and `set0`:

```
clsel = ClustersSelector('linreg', cpool, clusters_sets = "size+combinations", nclmax = 3, set0 = [2,3.5])
clsel.select_clusters(comat,energies)
print("Optimal set of clusters:")
print("No. points radius")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
radius=[]
for i,c in enumerate(cp2):
print(i, c.npoints, c.radius)
print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))
plot_optimization_vs_number_of_clusters(clsel)
plot_predictions(clsel,energies)
```

Another cluster optimisation strategy is based on LASSO (least absolute shrinkage and selection operator), which approximates the full combinatorial optimisation in an efficient way, by solving a convex optimisation problem:

```
clsel = ClustersSelector('lasso', cpool, sparsity_max=0.10, sparsity_min=0.001)
clsel.select_clusters(comat,energies)
print("Optimal set of clusters:")
print("No. points radius")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
radius=[]
for i,c in enumerate(cp2):
print(i, c.npoints, c.radius)
print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))
plot_optimization_vs_number_of_clusters(clsel)
from clusterx.visualization import plot_optimization_vs_sparsity
plot_optimization_vs_sparsity(clsel)
plot_predictions(clsel,energies)
```

[[/collapsible]]