Skip to content

Commit c002499

Browse files
committed
update doc and e3 example
1 parent 2f117f7 commit c002499

24 files changed

+657
-200
lines changed

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,23 @@
66

77
## About DeePTB
88

9-
**DeePTB** is an innovative Python package that employs deep learning to construct electronic tight-binding (TB) Hamiltonians with a minimal basis, or full LCAO basis from DFT packages. It is designed to:
9+
**DeePTB** is an innovative Python package that employs deep learning to construct electronic Hamiltonians using a minimal basis Slater-Koster TB(**SKTB**), and full LCAO basis using E3-equivariant neural networks (**E3TB**). It is designed to:
1010

1111
- Efficiently predict TB/LCAO Hamiltonians for large, unseen structures based on training with smaller ones.
1212
- Enable simulations of large systems under structural perturbations, finite temperature simulations integrating molecular dynamics (MD) for comprehensive atomic and electronic behavior.
13+
14+
For **SKTB**:
1315
- Support customizable Slater-Koster parameterization with neural network incorporation for local environmental corrections.
1416
- Operate independently of the choice of bases and exchange-correlation functionals, offering flexibility and adaptability.
1517
- Handle systems with strong spin-orbit coupling (SOC) effects.
1618

17-
**DeePTB** is a versatile tool adaptable for a wide range of materials and phenomena, providing accurate and efficient simulations. See more details in our DeePTB paper: [arXiv:2307.04638](http://arxiv.org/abs/2307.04638)
19+
For **E3TB**:
20+
- Support constructing DFT Hamiltonians/density and overlap matrices under full LCAO basis.
21+
- Utilize strictly local and semi-local E3-equivariant neural networks to achieve high data-efficiency and accuracy.
22+
- Speed up via SO(2)convolution to support LCAO basis containing f and g orbitals.
23+
24+
**DeePTB** is a versatile tool adaptable for a wide range of materials and phenomena, providing accurate and efficient simulations. See more details in our DeePTB paper: [sktb: arXiv:2307.04638](http://arxiv.org/abs/2307.04638), [e3tb: arXiv:2407.06053](https://arxiv.org/pdf/2407.06053)
25+
1826

1927
## Installation
2028

docs/advanced/e3module.md renamed to docs/advanced/e3tb/advanced_input.md

Lines changed: 11 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,4 @@
1-
# Fitting DFT Hamiltonian Directly
2-
DeePTB support directly parameterize the DFT Hamiltonian/Overlap/Density Matrix under full LCAO basis set. The user can generated the Hamiltonian/Overlap/Density Matrix as training data, and trained the equivariant DeePTB model that can reproduce the target quantities on new unseen structures.
3-
4-
## Data Preparation
5-
We sugget the user to use a data parsing tool [dftio](https://github.com/floatingCatty/dftio) to directly converting the output data from DFT calculation into readable datasets. Our implementation support the all the parsing format of `dftio`.
6-
7-
After parsing, the user need to write a info.json file into the dataset. For default dataset type, the `info.json` looks like:
8-
9-
```JSON
10-
{
11-
"nframes": 1,
12-
"pos_type": "cart",
13-
"AtomicData_options": {
14-
"r_max": 7.0,
15-
"pbc": true
16-
}
17-
}
18-
19-
```
20-
Here `pos_type` can be `cart`, `dirc` or `ase`. For `dftio` output dataset, we use `cart` by default. The `r_max`, in principle, should align with the orbital cutoff in DFT calculation. For single element, the `r_max` should be a float number, indicating the largest bond distance included. When the system have multiple atoms, the `r_max` can also be a dict of atomic species specific number like `{A: 7.0, B: 8.0}`. Then the largest bond `A-A` would be 7 and `A-B` be (7+8)/2=7.5, and `B-B` would be 8. `pbc` can be a bool variable, indicating the open or close of the periodic boundary conditions of the model. It can also be a list of three bool elements like `[true, true, false]`, by what means we can set the periodicity of each direction independently.
21-
22-
For LMDB type Dataset, the info.json is much simplier, which looks like:
23-
```JSON
24-
{
25-
"r_max": 7.0
26-
}
27-
```
28-
Where other information have been stored in the dataset. LMDB dataset is design for handeling very large data that cannot be fit into the memory directly.
29-
30-
Then you can set the `data_options` in the input parameters to point directly to the prepared dataset.
31-
32-
## Input Parameters
1+
# More on Input Parameters
332
In `common_options`, the user should define the some global param like:
343
```JSON
354
"common_options": {
@@ -43,9 +12,12 @@ In `common_options`, the user should define the some global param like:
4312
"seed": 42
4413
}
4514
```
46-
Here the basis sould align with the basis used to perform LCAO DFT calculations. The `"2s2p1d"` here indicates 2x`s` orbital, 2x`p`orbital and one `d` orbital.
15+
- `basis` sould align with the basis used to perform LCAO DFT calculations. The `"2s2p1d"` here indicates 2x`s` orbital, 2x`p`orbital and one `d` orbital. The
16+
- `seed` controls the global random seed of all related package. `dtype` can be chosen between `float32` and `float64`, but the former are accurate enough in most cases. If you have multiple card, the
17+
- `device` can be setted as `cuda:0`, `cuda:1` and so on, where the number is the device id.
18+
- `overlap` controls the fitting of overlap matrix. The user should provide overlap in the dataset when configuring the data_options if `overlap` is setted as True.
4719

48-
In `train_options`, we need to modify some part to match the training task of E3 hamiltonian:
20+
In `train_options`, a common parameter looks like this:
4921
```JSON
5022
"train_options": {
5123
"num_epoch": 500,
@@ -61,8 +33,8 @@ In `train_options`, we need to modify some part to match the training task of E3
6133
"min_lr": 1e-6
6234
},
6335
"loss_options":{
64-
"train": {"method": "hamil_abs"},
65-
"validation": {"method": "hamil_abs"}
36+
"train": {"method": "hamil_abs", "onsite_shift": false},
37+
"validation": {"method": "hamil_abs", "onsite_shift": false}
6638
},
6739
"save_freq": 10,
6840
"validation_freq": 10,
@@ -71,14 +43,14 @@ In `train_options`, we need to modify some part to match the training task of E3
7143
```
7244
For `lr_scheduler`, please ensure the `patience` x `num_samples` / `batch_size` ranged between 2000 to 6000.
7345

74-
When the dataset contraining multiple elements, it is suggested to open a tag in loss_options for better performance, as:
46+
When the dataset contraining multiple elements, and you are fitting the Hamiltonian, it is suggested to open a tag in loss_options for better performance. Since most DFT software would allow for a uniform shift when computing the electrostatic potentials, therefore, bringing extra degree of freedom. The `onsite_shift` tag allow such freedom and make the model generalizable to all sort of elements combinations:
7547
```JSON
7648
"loss_options":{
7749
"train": {"method": "hamil_abs", "onsite_shift": true},
7850
"validation": {"method": "hamil_abs", "onsite_shift" : true}
7951
}
8052
```
81-
Currently, the onsite_shift only support when batchsize is 1.
53+
8254
In `model_options`, we support two type of e3 group equivariant embedding methods: Strictly Localized Equivariant Message-passing or `slem`, and Localized Equivariant Message-passing or `lem`. Former one ensure strict localization by truncate the propagation of distant eighbours' information, therefore are suitable for bulk systems where the electron localization is enhanced by scattering effect. `lem` method, on the other hand, contrained such localization design inherently by incooperating a learnable decaying functions describing the dependency across distance.
8355

8456
The model options for slem and lem are the same, here is an short example:
@@ -122,8 +94,4 @@ Out[7]: 7x0e+6x1o+6x2e+2x3o+1x4e
12294

12395
`latent_dim`: The scalar channel's dimension of the system. 32/64/128 is good enough.
12496

125-
For params in prediction, there is not much to be changed. The setting is pretty good.
126-
127-
## Start training
128-
129-
use ```dptb train xxx.json -o xxx``` to start the training.
97+
For params in prediction, there is not much to be changed. The setting is pretty good.
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Data Preparation
2+
We sugget the user to use a data parsing tool [dftio](https://github.com/floatingCatty/dftio) to directly converting the output data from DFT calculation into readable datasets. Our implementation support the parsed dataset format of `dftio`. User can just clone the `dftio` respository and run `pip install .` in its root directory. Then one can use the following parsing command for the parallel data processing directly from DFT output:
3+
```bash
4+
usage: dftio parse [-h] [-ll {DEBUG,3,INFO,2,WARNING,1,ERROR,0}] [-lp LOG_PATH] [-m MODE] [-n NUM_WORKERS] [-r ROOT] [-p PREFIX] [-o OUTROOT] [-f FORMAT] [-ham] [-ovp] [-dm] [-eig]
5+
6+
optional arguments:
7+
-h, --help show this help message and exit
8+
-ll {DEBUG,3,INFO,2,WARNING,1,ERROR,0}, --log-level {DEBUG,3,INFO,2,WARNING,1,ERROR,0}
9+
set verbosity level by string or number, 0=ERROR, 1=WARNING, 2=INFO and 3=DEBUG (default: INFO)
10+
-lp LOG_PATH, --log-path LOG_PATH
11+
set log file to log messages to disk, if not specified, the logs will only be output to console (default: None)
12+
-m MODE, --mode MODE The name of the DFT software. (default: abacus)
13+
-n NUM_WORKERS, --num_workers NUM_WORKERS
14+
The number of workers used to parse the dataset. (For n>1, we use the multiprocessing to accelerate io.) (default: 1)
15+
-r ROOT, --root ROOT The root directory of the DFT files. (default: ./)
16+
-p PREFIX, --prefix PREFIX
17+
The prefix of the DFT files under root. (default: frame)
18+
-o OUTROOT, --outroot OUTROOT
19+
The output root directory. (default: ./)
20+
-f FORMAT, --format FORMAT
21+
The output root directory. (default: dat)
22+
-ham, --hamiltonian Whether to parse the Hamiltonian matrix. (default: False)
23+
-ovp, --overlap Whether to parse the Overlap matrix (default: False)
24+
-dm, --density_matrix
25+
Whether to parse the Density matrix (default: False)
26+
-eig, --eigenvalue Whether to parse the kpoints and eigenvalues (default: False)
27+
```
28+
29+
After parsing, the user need to write a info.json file into the dataset. For default dataset type, the `info.json` looks like:
30+
31+
```JSON
32+
{
33+
"nframes": 1,
34+
"pos_type": "cart",
35+
"AtomicData_options": {
36+
"r_max": 7.0,
37+
"pbc": true
38+
}
39+
}
40+
41+
```
42+
Here `pos_type` can be `cart`, `dirc` or `ase`. For `dftio` output dataset, we use `cart` by default. The `r_max`, in principle, should align with the orbital cutoff in DFT calculation. For single element, the `r_max` should be a float number, indicating the largest bond distance included. When the system have multiple atoms, the `r_max` can also be a dict of atomic species specific number like `{A: 7.0, B: 8.0}`. Then the largest bond `A-A` would be 7 and `A-B` be (7+8)/2=7.5, and `B-B` would be 8. `pbc` can be a bool variable, indicating the open or close of the periodic boundary conditions of the model. It can also be a list of three bool elements like `[true, true, false]`, by what means we can set the periodicity of each direction independently.
43+
44+
For LMDB type Dataset, the info.json is much simplier, which looks like:
45+
```JSON
46+
{
47+
"r_max": 7.0
48+
}
49+
```
50+
Where other information have been stored in the dataset. LMDB dataset is design for handeling very large data that cannot be fit into the memory directly.
51+
52+
Then you can set the `data_options` in the input parameters to point directly to the prepared dataset, like:
53+
```JSON
54+
"data_options": {
55+
"train": {
56+
"root": "./data",
57+
"prefix": "Si64",
58+
"get_Hamiltonian": true,
59+
"get_overlap": true
60+
}
61+
}
62+
```
63+
64+
If you are using a python script, the dataset can be build with the same parameters using `build_datasets`:
65+
```Python
66+
from dptb.data import build_dataset
67+
68+
dataset = build_dataset(
69+
root="your dataset root",
70+
type="DefaultDataset",
71+
prefix="frame",
72+
get_overlap=True,
73+
get_Hamiltonian=True,
74+
basis={"Si":"2s2p1d"}
75+
)
76+
```

docs/advanced/e3tb/index.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
=================================================
2+
E3TB Advanced
3+
=================================================
4+
5+
.. toctree::
6+
:maxdepth: 1
7+
:caption: Examples
8+
9+
advanced_input
10+
data_preparation
11+
loss_analysis

docs/advanced/e3tb/loss_analysis.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Loss Analysis
2+
## function
3+
The **DeePTB** contrains a module to help the user better understand the details of the error of **E3TB** module.
4+
We decompose the error of **E3TB** model into several parts:
5+
- onsite blocks: for diagonal blocks of the predicted quantum tensors
6+
the onsite blocks are further arranged according to the atom species.
7+
- hopping blocks: for off-diagonal blocks
8+
the hopping blocks error are then further arranged according to the atom-pair types.
9+
10+
## usage
11+
For using this function, we need a dataset and the model. Just build them up in advance.
12+
```Python
13+
from dptb.data import build_dataset
14+
from dptb.nn import build_model
15+
16+
dataset = build_dataset(
17+
root="your dataset root",
18+
type="DefaultDataset",
19+
prefix="frame",
20+
get_overlap=True,
21+
get_Hamiltonian=True,
22+
basis={"Si":"2s2p1d"}
23+
)
24+
25+
model = build_model("./ovp/checkpoint/nnenv.best.pth", common_options={"device":"cuda"})
26+
model.eval()
27+
```
28+
29+
Then, the user should sample over the dataset using the dataloader and doing a analysis with running average, the code looks like:
30+
```Python
31+
import torch
32+
from dptb.nnops.loss import HamilLossAnalysis
33+
from dptb.data.dataloader import DataLoader
34+
from tqdm import tqdm
35+
from dptb.data import AtomicData
36+
37+
ana = HamilLossAnalysis(idp=model.idp, device=model.device, decompose=True, overlap=True)
38+
39+
loader = DataLoader(dataset, batch_size=10, shuffle=False, num_workers=0)
40+
41+
for data in tqdm(loader, desc="doing error analysis"):
42+
with torch.no_grad():
43+
ref_data = AtomicData.to_AtomicDataDict(data.to("cuda"))
44+
data = model(ref_data)
45+
ana(data, ref_data, running_avg=True)
46+
```
47+
The analysis results are stored in `ana.stats`, which is a dictionary of statistics. The user can checkout the value directly, or displaying the results by:
48+
49+
```Python
50+
ana.report()
51+
```
52+
Here is an example of the output:
53+
```
54+
TOTAL:
55+
MAE: 0.00012021172733511776
56+
RMSE: 0.00034208124270662665
57+
58+
59+
Onsite:
60+
Si:
61+
MAE: 0.0012505357153713703
62+
RMSE: 0.0023699181620031595
63+
```
64+
![MAE onsite](../../img/MAE_onsite.png)
65+
![RMSE onsite](../../img/RMSE_onsite.png)
66+
67+
```
68+
Hopping:
69+
Si-Si:
70+
MAE: 0.00016888207755982876
71+
RMSE: 0.0003886453341692686
72+
```
73+
![MAE hopping](../../img/MAE_hopping.png)
74+
![RMSE hopping](../../img/RMSE_hopping.png)
75+
76+
If the user want to see the loss in a decomposed irreps format, one can set the `decompose` of `HamilLossAnalysis` class to `True`, and rerun the analysis. We can display the decomposed irreps results using the following code:
77+
```Python
78+
import matplotlib.pyplot as plt
79+
import torch
80+
81+
ana_result = ana.stats
82+
83+
for bt, err in ana_result["hopping"].items():
84+
print("rmse err for bond {bt}: {rmserr} \t mae err for bond {bt}: {maerr}".format(bt=bt, rmserr=err["rmse"], maerr=err["mae"]))
85+
86+
for bt, err in ana_result["onsite"].items():
87+
print("rmse err for atom {bt}: {rmserr} \t mae err for atom {bt}: {maerr}".format(bt=bt, rmserr=err["rmse"], maerr=err["mae"]))
88+
89+
for bt, err in ana_result["hopping"].items():
90+
x = list(range(model.idp.orbpair_irreps.num_irreps))
91+
rmserr = err["rmse_per_irreps"]
92+
maerr = err["mae_per_irreps"]
93+
sort_index = torch.LongTensor(model.idp.orbpair_irreps.sort().inv)
94+
95+
# rmserr = rmserr[sort_index]
96+
# maerr = maerr[sort_index]
97+
98+
plt.figure(figsize=(20,3))
99+
plt.bar(x, rmserr.cpu().detach(), label="RMSE per rme")
100+
plt.bar(x, maerr.cpu().detach(), alpha=0.6, label="MAE per rme")
101+
plt.legend()
102+
# plt.yscale("log")
103+
# plt.ylim([1e-5, 5e-4])
104+
plt.title("rme specific error of bond type: {bt}".format(bt=bt))
105+
plt.show()
106+
107+
for at, err in ana_result["onsite"].items():
108+
x = list(range(model.idp.orbpair_irreps.num_irreps))
109+
rmserr = err["rmse_per_irreps"]
110+
maerr = err["mae_per_irreps"]
111+
sort_index = torch.LongTensor(model.idp.orbpair_irreps.sort().inv)
112+
113+
rmserr = rmserr[sort_index]
114+
maerr = maerr[sort_index]
115+
116+
plt.figure(figsize=(20,3))
117+
plt.bar(x, rmserr.cpu().detach(), label="RMSE per rme")
118+
plt.bar(x, maerr.cpu().detach(), alpha=0.6, label="MAE per rme")
119+
plt.legend()
120+
# plt.yscale("log")
121+
# plt.ylim([1e-5, 2.e-2])
122+
plt.title("rme specific error of atom type: {at}".format(at=at))
123+
plt.show()
124+
125+
```

docs/img/MAE_hopping.png

24.6 KB
Loading

docs/img/MAE_onsite.png

26 KB
Loading

docs/img/RMSE_hopping.png

25.7 KB
Loading

docs/img/RMSE_onsite.png

23.3 KB
Loading

docs/img/silicon_e3_band.png

538 KB
Loading

0 commit comments

Comments
 (0)