deepmodeling
diff --git a/‎README.md
Lines changed: 10 additions & 2 deletions b/‎README.md
Lines changed: 10 additions & 2 deletions
diff --git a/‎docs/advanced/e3module.md renamed to ‎docs/advanced/e3tb/advanced_input.md
Lines changed: 11 additions & 43 deletions b/‎docs/advanced/e3module.md renamed to ‎docs/advanced/e3tb/advanced_input.md
Lines changed: 11 additions & 43 deletions
diff --git a/‎docs/advanced/e3tb/data_preparation.md
Lines changed: 76 additions & 0 deletions b/‎docs/advanced/e3tb/data_preparation.md
Lines changed: 76 additions & 0 deletions
diff --git a/‎docs/advanced/e3tb/index.rst
Lines changed: 11 additions & 0 deletions b/‎docs/advanced/e3tb/index.rst
Lines changed: 11 additions & 0 deletions
diff --git a/‎docs/advanced/e3tb/loss_analysis.md
Lines changed: 125 additions & 0 deletions b/‎docs/advanced/e3tb/loss_analysis.md
Lines changed: 125 additions & 0 deletions
diff --git a/‎docs/img/MAE_hopping.png
24.6 KB b/‎docs/img/MAE_hopping.png
24.6 KB
diff --git a/‎docs/img/MAE_onsite.png
26 KB b/‎docs/img/MAE_onsite.png
26 KB
diff --git a/‎docs/img/RMSE_hopping.png
25.7 KB b/‎docs/img/RMSE_hopping.png
25.7 KB
diff --git a/‎docs/img/RMSE_onsite.png
23.3 KB b/‎docs/img/RMSE_onsite.png
23.3 KB
diff --git a/‎docs/img/silicon_e3_band.png
538 KB b/‎docs/img/silicon_e3_band.png
538 KB
@@ -6,15 +6,23 @@
 
 ## About DeePTB
 
-**DeePTB** is an innovative Python package that employs deep learning to construct electronic tight-binding (TB) Hamiltonians with a minimal basis, or full LCAO basis from DFT packages. It is designed to:
+**DeePTB** is an innovative Python package that employs deep learning to construct electronic Hamiltonians using a minimal basis Slater-Koster TB(**SKTB**), and full LCAO basis using E3-equivariant neural networks (**E3TB**). It is designed to:
 
 - Efficiently predict TB/LCAO Hamiltonians for large, unseen structures based on training with smaller ones.
 - Enable simulations of large systems under structural perturbations, finite temperature simulations integrating molecular dynamics (MD) for comprehensive atomic and electronic behavior.
+
+For **SKTB**:
 - Support customizable Slater-Koster parameterization with neural network incorporation for local environmental corrections. 
 - Operate independently of the choice of bases and exchange-correlation functionals, offering flexibility and adaptability.
 - Handle systems with strong spin-orbit coupling (SOC) effects.
 
-**DeePTB** is a versatile tool adaptable for a wide range of materials and phenomena, providing accurate and efficient simulations. See more details in our DeePTB paper: [arXiv:2307.04638](http://arxiv.org/abs/2307.04638)
+For **E3TB**:
+- Support constructing DFT Hamiltonians/density and overlap matrices under full LCAO basis.
+- Utilize strictly local and semi-local E3-equivariant neural networks to achieve high data-efficiency and accuracy.
+- Speed up via SO(2)convolution to support LCAO basis containing f and g orbitals.
+
+**DeePTB** is a versatile tool adaptable for a wide range of materials and phenomena, providing accurate and efficient simulations. See more details in our DeePTB paper: [sktb: arXiv:2307.04638](http://arxiv.org/abs/2307.04638), [e3tb: arXiv:2407.06053](https://arxiv.org/pdf/2407.06053)
+
 
 ## Installation
 
 
@@ -1,35 +1,4 @@
-#  Fitting DFT Hamiltonian Directly
-DeePTB support directly parameterize the DFT Hamiltonian/Overlap/Density Matrix under full LCAO basis set. The user can generated the Hamiltonian/Overlap/Density Matrix as training data, and trained the equivariant DeePTB model that can reproduce the target quantities on new unseen structures.
-
-## Data Preparation
-We sugget the user to use a data parsing tool [dftio](https://github.com/floatingCatty/dftio) to directly converting the output data from DFT calculation into readable datasets. Our implementation support the all the parsing format of `dftio`.
-
-After parsing, the user need to write a info.json file into the dataset. For default dataset type, the `info.json` looks like:
-
-```JSON
-{
-    "nframes": 1,
-    "pos_type": "cart",
-    "AtomicData_options": {
-        "r_max": 7.0,
-        "pbc": true
-    }
-}
-
-```
-Here `pos_type` can be `cart`, `dirc` or `ase`. For `dftio` output dataset, we use `cart` by default. The `r_max`, in principle, should align with the orbital cutoff in DFT calculation. For single element, the `r_max` should be a float number, indicating the largest bond distance included. When the system have multiple atoms, the `r_max` can also be a dict of atomic species specific number like `{A: 7.0, B: 8.0}`. Then the largest bond `A-A` would be 7 and `A-B` be (7+8)/2=7.5, and `B-B` would be 8. `pbc` can be a bool variable, indicating the open or close of the periodic boundary conditions of the model. It can also be a list of three bool elements like `[true, true, false]`, by what means we can set the periodicity of each direction independently.
-
-For LMDB type Dataset, the info.json is much simplier, which looks like:
-```JSON
-{
-    "r_max": 7.0
-}
-```
-Where other information have been stored in the dataset. LMDB dataset is design for handeling very large data that cannot be fit into the memory directly. 
-
-Then you can set the `data_options` in the input parameters to point directly to the prepared dataset.
-
-## Input Parameters
+# More on Input Parameters
 In `common_options`, the user should define the some global param like:
 ```JSON
 "common_options": {
@@ -43,9 +12,12 @@ In `common_options`, the user should define the some global param like:
             "seed": 42
     }
 ```
-Here the basis sould align with the basis used to perform LCAO DFT calculations. The `"2s2p1d"` here indicates 2x`s` orbital, 2x`p`orbital and one `d` orbital.
+- `basis` sould align with the basis used to perform LCAO DFT calculations. The `"2s2p1d"` here indicates 2x`s` orbital, 2x`p`orbital and one `d` orbital. The 
+- `seed` controls the global random seed of all related package. `dtype` can be chosen between `float32` and `float64`, but the former are accurate enough in most cases. If you have multiple card, the 
+- `device` can be setted as `cuda:0`, `cuda:1` and so on, where the number is the device id. 
+- `overlap` controls the fitting of overlap matrix. The user should provide overlap in the dataset when configuring the data_options if `overlap` is setted as True.
 
-In `train_options`, we need to modify some part to match the training task of E3 hamiltonian:
+In `train_options`, a common parameter looks like this:
 ```JSON
 "train_options": {
     "num_epoch": 500,
@@ -61,8 +33,8 @@ In `train_options`, we need to modify some part to match the training task of E3
         "min_lr": 1e-6
     },
     "loss_options":{
-        "train": {"method": "hamil_abs"},
-        "validation": {"method": "hamil_abs"}
+        "train": {"method": "hamil_abs", "onsite_shift": false},
+        "validation": {"method": "hamil_abs", "onsite_shift": false}
     },
     "save_freq": 10,
     "validation_freq": 10,
@@ -71,14 +43,14 @@ In `train_options`, we need to modify some part to match the training task of E3
 ```
 For `lr_scheduler`, please ensure the `patience` x `num_samples` / `batch_size` ranged between 2000 to 6000.
 
-When the dataset contraining multiple elements, it is suggested to open a tag in loss_options for better performance, as:
+When the dataset contraining multiple elements, and you are fitting the Hamiltonian, it is suggested to open a tag in loss_options for better performance. Since most DFT software would allow for a uniform shift when computing the electrostatic potentials, therefore, bringing extra degree of freedom. The `onsite_shift` tag allow such freedom and make the model generalizable to all sort of elements combinations:
 ```JSON
 "loss_options":{
         "train": {"method": "hamil_abs", "onsite_shift": true},
         "validation": {"method": "hamil_abs", "onsite_shift" : true}
     }
 ```
-Currently, the onsite_shift only support when batchsize is 1.
+
 In `model_options`, we support two type of e3 group equivariant embedding methods: Strictly Localized Equivariant Message-passing or `slem`, and Localized Equivariant Message-passing or `lem`. Former one ensure strict localization by truncate the propagation of distant eighbours' information, therefore are suitable for bulk systems where the electron localization is enhanced by scattering effect. `lem` method, on the other hand, contrained such localization design inherently by incooperating a learnable decaying functions describing the dependency across distance.
 
 The model options for slem and lem are the same, here is an short example:
@@ -122,8 +94,4 @@ Out[7]: 7x0e+6x1o+6x2e+2x3o+1x4e
 
 `latent_dim`: The scalar channel's dimension of the system. 32/64/128 is good enough.
 
-For params in prediction, there is not much to be changed. The setting is pretty good.
-
-## Start training
-
-use ```dptb train xxx.json -o xxx``` to start the training.
+For params in prediction, there is not much to be changed. The setting is pretty good.
@@ -0,0 +1,76 @@
+# Data Preparation
+We sugget the user to use a data parsing tool [dftio](https://github.com/floatingCatty/dftio) to directly converting the output data from DFT calculation into readable datasets. Our implementation support the parsed dataset format of `dftio`. User can just clone the `dftio` respository and run `pip install .` in its root directory. Then one can use the following parsing command for the parallel data processing directly from DFT output:
+```bash
+usage: dftio parse [-h] [-ll {DEBUG,3,INFO,2,WARNING,1,ERROR,0}] [-lp LOG_PATH] [-m MODE] [-n NUM_WORKERS] [-r ROOT] [-p PREFIX] [-o OUTROOT] [-f FORMAT] [-ham] [-ovp] [-dm] [-eig]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -ll {DEBUG,3,INFO,2,WARNING,1,ERROR,0}, --log-level {DEBUG,3,INFO,2,WARNING,1,ERROR,0}
+                        set verbosity level by string or number, 0=ERROR, 1=WARNING, 2=INFO and 3=DEBUG (default: INFO)
+  -lp LOG_PATH, --log-path LOG_PATH
+                        set log file to log messages to disk, if not specified, the logs will only be output to console (default: None)
+  -m MODE, --mode MODE  The name of the DFT software. (default: abacus)
+  -n NUM_WORKERS, --num_workers NUM_WORKERS
+                        The number of workers used to parse the dataset. (For n>1, we use the multiprocessing to accelerate io.) (default: 1)
+  -r ROOT, --root ROOT  The root directory of the DFT files. (default: ./)
+  -p PREFIX, --prefix PREFIX
+                        The prefix of the DFT files under root. (default: frame)
+  -o OUTROOT, --outroot OUTROOT
+                        The output root directory. (default: ./)
+  -f FORMAT, --format FORMAT
+                        The output root directory. (default: dat)
+  -ham, --hamiltonian   Whether to parse the Hamiltonian matrix. (default: False)
+  -ovp, --overlap       Whether to parse the Overlap matrix (default: False)
+  -dm, --density_matrix
+                        Whether to parse the Density matrix (default: False)
+  -eig, --eigenvalue    Whether to parse the kpoints and eigenvalues (default: False)
+```
+
+After parsing, the user need to write a info.json file into the dataset. For default dataset type, the `info.json` looks like:
+
+```JSON
+{
+    "nframes": 1,
+    "pos_type": "cart",
+    "AtomicData_options": {
+        "r_max": 7.0,
+        "pbc": true
+    }
+}
+
+```
+Here `pos_type` can be `cart`, `dirc` or `ase`. For `dftio` output dataset, we use `cart` by default. The `r_max`, in principle, should align with the orbital cutoff in DFT calculation. For single element, the `r_max` should be a float number, indicating the largest bond distance included. When the system have multiple atoms, the `r_max` can also be a dict of atomic species specific number like `{A: 7.0, B: 8.0}`. Then the largest bond `A-A` would be 7 and `A-B` be (7+8)/2=7.5, and `B-B` would be 8. `pbc` can be a bool variable, indicating the open or close of the periodic boundary conditions of the model. It can also be a list of three bool elements like `[true, true, false]`, by what means we can set the periodicity of each direction independently.
+
+For LMDB type Dataset, the info.json is much simplier, which looks like:
+```JSON
+{
+    "r_max": 7.0
+}
+```
+Where other information have been stored in the dataset. LMDB dataset is design for handeling very large data that cannot be fit into the memory directly. 
+
+Then you can set the `data_options` in the input parameters to point directly to the prepared dataset, like:
+```JSON
+"data_options": {
+        "train": {
+            "root": "./data",
+            "prefix": "Si64",
+            "get_Hamiltonian": true,
+            "get_overlap": true
+        }
+    }
+```
+
+If you are using a python script, the dataset can be build with the same parameters using `build_datasets`:
+```Python
+from dptb.data import build_dataset
+
+dataset = build_dataset(
+    root="your dataset root",
+    type="DefaultDataset",
+    prefix="frame",
+    get_overlap=True,
+    get_Hamiltonian=True,
+    basis={"Si":"2s2p1d"}
+    )
+```
@@ -0,0 +1,11 @@
+=================================================
+E3TB Advanced
+=================================================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Examples
+
+   advanced_input
+   data_preparation
+   loss_analysis
@@ -0,0 +1,125 @@
+# Loss Analysis
+## function
+The **DeePTB** contrains a module to help the user better understand the details of the error of **E3TB** module.
+We decompose the error of **E3TB** model into several parts:
+- onsite blocks: for diagonal blocks of the predicted quantum tensors
+    the onsite blocks are further arranged according to the atom species.
+- hopping blocks: for off-diagonal blocks
+    the hopping blocks error are then further arranged according to the atom-pair types.
+
+## usage
+For using this function, we need a dataset and the model. Just build them up in advance.
+```Python
+from dptb.data import build_dataset
+from dptb.nn import build_model
+
+dataset = build_dataset(
+    root="your dataset root",
+    type="DefaultDataset",
+    prefix="frame",
+    get_overlap=True,
+    get_Hamiltonian=True,
+    basis={"Si":"2s2p1d"}
+    )
+
+model = build_model("./ovp/checkpoint/nnenv.best.pth", common_options={"device":"cuda"})
+model.eval()
+```
+
+Then, the user should sample over the dataset using the dataloader and doing a analysis with running average, the code looks like:
+```Python
+import torch
+from dptb.nnops.loss import HamilLossAnalysis
+from dptb.data.dataloader import DataLoader
+from tqdm import tqdm
+from dptb.data import AtomicData
+
+ana = HamilLossAnalysis(idp=model.idp, device=model.device, decompose=True, overlap=True)
+
+loader = DataLoader(dataset, batch_size=10, shuffle=False, num_workers=0)
+
+for data in tqdm(loader, desc="doing error analysis"):
+    with torch.no_grad():
+        ref_data = AtomicData.to_AtomicDataDict(data.to("cuda"))
+        data = model(ref_data)
+        ana(data, ref_data, running_avg=True)
+```
+The analysis results are stored in `ana.stats`, which is a dictionary of statistics. The user can checkout the value directly, or displaying the results by:
+
+```Python
+ana.report()
+```
+Here is an example of the output:
+```
+TOTAL:
+MAE: 0.00012021172733511776
+RMSE: 0.00034208124270662665
+
+
+Onsite: 
+Si:
+MAE: 0.0012505357153713703
+RMSE: 0.0023699181620031595
+```
+![MAE onsite](../../img/MAE_onsite.png)
+![RMSE onsite](../../img/RMSE_onsite.png)
+
+```
+Hopping: 
+Si-Si:
+MAE: 0.00016888207755982876
+RMSE: 0.0003886453341692686
+```
+![MAE hopping](../../img/MAE_hopping.png)
+![RMSE hopping](../../img/RMSE_hopping.png)
+
+If the user want to see the loss in a decomposed irreps format, one can set the `decompose` of `HamilLossAnalysis` class to `True`, and rerun the analysis.  We can display the decomposed irreps results using the following code:
+```Python
+import matplotlib.pyplot as plt
+import torch
+
+ana_result = ana.stats
+
+for bt, err in ana_result["hopping"].items():
+    print("rmse err for bond {bt}: {rmserr} \t mae err for bond {bt}: {maerr}".format(bt=bt, rmserr=err["rmse"], maerr=err["mae"]))
+
+for bt, err in ana_result["onsite"].items():
+    print("rmse err for atom {bt}: {rmserr} \t mae err for atom {bt}: {maerr}".format(bt=bt, rmserr=err["rmse"], maerr=err["mae"]))
+
+for bt, err in ana_result["hopping"].items():
+    x = list(range(model.idp.orbpair_irreps.num_irreps))
+    rmserr = err["rmse_per_irreps"]
+    maerr = err["mae_per_irreps"]
+    sort_index = torch.LongTensor(model.idp.orbpair_irreps.sort().inv)
+    
+    # rmserr = rmserr[sort_index]
+    # maerr = maerr[sort_index]
+    
+    plt.figure(figsize=(20,3))
+    plt.bar(x, rmserr.cpu().detach(), label="RMSE per rme")
+    plt.bar(x, maerr.cpu().detach(), alpha=0.6, label="MAE per rme")
+    plt.legend()
+    # plt.yscale("log")
+    # plt.ylim([1e-5, 5e-4])
+    plt.title("rme specific error of bond type: {bt}".format(bt=bt))
+    plt.show()
+
+for at, err in ana_result["onsite"].items():
+    x = list(range(model.idp.orbpair_irreps.num_irreps))
+    rmserr = err["rmse_per_irreps"]
+    maerr = err["mae_per_irreps"]
+    sort_index = torch.LongTensor(model.idp.orbpair_irreps.sort().inv)
+
+    rmserr = rmserr[sort_index]
+    maerr = maerr[sort_index]
+
+    plt.figure(figsize=(20,3))
+    plt.bar(x, rmserr.cpu().detach(), label="RMSE per rme")
+    plt.bar(x, maerr.cpu().detach(), alpha=0.6, label="MAE per rme")
+    plt.legend()
+    # plt.yscale("log")
+    # plt.ylim([1e-5, 2.e-2])
+    plt.title("rme specific error of atom type: {at}".format(at=at))
+    plt.show()
+
+```