update input and e3tb hands on doc (#240)

floatingCatty · QG-phy · web-flow · commit 75e3b0662f25 · 2025-04-26T13:09:04.000+08:00
* update input and e3tb hands on doc

* Update argcheck.py

---------

Co-authored-by: QG-phy &lt;guqq_phy@qq.com&gt;
Co-authored-by: Qiangqiang Gu &lt;98570179+QG-phy@users.noreply.github.com&gt;
diff --git a/docs/advanced/e3tb/advanced_input.md b/docs/advanced/e3tb/advanced_input.md
@@ -86,6 +86,9 @@ In [7]: idp.get_irreps_ess()
 Out[7]: 7x0e+6x1o+6x2e+2x3o+1x4e
 ```
 
+Rules for the irreps setting:
+First, we should check the largest angular momentum defined in the DFT LCAO basis, and then double it as our highest order of irreps (since the addition rule of the angular momentum). For example, for `1s1p` basis, the irreps should contain features with angular momentum from 0 to 2, which is 2 times 1, the angular momentum of `p` orbital. If the basis contains `d` orbital, then the irreps should contain angular momentum up to 4. `f` and `g` or even higher orbitals are also supported.
+
 `n_layers`: indicates the number of layers of the networks.
 
 `env_embed_multiplicity`: decide the irreps number when initializing the edge and node features.
diff --git a/docs/quick_start/hands_on/e3tb_hands_on.md b/docs/quick_start/hands_on/e3tb_hands_on.md
@@ -2,7 +2,7 @@
 
 DeePTB supports training an E3-equalvariant model to predict DFT Hamiltonian, density and overlap matrix under LCAO basis. Here, cubic-phase bulk silicon has been chosen as a quick start example.
 
-Silicon is a chemical element; it has the symbol Si and atomic number 14. It is a hard, brittle crystalline solid with a blue-grey metallic lustre, and is a tetravalent metalloid and semiconductor. The prepared files are located in:
+Silicon is a chemical element; it has the symbol Si and atomic number 14. It is a hard, brittle crystalline solid with a blue-grey metallic lustre, and is a tetravalent metalloid and semiconductor (Shut up). The prepared files are located in:
 
 ```
 deeptb/examples/e3/
@@ -19,16 +19,13 @@ deeptb/examples/e3/
 |   `-- info.json
 `-- input.json
 ```
-We prepared one frame of silicon cubic bulk structure as an example. The data was computed using DFT software ABACUS, with an LCAO basis set containing 1 `s` and 1 `p` orbital. The cutoff radius for the orbital is 7au, which means the largest bond would be less than 14 au. Therefore, the r_max should be set as 7.4. So we have an info.json file like:
+We prepared one frame of silicon cubic bulk structure as an example. The data was computed using DFT software ABACUS, with an LCAO basis set containing 1 `s` and 1 `p` orbital. We now have an info.json file like:
 
-```json
+```JSON
 {
         "nframes": 1,
         "pos_type": "cart",
-        "AtomicData_options": {
-                "r_max": 7.4,
-                "pbc": true
-        }
+        "pbc": true, # same as [true, true, true]
 }
 ```
 
@@ -42,7 +39,7 @@ The `input_short.json` file contains the least number of parameters that are req
     "overlap": true
 }
 ```
-In `common_options`, here are the essential parameters. The `basis` should align with the DFT calculation, so 1 `s` and 1 `p` orbital would result in a `1s1p` basis. The `device` can either be `cpu` or `cuda`, but we highly recommend using `cuda` if GPU is available. The `overlap` tag controls whether to fit the overlap matrix together. Benefitting from our parameterization, the fitting overlap only brings negelectable costs, but would boost the convenience when using the model.
+In `common_options`, here are the essential parameters. The `basis` should align with the DFT calculation, so 1 `s` and 1 `p` orbital would result in a `1s1p` basis. The cutoff radius for the orbital is 7au, which means the largest bond would be less than 14 au. Therefore, the `r_max`, which equals to the maximum bond length, should be set as 7.4. The `device` can either be `cpu` or `cuda`, but we highly recommend using `cuda` if GPU is available. The `overlap` tag controls whether to fit the overlap matrix together. Benefitting from our parameterization, the fitting overlap only brings negligible costs, but is very convenient when using the model.
 
 Here comes the `model_options`:
 ```json
@@ -67,16 +64,26 @@ The `model_options` contains `embedding` and `prediction` parts, denoting the co
 
 In `embedding`, the `method` supports `slem` and `lem` for now, where `slem` has a strictly localized dependency, which has better transferability and data efficiency, while `lem` has an adjustable semi-local dependency, which has better representation capacity, but would require a little more data. `r_max` should align with the one defined in `info.json`.
 
-For `irreps_hidden`, this parameter defines the size of the hidden equivariant irreducible representation, which is highly related to the power of the model. There are certain rules to define this param. First, we should check the largest angular momentum defined in the DFT LCAO basis, the irreps's highest angular momentum should always be double. For example, for `1s1p` basis, the irreps should contain features with angular momentum from 0 to 2, which is 2 times 1, the angular momentum of `p` orbital. If the basis contains `d` orbital, then the irreps should contain angular momentum up to 4. `f` and `g` or even higher orbitals are also supported.
+For `irreps_hidden`, this parameter defines the size of the hidden equivariant irreducible representation, which decides most of the power of the model. There are certain rules to define this param. But for quick usage, we provide a tool to do basis analysis to extract essential irreps.
+
+```IPYTHON
+In [1]: from dptb.data import OrbitalMapper
+
+In [2]: idp = OrbitalMapper(basis={"Si": "1s1p"})
+
+In [3]: idp.get_irreps_ess()
+Out[3]: 2x0e+1x1o+1x2e
+```
 
-In `prediction`, we should use the `e3tb` method to let the model know the output features are arranged in **DeePTB-E3** format. The neurons are defined for a simple MLP to predict the slater-koster-like parameters for predicting the overlap matrix, for which [64,64] is usually fine.
+This is the number of independent irreps contains in the basis. Irreps configured should be multiple times of this essential irreps. The number can varies with a pretty large freedom, but the all the types, for example ("0e", "1o", "2e") here, should be included for all. We usually take a descending order starts from "32", "64", or "128" for the first "0e" and decay by half for latter high order irreps. For general rules of the irreps, user can read the advance topics in the doc, but for now, you are safe to ignore!
 
+In `prediction`, we should use the `e3tb` method to require the model output features using **DeePTB-E3** format. The neurons are defined for a simple MLP to predict the slater-koster-like parameters for predicting the overlap matrix, for which [64,64] is usually fine.
 
 Now everything is prepared! We can using the following command and we can train the first model:
 
 ```bash
 cd deeptb/examples/e3
-dptb train ./input/input_short.json -o ./e3_silicon
+dptb train ./input_short.json -o ./e3_silicon
 ```
 
 Here ``-o`` indicate the output directory. During the fitting procedure, we can see the loss curve of hBN is decrease consistently. When finished, we get the fitting results in folders ```e3_silicon```.
@@ -87,9 +94,9 @@ python plot_band.py
 ```
 or just using the command line 
 ```bash
-dptb run ./run/band.json -i ./e3_silicon/checkpoint/nnenv.best.pth -o ./band_plot
+dptb run ./band.json -i ./e3_silicon/checkpoint/nnenv.best.pth -o ./band_plot
 ```
 
 ![band_e3_Si](https://raw.githubusercontent.com/deepmodeling/DeePTB/main/docs/img/silicon_e3_band.png)
 
-Now you know how to train a **DeePTB-E3** model for Hamiltonian and overlap matrix. For better usage, we encourage the user to read the full input parameters for the **DeePTB-E3** model. Also, the **DeePTB** model supports several post-process tools, and the user can directly extract any predicted properties just using a few lines of code. Please see the basis_api for details.
+Now you know how to train a **DeePTB-E3** model for Hamiltonian and overlap matrix. For better usage, we encourage the user to read the full input parameters for the **DeePTB-E3** model. Also, the **DeePTB** model supports several post-process tools, and the user can directly extract any predicted properties just using a few lines of code. Please see the basis_api for details.
diff --git a/docs/quick_start/input.md b/docs/quick_start/input.md
@@ -4,52 +4,56 @@ The following files are the central input files for DeePTB. Before executing the
 
 ## Inputs
 ### Data
-The dataset files contrains both the **atomic structure** and the **training label** information. 
+The dataset files contrains both the **atomic structure** and the **training label** information.
 
-The atomic structure should be prepared as a ASE trajectory binary file, where each structure is stored using an **Atom** class defined in ASE package. The provided trajectory file must have suffix `.traj` and the length of the trajectory is `nframes`. For labels, we currently support `eigenvalues`, `Hamiltonian`, `density matrix` and `overlap matrix`. 
+The **atomic structure** contains the atoms' position, unit-cell vector and atomic number vector. They **must** be included in your datafile in all task.
+The **training labels** are prepared dependent on each task. If you are working on `DeePTB-SK` mode, the eigenvalues and kpoints are needed. If you are working with `DeePTB-E3` mode the Hamiltonian/Density Matrix under LCAO basis must be provided, while overlap matrix are optionally provided (But we suggest to do so for convenience).
 
+The atomic structure should be prepared in either ASE trajectory binary file format, or the plain text format. We highly suggest to use the tool `dftio` to deal with the data preparation. It can transform the data from DFT output to the target format automatically. Herefore completion, we will introduce the format of each type.
 
-For training a **DeePTB-SK** model, we need to prepare the `eigenvalues` label, which contrains the `eigenvalues.npy` and `kpoints.npy`. A typical dataset of **DeePTB-SK** task looks like:
+- For ASE trajectory binary file, each structure is stored using an **Atom** class defined in ASE package. The provided trajectory file must have suffix `.traj` and the length of the trajectory is `nframes`
 
-```
-data/
--- set.x
--- -- eigenvalues.npy  # numpy array of fixed shape [nframes, nkpoints, nbands]
--- -- kpoints.npy      # numpy array of fixed shape [nkpoints, 3]
--- -- xdat.traj        # ase trajectory file with nframes
--- -- info.json        # defining the parameters used in building AtomicData graph data
-```
+- For the plain text format,  three seperate textfiles for **atomic structures** need to be provided: `atomic_numbers.dat`, `cell.dat` and `positions.dat`. The length unit used in `cell.dat` and `positions.dat` (if cartesian coordinates) is Angstrom.
 
-> We also support another format to provide structure information, instead of loading structures from a single binary `.traj` file. In this way, three seperate textfiles for **atomic structures** need to be provided: `atomic_numbers.dat`, `pbc.dat`, `cell.dat` and `positions.dat`. The length unit used in `cell.dat` and `positions.dat` (if cartesian coordinates) is Angstrom.
 
-The **band structures** data includes the kpoints list and eigenvalues in the binary format of `.npy`. The shape of kpoints data is fixed as **[nkpoints,3]** and eigenvalues is fixed as **[nframes,nkpoints,nbands]**. The `nframes` here must be the same as in **atomic structures** files.
+- For training a **DeePTB-SK** model, we need to prepare the `eigenvalues` label, which contrains the `eigenvalues.npy` and `kpoints.npy`. A typical dataset of **DeePTB-SK** task looks like:
 
-> **Important:** The eigenvalues.npy should not contain bands that contributed by the core electrons, which is not setted as the TB orbitals in model setting.
+    ```
+    data/
+    -- set.x
+    -- -- eigenvalues.npy  # numpy array of fixed shape [nframes, nkpoints, nbands]
+    -- -- kpoints.npy      # numpy array of fixed shape [nkpoints, 3]
+    -- -- xdat.traj        # ase trajectory file with nframes
+    -- -- info.json        # defining the parameters used in building AtomicData graph data
+    ```
 
-For typical **DeePTB-E3** task, we need to prepare the Hamiltonian/density matrix along with overlap matrix as labels. They are arranged as hdf5 binary format, and named as `hamiltonians.h5`/`density_matrices.h5` and `overlaps.h5` respectively. A typical dataset of **DeePTB-E3** looks like:
+    The **band structures** data includes the kpoints list and eigenvalues in the binary format of `.npy`. The shape of kpoints data is fixed as **[nkpoints,3]** and eigenvalues is fixed as **[nframes,nkpoints,nbands]**. The `nframes` here must be the same as in **atomic structures** files.
 
-```
-data/
--- set.x
--- -- positions.dat     # a text file with nframe x natom row and 3 col
--- -- pbc.dat           # a text file of three bool variables
--- -- cell.dat          # a text file with nframe x 3 row and 3 col, or 3 rol and 3 col.
--- -- atomic_numbers.dat    # a text file with nframe x natom row and 1 col
--- -- hamiltonian.h5    # a hdf5 dataset file with group named "0", "1", ..., "nframe". Each group contains a dict of {"i_j_Rx_Ry_Rz": numpy.ndarray} 
--- -- overlaps.h5       # a hdf5 dataset file with group named "0", "1", ..., "nframe". Each group contains a dict of {"i_j_Rx_Ry_Rz": numpy.ndarray} 
--- -- info.json
-```
+    > **Important:** The eigenvalues.npy should not contain bands that contributed by the core electrons, which is not setted as the TB orbitals in model setting.
+
+- For typical **DeePTB-E3** task, we need to prepare the Hamiltonian/density matrix along with overlap matrix as labels. They are arranged as hdf5 binary format, and named as `hamiltonians.h5`/`density_matrices.h5` and `overlaps.h5` respectively. A typical dataset of **DeePTB-E3** looks like:
+    ```
+    data/
+    -- set.x
+    -- -- positions.dat     # a text file with nframe x natom row and 3 col
+    -- -- cell.dat          # a text file with nframe x 3 row and 3 col, or 3 rol and 3 col.
+    -- -- atomic_numbers.dat    # a text file with nframe x natom row and 1 col
+    -- -- hamiltonian.h5    # a hdf5 dataset file with group named "0", "1", ..., "nframe". Each group contains a dict of {"i_j_Rx_Ry_Rz": numpy.ndarray} 
+    -- -- overlaps.h5       # a hdf5 dataset file with group named "0", "1", ..., "nframe". Each group contains a dict of {"i_j_Rx_Ry_Rz": numpy.ndarray} 
+    -- -- info.json
+    ```
 
 ### Data settings: info.json
 
 In **DeePTB**, the **atomic structures** and **band structures** data are stored in AtomicData graph structure. `info.json` defines the key parameters used in building AtomicData graph dataset, which looks like:
-```bash
+```JSON
 {
     "nframes": 1,
-    "pos_type": "ase/cart/frac"
+    "pos_type": "ase/cart/frac",
+    "pbc": [true, true, true]
 }
 ```
-`nframes` is the length of the trajectory, as we defined in the previous section. `pos_type` defines the input format of the **atomic structures**, which is set to `ase` if  ASE `.traj` file is provided, and `cart` or `frac` if cartesian / fractional coordinate in `positions.dat` file provided.
+`nframes` is the length of the trajectory, as we defined in the previous section. `pos_type` defines the input format of the **atomic structures**, which is set to `ase` if  ASE `.traj` file is provided, and `cart` or `frac` if cartesian / fractional coordinate in `positions.dat` file provided. The `pbc` specifies the periodic boundray condition of the system. The three value coresponding to the three boundary vector set in the unit cell information of the atomic data file.
 
 <!--In the `AtomicData_options` section, the key arguments in defining graph structure is provided. `r_max` is the maximum cutoff in building neighbour list for each atom. `er_max` and `oer_max` are optional value for additional environmental dependence TB parameterization in **DeePTB-SK** mode, such as strain correction and `nnenv`. All cutoff variables have the unit of Angstrom.
 For **DeePTB-SK**, We can get the recommended `r_max` value by `DeePTB`'s bond analysis function, using:
@@ -59,11 +63,12 @@ dptb bond <structure path> [[-c] <cutoff>] [[-acc] <accuracy>]
 
 For **DeePTB-E3**, we suggest the user align the `r_max` value to the LCAO basis's cutoff radius used in DFT calculation.
 -->
-For **DeePTB-SK** model, we should also specify the parameters in `info.json` that controls the fitting eigenvalues:
+For **DeePTB-SK** mode, we should also specify the parameters in `info.json` that controls the fitting eigenvalues:
 ```JSON
 {
     "nframes": 1,
     "pos_type": "ase/cart/frac",
+    "pbc": [true, true, true],
     "bandinfo": {
         "band_min": 0,
         "band_max": 6,
diff --git a/dptb/entrypoints/train.py b/dptb/entrypoints/train.py
@@ -137,13 +137,13 @@ def train(
                             log.warning(f"{obj} in config file is not consistent with the checkpoint, using the one in checkpoint")
                             jdata["train_options"][obj] = f["config"]["train_options"][obj]
                 else:
-                    jdata["train_options"] = f["config"]["train_options"]
+                    jdata["train_options"] = f["config"]["train_options"] # restart can be preceeded without train_options
     
                 if jdata.get("model_options", None) is None or jdata["model_options"] != f["config"]["model_options"]:
                     log.warning("model_options in config file is not consistent with the checkpoint, using the one in checkpoint")
                     jdata["model_options"] = f["config"]["model_options"] # restart does not allow to change model options
             else:
-                # init model mode, allow model_options change
+                # init model mode, allow model_options change (Would it cause some error later if the param mismatch?)
                 if jdata.get("train_options", None) is None:
                     jdata["train_options"] = f["config"]["train_options"]
                 if jdata.get("model_options") is None:
diff --git a/dptb/utils/argcheck.py b/dptb/utils/argcheck.py
@@ -1623,7 +1623,7 @@ def get_cutoffs_from_model_options(model_options):
                 oer_max = format_cuts(model_options["nnsk"]["onsite"]["rs"], model_options["nnsk"]["onsite"]["w"], 3)
 
     elif model_options.get("dftbsk", None) is not None:
-        assert r_max is None, "r_max should not be provided in outside the dftbsk for training dftbsk model."
+        assert r_max is None, "r_max should not be provided other than the dftbsk param section for training dftbsk model."
         r_max = model_options["dftbsk"].get("r_max")
 
     else:
diff --git a/examples/e3/data/info.json b/examples/e3/data/info.json
@@ -1,5 +1,5 @@
 {
 	"nframes": 1,
 	"pos_type": "cart",
-  "pbc": true
+  	"pbc": true
 }

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`{`
`2`	`2`	`"nframes": 1,`
`3`	`3`	`"pos_type": "cart",`
`4`		`- "pbc": true`
	`4`	`+ "pbc": true`
`5`	`5`	`}`