Implementing EPW file management (renaming/moving) #2597

hironori-kondo · 2025-01-01T13:49:03Z

hironori-kondo
Jan 1, 2025

Happy New Year! I'm returning to the task of implementing EPW in quacc, and I was hoping to consult you on how to implement the file management, @Andrew-S-Rosen & @tomdemeyere.

The EPW code interfaces with Espresso outputs, but the files are expected to follow a slightly different organization/naming scheme. Two previous jobs are required: an nscf job and a phonon job. After running the phonon job, we run a post-processing Python script (which ships with EPW). This script performs a couple of checks on how the phonon job was run, then copies/renames the necessary files into a new directory that the EPW job reads.

I've read through some of the quacc code, and my thought is to modify quacc.utils.files.copy_decompress_files(). The current implementation accepts a single destination directory (fed in as the tmp folder), to which the filenames are appended. I propose the following modifications to quacc's file management behavior:

Add an optional argument like output_filenames to copy_decompress_files(). The default behavior would be unchanged, using the existing filenames argument for the output. Should output_filenames be specified, however, this behavior would be overridden, resulting in changed file paths.
Add output_filenames to quacc.runners.prep.calc_setup()
Add output_filenames to quacc.runners._base.BaseRunner.setup()
Add output_filenames to quacc.runners.ase.Runner.__init__()
Add output_filenames to quacc.recipes.espresso._base.run_and_summarize()

If I haven't missed anything, the above should enable output_filenames to be optionally specified for any given Espresso job, enabling file movement/renaming during the copying step. The EPW file management would then look like the following:

Take the phonon and nscf output directories as arguments, e.g. in a list to the standard copy_files argument.
Determine which phonon and nscf files need to be copied over via a modified quacc.calculators.espresso.utils.prepare_copy_files(). Let's call this output updated_copy_files.
Create a dict of destination filenames for the above files. Let's call this output output_filenames.
Call run_and_summarize() with updated_copy_files and output_filenames as arguments.

This approach is a tad convoluted because the destination directory used by copy_decompress_files() (i.e., the tmp directory for the job) is not created until quacc.runners.prep.calc_setup() is called during runner initialization. As such, the desired behavior has to be inserted more deeply, somewhere between the directory's creation and the calculator call. My breakdown of some pros/cons of this approach:

Pros

Existing code for other jobs does not have to change
The renaming/moving does not have to get wrapped up in a separate function (the desired changes happen from the get-go during the existing copying step)
The phonon and nscf copying are handled together (which seems elegant)
The new copy_decompress_files() has a neat argument structure: source directory, source filenames, destination directory, destination filenames.

Cons

Messes with some fairly deep behavior that affects more than EPW/Espresso (and all for only one job)
Can be tedious if dealing with many files and only some need to be renamed (but this happens under the hood anyway; users don't have to deal with this)
Probably many more that don't immediately come to mind

How does the above sound? Do you have any wisdom/suggestions/preferences?

Andrew-S-Rosen · 2025-01-01T14:11:53Z

Andrew-S-Rosen
Jan 1, 2025
Maintainer

Thanks for your post. I'm out of town through Jan 3rd but will aim to reply shortly after I return. File handling is tricky... but looks like you have given this quite a bit of thought!

0 replies

tomdemeyere · 2025-01-01T14:20:20Z

tomdemeyere
Jan 1, 2025

@hironori-kondo Just to make sure, is this the Python script in question?

6 replies

tomdemeyere Jan 1, 2025

The way I understand it, the script not only move the files to another directory but also rename them, a process that is absolutely needed for EPW to work?

hironori-kondo Jan 2, 2025
Author

That is my understanding as well, yes. The idea is to take care of the moving and the renaming in one fell swoop.

N.B. the EPW script and examples all use the directory .\save for most of the renamed files, but that name itself is not necessary. We can rename that directory in the EPW input file.

tomdemeyere Jan 2, 2025

I suggest implementing the same script (more or less ) in espresso.utils (typically inside prepare_copy_files or similar). You can then create a new epw_flow that calls the nscf_job and phonon_job (or let you send your custom jobs) and call this newly created subroutine to move the files, wouldn't that work?

hironori-kondo Jan 3, 2025
Author

I believe I've considered two flavors of this. Please correct me if I've misinterpreted.

Flavor 1: The utility moves and renames files within the phonon folder. Then prepare_copy_files returns the paths we need, and copy_files handles the copying. The last two steps are the same as any other job. The benefit of this approach is that nothing really deep has to change. The drawback is that the phonon folder has been altered, and any future job that wants to use it would have to change it back. This feels like a major drawback to me.

Flavor 2: The utility copies and renames the files into the epw folder. This is ideal. But the issue (in my mind) is that the epw folder does not exist until the runner is initialized, so this doesn't seem to work? I took some inspiration from your file management for grid_phonon_flow(); you prepare the filepaths via a custom utility and feed them into copy_files, and the actual copying action occurs after run_and_summarize is called, and after the runner is initialized. It seemed less sensible to me to modify the runner behavior than to just generalize copy_decompress_files() to handle different destinations.

Andrew-S-Rosen Jan 3, 2025
Maintainer

Chiming in very briefly while away from work:

The drawback is that the phonon folder has been altered, and any future job that wants to use it would have to change it back. This feels like a major drawback to me.

Another subtle challenge with this has to do with databases. If enabled by the end-user, the output dictionary from each job can be stored in a database upon completion (when the summarizer is called). One of the keys in the JSON entry is the directory to the files. If this directory is ever altered, then the corresponding directory in the database entry will no longer exist, causing the end-user to have trouble linking their results to their raw data.

Anyway, I'll look at this closely when I return and can hopefully help with some suggestions! :)

Andrew-S-Rosen · 2025-01-04T21:48:44Z

Andrew-S-Rosen
Jan 4, 2025
Maintainer

@hironori-kondo: Thank you for the detailed writeup. I have a question that may help me provide better feedback. Is there an issue with having the post-processing step be its own job? This job would be very short and simple since it is just calling pp.py (or equivalent), and it will take as input the path of the directory to post process. Once this post-processing job is done, it sounds like everything should be ready for the next step in the workflow since files will be copied and renamed automatically as appropriate. This follow-up EPW job would then take as one of its required inputs the directory of the post-processed data. With this kind of setup, I don't see any need for messing around with file management or new functionality. Am I misunderstanding the procedure here?

I am imagining something like:

Run two parallel jobs: an NSCF job and phonon job.
When the above two jobs completed successfully, run a post-processing job via the as-shipped pp.py script.
When the post-process job is done, run the EPW job.

7 replies

Andrew-S-Rosen Jan 5, 2025
Maintainer

Thanks! That is helpful.

Option 1 was what I had in mind, but I see that the challenge there is you now have a lot of duplicated files eating up space on the disk. If it were just input files and log files, it'd probably not be a big deal and it'd be easier to just do this anyway. But if you are copying large volumetric objects like wavefunctions or charge densities, that's going to be unsustainable (?).

Option 2 does seem prone to user error...

@hironori-kondo: Your suggestion seems to make sense to me. You will have duplicate data, but perhaps that's okay. I leave that up to you. The post-processing script would still have to be its own @job though so that any workflow engine knows not to run it if the upstream phonon/NSCF jobs fail or are still running. Feel free to give a PR a go if you think this is suitable.

From a software design standpoint, it would probably be better for quacc to generate the temporary working directory the moment the @job-decorated function is called (rather than when the Runner is called), such that this metadata is available to access at any point of the task. That would avoid some of this headache since you could have an EPW job that starts off by copying files to that directory, run pp.py, and then doing whatever QE things you need to do there without needing a separate file staging folder. Alas, you live and learn. I think this is a limitation of the functional programming approach I took.

hironori-kondo Jan 5, 2025
Author

Indeed, the worry is that having a lot of duplicated files will be unsustainable. The files being copied are dynamical matrix data & the first-order variations of the potential.

I'm just slightly failing to follow which approach you're referring to here:

Your suggestion seems to make sense to me. You will have duplicate data, but perhaps that's okay. I leave that up to you. The post-processing script would still have to be its own @job though so that any workflow engine knows not to run it if the upstream phonon/NSCF jobs fail or are still running. Feel free to give a PR a go if you think this is suitable.

You mention duplicate data and needing a separate @job, which makes me think you're referring to this:

I can imagine a post-processing script that creates an epw directory within the completed phonon directory, which contains all the copied/renamed files. This would not interfere with future jobs that want to use the phonon directory, as the "standard" outputs are exactly as they should be. From there, we would require a second copying step via copy_files, as @tomdemeyere mentioned. The EPW job would read the epw sub-directory and copy its files from there instead of the main phonon directory. The drawback is having to copy a large number of files twice.

If I understood @tomdemeyere correctly, however, this is just

The file copying must be done twice (once during the pp.py and once again during the quacc copy_files operation to the new quacc-tmpdir created during the EPW job).

which is what I've interpreted your "Option 1" to mean. This seems to run into the duplicate-file problem, unless I'm missing some understanding here.

If you instead mean my original thought of modifying copy_decompress_files(), I'm not sure I understand why there would be duplicate data and the need for a separate @job for post-processing. It seems that the EPW job can directly take the (unmodified) phonon and nscf directories as inputs, thereby keeping the workflow engine side of the street clean. The copying can be handled by intermediate functions like the existing calculators.espresso.utils.prepare_copy_files().

Andrew-S-Rosen Jan 5, 2025
Maintainer

@hironori-kondo: Apologies for the lack of clarity on my part. I was indeed referring to "I can imagine a post-processing..." that you posted earlier even though this basically runs into the same issue brought up in Tom's Option 1. I wasn't sure how problematic the file situation would be, but I see that is indeed impractical.

Personally and somewhat selfishly, I am rather indifferent provided that the maintenance burden is not too high. So, if it is thoroughly tested and doesn't involve undue copy/paste from elsewhere (like pp.py), I'm likely content with the approach. I am not opposed to making it possible to change the destination filenames with copy_decompress_files if you think that is the ideal approach here. I am having trouble thinking of a different mechanism for handling the I/O here that doesn't involve a staging ground where a duplicated set of files will live. I'm very much open to suggestions!

Can be tedious if dealing with many files and only some need to be renamed

To solve this problem, you can make a keyword argument rename_files: dict | None = None such that you can do something like {"MYFILE": "YOURFILE"} such that this would rename MYFILE to YOURFILE without having to have a list with each entry in it.

hironori-kondo Jan 5, 2025
Author

@Andrew-S-Rosen Thank you for the clarification. Understood re: maintenance burden. I'll keep thinking about a different mechanism and get started on the copy_decompress_files PR in the meantime. The dict suggestion is also very helpful.

As always, I appreciate your prompt and thorough replies (even through the holidays!)

Andrew-S-Rosen Jan 5, 2025
Maintainer

Of course! Thank you for your detailed comments and questions. I wish I had a simpler answer for you, but you have the right idea!

Implementing EPW file management (renaming/moving) #2597

Uh oh!

Uh oh!

hironori-kondo Jan 1, 2025

Replies: 3 comments · 13 replies

Uh oh!

Andrew-S-Rosen Jan 1, 2025 Maintainer

Uh oh!

tomdemeyere Jan 1, 2025

Uh oh!

tomdemeyere Jan 1, 2025

Uh oh!

hironori-kondo Jan 2, 2025 Author

Uh oh!

tomdemeyere Jan 2, 2025

Uh oh!

hironori-kondo Jan 3, 2025 Author

Uh oh!

Andrew-S-Rosen Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

Andrew-S-Rosen Jan 4, 2025 Maintainer

Uh oh!

Uh oh!

Andrew-S-Rosen Jan 5, 2025 Maintainer

Uh oh!

hironori-kondo Jan 5, 2025 Author

Uh oh!

Uh oh!

Andrew-S-Rosen Jan 5, 2025 Maintainer

Uh oh!

hironori-kondo Jan 5, 2025 Author

Uh oh!

Andrew-S-Rosen Jan 5, 2025 Maintainer

hironori-kondo
Jan 1, 2025

Replies: 3 comments 13 replies

Andrew-S-Rosen
Jan 1, 2025
Maintainer

tomdemeyere
Jan 1, 2025

hironori-kondo Jan 2, 2025
Author

hironori-kondo Jan 3, 2025
Author

Andrew-S-Rosen Jan 3, 2025
Maintainer

Andrew-S-Rosen
Jan 4, 2025
Maintainer

Andrew-S-Rosen Jan 5, 2025
Maintainer

hironori-kondo Jan 5, 2025
Author

Andrew-S-Rosen Jan 5, 2025
Maintainer

hironori-kondo Jan 5, 2025
Author

Andrew-S-Rosen Jan 5, 2025
Maintainer