-
Notifications
You must be signed in to change notification settings - Fork 27
Description
1. Requester Information:
- PI's Full Name: Professor Dr. Dan Ames
- PI's Affiliated Institute: Brigham Young University
- PI's Affiliated Email Address: dan.ames@byu.edu
- Graduate Research Assistant: Sujan Chandra Mondol, mondolsc@byu.edu
2. Project Information:
The project aims to migrate and transform the existing National Water Model (NWM) dataset and create NWM-derived data products to widen the data coverage and facilitate a better user experience with the CIROH NWM API. It is part of the project developing the redesigned version of the NWM API with adaptations for analysis-ready and need-specific retrieval and LLM integration.
3. Project Description:
The project intends to migrate the NWM retrospective dataset (v3.0), currently hosted on Amazon AWS S3, to Google Cloud BigQuery to provide faster and convenient data service through the API. Since the existing data in AWS is in the Zarr format and the data to be loaded in BigQuery is supposed to be in tabular format, such as Parquet, the workflow will be executed through an Extract, Transform, and Load pipeline (Apache Beam in Google Cloud Dataflow). Moreover, the project also involves the creation of a new NWM-derived dataset (Streamflow Indices) based on the data analysis performed on the newly loaded BigQuery dataset. Since such an analysis involves the whole BigQuery table, it will either be performed through a single BigQuery SQL run or a Vertex AI workbench notebook script run.
4. Resource Requirements:
-
Dataflow: To run the Apache Beam ETL pipeline for NWM retrospective 3 data (51+TB in Zarr format) transformation and migration.
- Batch type job run for 24 hours with 50 workers of the n2-highmem-4 machine type (4 vCPU + 32 GB), 2TB persistent storage, and region set to Iowa (us-central1) -
Google BigQuery: Two new BigQuery tables will be created and maintained throughout the API operation period. The second table requires a full table query-based analysis.
- First table storage: 250 TB
- Second table storage: 100 GB -
Google Cloud Storage:
- Google Cloud Storage is required to store temporary and staging files generated in the DataFlow run.
-- 250 TB during the DataFlow job run period
- The S3 to Google Cloud and Parquet to BigQuery data transfer steps may require a temporary data sink. I need your advice on whether it will be cost-effective and efficient to run the Dataflow to directly collect the data from AWS S3 and sink it to BigQuery, considering the latency aspect arising from different data storage services and the sequential loading aspect due to parallelization in Beam.
-- (optional) 60 TB for three days for GCS Zarr storage
-- (optional) 10 TB for 10 days for GCS Parquet storage
5. Working Group Detail
Hydroinformatics
6. Timeline:
The resources are requested to be allocated for the whole month of July, 2025.
7. Security and Compliance Requirements:
- The Dataflow job should be allowed to finish unless it uses excessive resources beyond estimation. Otherwise, the desired product will not be available at the end, and the execution will be a waste of resources.
- Please provide technical support to inspect the codes and scripts for any anomalies that may cause unwanted costs.
- The created BigQuery tables can be requested for adoption as public data to eliminate operational storage costs.
8. Estimation:
Note: The estimates are rough approximations and may not represent the implementation. They require your further guidance and suggestions for change, both in technicalities and representations.
- Dataflow: Batch type job run for 24 hours with 50 workers of the n2-highmem-4 machine type (4 vCPU + 32 GB), 2TB persistent storage, and region set to Iowa (us-central1): costs 530$;
This is estimated to transform a 51+ TB Zarr file to Parquet or BigQuery. The local beam execution took about 6-10 seconds (assuming 15 seconds for cloud execution) and a maximum of 15 GB to process a single locally hosted chunk (time 672 and feature_id 30000) with a single worker. Using an n2-highmem-4 machine makes 15 GB available but limits the worker number to only one or two from the available four. Thus, choosing 50 VM workers makes 50 to 100 beam workers available, and the process can take about (5200015)/(360050) = 4.5 hours. Considering the latency issue from S3 to GC, we estimate a conservative period of 24 hours for the run. Additionally, the 10GB per chunk temporary storage may use about 300 GB per batch of runs, thus rounding the hard drive to 2000 GB. - BigQuery Storage: Since the NWM R3.0 dataset (44 years hourly) is supposed to have six times the number of rows and a similar number of columns to that of the NWM Analysis-Assimilation data table (7 years hourly—size 40 TB), the estimated storage size for the first table will be around 250 TB. Thus, the estimated cost approaches about $5,887.77/month. That’s why publishing this data as public data will be beneficial.
Moreover, since the second table will essentially have data for feature-based records instead of time series, compared to the return period dataset, it is supposed to occupy storage in the order of GB or 10s of GB if computed yearly indices, costing only a few dollars per month. - BigQuery Query run—The similar query operation on analysis-assimilation data queries one-fourth of the table size (10 TB out of 40 TB), and the created R3 dataset is thus expected to query 250/4 = 60 TB of data, costing about $400 one time.
- GCS Storage Temporary Caching: $170 (if Dictionaries for BQ) or $10 (if Parquet) for 1 day
- GCS Storage (optional): 60 TB for three days of GCS Zarr storage costs $125, and 10 TB for 10 days of GCS Parquet storage costs $70.
9. Approval:
- Necessary IAM access should be provided for the required resources at a higher level for creating, managing, and viewing components.
- The execution of this project may exceed the regular computation cost allowance, which requires approval of the exceedance permit.