Skip to content

Conversation

pkalita-lbl
Copy link
Collaborator

@pkalita-lbl pkalita-lbl commented Jul 22, 2025

Fixes #1612

Summary

These changes set up API endpoints for managing various types of submission-related images. The images themselves are ultimately stored in a Google Cloud Storage (GCS) bucket and we only store information about object names in Postgres. These changes do not introduce any usage of these endpoints. That will be done as part of other work:

The new endpoints are:

  • POST /api/metadata_submission/{id}/image/signed_upload_url

    This endpoint is used request a time-limited signed upload URL. Once the client receives this URL they can use it to upload a file directly to GCS.

  • POST /api/metadata_submission/{id}/image/{image_type}

    This endpoint is expected to be called by the client after successfully uploading a file to GCS via a signed URL. There are three ways that an image can be associated with a submission: a single-valued PI headshot image, a single-valued primary study image, and a multivalued collection of other study images. The image_type parameter determines how the new storage object's name is associated with the submission.

  • DELETE /api/metadata_submission/{id}/image/{image_type}

    This endpoint is used to remove an image object association from a submission. The image_type parameter functions similarly to the above POST endpoint. When removing an image from the multivalued collection of study images, the particular image to remove must be indicated by also providing the image_name query parameter.

Details

Database changes

References to all images in GCS associated with submissions are stored in a new table named submission_images_object. The slightly awkward name reflects the fact that it holds information about objects stored in the nmdc-submission-images bucket.

Two new foreign key columns (pi_image_name) and (primary_study_image_name) on the submission_metadata table enable the two single-valued associations. A new association table (submission_study_image_association) enables the multi-valued "other study images" association.

image

Why am I not just stuffing all of these image references into the venerable metadata_submission JSONB column?

  1. These fields will be updated by clients using dedicated API endpoints (by necessity because of the interaction with the GCS server). So having the information mixed into the JSONB object didn't seem advantageous in any way.
  2. More broadly speaking, I think the big JSONB object wasn't the best design decision from the get-go. There's nothing about the submission data that requires that much flexibility. So the first step towards moving away from that model is not adding to it.

Fake GCS server

In order to not force local development systems to talk to the real GCS server and incur the (admittedly small) cost of storing and transferring data, a GCS emulator is also included. The Docker image is called fake-gcs-server, and the code generally uses the terminology "fake" when referring to it as well.

The application will use the fake GCS server by default, but you can switch to the real one with an environment variable (NMDC_GCS_USE_FAKE). Even with the fake GCS server, you must provide Google Cloud credentials in the same way that you'd provide them to the real GCS server. I can explain why, but it gets quite in-the-weeds. Regardless, there are instructions in development.md for how to get that set up.

The Storage class

The new Storage class (nmdc_server/storage.py) provides the main interface to GCS operations. This class is responsible for setting up a Client and using it to perform operations on buckets and objects (referred to as "blobs" by the google-cloud-storage Python library methods; I tried to stick with the "objects" nomenclature but they're synonymous).

The nmdc_server/storage.py module also exports an instance (storage) of the Storage class which is intended to be used as a singleton elsewhere in the code.

CLI commands

Two new nmdc-server CLI commands have been added:

  • nmdc-server storage init -- this command is responsible for ensuring that all the buckets (currently only one) that the code knows about actually exist in GCS. If a bucket does not exist and you're using the fake GCS server, the bucket will automatically be created. If a bucket does not exist and you're using the real GCS server, an exception will be raised.

    Why are we not automatically creating buckets on the real GCS server? Creating a bucket and configuring its CORS settings automatically on the real GCS server would involve giving more permissions to the service account used to access GCS. Since creating new buckets will be relatively rare, it seemed like a better tradeoff to keep the permissions limited and create buckets manually when needed.

    This command is called during application startup (start.sh) and prior to testing with tox.

  • nmdc-server storage vacuum -- this command will delete any object from GCS that is not referenced in the submission_images_object table. This command is not currently automatically called anywhere. Under normal operation the new endpoints will remove de-referenced GCS objects automatically. But I did find it useful during debugging and development. At some point we could consider running this command periodically in production just in case.

Pydantic models

Because all of the content of the nmdc-submission-images bucket is private, a user needs a time-limited signed URL in order to download the image. This generation is managed in the SubmissionMetadataSchema Pydantic model so that it happens as one of the last steps of the object going "out the door" (i.e. being serialized to JSON). At that point, upstream code (e.g. in API request handlers) should have already determined that the requesting user is allowed to see images associated with the given submission.

I also took this opportunity to create a new Pydantic model (SubmissionMetadataSchemaListItem) which represents a "slim" view of a submission. It contains only the fields needed to render a submission in the SubmissionList.vue component. In particular none of the information from the metadata_submission field is needed. This makes the response to GET /api/metadata_submission drastically smaller. And, coincidentally, signed URL generation can be skipped because those fields are also not included in SubmissionMetadataSchemaListItem.

Finally, I fixed a handful of places where we were using the old-style way of converting a SQLAlchemy model to a Pydantic model:

some_pydantic_instance = SomePydanticModel(**some_sqla_instance.__dict__)

By replacing it with the newer style:

some_pydantic_instance = SomePydanticModel.model_validate(some_sqla_instance)

@pkalita-lbl pkalita-lbl requested a review from Copilot July 22, 2025 22:02
Copilot

This comment was marked as outdated.

@pkalita-lbl pkalita-lbl marked this pull request as ready for review July 30, 2025 20:10
@eecavanna
Copy link
Collaborator

I'm planning on reviewing this more this evening (hopefully, to completion). Sorry about the delay.

Copy link
Collaborator

@eecavanna eecavanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for waiting for me to finish reviewing it, setting the stage so thoroughly in the PR description, demonstrating so many scenarios in the automated tests, and... for working through all the GCS setup/ramp up.

Copy link
Collaborator

@naglepuff naglepuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up with all of my feedback! I say let's get this into dev

@pkalita-lbl
Copy link
Collaborator Author

Haha no I appreciate the feedback!

I'm going to merge main into this branch, re-linearlize the Alembic migrations, update the dev Rancher configuration, and then merge.

@pkalita-lbl pkalita-lbl merged commit 29698c9 into main Aug 7, 2025
2 checks passed
@pkalita-lbl pkalita-lbl deleted the issue-1612-gcs-images branch August 7, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set up infrastructure for storing submission media in object store
3 participants