Add API endpoints for handling submission-related image uploads #1706

pkalita-lbl · 2025-07-22T22:01:58Z

Summary

These changes set up API endpoints for managing various types of submission-related images. The images themselves are ultimately stored in a Google Cloud Storage (GCS) bucket and we only store information about object names in Postgres. These changes do not introduce any usage of these endpoints. That will be done as part of other work:

The new endpoints are:

POST /api/metadata_submission/{id}/image/signed_upload_url

This endpoint is used request a time-limited signed upload URL. Once the client receives this URL they can use it to upload a file directly to GCS.
POST /api/metadata_submission/{id}/image/{image_type}

This endpoint is expected to be called by the client after successfully uploading a file to GCS via a signed URL. There are three ways that an image can be associated with a submission: a single-valued PI headshot image, a single-valued primary study image, and a multivalued collection of other study images. The image_type parameter determines how the new storage object's name is associated with the submission.
DELETE /api/metadata_submission/{id}/image/{image_type}

This endpoint is used to remove an image object association from a submission. The image_type parameter functions similarly to the above POST endpoint. When removing an image from the multivalued collection of study images, the particular image to remove must be indicated by also providing the image_name query parameter.

Details

Database changes

References to all images in GCS associated with submissions are stored in a new table named submission_images_object. The slightly awkward name reflects the fact that it holds information about objects stored in the nmdc-submission-images bucket.

Two new foreign key columns (pi_image_name) and (primary_study_image_name) on the submission_metadata table enable the two single-valued associations. A new association table (submission_study_image_association) enables the multi-valued "other study images" association.

Why am I not just stuffing all of these image references into the venerable metadata_submission JSONB column?

These fields will be updated by clients using dedicated API endpoints (by necessity because of the interaction with the GCS server). So having the information mixed into the JSONB object didn't seem advantageous in any way.
More broadly speaking, I think the big JSONB object wasn't the best design decision from the get-go. There's nothing about the submission data that requires that much flexibility. So the first step towards moving away from that model is not adding to it.

Fake GCS server

In order to not force local development systems to talk to the real GCS server and incur the (admittedly small) cost of storing and transferring data, a GCS emulator is also included. The Docker image is called fake-gcs-server, and the code generally uses the terminology "fake" when referring to it as well.

The application will use the fake GCS server by default, but you can switch to the real one with an environment variable (NMDC_GCS_USE_FAKE). Even with the fake GCS server, you must provide Google Cloud credentials in the same way that you'd provide them to the real GCS server. I can explain why, but it gets quite in-the-weeds. Regardless, there are instructions in development.md for how to get that set up.

The `Storage` class

The new Storage class (nmdc_server/storage.py) provides the main interface to GCS operations. This class is responsible for setting up a Client and using it to perform operations on buckets and objects (referred to as "blobs" by the google-cloud-storage Python library methods; I tried to stick with the "objects" nomenclature but they're synonymous).

The nmdc_server/storage.py module also exports an instance (storage) of the Storage class which is intended to be used as a singleton elsewhere in the code.

CLI commands

Two new nmdc-server CLI commands have been added:

nmdc-server storage init -- this command is responsible for ensuring that all the buckets (currently only one) that the code knows about actually exist in GCS. If a bucket does not exist and you're using the fake GCS server, the bucket will automatically be created. If a bucket does not exist and you're using the real GCS server, an exception will be raised.

Why are we not automatically creating buckets on the real GCS server? Creating a bucket and configuring its CORS settings automatically on the real GCS server would involve giving more permissions to the service account used to access GCS. Since creating new buckets will be relatively rare, it seemed like a better tradeoff to keep the permissions limited and create buckets manually when needed.

This command is called during application startup (start.sh) and prior to testing with tox.
nmdc-server storage vacuum -- this command will delete any object from GCS that is not referenced in the submission_images_object table. This command is not currently automatically called anywhere. Under normal operation the new endpoints will remove de-referenced GCS objects automatically. But I did find it useful during debugging and development. At some point we could consider running this command periodically in production just in case.

Pydantic models

Because all of the content of the nmdc-submission-images bucket is private, a user needs a time-limited signed URL in order to download the image. This generation is managed in the SubmissionMetadataSchema Pydantic model so that it happens as one of the last steps of the object going "out the door" (i.e. being serialized to JSON). At that point, upstream code (e.g. in API request handlers) should have already determined that the requesting user is allowed to see images associated with the given submission.

I also took this opportunity to create a new Pydantic model (SubmissionMetadataSchemaListItem) which represents a "slim" view of a submission. It contains only the fields needed to render a submission in the SubmissionList.vue component. In particular none of the information from the metadata_submission field is needed. This makes the response to GET /api/metadata_submission drastically smaller. And, coincidentally, signed URL generation can be skipped because those fields are also not included in SubmissionMetadataSchemaListItem.

Finally, I fixed a handful of places where we were using the old-style way of converting a SQLAlchemy model to a Pydantic model:

some_pydantic_instance = SomePydanticModel(**some_sqla_instance.__dict__)

By replacing it with the newer style:

some_pydantic_instance = SomePydanticModel.model_validate(some_sqla_instance)

… bucket

nmdc_server/fakes.py

nmdc_server/migrations/versions/09a705b0bc00_submission_images.py

nmdc_server/models.py

docs/development.md

nmdc_server/storage.py

nmdc_server/utils.py

Co-authored-by: eecavanna <134325062+eecavanna@users.noreply.github.com>

eecavanna · 2025-07-30T21:08:11Z

I'm planning on reviewing this more this evening (hopefully, to completion). Sorry about the delay.

eecavanna

LGTM! Thanks for waiting for me to finish reviewing it, setting the stage so thoroughly in the PR description, demonstrating so many scenarios in the automated tests, and... for working through all the GCS setup/ramp up.

nmdc_server/schemas_submission.py

nmdc_server/storage.py

nmdc_server/schemas.py

tests/conftest.py

tests/test_submission.py

naglepuff

Thanks for putting up with all of my feedback! I say let's get this into dev

pkalita-lbl · 2025-08-07T18:53:11Z

Haha no I appreciate the feedback!

I'm going to merge main into this branch, re-linearlize the Alembic migrations, update the dev Rancher configuration, and then merge.

pkalita-lbl added 28 commits July 3, 2025 13:16

Add fake-gcs-server service and storage client

87d0d16

Add table for holding references to objects in nmdc-submission-images…

9c8899f

… bucket

Merge branch 'main' into issue-1612-gcs-images

a3e60eb

Update down revision

0de9e3f

Encapsulate common GCS operations in Storage class

eba99ee

Add cloud storage variables

f5e7522

Add endpoints for adding/removing submission PI images

4bf7bf0

Add required config variable to GHA testing environment

3791543

Connect face-gcs-server to persistent storage

7a41c1a

Send slim version of submission record in list response

34e9f23

Use slim version of submission record for list display

83abb0c

Add a path parameter to make the set and delete image endpoint generic

eaa3735

Fix handler function return type

6c79682

Extend submission images API to support study_images collection

8bbd769

Add CLI command for vacuuming storage buckets

a58223e

Use correct default_factory for study_images

6f18e91

Replace deprecated utcnow method

cf58cb9

Replace deprecated json method

2cb0083

Add tests for submission image upload record endpoints

ecbcc2e

Fix command name

b3700bb

Add scheme flag to storage service

215739b

Override storage service entrypoint

d5fbafa

Use docker run to start fake-gcs-server

7c19ae2

Remove stray hyphen

dbb6056

Standardize config variable names and add setup instructions

a877173

Fix parameter type

efe5d60

Use click group to organize storage commands

9152472

Add test functions for image delete endpoint

4b90dc2

pkalita-lbl requested a review from Copilot July 22, 2025 22:02

This comment was marked as outdated.

Sign in to view

eecavanna reviewed Jul 26, 2025

View reviewed changes

nmdc_server/fakes.py Outdated Show resolved Hide resolved

eecavanna reviewed Jul 26, 2025

View reviewed changes

nmdc_server/migrations/versions/09a705b0bc00_submission_images.py Outdated Show resolved Hide resolved

eecavanna reviewed Jul 26, 2025

View reviewed changes

nmdc_server/migrations/versions/09a705b0bc00_submission_images.py Outdated Show resolved Hide resolved

eecavanna reviewed Jul 26, 2025

View reviewed changes

nmdc_server/models.py Outdated Show resolved Hide resolved

naglepuff reviewed Jul 29, 2025

View reviewed changes

docs/development.md Show resolved Hide resolved

naglepuff reviewed Jul 29, 2025

View reviewed changes

nmdc_server/storage.py Outdated Show resolved Hide resolved

naglepuff reviewed Jul 29, 2025

View reviewed changes

nmdc_server/storage.py Outdated Show resolved Hide resolved

naglepuff reviewed Jul 29, 2025

View reviewed changes

nmdc_server/utils.py Outdated Show resolved Hide resolved

pkalita-lbl and others added 5 commits July 30, 2025 10:25

Add bytes suffix to submission image size limit settings

25c6f14

Add type hint

a6759c2

Clarify comments

84e395d

Co-authored-by: eecavanna <134325062+eecavanna@users.noreply.github.com>

Format

9ddb351

Move sanitize_filename function to storage module

61bb819

pkalita-lbl marked this pull request as ready for review July 30, 2025 20:10

eecavanna approved these changes Aug 1, 2025

View reviewed changes

pkalita-lbl added 8 commits August 4, 2025 10:31

Move submission images bucket name to configuration

ce7cc3f

Add comments/descriptions to file size fields

d9fb68e

Use url suffix on JSON-serialized fields

ac2f6e3

Merge branch 'main' into issue-1612-gcs-images

9a554e7

Linearize migrations

a0652cf

Ensure that admins can edit images on any submission

1289b02

Make fake GCS endpoints configurable through settings

2fbd9eb

Remove excess whitespace

7d7ca79

naglepuff approved these changes Aug 6, 2025

View reviewed changes

pkalita-lbl added 2 commits August 7, 2025 11:57

Merge branch 'main' into issue-1612-gcs-images

0752def

Update down revision

2fccdb9

pkalita-lbl merged commit 29698c9 into main Aug 7, 2025
2 checks passed

pkalita-lbl deleted the issue-1612-gcs-images branch August 7, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add API endpoints for handling submission-related image uploads #1706

Add API endpoints for handling submission-related image uploads #1706

Uh oh!

pkalita-lbl commented Jul 22, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eecavanna commented Jul 30, 2025

Uh oh!

eecavanna left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naglepuff left a comment

Uh oh!

pkalita-lbl commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Add API endpoints for handling submission-related image uploads #1706

Add API endpoints for handling submission-related image uploads #1706

Uh oh!

Conversation

pkalita-lbl commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Database changes

Fake GCS server

The Storage class

CLI commands

Pydantic models

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eecavanna commented Jul 30, 2025

Uh oh!

eecavanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naglepuff left a comment

Choose a reason for hiding this comment

Uh oh!

pkalita-lbl commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

pkalita-lbl commented Jul 22, 2025 •

edited

Loading

The `Storage` class