-
Notifications
You must be signed in to change notification settings - Fork 0
Add API endpoints for handling submission-related image uploads #1706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nmdc_server/migrations/versions/09a705b0bc00_submission_images.py
Outdated
Show resolved
Hide resolved
nmdc_server/migrations/versions/09a705b0bc00_submission_images.py
Outdated
Show resolved
Hide resolved
Co-authored-by: eecavanna <134325062+eecavanna@users.noreply.github.com>
I'm planning on reviewing this more this evening (hopefully, to completion). Sorry about the delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for waiting for me to finish reviewing it, setting the stage so thoroughly in the PR description, demonstrating so many scenarios in the automated tests, and... for working through all the GCS setup/ramp up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting up with all of my feedback! I say let's get this into dev
Haha no I appreciate the feedback! I'm going to merge |
Fixes #1612
Summary
These changes set up API endpoints for managing various types of submission-related images. The images themselves are ultimately stored in a Google Cloud Storage (GCS) bucket and we only store information about object names in Postgres. These changes do not introduce any usage of these endpoints. That will be done as part of other work:
The new endpoints are:
POST /api/metadata_submission/{id}/image/signed_upload_url
This endpoint is used request a time-limited signed upload URL. Once the client receives this URL they can use it to upload a file directly to GCS.
POST /api/metadata_submission/{id}/image/{image_type}
This endpoint is expected to be called by the client after successfully uploading a file to GCS via a signed URL. There are three ways that an image can be associated with a submission: a single-valued PI headshot image, a single-valued primary study image, and a multivalued collection of other study images. The
image_type
parameter determines how the new storage object's name is associated with the submission.DELETE /api/metadata_submission/{id}/image/{image_type}
This endpoint is used to remove an image object association from a submission. The
image_type
parameter functions similarly to the abovePOST
endpoint. When removing an image from the multivalued collection of study images, the particular image to remove must be indicated by also providing theimage_name
query parameter.Details
Database changes
References to all images in GCS associated with submissions are stored in a new table named
submission_images_object
. The slightly awkward name reflects the fact that it holds information about objects stored in thenmdc-submission-images
bucket.Two new foreign key columns (
pi_image_name
) and (primary_study_image_name
) on thesubmission_metadata
table enable the two single-valued associations. A new association table (submission_study_image_association
) enables the multi-valued "other study images" association.Why am I not just stuffing all of these image references into the venerable
metadata_submission
JSONB column?Fake GCS server
In order to not force local development systems to talk to the real GCS server and incur the (admittedly small) cost of storing and transferring data, a GCS emulator is also included. The Docker image is called
fake-gcs-server
, and the code generally uses the terminology "fake" when referring to it as well.The application will use the fake GCS server by default, but you can switch to the real one with an environment variable (
NMDC_GCS_USE_FAKE
). Even with the fake GCS server, you must provide Google Cloud credentials in the same way that you'd provide them to the real GCS server. I can explain why, but it gets quite in-the-weeds. Regardless, there are instructions indevelopment.md
for how to get that set up.The
Storage
classThe new
Storage
class (nmdc_server/storage.py
) provides the main interface to GCS operations. This class is responsible for setting up aClient
and using it to perform operations on buckets and objects (referred to as "blobs" by thegoogle-cloud-storage
Python library methods; I tried to stick with the "objects" nomenclature but they're synonymous).The
nmdc_server/storage.py
module also exports an instance (storage
) of theStorage
class which is intended to be used as a singleton elsewhere in the code.CLI commands
Two new
nmdc-server
CLI commands have been added:nmdc-server storage init
-- this command is responsible for ensuring that all the buckets (currently only one) that the code knows about actually exist in GCS. If a bucket does not exist and you're using the fake GCS server, the bucket will automatically be created. If a bucket does not exist and you're using the real GCS server, an exception will be raised.Why are we not automatically creating buckets on the real GCS server? Creating a bucket and configuring its CORS settings automatically on the real GCS server would involve giving more permissions to the service account used to access GCS. Since creating new buckets will be relatively rare, it seemed like a better tradeoff to keep the permissions limited and create buckets manually when needed.
This command is called during application startup (
start.sh
) and prior to testing withtox
.nmdc-server storage vacuum
-- this command will delete any object from GCS that is not referenced in thesubmission_images_object
table. This command is not currently automatically called anywhere. Under normal operation the new endpoints will remove de-referenced GCS objects automatically. But I did find it useful during debugging and development. At some point we could consider running this command periodically in production just in case.Pydantic models
Because all of the content of the
nmdc-submission-images
bucket is private, a user needs a time-limited signed URL in order to download the image. This generation is managed in theSubmissionMetadataSchema
Pydantic model so that it happens as one of the last steps of the object going "out the door" (i.e. being serialized to JSON). At that point, upstream code (e.g. in API request handlers) should have already determined that the requesting user is allowed to see images associated with the given submission.I also took this opportunity to create a new Pydantic model (
SubmissionMetadataSchemaListItem
) which represents a "slim" view of a submission. It contains only the fields needed to render a submission in theSubmissionList.vue
component. In particular none of the information from themetadata_submission
field is needed. This makes the response toGET /api/metadata_submission
drastically smaller. And, coincidentally, signed URL generation can be skipped because those fields are also not included inSubmissionMetadataSchemaListItem
.Finally, I fixed a handful of places where we were using the old-style way of converting a SQLAlchemy model to a Pydantic model:
By replacing it with the newer style: