Merge branch 'main' into lambda

jpcartailler · web-flow · commit 33d473feb4e0 · 2025-08-24T15:31:31.000-05:00
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -42,11 +42,8 @@ jobs:
       with:
         images: ${{ env.REGISTRY }}/${{ secrets.DOCKER_USERNAME }}/${{ env.IMAGE_NAME }}
         tags: |
-          type=ref,event=branch
-          type=ref,event=pr
           type=semver,pattern={{version}}
           type=semver,pattern={{major}}.{{minor}}
-          type=semver,pattern={{major}}
           type=raw,value=latest,enable={{is_default_branch}}
 
     - name: Build and push Docker image
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
diff --git a/.github/workflows/static-github-pages.yml b/.github/workflows/static-github-pages.yml
@@ -0,0 +1,45 @@
+# Simple workflow for deploying static content to GitHub Pages
+name: Deploy static content to Pages
+
+on:
+  # Runs on pushes targeting the lambda branch
+  push:
+    branches: ["lambda"]
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+
+jobs:
+  # Single deploy job since we're just deploying
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          ref: 'lambda' 
+      - name: Setup Pages
+        uses: actions/configure-pages@v5
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          # Upload only the www directory
+          path: './www'
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# peakScout <a href="https://github.com/vandydata/peakScout"><img src="assets/logo.svg" align="right" height="350"  alt="peakScout website"/></a>
+# peakScout <a href="https://github.com/vandydata/peakScout"><img src="assets/logo.svg" align="right" height="150"  alt="peakScout website"/></a>
 
 <!-- badges: start -->
 [![peakScout](https://github.com/vandydata/peakScout/actions/workflows/python.yaml/badge.svg)](https://github.com/vandydata/peakScout/actions/workflows/python.yaml)
@@ -11,50 +11,68 @@ peakScout is a user-friendly and reversible peak-to-gene translator for genomic
 
 ## Overview
 
-PeakScout is a bioinformatics tool designed to bridge the gap between genomic peak data and gene annotations, enabling researchers to understand the relationship between regulatory elements and their target genes. At its core, peakScout processes genomic peak files from common peak callers like MACS2 and SEACR and maps them to nearby genes using reference genome annotations. The workflow begins with input processing, where peak files are standardized and reference GTF files are decomposed into chromosome-specific feature collections. The core analysis modules then perform bidirectional mapping: peak-to-gene identifies which genes are potentially regulated by specific genomic regions, while gene-to-peak reveals which regulatory elements might influence particular genes. Throughout this process, nearest-feature detection algorithms handle the complex spatial relationships between genomic elements, considering factors like distance constraints and feature overlaps. Finally, the results are formatted into researcher-friendly CSV and Excel outputs, providing a comprehensive view of the genomic landscape that connects regulatory elements to their potential gene targets.
+PeakScout is a bioinformatics tool designed to bridge the gap between genomic peak data and gene annotations, enabling researchers to understand the relationship between regulatory elements and their target genes. At its core, peakScout processes genomic peak files generated by popular peak callers like MACS2 and SEACR and maps them to nearby genes using reference genome annotations. 
 
-## Installation
+peakScount performs:
+- **Peak-to-Gene Mapping**: This function identifies the nearest genes to each peak, allowing researchers to infer which genes might be regulated by specific genomic regions. Users can specify how many nearest genes (k) they want to retrieve for each peak.
+- **Gene-to-Peak Mapping**: Conversely, this function finds the nearest peaks to a list of genes, helping researchers identify potential regulatory elements that may influence gene expression.
 
-These instructions should generally work without modification in linux-based environments. If you are using Windows, we strongly recommend you use WSL2 to have a Linux environment within Windows.
+peakScout expects two inputs: 
+1. A **peak file** (in BED6 format or as output from MACS2 or SEACR) 
+2. A **reference GTF file** containing gene annotations. The tool can decompose the GTF file into chromosome-specific collections of genomic features, which are then used to perform bidirectional mapping between peaks and genes.
 
-### 1. Clone the Repository
+peakScount can be run via:
+- **Command line**: peakScout is designed to be run from the command line, making it accessible for users comfortable with terminal operations.
+- **Cloud computing**: for instanct access web access, we have set up peakScout in the cloud - https://vandydata.github.io/peakScout.
 
-```bash
-git clone https://github.com/vandydata/peakScout.git
-cd peakScout
-```
 
-### 2. Make the Script Executable
-```bash
-chmod +x src/peakScout
-```
 
-### 3. Add to Path
-```bash
-export PATH="$PATH:$(pwd)/src"
-```
+## Installation
 
-Alternatively, edit your `~/.bashrc` to make this change permanent, but be sure to include the complete path in the file itself, not the `$(pwd)`. 
+### From source
 
-### 4. Set Up Virtual Environment
+These instructions should generally work without modification in linux-based environments. If you are using Windows, we strongly recommend you use WSL2 to have a Linux environment within Windows.
+```bash
+# 1. Clone the Repository
+git clone https://github.com/vandydata/peakScout.git
 
-Using `venv`
+# 2. Make the Script Executable
+cd peakScout; chmod +x src/peakScout
 
-```bash
-# in peakScount main directory
+# 3. Add to Path
+# Alternatively, edit your `~/.bashrc` to make this change permanent, but be sure 
+# to include the complete path in the file itself, not the `$(pwd)`. 
+export PATH="$PATH:$(pwd)/src"
+
+# 4. Set up virtual environment and install dependencies
+# with venv
 python3 -m venv peakscout
 source peakscout/bin/activate
 pip3 install -r requirements.txt
-```
 
-Or using `uv`
-```bash
-# in peakScount main directory
+# OR with uv
 uv venv peakscout
 source peakscout/bin/activate
 uv pip install -r requirements.txt
 ```
 
+### Docker & singularity containers
+
+We have made available a Docker image for peakScout, which can be run as follows:
+
+```bash
+docker run -it --rm jpcartailler/peakscout:latest peakScout --help
+```
+
+For singularity, you can convert the Docker image to a Singularity image and run it as follows:
+
+```bash
+singularity pull docker://jpcartailler/peakscout:latest
+singularity exec peakscout_latest.sif peakScout --help
+```
+
+
+
 ## Usage
 
 ### Decomposing Reference GTF
@@ -125,27 +143,23 @@ Specific example:
 peakScout gene2peak --peak_file /path/to/peak/file --peak_type MACS2/SEACR/BED6 --gene_file /path/to/gene/file --species species of gtf --k number of nearest peaks --ref_dir /path/to/reference/directory --output_name name of output file --o /path/to/save/output --output_type csv/xslx
 ```
 
-## Decomposed references for common organisms
-
-For your convenience, we have prepared decomposed GTF files for common organisms, generated by `src/utils/decompose-common-organisms.sh` from the following GTF files:
-
-* `arabidopsis_TAIR10` - https://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gtf.gz
-* `fly_BDGP6.54` - https://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.114.gtf.gz
-* `frog_v10.1` - https://ftp.ensembl.org/pub/current_gtf/xenopus_tropicalis/Xenopus_tropicalis.UCB_Xtro_10.0.114.gtf.gz
-* `human_hg38` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.basic.annotation.gtf.gz
-* `human_hg19` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_10/gencode.v10.annotation.gtf.gz
-* `mouse_mm39` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M37/gencode.vM37.basic.annotation.gtf.gz
-* `mouse_mm10` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.basic.annotation.gtf.gz
-* `pig_Sscrofa11.1` - https://ftp.ensembl.org/pub/release-114/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.114.gtf.gz
-* `worm_WBcel235` - https://ftp.ebi.ac.uk/ensemblgenomes/pub/metazoa/current/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.61.gtf.gz
-* `yeast_R64-1-1` - https://ftp.ebi.ac.uk/ensemblgenomes/pub/fungi/current/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.61.gtf.gz
-* `zebrafish_GRCz11` - https://ftp.ensembl.org/pub/current_gtf/danio_rerio/Danio_rerio.GRCz11.114.gtf.gz
+## peakScout ready-made references for common organisms
 
-You can download these from a publix AWS S3 bucket as ZSTD-compressed files, which can be decompressed with `zstd -d`:
+For your convenience, we have prepared reference files for common organisms, generated by `src/utils/decompose-common-organisms.sh`. Source files are the GTFs below and downloadable peakScout reference files are the S3 links.
 
-* `s3://peakscout/common-organisms/arabidopsis_TAIR10.zst`
-* `s3://peakscout/common-organisms/worm_WBcel235.zst`
-* etc...
+| Species | GTF | S3 |
+|---------|-----|-----|
+| arabidopsis_TAIR10 | [GTF](https://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/arabidopsis_TAIR10.tar.zst) |
+| fly_BDGP6.54 | [GTF](https://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/fly_BDGP6.54.tar.zst) |
+| frog_v10.1 | [GTF](https://ftp.ensembl.org/pub/current_gtf/xenopus_tropicalis/Xenopus_tropicalis.UCB_Xtro_10.0.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/frog_v10.1.tar.zst) |
+| human_hg19 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_10/gencode.v10.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/human_hg19.tar.zst) |
+| human_hg38 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.basic.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/human_hg38.tar.zst) |
+| mouse_mm10 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.basic.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/mouse_mm10.tar.zst) |
+| mouse_mm39 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M37/gencode.vM37.basic.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/mouse_mm39.tar.zst) |
+| pig_Sscrofa11.1 | [GTF](https://ftp.ensembl.org/pub/release-114/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/pig_Sscrofa11.1.tar.zst) |
+| worm_WBcel235 | [GTF](https://ftp.ebi.ac.uk/ensemblgenomes/pub/metazoa/current/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.61.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/worm_WBcel235.tar.zst) |
+| yeast_R64-1-1 | [GTF](https://ftp.ebi.ac.uk/ensemblgenomes/pub/fungi/current/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.61.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/yeast_R64-1-1.tar.zst) |
+| zebrafish_GRCz11 | [GTF](https://ftp.ensembl.org/pub/current_gtf/danio_rerio/Danio_rerio.GRCz11.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/zebrafish_GRCz11.tar.zst) |
 
 
 ## FAQ
@@ -164,14 +178,3 @@ wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M37/genco
 gunzip gencode.vM37.basic.annotation.gtf.gz
 ```
 
-
-## Notes
-
-Genomes to target: mm10, mm39, hg19, hg38
-
-Get annotations from Gencode, i.e. M25 for mm10 (which is aka GRCm38)
-
-From gencode files, we can get chr, start, end, feature name, feature type (`gene`, `transcript`, etc), and biotype (eg. `gene_type "TEC";`)
-
-Write a parser for the GTF data and at least get it into the lowest common denominator currently need, optimizations can come later
-
diff --git a/aws/Dockerfile.txt b/aws/Dockerfile.txt
@@ -0,0 +1,24 @@
+# Use AWS Lambda Python base image (optimized for Lambda)
+FROM public.ecr.aws/lambda/python:3.11
+
+# Lambda-specific dependencies
+COPY aws/requirements.txt ${LAMBDA_TASK_ROOT}/requirements-lambda.txt
+RUN pip install -r requirements-lambda.txt
+
+# peakScout dependencies
+COPY requirements.txt ${LAMBDA_TASK_ROOT}
+RUN pip install -r requirements.txt
+
+# Copy the peakScout source code
+COPY src/ ${LAMBDA_TASK_ROOT}/
+
+COPY aws/lambda_handler.py ${LAMBDA_TASK_ROOT}/
+
+# Copy test files (useful for testing and validation)
+COPY test/ ${LAMBDA_TASK_ROOT}/test/
+
+# Make peakScout executable
+RUN chmod +x ${LAMBDA_TASK_ROOT}/peakScout
+
+# Set the handler
+CMD ["lambda_handler.handler"]
diff --git a/aws/dockerignore.txt b/aws/dockerignore.txt
@@ -0,0 +1,7 @@
+.git
+.gitignore
+README.md
+tests/
+__pycache__/
+*.pyc
+.DS_Store
diff --git a/aws/lambda_function.py b/aws/lambda_function.py
@@ -0,0 +1,78 @@
+import json
+import boto3
+from pathlib import Path
+import tempfile
+import tarfile
+import zstandard
+import os
+
+
+def extract_zst(archive: Path, out_path: Path):
+    """extract .zst file
+    works on Windows, Linux, MacOS, etc.
+    
+    Parameters
+    ----------
+
+    archive: pathlib.Path or str
+      .zst file to extract
+
+    out_path: pathlib.Path or str
+      directory to extract files and directories to
+    """
+    
+    if zstandard is None:
+        raise ImportError("pip install zstandard")
+
+    archive = Path(archive).expanduser()
+    out_path = Path(out_path).expanduser().resolve()
+    # need .resolve() in case intermediate relative dir doesn't exist
+
+    dctx = zstandard.ZstdDecompressor()
+
+    with tempfile.TemporaryFile(suffix=".tar") as ofh:
+        with archive.open("rb") as ifh:
+            dctx.copy_stream(ifh, ofh)
+        ofh.seek(0)
+        with tarfile.open(fileobj=ofh) as z:
+            z.extractall(out_path)
+
+def lambda_handler(event, context):
+    bucket_name = 'cds-peakscout-public'
+    file_name = 'mouse_mm39.tar.zst'
+    local_file_path = '/tmp/mouse_mm39.tar.zst'
+    new_file_path = '/tmp/mouse_mm39'
+
+    # Download file from S3 to /tmp
+    s3_client = boto3.client('s3')
+    try:
+        s3_client.download_file(bucket_name, file_name, local_file_path)
+        print(f"File '{file_name}' downloaded to '{local_file_path}'")
+    except Exception as e:
+        print(f"Error downloading file: {e}")
+        return {"statusCode": 500, "body": "Failed to download file"}
+
+    # Extract file
+    extract_zst("/tmp/mouse_mm39.tar.zst", "/tmp/mouse_mm39")
+
+    # Open and read the downloaded file
+    '''
+    try:
+        with open(new_file_path, 'rb') as file:
+            file_content = file.read()
+            print(f"File content: {file_content[:100]}")  # Print first 100 characters
+    except FileNotFoundError:
+        print(f"File not found at '{new_file_path}'.")
+        return {"statusCode": 500, "body": "File not found"}
+    except Exception as e:
+        print(f"Error reading file: {e}")
+        return {"statusCode": 500, "body": "Failed to read file"}
+    '''
+    # List files in the extracted directory
+    extracted_files = os.listdir(new_file_path)
+    print(f"Files in extracted directory: {extracted_files}")
+
+    return {
+        'statusCode': 200,
+        'body': "File read successfully from /tmp"
+    }