Skip to content

Commit 33d473f

Browse files
authored
Merge branch 'main' into lambda
2 parents ba7227e + 798dd21 commit 33d473f

File tree

7 files changed

+213
-101
lines changed

7 files changed

+213
-101
lines changed

.github/workflows/docker-publish.yml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,8 @@ jobs:
4242
with:
4343
images: ${{ env.REGISTRY }}/${{ secrets.DOCKER_USERNAME }}/${{ env.IMAGE_NAME }}
4444
tags: |
45-
type=ref,event=branch
46-
type=ref,event=pr
4745
type=semver,pattern={{version}}
4846
type=semver,pattern={{major}}.{{minor}}
49-
type=semver,pattern={{major}}
5047
type=raw,value=latest,enable={{is_default_branch}}
5148
5249
- name: Build and push Docker image

.github/workflows/docs.yml

Lines changed: 0 additions & 42 deletions
This file was deleted.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Simple workflow for deploying static content to GitHub Pages
2+
name: Deploy static content to Pages
3+
4+
on:
5+
# Runs on pushes targeting the lambda branch
6+
push:
7+
branches: ["lambda"]
8+
9+
# Allows you to run this workflow manually from the Actions tab
10+
workflow_dispatch:
11+
12+
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
13+
permissions:
14+
contents: read
15+
pages: write
16+
id-token: write
17+
18+
# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
19+
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
20+
concurrency:
21+
group: "pages"
22+
cancel-in-progress: false
23+
24+
jobs:
25+
# Single deploy job since we're just deploying
26+
deploy:
27+
environment:
28+
name: github-pages
29+
url: ${{ steps.deployment.outputs.page_url }}
30+
runs-on: ubuntu-latest
31+
steps:
32+
- name: Checkout
33+
uses: actions/checkout@v4
34+
with:
35+
ref: 'lambda'
36+
- name: Setup Pages
37+
uses: actions/configure-pages@v5
38+
- name: Upload artifact
39+
uses: actions/upload-pages-artifact@v3
40+
with:
41+
# Upload only the www directory
42+
path: './www'
43+
- name: Deploy to GitHub Pages
44+
id: deployment
45+
uses: actions/deploy-pages@v4

README.md

Lines changed: 59 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# peakScout <a href="https://github.com/vandydata/peakScout"><img src="assets/logo.svg" align="right" height="350" alt="peakScout website"/></a>
1+
# peakScout <a href="https://github.com/vandydata/peakScout"><img src="assets/logo.svg" align="right" height="150" alt="peakScout website"/></a>
22

33
<!-- badges: start -->
44
[![peakScout](https://github.com/vandydata/peakScout/actions/workflows/python.yaml/badge.svg)](https://github.com/vandydata/peakScout/actions/workflows/python.yaml)
@@ -11,50 +11,68 @@ peakScout is a user-friendly and reversible peak-to-gene translator for genomic
1111
1212
## Overview
1313

14-
PeakScout is a bioinformatics tool designed to bridge the gap between genomic peak data and gene annotations, enabling researchers to understand the relationship between regulatory elements and their target genes. At its core, peakScout processes genomic peak files from common peak callers like MACS2 and SEACR and maps them to nearby genes using reference genome annotations. The workflow begins with input processing, where peak files are standardized and reference GTF files are decomposed into chromosome-specific feature collections. The core analysis modules then perform bidirectional mapping: peak-to-gene identifies which genes are potentially regulated by specific genomic regions, while gene-to-peak reveals which regulatory elements might influence particular genes. Throughout this process, nearest-feature detection algorithms handle the complex spatial relationships between genomic elements, considering factors like distance constraints and feature overlaps. Finally, the results are formatted into researcher-friendly CSV and Excel outputs, providing a comprehensive view of the genomic landscape that connects regulatory elements to their potential gene targets.
14+
PeakScout is a bioinformatics tool designed to bridge the gap between genomic peak data and gene annotations, enabling researchers to understand the relationship between regulatory elements and their target genes. At its core, peakScout processes genomic peak files generated by popular peak callers like MACS2 and SEACR and maps them to nearby genes using reference genome annotations.
1515

16-
## Installation
16+
peakScount performs:
17+
- **Peak-to-Gene Mapping**: This function identifies the nearest genes to each peak, allowing researchers to infer which genes might be regulated by specific genomic regions. Users can specify how many nearest genes (k) they want to retrieve for each peak.
18+
- **Gene-to-Peak Mapping**: Conversely, this function finds the nearest peaks to a list of genes, helping researchers identify potential regulatory elements that may influence gene expression.
1719

18-
These instructions should generally work without modification in linux-based environments. If you are using Windows, we strongly recommend you use WSL2 to have a Linux environment within Windows.
20+
peakScout expects two inputs:
21+
1. A **peak file** (in BED6 format or as output from MACS2 or SEACR)
22+
2. A **reference GTF file** containing gene annotations. The tool can decompose the GTF file into chromosome-specific collections of genomic features, which are then used to perform bidirectional mapping between peaks and genes.
1923

20-
### 1. Clone the Repository
24+
peakScount can be run via:
25+
- **Command line**: peakScout is designed to be run from the command line, making it accessible for users comfortable with terminal operations.
26+
- **Cloud computing**: for instanct access web access, we have set up peakScout in the cloud - https://vandydata.github.io/peakScout.
2127

22-
```bash
23-
git clone https://github.com/vandydata/peakScout.git
24-
cd peakScout
25-
```
2628

27-
### 2. Make the Script Executable
28-
```bash
29-
chmod +x src/peakScout
30-
```
3129

32-
### 3. Add to Path
33-
```bash
34-
export PATH="$PATH:$(pwd)/src"
35-
```
30+
## Installation
3631

37-
Alternatively, edit your `~/.bashrc` to make this change permanent, but be sure to include the complete path in the file itself, not the `$(pwd)`.
32+
### From source
3833

39-
### 4. Set Up Virtual Environment
34+
These instructions should generally work without modification in linux-based environments. If you are using Windows, we strongly recommend you use WSL2 to have a Linux environment within Windows.
35+
```bash
36+
# 1. Clone the Repository
37+
git clone https://github.com/vandydata/peakScout.git
4038

41-
Using `venv`
39+
# 2. Make the Script Executable
40+
cd peakScout; chmod +x src/peakScout
4241

43-
```bash
44-
# in peakScount main directory
42+
# 3. Add to Path
43+
# Alternatively, edit your `~/.bashrc` to make this change permanent, but be sure
44+
# to include the complete path in the file itself, not the `$(pwd)`.
45+
export PATH="$PATH:$(pwd)/src"
46+
47+
# 4. Set up virtual environment and install dependencies
48+
# with venv
4549
python3 -m venv peakscout
4650
source peakscout/bin/activate
4751
pip3 install -r requirements.txt
48-
```
4952

50-
Or using `uv`
51-
```bash
52-
# in peakScount main directory
53+
# OR with uv
5354
uv venv peakscout
5455
source peakscout/bin/activate
5556
uv pip install -r requirements.txt
5657
```
5758

59+
### Docker & singularity containers
60+
61+
We have made available a Docker image for peakScout, which can be run as follows:
62+
63+
```bash
64+
docker run -it --rm jpcartailler/peakscout:latest peakScout --help
65+
```
66+
67+
For singularity, you can convert the Docker image to a Singularity image and run it as follows:
68+
69+
```bash
70+
singularity pull docker://jpcartailler/peakscout:latest
71+
singularity exec peakscout_latest.sif peakScout --help
72+
```
73+
74+
75+
5876
## Usage
5977

6078
### Decomposing Reference GTF
@@ -125,27 +143,23 @@ Specific example:
125143
peakScout gene2peak --peak_file /path/to/peak/file --peak_type MACS2/SEACR/BED6 --gene_file /path/to/gene/file --species species of gtf --k number of nearest peaks --ref_dir /path/to/reference/directory --output_name name of output file --o /path/to/save/output --output_type csv/xslx
126144
```
127145

128-
## Decomposed references for common organisms
129-
130-
For your convenience, we have prepared decomposed GTF files for common organisms, generated by `src/utils/decompose-common-organisms.sh` from the following GTF files:
131-
132-
* `arabidopsis_TAIR10` - https://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gtf.gz
133-
* `fly_BDGP6.54` - https://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.114.gtf.gz
134-
* `frog_v10.1` - https://ftp.ensembl.org/pub/current_gtf/xenopus_tropicalis/Xenopus_tropicalis.UCB_Xtro_10.0.114.gtf.gz
135-
* `human_hg38` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.basic.annotation.gtf.gz
136-
* `human_hg19` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_10/gencode.v10.annotation.gtf.gz
137-
* `mouse_mm39` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M37/gencode.vM37.basic.annotation.gtf.gz
138-
* `mouse_mm10` - https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.basic.annotation.gtf.gz
139-
* `pig_Sscrofa11.1` - https://ftp.ensembl.org/pub/release-114/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.114.gtf.gz
140-
* `worm_WBcel235` - https://ftp.ebi.ac.uk/ensemblgenomes/pub/metazoa/current/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.61.gtf.gz
141-
* `yeast_R64-1-1` - https://ftp.ebi.ac.uk/ensemblgenomes/pub/fungi/current/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.61.gtf.gz
142-
* `zebrafish_GRCz11` - https://ftp.ensembl.org/pub/current_gtf/danio_rerio/Danio_rerio.GRCz11.114.gtf.gz
146+
## peakScout ready-made references for common organisms
143147

144-
You can download these from a publix AWS S3 bucket as ZSTD-compressed files, which can be decompressed with `zstd -d`:
148+
For your convenience, we have prepared reference files for common organisms, generated by `src/utils/decompose-common-organisms.sh`. Source files are the GTFs below and downloadable peakScout reference files are the S3 links.
145149

146-
* `s3://peakscout/common-organisms/arabidopsis_TAIR10.zst`
147-
* `s3://peakscout/common-organisms/worm_WBcel235.zst`
148-
* etc...
150+
| Species | GTF | S3 |
151+
|---------|-----|-----|
152+
| arabidopsis_TAIR10 | [GTF](https://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/current/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/arabidopsis_TAIR10.tar.zst) |
153+
| fly_BDGP6.54 | [GTF](https://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/fly_BDGP6.54.tar.zst) |
154+
| frog_v10.1 | [GTF](https://ftp.ensembl.org/pub/current_gtf/xenopus_tropicalis/Xenopus_tropicalis.UCB_Xtro_10.0.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/frog_v10.1.tar.zst) |
155+
| human_hg19 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_10/gencode.v10.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/human_hg19.tar.zst) |
156+
| human_hg38 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.basic.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/human_hg38.tar.zst) |
157+
| mouse_mm10 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.basic.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/mouse_mm10.tar.zst) |
158+
| mouse_mm39 | [GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M37/gencode.vM37.basic.annotation.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/mouse_mm39.tar.zst) |
159+
| pig_Sscrofa11.1 | [GTF](https://ftp.ensembl.org/pub/release-114/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/pig_Sscrofa11.1.tar.zst) |
160+
| worm_WBcel235 | [GTF](https://ftp.ebi.ac.uk/ensemblgenomes/pub/metazoa/current/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.61.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/worm_WBcel235.tar.zst) |
161+
| yeast_R64-1-1 | [GTF](https://ftp.ebi.ac.uk/ensemblgenomes/pub/fungi/current/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.61.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/yeast_R64-1-1.tar.zst) |
162+
| zebrafish_GRCz11 | [GTF](https://ftp.ensembl.org/pub/current_gtf/danio_rerio/Danio_rerio.GRCz11.114.gtf.gz) | [S3](https://cds-peakscout-public.s3.us-east-1.amazonaws.com/zebrafish_GRCz11.tar.zst) |
149163

150164

151165
## FAQ
@@ -164,14 +178,3 @@ wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M37/genco
164178
gunzip gencode.vM37.basic.annotation.gtf.gz
165179
```
166180

167-
168-
## Notes
169-
170-
Genomes to target: mm10, mm39, hg19, hg38
171-
172-
Get annotations from Gencode, i.e. M25 for mm10 (which is aka GRCm38)
173-
174-
From gencode files, we can get chr, start, end, feature name, feature type (`gene`, `transcript`, etc), and biotype (eg. `gene_type "TEC";`)
175-
176-
Write a parser for the GTF data and at least get it into the lowest common denominator currently need, optimizations can come later
177-

aws/Dockerfile.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Use AWS Lambda Python base image (optimized for Lambda)
2+
FROM public.ecr.aws/lambda/python:3.11
3+
4+
# Lambda-specific dependencies
5+
COPY aws/requirements.txt ${LAMBDA_TASK_ROOT}/requirements-lambda.txt
6+
RUN pip install -r requirements-lambda.txt
7+
8+
# peakScout dependencies
9+
COPY requirements.txt ${LAMBDA_TASK_ROOT}
10+
RUN pip install -r requirements.txt
11+
12+
# Copy the peakScout source code
13+
COPY src/ ${LAMBDA_TASK_ROOT}/
14+
15+
COPY aws/lambda_handler.py ${LAMBDA_TASK_ROOT}/
16+
17+
# Copy test files (useful for testing and validation)
18+
COPY test/ ${LAMBDA_TASK_ROOT}/test/
19+
20+
# Make peakScout executable
21+
RUN chmod +x ${LAMBDA_TASK_ROOT}/peakScout
22+
23+
# Set the handler
24+
CMD ["lambda_handler.handler"]

aws/dockerignore.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.git
2+
.gitignore
3+
README.md
4+
tests/
5+
__pycache__/
6+
*.pyc
7+
.DS_Store

aws/lambda_function.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
import json
2+
import boto3
3+
from pathlib import Path
4+
import tempfile
5+
import tarfile
6+
import zstandard
7+
import os
8+
9+
10+
def extract_zst(archive: Path, out_path: Path):
11+
"""extract .zst file
12+
works on Windows, Linux, MacOS, etc.
13+
14+
Parameters
15+
----------
16+
17+
archive: pathlib.Path or str
18+
.zst file to extract
19+
20+
out_path: pathlib.Path or str
21+
directory to extract files and directories to
22+
"""
23+
24+
if zstandard is None:
25+
raise ImportError("pip install zstandard")
26+
27+
archive = Path(archive).expanduser()
28+
out_path = Path(out_path).expanduser().resolve()
29+
# need .resolve() in case intermediate relative dir doesn't exist
30+
31+
dctx = zstandard.ZstdDecompressor()
32+
33+
with tempfile.TemporaryFile(suffix=".tar") as ofh:
34+
with archive.open("rb") as ifh:
35+
dctx.copy_stream(ifh, ofh)
36+
ofh.seek(0)
37+
with tarfile.open(fileobj=ofh) as z:
38+
z.extractall(out_path)
39+
40+
def lambda_handler(event, context):
41+
bucket_name = 'cds-peakscout-public'
42+
file_name = 'mouse_mm39.tar.zst'
43+
local_file_path = '/tmp/mouse_mm39.tar.zst'
44+
new_file_path = '/tmp/mouse_mm39'
45+
46+
# Download file from S3 to /tmp
47+
s3_client = boto3.client('s3')
48+
try:
49+
s3_client.download_file(bucket_name, file_name, local_file_path)
50+
print(f"File '{file_name}' downloaded to '{local_file_path}'")
51+
except Exception as e:
52+
print(f"Error downloading file: {e}")
53+
return {"statusCode": 500, "body": "Failed to download file"}
54+
55+
# Extract file
56+
extract_zst("/tmp/mouse_mm39.tar.zst", "/tmp/mouse_mm39")
57+
58+
# Open and read the downloaded file
59+
'''
60+
try:
61+
with open(new_file_path, 'rb') as file:
62+
file_content = file.read()
63+
print(f"File content: {file_content[:100]}") # Print first 100 characters
64+
except FileNotFoundError:
65+
print(f"File not found at '{new_file_path}'.")
66+
return {"statusCode": 500, "body": "File not found"}
67+
except Exception as e:
68+
print(f"Error reading file: {e}")
69+
return {"statusCode": 500, "body": "Failed to read file"}
70+
'''
71+
# List files in the extracted directory
72+
extracted_files = os.listdir(new_file_path)
73+
print(f"Files in extracted directory: {extracted_files}")
74+
75+
return {
76+
'statusCode': 200,
77+
'body': "File read successfully from /tmp"
78+
}

0 commit comments

Comments
 (0)