Skip to content

Commit df7b61c

Browse files
authored
update v4 (#24)
* update v4 --------- Co-authored-by: Kalin Nonchev
1 parent afdfef2 commit df7b61c

17 files changed

+313
-354
lines changed

README.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,41 @@
22

33
# gnomAD_DB
44

5-
### Changelog
5+
#### Changelog
66

7-
#### NEW version (July 2022)
7+
#### NEW version (November 2023)
8+
- release gnomAD WGS v4.0 and WES v4.0
9+
- `gnomad_version`=["v2"|"v3"|"v4"] argument has to be specified when initializing the database
10+
- minor fixes
11+
12+
#### version (July 2022)
813
- release gnomAD WGS v3.1.2
914
- minor bug fixes
1015

1116
#### version (December 2021)
1217
- more available variant features present, check [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml)
1318
- `get_maf_from_df` renamed to `get_info_from_df`
1419
- `get_maf_from_str` renamed to `get_info_from_str`
15-
- `genome`=["Grch37"|"Grch38"] argument have to be specified, when initializing the database
20+
- [DEPRECATED 11.2023]`genome`=["Grch37"|"Grch38"] argument has to be specified when initializing the database
1621

22+
## Why and What
1723

1824
[The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
1925

2026
This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 34G for WGS v2.1.1 (261.942.336 variants) and 98G for WGS v3.1.2 (about 759.302.267 variants), and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml) you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)
2127

22-
It extracts from a gnomAD vcf about 23 variant annotations. You can find further infromation about the exact fields [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml).
28+
It extracts from a gnomAD vcf about 23 variant annotations. You can find further information about the exact fields [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml).
2329

2430
###### The package works for all currently available gnomAD releases.(July 2022)
2531

2632
## 1. Download SQLite preprocessed files
2733

28-
I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.2 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.
34+
I have preprocessed and created sqlite3 files for gnomAD for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.
2935

30-
gnomAD v3.1.2 (hg38, **759'302'267** variants) 46.2G zipped, 98G in total - https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1 \
31-
gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1
36+
- WGS gnomAD v4.0 (hg38, **759'302'267** variants) 36.1G zipped, 74G in total - https://zenodo.org/records/10066323/files/gnomad_db_wgs_v4.0.sqlite3.gz?download=1
37+
- WES gnomAD v4.0 (hg38, **161'417'006** variants) 7.3G zipped, 17G in total - https://zenodo.org/records/10066310/files/gnomad_db_wes_v4.0.sqlite3.gz?download=1
38+
- WGS gnomAD v3.1.2 (hg38, **759'302'267** variants) 46.2G zipped, 98G in total - https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1
39+
- WGS gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1
3240

3341
You can download it as:
3442

@@ -41,7 +49,7 @@ gnomAD_DB.download_and_unzip(download_link, output_dir)
4149
#### NB this would take ~30min (network speed 10mb/s)
4250

4351

44-
or you can create the database by yourself. **However, I recommend to use the preprocessed files to save ressources and time**. If you do so, you can go to **2. API usage** and explore the package and its great features!
52+
or you can create the database by yourself. **However, I recommend using the preprocessed files to save resources and time**. If you do so, you can go to **2. API usage** and explore the package and its great features!
4553

4654

4755
## 2. API usage
@@ -62,11 +70,11 @@ from gnomad_db.database import gnomAD_DB
6270
```
6371

6472
2. Initialize database connection \
65-
**Make sure to have the correct genome version!**
73+
**Make sure to have the correct gnomad version!**
6674
```python
6775
# pass dir
6876
database_location = "test_dir"
69-
db = gnomAD_DB(database_location, genome="Grch38")
77+
db = gnomAD_DB(database_location, gnomad_version="v3")
7078
```
7179

7280
3. Insert some test variants to run the examples below \

Snakefile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ database_location = config['database_location']
1414
gnomad_vcf_location = config['gnomad_vcf_location']
1515
tables_location = config['tables_location']
1616
script_locations = config['script_locations']
17-
genome = config['genome']
17+
gnomad_version = config['gnomad_version']
1818
KERNEL = config['KERNEL']
1919

2020

@@ -32,7 +32,7 @@ rule extract_tables:
3232
message:
3333
"Running createTSVtables notebook..."
3434
shell:
35-
"papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"
35+
"papermill {input.notebook} {output.notebook} -p gnomad_vcf_location {gnomad_vcf_location} -p tables_location {tables_location} -p gnomad_version {gnomad_version} -k {KERNEL}"
3636

3737

3838
# -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
@@ -45,7 +45,7 @@ rule insert_variants:
4545
message:
4646
"Running insertVariants notebook..."
4747
shell:
48-
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -p genome {genome} -k {KERNEL}"
48+
"papermill {input.notebook} {output.notebook} -p database_location {database_location} -p tables_location {tables_location} -p gnomad_version {gnomad_version} -k {KERNEL}"
4949

5050
# -------------------------- INSSERT VARIANTS WITH MAF TO DATABASE ------------------------------
5151
#rule create_GettingStartedNB:

gnomad_db/database.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@
88
import yaml
99
import pkg_resources
1010

11+
1112
class gnomAD_DB:
1213

13-
def __init__(self, genodb_path, genome="Grch38", parallel=False, cpu_count=None):
14+
def __init__(self, genodb_path, gnomad_version, parallel=False, cpu_count=None):
1415

1516

1617
self.parallel = parallel
17-
self.genome = genome
1818

1919
if self.parallel:
2020
self.cpu_count = cpu_count if isinstance(cpu_count, int) else int(multiprocessing.cpu_count())
@@ -26,7 +26,10 @@ def __init__(self, genodb_path, genome="Grch38", parallel=False, cpu_count=None)
2626
with open(columns_path) as f:
2727
columns = yaml.load(f, Loader=yaml.FullLoader)
2828

29-
self.columns = list(map(lambda x: x.lower(), columns["base_columns"])) + columns[self.genome]
29+
30+
self.gnomad_version = self._parse_gnomad_version(gnomad_version, list(columns.keys())[1:])
31+
32+
self.columns = list(map(lambda x: x.lower(), columns["base_columns"])) + columns[self.gnomad_version]
3033
self.dict_columns = columns
3134

3235
if not os.path.exists(self.db_file):
@@ -41,7 +44,7 @@ def open_dbconn(self):
4144

4245

4346
def create_table(self):
44-
value_columns = ",".join([f"{col} REAL" for col in self.dict_columns[self.genome]])
47+
value_columns = ",".join([f"{col} REAL" for col in self.dict_columns[self.gnomad_version]])
4548
sql_create = f"""
4649
CREATE TABLE gnomad_db (
4750
chrom TEXT,
@@ -171,6 +174,12 @@ def _pack_from_str(self, var: str) -> str:
171174
ref = var[2].split(">")[0]
172175
alt = var[2].split(">")[1]
173176
return chrom, pos, ref, alt
177+
178+
def _parse_gnomad_version(self, gnomad_version: str, supported_gnomad_versions: list) -> str:
179+
gnomad_version = str(gnomad_version)
180+
gnomad_version = gnomad_version.split(".")[-1]
181+
assert gnomad_version in supported_gnomad_versions, f"We don't support this version: {gnomad_version}. Please select one fo the following ones: {supported_gnomad_versions}"
182+
return gnomad_version
174183

175184

176185
def query_direct(self, sql_query: str):

gnomad_db/pkgdata/gnomad_columns.yaml

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ base_columns:
44
- REF
55
- ALT
66
- FILTER
7-
Grch37:
7+
v2:
88
- AC # Alternate allele count for samples
99
- AN # Total number of alleles in samples
1010
- AF # Alternate allele frequency in samples
@@ -23,23 +23,38 @@ Grch37:
2323
- AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
2424
- AF_afr # Alternate allele frequency in samples of African/African-American ancestry
2525
- AF_asj # Alternate allele frequency in samples of Ashkenazi Jewish ancestry
26-
Grch38:
26+
v3:
2727
- AC # Alternate allele count for samples
2828
- AN # Total number of alleles in samples
2929
- AF # Alternate allele frequency in samples
3030
- InbreedingCoeff # Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation
3131
- MQ # Root mean square of the mapping quality of reads across all samples
3232
- QD # Variant call confidence normalized by depth of sample reads supporting a variant
3333
- ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
34-
# - DP # Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered
3534
- VarDP
3635
- AS_VQSLOD
37-
# - VQSLOD # Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model
3836
- AC_popmax # Allele count in the population with the maximum AF
3937
- AN_popmax # Total number of alleles in the population with the maximum AF
4038
- AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
4139
- AF_eas # Alternate allele frequency in samples of East Asian ancestry
42-
# - AF_oth # Alternate allele frequency in XY samples of Other ancestry # not supported anymore 9.07.22
40+
- AF_nfe # Alternate allele frequency in XY samples of Non-Finnish European ancestry
41+
- AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
42+
- AF_afr # Alternate allele frequency in samples of African/African-American ancestry
43+
- AF_asj # Alternate allele frequency in samples of Ashkenazi Jewish ancestry
44+
45+
v4:
46+
- AC # Alternate allele count for samples
47+
- AN # Total number of alleles in samples
48+
- AF # Alternate allele frequency in samples
49+
- MQ # Root mean square of the mapping quality of reads across all samples
50+
- QD # Variant call confidence normalized by depth of sample reads supporting a variant
51+
- ReadPosRankSum # Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias
52+
- VarDP
53+
- AS_VQSLOD
54+
- AC_grpmax # Allele count in the population with the maximum AF
55+
- AN_grpmax # Total number of alleles in the population with the maximum AF
56+
- AF_grpmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
57+
- AF_eas # Alternate allele frequency in samples of East Asian ancestry
4358
- AF_nfe # Alternate allele frequency in XY samples of Non-Finnish European ancestry
4459
- AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
4560
- AF_afr # Alternate allele frequency in samples of African/African-American ancestry

script_config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@ database_location: "test_out" # where to create the database, make sure you have
22
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
33
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this
44
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
5-
genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
5+
gnomad_version: "v2" # main gnomad_version version of the gnomAD vcf file (e.g., v2, v3, v4)
66
KERNEL: "gnomad_db"

0 commit comments

Comments
 (0)