Skip to content

Commit a39f546

Browse files
committed
Merge branch 'release/v1.4.0'
2 parents efa1b67 + 60a1c8f commit a39f546

File tree

101 files changed

+902
-349
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

101 files changed

+902
-349
lines changed

conda/env_travis.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,4 +33,5 @@ dependencies:
3333
- scikit-learn >=0.24
3434
- python-graphviz
3535
- seaborn
36-
- setuptools
36+
- setuptools
37+
- billiard >=3.6.4.0

docs/_images/example_01.png

-701 Bytes
Loading

docs/_images/example_02.png

-2.78 KB
Loading

docs/_images/example_06.png

-5.76 KB
Loading

docs/_images/example_06b.png

6.41 KB
Loading

docs/_images/example_08.png

-107 Bytes
Loading

docs/_images/example_13.png

-1.13 KB
Loading

docs/_sources/ologram.rst.txt

Lines changed: 82 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ These contain example Snakemake workflows, that can be reused or from which comm
3131

3232
Most of the commands presented in this section are demonstrated in the *ologram-modl_supp_mat* Git, along with certain perspectives.
3333

34-
**Note for contributors** : To contribute to OLOGRAM, begin at *pygtftk/plugins/ologram.py* and unwrap function calls from there, to get a sense of how they interact. We have detailed comments to explain the role of every function.
34+
**Note for contributors** : To contribute to OLOGRAM, begin at *pygtftk/plugins/ologram.py* and unwrap function calls from there, to get a sense of how they interact. We have detailed comments to explain the role of every function. *A detailed table with the role of each file is presented at the end of this document.*
3535

3636

3737

@@ -171,6 +171,9 @@ ologram (multiple overlaps)
171171

172172
It is also possible to use the **OLOGRAM-MODL** Multiple Overlap Dictionary Learning) plugin to find multiple overlaps (ie. between n>=2 sets) enrichment (ie. Query+A+B, Query+A+C, ...) in order to highlight combinations of genomic regions, such as Transcriptional Regulator complexes.
173173

174+
.. note:: The null hypothesis of the statistical test is:
175+
- H0: Considering the genomic regions in the query set (--peak-file) and in the reference sets (--more-bed), the regions in one set are located independently of the regions in any another set. They are not assumed to be uniformly distributed, we keep inter-region lengths.
176+
174177
This is done only on custom regions supplied as BEDs supplied with the `--more-bed` argument. In most cases you may use the --no-gtf argument and only pass the regions of interest.
175178

176179
For statistical reasons, we recommend shuffling across a relevant subsection of the genome only (ie. enhancers only) using --bed-excl or --bed-incl. This ensures the longer combinations have a reasonable chance of being randomly encountered in the shuffles. Conversely, if you do not filter the combinations, keep in mind that the longer ones may be enriched even though they are present only on a few base pairs, because at random they would be even rarer. As such, we recommend focusing comparisons on combinations of similar order (number of sets).
@@ -232,7 +235,7 @@ As the computation of multiple overlaps can be RAM-intensive, if you have a very
232235

233236

234237
Itemset mining details
235-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
238+
======================
236239

237240
In broad strokes, the custom itemset algorithm MODL (Multiple Overlap Dictionary Learning) will perform many matrix factorizations on the matrix of true overlaps to identify relevant correlation groups of genomic regions. Then a greedy algorithm based on how much these words improve the reconstruction will select the utmost best words. MODL is only used to filter the output of OLOGRAM : once it returns a list of interesting combination, OLOGRAM will compute their enrichment as usual, but for them only. Each combination is of the form [Query + A + B + C] where A, B and C are BED files given as --more-bed. You can also manually specify the combinations to be studied with the format defined in OLOGRAM notes (below).
238241

@@ -416,9 +419,9 @@ ologram_merge_runs
416419

417420
**Description:** Merge several runs of OLOGRAM into a single run, by treating each a "superbatch" of shuffles.
418421

419-
OLOGRAM remembers all intersections occuring inside all minibatches, so as to calculate statistics. If you are using a large number of shuffles and/or very large files, this may cost a lot of RAM. In practice, you will seldom need more than 100 shuffles. But optionally, if you require increased precision, you can run OLOGRAM several times, treat each run as a "batch of batches" and merge and recalculate stats on the merged superbatch automatically using this command.
422+
OLOGRAM remembers all intersections occuring inside all minibatches, so as to calculate statistics. If you are using a large number of shuffles and/or very large files, this may cost a lot of RAM. In practice, you will seldom need more than 100-200 shuffles. But optionally, if you require increased precision, you can run OLOGRAM several times, treat each run as a "batch of batches" and merge and recalculate stats on the merged superbatch automatically using this command.
420423

421-
Around 100 shuffles is usually enough to robustly fit a Negative Binomial distribution. In terms of precision, a Negative Binomial mean under 1/100 (meaning this combination was not seen at least once in 100 shuffles) would not mean much anyways.
424+
Around 100-200 shuffles is usually enough to robustly fit a Negative Binomial distribution. In terms of precision, a Negative Binomial mean under 1/100 (meaning this combination was not seen at least once in 100 shuffles) would not mean much anyways.
422425

423426
.. code-block:: bash
424427
@@ -455,13 +458,87 @@ Notes
455458
-- For statistical reality reasons, with multiple sets the expected overlaps for the longer combinations (A+B+C+D+... when they are all independant) can be very low. As a result, longer combinations tend to be more enriched: this should be kept in mind when comparing enrichment values between combinations of a different order.
456459

457460
This is especially true when the total genomic coverage of the sets is low. We recommend instead shuffling only across a biologically relevant subsection of the genome (with -\-bed-incl) : for example, if studying Transcriptional Regulators, shuffling only on inferred Cis Regulatory Modules or promoters.
461+
If the shuffling is restricted to a sub-genome, and features outside are discarded. In essence it mostly means switching to a smaller genome. Of course, since the shuffling is done only here, (H_0) becomes ‘... the features are independent and can only be located on the sub-genome’. This bears mentioning. In practice, this means shuffling only across shortened chromosomes.
458462

459463
Our Negative Binomial model helps alleviate this problem. Even so, if a combination is so rare that it is not encoutered even once in the shuffles, it will have a p-value of NaN. Furthermore, if C is depleted with query but always present with A and B, and A and B are enriched themselves, A+B+C will be enriched.
460464

461465
-- BETA - When using --more-bed (and only that), you can give a list of bed files that should be kept fixated during the shuffles using the --keep-intact-in-shuffling argument.
462466

463-
-- RAM will be the biggest limiting factor. While 100 total shuffles should be enough to fit a Negative Binomial distribution in most cases, if needed try running more batches of fewer shuffles instead of the other way around.
467+
-- RAM will be the biggest limiting factor. While 100 total shuffles should be enough to fit a Negative Binomial distribution in most cases, if needed try running more batches of fewer shuffles instead of the other way around. The alternative is running them independantly and merging them afterwards with *ologram_merge_runs*.
464468

465469
-- If you have many (30+) BED files in --more-bed, consider running a pairwise analysis first to divide them in groups of 10-20, and study the multiple overlaps within those groups. This is also more likely to be biologically significant, as for example Transcription Factor complexes usually have 2-8 members.
466470

467471
-- We recommend running the ologram_modl_treeify plugin on the resulting tsv file if you use multiple overlaps.
472+
473+
474+
-- Our Negative Binomial model is only an approximation for the underlying true distribution, which is likely close to a Beta Binomial. For instance, the Neg. Binom. approximation fails with too few regions in the sets (at least 1K), and will likely slightly overestimate the p-values in other cases. However, precision is usually good for even very significant p-values, dropping only at the very significant level (<1E-5), hence there is only a very small risk of false positives. Also, even if they are overestimated, the order of p-values is unchanged (as a Neg. Binom. is a special case of Beta) meaning if a combination 1 has a higher Neg. Binom. p-value than combination 2, its true p-value is also likely higher than the p-value of 2.
475+
476+
The Neg. Binom. is still the better option, as fitting the proper distribution (approximated as Beta) is more difficult. As such, an ad-hoc p-value based on the Beta distribution is given, but it will only better than the Neg. Binom. on massive numbers of shuffles (thousands). We also added the empirical p-value as a new column (ie. number of shuffles in which a value as extreme is observed) if you believe the model to be inadequate.
477+
478+
-- Our model rests upon certain assumptions (ie. exchangeable variables, sufficient nb. of regions, etc.). The null hypothesis can be rejected if any assumption is rejected, or merely because the approximation holds only asymptotically. The fitting test is the key for that: if, when performing the shuffles, it is found that the distribution of S under our shuffling model does not follow a Neg. Binom., it will be said. Then if the hypothesis is rejected (low p-val) but the fitting was good, it is then reasnobale to assume the combination is enriched. Admittedly, the fitting test does not test the tails of the distribution, but it shows if the general shape is close enough.
479+
480+
481+
------------------------------------------------------------------------------------------------------------------
482+
483+
484+
OLOGRAM file structure
485+
~~~~~~~~~~~~~~~~~~~~~~
486+
487+
Below is a detailed list of the source code files of OLOGRAM-MODL, with their roles. All paths are relative to the root of the *pygtftk* Git repository. The "Plugin" group designates plugins that can be called directly from the command line. A file extension of "pyx" designates a Cython file, "py" a Python file, and "cpp" a C++ file.
488+
489+
490+
.. list-table:: OLOGRAM-MODL files.
491+
:widths: 10 40 50
492+
:header-rows: 1
493+
494+
* - Group
495+
- File path
496+
- Description
497+
* - Plugin
498+
- pygtftk/plugins/ologram.py
499+
- *Root file.* Parses the arguments and calls the other functions.
500+
* - Utility
501+
- docs/source/ologram.rst
502+
- Documentation source.
503+
* - Root
504+
- pygtftk/stats/intersect/overlap_stats_shuffling.py
505+
- *Main function*. Called directly by the *ologram.py* plugin. All other functions calls are descended from this one.
506+
* - Root
507+
- pygtftk/stats/intersect/overlap_stats_compute.py
508+
- Functions to compute overlap statistics on (shuffled) region sets.
509+
* - Algorithm
510+
- pygtftk/stats/intersect/create_shuffles.pyx
511+
- Shuffle BED files and generate new "fake" BED files.
512+
* - Algorithm
513+
- pygtftk/stats/intersect/overlap/overlap_regions.pyx
514+
- Compute the overlaps between two sets of genomic regions, supporting multiple overlaps.
515+
* - Utility
516+
- Turn a BED file into a list of intervals.
517+
- pygtftk/stats/intersect/read_bed/read_bed_as_list.pyx
518+
* - Utility
519+
- pygtftk/stats/intersect/read_bed/exclude.cpp
520+
- Exclude certain regions from a set to create concatenated sub-chromosomes.
521+
* - Utility
522+
- pygtftk/stats/multiprocessing/multiproc.pyx
523+
- Helper functions and structures for multiprocessing.
524+
* - Statistics
525+
- pygtftk/stats/negbin_fit.py
526+
- Utility functions relative to the negative binomial distribution, including verifying its good fit.
527+
* - MODL
528+
- pygtftk/stats/intersect/modl/dict_learning.py
529+
- Contains the MODL algorithm, an itemset mining algorithm described in this paper.
530+
* - MODL
531+
- pygtftk/stats/intersect/modl/subroutines.py
532+
- Subroutines of the MODL algorithm. Those are pure functions and can be used independently.
533+
* - Utility
534+
- pygtftk/stats/intersect/modl/tree.py
535+
- A graph-based representation of combinations of elements.
536+
* - Plugin
537+
- pygtftk/plugins/ologram_merge_runs.py
538+
- Merge a set of OLOGRAM runs into a single run and recalculates statistics based on it.
539+
* - Plugin
540+
- pygtftk/plugins/ologram_merge_stats.py
541+
- Merge a set of OLOGRAM outputs calculated on different queries into a single output, preserving labels. Build a heatmap from the results.
542+
* - Plugin
543+
- pygtftk/plugins/ologram_modl_treeify.py
544+
- Turns a result of OLOGRAM-MODL multiple overlap (tsv file) in a tree for easier visualisation.

docs/_static/documentation_options.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
var DOCUMENTATION_OPTIONS = {
22
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
3-
VERSION: '1.3.0',
3+
VERSION: '1.4.0',
44
LANGUAGE: 'en',
55
COLLAPSE_INDEX: false,
66
BUILDER: 'html',

docs/_static/example_01.png

-3.45 KB
Loading

0 commit comments

Comments
 (0)