-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Hi all,
this issue is intended keep the community up-to-date about the recent state of the conda solver, how you can improve things, and what we are working on to make it better.
What is the problem?
Conda currently uses an SAT (boolean satisfiability) solver to figure out the correct, and hopefully working, set of packages required to construct a functional environment. This means downloading the package index, cutting down the search-space, iterating the graph, inspecting the pinnings and so on.
Conda/Bioconda is special in that we have 1000s of Python and R packages. Recently, we’ve begun adding entire Bioconductor releases, with thousands of packages. Conda supports mixed environments, like Python+R+Perl, and does not remove old packages from the index. On the one hand, this enables reproducibility in the future (Need an old version of an R package or deepTools? No problem.), on the other hand it results in an incredibly large search space for the dependency solver to traverse. So in contrast to other package managers, Conda is constantly growing and we are currently not cutting out dead wood.
So we do face a special situation in Conda. Please take this into account when considering Conda’s performance. Yes, Conda is slow and will probably never be as fast as other package managers because Conda is vastly larger and supports scientific use-cases that others do not support.
However, we are aware of this and multiple people are working on it. See our tips below.
How to improve solver performance
Conda is especially slow if R is involved. This has historical reasons, as most of the packages are in all 3 supported channels (anaconda, conda-forge, bioconda). This was our fault. However, things should improve dramatically if you install the latest version available, e.g. bioconductor-deseq2=1.22.1. We’ve learned from past issues and now pin to one particular R version. However, old packages are still around for the sake of reproducibility.
Use pins, install packages with versions. Even conda create -n foo python=3 deeptools
will help. You will magically solve all your R envs by simply adding r-base=3.5.1
to your package install list.
Recommendations
A few recommendations, especially for environments with R inside:
- use conda >=4.6.x
- For bioconda packages, use the recommended channel order
- try the new experimental
pycryptosat
* solver (https://www.anaconda.com/conda-4-6-release …)
conda install pycryptosat
conda config --set sat_solver pycryptosat
- use
--strict-channel-priority
conda config --set channel_priority strict
- Do not use
conda install
useconda create
- Use environment.yaml files where ever you can. These include exact package versions, removing much of the solver’s workload and drastically speeding things up.
*
Different people from the community are trying to improve the solver or using different strategies to improve the situation. This is, and probably always be, a work in progress. Conda will grow and Anaconda and the community will improve things as we go.
cutting down the search space
Please have a look at https://github.com/regro/conda-metachannel. Conda Metachannels are work in progress but will allow users to specify the portion of the graph they care about upfront. It is very rare that users will actually need ALL of the packages in bioconda/conda-forge. Think about it like a constrained channel, only a specific set of your packages appear in this special channel. All others are not available, so you can not recreate a 3 years old environment with this channel. However, if you have this use case you can just switch back to the normal channels.
Maybe we should have this at some point for our community. The idea could be, having all recent (~2 years) packages in this space but all others still available to reproduce old envs. Start a discussion!
Bioconda is prepared
Very early on we recognised the special challenges that Conda is trying to face and we are prepared for the special use-case of long-term reproducibility - BioContainers. The containers are frozen sets of conda environments. A BioContainer is created for every Bioconda package, but you can also create your own. https://usegalaxy.eu is maintaining 1034 environments currently using BioContainers and it works well in that demanding environment.
Read more about this in our manuscript.
I recommend BioContainers for static/reproducible environments. For flexible environments we could use a metachannel in the future if we want to maintain this.
That said, I use conda on a daily basis and with the above recommendation I do not need a metachannel, as the normal conda solver is fast enough for me. However, I believe the conda community is prepared for the future.
Feedback
We would like to get feedback, benchmarks and examples do help us. What does slow mean? Considering what Conda is doing for you behind the scenes, is 30s or a minute really slow? Please provide numbers and the exact installation command.
Last but not least I would like to thank the conda-forge team, Anaconda and the
@bioconda/core team that are constantly working on all the packages and trying to keep things fast and reliable even with 100k packages.