All About Installations
Let's start with CORRECT installation
If you just want to quickly create the same environment as I will use in this book, read the code snippet in the end.
Installation is usually a pain-in-the-ass step. It is especially frustrated for beginners when being stopped at the first step of the analysis after spending a long time finishing an introduction course; I was one of them. So on this page, I want to detail explain how I install and manage my packages.
What does installation mean?
If this paragraph is hard to understand, read the following sections first and then go back to read this summary.
In python, installation means you use an installation command to install specific package(s) from a package repository into a package environment. The default and other named package environments were controlled by the package manager.
Python 3, not Python 2
ONLY USE PYTHON 3. If you are just new to python, don't bother to understand what is python2. It's an old version that's been deprecated by the python world now.
Package manager
Conda is the most wildly used package manager for python. You can install conda in two ways:
Install with the miniconda. A minimal installation of conda
Install with the anaconda. Comprehensive installation of conda and many basic data-science packages.
I prefer miniconda cause its small. But anaconda also works fine.
More about conda
conda
manages not only python package, but also other packages/softwares writing in R or any other programming languages. Conda contains multiple subcommands, the most used ones are:
conda install
install packagesconda create
create conda environmentsconda env
handle conda environmentsconda config
set up things like conda channel
Installation command
there are two shell commands for installing packages:
pip install PACKAGE_NAME
: pip is the default installer comes with pythonconda install PACKAGE_NAME
: A more sophisticated installer comes with conda.How to choose these two?
When the package is installable via conda, choose conda over pip. Because conda checks the package dependency graph more thoroughly, which means less pain.
However, many immature/developing package (e.g., a new package from a paper) only provide pip install options, then use pip.
Dependency Graph: when you install a python package, the installer actually installs dozens of other packages. Because most packages in python are NOT built from scratch, it import
many other packages for existing functionalities. Their dependencies are described in a dependency graph, and the installer needs to phrase the graph to install all necessary packages.
Specific packages
The word "specific" means both the name and the version of a package. Because different versions of the same package can be incompatible. It is good to install a specific version. For example, when you install a package called pandas
, command 1 is better than 2 for production:
How to determine which version to install?
If no specific conflicts, install the newest stable version, you may use command 2 first, but in the end, I gave you the exact version number I used for this book.
Package repository
Python also has two package repositories, one for pip, one for conda:
PyPI: this is where
pip
download package file when you executepip install PACKAGE_NAME
conda: yes, the word
conda
has multiple meanings... But basically, this is where conda download package fileconda install PACKAGE_NAME
, importantly, conda has different channels hosting different packages. You should set it up first (see below).
Conda Channels
Conda has multiple channels, which serve as the base for hosting and managing different packages. Conda has a default channel, but there are two other important channels for genome science:
conda-forge: where many important analysis packages are
bioconda: most bioinformatic packages installed from here.
Bioconda is a channel for the conda package manager specializing in bioinformatics software. It not only host python packages but also host the most common bioinformatic softwares: bwa, bowtie, samtools, bedtools, vcftools, deeptools, STAR, salmon, cutadapt, so on so forth.
Almost everything is installable via conda.
How to set up your conda channels?
Package environment
My Practice
A conda environment is a directory that contains a specific collection of conda packages that you have installed.
In my practice, I create separate conda environments for each of my projects. The environment isolates everything I installed for one project from the other, so they don't intermingle and give me great consistency.
If you don't do that, every time you install a new package, it will be installed into the default environment, and a new version will overwrite old one, which might contain incompatible changes that break your existing code.
How to create an environment?
Let's check the default python used currently, the which
command tells you where an executable command file actually located.
Now we create an environment called "genome_book" for this book and use the python 3.7. Then we enter this environment, check out python location again.
Next, let's install an example package into this environment. You will see tools installed inside an environment, are only available when you enter this environment. In this way, the environment isolates things!
If you are curious, how does the environment change work?
Here is how the above magic works; the key is the "PATH" environmental variable.
The PATH variable is a list of directory paths, where the shell uses to find a command's actual location. It is separated by the ":" character. Directory appears first, will be searched first, and once found, the searching will stop. By changing the PATH variable, the conda environment controls your command priority.
The environment is a great idea; we should all use it!
Summary: creating the environment for this book
This page is a very wordy introduction to the installation... But I do waste tons of time on installing things before I understand the above knowledge. Here is a summary code snippet to entirely create a same environment as I will use in this book:
Consistency 🍻!
Last updated