📗
Essential Python For Genome Science
  • Before Start
  • Chapter Contents
  • Prerequisites
    • About the UNIX system
    • About python
  • UNDERSTAND RAW DATA
    • Stages of Genome Data Generation
    • From Bulk To Single Cell
    • Introduction To the Datasets
      • bulk RNA-seq
      • single-cell data
  • Work Environment
    • Chapter Ensemble
    • All About Installations
    • Keep Running
    • Coding Environment
    • Git and Github
    • Other Tips
  • Python and UNIX System
    • Run Python
    • File I/O
    • Run Shell Command In Python - I
    • 🎉Case Study: Mapping bulk RNA-seq reads with salmon
  • Data Cleaning
    • 🎉Key Concept of Pandas
    • 🎉Case Study: Aggregate Salmon Quant
    • Case Study: Exploring The Dataset 🚩
    • The "copy" and "inplace" Parameter 🚩
    • Case Study: Extract and Reformat GTF file 🚩
    • the correct vs. the wrong way of using pandas 🚩
    • Case Study: Bulk Sample PCA 🚩
  • PYTHON BASICS
    • Python can be lightning-fast ⚡️ 🚩
    • Run Shell Command In Python - II 🚩
    • Pointers In Python 🚩
    • Everything is an object 🚩
    • Thread and Process 🚩
    • Resource For Intermediate Python Knowledge 🚩
    • Python magic method 🚩
  • Genome Science Data
    • NGS Data Formats and Tools 🚩
      • SAM/BAM 🚩
      • BED 🚩
      • GTF 🚩
      • Bigwig / Bigbed 🚩
      • VCF / BCF 🚩
    • The Python Packages 🚩
  • Data visualization
    • Matplotlib Basics 🚩
    • Seaborn Basics 🚩
    • Interactive Data Visualization 🚩
  • Use R in Python
    • Why? 🚩
    • rpy2 🚩
  • Gotchas
    • Check whether package X is installed
    • BAM to FASTQ
    • Genomic Websites
Powered by GitBook
On this page
  • What does installation mean?
  • Python 3, not Python 2
  • Package manager
  • More about conda
  • Installation command
  • Specific packages
  • Package repository
  • Conda Channels
  • Bioconda
  • How to set up your conda channels?
  • Package environment
  • My Practice
  • How to create an environment?
  • If you are curious, how does the environment change work?
  • Summary: creating the environment for this book

Was this helpful?

  1. Work Environment

All About Installations

Let's start with CORRECT installation

PreviousChapter EnsembleNextKeep Running

Last updated 4 years ago

Was this helpful?

If you just want to quickly create the same environment as I will use in this book, read the code snippet in the end.

Installation is usually a pain-in-the-ass step. It is especially frustrated for beginners when being stopped at the first step of the analysis after spending a long time finishing an introduction course; I was one of them. So on this page, I want to detail explain how I install and manage my packages.

What does installation mean?

If this paragraph is hard to understand, read the following sections first and then go back to read this summary.

In python, installation means you use an to install (s) from a into a . The default and other named were controlled by the .

Python 3, not Python 2

ONLY USE PYTHON 3. If you are just new to python, don't bother to understand what is python2. It's an old version that's been deprecated by the python world now.

Package manager

is the most wildly used package manager for python. You can install conda in two ways:

  1. Install with the . A minimal installation of conda

  2. Install with the . Comprehensive installation of conda and many basic data-science packages.

  3. I prefer miniconda cause its small. But anaconda also works fine.

# after install either miniconda or anaconda, 
# in your terminal, check if conda correctly installed:
$ conda -V

More about conda

conda manages not only python package, but also other packages/softwares writing in R or any other programming languages. Conda contains multiple subcommands, the most used ones are:

  • conda install install packages

  • conda create create conda environments

  • conda env handle conda environments

  • conda config set up things like conda channel

Installation command

there are two shell commands for installing packages:

  • pip install PACKAGE_NAME: pip is the default installer comes with python

  • conda install PACKAGE_NAME: A more sophisticated installer comes with conda.

  • How to choose these two?

    • When the package is installable via conda, choose conda over pip. Because conda checks the package dependency graph more thoroughly, which means less pain.

    • However, many immature/developing package (e.g., a new package from a paper) only provide pip install options, then use pip.

Specific packages

The word "specific" means both the name and the version of a package. Because different versions of the same package can be incompatible. It is good to install a specific version. For example, when you install a package called pandas, command 1 is better than 2 for production:

# command 1:
conda install pandas=1.X.X  
# the "=1.X.X" means the version, 
# should be specific number depending on the available pandas version

# command 2: 
conda install pandas  # this, by default, install the newest version available

How to determine which version to install?

  • If no specific conflicts, install the newest stable version, you may use command 2 first, but in the end, I gave you the exact version number I used for this book.

Package repository

Python also has two package repositories, one for pip, one for conda:

Conda has multiple channels, which serve as the base for hosting and managing different packages. Conda has a default channel, but there are two other important channels for genome science:

  • conda-forge: where many important analysis packages are

  • bioconda: most bioinformatic packages installed from here.

Bioconda is a channel for the conda package manager specializing in bioinformatics software. It not only host python packages but also host the most common bioinformatic softwares: bwa, bowtie, samtools, bedtools, vcftools, deeptools, STAR, salmon, cutadapt, so on so forth.

Almost everything is installable via conda.

How to set up your conda channels?

# In your terminal, execute these commands in the same order:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# this means when you run "conda install", 
# conda first check conda-forge, then bioconda, then defaults

Package environment

My Practice

In my practice, I create separate conda environments for each of my projects. The environment isolates everything I installed for one project from the other, so they don't intermingle and give me great consistency.

If you don't do that, every time you install a new package, it will be installed into the default environment, and a new version will overwrite old one, which might contain incompatible changes that break your existing code.

How to create an environment?

Let's check the default python used currently, the which command tells you where an executable command file actually located.

$ which python
/Users/hq/miniconda3/bin/python

# I use miniconda, and the default python is in the miniconda3 dir

Now we create an environment called "genome_book" for this book and use the python 3.7. Then we enter this environment, check out python location again.

$ conda create -n genome_book python=3.7
(skip the long stdout text)

$ conda activate genome_book
# I entered the env now

$ which python
/Users/hq/miniconda3/envs/genome_book/bin/python
# The python location changed! 
# Everything will refer to the environment's version of python


# Let's go to look at the environment directory
$ cd /Users/hq/miniconda3/envs/genome_book/

$ ls -hl
total 0
drwxr-xr-x  109 hq  staff   3.4K Apr  9 11:06 bin
drwxr-xr-x   21 hq  staff   672B Apr  9 11:06 conda-meta
drwxr-xr-x  112 hq  staff   3.5K Apr  9 11:06 include
drwxr-xr-x   92 hq  staff   2.9K Apr  9 11:06 lib
drwxr-xr-x    3 hq  staff    96B Apr  9 11:06 man
drwxr-xr-x    8 hq  staff   256B Apr  9 11:04 share
drwxr-xr-x    9 hq  staff   288B Apr  9 11:04 ssl
# you can see the environment is just an standalone direcotry 
# that contains everything.

Next, let's install an example package into this environment. You will see tools installed inside an environment, are only available when you enter this environment. In this way, the environment isolates things!

# installation
$ conda install -n genome_book bedtools
(skip the long stdout text)
# IMPORTANT: the -n genome_book means only install the bedtools into this env


# make sure you still inside your environment, 
# if not, run the above conda activate again
$ which bedtools
/Users/hq/miniconda3/envs/genome_book/bin/bedtools
# bedtools installed!

# now let's go out of this environment
$ conda deactivate
# I am out of the environment now

$ which bedtools
bedtools not found
# bedtools not found!

$ which python
/Users/hq/miniconda3/bin/python
# python location also change back to my conda default

If you are curious, how does the environment change work?

The PATH variable is a list of directory paths, where the shell uses to find a command's actual location. It is separated by the ":" character. Directory appears first, will be searched first, and once found, the searching will stop. By changing the PATH variable, the conda environment controls your command priority.

$ conda deactivate
# make sure you are out of the environment

$ echo $PATH
/Users/hq/miniconda3/condabin:/Users/hq/bin:/usr/local/bin:/Users/hq/lib/hadoop-3.0.0/bin:/Users/hq/lib/spark-2.2.1-bin-hadoop2.7/bin:/Users/hq/lib/scala-2.12.4/bin:/usr/local/mysql/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Applications/Server.app/Contents/ServerRoot/usr/bin:/Applications/Server.app/Contents/ServerRoot/usr/sbin
# My default PATH variable
# You can see the "/Users/hq/miniconda3/condabin" is the first path.
# This means the shell will first search in my miniconda defalut directory


# we then go into the env, check PATH again
$ conda activate genome_book
# I entered the env now

$ echo $PATH
/Users/hq/miniconda3/envs/genome_book/bin:/Users/hq/miniconda3/condabin:/Users/hq/bin:/usr/local/bin:/Users/hq/lib/hadoop-3.0.0/bin:/Users/hq/lib/spark-2.2.1-bin-hadoop2.7/bin:/Users/hq/lib/scala-2.12.4/bin:/usr/local/mysql/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Applications/Server.app/Contents/ServerRoot/usr/bin:/Applications/Server.app/Contents/ServerRoot/usr/sbin
# Now the env dir "/Users/hq/miniconda3/envs/genome_book/bin" is the first path. 
# Shell will search the env dir first!

The environment is a great idea; we should all use it!

Summary: creating the environment for this book

This page is a very wordy introduction to the installation... But I do waste tons of time on installing things before I understand the above knowledge. Here is a summary code snippet to entirely create a same environment as I will use in this book:

# Make sure you install either miniconda or anaconda

# First, setup conda channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# Second, remove the incomplete environment created above if you did
conda env remove -n genome_book

# Third, create the environment and install everything in one command!
# These packages will be used in this book
conda create -n genome_book \
    python=3.7 \
    pandas==1.0.3 \
    seaborn=0.10.0 \
    jupyter==1.0.0 \
    jupyter_contrib_nbextensions==0.5.1 \
    scikit-learn==0.22.2.post1 \
    pysam==0.15.4 \
    deeptools==3.4.2 \
    pybedtools==0.8.1 \
    scanpy=1.4 \
    salmon==1.2.1

# Forth, enter the env
conda activate genome_book

# Fifth, check python
which python

# All done!

Consistency 🍻!

Dependency Graph: when you install a python package, the installer actually installs dozens of other packages. Because most packages in python are NOT built from scratch, it import many other packages for existing functionalities. Their dependencies are described in a , and the installer needs to phrase the graph to install all necessary packages.

: this is where pip download package file when you execute pip install PACKAGE_NAME

: yes, the word conda has multiple meanings... But basically, this is where conda download package file conda install PACKAGE_NAME, importantly, conda has different channels hosting different packages. You should set it up first (see below).

Conda

A is a directory that contains a specific collection of conda packages that you have installed.

Here is how the above magic works; the key is the.

dependency graph
PyPI
conda
Channels
Bioconda
conda environment
"PATH" environmental variable
Conda
miniconda
anaconda
installation command
specific package
package repository
package environment
package environments
package manager