📗
Essential Python For Genome Science
  • Before Start
  • Chapter Contents
  • Prerequisites
    • About the UNIX system
    • About python
  • UNDERSTAND RAW DATA
    • Stages of Genome Data Generation
    • From Bulk To Single Cell
    • Introduction To the Datasets
      • bulk RNA-seq
      • single-cell data
  • Work Environment
    • Chapter Ensemble
    • All About Installations
    • Keep Running
    • Coding Environment
    • Git and Github
    • Other Tips
  • Python and UNIX System
    • Run Python
    • File I/O
    • Run Shell Command In Python - I
    • 🎉Case Study: Mapping bulk RNA-seq reads with salmon
  • Data Cleaning
    • 🎉Key Concept of Pandas
    • 🎉Case Study: Aggregate Salmon Quant
    • Case Study: Exploring The Dataset 🚩
    • The "copy" and "inplace" Parameter 🚩
    • Case Study: Extract and Reformat GTF file 🚩
    • the correct vs. the wrong way of using pandas 🚩
    • Case Study: Bulk Sample PCA 🚩
  • PYTHON BASICS
    • Python can be lightning-fast ⚡️ 🚩
    • Run Shell Command In Python - II 🚩
    • Pointers In Python 🚩
    • Everything is an object 🚩
    • Thread and Process 🚩
    • Resource For Intermediate Python Knowledge 🚩
    • Python magic method 🚩
  • Genome Science Data
    • NGS Data Formats and Tools 🚩
      • SAM/BAM 🚩
      • BED 🚩
      • GTF 🚩
      • Bigwig / Bigbed 🚩
      • VCF / BCF 🚩
    • The Python Packages 🚩
  • Data visualization
    • Matplotlib Basics 🚩
    • Seaborn Basics 🚩
    • Interactive Data Visualization 🚩
  • Use R in Python
    • Why? 🚩
    • rpy2 🚩
  • Gotchas
    • Check whether package X is installed
    • BAM to FASTQ
    • Genomic Websites
Powered by GitBook
On this page
  • Take An Introductory Course
  • The Python Language
  • Third-Party Packages

Was this helpful?

  1. Prerequisites

About python

PreviousAbout the UNIX systemNextStages of Genome Data Generation

Last updated 5 years ago

Was this helpful?

Take An Introductory Course

Here I listed prerequisites for continue reading. Don't be overwhelmed by the long list, because a full understanding of the things below is not required, I will also explain them along the book. In the beginning, you only need to know these names and understand their primary usage and purpose. The best way to do so, is take a well organized course or read a book for beginners. I don't want to repeat their message here, because they just did a much better job than me on this.

Here are my recommendations (choose any one of them, they are redundant):

  1. Class 1, 2, 3 of this Coursera specialization:

  2. Introduction to Python 3 by RealPython:

  3. (If you prefer book, this is my first introduction to python, before Coursera becomes popular. This book also has Chinese version.

    ) Beginning Python: .

It might take several days to finish the introduction contents, but it helps you build up a reliable knowledge graph for future study.

The Python Language

  • Basic data types: int, float, str

  • Basic data structure: list, dict, set

  • Control flow: if, for, break, continue, try...except...

  • : range(), any(), all(), enumerate(), dir(), help(), isinstance(), open(), print() and others.

  • Python built-in modules (just knowing what they are and the general usage, google and youtube can be helpful):

    • system and file related: pathlib, subprocess, json, gzip, multiprocessing, concurrent.futures

    • : re

    • Other enhancement on control flow or data structure: collections, random, itertools

Third-Party Packages

    • scipy: some basic statistical tests (t, Wilcoxon, fisher_exact), build dendrogram, sparse matrix format.

    • scikit-learn: PCA, Kmeans, RandomForest, and many other models/algorithms

    • statsmodels: build linear models, multi-test correction, ANOVA

Other third-party packages: Packages listed here are the primary packages for data science, but there are many others for more specific applications. With them, you can achieve most of the purpose in several lines of code. If you feel like some purpose needs more than 100 lines of code to succeed, you are likely reinventing the wheels. Try search google and see if someone already achieved that for you in a beautiful package (happens so many times on me).

Some great learning materials for introduction:

: The built-in modules come with python installation, and can be considered part of python language. Python is a versatile language for almost all programming application, so does its built-in modules. Not all modules are necessary for daily genome science.

: The "N-dimensional array" data structure in python, everything related to linear algebra based on this. In other words, everything based on this.

: The "Excel" in python, handle your tables. Must learn for genome science.

, , and : The statistics and basic machine learning packages for Python (and many other applications out of my knowledge). They all contain tons of functions, but here are simple examples on each:

, and : The must-learn python visualization package. For publication purpose, these two are enough for any figure, your imagination is the limit.

: explain pandas usage in 10 mins maybe an hour...

: This youtube video explains all packages listed here.

: The Seaborn documentation is a beautiful tutorial not only for the package, but also for the data visualization principles.

https://www.coursera.org/specializations/python
https://realpython.com/learning-paths/python3-introduction/
https://www.amazon.com/Beginning-Python-Professional-Magnus-Hetland-ebook/dp/B06XGVVVMG
Python built-in functions
Regular Expressions
Python built-in modules
Numpy
Pandas
scipy
scikit-learn
statsmodels
matplotlib
seaborn
10 minutes to pandas
The python ecosystem for Data Science
Seaborn tutorial