About python
Take An Introductory Course
Here I listed prerequisites for continue reading. Don't be overwhelmed by the long list, because a full understanding of the things below is not required, I will also explain them along the book. In the beginning, you only need to know these names and understand their primary usage and purpose. The best way to do so, is take a well organized course or read a book for beginners. I don't want to repeat their message here, because they just did a much better job than me on this.
Here are my recommendations (choose any one of them, they are redundant):
Class 1, 2, 3 of this Coursera specialization: https://www.coursera.org/specializations/python
Introduction to Python 3 by RealPython: https://realpython.com/learning-paths/python3-introduction/
(If you prefer book, this is my first introduction to python, before Coursera becomes popular. This book also has Chinese version.
) Beginning Python: https://www.amazon.com/Beginning-Python-Professional-Magnus-Hetland-ebook/dp/B06XGVVVMG.
It might take several days to finish the introduction contents, but it helps you build up a reliable knowledge graph for future study.
The Python Language
Basic data types: int, float, str
Basic data structure: list, dict, set
Control flow: if, for, break, continue, try...except...
Python built-in functions: range(), any(), all(), enumerate(), dir(), help(), isinstance(), open(), print() and others.
Python built-in modules (just knowing what they are and the general usage, google and youtube can be helpful):
system and file related: pathlib, subprocess, json, gzip, multiprocessing, concurrent.futures
Other enhancement on control flow or data structure: collections, random, itertools
Python built-in modules: The built-in modules come with python installation, and can be considered part of python language. Python is a versatile language for almost all programming application, so does its built-in modules. Not all modules are necessary for daily genome science.
Third-Party Packages
Numpy: The "N-dimensional array" data structure in python, everything related to linear algebra based on this. In other words, everything based on this.
Pandas: The "Excel" in python, handle your tables. Must learn for genome science.
scipy, scikit-learn, and statsmodels: The statistics and basic machine learning packages for Python (and many other applications out of my knowledge). They all contain tons of functions, but here are simple examples on each:
scipy: some basic statistical tests (t, Wilcoxon, fisher_exact), build dendrogram, sparse matrix format.
scikit-learn: PCA, Kmeans, RandomForest, and many other models/algorithms
statsmodels: build linear models, multi-test correction, ANOVA
matplotlib, and seaborn: The must-learn python visualization package. For publication purpose, these two are enough for any figure, your imagination is the limit.
Other third-party packages: Packages listed here are the primary packages for data science, but there are many others for more specific applications. With them, you can achieve most of the purpose in several lines of code. If you feel like some purpose needs more than 100 lines of code to succeed, you are likely reinventing the wheels. Try search google and see if someone already achieved that for you in a beautiful package (happens so many times on me).
Some great learning materials for introduction:
10 minutes to pandas: explain pandas usage in
10 minsmaybe an hour...The python ecosystem for Data Science: This youtube video explains all packages listed here.
Seaborn tutorial: The Seaborn documentation is a beautiful tutorial not only for the package, but also for the data visualization principles.
Last updated