Key Concept of Pandas
Last updated
Was this helpful?
Last updated
Was this helpful?
I think in general, a complete genomic data analysis can be divide into three parts:
Data cleaning: reform your data into a format that can be the direct input of your analysis steps
Data analysis: all the fancy computations~
Data visualization: visualize your computed results, to better tell a story
In practice, these three parts usually intermingled, but the coding / computational knowledge for each part is quite different. Here in this chapter, we are talking about the data cleaning part. Data cleaning is probably the most time consuming and tedious step for genomic data analysis. Luckily in python, we have the Pandas package to help with that. Pandas is like the "Excel" of python. It's such a versatile package that can handle anything organized in a tabular format.
I think the best way to learn pandas is through practice. However, let me give you an overview and some learning resources before diving into real coding.
, which is from pandas official documentation. This page can give you a minimum introduction to pandas, although it takes more than 10 mins π€ͺ.
(). This is my introduction book to pandas, the author is also the original author of the Pandas package. If you fully understand this book, you are ready for any data cleaning works in python.
. Once you finished 1 and 2, reading the tedious official documentation will further teach you how to write minimum but fastest code to accomplish the jobs. Because you've been prepared with all basic knowledge with pandas, the documentation might be a bit less tedious now.
Numpy is the basic package for almost all data science applications in python. It provides the most fundamental data structure numpy.ndarray
in python, an N-dimensional array type. While Pandas is a package about upper-level tabular data manipulation, both Series and Dataframe from pandas use numpy array internally to store the data. To some extent, Pandas is just like a higher-level wrapper of numpy.
You don't need to know everything about numpy before you start to learn pandas or further analysis, but keep this dependency relationship in mind is important. In addition, there are several important notes based on this dependency relationship:
Data in Pandas Series or Dataframe all belong to certain data types (int64, float64, object, etc.). Most of them inherited from numpy since pandas used it to store data. This is numpy documentation.
Many pandas operation also inherited from numpy, for example, how to to run faster drastically; how to and also run faster. It's OK if you don't know what these are right now. These are all original numpy functions, but you can also learn it from pandas later. And once you become familiar with pandas, you will also understand most of the numpy grammars!
Read a very brief numpy introduction, like Chapter 4 of the Python for Data Analysis book above.
Learn pandas first, during learning pandas, if you find any unknown numpy concepts, try to google it to get a basic idea.
Series is a 1-D vector of data, the basic data components of pandas.
Important facts about Series:
Series has a dtype attribute indicating data types. Usually, series contains certain data with the same data type (int64, float32, etc.), but it also supports mixed datatypes through the "object" dtype. Most data types are defined as numpy
Series has index, which stores a "name" of every value in the series
Dataframe is a 2-D table. It is the main entry point of pandas. If you select one row or one column from dataframe, it gives you a Series.
Important facts about dataframe:
Dataframe has two indexes. One is for each row names, one is for each column names. They can be accessed like this:
Dataframe.dtypes
refer to data type of each column, this is because we usually (but not always) use rows to represent observations and use columns to represent features. Observation (like one sample in your experiment) and have multiple features with different dtypes, but a single feature usually has consistent dtype for all observations.
As said above, both Series and Dataframe must have index, it can be meaningful names, or it can be default numbers if names are not provided. Index is a very important concept in pandas. It is the most different feature between pandas data structure and python list or numpy array.
Pandas support list like slicing to select data - use position (int number) and the iloc
method.
Because pandas data structures all have indexes, so all of them also support selecting data by name and the loc
method.
Most importantly, pandas support selecting data by a bool vector, which allows you to construct complex filters and apply easily onto your data.
There are also several other ways to select data. All of them combine allows you to do what every subsetting you want on your data.
Remembering all these different methods will take some time. It might be frustrated to try and learn at the beginning, this is why we say data cleaning is the most tedious part π, but it is also essential, so keep practicing ;)
Once you are most familiar with pandas (e.g., finish the book above), congratulations, you are also comfortable with most numpy operations. All you need to do is go through to learn it systematically.
A simple index only contains one name per item. However, one can also use combinatorial names to create . MultiIndex is a bit more complicated, let's not discuss here. But the basic idea is that all items in pandas have name(s) and the name(s) store in the index object.
Selecting data from a whole dataframe is a major topic of data cleaning. This from pandas documentation or Chapter 4 of the book above has more detailed discussion and examples of this topic. Here I want to emphasize some key points: