Coding Environment
Python Environments
There are multiple ways to code and run python. Some people may only use a text editor and the original python interpreter; some may use the IDEs; for data-analysis-oriented coding, I highly suggest using the Jupyter notebook.
Python interpreter
This is the basic python interpreter. I do not use this directly for production, but it is used by all other environments bellow, they make this basic interpreter more convenient to use. I will explain what a python interpreter is in another page.
ipython
kernel built on top of the python interpreter, providing a more convenient way to run python. It is the default python kernel used by Jupyter notebook. I do not use it directly via shell. I use it with its high-level GUI, the Jupyter Notebook.
Jupyter notebook or lab
Jupyter notebook is my only working environment for daily analysis. Here are the main reasons I like it:
Like a lab notebook, it integrates all your code and annotation, and figures in one file. Everything is self-explanatory.
It executes not only python code, but also shell commands. Using different kernels, it can also execute many other languages like R or even run python and R in a same notebook. For python, the ipython kernel is used.
Abundant notebook extensions make my life much easier.
Jupyter Lab is another kind of GUI form Jupyter, which is termed as "next-generation", check that out if you like, but I haven't fully switched to that in 2020, maybe I will, when it's more mature.
For analysis, I use Jupyter notebook, Jupyter notebook uses ipython kernel, ipython kernel uses the python interpreter.
IDE
IDE is a more comprehensive developing environment. I use PyCharm when I develop python packages for the basic infrastructure of my project. But I do not use it for daily analysis or this book. When you are getting more familiar with programming and starting to build your tools, you will need to know more about PyCharm.
For package development, I use PyCharm, PyCharm uses the python interpreter.
Start Jupyter Notebook
Jupyter Notebook has a web server, you start and keep the server process on running, and use it through a web browser:
By default, http://localhost:8888/ is the URL to open your jupyter notebook navigator. The token is just a temporary password. Setting up a real password for jupyter is more convenient, it's explained here:
I suggest you use a named screen for jupyter command because it needs to be always alive when doing your analysis.
ipynb file
Here is a simple example of jupyter notebook. When you save it from jupyter notebook page, it becomes a .ipynb
file, which is a special JSON format that contains all your code, markdown notes, and other metadata.
It can be viewed in:
The jupyter notebook webserver
nbviewer, a website rendering any notebooks from public github repositories.
nteract, a GUI app for jupyter
Notebook extensions
Jupyter notebook comes with rich third-party extensions that make it highly efficient, in the installation step, I added the jupyter_contrib_nbextensions
package, which is a collection of most jupyter notebook extensions. Because it's installed, you can see an additional panel called Nbextensions
in your jupyter navigator. You can activate any extensions from here. Not all of them are useful, but here are the ones I used:
Spend a few minutes to read those ones come with documentation and customize your jupyter notebook environment!
Often times, after finishing an analysis in a jupyter notebook, we want to rerun it with different input files or test different parameters. Copy notebook files and changing all the value is tedious and error-prone. Here is a great tool that solves this problem! With simple modifications, it turns any notebook into a parameterized pipeline:
Step 1: write a notebook template with a parameter cell
Step 2: write an execution notebook or use the command line interface to automatically execute the template notebook with different parameters!
Papermill is not very good at taking complex input (such as a dict with special object in it), better use simple data type (number, string, simple list etc.) in the parameter cells, if you really have complex needs, I usually solve that in two ways:
Write additional logic determine complex input from simple parameter in another cell
Use more sophisticated config file such as an
.ini
or.yaml
and only provide their path as a parameter to papermill. I hardly find this is necessary.
Last updated