📗
Essential Python For Genome Science
  • Before Start
  • Chapter Contents
  • Prerequisites
    • About the UNIX system
    • About python
  • UNDERSTAND RAW DATA
    • Stages of Genome Data Generation
    • From Bulk To Single Cell
    • Introduction To the Datasets
      • bulk RNA-seq
      • single-cell data
  • Work Environment
    • Chapter Ensemble
    • All About Installations
    • Keep Running
    • Coding Environment
    • Git and Github
    • Other Tips
  • Python and UNIX System
    • Run Python
    • File I/O
    • Run Shell Command In Python - I
    • 🎉Case Study: Mapping bulk RNA-seq reads with salmon
  • Data Cleaning
    • 🎉Key Concept of Pandas
    • 🎉Case Study: Aggregate Salmon Quant
    • Case Study: Exploring The Dataset 🚩
    • The "copy" and "inplace" Parameter 🚩
    • Case Study: Extract and Reformat GTF file 🚩
    • the correct vs. the wrong way of using pandas 🚩
    • Case Study: Bulk Sample PCA 🚩
  • PYTHON BASICS
    • Python can be lightning-fast ⚡️ 🚩
    • Run Shell Command In Python - II 🚩
    • Pointers In Python 🚩
    • Everything is an object 🚩
    • Thread and Process 🚩
    • Resource For Intermediate Python Knowledge 🚩
    • Python magic method 🚩
  • Genome Science Data
    • NGS Data Formats and Tools 🚩
      • SAM/BAM 🚩
      • BED 🚩
      • GTF 🚩
      • Bigwig / Bigbed 🚩
      • VCF / BCF 🚩
    • The Python Packages 🚩
  • Data visualization
    • Matplotlib Basics 🚩
    • Seaborn Basics 🚩
    • Interactive Data Visualization 🚩
  • Use R in Python
    • Why? 🚩
    • rpy2 🚩
  • Gotchas
    • Check whether package X is installed
    • BAM to FASTQ
    • Genomic Websites
Powered by GitBook
On this page
  • Text file & binary file
  • What is the difference?
  • Their Relationship
  • Are binary file formats all the same?
  • File Compression
  • Python basic file I/O
  • Pandas file I/O
  • Serialization of Python Objects

Was this helpful?

  1. Python and UNIX System

File I/O

Read/write your data table, or any python object from/to files

PreviousRun PythonNextRun Shell Command In Python - I

Last updated 5 years ago

Was this helpful?

All data analysis start from reading files into memory and end with saving results into new files. Here I summarized everything you need to know about files, and how to read and write files in python. For those genome science specific file formats, I will talk about how to deal with them in a separate chapter ("GENOME SCIENCE DATA")

Text file & binary file

What is the difference?

Text files is human-readable, while binary files are for programs.

For example, a SAM file (basic genomic file format that contains reads and alignment information) is a text file. You can directly read it with less; a BAM file is a binary version of SAM file, you can not get the same information using cat or less , you need special software called samtools to parse it back into SAM format.

Their Relationship

Text files can be encoded into binary files, while binary files can be decoded using certain standards into text files.

If you took any computer programming introduction course, you must hear of the ASCII standard. It is the most common standard to turn a character into a number, which is called encoding. And the number can be converted back into character using the same standard, which is called decoding.

Similarly, another widely used encoding standard in python world is called utf8 , you don't need to understand its details, but you will see this name a lot when you need to .

Are binary file formats all the same?

No, they are not. Binary files are files contain a lot of numbers. But when we define a binary file format, we can assign specific positions with special meaning. For example, we can use the first 100 numbers to save some metadata information (usually called a header), and the real content starts from the 101 number. In this case, the first 100 numbers need to be skipped when decoding the content, otherwise, they will become meaningless characters after decoding.

, for example, contains a header part, which is used for many metadata storage (e.g., how and when this file is generated, using what genome assembly). When you use samtools to read a bam file, it knows how to correctly parse or skip the header, while if you directly parse the data from the beginning using python, things will all mess up.

File Compression

You can compress both text or binary files to reduce their sizes, but the compression will not change their file type. BTW, a compressed file is usually a binary file, but many programs will recognize it and decompress it automatically to get the correct content. Like salmon quant command can take the gzipped FASTQ file directly because it decompresses the file into FASTQ internally during mapping.

Also, the BAM file is much smaller than SAM file because it's special binary encoding and compression. But when you use samtools to view the BAM files, all the decoding and decompression happens inside samtools automatically.

Python basic file I/O

I suggest you try those code by yourself. Once done, you should be able to explain:

  • Read/Write text or binary files

  • Read/Write gzipped files

  • string and bytes

  • encode and decode

  • File handle and the cursor location

  • Read line by line, read lines, read certain position

Pandas file I/O

Pandas is the excel for python. It can handle all the tabular data, super flexible and highly efficient - if using it in the right way.

After reading the notebook, you should be able to explain:

  • Read/Write CSV/TSV files, compressed or uncompressed

  • Use HDF to speed up large files' I/O

Serialization of Python Objects

Not only your data, but most of the python objects can actually be serialized onto disk. This is realized by some special packages. There are several packages to choose:

  • pickle: python built-in package for serializing objects

  • joblib: serialization package from sklearn

  • json: serialize python dict

For simplicity, I suggest using joblib.

Using joblib, you can serialize any python built-in data structure, a pre-trained machine learning model, or most complex python objects. Load them back into memory is also just one line of code. But serialization is not designed for really large data (GBs of data), but it is OK for most processed data or data containing objects.

More details about the pandas dataframe will be covered in a .

is a special data format that's designed to store large and heterogeneous data. It's a widely used data format in single-cell sequencing. The main reason to use it is its fast I/O speed. For a large data table (GBs), using HDF5 can be 10 times faster than using csv.gz or tsv.gz

convert bytes (contents read from binary source) into string (human readable)
BAM format
See Jupyter Notebook
See Jupyter Notebook
later page
HDF5 format
See Jupyter Notebook