Why Apache Arrow?

The core uchimata library accepts only data that are in the Apache Arrow IPC data format.

Why?

We’ve encountered several file formats used to store 3D coordinates of chromatin bins. Some come from proteomics tools (such as PDB or CIF), others are very general data file formats (e.g., CVS, TSV, XYZ). Few tools define their own formats: nucle3d, 3dg, or g3d. At their core, however, all these formats store 3 floating-point numbers (xyz coordinates) and some other columns (e.g., chromosome or genomic coordinate). Storing such information in formats that are based on delimiter-separated text files is often error prone and leads to incompatibility issues down the line.

Apache Arrow is a standard for storing columnar data in memory and on disk. It is much more widespread, which allows us to leverage libraries and tools developed outside of computational biology. As a standard data format, Arrow integrates much more seamlessly with other data structures commonly used in data science. For example, it is very easy to convert to/from numpy arrays using the Python pyarrow library. Similarly with pandas.

Converting to Arrow

To produce Arrow files from Python, we recommend consulting the Python Arrow Cookbook example for writing Arrow to disk

We provide also provide a script (written in JS, using deno) for converting some of the above-mentioned file formats to .arrow files.