Why Apache Arrow?
The core uchimata library accepts only data that are in the Apache Arrow IPC data format.
Why?
We’ve encountered several file formats used to store 3D coordinates of
chromatin bins. Some come from proteomics tools (such as PDB
or CIF
),
others are very general data file formats (e.g., CVS
, TSV
, XYZ
). Few
tools define their own formats: nucle3d
, 3dg
, or
g3d. At their core, however,
all these formats store 3 floating-point numbers (xyz coordinates) and some
other columns (e.g., chromosome or genomic coordinate). Storing such
information in formats that are based on delimiter-separated text files is
often error prone and leads to incompatibility issues down the line.
Apache Arrow is a standard for storing columnar
data in memory and on disk. It is much more widespread, which allows us to
leverage libraries and tools developed outside of computational biology. As a
standard data format, Arrow integrates much more seamlessly with other data
structures commonly used in data science. For example, it is very easy to
convert to/from numpy arrays
using the Python pyarrow
library. Similarly with
pandas.
Converting to Arrow
To produce Arrow files from Python, we recommend consulting the Python Arrow Cookbook example for writing Arrow to disk
We provide also provide a script (written in JS, using
deno) for converting some of the above-mentioned file
formats to .arrow
files.