Comparing columnar data formats: Arrow, Vaex and kdb+

Matt Doherty data, kdb+ Leave a Comment

Here at AquaQ we have years of experience using kdb+ in the financial sector, helping our clients capture, analyse and derive value from their data. We always take a keen interest in technologies that will help us solve new problems, or provide better solutions for existing problems. To this end Apache Arrow is very interesting. Arrow describes itself as a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. It’s essentially a standardized format for columnar data, similar to that used by kdb+. This means it leverages the same columnar advantages kdb+ enjoys, such as:

  • Data adjacency for sequential access (scans)
  • O(1) (constant-time) random access
  • SIMD optimization and vectorization-friendly

A full description of the arrow data format can be found here. Arrow alone is simply a specification and a set of building blocks to interact with different technologies. It doesn’t come with a query language of it’s own, or a set of tools for analysing the data: this is where vaex comes in. Vaex is a python library for lazy “out-of-core” dataframes. It essentially allows you to explore and visualize datasets stored as arrow (or other formats) that are too big to fit into memory. Users can specify queries in a syntax similar to that used by pandas. Vaex will figure out which sections of data need pulled from disk and read only these. So by combining arrow and vaex we get a solution that closely matches the feature set of kdb+ on-disk databases. So far vaex has largely been demoed on datasets like the NYC taxi dataset, and astronomical datasets, and compared against tools like dask and spark. Here we’re going to show it could potentially be very useful in the world of finance as an alternative to kdb+ for timeseries datasets.

Data loading

To demo and discuss these two technologies side by side we’re going to use a single day of NYSE TAQ, which is a very common dataset in finance containing all the trades and quotes for securities listed on US regulated exchanges. For reference we’re using kdb+ 4.0, and vaex 3.0.0. We parsed this data into arrow – via vaex – with something like:

And when saved on disk as a single arrow file we have:

$ ls arrowdata/
220191007_trade.arrow

We do the same with the much larger quote data we’re left with a data folder coming in at 123GB:

$ du -sh arrowdata/
123G    arrowdata/

To parse this same data into kdb we use something like:

q)trade:("TCS*JFCJJJC*TTC";enlist"|")0: -1_system"zcat EQY_US_ALL_TRADE_20191007.gz"

For more detail on parsing this data in kdb+ see our previous article on the topic. After saving to disk we end up with our standard kdb splayed partition structure:

$ tree kdbdata/
kdbdata/
├── sym
└── 2019.10.07
    └── trade
        ├── Exchange
        ├── Participant_Timestamp
        ├── Sale_Condition
        ├── Sale_Condition#
        ├── Sequence_Number
        ├── Source_of_Trade
...

Which comes in at a total size of 74GB:

$ du -sh kdbdata/
74G     kdbdata/

The size difference is likely due to the use of enumerations: our arrow file stores all string columns as simple strings, whereas kdb uses enumerations (symbols). Enumerations are much more space efficient as they don’t store repeating values. Arrow does in fact support enumerations, but there are currently some issues with vaex preventing us using them here (we’ll discuss this more later). We can see that arrow and kdb+ files look somewhat different at this level, with kdb columns exposed as individual files and arrow appearing as a single unified file. Despite the superficial difference the structure is actually very similar, arrow files simply have their columnar structure hidden within a single file. It’s worth noting here that both sets of files are uncompressed. Both kdb and arrow do support compression, so it would be an interesting follow up to look at the performance impacts in each case.

So, now that we have some data let’s try running some queries and actually looking at our data! Loading or mounting this data is very similar in vaex and kdb+. In both technologies we load very little into memory: only meta data about the tables (column names, types etc.) is required, the actual data is memory-mapped. As such it is a very fast operation. First, in kdb to load our data we simply load the directory:

q)\l kdbdata
q)tables[]
`s#`quote`trade

And in vaex we map each table separately:

When we simply type the name of a table in vaex it knows not to fetch the entire table into memory, but only a small sample to display, so the query above runs almost instantly. Even though the trade data is over 3GB it only needs to read a very small subset.

Querying data

We’re going to show three simple typical queries comparing the syntax and performance of kdb+ and vaex/arrow:

  • a simple filter query
  • a simple group by on the trade table
  • a larger group by on the quote table

These benchmarks/comparisons were all run on a linux server with an Intel Xeon Gold 6128 CPU @ 3.40GHz and 128GB of RAM. In each case the page cache has been cleared beforehand, so the comparison is as fair as possible. The aim here is not to run a rigorous benchmark suite, but simply illustrate the similarities and differences, with a rough performance comparison.

Filter query

So first a simple window filter: we’re going to select out the AAPL opening auction trade. First in kdb:

q)\ts select from trade where date=2019.10.07,Symbol=`AAPL, Sale_Condition like "*O*", Time within 2019.10.07D09:30 2019.10.07D16:00
417 15037536

And then in vaex:

So in both cases this filter is quite fast – 417ms for kdb and 486ms for vaex – as we’re able to scan only one column initially (Symbol), and then run subsequent filters on reduced subsets of columns, loading only what we need from each. In addition kdb can take advantage of the p attribute on the Symbol column to speed up the filter, but vaex gains a speed advantage from multithreading (note the difference between the wall and CPU times). If secondary threads are assigned kdb+ will utilize parallelism across multiple date partitions, but in this case we have only one partition.

Aggregation query

Next let’s try a simple aggregation on the trade table: we’re going to sum the traded volume by symbol and minute window during the day. This is sometimes called a bin or bar query, and is very common with timeseries datasets. First in kdb:

q)\ts select sum Trade_Volume by Symbol,0D00:01 xbar Time from trade
 where date=2019.10.07
6003 3758228400

And the same in vaex:

So again the performance this time is very close: 6s in kdb, and 8.85s in vaex. One of the nice things about working in vaex is that we have easy access to the rest of the python data science toolkit. For example we can run a similar query grouping AAPL trading volume into buckets, and immediately create a plot to see when during the day trading is concentrated:

This is also serves as a nice illustration of the expression system. Although it looks as though we’re pulling data from disk into an intermediate variable filtered_trade, that’s not what’s actually happening. In fact filtered_trade stores an expression – which is kind of like an instruction set – and vaex won’t go to disk for data until it has to as part of the aggregation statement. These expressions can be passed around and build on top of each other, in a sense it’s analogous to the functional programming concept of projections. This same feature can be very convenient for creating virtual columns without having to actually materialize them.

Large aggregation query

Finally we’re going to run an aggregation on the much larger quote table, finding the average mid price by time window. First in kdb:

q)\ts select mid:avg .5*Offer_Price+Bid_Price by Symbol,0D01 xbar Time from quote
 where date=2019.10.07
166711 60129674448

And then in vaex:

In this case performance is again relatively similar: the kdb query runs in 2 mins 46s, and the vaex query in 4min 28s. It’s interesting to note that in both this and the smaller previous aggregation queries vaex does not benefit as much from parallelism as in the filter query. This query has to touch a large quantity of data – a number of full unfiltered columns on the quote table – so memory usage is also interesting to look at. Using top to measure max memory consumption during the execution of the queries above the kdb query hits a max memory usage of 55.3GB, whereas vaex uses a max of 30GB. So although kdb executes this query faster, vaex is more memory efficient (this is even more impressive when you consider that the arrow dataset is larger on disk due to the lack of enumerations)

Discussion

Rather than dwelling on the minutia of these comparisons, the general point we’re trying to convey here is that there are now high quality open source tools which reproduce some of the kdb+ feature set, and are comparable in terms of performance. However, these tools are not necessarily a drop-in replacement and there are some important differences that deserve discussion.

It’s worth coming back to one difference between the arrow and kdb+ data that I mentioned earlier: the use of enumerations. The kdb+ tables used store a number of columns with low cardinality as symbols (enumerations). While arrow does support enumerations, there are still some issues to be ironed out on the vaex side, so all the string columns are simply stored as strings or arrays in this demo. As well as improving storage efficiency, some simple tests linked on the github issue indicate that using enumerations might improve vaex performance on aggregations by as much as a factor of 5. A performance improvement of this size would have vaex winning all the comparisons above. Another important note is support for partitioning of data. It is typical to partition kdb+ databases by date, creating virtual columns and allowing for efficient querying of subsets of very large timeseries datasets. Vaex does not currently have a directly comparable feature.

All the comparisons above apply to the “on-disk”, or hdb component of kdb+. Vaex is not a drop in replacement for real time kdb+ architectures using tickerplants, rdbs etc. This is not to say we couldn’t build something like this by combining other tools, but it certainly does not exist out of the box. I think it’s also fair to say that kdb+ is a much more mature tool: it’s been used in production systems for 20 years now, whereas the arrow format and vaex are only a few years old at this point (although the list of users is already quite impressive, including Google BigQuery, TensorFlow and AWS Athena)

Having said all this, vaex, arrow and the python data science ecosystem around them come with some significant advantages over kdb+. First and foremost these are fully free and open source tools, with a large community, good development momentum, and regular releases and updates. The syntax vaex uses, while it might look strange and clunky to veteran kdb+ developers, is very close to pandas query syntax and as such has a very easy learning curve for anyone coming from the python data science ecosystem. It comes with an expression system and virtual columns, it’s cloud friendly out of the box, and integrates with a range of data formats (parquet and hdf5 as well as arrow).

Perhaps the most important advantage over kdb+ is interoperability. One of the key design decisions in both the arrow format, and kdb+, is that the columnar serialization format should be the same as the format used in memory i.e. no serialization is required when going from disk to memory. This is what enables memory-mapping, and the query speed we saw in the queries above. kdb+ stores columns in it’s own proprietary columnar format, while arrow is an open standard that can be read and used by many different technologies. This opens up new and interesting options for passing data around with no serialization overhead. The arrow files used in the demo above could be read easily and directly by a wide range of other technologies, which is not true for serialized kdb database files.

Here at AquaQ we will continue to experiment with new technologies, adding our extensive experience with data and in the finance sector to build cutting edge solutions. For questions or to find out more about these technologies or any of our other projects, feel free to reach out and contact us at info@aquaq.co.uk.

Matt DohertyComparing columnar data formats: Arrow, Vaex and kdb+

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax