Comparing Arrow, Parquet and kdb+ File Formats

Michael Turkington data Leave a Comment

We have been investigating comparisons of performance and compression between the kdb+ binary format, Parquet and Arrow. Parquet is a columnar file format widely used in data warehousing, while Arrow is an on-disk format which is supported by some of the most popular data science tools. Some benchmarking has already been performed by the developers of Apache Arrow.

We opted to use Python for the investigation, due to the wide variety of data manipulation libraries it offers, specifically the Vaex package. Vaex is a library based on Pandas, which is used to visualise and explore large datasets. The benefit of Vaex is that it does not necessarily have to read an entire file into memory to perform some action on it, which is an obvious boon to any sort of high volume data platform. 

Compression

For testing purposes, we used a sample NYSE Trade file with approximately 34 million rows.  We investigated the sizes of the various formats, and how amenable they were to various compression algorithms. Sizes are summarised in the table below, with the compression algorithm noted in brackets.

FileSize (Mb)Compressed
Feather3508
kdb2905
CSV2876
Feather (lz4)1667
Feather (zstd)1002
CSV (gzip)887
kdb (lz4)847
Parquet795
Parquet (gzip)526
kdb (gzip)521
Parquet (brotli)444
Comparison of compression on various filetypes. CSV refers to the raw trade data, Feather is the Arrow IPC format, Parquet is the Apache Parquet format, kdb is a kdb+ binary file, lz4, zstd, gzip and brotli are various compression algorithms

The “feather” files seen here are used for storing Arrow tables, utilizing the Arrow IPC format. This format is designed for transporting large quantities of data in chunks and is specifically designed to be moved from one process (or server) to another. Some benchmarking of the latest iteration of Feather can be found here. We tested with various compression algorithms including gzip and LZ4. Of course, the uncompressed formats account for the largest on-disk objects. Generally we see the next largest format is the Arrow/Feather files with various compression, and Parquet, gzipped kdb+ and Parquet compressed with Brotli provided the best compression. These results aren’t all that surprising since Parquet is employed in data-warehousing.

In a previous overview of Vaex conducted by AquaQ, an inspection of compressed vs uncompressed Arrow files was performed. Using reduced NYSE trade files, a 1.2gb Arrow file was produced (using the standard write method in Vaex), and subsequently compressed with the zstd compression algorithm. Query performance was checked for the uncompressed file:

Uncompressed Arrow file query performance.

and for the compressed file:

There appears to be some optimisation when reading from a compressed file. Further, it should be noted that Vaex makes significant use of multi-threading.

Performance

The next stage of the investigation was to consider the practical aspects of these filetypes. Here we expand upon AquaQ’s previous overiew of Vaex and Arrow to include comparisons with Parquet files and kdb+ tables with the grouped attribute applied to the ticker Symbol column. A different sample NYSE trade file was used (~40 million rows), along with a truncated quote file (~45 million rows) so that we could include an example with joins. The kdb+ (ver. 4.0) queries were run with 8 slave threads. All queries were run on a standard Intel Xeon host with 125 gigabytes of memory which wasn’t tuned for any technology.

We started with a simple filter by a single Symbol value and a single SaleCondition comparison. In kdb this is a simple qSQL select: 

select from trade where Symbol=`AAPL, SaleCondition like “*O*” 

Loading and filtering the files with Vaex is also a simple procedure. The following timings were recorded. 

File formatTime (seconds)
kdb+0.112
kdb+ with g-attribute0.038
Arrow (Vaex)0.748
Parquet (Vaex)39
Comparison of times to simply query various file formats

Unsurprisingly, kdb+ excels at this task both in the standard implementation and when the grouped attribute is applied to the data, using the lookup table to quickly identify the filtered Symbol. Arrow returns decent timings, but Parquet does very poorly.

The next test was to apply a simple sum aggregation to the TradeVolume column in the data, and group by the ticker Symbol. 

select sum TradeVolume by Symbol from trade

The results:

File formatTime (seconds)
kdb+0.119
kdb+ with g-attribute0.103
Arrow (Vaex)0.416
Parquet (Vaex)12.7
Comparison of times to query various file formats with an aggregation

Once again, the kdb+ file with the grouped attribute comes out ahead, but by a much smaller margin than the previous example. The standard kdb+ binary file does better than Arrow, but both return decent timings. A previously performed comparison without slave threads actually returned almost identical times for standard kdb+ and Arrow. Parquet again takes much longer, but considerably less time than the previous example. 

As an extension to this example, the data was again grouped by Exchange:

select sum TradeVolume by Symbol, Exchange from trade

resulting in

File formatTime (seconds)
kdb+0.389
kdb+ with g-attribute0.338
Arrow (Vaex)0.737
Parquet (Vaex)18.8
Comparison of times to query various file formats with a multiple groupings

Interestingly, while we see the kdb+ timings more than tripling, Vaex acting on the Arrow file seems to be perform better at this additional level of aggregation, not even doubling its original time. Parquet, although the slowest once again, increased by similar margins to the Arrow example. 

For the final example, we implemented much more complex aggregation of both the trade and quote data and used an outer-join (union-join in kdb+ nomenclature) to merge the results. This was a relatively simple procedure from a kdb+ perspective:

a:selectavgspread:avg(OfferPrice-BidPrice), avgsize:0.5*avg(OfferSize+BidSize) by Symbol,Exchange from quote;
b:select maxprice:max TradePrice, minprice:min TradePrice, vwap:TradeVolume wavg TradePrice by Symbol,Exchange from trade;
a uj b

The same query was more difficult to implement in Vaex, since it doesn’t have a native weighted average function, and is unable to join tables on multiple columns. The first issue was resolved by calculated grouped sums and weighted sums in the first iteration, then adding a newly derived ‘vwap’ column by taking the ratio between the two. For the second, the Vaex tables were converted to a Pandas format and the Pandas merge() method was employed.

# simple query to get average spread and size from quote table per sym/exchange pair
a = quote.groupby(by = ['Symbol', 'Exchange'],
    agg = {'avgspread':vaex.agg.mean('OfferPrice-BidPrice'),
           'avgsize':vaex.agg.mean('OfferSize+BidSize/2')})
# derive max and min price, cumulative trade volume, trade weight from trade table per sym/exchange pair
b = trade.groupby(by = ['Symbol', 'Exchange'],
    agg = {'maxprice':vaex.agg.max('TradePrice'),
           'minprice':vaex.agg.min('TradePrice'),
           'TradeVol':vaex.agg.sum('TradeVolume'),
           'WeightSum':vaex.agg.sum('TradeVolume*TradePrice')})
# add vwap column to table b
b['vwap'] = b.WeightSum/b.TradeVol
# union-join results (vaex dataframes converted to pandas dataframes)
pd.merge(a.to_pandas_df(), b.to_pandas_df(),
    how = 'outer', on = ['Symbol','Exchange'])

The results of these queries were:

File formatTime (seconds)
kdb+1.538
kdb+ with g-attribute1.449
Arrow (Vaex)2.149
Parquet (Vaex)38.4
Comparison of times to query various file formats with a multiple groupings

Both kdb+ implementations returned the best timings. Vaex acting on the Arrow format took only half a second longer. Parquet was the slowest, but given that the relative complexity of this query, it might be understandable. 

Conclusion

Since we are primarily thinking about the comparison between kdb+, Parquet and Arrow binary files, on a high level we can say that Parquet offers excellent compression but poor performance, while a good balance between both is potentially offered by Arrow and kdb+. One caveat that should also be noted, however, is that while not demonstrated within this investigation, kdb+ can use multiple threads to read partitions in parallel while Vaex doesn’t support partitions.

kdb+ is a proprietary, licenced product, whilst Arrow and Vaex are open source. The kdb+ files are supported by continuous development of the q/kdb+ platform, who regularly provide performance and quality of life updates (recent updates include support for additional compression algorithms, and support for Data-At-Rest-Encryption). Arrow provides support for a wide array of open-source data science tools (of which Vaex is one) which makes a single copy of the data available to data engineers working across a variety of platforms and languages.

Examples of data science tools supported by Apache Arrow
Michael TurkingtonComparing Arrow, Parquet and kdb+ File Formats

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax