AquaQ Parquet kdb+ converter

William Lowe kdb+ Leave a Comment

Following on from the columnar and data formats blog post, we have decided to investigate creating a custom API for parquet. Traditionally, we would do this through python, however, we are interested to see what the performance benefits are.

Kdb-Parquet is a library that can convert kdb+ tables to and from the Apache Parquet table format. The library provides a translation of the most common kdb+ primitive data types to the Apache Parquet equivalent and vice versa. The codebase provides similar functionality to the library here, however, the library does not utilize the embedPy interface and avoids an extra translation step when encoding and decoding datasets.

Examples

Simple examples are available in the test file supplied. These are supplied for reading, writing, and inspecting parquet files as demonstrated below.

(base) kdb@linux:~/parquet/Kdb-Parquet$ q q/examples.q
KDB+ 4.0 2020.06.18 Copyright (C) 1993-2020 Kx Systems
******************************************************

"Parquet file reading examples"
============================================
Saving sample table: .pq.settabletofile[file;tab]
0i
Reading sample table: .pq.getfile[file]
j f d          s
-------------------
1 3 2020.12.15 ,"a"
2 4 2020.12.15 ,"b"
3 5 2020.12.15 ,"c"
Inspecting sample table: .pq.getschema[file]
name type
------------------
,"j" "int64"
,"f" "double"
,"d" "date32[day]"
,"s" "string"
Reading subset of columns from file: .pq.getfilebycols[file;`j`f`d]
j f d
--------------
1 3 2020.12.15
2 4 2020.12.15
3 5 2020.12.15
============================================
 Good bye

Streaming functionality

An example of the streaming functionality is also included in the test file supplied.

(base) kdb@linux:~/parquet/Kdb-Parquet$ q q/streamexample.q
KDB+ 4.0 2020.06.18 Copyright (C) 1993-2020 Kx Systems
******************************************************

"Parquet file streaming example"
============================================
Saving sample table: .pq.settabletofile[file;tab]
0i
Creating blank list for each element to be inserted into: t:()
Creating function for streamread : f:{[x] t::t,x}
Streaming sample table: .pq.streamread[file;`f]
Reformatting sample table: flip(col)!flip(count col:`$exec name from .pq.getschema(file)) cut t
j f d          s
-------------------
1 3 2051.01.11 ,"a"
2 4 2051.01.11 ,"b"
3 5 2051.01.11 ,"c"
============================================
 Good bye

Data type mappings

The parquet intrinsic types are mapped according to the table below when reading and writing to and from kdb+. In some cases, assumptions have been made, especially when considering symbols and certain temporal data types. Where possible getfile and settabletofile should encode and decode fields so that the functions are essentially the reciprocals of one another.

Parquet Typekdb TypeExample
Timestamp(ms)Timestamp2001.01.01D12:01:01:01.000000
Date32(day)Date2001.01.01
Time32Time12:01:01.000
Time64nyi
int64Long12
stringarray of characters
float32Float1.0
float64Float1.0
boolBoolean0b
uint16Int12i
uint32Long12
uint64Float12.3
decimal128nyi
binarynyi
Null0h list()

Comparison to embedPy interface

The embedPy interface is a flexible API that allows python and kdb+ to share memory and interact with each other. In theory, the universe of functionality available within python is opened up to kdb+. However, this flexibility does come at a certain cost when it comes to performance. In the example below, we create a simple parquet file with 1 million rows and a small number of columns and import this file into kdb+ via the embedPy interface for comparison directly via the functionality available in this repository. The example clearly shows how the translation of data into python and then subsequently to kdb+ has a large overhead, with the import being a lot slower. When working interactively with kdb+ this may not be an issue, however, when speed is an issue for applications such as EOD exports from an external system this may be an important factor. Furthermore, the number of temporal variables supported natively, rather than needing special transformations when involving embedPy may be important. With that said the embedPy suite has many other features that make it generally a more useful tool. This example is meant to highlight the improvements that can be made by writing a custom application in this specific instance.

(kdb) kdb@linux:~/parquet/Kdb-Parquet$ q comparison.q
KDB+ 4.0 2020.07.15 Copyright (C) 1993-2020 Kx Systems
******************************************************

`.pq
"Generating table and saving"
time                          a  b
-----------------------------------
2020.12.14D13:01:06.648155000 81 12
2020.12.14D13:01:06.648155001 2  10
2020.12.14D13:01:06.648155002 80 1
2020.12.14D13:01:06.648155003 96 90
2020.12.14D13:01:06.648155004 95 73
2020.12.14D13:01:06.648155005 94 90
2020.12.14D13:01:06.648155006 72 43
2020.12.14D13:01:06.648155007 87 90
2020.12.14D13:01:06.648155008 89 84
2020.12.14D13:01:06.648155009 58 63
2020.12.14D13:01:06.648155010 79 93
2020.12.14D13:01:06.648155011 64 54
2020.12.14D13:01:06.648155012 21 38
2020.12.14D13:01:06.648155013 31 97
2020.12.14D13:01:06.648155014 75 88
2020.12.14D13:01:06.648155015 78 58
2020.12.14D13:01:06.648155016 17 68
2020.12.14D13:01:06.648155017 44 45
2020.12.14D13:01:06.648155018 62 2
2020.12.14D13:01:06.648155019 1  39
..
0i
"initialising embedPy functionality"
{[f;x]embedPy[f;x]}[foreign]enlist
{[x]tab:.qparquet.py.lib[`:getTable][string x]`;flip .p.wrap[tab][`:to_dict;`list]`}
"Time to read in using embedPy:"
10981
"Time to read in using native functionality"
883
"Done"

New York Taxi Data

One benefit of Kdb-Parquet is the size of the parquet file as compared to its CSV equivalent is a lot smaller. A use case can be demonstrated using NYC taxi data from the NYC Taxi and Limousine Commission.

(kdb) kdb@linux:~/parquet/Kdb-Parquet$ q nyccomparison.q -c 25 100
KDB+ 4.0 2020.06.18 Copyright (C) 1993-2020 Kx Systems
******************************************************

VendorID tpep_pickup_datetime          tpep_dropoff_datetime         
---------------------------------------------------------------------
1        2020.01.01D00:28:15.000000000 2020.01.01D00:33:03.000000000 
1        2020.01.01D00:35:39.000000000 2020.01.01D00:43:04.000000000 
1        2020.01.01D00:47:41.000000000 2020.01.01D00:53:52.000000000 
1        2020.01.01D00:55:23.000000000 2020.01.01D01:00:14.000000000 
2        2020.01.01D00:01:58.000000000 2020.01.01D00:04:16.000000000 
2        2020.01.01D00:09:44.000000000 2020.01.01D00:10:37.000000000 
2        2020.01.01D00:39:25.000000000 2020.01.01D00:39:29.000000000 
2        2019.12.18D15:27:49.000000000 2019.12.18D15:28:59.000000000 
2        2019.12.18D15:30:35.000000000 2019.12.18D15:31:35.000000000 
1        2020.01.01D00:29:01.000000000 2020.01.01D00:40:28.000000000 
1        2020.01.01D00:55:11.000000000 2020.01.01D01:12:03.000000000 
1        2020.01.01D00:37:15.000000000 2020.01.01D00:51:41.000000000 
1        2020.01.01D00:56:27.000000000 2020.01.01D01:21:44.000000000 
2        2020.01.01D00:21:54.000000000 2020.01.01D00:27:31.000000000 
2        2020.01.01D00:38:01.000000000 2020.01.01D01:15:21.000000000 
1        2020.01.01D00:15:35.000000000 2020.01.01D00:27:06.000000000 
1        2020.01.01D00:41:20.000000000 2020.01.01D00:44:22.000000000 
1        2020.01.01D00:56:38.000000000 2020.01.01D01:13:34.000000000 
2        2020.01.01D00:08:21.000000000 2020.01.01D00:25:29.000000000 
1        2020.01.01D00:25:39.000000000 2020.01.01D00:27:05.000000000 
..
"Parquet loading times:"
9781
"CSV loading times:"
9582
"Loading in parquet, using reduced columns"
".pq.getfilebycols[file;`tpep_pickup_datetime`tpep_dropoff_datetime`passenger_count`trip_distance]"
1938
"Getting size of both files in bytes"
"Size of parquet file"
143737855
"Size of csv file"
593610736

Future Work

The next stage of this interface will be to potentially explore the possibility of allowing multiple kdb+ sessions to share data via the in-memory arrow format and a shared memory segment. In effect, large tables would be loaded into one shared memory segment and made accessible via multiple different applications, potentially with the arrow table being appended to from a master process. For certain applications, this could remove the need for IPC communication when operating on data sets and potentially reduce overall memory usage of the system as a whole. The actual practicalities of this design have not yet been considered.

A link to Kdb-Parquet can be found here.

William LoweAquaQ Parquet kdb+ converter

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax