kdb+ 3.6: anymap

Blog Data Analytics & Engineering 28 Aug 2018

Data Intellect

The recent release of kdb+3.6 brought a new data structure: anymap. As database nerds, this is very exciting news for us. We thought we’d have a play with it to see how it could be used.

New Nested Lists

The anymap features replaces all the previous structures used to create nested lists. It also supports storing non-uniform values within the column of a table; a table can therefore contain a column of complex data structures such as lists, matrices, dictionaries and tables.

This could be done in previous versions of kdb+, but with restrictions. For example, the following code can be used to create and store tables containing complex columns in both 3.5 and 3.6:

n:500000
dicttab:update `p#sym,dictserialized:-8!'dict from `sym xasc ([]time:asc 2019.01.01+n?1D;sym:n?`$'.Q.a;dict:{(`$'n?.Q.a)!(n:1+rand 26)?100f} each til n)
.Q.dpft[hdb:`$":hdb",ssr[string .z.K;".";""]; 2019.01.01; `sym; `dicttab]
bigrow:update valserialized:-8!'val from ([]time:enlist 2019.01.01D0;sym:enlist`xx;val:enlist dicttab)
.Q.dpft[hdb; 2019.01.01; `sym; `bigrow]
// don't run the line below in 3.5
br 1: get br:.Q.dd[.Q.par[hdb;2019.01.01;`bigrow];`val]

dicttab has a column called dict, which is a dictionary of variable size, and a column called dictserialized, which is the same structure converted to a byte array. A use case for a field like this is as a “catch all” column, where data that may or may not be populated on each update can be stored, or where it is not known in advance what the structure is. The columns that are fixed and known upfront become named columns within the table, everything else drops into the catch all column. An example of this could be when entering algorithmic orders: data to support or validate the order entry decision is captured alongside the order entry.

q)first dicttab
time | 2019.01.01D00:00:00.156366080
sym  | `b
dict | `g`b`y`f`e`n`j`c`h`a`r`o`j!51.70911 51.59796 40..
dictserialized| 0x0100000097000000630b000d0000006700620079006..

In 3.5, the performance to access the dict column is poor. It is not possible for kdb+ to memory map the structure, and therefore it has to read it all into memory, regardless of the operation. The solution is to serialise the data into a byte array, which is mappable, giving much improved access overhead. However, this increases complexity as the data has to be deserialised on access to convert it back into something usable.

// 3.5
// accessing a value in an unmappable column has huge overhead
q)\ts select dict from dicttab where date=2019.01.01,i=10000
498 153921904
// serialized version is faster, but more complex
q)\ts select dictserialized from dicttab where date=2019.01.01,i=10000
9 2336
q)\ts select -9!'dictserialized from dicttab where date=2019.01.01,i=10000
9 3968

Compare this to 3.6:

// 3.6 2018.08.24 m32
// column is mappable, giving much better performance
q)\ts select dict from dicttab where date=2019.01.01,i=10000
12 4196208
// though still not as efficient as the serialized version for this use case
q)\ts select -9!'dictserialized from dicttab where date=2019.01.01,i=10000
10 3776

When you use a serialized structure the problem becomes search. It was fine if the column was only used for reference i.e. searching was always done on other fields in the table. To search, each element would have to be deserialized. Let’s compare a fairly simple search, returning only those values where the dictionary has more than 20 values.

q)\ts select dict from dicttab where date=2019.01.01,20<count each dict
364 49438272
q)\ts select -9!'dictserialized from dicttab where date=2019.01.01,20<count each -9!'dictserialized
901 164019616

Or a little more complicated; only those records where the value of c in the dictionary is greater than 20.

q)\ts t1:select dict from dicttab where date=2019.01.01,dict[;`c]>20
594 159103088
q)\ts t2:select -9!'dictserialized from dicttab where date=2019.01.01,(-9!'dictserialized)[;`c]>20
935 159103440
q)t1~t2
1b

The new structure improves both processing time and memory usage. The other benefit is reduction in code complexity as the same code can now be used in both in-memory and on-disk versions of the table.

Big Rows

In the code in the previous section we also generated a table called bigrow. bigrow contains a single row, and a column called val which contains the whole of the dicttab table. The val column in bigrow had to be created explicitly with the 1: operator. The anymap structure allows the table stored in the val column to be accessed with equivalent performance to if it was a splayed table within the database:

q)\ts t1:select time,dict from dicttab where sym=`b,time within 09:00 10:00
14 4196320
q)\ts t2:select time,dict from (exec first val from select first val from bigrow where date=2019.01.01) where sym=`b,time within 09:00 10:00
11 4721904
q)t1~t2
1b
// compare this to the serialized version
q)\ts t2:select time,dict from (-9!exec first valserialized from select first valserialized from bigrow where date=2019.01.01) where sym=`b,time within 09:00 10:00
812 554471072

Schema Flexibility

kdb+ allows a database to be partitioned in one dimension (usually date) and with a further attribute applied on another dimension (usually sym, the primary identifier). Queries which respect these are fast.

The anymap structure would allow partition schemes to be mixed. For example, imagine we have a date partitioned database and we would like to add a dataset which would be accessed most efficiently as month partitions. Previously we would either have to make it date partitioned or make it splayed. If splayed we would likely only have an attribute on the month field. Using anymap, for a new month of data we can store a new row in a splayed table and have the full month of data stored as the value in the field. We can have attributes on it, and access it as efficiently as if it was a month partitioned database.

A similar use case would be to extend the number of dimensions that we can quickly access. An example of this would be if we received data for the same instruments from different sources and we would like to apply an attribute on source as well as instrument.

syms:`$'.Q.a
n:2000000
hdb:`:hdb

path:{` sv .Q.par[hdb;2019.01.01;x],`}
// create empty table
path[`quotes] set .Q.en[hdb;([]time:`timestamp$();sym:`symbol$();lp:`symbol$();price:`float$();size:`int$())];
// create data, sym at a time
{[n;s] 0N!s;path[`quotes] upsert .Q.en[hdb;([]time:asc 2019.01.01+n?1D;sym:n#s;lp:n?upper `$'10#.Q.a;price:n?100f;size:n?1000i)]}[n] each syms
// set attribute
@[path[`quotes];`sym;`p#]

// create empty nested structure
path[`quotesnested] set .Q.en[hdb;([]sym:`symbol$();prices:())]
// builded a nested table, one row per sym
{0N!x; path[`quotesnested] upsert .Q.en[hdb;([]sym:enlist x;prices:enlist update `p#lp from `lp xasc select time,lp,price,size from path[`quotes] where sym=x)]} each syms
p 1: get p:`:hdb/2019.01.01/quotesnested/prices

We can compare performance to extract the data for a single sym, and for a single sym and source.

q)\ts t1:select time,price,size from quotes where date=2019.01.01,sym=`c,lp=`W
107 27265136
q)\ts t2:select time,price,size from (exec first prices from select prices from quotesnested where date=2019.01.01,sym=`c) where lp=`W
47 4197120
q)t1~t2
1b
q)\ts t3:select lp,time,price,size from quotes where date=2019.01.01,sym=`c
208 58723152
q)\ts t4:select lp,time,price,size from exec first prices from select prices from quotesnested where date=2019.01.01,sym=`c
47 4196832
q)t4~`lp xasc t3
1b

The downsides for both these use cases are:

  • extra access complexity (especially if trying to use slaves)
  • the loss of the built-in map-reduce operations in select statements
  • building a “Big Row” style database may not be practical in a large scale real time capture (likely easier to implement in a historic load)

Compression

This is the bit where I wave my hands around vaguely and say things like “memory mapping magic” without really understanding what is happening, and why the core tech team at Kx get paid the big bucks. The crux is that if we take the example above and add compression (.z.zd:16 2 4), then the anymap structure seems to get even better.

q)\ts select lp,time,price,size from quotes where date=2019.01.01,sym=`c
392 59912848
q)\ts select lp,time,price,size from exec first prices from select prices from quotesnested where date=2019.01.01,sym=`c
11 5181600
q)\ts select time,price,size from quotes where date=2019.01.01,sym=`c,lp=`W
82 28290912
q)\ts select time,price,size from (exec first prices from select prices from quotesnested where date=2019.01.01,sym=`c) where lp=`W
12 5181808

Summary

The anymap datatype brings a lot of flexibility to schema design. It allows variable schema or unstructured data to be accessed natively and efficiently from kdb+. It allows greater flexibility with partitioning and attribute choice. Compression seems to speed it up. Some example use cases:

  1. Storage and easier search of variable key data structures without having to serialize or store in a key-value pair format (essentially the dicttab example in this blog)
  2. Easier storage of unstructured data or “document” style datasets, in the mould of MongoDB
  3. To facilitate multiple partition schemes and/or greater control over query dimensions.

All of the examples above are examples. However, there will be use cases where it makes more sense to stick to more standard storage structures.

Share this:

LET'S CHAT ABOUT YOUR PROJECT.

GET IN TOUCH