Temporal and Bitemporal Data in kdb+: 2

The previous section of this blog post explored using kdb+ to manage unitemporal data. We will continue the discussion into bitemporal data, which provides an additional timeline to the database and allows different historical views of the database to be defined.

Bitemporal Data

Bitemporal data consists of two temporal fields and is an extension of unitemporal data. Unitemporal adds Valid Time, which is defined as the time at which a referenced object existed. Bitemporal data extends this model with the addition of a second temporal field, known as the assertion or transaction time- the time that the database knew about that data.

General Theory

The assertion time is the time at which a record is created in the database. Simply put the assertion time represents the timeline of the database, whereas the valid time represents the timeline of the referenced data. Take a look at the table below:

cltid adate      vdate      name      dlt_flg
---------------------------------------------
C089  2015.03.10 2015.03.10 "Johnson" 0
C112  2015.05.04 2015.05.04 "Leonard" 1
C113  2015.08.19 2015.08.18 "Coyle"   0
C113  2016.02.01 2015.12.01 "Davies"  0

Here we have a number of client records with the addition of the assertion time field. For the most part this is similar to the previous examples, but by comparing the adates with the vdates you’ll notice slight differences between the two values. Take the last record of client C113 for example, the vdate has been recorded as 2015.12.01 whereas the adate is 2016.02.01- two months later.

The vdate in this scenario indicates the date C113’s name was modified, but outside the context of the database. The adate is the date this modification was actually inserted into the database, therefore it is asserting that on this date this is how the database viewed this client. Since this record was created at this time, it can be inferred this is what we knew at this time.

So the assertion time allows us to query our knowledge of a client, but what if our perceived knowledge was actually incorrect? With unitemporal data, when a new record entered the database it was by default assumed to be correct. If further down the line it was found to be incorrect, that record would simply be updated, removing all traces of the incorrectness. This is to be expected as with a single temporal value, only one version of that record may be held. With bitemporal data, the additional temporal value allows for multiple versions of a record to be held with the same valid time. Thus we can now acknowledge that a record was incorrect at some point in time, and view times that the incorrect data occurred.

`clientTab insert (`C118;.z.d;2016.02.04;"Linch";0b)

cltid adate      vdate      name      dlt_flg
---------------------------------------------
C089  2015.03.10 2015.03.10 "Johnson" 0
C112  2015.05.04 2015.05.04 "Leonard" 1
C113  2015.08.19 2015.08.18 "Coyle"   0
C113  2016.02.01 2015.12.01 "Davies"  0
C118  2016.02.05 2016.02.04 "Linch"   0

In this example we are inserting a newly contracted client into our database. The adate is taken as the current date, while the vdate is the date at which the contract was created. As it stands all we can determine is that the client entered the database the day after the contract was created. A few days later, the company realises that the client’s name was entered into the database incorrectly.

With a single temporal value this record would simply be updated, with the offending record swept under the carpet. With the assertion date this record can now be retained. To update the incorrect record, we insert a new record with a matching client id & vdate, the adate is then given the date the correction occurred.

Below is the result of the operation, and as you can see there are two versions of the record. Based on the assertion date, we can infer that 2016.02.10 is our current knowledge of the record.

`clientTab insert (`C118;.z.d;2016.02.04;"Lynch";0b)

cltid adate      vdate      name     dlt_flg
---------------------------------------------
C089  2015.03.10 2015.03.10 "Johnson" 0
C112  2015.05.04 2015.05.04 "Leonard" 1
C113  2015.08.19 2015.08.18 "Coyle"   0
C113  2016.02.01 2015.12.01 "Davies"  0
C118  2016.02.05 2016.02.04 "Linch"   0
C118  2016.02.10 2016.02.04 "Lynch"   0

Querying Bitemporal Data

When it comes to querying a bitemporal table, many of the basic querying patterns discussed in the previous section on unitemporal still apply, but may require some slight changes. This is primarily down to the possibility that each vdate may include multiple adates and therefore records which are now deemed to be incorrect.

To show the current “correct” view of a database, we must firstly group each record by its respective vdate & cltid. For each grouping we then find the assertion date with the highest value (as this is the latest record), and apply this value to all members of the group. As we have previously seen when filtering records from a group, this can be obtained using the fby function. This result is then used in conjunction with an equality condition of the adate column, which essentially returns all the records with the most recent assertion dates.

The below example shows the respective query in use. It should be pointed out that although the example only presents a single incorrect record being ignored, each vdate could include any number of incorrect versions.

select from clientTab where adate = (max;adate) fby ([]vdate;cltid)

cltid adate      vdate      surname   dlt_flag 
---------------------------------------------- 
C089  2015.03.10 2015.03.10 "Johnson" 0       
C112  2015.05.04 2015.05.04 "Leonard" 1       
C113  2015.08.19 2015.08.18 "Coyle"   0       
C113  2016.02.01 2015.12.01 "Davies"  0       
C118  2016.02.10 2016.02.04 "Lynch"   0

Current Knowledge

Let us now say we want to know the latest correct state of each client. We must filter by the valid date, in order to discard older client states. Then we must filter by adate to ensure we only get the latest value for the maximum vdate per cltid.

select by cltid from clientTab where vdate = (max;vdate) fby cltid, adate = (max;adate) fby cltid

cltid| adate      vdate      surname   dlt_flag 
-----| ---------------------------------------- 
C089 | 2015.03.10 2015.03.10 "Johnson" 0       
C112 | 2015.05.04 2015.05.04 "Leonard" 1       
C113 | 2016.02.01 2015.12.01 "Davies"  0       
C118 | 2016.02.10 2016.02.04 "Lynch"   0

Point in Time

Another important aspect of bitemporality is the ability to view our knowledge of the database at a specific point in time. This allows us to expose the state of the data at that point. To create this view we can extend the query of the last example, with the addition of a time condition test on the adate column. By checking whether the adate is less than or equal to the time we wish to view, we are limiting the result set to records that occurred either before or on that date.

Below is the result when performed on the example dataset:

select from clientTab where adate <= 2016.02.07, vdate = (max;vdate) fby cltid, adate = (max;adate) fby cltid

cltid adate      vdate      name     dlt_flg
---------------------------------------------
C089  2015.03.10 2015.03.10 "Johnson" 0
C112  2015.05.04 2015.05.04 "Leonard" 1
C113  2016.02.01 2015.12.01 "Davies"  0
C118  2016.02.05 2016.02.04 "Linch"   0

In the example, we wish to view our current knowledge of the clients as-of the 7th Feb 2016. As you can see the returned result set is pretty similar to the previous example, but with one key difference. The client C118 is labelled as having the surname “Linch” on the 4th Feb which was inserted into the table the next day. On the 7th this is what we believed to be the current correct name of the client, but in reality the name was misspelt when the record was inserted.

By being able to view our knowledge at this particular date, we can expose inaccuracies in the data that may have affected business decisions in the past. In terms of the provision of auditing capabilities and increased accountability, adding bitemporarility to a database can provide long-term advantages.

The Future

So far we have assumed that the adate and vdate values are always in the past, but bitemporal tables can also handle the future- this allows tables to be seeded with future values in advance of them occurring, without them affecting current datasets.

`clientTab insert (`C089;2017.01.01;2017.01.01;"Harper";0b)

cltid adate      vdate      name      dlt_flg
---------------------------------------------
C089  2015.03.10 2015.03.10 "Johnson" 0      
C112  2015.05.04 2015.05.04 "Leonard" 1      
C113  2015.08.19 2015.08.18 "Coyle"   0      
C113  2016.02.01 2015.12.01 "Davies"  0      
C118  2016.02.05 2016.02.04 "Linch"   0      
C118  2016.07.19 2016.02.04 "Lynch"   0      
C089  2017.01.01 2017.01.01 "Harper"  0

The change required now is to pre-filter the table based on the adate before doing any other operations. So if today is 2016.07.01, the current state of the table is:

select from clientTab where adate <= .z.d, vdate = (max;vdate) fby cltid, adate = (max;adate) fby cltid

cltid adate      vdate      name      dlt_flg
---------------------------------------------
C089  2015.03.10 2015.03.10 "Johnson" 0      
C112  2015.05.04 2015.05.04 "Leonard" 1      
C113  2016.02.01 2015.12.01 "Davies"  0      
C118  2016.02.05 2016.02.04 "Linch"   0

When the date becomes 2017.01.01, the above query will return cltid C089 with a name of “Harper”.

Summary

Both unitemporality and bitemporality increase the capabilities of a database greatly, although the query patterns do become a little more complicated. Bitemporality allows different views of the data at different points in time to be tracked.

AquaQ Analytics have built several bitemporal databases using kdb+. If you would like to learn more please contact us.