No it’s not a typo, this isn’t a blog about how to keep data clean, rather this will explore the concept of a data mesh and more specifically our thoughts on it, in relation to kdb+.
In the world of data we are often presented with new trends in the storage, distribution and access of data. Previous iterations include data lakes, data fabric and data warehouses. As we enter 2022, we are seeing the rise of the Data Mesh.
AquaQ became aware of this concept in 2021 and have held some “thought leadership” discussions both internally and externally. This blog is the output of these thoughts with regard to kdb+ in a mesh.
What is a data mesh?
The data mesh is a concept defined by Zhamak Dehghani (Thoughtworks) and in fact dates back to the 2010s. A quick google of “data mesh” will however show that the amount of content on the subject has increased, with many contributors in 2021. Listed below are a number of the resources that we have used when discussing this and this blog references material found on these sites.
- Introduction to Data Mesh, a youtube video by Thoughtworks
- What is a Data Mesh?, content by Starburst
- What is a Data Mesh – and How not to mesh it up, an article by Barr Moses
- Is Data Mesh right for your organisation?
- Data Mesh 101 an article by Barr Moses
- Data Mesh Principles and Logical Architecture, a Martin Fowler article by Zhamak Deghani
- Design a data mesh architecture using AWS Lake Formation and AWS Glue, an AWS post by Nivas Shankar, Ian Meyers, Zach Mitchell, and Roy Hasson
- How to move beyond a monolithic data lake to a distributed data mesh, a Martin Fowler article by Zhamak Deghani
Data mesh is about moving to a “distributed architecture for analytical data management” [Starbust.io]. It is about decentralizing data and moving the ownership and support of it to subject matter experts (SME). This doesn’t mean we just move the data storage and maintenance back to these SMEs, it means everything from ingestion, through transformation, to data availability. Effectively a data owner in the data mesh is responsible for the entire data pipeline, based on their expertise.
Developers should be well versed in applications moving to microservice architectures; a concept popularized with cloud. In any cloud provider today you will find microservices, even at the lambda level, focusing on a very specific piece of functionality. Developers working on the subdivision of monolithic architectures find that other teams (internally and externally) have already created microservices for some of the functionality, that they are the SME, and rather than reinvent the wheel just added a dependency.
I digress in the above paragraph to draw parallels to the world of data. The data lake concept is our monolith, the mesh our microservices. In that regard, keeping the aqua element of it alive we are no longer looking at data pools/lakes/oceans and thinking about rivers. I like this analogy as lakes are somewhat stagnant and stale, yet the rivers are about flow, about volume and about activity. Putting everything together in a single place can make it easy to access but what about the capabilities of understanding, being able to interpret the datasets and what about the usage of that data. Data that is not being used costs money to store and maintain, perhaps pushing back to the SMEs allows us to truly uncover the value of our data sets.
The drive to the data lake was to centralize data, so all the data was present on one platform for fast access and effectively a single location to go to. This can lead to single technology, large, mixed-structure, unwieldy datasets to be supported, however.
If we do move to the mesh, it will present it’s own issues; likely to be focused around data access and interoperability. This would need to be solved with a well-defined API strategy, in addition to some form of data cataloguing or data directory.
kdb+ in a Data Mesh Architecture
kdb+ is a technology that made its name by being, and continuing to be, the best at timeseries datasets. Over the years many data technologies have come and gone but kdb+ has prevailed.
It is this persistence that may in fact embed it nicely into the Data Mesh concept.
In the eras of the data lake and warehouse, applications and datasets involving kdb+ typically stayed outside. End users wanted what kdb+ provided; quick and relatively seamless access to time series data and analytics. This has meant that when you enter a financial institution where kdb+ is used you will typically encounter pods of “kdb+ teams”, perhaps siloed into the asset class, or function, associated with their data. In addition to being good at timeseries data, it requires those who enhance and maintain the system to have a knowledge of a niche programming language. Therefore we often find a devops-style culture already embedded in kdb+ plants given the knowledge and experience needed. The developers provide support, understand the data and work the release pipelines.
In effect by having kdb+ developers working with the data, we created SMEs who understand the data; how to work with, maintain and interpret it. They typically are involved from ingestion to availability which is the kdb+ way; just looking at the TorQ framework we see Feedhandler, tickerplant, real-time databases, event processing, hdb and gateways and APIs.
Relating this back to the data mesh concept it looks at delegating ownership away from centralized repositories of data toward individual teams. This is something I would consider to already be in place, for most kdb+ implementations.
The potential issue, however, with the kdb+ silos is often the data access layer. kdb+ end users often love to free-form query data and we don’t always have well defined APIs, never mind have this provided to a cross-technology consistent API that would be needed in the mesh. I would anticipate this is where the focus of any work would be in moving to the data mesh world for kdb+ applications.
We should also consider the fact datasets that were not housed in kdb+, or potentially migrated away from it when businesses created data lakes, there will now be potential additional use cases that can benefit from a move to the technology in the data mesh strategy.
The data mesh concept is interesting, especially when we draw parallels to the software engineering microservices approach. It presents opportunities for leveraging more technologies, specifically the right technologies, for the data while having SMEs who can realize and expose the true value of it. It also presents challenges as it will create more complex data architectures, similarly to microservices which will require advanced monitoring and documenting. Like any strategy it will need to be well defined with the focus of it being around how to expose and access data.
From the kdb+ perspective we would ascertain that this will involve less work to implement, given the silos we often find ourselves in. Often we are the subject matter experts and have the tools to manage our data independently in an almost devops style role. We will however need to consider how our data is exposed and accessed in relation to the API strategy.
In summary, kdb+ and our SMEs have been siloed for some time due to the nature of the technology which will mean many systems are already data mesh compatible, they just need the right front end strategy to get that in place. It may also create new opportunities to leverage the power of kdb+ and the kdb+ developer.
If the above as piqued your interest with regards to a particular project or initiative you can contact us at email@example.com