History of Data Storage

The holy grail of data is that anyone in the organization can use data to make better decisions. Today we are far from this since getting answers to most questions requires a large team of analysts and engineers and waiting for weeks or months.

In the coming years, I believe we will see an evolution from data stores to knowledge stores that will finally enable our vision.

Data stores (like a database) are designed for fixed schemas and optimized for storage costs.

Knowledge stores, on the other hand, store knowledge (like facts, hypotheses, relationships etc) in a schema less text format. They are designed for querying in English (or other “natural” languages).

Technologies like Vector Stores, LLMs and Generative AI now allow us to move to knowledge stores. This can finally get us to the vision of ALL data in our organization is available to EVERYONE in the organization to make BETTER decisions while drastically REDUCING our costs in data teams.

To understand why this change will happen we have to understand the evolution of data stores so far; what changed, what has driven the evolution and what issues we still see.

We started with no data storage at all, moved to Application Persistent Memory then to relational data stores and analytical data stores and finally ending up with document databases.

Phase 1: No Data Storage

aged game console with cassette in house — Photo by Mateusz Dach on Pexels.com

In the early days of computers there was no storage. When you turned off your computer it forgot everything you had done in that session. Every time you turned on your computer it felt like it was the first time you had interacted with that program.

The main problem was that we would have to spend time every session getting back to the state we left the last session. There were some efforts in generating number codes that you could enter to get back to the previous state but these did not work well outside of video games.

Phase 2: Application Persistent Memory

people office internet technology — Photo by Simon Gough on Pexels.com

In the next phase, the software applications on the computer started to write data to disk (floppy disks or hard drives). There was no uniform format since the only consumer of that data was the same application that wrote it.

An application would take its internal memory format for the data and just write it as binary to the file and then read it back. No one care about the schema of the data.

The only purpose of data storage was memory; how the application remembered what happened last time.

Plus disk storage was really limited so binary storage was the most compact.

The main problem was that as applications upgraded capabilities, it was really hard to still be able to read old files.

Secondly this data could not be used for any analysis.

Phase 3: Relational data stores

The need for analysis of this data and for having ability to handle upgrades drove the development of the relational database.

The relational database brought in a well defined table and column format. This could be used to store data with a strict schema.

In addition, normalization (storing data once and linking to it rather than duplicating data) was important since the disk storage was very expensive at the time.

SQL (Structured Query Language)

To enable querying the data, the SQL language was developed. SQL was designed to handle querying of table and column schemas.

The issue was that it took a lot of work to convert the data into this table-and-column form, to maintain it and to write voluminous SQL code to query it. As querying became more complex, the limitations of SQL became a hindrance. SQL was designed for quick queries and never for the complex queries we started using it for. For example, why do you specify the columns to query first and THEN the table where to find the columns? Btw, this is why AutoComplete on SQL code is so hard.

Unlike a programming language (C, C#, Java or Python), SQL is hard to write as units of functionality that can be put together into complex programs. Yes, stored procedures and functions exists but they are much harder to use than classes and functions in a programming language. Hence it is pretty common to see SQL queries that span multiple pages and there is high amount of duplication across SQL queries. This, of course, means it is really hard for one engineer to debug another engineer’s SQL and it is a mission to track down why two queries return different results.

Strict Schemas

Since relational databases have very strict schemas it was hard to evolve your schema over time. As an example, think of a schema that stores the last name for a person. In real world a person can change last names over their life. Evolving the schema to store multiple last names for a person requires significant changes including adding new tables, migrating all the old data and changing all the SQL code that queries that information.

Phase 4: Analytical Stores & Reporting Caches

Since relational databases are hard to use for analytics, we created a new thing called analytical stores. This was about copying the data from relational schemas into another data store while changing the format to make it easier to analyze and aggregate.

In some cases, we would create analytical cubes (fact tables and dimension tables to allow slicing the facts by dimensions).

In other cases we just stuffed this data into caches in our reporting tools (like Qlik, Tableau) and hoped that they would optimize it.

SQL to MDX

We switched from SQL to OLAP query languages (like MDX) but frankly these turned out to be much harder to learn. So we ended up with a very small number of analytics engineers who could use OLAP query languages.

Self Service, not really

The analytical stores added more self-service capabilities for the analysts but the maintenance of these stores turned out to be very expensive. In most companies there are dedicated teams that all they do is maintain these.

If a new dimension needs to be added it is typically a very resource and time intensive task. As a result the analytical stores are not enhanced very frequently.

The reporting caches also ended up being problematic. They are frequently slow and hard to keep up to date with the transactional data.

Phase 5: Flexible schema document databases

To address the issues with relational data stores, we started to move to document stores based on JSON (e.g., Mongo, Redis etc). This allowed us to evolve the schema without having to go update all the data and the code.

Death of Normalization

It also allowed us to get out the pain of normalization. As prices of data storage plummeted, the cost advantage of normalization went away. Duplicating data was not that much more expensive and could result in simpler schemas and better performance.

We no longer needed tables like “PersonIdentifiers”, “PersonNames”, “PersonAddresses”, “PersonPhones” and 10+ more tables to just capture information about a person.

From SQL to Programming Languages

We also switched our querying from SQL to using normal programming languages like C#, Java, Python and Javascript. This allowed us to save costs by having a single team of engineers to write both the applications and the queries.

Hybrid Data Strategy

The problem we had was that this worked well for transactional data but was hard to use for analytical data where we typical go across records and aggregate the data.

Many organizations ended up with a hybrid model where they stored their transactional data into document databases but then created relational databases for analytics. Of course this means you need people and time to map all your transactional data into relational data.

Recap of the Issues

young couple talking with a therapist — Photo by Timur Weber on Pexels.com

If we recap the current issues we have today:

Using relational models for transactional data is hard due to schemas changing over time as the applications evolve.
Doing the joins in transactional data are frequently slow.
Flexible schemas (e.g., JSON) with document stores (e.g., Mongo) improve the transactional data however they are worse for analytical needs.
A business’ data is typically stored in multiple systems so you have to query each system differently based on its schema, type etc.

Average Person Still Cannot Use the Data

Most importantly the person who wants to ask a question of the data has to understand the schema of the data and have technical skills. For example, if I want to know the birthdate of someone, I need to know what record this is stored in, the name of the field and also the format (string, datetime, date). Plus I need to learn the query language (SQL, Python, Javascript etc).

As a result, most people are not able to directly ask questions of the data. They have to put in a request with a technical team and wait for weeks or months to get an answer.

Tribal Knowledge

The technical teams have a tribal knowledge of what data is where. Even if there is good documentation (which is rare), there is still knowledge passed around on the caveats for each field (“we don’t populate birth date for people over 80” or “we don’t get birth date from this data source” etc.)

When the technical team provides a self-service reporting portal (e.g., analytical store) it is constrained by the types of questions the technical team designed it to answer. The user can’t just ask ANY question; only questions that were expected by the team building the analytical store.

What is the Solution?

So clearly it is still very hard for the decision makers to get answers from all the data. In today’s age, where data is the key to success for almost every company this is a huge problem.

To learn about how Knowledge Stores will address these issues, continue to Evolution from data stores to knowledge stores…