Introducing the Knowledge Store

The holy grail of data is that anyone in the organization can use data to make better decisions. Today we are far from this since getting answers to most questions requires a large team of analysts and engineers and waiting for weeks or months.

In the coming years, I believe we will see an evolution from data stores to knowledge stores that will finally enable our vision.

Data stores (like a database) are designed for fixed schemas and optimized for storage costs.

Knowledge stores, on the other hand, store knowledge (like facts, hypotheses, relationships etc) in a schema less text format. They are designed for querying in English (or other “natural” languages).

Technologies like Vector Stores, LLMs and Generative AI now allow us to move to knowledge stores. This can finally get us to the vision of ALL data in our organization is available to EVERYONE in the organization to make BETTER decisions while drastically REDUCING our costs in data teams.

I recommend starting with reading History of Data Storage to understand how we got to the current state of data storage and why. The content below will make more sense.

Note: I used the word “English” below but knowledge stores can answer in any language like Spanish, Korean etc.

Current issues with Data Storage

young couple talking with a therapist — Photo by Timur Weber on Pexels.com

As explained in History of Data Storage, the following are some of the issues we have today with data stores:

Using relational models for transactional data is hard due to schemas changing over time as the applications evolve.
Doing the joins in transactional data are frequently slow.
Flexible schemas (e.g., JSON) with document stores (e.g., Mongo) improve the transactional data however they are worse for analytical needs.
A business’ data is typically stored in multiple systems so you have to query each system differently based on its schema, type etc.

The Average Person Still Cannot Use the Data

Most importantly the person who wants to ask a question of the data has to understand the schema of the data and have technical skills. For example, if I want to know the birthdate of someone, I need to know what record this is stored in, the name of the field and also the format (string, datetime, date).

And I also need to learn the query language (SQL, Python, Javascript etc).

As a result, most people are not able to directly ask questions of the data. They have to put in a request with a technical team and wait for weeks or months to get an answer.

Tribal Knowledge

The technical teams have a tribal knowledge of what data is where. Even if there is good documentation (which is rare), there is still knowledge passed around on the caveats for each field (“we don’t populate birth date for people over 80” or “we don’t get birth date from this data source” etc.)

When the technical team provides a self-service reporting portal (e.g., analytical store) it is constrained by the types of questions the technical team designed it to answer. The user can’t just ask ANY question; only questions that were expected by the team building the analytical store.

Making quick data driven decisions is essential for every organization

So clearly it is still very hard for the decision makers to get answers from all the data. In today’s age, where data is the key to success for almost every company this is a huge problem.

Knowledge Stores

interior of library with bookshelves — Photo by Olga Lioncat on Pexels.com

Knowledge stores store knowledge (like facts, hypotheses, relationships etc) in a schema less text format. They are designed for querying in English (or other “natural” languages).

With the capabilities offered by vector stores to store data as plain text and to query it, the capabilities offered by Large Language Models to understand text and the ability of Generative AI to reason on text and answer in plain English, I believe we’re going to see a move to knowledge stores.

If you don’t know what a vector store is What is a Vector Store? can help.

How are knowledge stores different from data stores?

	Data Stores	Knowledge Stores
Schema	Rigid Schemas in relational databases or flexible schemas in document databases.	No schema needed
Heterogenous data	All data must be in the same schema	Data can be in different “schemas”
Maintenance and Enhancements	Requires a large team of people to maintain and enhance.	Very little maintenance needed
Query Language	SQL (for relational database) or Programming Language (for document databases)	English
Target User	A technical analyst or engineer	Average person
Time to Answer	Slow (frequently requires schema changes)	Instant (no schema change needed)

How does a Knowledge Store work?

Knowledge stores typically require two workflows:

Loading knowledge into a knowledge store
Answering questions based on the knowledge store

Workflow to Load Knowledge into a Knowledge Store

The workflow to load knowledge into a knowledge store looks like this:

Raw data (in relational databases, document databases or other data stores) is exported to text (e.g., comma separated values or pipe delimited files). There is no need to unify the schemas.
The text is stored in a Vector Store.

A knowledge store keeps all the data in text format. Note in the example below there is no schema. Facts are just represented in plain English.

If the source data is in relational format, then that data can be stored as tables and columns in the knowledge store. Just export it as comma separated or pipe delimited.
If the source data is in JSON format then that data can be stored as JSON in the knowledge store, no conversion needed.
Any other data can be stored in text format.

The simplest path to get started is to store the whole raw data as text. As you get more advanced you can generate summaries of your data and store those instead. Summaries are more information rich and use fewer tokens because they get rid of all the schema overhead like filed names.

Workflow to Answer Questions

User Asks Question in English in your application
The question is converted to an Embedding (A vector representation of the token in the question)
The Embedding is matched with text snippets in the Vector Store
The question and the text snippets are sent to an LLM to understand
Generative AI is used to to generate answer (or ask the system for more data before the question can be answered)
The application can render the answer in easy to understand form (table, chart, trend line, summary etc)

If some of the terms above are new to you, reading What is a Vector Store? can help.

Asking Questions in English

In a Knowledge Store, the user asks their question in plain language: “What is this person’s date of birth”, “How old is this person?”, “Is this person old enough to vote?” etc.

Notice how the birth date in the knowledge store can be called “Birth Date”, “Date of Birth”, “When I was born”. It does not matter.

The birth date can be “December 1, 1983” or “12/1/83” or “1983-12-01”. It does not matter.

The birth date can be stored in the patient document or it can be stored in the “Personal History” document or any other document. It does not matter.

All these things we care about today and spend time on are not needed anymore.

Vector Stores, LLM and Generative AI

This data in text format is stored in a vector store (e.g., Pinecone, Weavite, Chroma, ElasticSearch etc). A vector store is a specialized store for text data that uses embeddings (representation of text as a vector) and similarity queries (cosine distance etc) to match a user query to text that relates to that question.

A LLM (Large Language Model) like ChatGPT is used to take the text snippets relating to the question and understand both the question being asked and the text snippets.

A Generative AI engine like GPT4 is then used to generate the answer and provide it to the user.

Learning to get smarter

Finally the feedback from users (typically you do this initially with some test users) is fed into RLHF (Reinforcement Learning using Human Feedback) to tune the Generative AI engine to choose better answers in the future.

Benefits of the Knowledge Store

The Knowledge Store can enable us to:

Get all our data in one place without spending time unifying the schemas. All that is needed is exporting to text format which all databases already support.
Enable the average person to ask questions in plain English so they can get most answers without needing to go through a technical analyst/engineer.
Provide answers in a way that make them easier to understand (as a table, as a graph, as a trend).
Provide answers in a language personalized for that person (CEO vs CFO vs CMO vs VP Marketing)

The Knowledge store can finally get us to the vision of anyone in the organization being able to derive value from the data.

Risks/Pitfalls

People will be your biggest challenge (as always). Have a plan.
- You will face resistance from some technical folks who worry that their jobs are at stake. You will need to clarify that your goal is to accelerate their productivity so they can work on harder problems.
The quality of the knowledge in your knowledge store will only be as good as the quality of the data to populate the knowledge store.
- You are able to avoid the step of unifying schema but the quality of your knowledge won’t suddenly get better.
- The good news is that you can now get your analysts and engineers out of the work to just answer questions and focus on harder problems like this.
Don’t send all your data to LLMs.
- First you will reach the token limit (4,000 for GPT3 and 8,000 for GPT4).
- Even if your data is less than this, the pricing is by tokens so you will pay for unnecessary data.
- Not to mention you increase the risk by sending extra data outside your company boundary.
Implement a local vector store to store your knowledge.
- Then use similarity searches and context compression to send only the minimum data to LLMs to process.
Training your own private GPT models is problematic and unnecessary if you implement the vector stores.
- While there are many claims of smaller models performing as well as larger models, it will likely cost you a bunch of time and people to make this approach work.
- If you implement a vector store, proper prompts and RLHF the need to train your own model from scratch can be greatly reduced. Read Hallucinations in ChatGPT for more detail.

What’s the Right Strategy?

Here’s how I would recommend:

Start with a people plan. How will you communicate your plan so people feel excited and not scared? Reading Will ChatGPT Replace Software Engineers? may help you create your communication plan.
Focus on analytical needs first. The transactional needs are less of an obstacle in today’s systems.
Start with a proof of concept of taking one data source and storing it in a vector store. Then add in a managed GenAI service such as OpenAI or Azure OpenAI.
Most current databases will soon offer vector store capabilities so you don’t need to change your database unless you want very advanced capabilities. For example ElasticSearch, Redis, Postgres and Mongo all offer vector search ability. I would expect other popular databases will follow suit soon. We know changing database technologies is a huge lift for most organizations so it is best to avoid that.
Expect resistance from your technical teams. They are so used to thinking in the data store mind frame that it will be hard for them. Acknowledge their concerns, define your goals clearly and then choose a few early adopters to work on the proof of concept.
Enable users to ask questions using English as soon as you can even before you have all the knowledge in your knowledge store. This will create the organizational momentum to push through the resistance.
Get started! The power of querying data in English is so powerful when users see it that you don’t want to fall behind your competitors.

Feel free to share your thoughts on knowledge stores and if you’re using them.

If you don’t have a good understanding of ChatGPT yet, you can read A Simple Explanation of ChatGPT, How To Use ChatGPT, Incorporating ChatGPT into existing applications or Hallucinations in ChatGPT.