Needles and haystacks or…

Locating critical information in a corporate knowledge repository

If you’re reading this document, there’s a pretty high chance you’ve gained the painful experience of trying to urgently find something in the organisational knowledge repository (or hub etc) when it seems to defy all logic of searching by keywords or taxonomy (the way things are organised / classified / categorised). 

“What are the main considerations when evaluating a potential contract with an organisation outside of Europe?” shouldn’t be too hard a question for a sales focused organisation but sometimes we can be surprised at how tricky it is to chase these snippets down.

Until recently, our systems had two broad ways that they they could work to throw light on questions such as this:

  • Provide instinctively obvious hierarchical routes that guide us to the correct documents that contain the answers
  • Provide a search facility based on content keywords or tags (human or automatically generated) that allows us to get a shortlist of documents that contain the answers

These two routes are fundamentally flawed when dealing with fluid knowledge repositories with many owners and containing large numbers of documents (thousands to tens of millions).

So what can we do? 

I’ll get to that shortly and it will be really helpful if before that, I can help you to understand some fundamentals of natural language processing (NLP). 

If you’re not really interested in the underlying technology of how Natural Language Processing works, just skip ahead to the large red “Phew!”

Since the earliest days of computing, computers have been able to understand numbers and the inherent intent of the number. When we say “17” to a computer, it understands that it’s 1 more than 16 and 1 less than 18. It understands that we can add other numbers to it and get the answer. 

But when we gave the computer a word like “shirt” or “queen” or “man” it has no sense of what that word means except that it is a collection of letters. 

As computer systems became more sophisticated, eventually we got things like wikipedia that would allow us to look up a definition of a shirt and find that it’s an item of clothing for the upper body. But in truth, the computer still has no notion of what a shirt is. The definition is just another collection of words that have no underlying intent or meaning. There’s no understanding of the word or collection of words that it can do anything meaningful with. 

If the computer had an understanding of the underlying intent of a word or collection of words, what might it be able to do?

Well if we take the words queen, woman and man….

  • “Queen” conveys the notions of sovereignty, femininity,  leadership, power, nationhood and many other subtly nuanced characteristics
  • Without entering the debate of binary gender….
    • “Woman” conveys the notion of being a feminine human and…
    • “Man” conveys the notion of being a masculine human

If the computer truly had an understanding of these notions or intents of the words, then it would be able to deduce that…

Queen – Woman + Man ….

Means removing all the human femininity from the  sovereignty, leadership, power, nationhood and femininity of the word queen and then adding back the notion of human masculinity.

We don’t often think of language in these hard formulaic terms but if we bend our minds to the task then we can deduce that….

Queen – Woman + Man = King 

And we can only perform that feat of linguistic acrobatics because we have a deep understanding of the underlying intent of the words. 

Can computers learn that same deep understanding of the intent of words?

Yes.

And whilst having a deep understanding of the intent behind a single word is pretty impressive, we actually speak in sentences and paragraphs. 

If we go further and express the queen, woman, man formula with more complexity… 

What is queen minus woman minus  blood pumping organ  minus  red plus man plus soil digging implement plus black? 

We might have to think about it for a few moments and if we’re not used to tricky constructions of this kind, we might need someone to lead us through it but eventually we would understand that we are transforming the “Queen of Hearts” into the “King of Spades”. 

Recent developments in ai mean that this deep understanding of sentences is now possible and in December 2023 I posed this question to ChatGPT 4. This is the answer it gave: 

Interpreting the phrase “queen minus woman minus blood pumping organ minus red plus man plus soil digging implement plus black” within the context of playing cards could lead to a clever transformation:

  • “queen”: Start with the queen card.
  • “woman”: Remove the female aspect.
  • “blood pumping organ” (heart): Removing the heart implies taking away the heart suit.
  • “- red”: Removing the color red further emphasizes moving away from the hearts (and diamonds) suit, which are red.
  • “+ man”: Adding the male aspect suggests switching to a king.
  • “+ soil digging implement (spade)”: Adding a spade implies the spade suit.
  • “+ black”: Adding black aligns with the color of the spade suit.

Putting it all together, this phrase seems to be describing the transformation of a queen of hearts card into a king of spades card in a deck of playing cards. The queen (a female figure in the red heart suit) is transformed into a king (a male figure in the black spade suit) through this series of symbolic subtractions and additions.

An exploration of how ChatGPT 4 has managed to achieve this parlour trick is definitely beyond the scope of this blog .

And… I chose to demean the demo by calling it a “parlour trick” to underline the fact that it has no simple link to our problem of finding needles in haystacks.

However, it does demonstrate that ChatGPT 4 has an excellent understanding of the underlying intent of words and that further it can understand the intent of a collection of words when put together as a question.

Behind the curtain, ChatGPT has reduced all the words to a series of about 1500 numbers between 0 and 1. These 1500 numbers capture the essence of the meaning of the word in a way that only ChatGPT can understand. 

Furthermore, it can summarise collections of words using the same techniques to another set of c1500 numbers. The collections of words can be quite long (something like 1000 words). 

Phew!

So, developments in the last few years mean that it’s now possible for ai systems to understand the intent behind words, sentences, paragraphs and collections of paragraphs (up to a maximum length of about 1000 words). 

To find the needle in our haystack, we can break our problem down into a few parts:

  • Work out the intent of the question by turning it into c1500 numbers
  • Break down the documents in our repository into 1000 word chunks
  • Understand the intent of of each of the chunks by turning each one into a set of c1500 numbers
  • Work out which chunks have an intent that is closest to the intent of the question

Once we’ve done that, we have some choices on how we could help answer the question:

  • Just offer the most relevant documents back to the questioner and say we believe the needle can be found within
  • Package up the document chunks and pass them to an external solution such as ChatGPT or Bard or Gemini along with the question with an overarching question something like:
    • Using the attached documents as your source of knowledge, please answer the question and explain what elements of the document led you to your answer
  • A hybrid such as offering the documents first with the option of going to the external ai if the documents are too complex to interpret quickly

Of course there are more options that we could explore but mostly they would be variations on the themes here. 

If our knowledge repository is rather trivial (say a departmental policies and procedures folder with 20 or so documents in it) then there are ways of achieving a great solution using low / no code deployments within the ChatGPT SaaS environment. 

But in the real world, the thing that presents the real problem is extracting knowledge and insight from our vast knowledge repositories containing hundreds of thousands of documents. The technical solution required to help all our people have easy and efficient access to the information they require  is much more significant. 

If you can relate to the problems I’ve described in the blog above and would like to have a chat about routes to a solution, please give us a call and ask for a chat or even better a “show and tell” with one of our compelling demos.

Epilogue

Once more, I’ve used generative ai to come up with the main image for this blog. My prompt was along the lines of: 

“Please give me an image of a queen of hearts searching for a needle in a haystack in the style of an oil painting by Gustav Klimt with flashes of golden light” 

I choose Gustav Klimt because I like the style and it creates a consistency across my blogs which makes me smile. 

I thought I’d also share one of the other images that I rejected… 

I asked it for something “a bit more High Tech” and it gave me this unnerving image!!