Ask HN: Best way to create a searchable knowledge base?
Have you had the experience of using/developing knowledge bases? Here is my scenario:
My team is dealing with a lot of information: Wikis, Code repos, Monitoring dashboards, internal chat messages, emails, Task tickets, related systems, etc.
There are many cases when we need to do ad-hoc searches for anything related to a concept. For instance, imagine if someone makes a change to a metric, there is a need to find all dashboards that might be using this metric to make sure they are still valid after the change.
I don't want to just fix this problem, but create the ability to find related information in ad-hoc cases.
The ramp-up time is not important, as long as some positive value can be created with a small initial effort.
Any existing products (Paid/Free/Open Source, etc) and any references to existing knowledge (designs, discussions) about this would be really appreciated.
I've _never_ seen a wiki or internal documentation stay up to date.
My experience has been to encourage public blogging / speaking of technical information. If its public, there are several benefits. First you need to explain to people with little context from the company. You also feel scrutiny to make it accurate and not embarrass yourself. And readers will see the date of authorship, and have a sense of when this information was true. And of course, Google is a better search engine than anything you'll have internally!
For example, when I worked on search at Reddit, I didn't point people at anything internal (that stuff rots) but instead I would point people at places like:
https://www.reddit.com/r/RedditEng/comments/1985mnj/bringing...
https://www.youtube.com/watch?v=gUtF1gyHsSM
The downside to this approach is companies are too precious about IP so don't want you to be specific. (despite it almost certaintly not being special). Also company blogs can get over-edited to the point where they lose authenticity in favor of SEO spam.
This isn't the tool to use for things like runbooks, etc. It's a more useful thing for broader context.
I wish more companies just gave their developers their own personal blogs, and were less precious about preventing speaking.
This is a very interesting take. At this moment, I have little hope that I can encourage others to do this in the short term, but a helpful reminder for myself to finish and post some of the drafts I've kept for too long.
At Reddit, I tried to be the instigator for other people to do this :) It helps to have someone encouraging blogging, speaking. Almost a team evangelist.
Where I work we purchased Coveo. We purchased the base version which includes connectors to files, databases, and REST APIs. We then query it using a Coveo API. We stuck with the basic search which is smart enough and did not purchase their AI addon. So far we are happy with it. https://www.coveo.com/en/integrations#filter=universal%20con...
I did that for a very large multi-continent automotive company. There were several unsuccessful attempts before. I did it with my own phpwiki, plus xapian integration for searching over all the other existing documentation and tickets, SAP and CQ. It's still doing good for the last 15 years.
It's markdown, with several plug-ins. Easier than mediawiki which relies on human forces. We preferred automation, but no insanities like on other wikis like WYSIWYG editing and such management nonsense.
This is very interesting. If it's convenient, can you please share a few reasons why previous attempts failed?
They tried the usual wikis, like mediawiki, confluence,... But had no search integration of their other documentation.
And I had the advantage of being the phpwiki maintainer those times to easily extend it to our needs. And I wrote some custom plugins for them, with ajax tricks. It helped that phpwiki is not such a mess as mediawiki, which was also entirely insecure.
All these plugins were up streamed then. Then Alcatel took over maintainance and they run a similar knowledge base.
You might want to look into knowledge graphs (KGs), graph databases, ontologies, and similar.
I personally and professionally used these to do some cool things, like run audits across different systems simultaneously. Common stack would include Protege for creating the ontologies (i.e., a schema of how the things you're interested in link to each other), Ontotext Refine or py scripts to populate the graphs, and Ontotext GraphDB or Neo4j AuraDB for storing them.
It's relatively easy to then connect this knowledge base to an LLM, and get more flexibility out of it.
That said, there aren't that many user-friendly tools that get the most out of KGs. Most people I worked with weren't interested in KGs or knowledge bases themselves, they just wanted their particular problem solved. And often, it was easier to justify purchasing a subscription to managed tools that (claim to) solve the problem.
So, unless you're OK with building some middleware to combine user apps with KGs, it won't stick with others, in my experience.
I made something similar for a poc for an idea I was exploring.
Neo4j, open search, and vector embedding. I would use OpenAI api calls to generate the open search query based on user text input.
For example, user could search “what tasks are assigned to Jake that are at least 50% complete and due in the in the next 2 weeks” and it would be able to return relevant results.
Obviously only as good as the user search query. I spent close to a 100 hours writing tests to get it working close to 100% of the time. Eventually I dropped the embeddings because I could generate the opensearch query on the fly. So it was pretty lean and easy.
I think you need a Zettelkasten, but its data collection from all your sources may be challenging if not an overkill... However, a Zettelkasten stores metadata, which can include tags, subject headings, and unique identifiers that help link and organize notes. This metadata enhances the ability to retrieve and connect related information within the system; not sure how to do so externally.
The key is to store both data and metadata... OpenMetadata may be what you need: https://open-metadata.org/ but I couldn't spot wiki, chat, github nor JIRA connectors :shrug:
Good luck, keep us posted.
I never dealt with it or tried anything like you need, but isn't it a good use case for an AI with Retrieval Augmented Generation?
Agree. Still, if the data is stored in a usable format not just by AI, it can enable other use cases. We do have some AI indexing of our sources and it creates some value.
If it really matters it’s an FTE.
It probably doesn’t really matter.
Good luck.
cd ~/Documents
vim knowledge.md
[dead]