Ask HN: Best way to create a searchable knowledge base?

22 points by aljgz 4 months ago

Have you had the experience of using/developing knowledge bases? Here is my scenario:

My team is dealing with a lot of information: Wikis, Code repos, Monitoring dashboards, internal chat messages, emails, Task tickets, related systems, etc.

There are many cases when we need to do ad-hoc searches for anything related to a concept. For instance, imagine if someone makes a change to a metric, there is a need to find all dashboards that might be using this metric to make sure they are still valid after the change.

I don't want to just fix this problem, but create the ability to find related information in ad-hoc cases.

The ramp-up time is not important, as long as some positive value can be created with a small initial effort.

Any existing products (Paid/Free/Open Source, etc) and any references to existing knowledge (designs, discussions) about this would be really appreciated.

softwaredoug 4 months ago

I've _never_ seen a wiki or internal documentation stay up to date.

My experience has been to encourage public blogging / speaking of technical information. If its public, there are several benefits. First you need to explain to people with little context from the company. You also feel scrutiny to make it accurate and not embarrass yourself. And readers will see the date of authorship, and have a sense of when this information was true. And of course, Google is a better search engine than anything you'll have internally!

For example, when I worked on search at Reddit, I didn't point people at anything internal (that stuff rots) but instead I would point people at places like:

https://www.reddit.com/r/RedditEng/comments/1985mnj/bringing...

https://www.youtube.com/watch?v=gUtF1gyHsSM

The downside to this approach is companies are too precious about IP so don't want you to be specific. (despite it almost certaintly not being special). Also company blogs can get over-edited to the point where they lose authenticity in favor of SEO spam.

This isn't the tool to use for things like runbooks, etc. It's a more useful thing for broader context.

I wish more companies just gave their developers their own personal blogs, and were less precious about preventing speaking.

aljgz 4 months ago

This is a very interesting take. At this moment, I have little hope that I can encourage others to do this in the short term, but a helpful reminder for myself to finish and post some of the drafts I've kept for too long.
- softwaredoug 4 months ago
  
  At Reddit, I tried to be the instigator for other people to do this :) It helps to have someone encouraging blogging, speaking. Almost a team evangelist.
uaas 4 months ago

Public speaking is great, but not sure if it’s easier to keep a recorded talk (or even a company blog post) up to date than anything else you have full control over.
- softwaredoug 4 months ago
  
  I think the "keeping up to date" is a fools errand IMO. Because you end up with "half-up-to-date" documentation where someone thought to update some part of it, but not another. And it gets incoherent.
  So my preference is a coherent story at a point in time

slightwinder 4 months ago

The solution are text-files, automation and version-control. Write scripts which are regularly automatically export all data, configs, communication, etc. into a centralized storage which is under version control with an automatic commit every day, which is also exported into a report of that day's changes. Then use grep or whatever to search it.

People will not maintain knowledge bases unless you force them. So remove as much friction as possible and make as accessible as possible. Hence, the automation and textiles. It doesn't need to be plaintext, just something universal and human-readable. Could be formatted in markdown, yaml, json, be single email-files, everything you can find with simple tools and make connections with. The version-control and it's report then will allow you to follow the trail of work, to discover what was discussed and change around the same time, to find connections. And it's never wrong to have a reversible history of your stuff.

And maybe along the way you can motivate people to also write some proper documentation here and there, and add some more fancy tools on-top.

badmonster 4 months ago

https://github.com/cocoindex-io/cocoindex is built for building fresh knowledge base. Detects delta of data sources, apply any transformations, and refresh knowledge, at any scale.

It is open source.

Examples: https://cocoindex.io/docs/examples

aljgz 4 months ago

Cocoindex looks amazing. Thanks

rawgabbit 4 months ago

Where I work we purchased Coveo. We purchased the base version which includes connectors to files, databases, and REST APIs. We then query it using a Coveo API. We stuck with the basic search which is smart enough and did not purchase their AI addon. So far we are happy with it. https://www.coveo.com/en/integrations#filter=universal%20con...

rurban 4 months ago

I did that for a very large multi-continent automotive company. There were several unsuccessful attempts before. I did it with my own phpwiki, plus xapian integration for searching over all the other existing documentation and tickets, SAP and CQ. It's still doing good for the last 15 years.

It's markdown, with several plug-ins. Easier than mediawiki which relies on human forces. We preferred automation, but no insanities like on other wikis like WYSIWYG editing and such management nonsense.

aljgz 4 months ago

This is very interesting. If it's convenient, can you please share a few reasons why previous attempts failed?
- rurban 4 months ago
  
  They tried the usual wikis, like mediawiki, confluence,... But had no search integration of their other documentation.
  And I had the advantage of being the phpwiki maintainer those times to easily extend it to our needs. And I wrote some custom plugins for them, with ajax tricks. It helped that phpwiki is not such a mess as mediawiki, which was also entirely insecure.
  All these plugins were up streamed then. Then Alcatel took over maintainance and they run a similar knowledge base.

TomasBM 4 months ago

You might want to look into knowledge graphs (KGs), graph databases, ontologies, and similar.

I personally and professionally used these to do some cool things, like run audits across different systems simultaneously. Common stack would include Protege for creating the ontologies (i.e., a schema of how the things you're interested in link to each other), Ontotext Refine or py scripts to populate the graphs, and Ontotext GraphDB or Neo4j AuraDB for storing them.

It's relatively easy to then connect this knowledge base to an LLM, and get more flexibility out of it.

That said, there aren't that many user-friendly tools that get the most out of KGs. Most people I worked with weren't interested in KGs or knowledge bases themselves, they just wanted their particular problem solved. And often, it was easier to justify purchasing a subscription to managed tools that (claim to) solve the problem.

So, unless you're OK with building some middleware to combine user apps with KGs, it won't stick with others, in my experience.

comprev 4 months ago

I'm a firm believer in having a single source of truth and that's often a Jira ticket ID in the corporate world. On Jira you can either directly link to another item, link to a Confluence page or leave a comment with the unique string (if the boards are not connected).

Every git branch and commit contains a Jira ID string, even down to "description" fields in resource properties or meta tags.

The idea is a future engineer who I will never meet has a starting point to understand the context of the change.

Projects on GitLab get a "release" too which contains the Jira ID, or possibly multiple in the auto-generated CHANGELOG.

It's not perfect and does require good discipline however I feel a professional responsibility to make the extra effort.

moomoo11 4 months ago

I made something similar for a poc for an idea I was exploring.

Neo4j, open search, and vector embedding. I would use OpenAI api calls to generate the open search query based on user text input.

For example, user could search “what tasks are assigned to Jake that are at least 50% complete and due in the in the next 2 weeks” and it would be able to return relevant results.

Obviously only as good as the user search query. I spent close to a 100 hours writing tests to get it working close to 100% of the time. Eventually I dropped the embeddings because I could generate the opensearch query on the fly. So it was pretty lean and easy.

asim 4 months ago

So I just started doing this using LLM embeddings for semantic search. It actually works quite well. You index every piece of data with metadata and it's content. Then you choose specific metadata fields you might want to correlate on e.g knowing two pieces of data are of type "product" or "design" and then the query will return the related items. OpenAI gets used for turning your query into what can then be used against your index which is basically a vector dB. If you are using Go then chromem-go does this quite easily and has examples.

abramN 4 months ago

Atlassian has Jira, Confluence, and Bitbucket - and their cross-platform search is getting better all the time. Confluence has AI search as well, so you can ask questions in natural language. I believe the underlying AI product is called Rovo.

2rsf 4 months ago

Add an MCP server to each and ask your favorite AI assistance? as an example if you are in the Microsoft business environment then Copilot (the MS one, not GitHub) can find information for you across sources.

carlos_rpn 4 months ago

I never dealt with it or tried anything like you need, but isn't it a good use case for an AI with Retrieval Augmented Generation?

aljgz 4 months ago

Agree. Still, if the data is stored in a usable format not just by AI, it can enable other use cases. We do have some AI indexing of our sources and it creates some value.

bitbasher 4 months ago

cd ~/Documents

vim knowledge.md

vladsanchez 4 months ago

I think you need a Zettelkasten, but its data collection from all your sources may be challenging if not an overkill... However, a Zettelkasten stores metadata, which can include tags, subject headings, and unique identifiers that help link and organize notes. This metadata enhances the ability to retrieve and connect related information within the system; not sure how to do so externally.

The key is to store both data and metadata... OpenMetadata may be what you need: https://open-metadata.org/ but I couldn't spot wiki, chat, github nor JIRA connectors :shrug:

Good luck, keep us posted.

brudgers 4 months ago

If it really matters it’s an FTE.

It probably doesn’t really matter.

Good luck.

scoco2121 4 months ago

[dead]