Your File System Is Already A Graph Database

147 points by alxndr 2 days ago

I keep harping on this, but the question is not "can you use your filesystem as a graph database" - of course you can - but whether this performs better or worse than a vector database approach, especially at scale.

The premise of Atomic, the knowledge base project I'm currently working on, is that there is still significant value in vectors, even in an agentic context. https://github.com/kenforthewin/atomic

iwontberude 3 hours ago

Oh neat! Whenever I was working on RAG proof of concepts vector databases seemed to generate noisiest outputs that happened to include my information from my chunks but it was unable to draw reasonable contextual associations. I swap RAG out with a web search tool, all of a sudden the quality goes way up. Is RAG ever going to be easier to hold or should lay people like me just stay moving on?
- kenforthewin 3 hours ago
  
  I think agentic RAG still has its place. a hybrid semantic/keyword search tool in addition to other research tools outperforms the baseline in my experience.
pyinstallwoes 3 hours ago

Cool project.

embedding-shape 8 hours ago

I've been playing around with the same, but trying to use local models as my Obsidian vault obviously contain a bunch of private things I'm not willing to share with for-profit companies, but I have yet to find any model that comes close to working out as well as just codex or cc with the small models, even with 96GB of VRAM to play around with.

I've started to think about maybe a fine-tuned model is needed, specifically for "journal data retrieval" or something like that, is anyone aware of any existing models for things like this? I'd do it myself, but since I'm unwilling to send larger parts of my data to 3rd parties, I'm struggling collecting actual data I could use for fine-tuning myself, ending up in a bit of a catch 22.

For some clients projects I've experimented with the same idea too, with less restrictions, and I guess one valuable experience is that letting LLMs write docs and add them to a "knowledge repository" tends to up with a mess, best success we've had is limiting the LLMs jobs to organizing and moving things around, but never actually add their own written text, seems to slowly degrade their quality as their context fills up with their own text, compared to when they only rely on human-written notes.

weitendorf 5 hours ago

This is exactly what we're working on, is there any application in particular you're interested in the most?
> I'm struggling collecting actual data I could use for fine-tuning myself,
Journalling or otherwise writing is by far the best way to do this IMO but it doesn't take very much audio to accurately do a voice-clone. The hard thing about journalling is that it can actually be really biased away from the actual "distribution" of you, whether it's more aspirational or emotional or less rigorous/precise with language.
What I'm starting to do is save as many of my prompts as possible, because I realized a lot of my professional writing was there and it was actually pretty valuable data (especially paired with outputs and knowledge of what went well and waht didn't) for finetuning on my own workloads. Secondly is assembling/curating a collection of tools and products that I can drop into each new context with LLMs and also use for finetuning them on my own needs. Unlike "knowledge repositories" these both accurately model my actual needs and work and don't require me to do really do anything unnatural.
The other thing I'm about to start doing is "natural" in a certain sense but kinda weird, basically recording myself talking to my computer (verbalizing my thoughts more so it can be embedded alongside my actions, which may be much sparser from the computer's perspective) / screen recordings of my session as I work with it. This is something I've had to look into building more specialized tools for, because it creates too much data to save all of it. But basically there are small models, transcoding libraries, and pipelines you can use for audio/temporal/visual segmentation and transcription to compress the data back down into tokens and normal-sized images.
This is basically creating a semantic search engine of yourself as you work, kinda weird, but IMO it's just much weirder that your computer can actually talk back and learn about you now. With 96GB you can definitely do it BTW. I successfully finetuned an audio workload on gemma 4 2b yesterday on a 16GB mac mini. With 96GB you could do a lot.
> letting LLMs write docs and add them to a "knowledge repository"
I think what you actually want them to do is send them to go looking for stuff for you, or actively seeking out "learning" about something like that for their own role/purposes, so they can embed the useful information and better retrieve it when they need it, or produce traces grounded in positive signals (eg having access to this piece of information or tool, or applying this technique or pattern, measurably improves performance at something in-distribution to whatever you have them working on) they can use in fine-tuning themselves.
- embedding-shape 5 hours ago
  
  I think maybe you're misunderstanding the issue here. I have loads of data, but I'm unwilling to send it to 3rd parties, so that leaves me with gathering/generating the training data locally, but none of the models are good/strong enough for that today.
  I'd love to "send them to go looking for stuff for you", but local models aren't great at this today, even with beefy hardware, and since that's about my only option, that leaves me unable to get sessions to use for the fine-tuning in the first place.
  
  weitendorf 4 hours ago
  
  Right, that's exactly the situation I'm in too and "send them to go looking for stuff for you" without it going off the rails is the problem we've been working on.
  Basically you need a squad of specialized models to do this in a mostly-structured way that ends up looking kind of like a crawling or scraping/search operation. I can share a stack of about 5-6 that are working for us directly if you want, I want to keep the exact stack on the DL for now but you can check my company's recent github activity to get an idea of it. It's basically a "browser agent" where gemma or qwen guide the general navigation/summarization but mostly focus on information extraction and normalization.
  The other thing I've done, which obviously not everybody is going to want to do, is create emails and browser profiles for the browser agent (since they basically work when I'm not on the computer, but need identity to navigate the web) and run them on devices that don't have the keys to the kingdom. I also give them my phone number and their own (via an endpoint they can only call me from). That way if they run into something they have a way to escalate it, and I can do limited steering out of the loop. Obviously this is way more work than is reasonable for most people right now though so I'm hoping to show people a proper batteries-included setup for it soon.
  Edit: Based on your other comment, I think maybe what you're really looking for most are "personal traces". Right now that's something we're working on with https://github.com/accretional/chromerpc (which uses the lower level Chrome Devtools Protocol rather than Puppetteer to basically fully automate web navigation, either through an LLM or prescriptive workflows). It would be very simple to set up automation to take a screenshot and save it locally every Xm or in response to certain events and generate traces for yourself that way, if you want. That alone provides a pretty strong base for a personal dataset.
  
  embedding-shape 4 hours ago
  
  > that ends up looking kind of like a crawling or scraping/search operation
  Sure, but what I'm talking about is that the current SOTA models are terrible even for specialized small use cases like what you describe, so you can't just throw a local modal on that task and get useful sessions out of them that you can use for fine-tuning. If you want distilled data or similar, you (obviously) need to use a better model, but currently there is none that provides the privacy-guarantees I need, as described earlier.
  All of those things come once you have something suitable for the individual pieces, but I'm trying to say that none of the current local models come close to solving the individual pieces, so all that other stuff is just distraction before you have that in place.
  
  weitendorf 4 hours ago
  
  Understood. I guess I'm saying "soon" but definitely agreed its not "now" yet. I will say though, with 96GB, in a couple months you're going to be able to hold tons of Gemma 4 LoRa "specialists" in-memory at the same time and I really think it will feel like a whole new world once these are all getting trained and shared and adapted en-masse. And also, you could set up personal traces now if you want. Nobody can make you, but in its laziest form it can be literally just taking screenshots of your screen periodically as you work, and that'll have applications soon
  
  embedding-shape 3 hours ago
  
  > And also, you could set up personal traces now if you want. Nobody can make you, but in its laziest form it can be literally just
  But again, you're missing my point :) I cannot, since the models I could generate useful traces from are run by platforms I'm not willing to hand over very private data to, and local models that I could use I cannot get useful traces from.
  And I'm not holding out hope for agent orchestration, people haven't even figured out how to reliably get high quality results from agent yet, even less so with a fleet of them. Better to realistically tamper your expectations a bit :)
SeanLang 5 hours ago

Couldn't you create synthetic data based on your entries using local models? Or would that defeat the purpose of fine tuning it?
- embedding-shape 5 hours ago
  
  Yeah, I suppose, but how do I get sufficiently high quality synthetic data without sending the original data to OpenAI/Anthropic, or by using local models when none of them seem strong enough to be able to generate that "sufficiently high quality synthetic data" in the first place?
  
  mswphd 59 minutes ago
  
  you could do something like rent GPU time yourself, and use it to run a higher-quality local model (e.g. one of the Chinese "close to frontier" ones). Not guaranteed to preserve privacy of course, but it at least avoids directly sending the data to OpenAI/Anthropic.
terminalkeys 5 hours ago

You can fine-tune local models using your own data. Unsloth has a guide at https://unsloth.ai/docs/get-started/fine-tuning-llms-guide.
I'm currently experimenting with Tobi's QMD (https://github.com/tobi/qmd) to see how it performances with local models only on my Obsidian Vault.
- embedding-shape 5 hours ago
  
  Right, the technical know-how about fine-tuning isn't the problem here, getting sufficiently high quality session logs without basically giving away my private data for free is the issue.
  Today, I can use even the small models of OpenAI and Anthropic to get valuable sessions, but if I wanted to actually use those for fine-tuning a local model, I'd need to actually start sending the data I want to use for fine-tuning to OpenAI and Anthropic, and considering it's private data I'm not willing to share, that's a hard-no.
  So then my options are basically using stronger local models so I get valuable sessions I can use for fine-tuning a smaller model. But if those "stronger local models" actually worked in practice to give me those good sessions, then I'd just use those, but I'm unable to get anything good enough to serve as a basis for fine-tuning even from the biggest ones I can run.
gchamonlive 4 hours ago

Models are lossy, so fine-tune can only take you so far with small models. What we need is reasonably capable local models with a huge context window and a method to make efficient use of token and cram as much info as possible in the context before degrading the output quality.

SoftTalker 23 minutes ago

I will always be in awe of people who can remain diligent doing this level of journaling/personal information management.

I've got scraps of paper and legal pads and post-it notes and just throw them away after they've been sitting around for a while and I forget what they are about.

stingraycharles 7 hours ago

Using the same logic, a key/value database is also a graph database?

Isn’t the biggest benefit of graph databases the indexing and additional query constructs they support, like shortest path finding and whatnot?

sorokod 7 hours ago

Yes, the author is likely unaware of this. They see markdown files with links, so a graph and the set of those files, so a "database".
https://neo4j.com/docs/graph-data-science/current/algorithms...
- esafak 6 hours ago
  
  His argument is that the LLM is the query engine. By that logic you can approximate anything since LLMs can.
  
  sorokod 6 hours ago
  
  Indeed, what is the point of links/edges when the llm can figure out the relations by itself?
- lamasery 6 hours ago
  
  Neo4j looooooves the "if you think about it, everything is graphs!" marketing maneuver. They (their marketing department) were the very first thing I thought of when I read this headline.
  
  zadikian 4 hours ago
  
  "Everything is graphs, so let's use a graph DBMS for anything" is a classic blunder
  
  lamasery 3 hours ago
  
  I've seen it work to sell their product to managers who definitely should have gone with something else, so I get why they do it. It works.
volemo 6 hours ago

I think the confusion stems from the fact that we call a database what is really a database management system.
- altmanaltman 4 hours ago
  
  You confuse the raw fist with the master who calculates the shortest path to your destruction.
- rzzzt 2 hours ago
  
  I'd just like to interject for a moment. What you're refering to as a database, is in fact a database management system, or as I've recently taken to calling it, database plus management system.

kesor 2 hours ago

So you have some folders with markdown files ... which are insanely hard to query without a tool ... impossible to traverse via their relationships ... and you call that a graph database? WHAT?!

Clicked the link expecting to see some tool or method that actually allows graph-like queries and traversals on files in a file system, all I found was some rant about someone on the internet being wrong.

Waste of time.

Jayakumark 2 hours ago

Interesting approach but how do you download Google Docs, XLS and Slack threads etc.. and how is it saved in obsidian, are they all converted to markdown before saving or summarized to extract key topics and saved. What about images ?

alxndr 2 days ago

> […] the knowledge base isn’t just for research. It’s a context engineering system. You’re building the exact input your LLM needs to do useful work. > […] there’s a real difference between prompting “help me write a design doc for a rate limiting service” and prompting an LLM that has access to your project folder with six months of meeting notes, three prior design docs, the Slack thread where the team debated the approach, and your notes on the existing architecture.

bullen 7 hours ago

Yep, my distributed JSON over HTTP database uses the ext4 binary tree for indexing: http://root.rupy.se

It can only handle 3 way multiple cross references by using 2 folders and a file now (meta) and it's very verbose on the disk (needs type=small otherwise inodes run out before disk space)... but it's incredibly fast and practially unstoppable in read uptime!

Also the simplicity in using text and the file system sort of guarantees longevity and stability even if most people like the monolithic garbled mess that is relational databases binary table formats...

itmitica 7 hours ago

I can see over engineering when I look at one. And premature optimization.

Anyway, why care how the data is stored? You need a catalog. You need an index. You need automation. Helps keeping order and helps with inevitable changes and flips and pivots and whims and trends and moods and backups and restoration and snapshots and history and versioning and moon travels and collaboration and compatibility and long summer evening walks and portability.

aleksiy123 3 hours ago

I’ve been thinking about this in a couple of contexts and pretty much how I’ve come to think about it.

Folders give you hierarchical categories.

You still want tags for horizontal grouping. And links and references for precise edges.

But that gives you a really nice foundation that should get you pretty damn far.

I also now am telling the llm to add a summary as the first section of the file is longer.

estetlinus 2 hours ago

I am more curious on the note taking. How do you ingest data here? Export from slack via LLM:s? Store it in GitHub?

My “knowledge” is spread out on various SaaS (Google, slack, linear, notion, etc). I don’t see how I can centralize my “knowledge” without a lot of manual labour.

LocalPCGuy 2 hours ago

Unless you're forced into using certain tool (work, etc), start by standardizing on a single tool. That's one reason a lot of people like Obsidian, but there are plenty of similar tools, or you can just write markdown in your editor of choice. Then set of some sort of sync so you have it everywhere you are (mobile can be a bit tricky for some set-ups) and commit to using that method as much as possible for your notes.
You may want to do as described and link to Slack messages (etc), but just remember any external link should be treated as ephemeral. You may not have access to the Slack anymore, for example. That may mean you don't need that note either, or it may mean you lost access to a node on your knowledge graph, you have to determine whether that matters.
By starting now, at least everything going forward is captured in a way you can both own and utilize it. Then it may be a bit of a pain and some manual work to get existing notes into your tool of choice, but you can determine what needs to be in there from other tools as you go forward.

zadikian 4 hours ago

On the other hand, I get why cloud drive users completely disregard file structure and search everything. Two files usually don't have the same name unless you're laying it out programmatically like this. I use dir trees for code ofc, but everything else is flat in my ~/Documents.

Deep inside a project dir, feels like some the ease of LLMs is just not having to cd into the correct directory, but you shouldn't need an LLM to do that. I'm gonna try setting up some aliases like "auto cd to wherever foo/main.py is" and see how that goes.

embedding-shape 4 hours ago

> I use dir trees for code ofc, but everything else is flat in my ~/Documents.
Which is great, but on all major OSes you'd eventually hit performance issues with flat directories like this. Might not be an issue in month one, or even year one, but after 10 years of note taking/journaling that approach will show the issue with large flat directories.
So eventually you'd need to shard it somehow, so might as well start categorizing/sorting things from the get go, at least in some broad major categories at least, because doing so once you already have 10K entries in a directory, it sucks big time to do it.
- zadikian 1 hour ago
  
  If it's just performance, cd ~/Documents && mkdir old && mv ./* old/ (or today's date instead of old). I actually have that layout on one PC.
  If real organization is needed, seems like that'd be easier in hindsight than having foresight
  
  embedding-shape 5 minutes ago
  
  So then you have one intentionally slow directory ("old/" in this case) and one fast directory?
  Personally I'd categorize stuff, but you do you, there really isn't any wrong way to do it, if it works it works :)

appsoftware 7 hours ago

I created AS Notes (https://www.asnotes.io) (an extension for VS Code, Antigravity etc) partly because of this use case. It works like Obsidian, being markdown based, with wikilinks, mermaid rendering and task management. In VS Code, we have access to really good Agent harnesses and can navigate our notes and documents in a file system like manner. Further, using AGENTS.md, idea files etc we can instruct the agent how to interact, add to our notes etc. I've found working with my notes like this really useful, and provided I trim anything generated by an AI that's not going to be useful, provides an investment in the information I've gathered as the information is retained in markdown rather than getting lost in multiple chatbot UI s.

stared 7 hours ago

Filesystem is a tree - a particular, constrained graph. Advanced topics usually require a lot of interconnections.

Maybe it is why mind maps never spoke to me. I felt that a tree structure (or even - planar graphs) were not enough to cover any sufficiently complex topic.

nutjob2 7 hours ago

If it has hard or soft links, its a proper graph.
- calgoo 7 hours ago
  
  That what i was thinking! Instead of Wiki links, use Symlinks (i guess windows would not like it?)
- zahlman 7 hours ago
  
  On Linux at least, hard links can't be made to directories, except for the magic . and .. links. So this only allows for a DAG.
  Symbolic links can form a graph, and you can process them as needed using readlink etc. to traverse the graph, but they'll still be considered broken if they form a cycle.
  
  Retr0id 6 hours ago
  
  Considered broken by what?
  
  rleigh 5 hours ago
  
  Historically, it made deletion rather difficult with some problematic edge-cases. You could unlink a directory and create an orphan cycle that would never be deleted. Combine that with race conditions on a multi-user systems, plus the indeterminate cost of cycle-detection, and it turns out to be a rather complex problem to solve properly, and banning hard-links is a very simple way to keep the problem tractable, and result in fast, robust and reliable filesystem operations.
  
  Retr0id 5 hours ago
  
  GP was talking about symlink cycles though, which can't produce orphans during deletion.
  
  rleigh 5 hours ago
  
  True, I missed that. I suppose with symlinks you have the reverse problem: you can point to deleted filenames and then have broken links. The cycle detection is still an issue though--it has indeterminate complexity and the graph can be modified as you are traversing it!
  
  Retr0id 4 hours ago
  
  This is true, but just about everyone has a symlink cycle on their system at `/proc/self/root`, and for the most part nobody notices. Having a max recursion depth is usually more useful than actively trying to detect cycles.
  
  PunchyHamster 29 minutes ago
  
  I guess technically you could do bind mounts but that's messy
vinaigrette 3 hours ago

Isn't text a basic linear structure that can cover sufficiently complex topics ?
- stared 3 hours ago
  
  Yes. And precisely for this reason reading a dictionary is not a way of learning a language.

itake 7 hours ago

I'm wonder though:

1. Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.

2. pre-compute compression vs compute at query time.

Kaparthy (and you) are recommending pre-compressing and sorting based on hard coded human abstraction opinions that may match how the data might be queried into human-friendly buckets and language.

Why not just let the AI calculate this at run time? Many of these use cases have very few files and for a low traffic knowledge store, it probably costs less tokens if you only tokenize the files you need.

laurowyn 7 hours ago

> Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.
It doesn't. The human creating the files needs it, to make it easier to traverse in future as the file count grows. At 52k files, that's a horrendous list to scroll through to find the thing you're looking for. Meanwhile, an AI can just `find . -type f -exec whatever {} \;` and be able to process it however it needs. Human doesn't need to change the way they work to appease the magic rock in the box under the desk.
- itake 7 hours ago
  
  > The human creating the files needs it
  why? The human would just talk to the AI agent. Why would they need to scroll through that many files?
  I made a similar system with 232k files (1 file might be a slack message, gitlab comment, etc). it does a decent job at answering questions with only keyword search, but I think i can have better results with RAG+BM25.
  
  laurowyn 7 hours ago
  
  And when the system fails for whatever reason?
  Just because AI exists doesn't mean we can neglect basic design principles.
  If we throw everything out the window, why don't we just name every file as a hash of its content? Why bother with ASCII names at all?
  Fundamentally, it's the human that needs to maintain the system and fix it when it breaks, and that becomes significantly easier if it's designed in a way a human would interact with it. Take the AI away, and you still have a perfectly reasonable data store that a human can continue using.
dgb23 5 hours ago

> 1. Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.
Two reasons I think:
Coding agents simulate similar things to what they have been trained on. Familiarity matters.
And they tend to do much better the more obvious and clear a task is. The more they have to use tools or "thinking", the less reliable they get.
weitendorf 5 hours ago

> Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.
Progressive disclosure, same reason you don't get assaulted with all the information a website has to offer at once, or given a sql console and told to figure it out, and instead see a portion of the information in a way that is supposed to naturally lead you to finding the next and next bits of information you're looking for.
> use cases
This is essentially just where you're moving the hierarchy/compression, but at least for me these are not very disjoint and separable. I think what I actually want are adaptable LoRa that loosely correspond to these use cases but where a dense discriminator or other system is able to adapt and stay in sync with these too. Also, tool-calling + sql/vector embeddings so that you can actually get good filesystem search without it feeling like work, and let the model filter out the junk.
> let the AI calculate this at run time?
You still do want to let it do agentic RAG but I think more tools are better. We're using sqlite-vec, generating multimodal and single-mode embeddings, and trying to make everything typed into a walkable graph of entity types, because that makes it much easier to efficiently walk/retrieve the "semantic space" in a way that generalizes. A small local model needs at least enough structure to know these are the X ways available to look for something and they are organized in Y ways, oriented towards Z and A things.
Especially on-device, telling them to "just figure it out" is like dropping a toddler or autonomous vehicle into a dark room and telling them to build you a search engine lol. They need some help and also quite literally to be taught what a search engine means for these purposes. Also, if you just let them explore or write things without any kind of grounding in what you need/any kind of positive signals, they're just going to be making a mess on your computer.

WillAdams 9 hours ago

I've found a similar structure along with a naming convention useful at my day job --- the big thing is the names are such that when copied as a filepath, the filepath and extension deleted, and underscores replaced by tabs, the text may then be pasted into a spreadsheet and summed up or otherwise manipulated.

In somewhat of an inversion, I've been getting the initial naming done by an LLM (well, I was, until CoPilot imposed file upload limits and the new VPN blocked access to it) --- for want of that, I just name each scan by Invoice ID, then use a .bat file made by concatenating columns in a spreadsheet to rename them to the initial state ready for entry.

game_the0ry 3 hours ago

There for sure a "second brain" product hiding in plain site for one of the frontier AI companies. Google/Gemini should be all over this right now.

mzelling 6 hours ago

If I understand this right, the difference between the author's suggested approach and simply chatting with an AI agent over your files is hyperlinks: if your files contain links to other relevant files, the agent has an easier time identifying relevant material.

themafia 54 minutes ago

Sure. It just fails to be atomic. Which is a property I really like.

exossho 8 hours ago

I can't remember how many file structures I've already tried... LLMs seem to be a great help here. Also used CC to organize my messy harddrive.

Now just need to find a good way to maintain the order...

freedomben 8 hours ago

> Also used CC to organize my messy harddrive.
Do you still have your prompt by chance, and willing to share it? I took a stab at this and it didn't want to make much change. I think I need to be more specific but am not sure how to do that in a general way
- exossho 6 hours ago
  
  I don't have the exact prompt anymore, but it was very lean. I first asked to do an assessment: "Review the content of the whole folder structure. I want you to assess it, and suggest a better setup and structure based on its content. Don't change anything yet, just assess"
  and then worked from there, giving feedback on the proposed folder structure, until I was happy

bhewes 4 hours ago

Wow just strings in files. Are you jumping node to node via pointers index free?