I can’t believe no one’s mentioned the Harry Potter fanfic written by a Markov Chain. If you’re familiar with HP, I highly recommend reading Harry Potter and the Portrait of What Looked Like a Large Pile of Ash.
Cool article, it got me to play around with Markov models, too! I first did a Markov model over plain characters.
> Itheve whe oiv v f vidleared ods alat akn atr. s m w bl po ar 20
Using pairs of consecutive characters (order-2 Markov model) helps, but not much:
> I hateregratics.pyth fwd-i-sed wor is wors.py < smach. I worgene arkov ment by compt the fecompultiny of 5, ithe dons
Triplets (order 3) are a bit better:
> I Fed tooks of the say, I just train. All can beconsist answer efferessiblementate
> how examples, on 13 Debian is the more M-x: Execute testeration
LLMs usually do some sort of tokenization step prior to learning parameters. So I decided to try out order-1 Markov models over text tokenized with byte pair encoding (BPE).
Trained on TFA I got this:
> I Fed by the used few 200,000 words. All comments were executabove. This value large portive comment then onstring takended to enciece of base for the see marked fewer words in the...
Then I bumped up the order to 2
> I Fed 24 Years of My Blog Posts to a Markov Model
> By Susam Pal on 13 Dec 2025
>
> Yesterday I shared a little program calle...
It just reproduced the entire article verbatim. This makes sense as BPE removes any pair of repeated tokens, making order-2 Markov transitions fully deterministic.
I've heard that in NLP applications, it's very common to run BPE only up to a certain number of different tokens, so I tried that out next.
Before limiting, BPE was generating 894 tokens. Even adding a slight limit (800) stops it from being deterministic.
> I Fed 24 years of My Blog Postly coherent. We need to be careful about not increasing the order too much. In fact, if we increase the order of the model to 5, the generated text becomes very dry and factual
It's hard to judge how coherent the text is vs the author's trigram approach because the text I'm using to initialize my model has incoherent phrases in it anyways.
Nice :) I did something similar a few days ago. What I ended up with was a 50/50 blend of hilarious nonsense, and verbatim snippets.There seemed to be a lot of chains where there was only one possible next token.
I'm considering just deleting all tokens that have only one possible descendant, from the db. I think that would solve that problem. Could increase that threshold to, e.g. a token needs to have at least 3 possible outputs.
However that's too heavy handed: there's a lot of phrases or grammatical structures that would get deleted by that. What I'm actually trying to avoid is long chains where there's only one next token. I haven't figured out how to solve that though.
That's where a dynamic n-gram comes into play. Train the markov model from 1 to 5 n-grams, and then scale according to the number of potential paths available.
You'll also need a "sort of traversal stack" so you can rewind if you get stuck several plies in.
I have a pet tool I use for conlang work for writing/worldbuilding that is built on Markov chains and I am smacking my forehead right now at how obvious this seems in hindsight. This is great advice, thank you.
I did something similar many years ago. I fed about half a million words (two decades of mostly fantasy and science fiction writing) into a Markov model that could generate text using a “gram slider” ranging from 2-grams to 5-grams.
I used it as a kind of “dream well” whenever I wanted to draw some muse from the same deep spring. It felt like a spiritual successor to what I used to do as a kid: flipping to a random page in an old 1950s Funk & Wagnalls dictionary and using whatever I found there as a writing seed.
Curious if you've heard of or participated in NaNoGenMo[0] before. With such a corpus at your fingertips could be a fun little project; obviously, pure Markov generation wouldn't be quite sufficient but a good starting point maybe.
Hey that's neat! I hadn't heard of it. It says you need to publish the novel and the source at the end - so I guess as part of the submission you'd include the RNG seed.
The only thing I'm a bit wary of is the submission size - a minimum of 50,000 words. At that length, It'd be really difficult to maintain a cohesive story without manual oversight.
I gave a talk in 2015 that did the same thing with my tweet history (about 20K at the time) and how I used it as source material for a Twitter bot that could reply to users. [1]
What a fantastic idea, I have about 30 years of writing, mostly chapters and plots for novels that did not coalesce. Love to know how it turns out too.
I think you're absolutely right about the easiest approach. I hope you don't mind me asking for a bit more difficulty.
Wouldn't fine tuning produce better results so long as you don't catastrophically forget? You'd preserve more context window space, too, right? Especially if you wanted it to memorize years of facts?
So that's the key difference. A lot of people train these Markov models with the expectation that they're going to be able to use the generated output in isolation.
The problem with that is either your n-gram level is too low in which case it can't maintain any kind of cohesion, or your n-gram level is too high and it's basically just spitting out your existing corpus verbatim.
For me, I was more interested in something that could potentially combine two or three highly disparate concepts found in my previous works into a single outputted sentence - and then I would ideate upon it.
So I haven't opened the program in a long time so I just spun it up and generated a few outputs:
A giant baby is navel corked which if removed causes a vacuum.
I'm not sure what the original pieces of text were based on that particular sentence but it starts making me think about a kind of strange void harkonnen with heart plugs that lead to weird negatively pressurized areas. That's the idea behind the dream well.
The one author that I think we have a good chance of recreating would be Barbara Cartwright. She wrote 700+ romance novels all pretty much the same. It should be possible to generate another of her novels given that large a corpus.
I made one for Hipchat at a company. I can't remember if it could emulate specific users, or just channels, but both were definitely on my roadmap at the time.
I'm hoping someone can find it so I can bookmark it but I once read a story about a company that let multiple markov chain bots loose in a Slack channel. A few days later production went down because one of them ran a Slack command that deployed or destroyed their infrastructure.
I think it took 2-3 hours on my friend's Nvidia something.
The result was absolutely hilarious. It was halfway between a markov chain and what you'd expect from a very small LLM these days. Completely absurd nonsense, yet eerily coherent.
Also, it picked up enough of our personality and speech patterns to shine a very low resolution mirror on our souls...
###
Andy: So here's how you get a girlfriend:
1. Start making silly faces
2. Hold out your hand for guys to swipe
3. Walk past them
4. Ask them if they can take their shirt off
5. Get them to take their shirt off
6. Keep walking until they drop their shirt
Andy: Can I state explicitly this is the optimal strategy
That‘s funny! Now imagine you‘re using Signal on iOS instead of WhatsApp. You cannot do this with your chat history because Signal won‘t let you access your own data outside of their app.
I think this is more correctly described as a trigram model than a Markov model, if it would naturally expand to 4-grams when they were available, etc, the text would look more coherent
Iirc there was some research on "infini-gram", that is a very large ngram model, that allegedly got performance close to LLMs in some domains a couple years back
I just realized, one of the things that people might start doing is making a gamma model of their personality. I won't even approach who they were as a person, but it will give their Descendants (or bored researchers) a 60% approximation of who they were and their views. (60% is pulled from nowhere to justify my gamma designation, since there isn't a good scale for personality mirror quality for LLMs as far as I'm aware.)
That's the question today. Turns out transformers really are a leap forwards in terms of AI, whereas Markov chains, scaled up to today's level of resources and capacity, will still output gibberish.
When I was in college my friends and I did something similar with all of Donald Trump’s tweets as a funny hackathon project for PennApps. The site isn’t up anymore (RIP free heroku hosting) but the code is still up on GitHub: https://github.com/ikhatri/trumpitter
I usually have this technical hypothetical discussions with ChatGpt, I can share if you like, me asking him this: aren't LLMs just huge Markov Chains?! And now I see your project... Funny
When you say nobody you mean you, right? You can't possible be answering for every single person in the world.
I was having a discussion about similarities between Markov Chains and LLMs and short after I found this topic on HN, when I wrote "I can share if you like" was as a proof about the coincidence.
Don't know what happened. I stumbled onto a funny coincidence - me talking to a LLM about its similarities with MC - decided to share on a post about using MC to generate text. Got some nasty comments and a lot of down votes. Even though my comment sparked a pretty interesting discussion.
Hate to be that guy, but I remember this place being nicer.
Ever since LLMS became popular, there's been an epidemic of people pasting ChatGPT output onto forums (or in your case, offering to). These posts are always received similarly to yours, so I'm skeptical that you're genuinely surprised by the reaction.
Everyone has access to ChatGPT. If we wanted its "opinion" we could ask it ourselves. Your offer is akin to "Hey everyone, want me to Google this and paste the results page here?". You would never offer to do that. Ask yourself why.
These posts are low-effort and add nothing to the conversation, yet the people who write them seem to expect everyone to be impressed by their contribution. If you can't understand why people find this irritating, I'm not sure what to tell you.
Not sure why that's contorting, a markov model is anything where you know the probability of going from state A to state B. The state can be anything. When it's text generation the state is previous text to text with an extra character, which is true for both LLMs and oldschool n-gram markov models.
Yes, technically you can frame an LLM as a Markov chain by defining the "state" as the entire sequence of previous tokens. But this is a vacuous observation under that definition, literally any deterministic or stochastic process becomes a Markov chain if you make the state space flexible enough. A chess game is a "Markov chain" if the state includes the full board position and move history. The weather is a "Markov chain" if the state includes all relevant atmospheric variables.
The problem is that this definition strips away what makes Markov models useful and interesting as a modeling framework. A “Markov text model” is a low-order Markov model (e.g., n-grams) with a fixed, tractable state and transitions based only on the last k tokens. LLMs aren’t that: they model using un-fixed long-range context (up to the window). For Markov chains, k is non-negotiable. It's a constant, not a variable. Once you make it a variable, near any process can be described as markovian, and the word is useless.
Sure many things can be modelled as Markov chains, which is why they're useful. But it's a mathematical model so there's no bound on how big the state is allowed to be. The only requirement is that all you need is the current state to determine the probabilities of the next state, which is exactly how LLMs work. They don't remember anything beyond the last thing they generated. They just have big context windows.
The etymology of the "markov property" is that the current state does not depend on history.
And in classes, the very first trick you learn to skirt around history is to add Boolean variables to your "memory state". Your systems now model, "did it rain The previous N days?" The issue obviously being that this is exponential if you're not careful. Maybe you can get clever by just making your state a "sliding window history", then it's linear in the number of days you remember. Maybe mix the both. Maybe add even more information .Tradeoffs, tradeoffs.
I don't think LLMs embody the markov property at all, even if you can make everything eventually follow the markov property by just "considering every single possible state". Of which there are (size of token set)^(length) states at minimum because of the KV cache.
The KV cache doesn't affect it because it's just an optimization. LLMs are stateless and don't take any other input than a fixed block of text. They don't have memory, which is the requirement for a Markov chain.
Have you ever actually worked with a basic markov problem?
The markov property states that your state is a transition of probabilities entirely from the previous state.
These states, inhabit a state space. The way you encode "memory" if you need it, e.g. say you need to remember if it rained the last 3 days, is by expanding said state space. In that case, you'd go from 1 state to 3 states, 2^3 states if you needed the precise binary information for each day. Being "clever", maybe you assume only the # of days it rained, in the past 3 days mattered, you can get a 'linear' amount of memory.
Sure, a LLM is a "markov chain" of state space size (# tokens)^(context length), at minimum. That's not a helpful abstraction and defeats the original purpose of the markov observation. The entire point of the markov observation is that you can represent a seemingly huge predictive model with just a couple of variables in a discrete state space, and ideally you're the clever programmer/researcher and can significantly collapse said space by being, well, clever.
>Sure many things can be modelled as Markov chains
Again, no they can't, unless you break the definition. K is not a variable. It's as simple as that. The state cannot be flexible.
1. The markov text model uses k tokens, not k tokens sometimes, n tokens other times and whatever you want it to be the rest of the time.
2. A markov model is explcitly described as 'assuming that future states depend only on the current state, not on the events that occurred before it'. Defining your 'state' such that every event imaginable can be captured inside it is a 'clever' workaround, but is ultimately describing something that is decidedly not a markov model.
It's not n sometimes, k tokens some other times. LLMs have fixed context windows, you just sometimes have less text so it's not full. They're pure functions from a fixed size block of text to a probability distribution of the next character, same as the classic lookup table n gram Markov chain model.
1. A context limit is not a Markov order.
An n-gram model’s defining constraint is: there exists a small constant k such that the next-token distribution depends only on the last k tokens, full stop. You can't use a k-trained markov model on anything but k tokens, and each token has the same relationship with each other regardless. An LLM’s defining behavior is the opposite: within its window it can condition on any earlier token, and which tokens matter can change drastically with the prompt (attention is content-dependent). “Window size = 8k/128k” is not “order k” in the Markov sense; it’s just a hard truncation boundary.
2. “Fixed-size block” is a padding detail, not a modeling assumption.
Yes, implementations batch/pad to a maximum length. But the model is fundamentally conditioned on a variable-length prefix (up to the cap), and it treats position 37 differently from position 3,700 because the computation explicitly uses positional information. That means the conditional distribution is not a simple stationary “transition table” the way the n-gram picture suggests.
3. “Same as a lookup table” is exactly the part that breaks.
A classic n-gram Markov model is literally a table (or smoothed table) from discrete contexts to next-token probabilities. A transformer is a learned function that computes a representation of the entire prefix and uses that to produce a distribution. Two contexts that were never seen verbatim in training can still yield sensible outputs because the model generalizes via shared parameters; that is categorically unlike n-gram lookup behavior.
I don't know how many times I have to spell this out for you. Calling LLMs markov chains is less than useless. They don't resemble them in any way unless you understand neither.
I think you're confusing Markov chains and "Markov chain text generators". A Markov chain is a mathematical structure where the probabilities of going to the next state only depend on the current state and not the previous path taken. That's it. It doesn't say anything about whether the probabilities are computed by a transformer or stored in a lookup table, it just exists. How the probabilities are determined in a program doesn't matter mathematically.
Just a heads-up: this is not the first time somebody has to explain Markov chains to famouswaffles on HN, and I'm pretty sure it won't be the last. Engaging further might not be worth it.
A GPT model would be modelled as an n-gram Markov model where n is the size of the context window. This is slightly useful for getting some crude bounds on the behaviour of GPT models in general, but is not a very efficient way to store a GPT model.
I'm not saying it's an n-gram Markov model or that you should store them as a lookup table. Markov models are just a mathematical concept that don't say anything about storage, just that the state change probabilities are a pure function of the current state.
Markov models with more than 3 words as "context window" produce very unoriginal text in my experience (corpus used had almost 200k sentences, almost 3 million words), matching the OP's experience. These are by no means large corpuses, but I know it isn't going away with a larger corpus.[1] The Markov chain will wander into "valleys" of reproducing paragraphs of its corpus one for one because it will stumble upon 4-word sequences that it has only seen once. This is because 4 words form a token, not a context window. Markov chains don't have what LLMs have.
If you use a syllable-level token in Markov models the model can't form real words much beyond the second syllable, and you have no way of making it make more sense other than increasing the token size, which exponentially decreases originality. This is the simplest way I can explain it, though I had to address why scaling doesn't work.
[1] There are 4^400000 possible 4-word sequences in English (barring grammar) meaning only a corpus with 8 times that amount of words and with no repetition could offer two ways to chain each possible 4 word sequence.
I can’t believe no one’s mentioned the Harry Potter fanfic written by a Markov Chain. If you’re familiar with HP, I highly recommend reading Harry Potter and the Portrait of What Looked Like a Large Pile of Ash.
Here’s a link: https://botnik.org/content/harry-potter.html
Cool article, it got me to play around with Markov models, too! I first did a Markov model over plain characters.
> Itheve whe oiv v f vidleared ods alat akn atr. s m w bl po ar 20
Using pairs of consecutive characters (order-2 Markov model) helps, but not much:
> I hateregratics.pyth fwd-i-sed wor is wors.py < smach. I worgene arkov ment by compt the fecompultiny of 5, ithe dons
Triplets (order 3) are a bit better:
> I Fed tooks of the say, I just train. All can beconsist answer efferessiblementate
> how examples, on 13 Debian is the more M-x: Execute testeration
LLMs usually do some sort of tokenization step prior to learning parameters. So I decided to try out order-1 Markov models over text tokenized with byte pair encoding (BPE).
Trained on TFA I got this:
> I Fed by the used few 200,000 words. All comments were executabove. This value large portive comment then onstring takended to enciece of base for the see marked fewer words in the...
Then I bumped up the order to 2
> I Fed 24 Years of My Blog Posts to a Markov Model
> By Susam Pal on 13 Dec 2025
>
> Yesterday I shared a little program calle...
It just reproduced the entire article verbatim. This makes sense as BPE removes any pair of repeated tokens, making order-2 Markov transitions fully deterministic.
I've heard that in NLP applications, it's very common to run BPE only up to a certain number of different tokens, so I tried that out next.
Before limiting, BPE was generating 894 tokens. Even adding a slight limit (800) stops it from being deterministic.
> I Fed 24 years of My Blog Postly coherent. We need to be careful about not increasing the order too much. In fact, if we increase the order of the model to 5, the generated text becomes very dry and factual
It's hard to judge how coherent the text is vs the author's trigram approach because the text I'm using to initialize my model has incoherent phrases in it anyways.
Anyways, Markov models are a lot of fun!
Nice :) I did something similar a few days ago. What I ended up with was a 50/50 blend of hilarious nonsense, and verbatim snippets.There seemed to be a lot of chains where there was only one possible next token.
I'm considering just deleting all tokens that have only one possible descendant, from the db. I think that would solve that problem. Could increase that threshold to, e.g. a token needs to have at least 3 possible outputs.
However that's too heavy handed: there's a lot of phrases or grammatical structures that would get deleted by that. What I'm actually trying to avoid is long chains where there's only one next token. I haven't figured out how to solve that though.
That's where a dynamic n-gram comes into play. Train the markov model from 1 to 5 n-grams, and then scale according to the number of potential paths available.
You'll also need a "sort of traversal stack" so you can rewind if you get stuck several plies in.
the trick to prevent 'dry' output that quotes verbatim is to make the 5 words limit flexible: if there is only one path, reduce it to 4.
I have a pet tool I use for conlang work for writing/worldbuilding that is built on Markov chains and I am smacking my forehead right now at how obvious this seems in hindsight. This is great advice, thank you.
I did something similar many years ago. I fed about half a million words (two decades of mostly fantasy and science fiction writing) into a Markov model that could generate text using a “gram slider” ranging from 2-grams to 5-grams.
I used it as a kind of “dream well” whenever I wanted to draw some muse from the same deep spring. It felt like a spiritual successor to what I used to do as a kid: flipping to a random page in an old 1950s Funk & Wagnalls dictionary and using whatever I found there as a writing seed.
Curious if you've heard of or participated in NaNoGenMo[0] before. With such a corpus at your fingertips could be a fun little project; obviously, pure Markov generation wouldn't be quite sufficient but a good starting point maybe.
[0]: https://nanogenmo.github.io/
Hey that's neat! I hadn't heard of it. It says you need to publish the novel and the source at the end - so I guess as part of the submission you'd include the RNG seed.
The only thing I'm a bit wary of is the submission size - a minimum of 50,000 words. At that length, It'd be really difficult to maintain a cohesive story without manual oversight.
I gave a talk in 2015 that did the same thing with my tweet history (about 20K at the time) and how I used it as source material for a Twitter bot that could reply to users. [1]
It was pretty fun!
[1] https://youtu.be/rMmXdiUGsr4
What a fantastic idea, I have about 30 years of writing, mostly chapters and plots for novels that did not coalesce. Love to know how it turns out too.
Terry Davis, pbuh, did something very similar!
What would the equivalent be with LLMs?
I spend all of my time with image and video models and have very thin knowledge when it comes to running, fine tuning, etc. with language models.
How would one start with training an LLM on the entire corpus of one's writings? What model would you use? What scripts and tools?
Has anyone had good results with this?
Do you need to subsequently add system prompts, or does it just write like you out of the box?
How could you make it answer your phone, for instance? Or discord messages? Would that sound natural, or is that too far out of domain?
Simplest way pack all text into a prompt.
You could use a vector database.
You could train a model from scratch.
Probably easiest to use OpenAI tools. Upload documents. Make custom model.
How do you make it answer your phone? You could use twillio api + script + llm + voice model. Want natural use a service.
I think you're absolutely right about the easiest approach. I hope you don't mind me asking for a bit more difficulty.
Wouldn't fine tuning produce better results so long as you don't catastrophically forget? You'd preserve more context window space, too, right? Especially if you wanted it to memorize years of facts?
Are LoRAs a thing with LLMs?
Could you train certain layers of the model?
A good place to start with your journey is this guide from Unsloth:
https://docs.unsloth.ai/get-started/fine-tuning-llms-guide
Did it work?
So that's the key difference. A lot of people train these Markov models with the expectation that they're going to be able to use the generated output in isolation.
The problem with that is either your n-gram level is too low in which case it can't maintain any kind of cohesion, or your n-gram level is too high and it's basically just spitting out your existing corpus verbatim.
For me, I was more interested in something that could potentially combine two or three highly disparate concepts found in my previous works into a single outputted sentence - and then I would ideate upon it.
So I haven't opened the program in a long time so I just spun it up and generated a few outputs:
I'm not sure what the original pieces of text were based on that particular sentence but it starts making me think about a kind of strange void harkonnen with heart plugs that lead to weird negatively pressurized areas. That's the idea behind the dream well.The one author that I think we have a good chance of recreating would be Barbara Cartwright. She wrote 700+ romance novels all pretty much the same. It should be possible to generate another of her novels given that large a corpus.
I'm not sure how we'd know. My wife sometimes buys and rereads a novel she's already finished.
I recall a Markov chain bot on IRC in the mid 2000s. I didn't see anything better until gpt came along!
Yes, I made one using bitlbee back in the 2000s, good times!
I made one for Hipchat at a company. I can't remember if it could emulate specific users, or just channels, but both were definitely on my roadmap at the time.
I'm hoping someone can find it so I can bookmark it but I once read a story about a company that let multiple markov chain bots loose in a Slack channel. A few days later production went down because one of them ran a Slack command that deployed or destroyed their infrastructure.
First of all: Thank you for giving.
Giving 24 years of your experience, thoughts and life time to us.
This is special in these times of wondering, baiting and consuming only.
Really fascinating how you can get such intriguing output from such a simple system. Prompted me to give it a whirl with the content on my own site.
https://vale.rocks/micros/20251214-0503
In 2020, a friend and I did this with our mutual WhatsApp chat history.
Except instead we fine-tuned GPT-2 instead. (As was the fashion at the time!)
We used this one, I think https://github.com/minimaxir/gpt-2-simple
I think it took 2-3 hours on my friend's Nvidia something.
The result was absolutely hilarious. It was halfway between a markov chain and what you'd expect from a very small LLM these days. Completely absurd nonsense, yet eerily coherent.
Also, it picked up enough of our personality and speech patterns to shine a very low resolution mirror on our souls...
###
Andy: So here's how you get a girlfriend:
1. Start making silly faces
2. Hold out your hand for guys to swipe
3. Walk past them
4. Ask them if they can take their shirt off
5. Get them to take their shirt off
6. Keep walking until they drop their shirt
Andy: Can I state explicitly this is the optimal strategy
That‘s funny! Now imagine you‘re using Signal on iOS instead of WhatsApp. You cannot do this with your chat history because Signal won‘t let you access your own data outside of their app.
Here's a quick custom markov page you can have fun with, (all in client) https://aperocky.com/markov/
npm package of the markov model if you just want to play with it on localhost/somewhere else: https://github.com/Aperocky/weighted-markov-generator
I think this is more correctly described as a trigram model than a Markov model, if it would naturally expand to 4-grams when they were available, etc, the text would look more coherent
Iirc there was some research on "infini-gram", that is a very large ngram model, that allegedly got performance close to LLMs in some domains a couple years back
When the order argument is cranked up to 4, it looks to the average LLMvangelist like it is thinking.
I just realized, one of the things that people might start doing is making a gamma model of their personality. I won't even approach who they were as a person, but it will give their Descendants (or bored researchers) a 60% approximation of who they were and their views. (60% is pulled from nowhere to justify my gamma designation, since there isn't a good scale for personality mirror quality for LLMs as far as I'm aware.)
"Dixie can't meaningfully grow as a person. All that he ever will be is burned onto that cart;"
"Do me a favor, boy. This scam of yours, when it's over, you erase this god-damned thing."
Damn interesting!
now i wonder if you can compare vs feeding into a GPT style transformer of a similar Order of Magnitude in param count..
I thought for a moment your comment was the output of a Markov chain trained on HN
No mention of Rust or gut bacteria. Definitely not.
That's the question today. Turns out transformers really are a leap forwards in terms of AI, whereas Markov chains, scaled up to today's level of resources and capacity, will still output gibberish.
When I was in college my friends and I did something similar with all of Donald Trump’s tweets as a funny hackathon project for PennApps. The site isn’t up anymore (RIP free heroku hosting) but the code is still up on GitHub: https://github.com/ikhatri/trumpitter
Megahal/Hailo (cpanm -n hailo for Perl users) can still be fun too.
Usage:
Where "corpus.txt" should be a file with one sentence per line. Easy to do under sed/awk/perl. This spawns the chatbot with your trained brain.By default Hailo chooses the easy engine. If you want something more "realistic", pick the advanced one mentioned at 'perldoc hailo' with the -e flag.
I usually have this technical hypothetical discussions with ChatGpt, I can share if you like, me asking him this: aren't LLMs just huge Markov Chains?! And now I see your project... Funny
> I can share if you like
Respectfully, absolutely nobody wants to read a copy-and-paste of a chat session with ChatGPT.
When you say nobody you mean you, right? You can't possible be answering for every single person in the world.
I was having a discussion about similarities between Markov Chains and LLMs and short after I found this topic on HN, when I wrote "I can share if you like" was as a proof about the coincidence.
Don't know what happened. I stumbled onto a funny coincidence - me talking to a LLM about its similarities with MC - decided to share on a post about using MC to generate text. Got some nasty comments and a lot of down votes. Even though my comment sparked a pretty interesting discussion.
Hate to be that guy, but I remember this place being nicer.
Ever since LLMS became popular, there's been an epidemic of people pasting ChatGPT output onto forums (or in your case, offering to). These posts are always received similarly to yours, so I'm skeptical that you're genuinely surprised by the reaction.
Everyone has access to ChatGPT. If we wanted its "opinion" we could ask it ourselves. Your offer is akin to "Hey everyone, want me to Google this and paste the results page here?". You would never offer to do that. Ask yourself why.
These posts are low-effort and add nothing to the conversation, yet the people who write them seem to expect everyone to be impressed by their contribution. If you can't understand why people find this irritating, I'm not sure what to tell you.
LLMs are indeed Markov chains. The breakthrough is that we are able to efficiently compute well performing probabilities for many states using ML.
LLMs are not Markov Chains unless you contort the meaning of a Markov Model State so much you could even include the human brain.
Not sure why that's contorting, a markov model is anything where you know the probability of going from state A to state B. The state can be anything. When it's text generation the state is previous text to text with an extra character, which is true for both LLMs and oldschool n-gram markov models.
Yes, technically you can frame an LLM as a Markov chain by defining the "state" as the entire sequence of previous tokens. But this is a vacuous observation under that definition, literally any deterministic or stochastic process becomes a Markov chain if you make the state space flexible enough. A chess game is a "Markov chain" if the state includes the full board position and move history. The weather is a "Markov chain" if the state includes all relevant atmospheric variables.
The problem is that this definition strips away what makes Markov models useful and interesting as a modeling framework. A “Markov text model” is a low-order Markov model (e.g., n-grams) with a fixed, tractable state and transitions based only on the last k tokens. LLMs aren’t that: they model using un-fixed long-range context (up to the window). For Markov chains, k is non-negotiable. It's a constant, not a variable. Once you make it a variable, near any process can be described as markovian, and the word is useless.
Sure many things can be modelled as Markov chains, which is why they're useful. But it's a mathematical model so there's no bound on how big the state is allowed to be. The only requirement is that all you need is the current state to determine the probabilities of the next state, which is exactly how LLMs work. They don't remember anything beyond the last thing they generated. They just have big context windows.
The etymology of the "markov property" is that the current state does not depend on history.
And in classes, the very first trick you learn to skirt around history is to add Boolean variables to your "memory state". Your systems now model, "did it rain The previous N days?" The issue obviously being that this is exponential if you're not careful. Maybe you can get clever by just making your state a "sliding window history", then it's linear in the number of days you remember. Maybe mix the both. Maybe add even more information .Tradeoffs, tradeoffs.
I don't think LLMs embody the markov property at all, even if you can make everything eventually follow the markov property by just "considering every single possible state". Of which there are (size of token set)^(length) states at minimum because of the KV cache.
The KV cache doesn't affect it because it's just an optimization. LLMs are stateless and don't take any other input than a fixed block of text. They don't have memory, which is the requirement for a Markov chain.
Have you ever actually worked with a basic markov problem?
The markov property states that your state is a transition of probabilities entirely from the previous state.
These states, inhabit a state space. The way you encode "memory" if you need it, e.g. say you need to remember if it rained the last 3 days, is by expanding said state space. In that case, you'd go from 1 state to 3 states, 2^3 states if you needed the precise binary information for each day. Being "clever", maybe you assume only the # of days it rained, in the past 3 days mattered, you can get a 'linear' amount of memory.
Sure, a LLM is a "markov chain" of state space size (# tokens)^(context length), at minimum. That's not a helpful abstraction and defeats the original purpose of the markov observation. The entire point of the markov observation is that you can represent a seemingly huge predictive model with just a couple of variables in a discrete state space, and ideally you're the clever programmer/researcher and can significantly collapse said space by being, well, clever.
Are you deliberately missing the point or what?
> Sure, a LLM is a "markov chain" of state space size (# tokens)^(context length), at minimum.
Okay, so we're agreed.
>Sure many things can be modelled as Markov chains
Again, no they can't, unless you break the definition. K is not a variable. It's as simple as that. The state cannot be flexible.
1. The markov text model uses k tokens, not k tokens sometimes, n tokens other times and whatever you want it to be the rest of the time.
2. A markov model is explcitly described as 'assuming that future states depend only on the current state, not on the events that occurred before it'. Defining your 'state' such that every event imaginable can be captured inside it is a 'clever' workaround, but is ultimately describing something that is decidedly not a markov model.
It's not n sometimes, k tokens some other times. LLMs have fixed context windows, you just sometimes have less text so it's not full. They're pure functions from a fixed size block of text to a probability distribution of the next character, same as the classic lookup table n gram Markov chain model.
1. A context limit is not a Markov order. An n-gram model’s defining constraint is: there exists a small constant k such that the next-token distribution depends only on the last k tokens, full stop. You can't use a k-trained markov model on anything but k tokens, and each token has the same relationship with each other regardless. An LLM’s defining behavior is the opposite: within its window it can condition on any earlier token, and which tokens matter can change drastically with the prompt (attention is content-dependent). “Window size = 8k/128k” is not “order k” in the Markov sense; it’s just a hard truncation boundary.
2. “Fixed-size block” is a padding detail, not a modeling assumption. Yes, implementations batch/pad to a maximum length. But the model is fundamentally conditioned on a variable-length prefix (up to the cap), and it treats position 37 differently from position 3,700 because the computation explicitly uses positional information. That means the conditional distribution is not a simple stationary “transition table” the way the n-gram picture suggests.
3. “Same as a lookup table” is exactly the part that breaks. A classic n-gram Markov model is literally a table (or smoothed table) from discrete contexts to next-token probabilities. A transformer is a learned function that computes a representation of the entire prefix and uses that to produce a distribution. Two contexts that were never seen verbatim in training can still yield sensible outputs because the model generalizes via shared parameters; that is categorically unlike n-gram lookup behavior.
I don't know how many times I have to spell this out for you. Calling LLMs markov chains is less than useless. They don't resemble them in any way unless you understand neither.
I think you're confusing Markov chains and "Markov chain text generators". A Markov chain is a mathematical structure where the probabilities of going to the next state only depend on the current state and not the previous path taken. That's it. It doesn't say anything about whether the probabilities are computed by a transformer or stored in a lookup table, it just exists. How the probabilities are determined in a program doesn't matter mathematically.
Just a heads-up: this is not the first time somebody has to explain Markov chains to famouswaffles on HN, and I'm pretty sure it won't be the last. Engaging further might not be worth it.
A GPT model would be modelled as an n-gram Markov model where n is the size of the context window. This is slightly useful for getting some crude bounds on the behaviour of GPT models in general, but is not a very efficient way to store a GPT model.
I'm not saying it's an n-gram Markov model or that you should store them as a lookup table. Markov models are just a mathematical concept that don't say anything about storage, just that the state change probabilities are a pure function of the current state.
Well LLMs aren't human brains, unless you contort the definition of matrix algebra so much you could even include them.
Markov models with more than 3 words as "context window" produce very unoriginal text in my experience (corpus used had almost 200k sentences, almost 3 million words), matching the OP's experience. These are by no means large corpuses, but I know it isn't going away with a larger corpus.[1] The Markov chain will wander into "valleys" of reproducing paragraphs of its corpus one for one because it will stumble upon 4-word sequences that it has only seen once. This is because 4 words form a token, not a context window. Markov chains don't have what LLMs have.
If you use a syllable-level token in Markov models the model can't form real words much beyond the second syllable, and you have no way of making it make more sense other than increasing the token size, which exponentially decreases originality. This is the simplest way I can explain it, though I had to address why scaling doesn't work.
[1] There are 4^400000 possible 4-word sequences in English (barring grammar) meaning only a corpus with 8 times that amount of words and with no repetition could offer two ways to chain each possible 4 word sequence.
Yeah, there's only two differences between using Markov chains to predict words and LLMs:
* LLMs don't use Markov chains, * LLMs don't predict words.
* Markov chains have been used to predict syllables or letters since the beginning, and an LLMs tokenizer could be used for Markov chains
* The R package markovchain[1] may look like it's using Markov chains, but it's actually using the R programming language, zeros and ones.
[1] https://cran.r-project.org/web/packages/markovchain/index.ht...
...are you under the impression that you have an exclusive relationship with "him"? Everyone else has access to ChatGPT too.
Yes. Yes I was. Thank you for the wake up call. I was under the impression that he was talking only to me.