Generating pseudo random text with Markov chains using Python

by shabda on June 16, 2009

First the definition from Wolfram

A Markov chain is collection of random variables {X_t} (where the index t runs through 0, 1, …) having the property that, given the present, the future is conditionally independent of the past.

Wikipedia is a little clearer

…Markov chain is a stochastic process with markov property … [Which means] state changes are probabilistic, and future state depend on current state only.

Markov chains have various uses, but now let’s see how it can be used to generate gibberish, which might look legit.

The algorithm is,

  1. Have a text which will serve as the corpus from which we choose the next transitions.
  2. Start with two consecutive words from the text. The last two words constitute the present state.
  3. Generating next word is the markov transition. To generate the next word, look in the corpus, and find which words are present after the given two words. Choose one of them randomly.
  4. Repeat 2, until text of required size is generated.

The code for this is

To see a sample output, we take the text of My man jeeves by Wodehouse from Project Gutenberg, and see a sample output.

In [1]: file_ = open('/home/shabda/jeeves.txt')

In [2]: import markovgen

In [3]: markov = markovgen.Markov(file_)

In [4]: markov.generate_markov_text()
Out[4]: 'Can you put a few years of your twin-brother Alfred,
who was apt to rally round a bit. I should strongly advocate
the blue with milk'

[The files you need to run this are jeeves.txt and markovgen.py]

How is this a markov algorithm?

  • The last two words are the current state.
  • Next word depends on last two words only, or on present state only.
  • The next word is randomly chosen from a statistical model of the corpus.

Here is some sample text.

“The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead.”

This gives us a corpus like,

{('The', 'quick'): ['brown'],
 ('brown', 'fox'): ['jumps', 'who', 'who'],
 ('fox', 'jumps'): ['over'],
 ('fox', 'who'): ['is', 'is'],
 ('is', 'slow'): ['jumps'],
 ('jumps', 'over'): ['the', 'the'],
 ('over', 'the'): ['brown', 'brown'],
 ('quick', 'brown'): ['fox'],
 ('slow', 'jumps'): ['over'],
 ('the', 'brown'): ['fox', 'fox'],
 ('who', 'is'): ['slow', 'dead.']}

Now if we start with “brown fox”, the next word can be “jumps” or “who”. If we choose “jumps”, then the current state is “fox jumps” and next word is over, and so on.

Hints

  • The larger text we choose, the more choices are there at each transition, and a better looking text is generated.
  • The state can be set to depend on one word, two words, or any number of words. As the number of words in each state increases, the generated text becomes less random.
  • Don’t strip the punctuations etc. They lead to a more representative corpus, and a better random text.

Resources


Did you know that markov chains have many other uses in finance, statistics and mathematics? In fact Google’s page rank algorithm is essentially a markov chain of the web! We publish articles on Algorithms, Python, Django and related technologies every week. Why not Subscribe to us or follow us on twitter?

Related posts:

  1. Finding keywords using Python
  2. Generating PDFs with Django
  3. Constraint programming in Python
  4. Beginning python
  5. Python metaclasses and how Django uses them

1 Comment 4 Comments 6 Comments

{ 6 trackbacks }

Achtung: Geekiges Zeuch – Der Schockwellenreiter
June 17, 2009 at 1:28 am
Oliver Gassner
June 17, 2009 at 4:06 am
links for 2009-06-18 « Where is my towel?
June 18, 2009 at 2:03 am
A Drop In The Stream › links for 2009-06-23
June 23, 2009 at 11:05 pm
Generating pseudo random text with Markov chains using Python — The Uswaretech Blog – Django Web Development « Foo = Bar
July 26, 2009 at 2:52 pm
Python Markov Chains and how to use them | Evan Fosmark
November 4, 2009 at 9:34 pm

{ 16 comments… read them below or add one }

1 Paddy3118 June 23, 2009 at 4:17 pm

You don’t seem to use the relative frequenccy of next words to weight the random choice?

  • Paddy.
2 shabda June 24, 2009 at 8:01 am

Paddy: The words are added to the list the number of times they appear. For example, a key might be

(‘milk’, ‘honey’):['story', 'book', 'book'], ……..

When you do a random.choice, book has twice the probability of story, hence it is frequency weighted.

3 Paddy3118 June 24, 2009 at 2:11 pm

Thanks, I missed that!

4 sonu July 22, 2009 at 7:59 am

Markov chain has been the collection of random variables ,it is also used to generate gibberish ,actually markov chain depends on the present and future.It has the only one who can make a possible word of future.

5 brian metz August 18, 2009 at 4:43 pm

do you remember a program from waaaaay back called “text mangler”? it used markov chains (i guess), and it was almost creepy how the sentences came out. it was hours of pointless and amazing fun… i haven’t been able to get the same results with all the other newer generators… thanks for this – i’m going to get all texty with this tonight!!!!!

6 aduric June 16, 2009 at 7:22 am

NLTK has a pretty good text generation library. It generates text based on any corpus you give it.

This comment was originally posted on Reddit

7 jcsalterego June 16, 2009 at 10:21 am

Not ground-breaking by any means, but a nice, concrete application of Markov chains.

This comment was originally posted on Hacker News

8 davidw June 16, 2009 at 10:26 am

This is relevant, if you want to do something fancier:http://megahal.alioth.debian.org/How.html

BTW, has anything better than Megahal turned up in the meantime, that is also open source?

This comment was originally posted on Hacker News

9 davidw June 16, 2009 at 10:26 am

This is relevant, if you wan to do something fancier:http://megahal.alioth.debian.org/How.html

BTW, has anything better than Megahal turned up in the meantime, that is also open source?

This comment was originally posted on Hacker News

10 Dylan Richard June 16, 2009 at 10:39 am

Markov chains can be hilarious.

This comment was originally posted on FriendFeed

11 Caligula June 16, 2009 at 2:02 pm

This is a big reason why speech recognition works. You have a list of probabilities of sequence of words occuring(language model) and it decides on the next word based on the last few previous words.So if you have the word Super, odds are the next word is Man and not Microphone.

This comment was originally posted on Hacker News

12 alxv June 16, 2009 at 2:19 pm

In "The Practice of Programming", Brian Kernighan and Rob Pike use the Markov Chain Algorithm to illustrate a number of lessons about software design and implementation. I used it as my "Hello world" program when learning a new programming language.In Python, the implementation is surprisingly succinct:

import sys import random import collections MAXGEN = 10000 NONWORD = "" suffixchain = collections.defaultdict(list) w1 = w2 = NONWORD for line in sys.stdin: for word in line.split(): suffixchain[(w1, w2)].append(word) w1, w2 = w2, word suffixchain[(w1, w2)].append(NONWORD) w1 = w2 = NONWORD for i in range(MAXGEN): suffixes = suffixchain[(w1, w2)] next = random.choice(suffixes) if next is NONWORD: break print next, w1, w2 = w2, nextTo use it, just run: $ python markov.py < english_corpus.txt

This comment was originally posted on Hacker News

13 SuperMicrophone June 16, 2009 at 2:36 pm

I resent that.

This comment was originally posted on Hacker News

14 DarkXanthos June 16, 2009 at 2:48 pm

This should be called how to use a library in Python. Pretty weak IMO.

This comment was originally posted on Reddit

15 DarkXanthos June 16, 2009 at 3:31 pm

Here’s one I wrote that actually shows how they’re coded: http://justinbozonier.posterous.com/markov-chains-in-python

This comment was originally posted on Reddit

16 fas2 June 27, 2009 at 6:48 am

I once was given homework by the information theory lecture to implement such a random text generator (k-th order markov chain, k configurable). As I just started with python I thought it would be fun to do that in a one liner. It happened, though it was a really long line and a total lambda mess.

This comment was originally posted on Reddit

Leave a Comment

Additional comments powered by BackType

Previous post:

Next post: