December 5, 2012

Exploit Information Leaks in Random Numbers from Python, Ruby and PHP

by with 13 comments

The Mersenne Twister (MT 19937) is a pseudorandom number generator, used by Python and many other languages like Ruby, and PHP. It is known to pass many statistical randomness tests, but it’s also known to be not cryptographically secure. The Python documentation is clear on this point, describing it as “completely unsuitable for cryptographic purposes.” Here we will show why.

When you are able to predict pseudorandom numbers, you can predict session ids, randomly generated passwords or encryption keys and know all the cards in online poker games, or play “Asteroids” better than legally possible.

Many sources already showed that it’s easy to rebuild the internal state of the MT by using 624 consecutive outputs. But this alone isn’t a practical attack, it’s unlikely that you have access to the whole output. In this post I’ll demonstrate how to restore its internal state by using only parts of its output. This will allow us to know all previous and future random number generation.

With every 32bit output the MT directly exposes 32 bit of it’s internal state (only slightly and reversibly modified by the tempering function). After each round of 624 outputs, the internal state of the Mersenne Twister is “twisted” itself: All bits are XOR’d with several other bits. In fact the Mersenne Twister is just a big XOR machine: All its output can be expressed by an sequence of XORs of the initial state bits.

Python always combines two outputs into a 64bit integer before returning them as random integers. So each call of random.randint(0,255) gives you only 8 bits out of two 32 bit Mersenne Twister outputs. Since the tempering function already mixed the 32 bits outputs, it’s not possible anymore to directly recover internal state bits out of only the 8 bits.

I was curious if it’s hard to recover the internal MT state by using only the output of a function like this:

def random_string(length):
    return "".join(chr(random.randint(0, 255)) for i in xrange(length))

Since the internal state of the Mersenne Twister consists out of 19968 bits we will need at least ~2.5KB of output to recover the internal state. In fact I needed ~3.3kb, probably because of redundant output information. Also possible is a bug in my POC implementation :)

You can find the result on github.

How does it work?

First I named the initial state with variables s0…s19967. The initial state looks like this:

Internal state bit Value
0 s0
1 s1
19967 s19967

Now the first output of the Mersenne Twister is a combination of the first 32 bits (combined by the tempering function):

Output-Bit Value
o0 s0 xor s4 xor s7 xor s15
o1 s1 xor s5 xor s16
o2 s2 xor s6 xor s13 xor s17 xor s24
o31 s2 xor s9 xor s13 xor s17 xor s28 xor s31

same for the second output:

Output-Bit Value
o32 s32 xor s36 xor s39 xor s47,

But we can only observe eight of these bits, because random.randint(0,255) exposes only this portion of the output.

After 624 outputs, the internal state of the Mersenne Twister is “twisted” around. We update our internal state as an xor-combination of our old indices.

Internal state bit Value
0 s63 xor s12704
1 s0 xor s12705
19967 s61 xor s62 xor s5470 xor s5471 xor s18143

The outputs look now more complicated now, because the state bits are an xor-combination of the initial state:

Output-Bit Value
o19968 s35 xor s38 xor s46 xor s63 xor s12704 xor s12708 xor s12711 xor s12719

After 3.3 kb this list contains about 40 variables.

Now we have a big list of output-bits and how they are made out of an xor-combination of the original state. A big system of equations that we can to solve! This is done as you learned it at school: Here’s a simple example for 3 bits.

Given this equations system:

o1 = s0 xor s1 xor s2
o2 = s1 xor s2
o3 = s0 xor s1
 
First we solve s0:
 
o1 = s0 xor s1 xor s2
o2 = s1 xor s2
=>
o1 xor o2 = s0
 
With this solution it’s easy to find solution for s1.
 
o3 = s0 xor s1
o1 xor o2 = s0
=>
o1 xor o2 xor o3 = s1
 
And finally for s2.
 
o2 = s1 xor s2
o1 xor o2 xor o3 = s1
=>
o1 xor o3 = s2
 
Result:
 
o1 xor o2 = s0
o1 xor o2 xor o3 = s1
o1 xor o3 = s2

Now we know how to recover the 3-bit state out of our 3 output-bits:
s0 = o1 xor o2
s1 = o1 xor o2 xor o3
s2 = o1 xor o3

However, in reality we have about 26,000 equations with 20,000 variables.

If you want to try it yourself, you can download the the result of the solved equation together with a test-program on github.

Further notes

Since the Mersenne Twister is highly symmetric, it’s probably possible to find some shortcuts or a fully mathematical solution for this problem. However, I implemented the straight-forward solution since it’s easy and reusable.

Python seeds the Twister with only 128 bits of “real” randomness. So theoretically it’s enough to know a few output bytes to restore the whole state, but you would need an efficient attack on the seeding algorithm since 128 bit is too much for a brute-force attack.

However, other implementations use much less randomness to seed their random number generators. PHP seems to use only 32 bits for seeding mt_random, Perl also uses only 32 bit (but another PRNG). In these cases it’s probably easier to use a brute-force attack on the seed.

Comments
  1. The best solution would be to avoid using built-in solutions and instead generate a custom "random" number generator.

    I was thinking about this today and came up with the following concept:

    Imagine a 1m3 room where a ball is placed inside. The ball is placed at X,Y,Z, travelling at angle A and speed V. All these variables take the following values: X=Pseudorandom number, Y=Current CPU Usage, Z=Current RAM usage, A=Number of active threads, V=Complex of all previous numbers. You let it "travel" for N, where N=Pseudo-random number, and take its new values. Transpose them on a 1d plane, take the X position, aaaand…. You've got a virtually unpredictable random number, at a slightly higher computation cost though.

    Feasible? Yes. Functional? Probably. Reasonable? Probably not. Practical? No.

  2. Why not just use a cryptographically secure pseudorandom number generator? Your physics equation is still solvable with enough output, and X, Y, Z, and A could be guessed, coerced, or even just brute forced. Cryptographic functions should still be unbreakable (except for brute force on large keys) even when the details of an implementation are public. See Kerckhoffs's principle.

  3. We presented a paper in this year's blackhat and Usenix Security conferences on exploiting PRNGs in PHP where we applied the same technique in order to recover the internal state of MT from truncated outputs. We also have some more extensive experiments on the number of equations and outputs needed in order to be able to recover the internal state.
    You can check the paper at
    https://crypto.di.uoa.gr/CRYPTO.SEC/Randomness_Attacks.html

  4. Oh, hush. SpiderOak requires users to register with their full legal name. Leaks are part of SpiderOak's design.

  5. True randomness is available, and very good randomness in Python is arrived at by importing system.random instead of math.random. The Mersene Twister is used for quick and dirty well distributed numbers. For real randomness, follow the work of Landon Curt Noll. His original LavaRand (SGI camera at a lava lamp) and later using cheap camera chips in darkness of 100Mb/sec of real randomness.

  6. This is pretty cool stuff, and a great demonstration of why it's better to just always use a cryptographically strong random number generator.

    I can't speak for Ruby and PHP, but on the Python side, most of these sorts of issues can be solved by using PyCrypto and replacing:
    import random
    with
    from Crypto.Random import random

    The only things that don't work with the PyCrypto drop-in are floating-point support (it's really hard to preserve randomness using floating-point numbers, so PyCrypto doesn't try), and random.shuffle (due to a stupid typo that will be fixed and regression-tested in PyCrypto 2.7).

    [/shamelessplug]

  7. @Dwayne Litzenberger

    Thanks for your comment and your great work! Please also accept the pull-request on object serialization for PyCrypto 2.7 :) [/anothershamelessplug]

  8. I'm amazed that anyone uses the Twister without hashing the output – Jenkins warns against this on his website and has for years.

  9. Terry: hashing output does not make something more unpredictable. The problem is the lack of entropy. If I took the numbers 1, 2, 3, 4, 5…, 1000 in order and ran them through SHA1, and then I take that output and run it through the statistical analyzers, it will come back as statistically uniform. It will look good. But it is totally predictable. If someone knew that's what I was doing, they could completely predict what's coming. Randomness != unpredictability.

    Lastly, Cryptographically random != suitable for all purposes. Java's SecureRandom is cryptographically secure. But you can't shuffle a deck of cards with it. It uses a 20 byte state. That means it can only produce 2^160 possible streams of random numbers. The number of possible ways to shuffle a deck of cards is 52!, roughly 2^226. So Java's SecureRandom cannot produce every possible shuffle of a deck of cards, even though it is cryptographically secure (e.g., suitable for session IDs and random AES keys and such)