Hi Lorraine,
Thank you for allowing me to explain myself further.
The information emerges from acyclical data. If the data is not acyclical, then there is no information.
Let's consider a 64-bit string of data: "1010010101010110101010101010110101100010100111010001001101011010".
There are many orders L of Shannon information S for such a string.
To obtain the first order L = 1 information, we first break the string down into "unigrams" (packets of length one): '1', '0', '1', '0', '0', ... etc.
Unigram '0' occurred 32/64 = 0.5th of the time.
Unigram '1' occurred 32/64 = 0.5th of the time.
There are 2^L == 2 possible unigrams, and all 2 occurred (the same amount). It is useful to think of the string of unigrams as a kind of path that steps from vertex to vertex in a space that consists of 0 and 1: 1 -> 0 -> 1 -> 0 -> 0 -> ... etc.
To obtain the second order L = 2 information, we first break the string down into "digrams": '10', '01', '10', '00', '01', ... etc.
Digram '00' occurred 7/63 = 0.111111th of the time.
Digram '01' occurred 24/63 = 0.380952th of the time.
Digram '10' occurred 25/63 = 0.396825th of the time.
Digram '11' occurred 7/63 = 0.111111th of the time.
There are 2^L == 4 possible digrams, and all 4 occurred (not the same amount, but they still all occurred at least once). The path is in a space that consists of 00, 01, 10, 11: 10 -> 01 -> 10 -> 00 -> 01 -> ... etc.
To obtain the third order information, we first break the string down into "trigrams": '101', '010', '100', '001', '010', ... etc.
Trigram '000' occurred 2/62 = 0.0322581th of the time.
Trigram '001' occurred 5/62 = 0.0806452th of the time.
Trigram '010' occurred 18/62 = 0.290323th of the time.
Trigram '011' occurred 6/62 = 0.0967742th of the time.
Trigram '100' occurred 5/62 = 0.0806452th of the time.
Trigram '101' occurred 19/62 = 0.306452th of the time.
Trigram '110' occurred 6/62 = 0.0967742th of the time.
Trigram '111' occurred 1/62 = 0.016129th of the time.
There are 2^L == 8 possible trigrams, and all 8 occurred (not the same amount, but they still all occurred at least once). The path is 101 -> 010 -> 100 -> 001 -> 010 -> ... etc.
If at any order L there is a cycle, then the Shannon information for all orders equal to and greater than L are non-maximal (and possibly zero).
Let's consider the string: "00". The Shannon information for order L = 1 and greater is zero because of this first order cycle of period one: 0 -> 0 -> 0 -> 0 -> 0 ... etc.
Let's consider the string: "010101010101". The Shannon information for orders L = 2 and greater is not maximal because of this second order cycle of period two: 01 -> 10 -> 01 -> 10 -> 01 ... etc (ie. only two digrams '01' and '10' occur). The Shannon information for order L = 1 is maximal because both unigrams occurred the same amount (6 each).
Let's consider the special string "0". The Shannon information for order L = 1 and greater is zero. Why? Because to decide between a cyclical and acyclical path, you first need two vertices, and this path has yet only one vertex.
If for all orders the path is acyclical so that all 2^L-grams occur roughly the same amount, then the Shannon information for all orders will be roughly maximal. Of course, the order is limited by the length of the string -- a string of length N has a maximum order of L = N, and the Shannon information for this highest order L = N is always zero because only one 2^L-gram can have occurred by that time. I'm not entirely certain if maximal entropy at all orders automatically implies randomness, but randomness certainly requires acylical paths at all orders. That is, pseudo-random number generators with cycles of higher period are usually preferred over generators with lower period.
Altogether, the string "0" has one bit of data content, and zero bits of information content because it is not an acyclical string.
Let's consider a Huffman encoding of a string of three symbols "abc", where the Huffman codes are a = `0', b = '10', c = '11', so that the string is rewritten as "01011". The Shannon information content per symbol is S = ln(3)/ln(2) = 1.58. The data content per symbol is 5/3 = 1.66. It's not a matter of rounding the number of bits up to whole units to get the information content per symbol -- that's the data content, and it's not the same as the information content. I take this distinction between the information and the data literally, because of the deep relationship between acyclical paths and the work of Shannon (which is about leveraging choice / uncertainty).
It's not magic, just graph theory and a bit of stuff about (pseudo-)randomness.
- Shawn