Research at Google has a fun riddle here : https://plus.google.com/117790530324740296539/posts/BX4bBjnDi4h
They've put up a post contaning the following characters:
We'll first note two things :
25 000020
16 002634
14 0022da
12 002230
9 0025d3
9 0022c2
9 0003df
8 0022aa
7 002646
7 00222d
6 0025b2
4 002787
4 002761
4 002325
4 00221e
3 0025c9
3 0022d8
3 0022af
3 00221c
2 002569
2 00002e
1 00211c
1 00028a
1 00002c
Now, let's try that! The study lists the most frequent letters: E, T, A, O, I, N, .... Let's replace our most frequent unicode character with 'E', the second most frequent with 'T' and so on. Our program now prints:
Could be better, couldn't it? There's a couple of interesting things we can notice though:
As we swap the letters ('e' and 't', 'i' and 'h', 'a' and 'e', 'm' and 'f') we notice that the letters are not far from each other which means we're not far from the canonical statistical distribution described in the paper. That's a good sign!
The 'thnee' would nicely make a 'three' and the two 'ta' and the 'af' would make nice 'to' and 'of', right? Let's do some more swapping:
Nice! I'll pass over the trial and error process, but basically the next step was to figure out that 'thai' was probably 'this', as we already new that 'th' was in the right place, and a couple of inversions and we've got:
Et voila! Seth Lloyd
They've put up a post contaning the following characters:
♆❡❡⋚ϟ▲◓⊪╩ ☴⋚ ∭∰☴⋂ ⌥⌥⋚⊯▲, ☴⋂∰ ⋚➇∭∰ϟ∜♆➇⌥∰ ⋘⊪◓∜∰ϟ∭∰ ⋂⋚⌥▲∭ ☴ʊ⋚ ☴⋚ ☴⋂∰ ☴⋂ϟ∰∰ ⋂⋘⊪▲ϟ∰▲ ♆⊪▲ ∞◓∜∰ ➇◓☴∭ ⋚∞ ◓⊪∞⋚ϟ◉♆☴◓⋚⊪. ☴⋂◓∭ ❡ϟ⊯ℜ☴⋚╩ϟ♆◉ ❡⋚⊪☴♆◓⊪∭ ♆➇⋚⋘☴ ☴⋂◓ϟ☴⊯ ⋚∞ ☴⋂∰◉.
We'll first note two things :
- These are unicode characters
- We've got some punctuation characters: a coma, two dots and spaces
It's difficult for the human eye to make sense of the seemingly random characters presented here, so we're going to write a small program that will output the numerical value of the characters. We'll see if we can make sense of that:
#include
#include
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))
unsigned char t[]="♆❡❡⋚ϟ▲◓⊪╩ ☴⋚ ∭∰☴⋂ ⌥⌥⋚⊯▲, ☴⋂∰ ⋚➇∭∰ϟ∜♆➇⌥∰ ⋘⊪◓∜∰ϟ∭∰ ⋂⋚⌥▲∭ ☴ʊ⋚ ☴⋚ ☴⋂∰ ☴⋂ϟ∰∰ ⋂⋘⊪▲ϟ∰▲ ♆⊪▲ ∞◓∜∰ ➇◓☴∭ ⋚∞ ◓⊪∞⋚ϟ◉♆☴◓⋚⊪. ☴⋂◓∭ ❡ϟ⊯ℜ☴⋚╩ϟ♆◉ ❡⋚⊪☴♆◓⊪∭ ♆➇⋚⋘☴ ☴⋂◓ϟ☴⊯ ⋚∞ ☴⋂∰◉.";
int main(int argc, char **argv)
{
int i;
for (i = 0; i < ARRAY_SIZE(t) - 1;) {
int inc;
int res;
if (t[i] <= 0x7f) {
res = t[i];
inc = 1;
} else if (t[i] > 0x7f && t[i] <= 0xbf) {
abort();
} else if (t[i] > 0xc2 && t[i] <= 0xdf) {
res = (t[i] & 0x1f) << 6 | (t[i + 1] & 0x3f);
inc = 2;
} else if (t[i] > 0xe0 && t[i] <= 0xef) {
res = (t[i] & 0xf) << 12 | (t[i + 1] & 0x3f) << 6 | (t[i + 2] & 0x3f);
inc = 3;
} else {
abort();
}
i += inc;
printf("%06x\n", res);
}
This decodes the values UTF-8 characters (see the wikipedia for details: http://en.wikipedia.org/wiki/UTF-8#Description), and outputs said values:
002646
002761
002761
0022da
0003df
....In the hope that these characters actually replace letters, let's sort them an look at their distribution:
25 000020
16 002634
14 0022da
12 002230
9 0025d3
9 0022c2
9 0003df
8 0022aa
7 002646
7 00222d
6 0025b2
4 002787
4 002761
4 002325
4 00221e
3 0025c9
3 0022d8
3 0022af
3 00221c
2 002569
2 00002e
1 00211c
1 00028a
1 00002c
Now, this is encouraging: we see 0x20, the space character, in the top of the list. The distribution of the remaining characters nicely corresponds to the frequency of characters real text. The same 'Research at Google' account coincidentally posted a link to this study on 'English Letter Frequency Counts': http://norvig.com/mayzner.html
Now, let's try that! The study lists the most frequent letters: E, T, A, O, I, N, .... Let's replace our most frequent unicode character with 'E', the second most frequent with 'T' and so on. Our program now prints:
rcctnlosy et haei uutgl, eia tdhanwrdua psowanha itulh evt et eia einaa ipslnal rsl mowa doeh tm osmtnfreots. eioh cngbetynrf ctserosh rdtpe eioneg tm eiaf.
Could be better, couldn't it? There's a couple of interesting things we can notice though:
- The flow of the text closely resembles a real text, this is not a truly random set of letters
- We've got a couple of short words that appear twice: 'et' and 'eia'
The sample of text to decode is very small, so it's very likely that the frequency will differ a bit from the canonical frequency we've used.
Again according to 'English Letter Frequency Counts', the most common words in English are: the, of, and, to, in....
If 'eia' would be 'the', and 'et' be 'to', that would fit nicely with the final word 'eiaf' being 'them'... Let's swap the letters in our program to see what happens:
rccanlosy ta ieth uuagl, the adienwrdue psowenie hauli tva ta the thnee hpslnel rsl fowe doti af osfanmrtoas. thoi cngbtaynrm castrosi rdapt thontg af them.
As we swap the letters ('e' and 't', 'i' and 'h', 'a' and 'e', 'm' and 'f') we notice that the letters are not far from each other which means we're not far from the canonical statistical distribution described in the paper. That's a good sign!
The 'thnee' would nicely make a 'three' and the two 'ta' and the 'af' would make nice 'to' and 'of', right? Let's do some more swapping:
nccorlasy to ieth uuogl, the odierwndue psawerie houli tvo to the three hpslrel nsl fawe dati of asformntaos. thai crgbtoyrnm costnasi ndopt thartg of them.
Nice! I'll pass over the trial and error process, but basically the next step was to figure out that 'thai' was probably 'this', as we already new that 'th' was in the right place, and a couple of inversions and we've got:
according to seth lloyd, the observable universe holds two to the three hundred and five bits of information. this cryptogram contains about thirty of them.
Et voila! Seth Lloyd