Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Entertaining letter frequency riddle

Research at Google has a fun riddle here : https://plus.google.com/117790530324740296539/posts/BX4bBjnDi4h

They've put up a post contaning the following characters:
♆❡❡⋚ϟ▲◓⊪╩ ☴⋚ ∭∰☴⋂ ⌥⌥⋚⊯▲, ☴⋂∰ ⋚➇∭∰ϟ∜♆➇⌥∰ ⋘⊪◓∜∰ϟ∭∰ ⋂⋚⌥▲∭ ☴ʊ⋚ ☴⋚ ☴⋂∰ ☴⋂ϟ∰∰ ⋂⋘⊪▲ϟ∰▲ ♆⊪▲ ∞◓∜∰ ➇◓☴∭ ⋚∞ ◓⊪∞⋚ϟ◉♆☴◓⋚⊪.  ☴⋂◓∭ ❡ϟ⊯ℜ☴⋚╩ϟ♆◉ ❡⋚⊪☴♆◓⊪∭ ♆➇⋚⋘☴ ☴⋂◓ϟ☴⊯ ⋚∞ ☴⋂∰◉. 

We'll first note two things :
  • These are unicode characters
  • We've got some punctuation characters: a coma, two dots and spaces
It's difficult for the human eye to make sense of the seemingly random characters presented here, so we're going to write a small program that will output the numerical value of the characters. We'll see if we can make sense of that:
#include
#include
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))
unsigned char t[]="♆❡❡⋚ϟ▲◓⊪╩ ☴⋚ ∭∰☴⋂ ⌥⌥⋚⊯▲, ☴⋂∰ ⋚➇∭∰ϟ∜♆➇⌥∰ ⋘⊪◓∜∰ϟ∭∰ ⋂⋚⌥▲∭ ☴ʊ⋚ ☴⋚ ☴⋂∰ ☴⋂ϟ∰∰ ⋂⋘⊪▲ϟ∰▲ ♆⊪▲ ∞◓∜∰ ➇◓☴∭ ⋚∞ ◓⊪∞⋚ϟ◉♆☴◓⋚⊪.  ☴⋂◓∭ ❡ϟ⊯ℜ☴⋚╩ϟ♆◉ ❡⋚⊪☴♆◓⊪∭ ♆➇⋚⋘☴ ☴⋂◓ϟ☴⊯ ⋚∞ ☴⋂∰◉.";
int main(int argc, char **argv)
{
        int i;
        for (i = 0; i < ARRAY_SIZE(t) - 1;) {
                int inc;
                int res;
                if (t[i] <= 0x7f) {
                        res = t[i];
                        inc = 1;
                } else if (t[i] > 0x7f && t[i] <= 0xbf) {
                        abort();
                } else if (t[i] > 0xc2 && t[i] <= 0xdf) {
                        res = (t[i] & 0x1f) << 6 | (t[i + 1] & 0x3f);
                        inc = 2;
                } else if (t[i] > 0xe0 && t[i] <= 0xef) {
                        res = (t[i] & 0xf) << 12 | (t[i + 1] & 0x3f) << 6 | (t[i + 2] & 0x3f);
                        inc = 3;
                } else {
                        abort();
                }
                i += inc;
                printf("%06x\n", res);
}

This decodes the values UTF-8 characters (see the wikipedia for details: http://en.wikipedia.org/wiki/UTF-8#Description), and outputs said values:


002646 
002761 
002761 
0022da 
0003df 
....
In the hope that these characters actually replace letters, let's sort them an look at their distribution:


  25 000020
  16 002634
  14 0022da
  12 002230
   9 0025d3
   9 0022c2
   9 0003df
   8 0022aa
   7 002646
   7 00222d
   6 0025b2
   4 002787
   4 002761
   4 002325
   4 00221e
   3 0025c9
   3 0022d8
   3 0022af
   3 00221c
   2 002569
   2 00002e
   1 00211c
   1 00028a
   1 00002c

Now, this is encouraging: we see 0x20, the space character, in the top of the list. The distribution of the remaining characters nicely corresponds to the frequency of characters real text. The same 'Research at Google' account coincidentally posted a link to this study on 'English Letter Frequency Counts': http://norvig.com/mayzner.html  


Now, let's try that! The study lists the most frequent letters: E, T, A, O, I, N, .... Let's replace our most frequent unicode character with 'E', the second most frequent with 'T' and so on. Our program now prints:

rcctnlosy et haei uutgl, eia tdhanwrdua psowanha itulh evt et eia einaa ipslnal rsl mowa doeh tm osmtnfreots.  eioh cngbetynrf ctserosh rdtpe eioneg tm eiaf.

Could be better, couldn't it? There's a couple of interesting things we can notice though:

  • The flow of the text closely resembles a real text, this is not a truly random set of letters
  • We've got a couple of short words that appear twice: 'et' and 'eia'
The sample of text to decode is very small, so it's very likely that the frequency will differ a bit from the canonical frequency we've used.

Again according to 'English Letter Frequency Counts', the most common words in English are: the, of, and, to, in....

If 'eia' would be 'the', and 'et' be 'to', that would fit nicely with the final word 'eiaf' being 'them'... Let's swap the letters in our program to see what happens:
rccanlosy ta ieth uuagl, the adienwrdue psowenie hauli tva ta the thnee hpslnel rsl fowe doti af osfanmrtoas.  thoi cngbtaynrm castrosi rdapt thontg af them.

As we swap the letters ('e' and 't', 'i' and 'h', 'a' and 'e', 'm' and 'f') we notice that the letters are not far from each other which means we're not far from the canonical statistical distribution described in the paper. That's a good sign!

The 'thnee' would nicely make a 'three' and the two 'ta' and the 'af' would make nice 'to' and 'of', right? Let's do some more swapping:
nccorlasy to ieth uuogl, the odierwndue psawerie houli tvo to the three hpslrel nsl fawe dati of asformntaos.  thai crgbtoyrnm costnasi ndopt thartg of them.  

Nice! I'll pass over the trial and error process, but basically the next step was to figure out that 'thai' was probably 'this', as we already new that 'th' was in the right place, and a couple of inversions and we've got:

according to seth lloyd, the observable universe holds two to the three hundred and five bits of information.  this cryptogram contains about thirty of them.

Et voila! Seth Lloyd



This post first appeared on Tail -f /dev/random, please read the originial post: here

Share the post

Entertaining letter frequency riddle

×

Subscribe to Tail -f /dev/random

Get updates delivered right to your inbox!

Thank you for your subscription

×