SICP Exercise 2.70

Question

The following eight-symbol alphabet with associated relative frequencies was designed to efficiently encode the lyrics of 1950s rock songs. (Note that the “symbols” of an “alphabet” need not be individual letters.)

A 2 NA 16
BOOM 1 SHA 3
GET 2 YIP 9
JOB 2 WAH 1

Use generate-huffman-tree (Exercise 2.69) to generate a corresponding Huffman tree, and use encode (Exercise 2.68) to encode the following message:

Get a job
Sha na na na na na na na na

Get a job
Sha na na na na na na na na

Wah yip yip yip yip
yip yip yip yip yip
Sha boom

How many bits are required for the encoding? What is the smallest number of bits that would be needed to encode this song if we used a fixed-length code for the eight-symbol alphabet?

Answer

The encoded string is

111111100111101110000000001111111001111011
100000000011010101010101010101010111011011

which is 84 characters long. Since our Huffman set does not have leaves that represent line breaks and is also case-sensitive, the message had to be brought into the right format:

GET A JOB SHA NA NA NA NA NA NA NA NA GET A JOB SHA NA NA NA NA
NA NA NA NA WAH YIP YIP YIP YIP YIP YIP YIP YIP YIP SHA BOOM

If we were to change this to a fixed-length alphabet, we would require 3 bits per word (because \(2^3=8\)). The original text has 36 words, so the encoded message would require \(36*3=108\) bits in total, which is an increase of about 29%.