4.2.1 Reducing and processing text
You are familiar with the idea of a word processor. Although I grew up long before the era of word processing, it's now difficult for me to imagine how I ever lived without one. Word processors enable us to enter text into the computer, edit and fiddle about with it, store it and then print it out when we are satisfied with the result. That's exactly what's happening as I write this course. But, if the text spends time inside the computer before being returned to print, that must mean it exists there in the form of numbers. It's inside the boundary. How can text be made into numbers?
Let's use the following famous line from Shakespeare as an example:
Rough winds do shake the darling buds of May
(Sonnet 18)
This presents no problem to the human eye. We read it straight off. Actually the process by which we read, recognise, understand, combine and understand textual symbols is complex and not fully understood – but that's another course.
Exercise 8
How do you think this line could be transformed into numbers?
Discussion
You may have been thinking along the following lines. Pick one number to represent each letter – 1 for ‘a’, 2 for ‘b’, …, – and then simply substitute the number for that letter in the line.
I did say earlier that the computer world is a simple world, and transforming text into numbers is as straightforward as that. First, we assign a unique number to each letter in the alphabet. Each letter in the text now becomes a number inside the computer. I'm going to make the following choices:
letter | a | b | c | d | e | f | g | h | i | j | k | l | m |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
number | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 |
letter | n | o | P | q | r | s | t | u | V | w | X | y | z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
number | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 |
These choices probably seem fairly arbitrary, but let's stick with them for the moment. Now if I simply substitute each letter with the number I've chosen for it, our line for will look like this inside the computer (the breaks to a new row have no significance):
114 | 111 | 117 | 103 | 104 | 119 | 105 | 110 | 100 | 115 | 100 | 111 | 115 |
104 | 97 | 107 | 101 | 116 | 104 | 101 | 100 | 97 | 114 | 108 | 105 | 110 |
103 | 98 | 117 | 100 | 115 | 111 | 102 | 109 | 97 | 121 |
It looks as if the problem of converting text into numbers has been solved.
SAQ 6
Before going on, do you think the above table is a complete representation of the line of poetry?
Answer
Not quite, unfortunately. If I instruct the computer to translate what I've given it back into text, I'll see
roughwindsdoshakethedarlingbudsofmay
I forgot that there are spaces between the words, probably because I didn't even notice them. Moreover, the first letter of the line should be a capital and so should the first letter of the proper name ‘May’.
But a computer doesn't know anything about words or the spaces between them, still less about the months of the year. We need more numbers to solve this problem. Let's allocate a new number, 32, to represent a space. However, the problem of capital letters is more serious. There is no easy way of instructing the machine that V and ‘R’ are different forms of the same letter. Nor could we possibly tell it anything about the first letters of poetic lines. Our only option is to allocate a whole set of new numbers to the upper-case (capital) versions of every letter. Let's set aside 82 to represent a capital ‘R’ and 77 for a capital W. Now, if I use this enhanced way of representing characters as numbers and peer into the memory of the computer, our line of poetry becomes:
82 | 111 | 117 | 103 | 104 | 32 | 119 | 105 | 110 | 100 | 115 | 32 | 100 | 111 | 32 |
115 | 104 | 97 | 107 | 101 | 32 | 116 | 104 | 101 | 32 | 100 | 97 | 114 | 108 | 105 |
110 | 103 | 32 | 98 | 117 | 100 | 32 | 115 | 111 | 32 | 77 | 97 | 121 |
This is now a better representation of the text. The example illustrates that unique numbers are needed, not simply for all the upper- and lower-case letters and for spaces, but also for characters that we might not think of straight away. These include mathematical symbols (e.g. > (greater than), < (less than) and ≠ (not equal to)) and accented letters found in foreign words (e.g. é, è, c and ö). This is why computer scientists usually refer to characters, rather than letters, when discussing text. All in all, then, a great many numbers will have to be assigned to representing text.