Crossing the boundary: analogue universe, digital worlds

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

# 4.2.1 Reducing and processing text

You are familiar with the idea of a word processor. Although I grew up long before the era of word processing, it's now difficult for me to imagine how I ever lived without one. Word processors enable us to enter text into the computer, edit and fiddle about with it, store it and then print it out when we are satisfied with the result. That's exactly what's happening as I write this course. But, if the text spends time inside the computer before being returned to print, that must mean it exists there in the form of numbers. It's inside the boundary. How can text be made into numbers?

Let's use the following famous line from Shakespeare as an example:

Rough winds do shake the darling buds of May

(Sonnet 18)

This presents no problem to the human eye. We read it straight off. Actually the process by which we read, recognise, understand, combine and understand textual symbols is complex and not fully understood – but that's another course.

## Exercise 8

How do you think this line could be transformed into numbers?

### Discussion

You may have been thinking along the following lines. Pick one number to represent each letter – 1 for ‘a’, 2 for ‘b’, …, – and then simply substitute the number for that letter in the line.

I did say earlier that the computer world is a simple world, and transforming text into numbers is as straightforward as that. First, we assign a unique number to each letter in the alphabet. Each letter in the text now becomes a number inside the computer. I'm going to make the following choices:

letter   a   b   c   d   e   f   g   h   i   j   k   l   m
number979899100101102103104105106107108109
letter   n   o   P   q   r   s   t   u   V   w   X   y   z
number110111112113114115116117118119120121122

These choices probably seem fairly arbitrary, but let's stick with them for the moment. Now if I simply substitute each letter with the number I've chosen for it, our line for will look like this inside the computer (the breaks to a new row have no significance):

 114 111 117 103 104 119 105 110 100 115 100 111 115 104 97 107 101 116 104 101 100 97 114 108 105 110 103 98 117 100 115 111 102 109 97 121

It looks as if the problem of converting text into numbers has been solved.

## SAQ 6

Before going on, do you think the above table is a complete representation of the line of poetry?

Not quite, unfortunately. If I instruct the computer to translate what I've given it back into text, I'll see

roughwindsdoshakethedarlingbudsofmay

I forgot that there are spaces between the words, probably because I didn't even notice them. Moreover, the first letter of the line should be a capital and so should the first letter of the proper name ‘May’.

But a computer doesn't know anything about words or the spaces between them, still less about the months of the year. We need more numbers to solve this problem. Let's allocate a new number, 32, to represent a space. However, the problem of capital letters is more serious. There is no easy way of instructing the machine that V and ‘R’ are different forms of the same letter. Nor could we possibly tell it anything about the first letters of poetic lines. Our only option is to allocate a whole set of new numbers to the upper-case (capital) versions of every letter. Let's set aside 82 to represent a capital ‘R’ and 77 for a capital W. Now, if I use this enhanced way of representing characters as numbers and peer into the memory of the computer, our line of poetry becomes:

 82 111 117 103 104 32 119 105 110 100 115 32 100 111 32 115 104 97 107 101 32 116 104 101 32 100 97 114 108 105 110 103 32 98 117 100 32 115 111 32 77 97 121

This is now a better representation of the text. The example illustrates that unique numbers are needed, not simply for all the upper- and lower-case letters and for spaces, but also for characters that we might not think of straight away. These include mathematical symbols (e.g. > (greater than), < (less than) and ≠ (not equal to)) and accented letters found in foreign words (e.g. é, è, c and ö). This is why computer scientists usually refer to characters, rather than letters, when discussing text. All in all, then, a great many numbers will have to be assigned to representing text.