4.2 Representing text
Study note: You will need to refer to the Reference Manual while you are working through this section.
Please click on the 'View document' link below to read the Reference Manual.
Text can be represented in a computer by a succession of binary codes, with each code representing a letter from the alphabet or a punctuation mark. Numerals can also be represented this way, if desired. This can be useful in, say, a word-processing application where no calculations are to be performed and it is convenient to encode a digit in a phrase such as ‘we agreed to meet at 7 o'clock’ in the same way that all the other characters in the sentence are encoded.
Of course, the binary codes that will be used need to be agreed upon in advance. PCs, in common with many other computers, use a code based on the ASCII code to represent letters and numerals, that is alphanumeric characters, together with certain other symbols found on computer keyboards (ASCII is pronounced ‘askey’ and stands for American Standard Code for Information Interchange). The ASCII code, which dates back to the early days of computing, allocates seven bits for each symbol. Because nowadays computers work with 8-bit groups of 1s and 0s (that is, bytes), rather than with 7-bit groups, ASCII codes are often extended by one bit to 8 bits. There is no one standard way of doing this, but the one used in PCs is simply to prefix a 0 to each 7-bit code.
The set of 7-bit ASCII codes is shown in the appendix of the Reference Manual, which you should look at now.
Notice that some of the ASCII codes do not represent a displayable character but instead represent an action (e.g. line feed, tab). These codes are said to represent ‘control characters’. Those characters in the range
0000 0000 to 0001 1111 which are not shown in the appendix are all control characters.
Activity 19 (Self assessment)
Write a sequence of binary codes which forms an answer to the following question, using the appendix of your Reference Manual.
The question is: Is 2 = 3? So the answer is: No. This is coded as follows:
You may have omitted the full stop, which is fine. If you had ‘no’ instead of ‘No’ then you will have:
A significant problem with ASCII is that it cannot cope with languages that use non-Latin characters, for example the ß character used in German, or several of the Cyrillic characters used in Russian. One solution has been to create national variants of ASCII, but of course this causes problems when files are transferred between different language areas.
A longer-term solution is Unicode, which assigns a unique, standard code for every character in use in the world's major written languages. It also has codes for punctuation marks, diacriticals (such as the tilde ~ used over some characters in, for example, Spanish), mathematical symbols, and so on. Unicode uses 16 bits, permitting over 65 000 characters to be coded. It also allows for an extension mechanism to enable an additional 1 million characters to be coded.
As far as the Latin alphabet is concerned, there are similarities between ASCII and Unicode. For example, the upper-case letter A in Unicode is represented by
0000 0000 0100 0001
The last 7 bits of this are identical to the 7-bit ASCII code for the same letter.
The software run on PCs is slowly changing over to Unicode instead of 8-bit ASCII.
Box 9: Sizes of text files
You may want to send a text file as an attachment to your email. But how large will the file be?
Suppose you type about a hundred words of plain text (just one font, no bold, no underlining, no paragraph formatting, etc.) into your word processor and save the resulting document as a file. In English, words average some five or six letters, and there is a space between each word. So you will be saving about seven hundred characters in all, including spaces and punctuation marks. In ASCII, which uses one byte per character, you might expect a resulting file size of around 700 bytes, but as your computer probably rounds up to the nearest kilobyte (a kilobyte is 1024 bytes) you might expect it to record a file size of 1 kilobyte. Even in a language whose average number of letters per word is more than English, you would hardly expect a file size over a kilobyte.
Yet I just tried this with my word processor, and the resulting file size was 20 kilobytes!
Let me hasten to add that when I saved my hundred words as plain text (one of the options offered by my word processor) the text file was indeed 1 kilobyte. So there is nothing wrong with my arithmetic. The difference in the expected and actual file sizes when my word processor saves in its own native format lies in the way the word processor saves files. For instance it adds a great deal of information of its own (who created the file, what it is called, when it was created, how big it is, what font and type size is being used, etc.). It also puts the user's text and its own additional information into chunks whose (pre-defined) size is quite large. These and other similar aspects of what the word processor saves add a considerable overhead into the file size.
So if formatting doesn't matter you might want to consider sending your text file unformatted. If formatting does matter, you have the option of using a lossless compression technique (such as ‘Zip’) to reduce your file size by a factor of perhaps 3 or 4.