Representing and manipulating data in computers
Representing and manipulating data in computers

This free course is available to start right now. Review the full course description and key learning outcomes and create an account and enrol if you want a free statement of participation.

Free course

Representing and manipulating data in computers

4.2 Representing text

Study note: You will need to refer to the Reference Manual while you are working through this section.

Please click on the 'View document' link below to read the Reference Manual.

View document [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)]

Text can be represented in a computer by a succession of binary codes, with each code representing a letter from the alphabet or a punctuation mark. Numerals can also be represented this way, if desired. This can be useful in, say, a word-processing application where no calculations are to be performed and it is convenient to encode a digit in a phrase such as ‘we agreed to meet at 7 o'clock’ in the same way that all the other characters in the sentence are encoded.

Of course, the binary codes that will be used need to be agreed upon in advance. PCs, in common with many other computers, use a code based on the ASCII code to represent letters and numerals, that is alphanumeric characters, together with certain other symbols found on computer keyboards (ASCII is pronounced ‘askey’ and stands for American Standard Code for Information Interchange). The ASCII code, which dates back to the early days of computing, allocates seven bits for each symbol. Because nowadays computers work with 8-bit groups of 1s and 0s (that is, bytes), rather than with 7-bit groups, ASCII codes are often extended by one bit to 8 bits. There is no one standard way of doing this, but the one used in PCs is simply to prefix a 0 to each 7-bit code.

The set of 7-bit ASCII codes is shown in the appendix of the Reference Manual, which you should look at now.

Notice that some of the ASCII codes do not represent a displayable character but instead represent an action (e.g. line feed, tab). These codes are said to represent ‘control characters’. Those characters in the range

0000 0000 to 0001 1111 which are not shown in the appendix are all control characters.

Activity 19 (Self assessment)

Write a sequence of binary codes which forms an answer to the following question, using the appendix of your Reference Manual.

  • 0100 1001

  • 0111 0011

  • 0010 0000

  • 0011 0010

  • 0011 1101

  • 0011 0011

  • 0011 1111

Answer

The question is: Is 2 = 3? So the answer is: No. This is coded as follows:

  • 0100 1110

  • 0110 1111

  • 0010 1110

You may have omitted the full stop, which is fine. If you had ‘no’ instead of ‘No’ then you will have:

  • 0110 1110

  • 0110 1111

A significant problem with ASCII is that it cannot cope with languages that use non-Latin characters, for example the ß character used in German, or several of the Cyrillic characters used in Russian. One solution has been to create national variants of ASCII, but of course this causes problems when files are transferred between different language areas.

A longer-term solution is Unicode, which assigns a unique, standard code for every character in use in the world's major written languages. It also has codes for punctuation marks, diacriticals (such as the tilde ~ used over some characters in, for example, Spanish), mathematical symbols, and so on. Unicode uses 16 bits, permitting over 65 000 characters to be coded. It also allows for an extension mechanism to enable an additional 1 million characters to be coded.

As far as the Latin alphabet is concerned, there are similarities between ASCII and Unicode. For example, the upper-case letter A in Unicode is represented by

0000 0000 0100 0001

The last 7 bits of this are identical to the 7-bit ASCII code for the same letter.

The software run on PCs is slowly changing over to Unicode instead of 8-bit ASCII.

Box 9: Sizes of text files

You may want to send a text file as an attachment to your email. But how large will the file be?

Suppose you type about a hundred words of plain text (just one font, no bold, no underlining, no paragraph formatting, etc.) into your word processor and save the resulting document as a file. In English, words average some five or six letters, and there is a space between each word. So you will be saving about seven hundred characters in all, including spaces and punctuation marks. In ASCII, which uses one byte per character, you might expect a resulting file size of around 700 bytes, but as your computer probably rounds up to the nearest kilobyte (a kilobyte is 1024 bytes) you might expect it to record a file size of 1 kilobyte. Even in a language whose average number of letters per word is more than English, you would hardly expect a file size over a kilobyte.

Yet I just tried this with my word processor, and the resulting file size was 20 kilobytes!

Let me hasten to add that when I saved my hundred words as plain text (one of the options offered by my word processor) the text file was indeed 1 kilobyte. So there is nothing wrong with my arithmetic. The difference in the expected and actual file sizes when my word processor saves in its own native format lies in the way the word processor saves files. For instance it adds a great deal of information of its own (who created the file, what it is called, when it was created, how big it is, what font and type size is being used, etc.). It also puts the user's text and its own additional information into chunks whose (pre-defined) size is quite large. These and other similar aspects of what the word processor saves add a considerable overhead into the file size.

So if formatting doesn't matter you might want to consider sending your text file unformatted. If formatting does matter, you have the option of using a lossless compression technique (such as ‘Zip’) to reduce your file size by a factor of perhaps 3 or 4.

T224_2

Take your learning further

Making the decision to study can be a big step, which is why you'll want a trusted University. The Open University has 50 years’ experience delivering flexible learning and 170,000 students are studying with us right now. Take a look at all Open University courses.

If you are new to university level study, find out more about the types of qualifications we offer, including our entry level Access courses and Certificates.

Not ready for University study then browse over 900 free courses on OpenLearn and sign up to our newsletter to hear about new free courses as they are released.

Every year, thousands of students decide to study with The Open University. With over 120 qualifications, we’ve got the right course for you.

Request an Open University prospectus