Study Sessions 10 and 11 have given you some background information about the community survey which you will be undertaking in your kebele. In this study session you will learn techniques of data collection and how to manage and analyse data.
You need to approach your community survey in a systematic and organised way. If data are collected haphazardly, they will be of little value to you or the community. The first step, before you start collecting data, is to plan your survey and prepare resources such as data collection forms. The forms and other records need to be standardised so that you collect information uniformly from all the respondents. This is particularly important if some of the data is being collected by volunteers in your community; you need to ensure they all follow the same procedures. The need for good organisation continues after the initial data collection stage; for example, the completed forms will need to be stored in an organised way (Figure 12.1).
When you have studied this session, you should be able to:
12.1 Define and use correctly all of the key words printed in bold. (SAQs 12.1 and 12.2)
12.2 Describe various techniques for collecting data and state their uses and limitations. (SAQ 12.3)
12.3 Explain how bias can occur in data collection and how it can be avoided. (SAQs 12.1 and 12.4)
12.4 Describe basic concepts and procedures required for data analysis and interpretation. (SAQs 12.2 and 12.5)
12.5 Identify ethical issues involved in data collection as part of a community survey. (SAQ 12.6)
Data collection methods may vary according to whether you adopt a quantitative or qualitative approach. A quantitative approach to data collection usually uses structured questionnaires, while a qualitative approach uses unstructured interviews or discussions (see Section 12.1.2). If the purpose of the data collection is to assess how widespread a problem is, or how many people are affected by a disease, or if you want to use the data to describe a particular group of people, then you will need quantitative data. On the other hand, qualitative data may be more appropriate if your plan is to:
You will also need to consider how the data will be processed, analysed and interpreted, otherwise collecting it will serve no purpose. Thinking about what you are going to do with the collected data before you start will help to ensure that nothing important is missed out. Other aspects to consider are how to fit the data collection into your work plan, whether there are cost implications and whether you have sufficient budget, and whether there might be any ethical considerations to address.
When you are planning your community survey, the first decision will involve the method of data collection to be used. Methods of collecting community survey data include:
This study session will introduce you to these methods of data collection.
Observation of human behaviour is a commonly used data collection technique; however, it is time consuming. It is most often used in small-scale surveys. The observation method of data collection simply means to gather information by your own direct observation without asking questions of the respondent. It is important to record your observations carefully using a checklist. The purpose of using a checklist is to make your observation as objective as possible, so that you note down what you see in a consistent way when you are observing different people.
Interviewing involves oral questioning of respondents, either individually or as a group. This is a face-to-face or personal interview method and requires a person, the interviewer, asking questions to the other person, the respondent. The questions are usually initiated by the interviewer who then records the responses, as shown in Figure 12.2.
The collection of information through personal interviews is usually carried out in a structured way. Structured interviews involve the use of a set of predetermined questions in an interview schedule (list of questions) and use standard techniques of recording the respondent’s answers. These are usually written in a notebook (ideally a tape recorder would be used, but this is not always available). The interviewer asks the questions in a prescribed order and the respondent gives answers in their own words. The interviewer is allowed to ask ‘follow-up’ questions only if something the respondent says is not clear, or if the question wasn’t understood, but otherwise keeps to the questions on the interview schedule. An example of a possible structured interview question and a follow-up question are given below:
‘If you or a female relative is expecting a baby, would you prefer the labour and delivery to be at home or in the Health Post? Can you say why?’
Note that when presenting questions like the one above, it is important not to ‘prompt’ the respondent (i.e. suggest or hint at a possible answer) because this might influence their response. They may try to give you the answer they think you want to hear.
In contrast, unstructured interviews are characterised by a flexibility of approach to questioning. An example of an unstructured interview question is given below:
‘Please tell me about giving birth to your first child.’
In unstructured interviews you do not follow a system of pre-determined questions, but simply begin a conversation with the respondent on a particular topic. The respondent is free to explore the topic in their own words and in their own way, without being restricted by specific questions that must be answered. The interviewer can prompt the respondent to say more with phrases such as ‘Tell me more about that’ or ‘This is interesting – please go on’, but does not ask specific questions about the topic.
A written questionnaire is a data collection tool in which written questions are presented to be answered by the respondents in written form. The questions are directed towards collecting simple factual information, which can be answered either by writing a few words on the questionnaire, or ticking a box next to the chosen answer from a list of options. You can use this form of data collection in many different ways, for example:
As with questions presented in interviews, the questions on a written questionnaire can be either structured or unstructured, but they are always simple to answer directly. Questionnaires do not usually seek complex information about people’s attitudes, beliefs or preferences, or explanations about why they behave in a certain way. (Complex information is best collected through interviews or focus groups.)
In a written questionnaire, the following question was asked:
From which of the following sources do you get your water? Tick all options that apply to you.
E Another source
Is this a structured or unstructured question?
It is a structured question because a rigid choice of answers is presented and the respondent must choose from them.
How would you ask this same question in an unstructured way? How are the answers recorded, and what further questions might this enable you to ask?
You could ask ‘Where do you get your water from?’ This is an unstructured question because there are no prepared responses already written down. The respondent either writes their answer in their own words on the questionnaire, or the interviewer writes it for them on the questionnaire. Further questions you may have thought of might include:
A ‘How far do you have to go to collect your water?’
B ‘How often do you collect water?’
C ‘How long does it take you to collect water?’
The unstructured question therefore enables you to explore the respondent’s answer further. Note that all the questions require very simple factual answers, e.g. (in the example above) the answers might be:
A ‘Two kilometres’,
B ‘Once a day’,
C ‘Two hours’.
Table 12.1 summarises the advantages and disadvantages of the methods of collecting data that you have learned about so far.
A focus group discussion is a loosely structured interview conducted by an experienced moderator with a small number of people who all sit together at the same time in the same place. For a focus group discussion the participants will be guided through an unstructured, spontaneous discussion on a particular topic. The information obtained is qualitative data.
The ideal characteristics for a focus group are as follows:
Focus group discussions can offer an effective qualitative data collection method for a number of reasons. They are good for generating ideas; for example, they may act as a starting point for introducing a new product (e.g. condom) or discussion of ideas, uses or improvements. Focus group discussions can reveal community needs, perceptions and attitudes to health services that are currently provided. They can therefore be used to assess needs and gaps, and enable the service-provider team to rethink the way they operate in order to improve the service. The discussions can also be useful for evaluating programmes and guiding programme development.
The qualitative information obtained from focus group discussions is likely to be in the form of written or spoken text. The best way to analyse such information is generally to try to identify central concepts or themes which came out of the discussions. The qualitative information obtained from such discussions may complement data collected by quantitative methods.
If you ‘hand pick’ your study subjects when you are collecting data, then it is likely that you are introducing bias in your study. Bias in data collection is a distortion which results in the information not being truly representative of the situation you are trying to investigate. Sources of bias can be prevented by carefully planning the data collection process.
Can you think of a way that bias might be accidentally introduced into a survey?
In interviews, when you are asking questions, it is important not to prompt respondents into giving particular answers because this could introduce a source of bias.
To avoid bias you need to collect data as objectively as possible, for example, by using well-prepared questions that do not lead respondents into making a particular answer. If you are selecting a sample of people for your research (i.e. not including everyone) then you must ensure the sample is representative of the population or group you are studying. If you are using volunteers to help in collecting data, you should ensure that everyone is collecting and recording data in the same way and that they all understand the need to avoid prompting the respondents to particular answers.
Once you have collected your data, you are ready to start processing and analysing it.
Data processing refers to recording or entering your data (e.g. on to a master sheet or computer), and data checking and correcting. You may be concerned about the quality of some of the data which has been collected. For example, some of your data will probably have been collected by the volunteers who are helping you and it is possible that some may not clearly understand the objective of the data collection, and may be recording it in different ways. It is important to check your data for consistency and missing values as you collect it, and once collected, check again for errors.
No matter how carefully the data have been collected, some errors are inevitable. Errors (mistakes) can result from incorrect reading of the data, incorrect reporting, incorrect filing or incorrect typing. In addition, the data entered may be incomplete (some of the data was never collected, or has been lost). The aim of the checking process is therefore to produce a reliable set of data that you can be confident is accurate for the purposes of your analysis.
Once the data has been checked for errors and completeness, all the answers of individual respondents are entered on a data master sheet. An example is shown in Table 12.2.
|Individual respondent no.||Q 1 Gender||Q 2 Ethnicity||Q 3 Age||Q 4 Education||Q 5 Marital status||Q 6 Occupation||Q 7 House type||Q 8 Water source|
|2||M||Tigre||67||7th grade||Married||Merchant||Tukul||Protected well|
|7||F||Hadiya||56||6th grade||Widowed||Housewife||Corrugated iron||Protected spring|
The data in Table 12.2 is for only seven people. Imagine how large the table would need to be for a whole community! Analysing data enables you to present information in a clearer and more useful way. Data analysis means describing and summarising your findings in an unbiased way. The results obtained from the analysis will not only help you to meet your community survey objectives, they will also enable you to:
To analyse your data, you first need to identify the type of data you have. You may have collected quantitative or qualitative data. Qualitative data use names or descriptions to describe variables, while quantitative data usually use numbers. A variable is any measured characteristic or attribute that differs between different people, households, etc.
Give an example from Table 12.2 of quantitative data and an example of qualitative data.
An example of quantitative data would be the column listing the respondents’ age. An example of qualitative data would be ethnicity, occupation, house type or water source.
Several terms are used to describe types of variable. For some variables, called categorical variables, there are a limited number of possible responses that can be given, in other words, a limited number of categories. For example, ‘gender’ is a categorical variable because it has two categories: ‘male’ and ‘female’. Other variables, known as continuous variables, have lots of different possible responses, though usually within a certain range. For example, age is a continuous variable, within the range of a normal human lifespan.
Variables that are described by a number are, unsurprisingly, also known as numerical variables. For example, the number of new AIDS cases reported during a one-year period, the number of beds available in a particular hospital, or a person’s weight or temperature are all numerical variables.
Of the variables given in Table 12.2, gender is one categorical variable. Can you find another?
Another categorical variable would be marital status, because everyone can be categorised into single, married, divorced or widowed, or cohabiting (living together without being married).
‘Blood group’ is a variable. People may have one of four blood groups and these are A, B, AB and O. Is blood group a categorical or a continuous variable?
Blood group is a categorical variable because it has four categories. Each person has one of the four blood groups – A, B, AB or O.
At times, you may find it useful to transform numerical data into categorical data. You can do this by dividing the range of values of the variable into intervals, i.e. by grouping the data. For example, the numerical variable ‘age’ might be transformed into a categorical variable ‘age group’, which consists of categories such as under 30 years, 30–44, 45–59 and over 60 years. This transformation is useful if the researcher is interested in the number of people falling into each of these four categories (Figure 12.3).
Suppose you find that the ages of a group of people you interviewed about tuberculosis in your kebele are as shown in Table 12.3. How many of these people would be in each of the age groups under 21, 21–30, 31–40, 41–50, 51–59 and over 60? Put your answers in Table 12.4a. Which age category has the most people in it?
|Age (years)||Number of people|
|Age group (years)||Number of people|
Your completed table should look like Table 12.4b below. The age group with the most people in it is the 21–30 years category, with 24 people.
|Age group||Number of people|
We mentioned above that a complete set of raw (unanalysed) data from a whole community survey would be large and unmanageable. You need to summarise the findings so that they are useful to you and others. In this section, we will describe some of the most common methods for summarising quantitative data.
Frequency means the number of times an event occurs or the number of responses in a particular category. In other words, a frequency is a count of events in a given time frame. For example, if you report ‘Our Health Post sees 130 patients each month’, the frequency of patients seen is 130 per month.
Frequency data is often presented in tables, graphs or pie charts.
Suppose you find that in a particular area, 14 out of 25 adults aged under 30 years have had malaria, whereas 19 out of 25 adults between the ages of 30 and 50 years, and 20 out of 25 adults over 50 years, have had malaria. Present these data in the form of a table.
Your table should look something like Table 12.5.
|Age category||Number of people sampled||Number who have had malaria|
|over 50 years||25||20|
|30 to 50 years||25||19|
|under 30 years||25||14|
To summarise numerical variables, there are three measures that are commonly used: mean, median and mode. To explain how to proceed with these measures, let’s look at some examples.
The mean is the average of a series of measurements or scores. To calculate the mean, you add up all the individual measurements or scores and then divide this total by how many scores there are (it is the sum divided by the number of individual values). Although the mean is the most commonly used of the measures mentioned here, the median or the mode may sometimes be more appropriate. The median is a measure of central location, where half of the measures are below and the other half are above this value. The mode is the most common result (the most frequent value) of a test, survey or experiment.
For example, imagine a school exam taken by 10 students with possible scores from 0 to 100. Nine students score 95 but one person scores 5. The mean is calculated by adding up the total scores (9 × 95 + 5 = 860) and dividing by the number of scores (10), which gives a mean of 86. That one person with the low score really throws off the final statistic! The median, however, is 95 and in this case is a better description of how most people did in the exam. The mode would be the most common score which would also be 95 in this example. In this case the median or mode might be more useful than the mean.
Seven farmers in your kebele keep goats (Figure 12.4). You record how many goats each farmer has and the results are 8, 1, 3, 7, 1, 6 and 9. What is the mean number of goats owned by these farmers and what is the median number?
The mean is 5. It is the sum of the scores (35) divided by the number of farmers (7). If you put the numbers in order they are 1, 1, 3, 6, 7, 8, and 9. The middle value is 6 and therefore the median is 6.
Note the difference in the values between the mean and median. The mean or average can be influenced by extreme or outlying values at either end of the scale, but the median is not. If the number of values is even, there isn’t a middle value, so to calculate the median you take the mean of the two middle numbers.
For example, if there were only six farmers and the number of goats they owned were 10, 12, 14, 16, 18 and 20, the two middle numbers are 14 and 16, so the median is 14+16 divided by 2, which equals 15.
Supposing the numbers of goats owned by the seven farmers are 3, 4, 7, 7, 7, 9 and 10. What is the mode of the numbers of goats?
Looking at these scores, you can see that 7 is the most common number of goats because three farmers have 7 goats. The mode of the numbers (also referred to as the modal number) of goats is therefore 7.
Sometimes it is more appropriate to think about the modal number since this represents the most common situation.
A proportion, sometimes called relative frequency, is simply the number of times the observation occurs in the data, divided by the total number of responses. Proportions are very often converted to percentage values because this makes comparison easier between different sets of data. Percentage means the number of occurrences or responses, as a proportion of the whole, multiplied by 100. For example, if 30 people respond to a survey out of a total of 100, the frequency of respondents is 30, the proportion is 30/100 or 0.3, and the percentage of respondents is 30%. If the total number of people surveyed was only 60 and there were 30 respondents, the proportion is 30/60 or 0.5, and the percentage of respondents is 50%.
Approximately what percentage of the seven people whose answers are summarised in Table 12.2 are illiterate?
Three of the seven people are illiterate. So the percentage of people who are illiterate is 3/7 × 100% which is approximately 43%.
Go back to Table 12.4b, which showed the number of people who have had malaria in different age categories, and add a column to show the percentage of each age category that have had malaria. Which age category has the highest percentage of people who have had malaria?
Your table should look something like Table 12.5 below.
|Age group(years)||Number of people sampled||Number who have had malaria||Percentage who have had malaria|
|30 to 50||25||19||76%|
To calculate the percentage, you take the number who have had malaria and divide it by the total number sampled, then multiply your answer by 100. For example, in Table 12.5, for those aged over 50 years, the calculation is 20 (who have had malaria), divided by 25 (people sampled) × 100%, which is 80%. This is the age group with the highest percentage of people who have had malaria.
Table 12.6 shows the percentage of women in each of four age groups in a certain population. It shows that more women fall in the age group 30–40 years than in any other category.
|Age group||Number of women||Percentage of total|
In the example in Table 12.5, only 25 people in each age group were sampled. When reporting percentages, you should also always report how many observations there were. For example, if you say that 50% of women seen by the clinic this month had diabetes, it is important to know how many women were seen. If it is 50% of 500 women, this means that 250 women with diabetes were seen, but if it is 50% of two women, then only one woman with diabetes was seen!
The cumulative percentage for a given category means the percentage of people who fall into that category, or a lower category. To work out the cumulative percentage for each category, you just have to add the percentage for that category to all of the percentages for the categories which are lower. Table 12.7 shows an example of cumulative percentages using the same data as in Table 12.6. It is a way of presenting the same data in a more descriptive way.
|Age group||Number of women||Percentage of the total||Cumulative percentage|
You have learned about the ethical issues that you need to be aware of in your role as a Health Extension Practitioner in Study Sessions 7, 8 and 9. These issues must also be considered in the context of research. There are many established codes of practice that cover the ethics of research. These are codes that protect the rights of respondents either in research or in a community survey. Some of the widely accepted ethical principles include:
Furthermore, as we develop our data collection techniques, we need to consider whether our data collection procedures are likely to cause any physical or emotional harm. Harm may be caused, for example by:
You will need to be aware of these ethical considerations when you collect data for your community survey or in other research, For example, in questionnaires, it may be advisable to omit names and addresses if sensitive questions are asked about such things as family planning or sexual practices, or about opinions of patients on the health services provided. Some other suggestions for dealing with difficult ethical considerations are:
In Study Session 12, you have learned that:
Now that you have completed this study session, you can assess how well you have achieved its Learning Outcomes by answering the following questions. Write your answers in your Study Diary and discuss them with your Tutor at the next Study Support Meeting. You can check your answers with the Notes on the Self-Assessment Questions at the end of this Module.
Explain what is meant by bias in the collection of data, and why it should be avoided.
Bias is a distortion of information during data collection. Biased data collection does not show the true situation that you are trying to investigate so should be avoided if possible.
In a survey of ten households, the numbers of children in each family were found to be:
3, 1, 6, 4, 0, 3, 3, 5, 8, 4.
a.The mean number of children per household is 3.7. To calculate the mean you add up all the numbers of children, which comes to 37, and divide by the number of households, which is 10.
b.The median number is 3.5. To calculate the median you rearrange the data in order: 0, 1, 3, 3, 3, 4, 4, 5, 6, 8. In this case, because there are an even number of records, there is no middle number so you have to take a mean of the two middle numbers, which are 3 and 4.
c.The modal number is 3. This occurs three times whereas other numbers occur no more than twice.
d.The proportion of families with more than three children is 5 out of 10. You could simplify this to say half the families have more than three children.
e.Three families have more than four children so the percentage is 3 divided by 10, multiplied by 100, which equals 30%.
Now read Case Study 12.1 and then answer the questions that follow it.
You suspect that a large proportion of women and children in your kebele are malnourished, in particular women of childbearing age. You would like to determine the extent of this problem, and whether women perceive it as a problem. Furthermore you would like to know whether the women themselves could contribute to improving their nutritional status and how they might do this.
What data collection methods might be appropriate to collect data for this investigation?
The data required is qualitative because it includes the women’s perceptions and opinions. Interviews and focus group discussions with women could be used to collect this data. Written questionnaires can also be used however this will only be suitable if all of the women are literate.
Describe some biases that could occur during collection of data on nutritional problems of women and children in a situation like the one described in Case Study 12.1. How could these biases be avoided?
If data are collected using interviews, then the questions would need to be well prepared and devised so they did not lead to particular answers. All interviewers would need to receive appropriate training to ensure that they record the answers in the same way. Bias could also occur if respondents are prompted when answering questions. Respondents should not be handpicked, but selected according to consistent criteria.
What sort of checks should be done on the data which has been collected before it is analysed and interpreted?
It is important to check data for consistency and missing values. You should check for errors in order to ensure that the data are reliable before you start to analyse and interpret the data.
What ethical issues might you encounter while collecting data on the nutritional problems of women and children in Case Study 12.1?
It would be important to establish a relationship with, and to obtain informed consent from, each mother before you start to ask a lot of questions. You would have to be aware that nutritional status might be a sensitive issue.