Chapter 2.2: Finding the right data
This chapter explains what to do if, after defining a problem, you discover that data needed to solve it is not readily available.
We previously discussed how, in order to answer a policy question, you may need to transform it into one or more research questions. But the transformation process doesn’t end here as you will need to formulate research questions as data questions in order to proceed. This new pathway is not without challenges because finding the right datasets can be difficult and time-consuming. The worst thing that can happen is you will discover that the required data does not exist or, if it does, can’t be used due to compatibility issues.
Embarking on a data discovery journey
The data discovery process should begin with a simple question: “What data do I need?” A good starting point can be an extended baseline analysis of the current datasets that your organisation has access to. While doing the stock-take, make sure to enlist all datasets, including all the idle ones e.g. a dataset that is too big to process on a personal computer.
In parallel, try to organise one or more brainstorming sessions with your colleagues. Write down every single idea that emerges from the creative process and don’t worry too much about whether a suggested dataset exists or not. This can be verified at the next stage.
It is also a good idea to consult external stakeholders through round tables, surveys and/or focus groups. Once you have your longlist of datasets ready it’s time to check which ones are available.
The data does not exist, yet
It is probably unavoidable that some datasets on your longlist will not exist. What we learned in PoliVisu is that it is difficult to start collecting new data on your own from scratch. As most projects tend to have relatively short time horizons, such an initiative may be neither feasible nor practical, and so it is advisable to consider existing sources even if they come from someone else.
The data exists, but...
The fact that data exists does not automatically mean you can use it. For one, the data owner could be another public body, in which case a data sharing agreement is in order. For another, if data is owned by a private company, most certainly they will not want to give it for free. Regardless of who you want to get data from (public or private), it is always a good idea to seek advice from your legal department and the Data Protection Officer before proceeding with an agreement/deal.
Even if data is owned by your organisation, harnessing its full potential can be difficult due to e.g.
Substandard quality as evidenced by faults, errors, missing values etc.
High level of complexity that can only tackled by an experienced data scientists (raises a data literacy issue)
Outdated records and missing metadata
It is only when all these issues are addressed that you can start using existing data for your policy/research questions.
Figure 8. Data discovery process