Learning Outcome 3: Clean and Verify Data

View

Information Sheet 1.3-1:
Ensuring Data Quality

The process of Cleaning and Verifying Data is the crucial quality control stage that occurs after data has been accessed and collated. In the context of TESDA's National Competency Assessment, this step ensures that the information used for certification is accurate, consistent, and reliable.

"Cleaning" involves identifying and correcting errors and inconsistencies in the data, while "Verifying" is the act of confirming that the data accurately reflects the original source and the candidate's actual performance. This process is fundamental to maintaining the integrity and credibility of every National Certificate (NC) or Certificate of Competency (CoC) issued.

No dataset is perfect. The final preparation step is to detect and correct errors and inconsistencies.

In reality, data is rarely perfect. "Data Errors" also known as data quality issues or dirty data, are inaccuracies and inconsistencies within a dataset that can significantly undermine the reliability of any analysis or decision-making process based on that data. Understanding the various types of data errors is the first critical step toward identifying, correcting, and ultimately preventing them.

Understanding common errors is the first step to cleaning them.

Data Quality Issue

Definition

Example from TESDA Forms

1.  Typographical Errors

Basic and Common Errors

-   Misspellings, typos in Candidate's Name of School / Training Center/ Company

-   Title of Qualifications, including Full Qualification, COC, or Renewal

2.  Missing Data

Required fields are left blank.

-   Candidate's Signature missing on Application Form.

-   Candidates’ Critical Information

•      Personal Email Address

•      Client Type

•      Mobile Number

•      Education Attainment

•      Employment Status

3.  Inconsistency

Data conflicts between different sources or entries.

- Name spelled as "Jhon Santos" on Application Form but "John Santos" on other Form/documents.

 

- A candidate marked "Competent" in all units but given an overall "Not Yet Competent."


4.  Invalid Entry

Data is in an incorrect or non-permissible format.

- Entering letters in a numeric field like Reference Number.
- Using an unofficial abbreviation for a qualification.

5.  Accuracy Error

Out-of-Range Values: Data that doesn't make sense

- A birth year of 1850

6.  Duplication

The same candidate or assessment event is recorded more than once.

- A candidate's results are entered twice into the system due to a clerical error.

 

Data cleaning is a critical step in the data analysis pipeline, ensuring datasets are accurate, consistent, and reliable for informed decision-making. It involves identifying, correcting, or removing errors, duplicates, and inconsistencies

 

Cleaning Techniques:

Reviewing: This is the initial exploratory phase where you visually inspect the dataset row by row or column by column. It helps detect outliers, formatting issues (e.g., mixed date formats), or logical errors (e.g., negative ages). While time-intensive for large datasets, it's invaluable for small-scale or qualitative data.

Sorting & Filtering: Sorting arranges data alphabetically or numerically (e.g., by name or date), grouping anomalies for easy identification. Filtering then isolates subsets (e.g., rows with specific keywords) to drill down on problems like duplicates or outliers.

 

Cross-Validation: Compare the collated data against the original source documents to verify accuracy. This verification step involves reconciling cleaned data with primary sources (e.g., scanning original document or databases) to confirm entries. It's especially useful for ensuring no transcription errors occurred during data entry.

 

Secure Storage and Sign-Off: Once the data is cleaned and verified, it must be stored securely. This means saving it in the correct, access-controlled location. The final step is to obtain a sign-off/clearance from your supervisor, which confirms that the data has been prepared in accordance with the required standards.