Data Quality Guideline
Last updated
Last updated
Discovery Loft applies the seven data quality dimensions specified in (5.1) Figure #1 across all operations for PAVE. These dimensions span all measurable quality aspects for defining critical and shared data assets.
When assessing and improving data quality in each of the following dimensions, quality issues are often related to the people, process, technology, or a combination.
Figure #1 - Data Quality Dimensions
Each dimension gets divided into properties (sub-sections). These properties define specific characteristics being considered in the evaluation of each dimension. Each property represents a singular, testable measure of quality. The seven dimensions and the associated properties are described in this section. Use the appropriate asset to populate the assessment template found in this Data Quality Management Plan. These dimensions should be consistently applied across all of Discovery Loft's operations to allow for accurate census between our teams, our users, partners and our customer.
Is the data accurate, valid, and to what level?
Accuracy refers to how the data correctly portrays the real-world situation it originally was designed to measure. Data must be meaningful and helpful to correctly and accurately interpret and analyze. Data must also be valid to be considered accurate, which means it conforms to a defined format and adheres to specific business rules.
Data collection: Errors at the source of the collection can occur at multiple stages in the collection process, usually by data providers or during data entry. Various factors may cause these errors during the process, including; poor form design, providers not having a clear understanding of the concept or form, and human entry errors. As a result, poor data collection often misrepresents findings and leads to inaccurate conclusions or gaps in a data asset.
Data is commonly understood: Understanding how the information is to be used and what the data represents by everyone collecting and using it reduces the risk of errors in entry and misinterpretation of the data. In addition, document formats and standards and the intent of each defined data field being collected will ensure the data collected is well understood and accurate.
Process: This relates to any method of collecting or manipulating data, such as changing the data format or moving data's location. This may be done automatically by a system or manually by a person. In addition, business processes to monitor and improve data quality should be created. These procedures should be documented, discoverable, and automated to remove human error and reduce errors.
People: All staff involved with data collection and processing must be aware of the importance of data quality and their role in ensuring quality is maintained or improved. For example, suppose a user has to correct a data asset manually repeatedly. In that case, this feedback should be sent to the Product Team or data collection provider to fix any issues in the collection mechanism or data asset. It also relates to the Collection dimension.
Validation: Validations verify that data fits specific criteria and standards specified for that record type. Validation rules should be documented and accessible by all staff. Automated validations should be implemented in data collection and processing systems to help in reducing errors. For example, using validation in forms or fields used at the collection point to ensure that only allowable values are entered. Rules should be reviewed to ensure accuracy throughout the lifecycle of all data assets.
Format: Format relates to the structure of data. Generalized formats like large open text fields often increase the risk of errors and decrease the validity of the data. Structured data should be entered in a way that, when collected, it's easier to interpret and use. For example, it is easier to process an address when it is structured as separate parts - street number, name, city, state, zip code - instead of a single text field. When possible, data should also be validated against format rules (e.g. specific data must not contain numbers or special characters) and based on known standards (e.g. VIN data ISO standards).
How complete is the data? Are there apparent gaps?
Completeness relates to what extent is the data complete. The Completeness dimension also reflects the ability to determine what data elements are missing and whether these omissions are acceptable (optional data). Departments must evaluate and understand whether a data asset contains unacceptable gaps. These may limit the data, leading to an increased reliance on assumptions and estimations or precluding the asset for use altogether. It is also helpful to note the level of completeness, which is relevant in cases where 100% completeness is not required to meet the original purpose of the dataset and if the dataset is considered complete at a particular point in time, e.g. beginning or end of a month.
Process: As with the Accuracy dimension, procedures must be in place to ensure that the data entered is as complete as possible. A review and feedback process should be implemented to assist in discovery and resolutions to ensure there are no gaps or coverage issues. This may be done automatically by a system or manually by a person. These procedures should be documented, discoverable, and automated to remove human error and reduce errors.
Gaps: This property relates to determining if any known gaps are in the data. Often these gaps occur from a breakdown in the system or our processes or when data is being collected for a defined period. Sometimes, for some data to fit the purpose, 100% completeness may not be required, and this should be noted. However, to allow users to compensate, any known gaps in data should be documented.
Is the dataset representative of the situation or conditions that it refers to?
Representativeness relates to the relevance of data to meet the initial defined purpose for its collection or creation. For data to be representative, it must represent the environment in which it was collected or created and reflect the situation it is attempting to describe. Raw data may not always be representative. However, this can also be achieved using analytical techniques such as weighting.
Coverage: Coverage relates to the proportion of the sample that has been incorrectly included or excluded. To discover this, compare the final results with the expected results or response rates and totals. The coverage should always represent a sufficient sample size and breadth of the conditions needed to use the data. For example, having only a small sample of a large overall population may not yield a valid result.
Relevance: This is how the data reflects the real-world situation to meet users' needs. Working with data of high relevance tells you what you want to know without sorting through irrelevant data and information. The data should be relevant to the situation or environment it intends to measure or analyze. Customer feedback should be collected and measured to determine if the data meets their needs by being relevant.
Is the timeliness and has the appropriate currency of the data?
Timeliness refers to the speed at which the required data is made available and what delay is there between the reference period and when the information is released. Factors that often impact timeliness are the method used for collection and how it is processed. Data should always be discoverable, available and accessible throughout the data asset lifecycle for greater internal and external use. The currency and reliability may be impacted when delays occur during the provision of data.
Availability: Any data must be made available as soon as possible, both internally to PAVE and externally (where applicable). Most delays occur between the release date of the data and the reference period it spans. Data must be available quickly and frequently to support information needs, meet contract terms and reporting deadlines, and support product or management decisions. In addition, when any findings are released that data was used to support the data should be made available.
Currency: Based on the user's needs, if data is no longer current, it may no longer be fit for its intended use, data for activities occurring within a set period. When data references specific periods, it should be current and released before it is superseded. This may happen with data assets collected on a one-off basis or infrequently. Metadata, including collection dates, collection periods, coverage and the expiry date of the data should be provided with the data asset.
What was the collection method used, and was it consistent?
Data collection methods should always be appropriate for the collected data type. For example, a survey may be a more appropriate collection method than using data entry software to collect specific data. In addition, collection must be consistent, especially if the same data is continuously collected or is to be compared to other data assets.
Data is commonly understood: Having an understanding of the use and meaning of data by those collecting and using it minimizes the risk of any data collection errors. Defining the data itself, the data collection and the processing methods used should be made available organization-wide to provide a consistent understanding. This definition should include what data was collected, how it was collected, who collected it, how it was processed (system or person) and any details if it was ever edited or manipulated. When these details are documented over time, data collection remains consistent independent of any person or system being used to collect it.
Appropriateness: The most appropriate method possible should be used for recording the data. Some possible data collection methods include third-party service providers, forms, and data feeds. Any collection method is to be chosen based on the required level of data quality needed for analysis and risk level. For example, conducting a survey may be suitable to collect general information from users, but collecting system data may be more appropriate for analytical use. In addition, it is essential to consider the associated risk, e.g. implications of collecting sensitive or personal data. Finally, the method selected should be documented.
Duplication: Data should never be duplicated in other data assets. When duplication does exist, it should always get identified and managed. Maintaining a glossary of terms and context that is documented helps to reduce duplication and ensures data comparison is appropriate.
Is the data consistent with other related datasets, standards and formats?
Consistency of data means the data was collected, grouped, structured and stored in a consistent and standardized way. This requires implementing standard concepts, definitions, and classifications across departments with agreed-upon meanings and interpretations. Data must also be consistent for all its potential uses. For example, some data may appear similar but have very different meanings or uses in various departments. Duplication, or different meanings for similar data, often results in confusion or misinterpretation of data and makes it unsuitable for comparison with related assets. It may also be challenging to determine if trends are due to an actual effect or problems with inconsistent data collection.
Comparison with similar data assets: It is essential to have the capacity to make comparisons across multiple data assets. Having common data definitions and standards helps achieve this ability to make comparisons. Common data definitions should be agreed upon and shared across PAVE and our external partners, and manage any inconsistencies.
Consistency over time: This refers to tracking a data asset over time. Data must remain consistent even when changes may be made in scope, definition and collection so it can still be compared with previous data assets. For example, track any changes if data was collected over a period of time. To determine if comparisons are appropriate, documentation of any changes and the frequency or timing of any updates must occur.
Documentation: Any changes to data and the related processes should be documented to be traceable. Maintaining data dictionaries that specify all business rules, specifications, and validations, as well as a glossary of terms, should be available and regularly maintained, including any changes made to data. These can include the definition, naming conventions, or scope of data collected and periodically, these may change over time. Documentation of this related information should be recorded and include the time and date, the reason for the change, and the person's name who made the changes to the data asset. This allows for mapping these variations between any versions to allow for traceability, which is needed for retrospective analysis. This documentation becomes a Data Quality Statement - that is, a summary of known characteristics that may affect a data asset's quality - that informs any users of the data asset and enables them to make decisions about its use.
Data is commonly understood: Documentation of data definitions, collection, groupings, and terminology help form an understanding of data maintaining continuity and consistency. Wherever possible, any synonyms for terms should be included when more than one term may have the same meaning (e.g. “client” and “customer”) but may mean the same in one context but could be different in another. Technical definitions may also be included when this is important.
Is the data fit for the purpose it originally was intended to be used?
Data is considered fit for purpose when appropriate for the intended use. For example, the purpose could include decision making, developing a policy, service delivery, product functionality, reporting, or administrative. The purpose the data is measured against is always the original intended purpose. However, some future uses of the data may not have been apparent at the time the original data was collected. Knowing who the users of the data will be and what their expectations of data quality must always align with the original business intent of the data asset. Consulting with potential users in a data asset's development or planning phase can ensure that the data collected meets their quality and relevance expectations. Fitness for purpose is often subjective and difficult to measure, and no specific properties are suggested for the Fitness for purpose dimension. However, a statement may be made against the overall data asset’s Fitness for purpose as it is a critical part of data quality across all dimensions.