We have discussed about the 3 V’s of Big Data in my previous blogs in some details. Now comes the 4th V – Veracity. It translates to “Can the data be trusted?”. The fourth V becomes important in Big Data world because what is the use of Big Data and the analytics that comes from it, if we cannot trust the data. Let’s look at some real examples…
- A man who died in 1979 is listed by a website that advertises itself as a “People Search Engine”. The website used consolidated public data to determine that the man is 92 years old and living in Florida.
- Gartner Group estimates that by 2014, false reviews constitute 10 – 15% of all reviews. Case in point: A travel website with over 200 million unique visitors per month recently removed over 100 reviews that were created by a hotel chain executive. The executive created positive reviews of his properties while creating negative reviews of his rivals.
- Veracity is based on the data, algorithms, and assumptions. For example, in 2008, Google began tracking flu infection rates. Google Flu Trends derives estimates of flu infection rates by using tweets and flu-related search terms. When the data is compared with CDC manual reporting rates, the numbers are shown to be accurate and providing delivery several days earlier, providing the necessary lead time in fighting a potential epidemic. During the 2013 season, those estimates spiked to nearly double the CDC reports. Experts asserted that increased news coverage resulted in increased searches by individuals who were not ill.
So to check the Veracity of data, questions like the following can be raised:
- Where did the data come from?
- Did the data originate internally within the organization or externally?
- Is the data from a transaction that can be audited and proven?
- Is the data truth or opinion?
- Is the data an intentional fabrication?
- Is it publicly available data, such as phone numbers, or is it behavioral data from a data aggregator?
- Is the raw data usable as is, such as in the case of fraud detection, where identifying the aberrations are the focus, or does it require standardization
- What governance methods does an organization use to get and rank veracity and classify its dimensions? As more data sources move from internal to acquired externally, this issue has become more pressing.
- How do you classify the trust factor? Organizations seriously must consider classification as part of the governance process.
So based on the above considerations, there evolves three dimensions that can define the characteristics of the data from the perspective of it’s veracity:
- The quality or cleanliness/consistency/accuracy of the data
- The provenance or source of the data, along with its lineage over time
- The intended usage because the usage can dramatically affect what is considered an acceptable level of trust or quality