As we know, a Data Reservoir contains data from many different sources to ease data discovery, data analytics or ad hoc investigations. Lets delve into it a little more to see the need for Information Governance in a Data Reservoir solution.
Lets start with the use case of a Bank that plans to implement a Data Reservoir. It has the traditional sources of data. This data includes how much a person earns, what they spend their money on, where they live, even where they travel or eat. The customer may share similar type of information on social media site also. However, people who willingly share their information on a social media site know that this data will become more or less public. But when people share their data with their bank, they trust that the bank will use this data responsibly, for the purposes that the data was shared, and this responsibility goes further than of just abiding by the law.
Take a customer’s payment transactions as an example. Many customers would be unhappy if they felt the bank was monitoring how they spent their money. However, they would probably also expect the bank to detect fraudulent use of their debit card. Both of these use cases involve the same data but the first example seems to be prying into a person’s privacy and the second is an aspect of fraud prevention. The difference between the cases is in the purpose of the analytics. So as the bank makes data more widely available to its employees for the purpose of analytics, it must monitor both the access to data and the types of analytics it is being used for. It does that by information governance and security capabilities. Let’s delve into it more.
No data can enter the data reservoir without first being described in the data reservoir’s catalog. The data owner classifies their information sources that will feed the data reservoir to determine how the data reservoir should manage the data, including access control, quality control, masking of sensitive data, and data retention periods.
The classification assigned to data will lead to different management actions in that data in the data reservoir. For example, when data is classified as highly sensitive, the data reservoir can enforce masking of the data on ingestion into the data reservoir. Less sensitive data, that is nevertheless personal to the bank’s customers, may be stored in secured repositories in the data reservoir, so it can be used for production analytics. However, when it is copied into sandboxes for analytical discovery, it will be masked to prevent data scientists from seeing the values, without loosing the referential integrity of the data. Behind the scenes, the data reservoir is auditing access to data to detect if employees are accessing more data than is reasonable for their role. Thus the data reservoir is opening access to the bank’s data, but only for legitimate and approved purposes.