What is Big Data? (Part 3) – Variety

Namit_BackGroundImageThe volume associated with the Big Data phenomena brings along new challenges for data centers trying to deal with it: its Variety. With the explosion of sensors, and smart devices, as well as social collaboration technologies, data in an enterprise has become complex, because it includes not only traditional relational data, but also raw, semistructured, and unstructured data from webpages, web log files (including click-stream data), search indexes, social media forums, e-mail, documents, sensor data from active and passive systems, and so on. What’ s more, traditional systems can struggle to store and perform the required analytics to gain understanding from the contents of these logs because much of the information being generated doesn’t lend itself to traditional database technologies.

Quite simply, variety represents all types of data—a fundamental shift in analysis requirements from traditional structured data to include raw, semistructured, and unstructured data as part of the decision making and insight process. Traditional analytic platforms can’t handle variety. However, an organization’s success will rely on its ability to draw insights from the various kinds of data available to it, which includes both traditional and nontraditional.

Just 20 percent of the data is of the relational kind that’ s neatly formatted and sits ever so nicely into strict schemas. But something like 80 percent of the world’s data (and more and more of this data is responsible for setting new velocity and volume records) is unstructured, or semistructured at best. If we look at a Twitter feed, we’ll see structure in its JSON format, but the actual text is not structured, and understanding that can be rewarding. Video and picture images aren’t easily or efficiently stored in a relational database; certain event information can dynamically change (such as weather patterns), which isn’t well suited for strict schemas, and more. To capitalize on the Big Data opportunity, enterprises must be able to analyze all types of data, both relational and non relational : text, sensor data, audio, video, transactional, and more.

What is Business Intelligence?

In my last sets of posts, I was mentioning the journey of data to a Data Warehouse. It started from an ETL job that can has an ability to extract the Data from different types of sources (Z/OS, SAP or Custom) and on the way getting transformed, cleansed to land into a final source as “Trusted Information”(which by definition means Accurate, Complete, Insightful and Real Time). So why was this effort made after all? An answer that I already provided was that for compliance purpose this much effort is needed and more or less sufficient (at least from IT perspective) to ensure that our records are proper. But beyond that, often such data can be further used to give valuable insights. This is where BI (called as Bee Eye) or Business Intelligence comes into picture.

What is BI?
Business intelligence (BI) is defined as the ability for an organization to take all its capabilities and convert them into knowledge, ultimately, getting the right information to the right people, at the right time to make right decisions. These decisions drive organizations. Making a good decision at a critical moment may lead to a more efficient operation, a more profitable enterprise, or perhaps a more satisfied customer. BI tools and processes working on Trusted data provides a safer way to make decision than making a decision by a “gut feeling”.

Where does BI Apply?

  • BI can be used to segment customers and identify the best ones. We can analyze data to understand customer behaviors, predict their wants and needs, and offer fitting products and promotions. Finally, we can identify the customers at greatest risk of attrition, and intervene to try to keep them.
  • The human resources department can learn which individuals in an organization are high performers and then hire, train, and reward other employees to become similar high performers.
  • Inventory managers can segment their inventory items by cost and velocity, build key facilities in the best locations, and ensure that the right products are available in the right quantities.
  • Production can minimize its costs by setting up activity-based costing programs.

With BI can I make any business decisions accurate?
BI just assists in making a proper decision. But in places, intuition may be required. What if we do not have sufficient time to run our tools and get a report before making a decision? What if we have no precedent data to make decision or that history is misleading?

So does BI munch the trusted data and gives you some gyan (sanskrit word for giving insight)? Not really. There are two additional things that it should do. It should measure the results according to predetermined metric and also feed the lessons from one decision into the next.

These are my tit-bits gathered from reading about BI from various sources. I welcome readers to share their understandings or point to some more interesting read in this upcoming area.

InfoSphere DataStage – VII (Data Cleansing)

In this Blog, I would like to discuss about an important feature of any ETL tool – Supporting Data Cleansing, which is a part of a bigger umbrella called Data Quality. It is a debate by some whether Data Cleansing  should happen during ETL phase or it should happen later when Data Marts are being created. We will not go in that debate now. We will focus on the need of Data Cleansing.

Most organizations today depend on the use of data in two general ways. Standard business processes use data for executing transactions, as well as supporting operational activities. Business analysts review data captured as a result of day-to-day operations through reports as a way of identifying new opportunities for efficiency or growth. So data is used to both run and improve the ways that organizations achieve their business objectives. If that is true, then there must be processes in place to ensure that data is of sufficient quality to meet the business needs.

The price of poor data is illustrated by these examples:

  • A data error in a bank causes 300 credit-worthy customers to receive mortgage default notices. The error costs the bank time, effort, and customer goodwill.
  • A marketing organization sends duplicate direct mail pieces. A six percent redundancy in each mailing costs hundreds of thousands of dollars a year.
  • A managed-care agency cannot relate prescription drug usage to patients and prescribing doctors. The agency’s OLAP application fails to identify areas to improve efficiency and inventory management and new selling opportunities.

We should therefore have the ability to check, filter, and correct mistakes or corruption that can be found in the source data. A typical case in which a cleaning process is mandatory is in the address (location) processing: “street,” “st,” “st.,” and so forth; all indicate the word “street” and an automatic process must be able to recognize it. For this particular purpose, there are specialized tools that apply ad-hoc algorithms and have their own topographic database where to store/retrieve the correct name of every street, and so forth.

IBM InfoSphere QualityStage™, part of the InfoSphere Information Server suite, comprises a set of stages, a Match Designer, and related capabilities that provide a development environment for building data-cleansing tasks called jobs. It enables enterprises to create and maintain an accurate view of master data entities, such as customers, vendors, locations and products. Core capabilities in it include data investigation, standardization, address validation, probabilistic matching and survivorship. InfoSphere QualityStage may be deployed in transactional, operational, or analytic applications, and in batch and real-time environments.

A cool thing about Quality Stage is that it shares the canvas with the data integration stages.

You can watch a brief video of Quality Stage here.

InfoSphere DataStage – III (3 Strengths of DataStage)

There are quiet a good number of strengths of using DataStage. In this blog, I will describe the top three (my personal choice) where I find it to be really cool. One is we need not bother about the underlying structure while designing a job (remember my last post?), another is due to the functions available, much of the transformations can happen without a need of a staging database, and finally the way it scales. Here is some more description of these in little detail.

  • One of the great strengths of InfoSphere DataStage is that when designing jobs, very little consideration to the underlying structure of the system is required and does not typically need to change. If the system changes, is upgraded or improved, or if a job is developed on one platform and implemented on another, the job design does not necessarily have to change. InfoSphere DataStage has the capability to learn about the shape and size of the system from the InfoSphere DataStage configuration file. Further, it has the capability to organize the resources needed for a job according to what is defined in the configuration file. When a system changes, the file is changed, not the jobs. A configuration file defines one or more processing nodes with which the job will run. The processing nodes are logical rather than physical. The number of processing nodes does not necessarily correspond to the number of cores in the system.
  • Another great strength of InfoSphere DataStage is that it does not rely on the functions and processes of a database to perform transformations: while InfoSphere DataStage can generate complex SQL and leverages databases, InfoSphere DataStage is designed from the ground up as a multipath data integration engine equally at home with files, streams, databases, and internal caching in single-machine, cluster, and grid implementations. As a result, users in many circumstances find they do not also need to invest in staging databases to support InfoSphere DataStage.

DataStage Scalability

  • Linear scalability and very high data processing rates were obtained for a typical information integration benchmark using data volumes that scaled to one terabyte. Scalability and performance such as this is increasingly sought by customers investing in a solution that can grow with their data needs and deliver trusted information throughout their organization.

Providing relevant Social Media Analytics

Slowly social media is becoming very prevalent place where individuals or groups of individuals express their opinions on the gamut of tools/ services/ latest gizmos/ ad campaigns to an organization as a whole.  An organization should be aware of what the consumers/ customers or competitors are talking about them to decide the future course of plan and remain competitive in market place.  For example on a launch of a new product a negative chatter starts about it. It could have been a misinformation, but this chatter has potential to do the damage to the sales of the product and many consumers may form their opinions. So it would be nice for an organization to scan the social media and get the answers to the questions like following:

  • How do consumer feel about our new product or ad campaign?
  • What are consumer hearing about our brand [Brand reputation]?
  • What are the most talked about product attributes in my product category [Like in my smart phone whether people are talking about screen, battery life or camera]. Is it good or bad?
  • What is my competitor doing to excite market [Competitive analysis]?
  • Are my business partners helping or hurting my reputation?
  • Is there a negative chatter that my PR team should respond to?

Cognos Cosumer Insight (CCI) does the same thing. Typically it does not crawl the data and uses the service(s) of some known social media crawlers (like Board Reader) to get social media content. This ASCII data comes in Jason format and CCI processes the data using Hadoop from BigInsights. It applies sentiment analysis (using SystemT) and does the following

  • Perform pattern matching, based on the input keywords, and then look for sentiments
  • Check for positive, negative, neutral words / phrases (grammar, slang, typos, synonyms etc)
  • Save results for further processing by visualization / search engine.

Finally it provides Sentiment Analysis using the Visualization Engine.  The job of CCI is done now, but the story does not end here. Now we know the sentiment of the customer as of now. Organizations may need the ability understand the key factor driving the sentiment and IBM SPSS is a tool for that. Once there is a fair prediction, we need to act on those and monitor results and IBM Coremetrics and Unica are the tools used for the same.

Data Governance VI (Protecting Database)

Data Governances encompasses preventing issues with data, so that the enterprise can become more efficient. Most organizations have formal policies that govern how and when privileged users such as database administrators, help desk members, and outsourced personnel can access database systems. But (There is always a BUT!), organizations do not always have effective mechanisms to monitor, control, and audit the actions of these privileged users. To make matters worse, accountability is difficult to achieve because privileged users often share the credentials used to access database systems.

Monitoring privileged users helps ensure Data Governance in the following ways:

  • Data privacy—Monitoring ensures that only authorized applications and users are viewing sensitive data.
  • Database change control—Monitoring ensures that critical database structures and values are not being changed outside of corporate change control procedures.
  • Protection against external attacks—A successful, targeted attack frequently results in the attacker gaining privileged user access. For example, an outsider in Timbuktu might look like an insider because he has authenticated access, until you look at other identifying information such as the user’s location.

An organization will want to track all database changes to the following:

  • Database structures such as tables, triggers, and stored procedures. For example, the organization will want to detect accidental deletions or insertions in critical tables that affect the quality of business decisions.
  • Critical data values such as data that affects the integrity of fi nancial transactions.
  • Security and access control objects such as users, roles, and permissions. For example, an outsourced contractor might create a new user account with unfettered access to critical databases and then delete the entire account, eliminating all traces of her activity.
  • Database configuration files and other external objects, such as environment/registry variables, confi guration files (e.g., NAMES.ORA), shell scripts, OS fi les, and executables such as Java. programs.

IBM InfoSphere Guardium Database Activity Monitor offers a solution that creates a continuous, fine-grained audit trail of all database activities, including the “who,” “what,” “when,” “where,” and “how” of each transaction. This audit
trail is continuously analyzed and filtered in real-time, to identify unauthorized or suspicious activities. To enforce separation of duties, all audit data is stored in a secure, tamper-proof repository external to monitored databases.

IBM InfoSphere Guardium Database Activity Monitor’s solution has a minimal impact on database performance and does not require any changes to databases or applications. IBM InfoSphere Guardium Database Activity Monitor also enables an organization to automate the time-consuming process of tracking all observed database changes and reconciling them with authorized work orders within existing change-ticketing systems, such as BMC Remedy and custom change management applications. For example, a large financial institution set up an automated change-reconciliation process with IBM InfoSphere Guardium Database Activity Monitor.

Data Governance – I (Basics)

Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise. Data governance guarantees that data can be trusted and that people can be made accountable for any adverse event that happens because of poor data quality. So Data Governance is about putting people in charge of fixing and preventing issues with data, so that the enterprise can become more efficient.

Data governance encompasses the people, processes, and information technology required to create a consistent and proper handling of an organization’s data across the business enterprise.

  •  People – Effective enterprise data governance requires executive sponsorship as well as a firm commitment from both business and IT staffs.
  • Policies – A data governance program must create – and enforce – what is considered “acceptable” data through the use of business policies that guide the collection and management of data.
  •  Technology – Beyond data quality and data integration functionality, an effective data governance program uses data synchronization technology, data models, collaboration tools and other components that help create a coherent enterprise view.

The benefits to a holistic approach are obvious; better data drives more effective decisions across every level of the organization. With a more unified view of the enterprise, managers and executives can create strategies that make the company more profitable.