What is Big Data? (Part 3) – Variety

Namit_BackGroundImageThe volume associated with the Big Data phenomena brings along new challenges for data centers trying to deal with it: its Variety. With the explosion of sensors, and smart devices, as well as social collaboration technologies, data in an enterprise has become complex, because it includes not only traditional relational data, but also raw, semistructured, and unstructured data from webpages, web log files (including click-stream data), search indexes, social media forums, e-mail, documents, sensor data from active and passive systems, and so on. What’ s more, traditional systems can struggle to store and perform the required analytics to gain understanding from the contents of these logs because much of the information being generated doesn’t lend itself to traditional database technologies.

Quite simply, variety represents all types of data—a fundamental shift in analysis requirements from traditional structured data to include raw, semistructured, and unstructured data as part of the decision making and insight process. Traditional analytic platforms can’t handle variety. However, an organization’s success will rely on its ability to draw insights from the various kinds of data available to it, which includes both traditional and nontraditional.

Just 20 percent of the data is of the relational kind that’ s neatly formatted and sits ever so nicely into strict schemas. But something like 80 percent of the world’s data (and more and more of this data is responsible for setting new velocity and volume records) is unstructured, or semistructured at best. If we look at a Twitter feed, we’ll see structure in its JSON format, but the actual text is not structured, and understanding that can be rewarding. Video and picture images aren’t easily or efficiently stored in a relational database; certain event information can dynamically change (such as weather patterns), which isn’t well suited for strict schemas, and more. To capitalize on the Big Data opportunity, enterprises must be able to analyze all types of data, both relational and non relational : text, sensor data, audio, video, transactional, and more.

What is Big Data? (Part 2) – Volume

In the Part 1 of this series on Big Data, I got the following comment by Venkataramana:

‘Big’ data that is being buzzed all over seems to mostly refer to unstructured information like Tweets, Facebook updates, Google searches etc., Do you think even the transactional data is being considered ‘Big’ ? My personal perception is that transactional data being structured in nature can always be subject to better filtering and can always be limited to ‘practical’ sizes in the current computing limits.

I am happy that someone actually is reading the blogs and providing some more churning of thoughts by their insightful comments. In this post I wish to share some more insights on the Volume aspect of Big Data in response to the above comment.

Lets define Big Data First…
A reasonable definition of what people refer to as ‘Big Data’ is information that can’t be processed or analyzed using traditional processes or tools. This is due to three reasons – Volume, Variety (that was mentioned in the comment) and Velocity.

Now talking about Volume…
My perception is that the Volume of even the so called structured data (residing in nice well defined schemas in some database) is also growing off the limit. What makes it off the limit is not the storage capacity of the database  rather how the traditional tools are equipped to sift through them to deliver meaningful insights. The following examples will help to understand how the data getting generated can increase exponentially:

Taking smart phone out of the holster generates an event; When commuter’s train door opens, an event is generated; check-in for a plane, swiping badge for work, buy a song on iTunes, change the TV channel, take an electronic toll route – everyone of this generates data. Need more? The St. Anthony Falls Bridge in Minneapolis has more than 200  embedded sensors positioned at strategic points to provide a fully comprehensive monitoring system where all sorts of detailed data is collected and even a shift in temperature and the bridge’s concrete reaction to that change is available for analysis.

As the amount of data available to the enterprise is on the rise, the percent of data it can process, undersand, and analyze is on the decline, thereby creating the blind zone. What’s in the blind zone? You don’t know: It might be something great, or may be nothing at all, but the “don’t know” is the problem (or the opportunity, depending how you look at it).

To make the problem and the opportunity look “Real”, I plan to discuss a business case in my next blog where “Big Data” solution solved some “Volume” issues of a client. Stay tuned!