IBM Bluemix Data Connect

I have been tracking the development on IBM Bluemix Data Connect quite closely. One of the reason is that I was a key developer in the one of the first few services that it launched almost two years back under the name of DataWorks. Two weeks back I attended a session on Data Connect by the architect and saw a demo. I am impressed at the way it has evolved since then. Therefore I am planning to re-visit DataWorks again, now as IBM Bluemix Data Connect. In this blog I will reconcile the role that IBM Bluemix Data Connect play in the era of cloud computing, big data and the Internet of Things.

Research from Forrester found that 68 percent of simple BI requests take weeks, months or longer for IT to fulfill due to lack of technical resources. So this entails that the enterprises must find ways to transform line of business professionals into skilled data workers, taking some of the burden off of IT. It means business users should be empowered work with data from many sources—both on premises and in the cloud—without requiring the deep technical expertise of a database administrator or data scientist.

This is where cloud services like IBM Bluemix Data Connect comes into picture. It enables both technical and non-technical business users to derive useful insights from data, with point and click access—whether it’s a few Excel sheets stored locally, or a massive database hosted in the cloud.

Data Connect is a fully managed data preparation and movement service that enables users to put data to work through a simple yet powerful cloud-based interface. The design team has taken lot of pain to design the solution in most simplistic way, so that a basic user can quickly get started with it. Data Connect empowers the business analyst to discover, cleanse, standardize, transform and move data in support of application development and analytics use cases.

Through its integration with cloud data services like IBM Watson Analytics, Data Connect is a seamless tool for preparing and moving data from on premises and off premises to an analytics cloud ecosystem where it can be quickly analyzed and visualized. Furthermore, Data Connect is backed by continuous delivery, which adds robust new features and functionality on a regular basis. Its processing engine is built on Apache Spark, the leading open source analytics project, with a large and continuously growing development community. The result is a best-of-breed solution that can keep up with the rapid pace of innovation in big data and cloud computing.

So here are highlights of IBM Bluemix Data Connect:

  • Allow technical and non-technical users to draw value from data quickly and easily.
  • Ensure data quality with simple data preparation and movement services in the cloud.
  • Integrate with leading cloud data services to create a seamless data management platform.
  • Continuous inflow of new and robust features
  • Best-of-breed ETL solution available on Bluemix  – IBMs Next-Generation Cloud App Development Platform

IBM InfoSphere Streams

StreamsStream Computing and need for it:
Stream Computing refers to Analysis of data in motion (before it is stored).
Stream computing becomes essential with certain kinds of data when there is no time to store it before acting on it. This data is often (but not always) generated by a sensor or other instrument. Speed matters for use cases such as fraud detection, emergent healthcare, and public safety where insight into data must occur in real time and decisions to act can be life-saving.

Streaming data might arrive linked to a master data identifier (a phone number, for example). In a neonatal ICU, sensor data is clearly matched to a particular infant. Financial transactions are unambiguously matched to a credit card or Social Security number. However, not every piece of data that streams in is valuable, which is why the data should be analyzed by a tool such as IBM InfoSphere Streams before being joined with the master data.

IBM InfoSphere Streams
IBM InfoSphere Streams analyzes these large data volumes with Micro-Latency. Rather than accumulating and storing data first, the software analyzes data as it flows in and identifies conditions that trigger alerts (such as outlier transactions that a bank flags as potentially fraudulent during the credit card authorization process). When this situation occurs, the data is passed out of the stream and matched with the master data for better business outcomes. InfoSphere Streams generates a summary of the insights that are derived from the stream analysis and matches it with trusted information, such as customer records, to augment the master data.

IBM InfoSphere Streams is based on nearly 6 years of effort by IBM Research to extend computing technology to rapidly do advanced analysis of high volumes of data, such as:

  •  Analysis of video camera data for specific faces in law enforcement applications
  •  Processing of 5 million stock market messages per second, including trade execution with an average latency of 150 microseconds
  •  Analysis of test results from chip manufacturing wafer testers to determine in real time if there are failing chips and if there are patterns to the failures

InfoSphere Streams provides a development platform and runtime environment where you can develop applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams that are based on defined, proven, and analytical rules that alert you to take appropriate action, all within an appropriate time frame for your organization.

The data streams that are consumable by InfoSphere Streams can originate from sensors, cameras, news feeds, stock tickers, or various other sources, including traditional databases. The streams of input sources are defined and can be numeric, text, or non-relational types of information, such as video, audio, sonar, or radar inputs. Analytic operators are specified to perform their actions on the streams.

What is InfoSphere BigInsights ?

I spent some time reading about IBM InfoSphere BigInsights. In this blog, I wish to share the summary of what I read.

Need for a solution like BigInsights
Imagine if you were able to:

  •  Build sophisticated predictive models from the combination of existing information and big data information flows, providing a level of depth that only analytics applied at a large scale can offer.
  •  Broadly and automatically perform consumer sentiment and brand perception analysis on data gathered from across the Internet, at a scale previously impossible using partially or fully manual methods.
  • Analyze system logs from a variety of disparate systems to lower operational risk.
  •  Leverage existing systems and customer knowledge in new ways that were previously ruled out as infeasible due to cost or scale.

Highlights of InfoSphere BigInsightsInfoSphere-BigInsights

  • BigInsights allows organizations to cost-effectively analyze a wide variety and large volume of data to gain insights that were not previously possible.
  • BigInsights is focused on providing enterprises with the capabilities they need to meet critical business requirements while maintaining compatibility with the Hadoop project.
  • BigInsights includes a variety of IBM technologies that enhance and extend the value of open-source Hadoop software to facilitate faster time-to-value, including application accelerators, analytical facilities, development tools, platform improvements and enterprise software integration.
  • While BigInsights offers a wide range of capabilities that extend beyond the Hadoop functionality, IBM has taken an optin approach: you can use the IBM extensions to Hadoop based on your needs rather than being forced to use the extensions that come with InfoSphere BigInsights.
  • In addition to core capabilities for installation, configuration and management, InfoSphere BigInsights includes advanced analytics and user interfaces for the non-developer business analyst.
  • It is flexible to be used for unstructured or semi-structured information; the solution does not require schema definitions or data preprocessing and allows for structure and associations to be added on the fly across information types.
  • The platform runs on commonly available, low-cost hardware in parallel, supporting linear scalability; as information grows, we simply add more commodity hardware.

InfoSphere BigInsights provides a unique set of capabilities that combine the innovation from the Apache Hadoop ecosystem with robust support for traditional skill sets and already installed tools. The ability to leverage existing skills and tools through open-source capabilities helps drive lower total cost of ownership and faster time-to-value.  Thus InfoSphere BigInsights enables new solutions for problems that were previously too large and complex to solve cost-effectively.

Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions

Challenges of Data Lake paving way for Data Reservoir

DataLakeIn my previous blogs I was discussing about Data Lake. Imagine you have pooled the entire data of your enterprise to a Data lake, there will be challenges. All this raw data will be overwhelming and unsafe to use because no-one is sure where data came from, how reliable it is, and how it should be protected. Without proper management and governance, such a data lake can quickly become a data swamp. This data swamp can cause frustration to the business users, application developers, IT and even customers.

So there is a need for a facility for transforming raw data into information that is Clean, Timely, Useful and Relevant. Hence an enhanced data lake solution was built with management, affordability, and governance at its core. This solution is known as a data reservoir. Probably in one of the subsequent blogs we will take a dip into data reservoir! Stay tuned.

Data Lake Vs Data Warehouse

DataLakeIn my last blog, I wrote on Data Lake. The first comment on the Blog was to find out the difference between Data Lake and Data Warehouse. So in this blog, I will try to share some of my understanding on their difference:

Schema: In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. But in Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data.

Cost (Storage and Processing) : Data Lake provides cheaper storage of large volumes of data and has potential to reduce the processing cost by bringing analytics near to data.

Data Access: The data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery which is required in a dynamic market economy.

Flexibility: Data Lakes offers unparalleled flexibility since nobody or nothing stands between business users and the data.

Data Quality: The quality of data that exists in a traditional Data Warehouse is cleansed whereas typical data that exist in Data Lake is Raw.

Relevance in Big Data world: Traditional approach of manually curated data warehouses, provides limited window view of data and are designed to answer only specific questions identified at the design time. This may not be adequate for data discovery in today’s big data world. Moreover data lake can contain any type of data – clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data. For example, data lakes are an ideal way to manage the millions of patient records for a hospital. These patient records can be physicians’ notes to lab results. With a data lake, the hospital stores all of that disparate data in its original format, calling upon specific types of record when needed, converting the data into uniform structures only when the situation calls for it.

Data Lake does provide some advantages to the Enterprises who require quick access to data. But Data Lakes brings  it’s own sets of challenges. I will explore this in my subsequent blogs.

Information Server and Big Data Integration – IV (Strenghts of Hadoop)

hadoopIn my earlier blog, I described what Hadoop is? In last two parts of this series, I mentioned some of the limitations of of Hadoop. That should not discount the potential that Hadoop has. So in this blog I will share in simple terms what Hadoop is and what role it can play for enterprises who are testing waters in the Big Data world.

Example to illustrate the potential of Hadoop:

Imagine you have a jar of multicolored candies, and you need the count of blue candies relative to red and yellow ones. You could empty the jar onto a plate, sift through them and tally up your answer. If the jar held only a few hundred candies, this process would take only a few minutes.
Now imagine you have four plates and four helpers. You pour out about one-fourth of the candies onto each plate. Everybody sifts through their set and arrives at an answer that they share with the others to arrive at a total. Isn’t it much faster?
That is what Hadoop does for data. Hadoop is an open-source software framework for running applications on large clusters of commodity hardware. Hadoop delivers enormous processing power – the ability to handle virtually limitless concurrent tasks and jobs – making it a remarkably low-cost complement to a traditional enterprise data infrastructure.

Enterprises are using Hadoop for several notable merits:
• Hadoop is distributed. Bringing a high-tech twist to the adage, “Many hands make light work,” data is stored on local disks of a distributed cluster of servers.
• Hadoop runs on commodity hardware. Based on the average cost per terabyte of compute capacity of a prepackaged system, Hadoop is easily 10 times cheaper for comparable computing capacity compared to higher-cost specialized hardware.
• Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data replication and speculative processing. If capacity is available, Hadoop runs multiple copies of the same task, accepting the results from the task that finishes first.
• Hadoop does not require a predefined data schema. A key benefit of Hadoop is the ability to just upload any unstructured files without having to “schematize” them first. You can dump any type of data into Hadoop and allow the consuming programs to determine and apply structure when necessary.
• Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000 and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000 concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000 Hadoop nodes storing more than 200 petabytes of data.  Linkedin manage over 1 billion personalized recommendations every week by using  Hadoop and its MapReduce and HDFS features! Facebook keeps track of 1 billion user profiles, along with the related data such as posts, comments, images, videos, and so on using Hadoop.
• Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of data in 62 seconds. To put it in context 10 terabytes could store the entire US Library of Congress print collection.


Hadoop handles big data. It does it fast. It redefines the possible when it comes to analyzing large volumes of data, particularly semi-structured and unstructured data (text). In my upcoming blogs on IBM InfoSphere BigInsights, I will try to share how tools built over Hadoop accelerate uncovering the potential of Hadoop for enterprise users.

Information Server and Big Data Integration – III (Limitations of Hadoop contd.)

In Part1 of this series I mentioned the role an ETL tool can play in the world of Hadoop. In Part2, we discussed some of the technical limitations of Hadoop. In Part3, based on my recent readings, we will discuss more on how Hadoop cannot play a part of a Data Integration Solution independently. This may come as a surprise to some of the Hadoop proponents as they see Hadoop projects performing extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools. Here are a few…

  • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
  • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
  • Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.
  • Purely Hadoop-based approach to data integration will require custom code and higher costs, which demands specialized skills and ongoing effort to maintain and change.
  • Data integration projects requires good governance principles, and select technologies that support the application of the required policies and procedures. This is not addressed in Hadoop based projects as of now.

Concluding Remarks:
Not only are many key data integration capabilities immature or missing from the Hadoop stack, but many have not been addressed in current projects.

Disclaimer: “The postings on this site are my own (based on my readings and interpretations) and don’t necessarily represent IBM’s positions, strategies or opinions.”