IBM InfoSphere Streams

StreamsStream Computing and need for it:
Stream Computing refers to Analysis of data in motion (before it is stored).
Stream computing becomes essential with certain kinds of data when there is no time to store it before acting on it. This data is often (but not always) generated by a sensor or other instrument. Speed matters for use cases such as fraud detection, emergent healthcare, and public safety where insight into data must occur in real time and decisions to act can be life-saving.

Streaming data might arrive linked to a master data identifier (a phone number, for example). In a neonatal ICU, sensor data is clearly matched to a particular infant. Financial transactions are unambiguously matched to a credit card or Social Security number. However, not every piece of data that streams in is valuable, which is why the data should be analyzed by a tool such as IBM InfoSphere Streams before being joined with the master data.

IBM InfoSphere Streams
IBM InfoSphere Streams analyzes these large data volumes with Micro-Latency. Rather than accumulating and storing data first, the software analyzes data as it flows in and identifies conditions that trigger alerts (such as outlier transactions that a bank flags as potentially fraudulent during the credit card authorization process). When this situation occurs, the data is passed out of the stream and matched with the master data for better business outcomes. InfoSphere Streams generates a summary of the insights that are derived from the stream analysis and matches it with trusted information, such as customer records, to augment the master data.

IBM InfoSphere Streams is based on nearly 6 years of effort by IBM Research to extend computing technology to rapidly do advanced analysis of high volumes of data, such as:

  •  Analysis of video camera data for specific faces in law enforcement applications
  •  Processing of 5 million stock market messages per second, including trade execution with an average latency of 150 microseconds
  •  Analysis of test results from chip manufacturing wafer testers to determine in real time if there are failing chips and if there are patterns to the failures

InfoSphere Streams provides a development platform and runtime environment where you can develop applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams that are based on defined, proven, and analytical rules that alert you to take appropriate action, all within an appropriate time frame for your organization.

The data streams that are consumable by InfoSphere Streams can originate from sensors, cameras, news feeds, stock tickers, or various other sources, including traditional databases. The streams of input sources are defined and can be numeric, text, or non-relational types of information, such as video, audio, sonar, or radar inputs. Analytic operators are specified to perform their actions on the streams.

What is InfoSphere BigInsights ?

I spent some time reading about IBM InfoSphere BigInsights. In this blog, I wish to share the summary of what I read.

Need for a solution like BigInsights
Imagine if you were able to:

  •  Build sophisticated predictive models from the combination of existing information and big data information flows, providing a level of depth that only analytics applied at a large scale can offer.
  •  Broadly and automatically perform consumer sentiment and brand perception analysis on data gathered from across the Internet, at a scale previously impossible using partially or fully manual methods.
  • Analyze system logs from a variety of disparate systems to lower operational risk.
  •  Leverage existing systems and customer knowledge in new ways that were previously ruled out as infeasible due to cost or scale.

Highlights of InfoSphere BigInsightsInfoSphere-BigInsights

  • BigInsights allows organizations to cost-effectively analyze a wide variety and large volume of data to gain insights that were not previously possible.
  • BigInsights is focused on providing enterprises with the capabilities they need to meet critical business requirements while maintaining compatibility with the Hadoop project.
  • BigInsights includes a variety of IBM technologies that enhance and extend the value of open-source Hadoop software to facilitate faster time-to-value, including application accelerators, analytical facilities, development tools, platform improvements and enterprise software integration.
  • While BigInsights offers a wide range of capabilities that extend beyond the Hadoop functionality, IBM has taken an optin approach: you can use the IBM extensions to Hadoop based on your needs rather than being forced to use the extensions that come with InfoSphere BigInsights.
  • In addition to core capabilities for installation, configuration and management, InfoSphere BigInsights includes advanced analytics and user interfaces for the non-developer business analyst.
  • It is flexible to be used for unstructured or semi-structured information; the solution does not require schema definitions or data preprocessing and allows for structure and associations to be added on the fly across information types.
  • The platform runs on commonly available, low-cost hardware in parallel, supporting linear scalability; as information grows, we simply add more commodity hardware.

InfoSphere BigInsights provides a unique set of capabilities that combine the innovation from the Apache Hadoop ecosystem with robust support for traditional skill sets and already installed tools. The ability to leverage existing skills and tools through open-source capabilities helps drive lower total cost of ownership and faster time-to-value.  Thus InfoSphere BigInsights enables new solutions for problems that were previously too large and complex to solve cost-effectively.

Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions

Need for Master Data Management

Incidentally this is my 100th blog on this site. Since some time I was reading on Master Data Management and in this blog I wish to explain the Need of Master Data Management (MDM).

Imagine you are a bank with a huge clientele. You want to know how many clients you have? Is it enough to count the number of credit card customers, saving account customers and loan account customers? What if there is an overlap? What if you have information in one account that will help you to serve the customer better in the another account? For example a customer with gold account in Credit card calls but this time as a loan account customer and we fail to recognize this and treat him as ordinary customer? So wouldn’t it be nice for the bank to have a consolidate view of each of the customer? That’s where master data management (MDM) comes in. MDM makes it possible to distill a single view of the client—or of the patient, supplier, partner, account or other critical ‘entity’—from the incomplete or inconsistent bits of data that are scattered across the enterprise. The resulting view, now unified across disparate silos, provides the insight that you need to make better decisions and create superior outcomes.

Master data is the information about customers, products, materials, accounts and other entities that is critical to the operation of the business. But companies hold pieces of master data in many different applications, such as enterprise resource planning (ERP) and customer relationship management (CRM) systems. Each of those source systems creates and holds the data in its own unique way. As a result, information does not match from one system to the next. Critical data elements may be missing, duplicated or inconsistent. Further, each department can only operate from within its own compartmentalized view.MDM_Success

MDM software manages the creation, maintenance, delivery and use of master data, both to ensure that it is consistent and trustworthy, and to make it possible to see the data in an organization-wide context. Consider an insurance company with multiple divisions: Without MDM, an agent in the auto division will offer rates to a prospective client based on the home address and ages of drivers in the household. These standard rates might be higher than those offered by a competitor. What if the company had an MDM “hub” pulling together customer data from across divisions? Then, the agent could see that the customer already owns a  homeowner’s policy, and could offer them discounted, more competitive auto rates. In this simple example, MDM provides a single view of the client that empowers the agent to secure a better business outcome.

So to summarize, MDM delivers a single, unified, trusted version of truth about an organization’s critical entities—customer, supplier, product and more. Armed with this single, trusted view, organizations can make better decisions and improve business outcomes— which can lead to higher revenue, better customer satisfaction, lower cost and lower risk.

Further Reading


Data Provisioning for enabling Data Virtualization

In my previous blog [Need for Data Virtualization], we discussed how data virtualization is becoming an important component of a comprehensive data integration strategy. Now how does Data Virtualization happen behind the scenes? For the user to get the data from a wide range of sources, the data needs to be provisioned.

Information Provisioning is the mechanisms behind the scenes that supply the data when requested. In InfoSphere, it is achieved using the following four ways…

Information Provisioning styles provided by InfoSpere
Information Provisioning styles provided by InfoSpere

Information Provisioning styles provided by InfoSphere

Federation: Provide correlated information from multiple sources on demand.

Replication: Maintain a copy of an Information Source. A real-time data replication solution can replicate data changes to the locally stored data to ensure it reflects the most current state.You can find more about it here.

Consolidation: Copy and integrate information from multiple sources into a single location. In some cases, data must be transformed (to ensure consistency with other data) before it can be virtualized. The transformation capabilities within an extract, transform and load (ETL) engine
can perform this function.

Caching: Provide localized read only copies of information. The localization of frequently accessed remote data enables queries to be executed locally and quickly, without the need for access to remote data sources.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Data Delivery using CDC

In my previous blog, I mentioned how Data Virtualization can be achieved through Federation, Consolidation, Replication and Caching. Consolidation is ETL batch process traditionally running at the end of each business day. In my earlier blog, I have spoken in some details about it. In this blog, I wish to spend some more time on ‘Replication’ in general focusing on specifics of InfoSphere CDC towards the end.

When would we require data to be Replicated / have Incremental data delivery?
When business require their data to provide up to the minute or near real-time information, they opt for this method of data delivery. This includes both replication and change data capture. Replication moves data from database to database to provide solutions for (a) continuous business availability, (b) live reporting, and (c) database or platform migrations. When using change data capture, the target is not necessarily a database. In addition to the solutions included in replication, this approach can also feed changes to an ETL process or deliver data changes to a downstream application by using a message queue.

Some examples of how Replication is used include the following:

  •  Providing feeds of changed data for Data Warehouse or Master Data Management (MDM) projects, enabling users to make operation and tactical business decision making using the latest information.
  •  Dynamically routing data based on content to message queues to be consumed by one or more applications, ensuring consistent, accurate, and reliable data across the enterprise.
  • Populating real-time dashboards for on-demand analytics, continuous business monitoring, and business process management to integrate information between mission-critical applications and web applications, ensuring access to real-time data to customers and employees.
  •  Consolidating financial data across systems in different regions, departments, and business units.
  •  Improving the operational performance of systems that are adversely affected by shrinking nightly batch windows or expensive queries and reporting functions.

So what is InfoSphere CDC:
Change data capture uses a developed technology to integrate data in near real time. InfoSphere CDC detects changes by monitoring or scraping database logs. The capture engine (the log scraper) is a lightweight, small footprint, and low-impact process on the source server running where the database changes are detected. After the log scraper finds new changed data on the source, that data is pushed from the source agent to the target apply engine through a standard Internet Protocol network socket. In a typical continuous mirroring scenario, the change data is applied to the target database through standard SQL statements. By having the data only interact with the database logs, additional load is not put on the source database and no changes are required to the source application.

Incremental data delivery is shown below:
Incremental Data Delivery

Need For Data Virtualization

We are seeing massive upswings in the volume, variety and velocity of data. Business now want to use the data that was difficult to process in the past for – Decision Making, Exploratory Analysis, Advanced Analytics, New Applications (including mobiles). Adding to that the information landscape has become more complex. So the ability to get simple data access is more relevant than ever. Data virtualization technologies can help create a single access point to the pool of data you need to support your business.

Here are some of the reasons why some organization may benefit from Data virtualization

Simplified Access to data:

  • Data virtualization focuses on simplifying access to data by isolating the details of storage and retrieval and making the process transparent to data consumers.
  • The end user will never know—and doesn’t need to know—the details about the data sources. This means changes to source systems will not affect downstream applications. The burden on IT is reduced since the solution does not have to be reconfigured each time a change is made.
  • An organization can create a virtual view across data in the warehouse and data that exists within Hadoop to create a new logical data warehouse. Because end users interact with only one endpoint and use the same access methods they are accustomed to, they can easily take advantage of these new insights from the big data platform without learning new languages and without administrators having to maintain user privileges in multiple locations.

Reduced Costs:

  • Multiple information access points also increase the total cost of ownership since each access point requires administration, security and management resources.
  • Reliance on older technologies to get to data eventually requires organizations to purchase more hardware, more databases and more software with more capabilities—all of which require budget.
  • Because data virtualization involves virtual consolidation of data rather than physical movement, there is no need to create new databases or purchase additional hardware to store the consolidated data.

Faster Time to Value:

  • Building virtual data stores is much quicker than creating physical ones since data does not have to be physically moved.
  • Data virtualization reduces the time required to take advantage of disparate data, which makes it easier for users and processes to get the information they need in a timely manner.