Lift your Data to Cloud

database_migrationTo stay competitive and reduce cost, several Enterprises are realizing the merits of moving their data to Cloud. Due to their economies of scale cloud storage vendors can achieve lesser cost. Also Enterprises escape the drudgery of [capacity] planning, buying, commissioning, provisioning and maintaining storage systems. Data is even protected by replication to multiple data centers which Cloud vendors provide by default. You can read this blog listing the various advantages to move data to cloud.

But now the BIG challenge is to securely migrate the terabytes of Enterprise data to Cloud. Months can be spent coming up with airtight migration plan which does not disrupt your business. And the final migration may also take a long time impacting adversely the users, applications or customers using the source database.

Innovative data migration

In short, database migration can end up being a miserable experience. IBM Bluemix Lift is a self-service, ground-to-cloud database migration offering from IBM to take care of the above listed needs. Using Bluemix Lift, database migration becomes fast, reliable and secure. Here’s what it offers:

  • Blazing fast Speed: Bluemix Lift helps accelerate data transfer by embedding the IBM Aspera technology. Aspera’s patented and highly efficient bulk data transport protocol allows Bluemix Lift to achieve transport speeds much faster than FTP and HTTP. Moving 10 TB of data can take a little over a day, depending on your network connection.
  • Zero downtime: Bluemix Lift can eliminate the downtime associated with database migrations. An efficient change capture technology tracks incremental changes to your source database and replays them to your target database. As a result, any applications using the source database can keep running uninterrupted while the database migration is in progress.
  • Secure: Any data movement across the Internet requires strong encryption so that the data is never compromised. Bluemix Lift encrypts data as it travels across the web on its way to an IBM cloud data property.
  • Easy to use: Set up the source data connection, provide credentials to the target database, verify schema compatibility with the target database engine and hit run. That’s all it takes to kick off a database migration with Bluemix Lift.
  • Reliable: The Bluemix Lift service automatically recovers from problems encountered during data extract, transport and load. If your migration is interrupted because of a drop in network connectivity, Bluemix Lift automatically resumes once connectivity returns. In other words, you can kick off a large database migration and walk away knowing that Bluemix Lift is on the job.

Speed, zero downtime, security, ease of use and reliability—these are the hallmarks of a great database migration service, and Bluemix Lift can deliver on all these benefits. Bluemix Lift gets data into a cloud database as easy as selecting Save As –> Cloud. Bluemix Lift also provides an amazing jumping-off point for new capabilities that are planned to be added in the future such as new source and target databases, enhanced automation and additional use cases. Take a look at IBM Bluemix Lift and give it a go.

IBM Bluemix Data Connect

I have been tracking the development on IBM Bluemix Data Connect quite closely. One of the reason is that I was a key developer in the one of the first few services that it launched almost two years back under the name of DataWorks. Two weeks back I attended a session on Data Connect by the architect and saw a demo. I am impressed at the way it has evolved since then. Therefore I am planning to re-visit DataWorks again, now as IBM Bluemix Data Connect. In this blog I will reconcile the role that IBM Bluemix Data Connect play in the era of cloud computing, big data and the Internet of Things.

Research from Forrester found that 68 percent of simple BI requests take weeks, months or longer for IT to fulfill due to lack of technical resources. So this entails that the enterprises must find ways to transform line of business professionals into skilled data workers, taking some of the burden off of IT. It means business users should be empowered work with data from many sources—both on premises and in the cloud—without requiring the deep technical expertise of a database administrator or data scientist.

This is where cloud services like IBM Bluemix Data Connect comes into picture. It enables both technical and non-technical business users to derive useful insights from data, with point and click access—whether it’s a few Excel sheets stored locally, or a massive database hosted in the cloud.

Data Connect is a fully managed data preparation and movement service that enables users to put data to work through a simple yet powerful cloud-based interface. The design team has taken lot of pain to design the solution in most simplistic way, so that a basic user can quickly get started with it. Data Connect empowers the business analyst to discover, cleanse, standardize, transform and move data in support of application development and analytics use cases.

Through its integration with cloud data services like IBM Watson Analytics, Data Connect is a seamless tool for preparing and moving data from on premises and off premises to an analytics cloud ecosystem where it can be quickly analyzed and visualized. Furthermore, Data Connect is backed by continuous delivery, which adds robust new features and functionality on a regular basis. Its processing engine is built on Apache Spark, the leading open source analytics project, with a large and continuously growing development community. The result is a best-of-breed solution that can keep up with the rapid pace of innovation in big data and cloud computing.

So here are highlights of IBM Bluemix Data Connect:

  • Allow technical and non-technical users to draw value from data quickly and easily.
  • Ensure data quality with simple data preparation and movement services in the cloud.
  • Integrate with leading cloud data services to create a seamless data management platform.
  • Continuous inflow of new and robust features
  • Best-of-breed ETL solution available on Bluemix  – IBMs Next-Generation Cloud App Development Platform

InfoSphere Information Server 11.5 Tames Big Data

hadoop-logo In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in  IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.

Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product  in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!

Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.

How is this accomplished? YARN! 
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server  can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.

Step 1
The conductor process manages the section leader and player processes that run on the InfoSphere Information Server engine. The conductor process on the engine tier receives a job run request for an InfoSphere DataStage, InfoSphere QualityStage job. This job might be generated from an InfoSphere Information Analyzer analysis.
Step 2
The conductor connects to the YARN client, which assigns an Application Master to the job from the available pool of Application Masters it maintains. If an Application Master is not available in the pool the client will start a new one for this job. The conductor connects to the Application Master and sends the details about the resources that are required for running the job.
Step 3
The Application Master requests resources from the Yarn resource manager. The jobs processes run in a YARN container, with each container running a section leader and players. The YARN container designates resource requirements such as CPU and memory. When the resources are allocated, the conductor sends the process commands for the section leader to the Application Master, which starts those commands on the allocated resources.

More details can be found here.

A few other capabilities offered by InfoSphere Information Server on Hadoop includes:

  • InfoSphere Information Analyzer features are now supported, executing directly inside a Hadoop cluster.
  • Hadoop metadata management is made easy
  • HDFS file connectivity including new data formats, additional character sets and additional data types
  • Support for Kerberos-enabled clusters
  • YARN job browser

Spark – Sparkling framework for big data management and analytics

sparkThere has been lot of buzz around Apache Spark since last several months, and I have been following it to some extent and comparing it with Hadoop. In this blog, I will share some of what I have read about it.

Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Well, wasn’t that Hadoop’s claim to fame? Well yes, but Spark was developed as a way to speed up processing jobs in Hadoop systems. Spark advocates claim that with its in-memory computing layer, Spark can run batch-processing programs up to 100 times faster than MapReduce can. When data is processed from disk, Spark can run batch jobs up to 10 times faster, they say.

While MapReduce is limited to batch processing, the Apache Spark architecture includes a stream processing module, a machine learning library and a graph processing API with related algorithms thereby making it a more general purpose platform. The Spark Streaming technology in particular has found its way into deployments at Spark early adopters, for uses such as analyzing online advertising data and processing satellite images and geo-tagged tweets. Does that imply that handling of these additional processing workload may require companies to expand the size of their Hadoop clusters? And the answer is obviously, Yes.

Unlike Hadoop, Spark doesn’t include its own file system. It can run in a standalone mode and access a variety of data sources, but most often it is used to process and analyze data stored in the Hadoop Distributed File System (HDFS). So it should not be surprising to note that Spark has been incorporated into the top Hadoop distributions in every major distribution of Hadoop, including the ones from Cloudera, Hortonworks, IBM, MapR and Pivotol. In such installations, one can still use MapReduce because of its reliability, but Spark may require less development expertise than MapReduce does because of its high-level APIs and support for writing applications in Java, Scala or Python.

IBM InfoSphere Streams

StreamsStream Computing and need for it:
Stream Computing refers to Analysis of data in motion (before it is stored).
Stream computing becomes essential with certain kinds of data when there is no time to store it before acting on it. This data is often (but not always) generated by a sensor or other instrument. Speed matters for use cases such as fraud detection, emergent healthcare, and public safety where insight into data must occur in real time and decisions to act can be life-saving.

Streaming data might arrive linked to a master data identifier (a phone number, for example). In a neonatal ICU, sensor data is clearly matched to a particular infant. Financial transactions are unambiguously matched to a credit card or Social Security number. However, not every piece of data that streams in is valuable, which is why the data should be analyzed by a tool such as IBM InfoSphere Streams before being joined with the master data.

IBM InfoSphere Streams
IBM InfoSphere Streams analyzes these large data volumes with Micro-Latency. Rather than accumulating and storing data first, the software analyzes data as it flows in and identifies conditions that trigger alerts (such as outlier transactions that a bank flags as potentially fraudulent during the credit card authorization process). When this situation occurs, the data is passed out of the stream and matched with the master data for better business outcomes. InfoSphere Streams generates a summary of the insights that are derived from the stream analysis and matches it with trusted information, such as customer records, to augment the master data.

IBM InfoSphere Streams is based on nearly 6 years of effort by IBM Research to extend computing technology to rapidly do advanced analysis of high volumes of data, such as:

  •  Analysis of video camera data for specific faces in law enforcement applications
  •  Processing of 5 million stock market messages per second, including trade execution with an average latency of 150 microseconds
  •  Analysis of test results from chip manufacturing wafer testers to determine in real time if there are failing chips and if there are patterns to the failures

InfoSphere Streams provides a development platform and runtime environment where you can develop applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams that are based on defined, proven, and analytical rules that alert you to take appropriate action, all within an appropriate time frame for your organization.

The data streams that are consumable by InfoSphere Streams can originate from sensors, cameras, news feeds, stock tickers, or various other sources, including traditional databases. The streams of input sources are defined and can be numeric, text, or non-relational types of information, such as video, audio, sonar, or radar inputs. Analytic operators are specified to perform their actions on the streams.

What is InfoSphere BigInsights ?

I spent some time reading about IBM InfoSphere BigInsights. In this blog, I wish to share the summary of what I read.

Need for a solution like BigInsights
Imagine if you were able to:

  •  Build sophisticated predictive models from the combination of existing information and big data information flows, providing a level of depth that only analytics applied at a large scale can offer.
  •  Broadly and automatically perform consumer sentiment and brand perception analysis on data gathered from across the Internet, at a scale previously impossible using partially or fully manual methods.
  • Analyze system logs from a variety of disparate systems to lower operational risk.
  •  Leverage existing systems and customer knowledge in new ways that were previously ruled out as infeasible due to cost or scale.

Highlights of InfoSphere BigInsightsInfoSphere-BigInsights

  • BigInsights allows organizations to cost-effectively analyze a wide variety and large volume of data to gain insights that were not previously possible.
  • BigInsights is focused on providing enterprises with the capabilities they need to meet critical business requirements while maintaining compatibility with the Hadoop project.
  • BigInsights includes a variety of IBM technologies that enhance and extend the value of open-source Hadoop software to facilitate faster time-to-value, including application accelerators, analytical facilities, development tools, platform improvements and enterprise software integration.
  • While BigInsights offers a wide range of capabilities that extend beyond the Hadoop functionality, IBM has taken an optin approach: you can use the IBM extensions to Hadoop based on your needs rather than being forced to use the extensions that come with InfoSphere BigInsights.
  • In addition to core capabilities for installation, configuration and management, InfoSphere BigInsights includes advanced analytics and user interfaces for the non-developer business analyst.
  • It is flexible to be used for unstructured or semi-structured information; the solution does not require schema definitions or data preprocessing and allows for structure and associations to be added on the fly across information types.
  • The platform runs on commonly available, low-cost hardware in parallel, supporting linear scalability; as information grows, we simply add more commodity hardware.

InfoSphere BigInsights provides a unique set of capabilities that combine the innovation from the Apache Hadoop ecosystem with robust support for traditional skill sets and already installed tools. The ability to leverage existing skills and tools through open-source capabilities helps drive lower total cost of ownership and faster time-to-value.  Thus InfoSphere BigInsights enables new solutions for problems that were previously too large and complex to solve cost-effectively.

Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions

Challenges of Data Lake paving way for Data Reservoir

DataLakeIn my previous blogs I was discussing about Data Lake. Imagine you have pooled the entire data of your enterprise to a Data lake, there will be challenges. All this raw data will be overwhelming and unsafe to use because no-one is sure where data came from, how reliable it is, and how it should be protected. Without proper management and governance, such a data lake can quickly become a data swamp. This data swamp can cause frustration to the business users, application developers, IT and even customers.

So there is a need for a facility for transforming raw data into information that is Clean, Timely, Useful and Relevant. Hence an enhanced data lake solution was built with management, affordability, and governance at its core. This solution is known as a data reservoir. Probably in one of the subsequent blogs we will take a dip into data reservoir! Stay tuned.