DataStage now available on Cloud

For data integration projects, DataStage has been the work horse for many years. It is used by Data Engineers to extract data from many different sources, transform and combine the data, and then populate them for applications and end users. DataStage has many distinct advantages over other popular ETL tools.

ETL on CloudUntil recently, these capabilities were only available with the on-premises offering. Now DataStage is available on the Cloud as a hosted cloud offering. Customers can take advantage of the full capabilities of DataStage and without the burden and time consumption of standing up the infrastructure and installing the software themselves. Customers can quickly deploy a DataStage environment (from ordering to provisioning it on the cloud) and be up and running in a day or less. There is no up-front capital expenditure as customers only pay a monthly subscription based on the capacity they purchase. Licensing is also greatly simplified.

Using DatasStage on Cloud, existing DataStage customers can start new projects quickly. Since it is hosted in the IBM cloud, the machine and operating system are managed by IBM. The customer will not have to spend time to either increase the current environment or create a new one. In other words, Cloud elasticity makes them ready to scale and handle any workload. DataStage ETL job developers can immediately be productive and the data integration activities can span both on-premises and cloud data if necessary, as the DataStage jobs can be exported from the cloud and brought back to an on-premises DataStage environment.

datastage-on-cloud2As an example; A customer has data sources such as Teradata, DB2, etc. in their data center as well as SalesForce, MongoDB and other data residing in the Cloud. They need access to their existing data sources and their cloud data sources for a new customer retention project . This project requires some sophisticated data integration to bring it all together but they don’t have the IT resources or budget to stand up a new data integration environment in their own data center for this project. So, an instance of DataStage on the Cloud can be deployed for their use. The customer can access the DataStage client programs on the Cloud to work with DataStage. The access would be either through the public Internet or a private connection via the SoftLayer VPN. DataStage ETL jobs running in the Cloud can access the customer’s on-premise data sources and targets using secured protocols and encryption methods. In addition, these DataStage jobs can also access cloud data sources like dashDB as well as data sources on other cloud platforms using the appropriate secured protocols.

So with DataStage hosted on the Cloud you can:

  1. Extend your ETL infrastructure: Expand your InfoSphere DataStage environment or begin transitioning into a private or public cloud with flexible deployment options and subscription pricing.
  2. Establish ad hoc environments: Extend your on-premises capacity to quickly create new environments for ad hoc development and testing or for limited duration projects.
  3. Start new projects in the cloud: Move straight to the cloud without establishing an on-premises environment. Realize faster time-to-value, reduce administration burden and use low-risk subscription pricing.

Lift your Data to Cloud

database_migrationTo stay competitive and reduce cost, several Enterprises are realizing the merits of moving their data to Cloud. Due to their economies of scale cloud storage vendors can achieve lesser cost. Also Enterprises escape the drudgery of [capacity] planning, buying, commissioning, provisioning and maintaining storage systems. Data is even protected by replication to multiple data centers which Cloud vendors provide by default. You can read this blog listing the various advantages to move data to cloud.

But now the BIG challenge is to securely migrate the terabytes of Enterprise data to Cloud. Months can be spent coming up with airtight migration plan which does not disrupt your business. And the final migration may also take a long time impacting adversely the users, applications or customers using the source database.

Innovative data migration

In short, database migration can end up being a miserable experience. IBM Bluemix Lift is a self-service, ground-to-cloud database migration offering from IBM to take care of the above listed needs. Using Bluemix Lift, database migration becomes fast, reliable and secure. Here’s what it offers:

  • Blazing fast Speed: Bluemix Lift helps accelerate data transfer by embedding the IBM Aspera technology. Aspera’s patented and highly efficient bulk data transport protocol allows Bluemix Lift to achieve transport speeds much faster than FTP and HTTP. Moving 10 TB of data can take a little over a day, depending on your network connection.
  • Zero downtime: Bluemix Lift can eliminate the downtime associated with database migrations. An efficient change capture technology tracks incremental changes to your source database and replays them to your target database. As a result, any applications using the source database can keep running uninterrupted while the database migration is in progress.
  • Secure: Any data movement across the Internet requires strong encryption so that the data is never compromised. Bluemix Lift encrypts data as it travels across the web on its way to an IBM cloud data property.
  • Easy to use: Set up the source data connection, provide credentials to the target database, verify schema compatibility with the target database engine and hit run. That’s all it takes to kick off a database migration with Bluemix Lift.
  • Reliable: The Bluemix Lift service automatically recovers from problems encountered during data extract, transport and load. If your migration is interrupted because of a drop in network connectivity, Bluemix Lift automatically resumes once connectivity returns. In other words, you can kick off a large database migration and walk away knowing that Bluemix Lift is on the job.

Speed, zero downtime, security, ease of use and reliability—these are the hallmarks of a great database migration service, and Bluemix Lift can deliver on all these benefits. Bluemix Lift gets data into a cloud database as easy as selecting Save As –> Cloud. Bluemix Lift also provides an amazing jumping-off point for new capabilities that are planned to be added in the future such as new source and target databases, enhanced automation and additional use cases. Take a look at IBM Bluemix Lift and give it a go.

ETL Services now available on Cloud

In my last blog, I started with two questions.

1. What are somethings to be considered from the perspective of deployment ETL service on Cloud?
2. Whether few ETL Services are already available for Enterprises or is this just a theory?

My last blog covered the part 1 and in this blog, I want to dwell more on part 2.

Need for ETL Services on Cloud:
With this explosion of data, the opportunities available to enterprises are booming. But as the enterprises are getting flooded with increasingly more data – data that’s unknown and unproven. Hence enterprises are taking on a whole new scope of risks and complexities. So it’s not enough to just capture the data. There is a need to Control it, Clean it and make refined data readily available to the people driving the business. So can we afford  to create these refinery service from scratch? Lets see with an analogy of getting clean water for Homebuilders.

Homebuilders don’t build water infrastructure, right? They build the home and the pipes underneath, then join it all together with existing pipework for immediate access to clean water. It’s the same way for app developers. They don’t want to govern, clean and monitor data. They just want to bring clean data directly into their applications. That’s the idea behind continuous data refinement – allowing app developers the tools – and the pipes – to build their house.DataWorks

IBM DataWorks™ refining your data
IBM DataWorks™ is a data refinery (on Cloud) to speed application development by getting the data you need, when you need it, and then ensuring it is fit for purpose. It exposes a set of APIs that implement a standard REST model. These APIs allow you to interoperate with feature-rich refinery capabilities. The performance and scalability of the IBM DataWorks engine will ensure that your application runs efficiently. IBM DataWorks includes APIs to identify relevant data, transform the data to suit your needs, and load it to a system for use.

In IBM DataWorks, you begin by finding the data that you want to work with from data sources like SQL Database and dashDB™. You use metrics to better understand your data quality and identify areas to improve.

To improve the data quality, you work with a sample of the data and apply shaping actions such as sorting, filtering, and joining. You can apply the actions to the full data set and load the data to destinations such as Cloudant® NoSQL DB.

For more information visit Data Works

Information Server and Big Data Integration – III (Limitations of Hadoop contd.)

In Part1 of this series I mentioned the role an ETL tool can play in the world of Hadoop. In Part2, we discussed some of the technical limitations of Hadoop. In Part3, based on my recent readings, we will discuss more on how Hadoop cannot play a part of a Data Integration Solution independently. This may come as a surprise to some of the Hadoop proponents as they see Hadoop projects performing extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools. Here are a few…

  • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
  • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
  • Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.
  • Purely Hadoop-based approach to data integration will require custom code and higher costs, which demands specialized skills and ongoing effort to maintain and change.
  • Data integration projects requires good governance principles, and select technologies that support the application of the required policies and procedures. This is not addressed in Hadoop based projects as of now.

Concluding Remarks:
Not only are many key data integration capabilities immature or missing from the Hadoop stack, but many have not been addressed in current projects.

Disclaimer: “The postings on this site are my own (based on my readings and interpretations) and don’t necessarily represent IBM’s positions, strategies or opinions.”

InfoSphere Quality Stage – XIII (Creating a good match specification)

In one of my earlier blog I was talking about the Matching in Quality Stage. For me this is one of the most powerful part of Quality Stage where we can find duplicates in huge data in matter of minutes. For example one of our customer wanted to find duplicates in 5 million of his records and we were able to do that on my server in minutes. Not just that we found out several cases where it appeared that the data had been intentionally modified so that duplicates could not be identified, still the tool was able to catch that. But the heart of all this depends on how well we create our match specification. And unfortunately it is an art to create a good match specification that many practitioners are not aware of. I sit with many practitioners and spend time with them to educate how to use this tool to create a good specification. So in this blog I wish to write some of my tips for creating a good match specification. This is not exhaustive but a good starting point that I got from a redbook.

Generally, raw information in the source systems is not acceptable to match on because it is either stored in a free form format or, even if it is parsed into individual columns, content does not always correspond to the metadata. It is very difficult to achieve high accuracy match results when using data in these
formats. So, as a rule, business intelligence fields and match fields that are generated by Domain Specific Standardization should be used for match specifications.

In addition, single domain fields, such as SSN or TIN (that might require some validation) should be reviewed as possible additional fields to help in determining the match. Initial investigation results could be used to determine the accuracy and field population.

Initial blocking strategy
Within a file with many records, it is very difficult to find matches when trying to compare every single pair in the file, although that seems a logical approach, except for the large files job, which takes a very long time to run because of the number of comparisons that need to be made. (n*(n-1)/2).

Blocking is a scheme that reduces the number of pairs of records that needs to be examined. In blocking, files are partitioned into mutually exclusive and
exhaustive blocks designed to increase the number of matches while decreasing the number of pairs to compare. Consequently, blocking is generally
implemented by partitioning files based on the values of one or more fields.
The following guidelines should be used when selecting fields for blocking:

  •  Select fields that are reliable, well populated, and have a good frequency distribution of values
  •  Use multiple fields in a block
  •  Go from most specific in the first pass – to more lenient in future passes. Your first pass should result in the most obvious and “best” matches
  •  Multiple Pass Strategy: Use multiple passes for variation and to “widen the net”
  •  Use match fields generated from standardization (NYSIIS, Soundex) rather than the business intelligence fields when applicable (First Name, Primary Name, Street Name, City Name)

Of course, if a pair of records blocking fields are not the same in any of the passes of the match specification, that pair does not have a chance to be linked.
So defining blocking fields and multiple passes is a balance between scope and accuracy to compare a reasonable amount of “like” records.

Select fields for match comparison

These are guidelines on the match field selection:

  •  Do not use Match Fields generated from Standardization. Use Business Intelligence Fields (such as First Name, Primary Name, Street Name, and City Name) and apply a fuzzy comparison:
    – UNCERT (Street Name, City Name)
    – Name UNCERT (Middle Name, First Name Maybe)
    – MultUNCERT (Individual Name)
  • Use Exact Comparison methods on shorter fields (as examples, House Number, Codes, Gender, and Floor Value) as well as on fields well standardized (such as Street Type, Directions, State Code, Country Code, and Floor Type)
  •  Use default match parameters in the initial run
  •  Use default match cutoff values of 0 in the initial run


Data Provisioning for enabling Data Virtualization

In my previous blog [Need for Data Virtualization], we discussed how data virtualization is becoming an important component of a comprehensive data integration strategy. Now how does Data Virtualization happen behind the scenes? For the user to get the data from a wide range of sources, the data needs to be provisioned.

Information Provisioning is the mechanisms behind the scenes that supply the data when requested. In InfoSphere, it is achieved using the following four ways…

Information Provisioning styles provided by InfoSpere
Information Provisioning styles provided by InfoSpere

Information Provisioning styles provided by InfoSphere

Federation: Provide correlated information from multiple sources on demand.

Replication: Maintain a copy of an Information Source. A real-time data replication solution can replicate data changes to the locally stored data to ensure it reflects the most current state.You can find more about it here.

Consolidation: Copy and integrate information from multiple sources into a single location. In some cases, data must be transformed (to ensure consistency with other data) before it can be virtualized. The transformation capabilities within an extract, transform and load (ETL) engine
can perform this function.

Caching: Provide localized read only copies of information. The localization of frequently accessed remote data enables queries to be executed locally and quickly, without the need for access to remote data sources.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Data Provisioning Style for Data Integration Tools

This is the last day of 2013, and probably the last blog for this year from me. In this blog I wish to discuss the various ways the Data Integration tool delivers the data for Enterprises.

Traditionally most data integration tool vendors offered bulk-batch data movement via extraction, transformation and loading (ETL) functions. But Enterprises increasingly expect to deploy data integration tools for a broad range of use cases like the following:

  • BI, Analytics and (Logical) Data Warehousing
  • Data Consistency Between Operational Applications
  • Data or System Migrations and Consolidations
  • Master Data Management
  • Inter-enterprise Data Acquisition or Sharing

This expectation is driving demand for comprehensive data delivery capabilities. The following are the 4 critical data delivery capability that Enterprises are expecting any Data Integration tool to support.

  • Bulk-batch data movement involves bulk and/or batch data extraction and delivery approaches (such as support for ETL processes) to consolidate data from primary databases and formats. This capability draws on data from across systems and organizational boundaries.
  • Data federation/virtualization executes queries against multiple data sources to create virtual integrated views of data in memory (rather than physically moving the data). Federated views require adapters to various data sources, an active metadata repository and a distributed query engine that can provide results in various ways (for example, as an SQL row set, XML or a Web services interface).
  • Message-oriented movement encapsulates data in messages that various applications can read so that they can exchange data in real time. I have covered some aspects of this in a previous blog.
  • Data replication and synchronization synchronizes data, such as enabling change data capture (CDC), between two or more database management systems (DBMSs), schemas and other data structures, whether of the same type or different types. This capability supports high-volume and mission-critical scenarios in keeping operational data current in multiple systems.

So where does IBM Stand amongst the various data integration tool vendors in this critical tool capability?