IBM Vs Informatica (1 of 2)

Some of you would have noticed in “about me” page, that one of my post that gets lot of hit is IBM and Informatica Leads Gartner Magic Quadrant for Data Integration Tools 2013. I realize that many of visitors would want to get a comparison between IBM Information Server and Informatica. I am into ETL domain since last 13 years, and have several publications and patents in this domain. So I thought of venturing into comparing these two solutions. One may ask that Gartner anyhow compares these solutions, so why is a need for me to do the same. The answer is that Gartner takes into account many different factors, and I am basing my comparison based on just the technical capability perspective of the key areas. Also I am opening a dialog where other practitioners who have worked on these two can provide inputs so that all the readers (including me) can benefit.

In this blog, I will focus on the scalability aspect of these two ETL solutions.

Scalability and Parallel Processing

Big Data Integration requires something called Massive Data Scalability. Massive Data Scalability requires the ability to process more data by simply adding more hardware.

  • IBM’s Information Server is built on a shared nothing, massively parallel processing architecture. There is no limitation on throughput and performance. If you want to process more data, you just add hardware. You don’t change your application. You can refer to my earlier blog which describes the Information Server Parallel processing which is much faster (10X to 15X) than the processing by Hadoop.

    arch_dynrepart
    Information Server Architecture Supporting Data Partitioning
  • Informatica’s  PowerCenter and Blaze can’t support partitioning a large data set across the nodes of a cluster or grid or MPP system. This is one of the fundamental architectural requirements for processing large data volumes. This means there is no support for running data integration logic in parallel across computing nodes, with the same logic running against separate data partitions. Because of this architectural limitations, the amount of data that you can sort, aggregate, transform, join, etc. is limited to what you can process on one node. So what does this mean?
    • First, you can’t exploit commodity grid hardware and storage for processing Big Data. You have to buy expensive SMP servers and storage because the amount of data that you can sort, transform, aggregate is what you can process on one server.
    • A second implication is that you are forced to push big ETL workloads into the parallel database. Users will be forced to push big ETL workloads into parallel databases such as Netezza, DB2, Oracle, and Teradata
    • Finally, because you can’t run all complex data transformations in the parallel database, you have to live with dirty data that has not been cleansed.

In Summary:
Processing large data in a scalable manner require data to be partitioned across separate nodes so that a single job executes the same application logic against all partitioned data. This is not possible by Informaticas Power Center. And so for processing large dataset, INFA customer has to depend on pushing the processing in the Database (too expensive) or offload some of the work to Hadoop (too slow).

 

 

Advertisements

Whats new in IBM InfoSphere Information Server 11.7 – Part 3

In my last blog, we discussed about Information Governance Catalog (IGC). In this blog I wish to touch upon some new features of Information Governance that were introduced along with the new look and feel with  IBM InfoSphere Information Server version 11.7.

Enterprise Search

Social Collaboration
InfoSphere Information Server also brought social collaboration to the domain of Information Governance. When you browse your data, sometimes you would want to know what other experts think about critical assets such as reports, source files, and more.. Now it is possible, as you can rate an asset on a scale of one to five stars, and you can leave a comment with a couple of words. This enables all members of your organization to collaborate and share their expertise right where it’s needed. Also remember that the more popular the asset is, the higher is its position on the search results list.

Searching for assets
With 11.7, Searching for assets has become very easy. You don’t need to know anything about the data in your enterprise to explore it. Let’s assume that you want to find information about bank accounts, simply type ‘bank account’ in the search field in enterprise search, and that’s it. The search engine looks for the information in all asset types. It takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage. And if  you already familiar with your organization and looking for something more specific, then you just open the catalog with your data, and select asset types that you want to browse. To narrow down search results, apply advanced filters like creation and modification dates, stewards, labels, or custom attributes.

Unstructured data sources
The data in your enterprise consists of databases, tables, columns, and other sources of structured data. What about email messages, word-processing documents, audio or video files, collaboration software, or instant messages? They are also a very valuable source of information. To support a unified approach to enterprise information management, IBM StoredIQ can now be set up to synchronize data with IBM Information Governance Catalog. So now you can classify such information in IGC too.

Exploring Relationships
Data in large organizations can be very complex, and assets can be related to one another in multiple ways. To understand these complex relations better, explore them in a graphical form by using graph explorer. This view by default displays all relationships of one asset that you select. But this is just the starting point, as you can further expand relationships of this asset’s relationships in the same view. Having all this information in one place in a graphical format makes it a lot easier to dig into the structure of your data. Each relationship has direction and name. You’ll be surprised when you discover how assets are connected!

To have a look at the new Information Governance Catalog, view this video.

 

Whats new in IBM InfoSphere Information Server 11.7 – Part 2

DataStage Flow Designer

As promised in the last blog, here are a few more changes that came with InfoSphere Information Server 11.7. DataStage Flow Designer is the new web based user interface for IBM’s flagship data integration component IBM DataStage. It can be used to create, edit, load, and run DataStage jobs. But unlike the current DataStage Designer, it does not require any installation on a Microsoft Windows client environment and therefore is immediately available and easily accessible once DataStage is being installed. Moreover, you do not need to migrate jobs to a new location in order to use the new web-based IBM DataStage Flow Designer user interface. Any existing DataStage jobs can be rendered in IBM DataStage Flow Designer, avoiding complex, error-prone migrations that could lead to costly outages.

DataStage Flow Designer
DataStage Flow Designer

Here are few of it’s features.

  • Search and Quick Tours: Quickly find any job using the built-in search feature.  For example, you can search for job name, description or timestamp to find what you are looking for very quickly. Also you can familiarize yourself with the product by taking the built in quick tour. You can also watch the “Create your first job” video on the welcome page.
  • Automatic metadata propagation: Making changes to a stage in a DataStage job can be time consuming because you have to go to each subsequent stage and redo the change. DataStage Flow Designer automatically propagates the metadata to subsequent stages in that flow, increasing productivity.
  • Highlighting of all compilation errors: Today, the DataStage thick client identifies compilation errors one at a time. Big jobs with upwards of 30 or 50 stages have a difficult time on compile, because errors are highlighted one stage at a time. DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at once before re-compiling.

In summary, the new browser-based DataStage® Flow Designer is geared for data engineers, but is versatile and accessible to all business users. This cognitive designer features an intuitive, modern, and security-rich browser-based interface. Users can access the DataStage Flow Designer and quickly address their data transformation or preparation needs, without having to rely on a Windows™ desktop environment. Do watch the following video on IBM DataStage Flow Designer.

To know more, please visit the IBM Knowledge Center.
There is a lot more in IBM InfoSphere Information Server 11.7. So stay tuned.

DataStage now available on Cloud

For data integration projects, DataStage has been the work horse for many years. It is used by Data Engineers to extract data from many different sources, transform and combine the data, and then populate them for applications and end users. DataStage has many distinct advantages over other popular ETL tools.

ETL on CloudUntil recently, these capabilities were only available with the on-premises offering. Now DataStage is available on the Cloud as a hosted cloud offering. Customers can take advantage of the full capabilities of DataStage and without the burden and time consumption of standing up the infrastructure and installing the software themselves. Customers can quickly deploy a DataStage environment (from ordering to provisioning it on the cloud) and be up and running in a day or less. There is no up-front capital expenditure as customers only pay a monthly subscription based on the capacity they purchase. Licensing is also greatly simplified.

Using DatasStage on Cloud, existing DataStage customers can start new projects quickly. Since it is hosted in the IBM cloud, the machine and operating system are managed by IBM. The customer will not have to spend time to either increase the current environment or create a new one. In other words, Cloud elasticity makes them ready to scale and handle any workload. DataStage ETL job developers can immediately be productive and the data integration activities can span both on-premises and cloud data if necessary, as the DataStage jobs can be exported from the cloud and brought back to an on-premises DataStage environment.

datastage-on-cloud2As an example; A customer has data sources such as Teradata, DB2, etc. in their data center as well as SalesForce, MongoDB and other data residing in the Cloud. They need access to their existing data sources and their cloud data sources for a new customer retention project . This project requires some sophisticated data integration to bring it all together but they don’t have the IT resources or budget to stand up a new data integration environment in their own data center for this project. So, an instance of DataStage on the Cloud can be deployed for their use. The customer can access the DataStage client programs on the Cloud to work with DataStage. The access would be either through the public Internet or a private connection via the SoftLayer VPN. DataStage ETL jobs running in the Cloud can access the customer’s on-premise data sources and targets using secured protocols and encryption methods. In addition, these DataStage jobs can also access cloud data sources like dashDB as well as data sources on other cloud platforms using the appropriate secured protocols.

So with DataStage hosted on the Cloud you can:

  1. Extend your ETL infrastructure: Expand your InfoSphere DataStage environment or begin transitioning into a private or public cloud with flexible deployment options and subscription pricing.
  2. Establish ad hoc environments: Extend your on-premises capacity to quickly create new environments for ad hoc development and testing or for limited duration projects.
  3. Start new projects in the cloud: Move straight to the cloud without establishing an on-premises environment. Realize faster time-to-value, reduce administration burden and use low-risk subscription pricing.

Go here for more information: https://developer.ibm.com/clouddataservices/docs/information-server/

DataStage job run time architecture on Hadoop

hadoop-logo In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in  IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.

Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product  in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!

Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.

How is this accomplished? YARN! 
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server  can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.

Here is more detail on how Information Server uses YARN

DataStageOnHadoopArchitecture

  1. A job is submitted to run in the Information Server engine.
  2. The ‘Conductor’ (the process responsible for coordinating the job) asks YARN to instantiate the YARN version of the Conductor: The Application Master.
  3. The YARN Client is responsible for starting and stopping Application Masters
  4. Now that the Application Master is ready, ‘Section Leaders’ (responsible for work on a datanode) are prepared
  5. Section Leaders are created and managed by YARN Node Managers.  This is the point where the BigIntegrate/BigQuality binaries will be copied to the Hadoop DataNode if they do not already exist there.
  6. Now the real work can begin – the ‘players’ (that actually run the process) are started.

All of this is automatic and behind the scenes.  The actual user interface will look and feel identical to when a job is run on Windows, AIX, or Linux.

 

IBM Stewardship Center

Need for IBM Stewardship Center in Data Curation: 

Managing Data Quality requires the joint effort of business and IT. Business defines the information policies that govern the data quality for an organization. Based on these policies, IT team implement rules so that any deviations in the data quality can be reported for business to review. For example, if the policy for a Bank is that the account holder’s age should be greater than 18. During data load, an ETL tool can run some profiling on the data to check how many records are violating this rule. Now these records needs to be shared with the business (non technical domain experts called Data Stewards) who can take appropriate action to fix the issue. As many data stewards become increasingly responsible for improving the value of their data assets, they need capabilities to help them manage these new requirements like:

  • Collaborating across multiple lines of business to build information policies that support regulatory requirements
  • Assessing the cost of poor data quality and managing such data quality issues to closure
  • Engaging subject matter experts through business processes to review and approve corporate glossary changes

StewardshipCenter

IBM Stewardship Center is a powerful browser-based interface that helps to bridge the gap between business and IT, providing a central location for users to collaborate on data governance and manage data quality issues. Stewardship Center is built on an open event management infrastructure, which makes it possible to integrate information server based stewardship seamlessly into your existing stewardship solutions and collaboration environments.

IBM Stewardship Center leverages the strengths of IBM® Business Process Manager to offer solutions to these challenges that can be used immediately or can be customized or extended to suit the specific needs of your organization. The capabilities that are provided with IBM Stewardship Center are divided into three categories: data quality exception management, governance and workflow notification, and performance metrics.
IBM Stewardship Center includes these components:

  • The Data Quality Exception sample process application, which is a sample workflow for resolving data quality issues, which can be customized or extended.
  • The Stewardship Center Application Toolkit, which can be used to extend the capabilities of the Data Quality Exception sample process application or to create your own custom workflows.
  • The Email Notification process application, which can be used to notify users by email when governance events are generated in Information Governance Catalog.
  • The Governance Rule Approval process application, which can be used to manage approvals for information governance rules from Information Governance Catalog.

For more information, see Overview of IBM Stewardship Center.
For a video see Tech Talk: Stewardship Center.

DataStage Best Practices – 3

DataStage1. Avoid unnecessary type conversions:
Set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.
2. Use Transformer stages sparingly and wisely.
Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations

3. Increase Sort performance where possible.
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types.

4. Remove Unneeded Columns.
Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

5. Avoid reading from sequential files using the Same partitioning method.
Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly re-partition.

6. It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.

7. For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU intensive applications, which typically perform multiple CPU demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.

8. Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.

9. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMS’s, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.

10. Turn off Runtime Column propagation wherever it’s not required.