IBM Vs Informatica (2 of 2)

In my last blog, we compared IBM’s Information Server and Informatica’s Power Center based on their scalability. Here is the summary: Big Data and enterprise class data environments need unlimited data scalability to keep pace with data volume growth. Informatica’s PowerCenter is NOT designed to provide unlimited data scalability which may lead to investment in expensive workarounds.

In this blog we will touch upon two other important aspect of ETL tools.

Data Governance and Metadata Management

Governance

  • IBM provides a data governance solution (Information Governance Catalog) designed for business users.
    • Information Governance Catalog has deep support of REST API interface. This makes Information Sever more open and ensures compatibility with other enterprise systems. User can create custom enhancement and loaders as well as can create unique user interfaces for a consistent look and feel.
    • There is a superior Event based notification that takes advantage of open source kafka messaging. For example, Import of metadata is an “event” that can be monitored for workflow and approval purposes, or simply for notification.
    • There is graphical reporting to illustrate relationships, data design origins, and data flow lineage to help answer “what does this mean” and “where did this data come from?”
    • There is an advanced search and navigation or a “shopping” experience for the data.
    • Metatadata Asset Manager controls what data goes in the repository. “Import Areas”  govern what is being imported into the repository (or not), and who is able to import. These imports are initiated via browser interface.  No local Windows installation is required for the metadata administrator.
  • Informatica lacks these capabilities and provides a data governance solution designed for technical users. It lacks openness of their platform and you get locked to “Informatica Only” architecture.

Data Quality

  • IBM provides an integrated data integration platform with one processing engine, one user design experience for data integration and data quality, and one shared metadata repository. Information Server gives ability to write a datastage job once and run it anywhere (transcational database, hadoop or eventually spark)
  • Informaticdataquality.pnga provides a  collection of multiple and incompatible processing engines, user design experiences, and metadata repositories. Informatica Data Quality and Informatica Power Center are two different products that have different user interfaces.  In fact, PC needs two interfaces to design jobs an manage workflows. It also uses two engines. This means that Data Quality processes have to be ‘pushed’ or ‘exported’ to PC to run.
Summary:
In Summary, we can say Information Server is a better solution to go in case we want to create scalable workflows, open-ness in architecture and better productivity design and running the workflows. Information Server supports the power of 1.
  • 1 Engine: The same engine runs stand-alone, in a grid, or natively in Hadoop/YARN. Jobs can remain unchanged regardless of deployment model.
  • 1 Design Experience: Single design experience for Data Integration and Data Quality that increases productivity and reduces error.
  • 1 Repository: A single active metadata repository across the entire portfolio and so design and execution metadata instantly shared among team members.
Disclaimer: The postings on this site are my own and don’t necessarily represent IBM‘s positions, strategies or opinions
Advertisements

IBM Vs Informatica (1 of 2)

Some of you would have noticed in “about me” page, that one of my post that gets lot of hit is IBM and Informatica Leads Gartner Magic Quadrant for Data Integration Tools 2013. I realize that many of visitors would want to get a comparison between IBM Information Server and Informatica. I am into ETL domain since last 13 years, and have several publications and patents in this domain. So I thought of venturing into comparing these two solutions. One may ask that Gartner anyhow compares these solutions, so why is a need for me to do the same. The answer is that Gartner takes into account many different factors, and I am basing my comparison based on just the technical capability perspective of the key areas. Also I am opening a dialog where other practitioners who have worked on these two can provide inputs so that all the readers (including me) can benefit.

In this blog, I will focus on the scalability aspect of these two ETL solutions.

Scalability and Parallel Processing

Big Data Integration requires something called Massive Data Scalability. Massive Data Scalability requires the ability to process more data by simply adding more hardware.

  • IBM’s Information Server is built on a shared nothing, massively parallel processing architecture. There is no limitation on throughput and performance. If you want to process more data, you just add hardware. You don’t change your application. You can refer to my earlier blog which describes the Information Server Parallel processing which is much faster (10X to 15X) than the processing by Hadoop.

    arch_dynrepart
    Information Server Architecture Supporting Data Partitioning
  • Informatica’s  PowerCenter and Blaze can’t support partitioning a large data set across the nodes of a cluster or grid or MPP system. This is one of the fundamental architectural requirements for processing large data volumes. This means there is no support for running data integration logic in parallel across computing nodes, with the same logic running against separate data partitions. Because of this architectural limitations, the amount of data that you can sort, aggregate, transform, join, etc. is limited to what you can process on one node. So what does this mean?
    • First, you can’t exploit commodity grid hardware and storage for processing Big Data. You have to buy expensive SMP servers and storage because the amount of data that you can sort, transform, aggregate is what you can process on one server.
    • A second implication is that you are forced to push big ETL workloads into the parallel database. Users will be forced to push big ETL workloads into parallel databases such as Netezza, DB2, Oracle, and Teradata
    • Finally, because you can’t run all complex data transformations in the parallel database, you have to live with dirty data that has not been cleansed.

In Summary:
Processing large data in a scalable manner require data to be partitioned across separate nodes so that a single job executes the same application logic against all partitioned data. This is not possible by Informaticas Power Center. And so for processing large dataset, INFA customer has to depend on pushing the processing in the Database (too expensive) or offload some of the work to Hadoop (too slow).

 

 

25 Years of IBM Patent Leadership

IBM inventors received a record-breaking 9,043 U.S. patents during 2017—the 25th consecutive year that IBM has topped the annual list of patent recipients. IBM‘s patents in 2017 included more than 1,900 cloud patents, 1,400 in artificial intelligence (AI) and over 1,200 in the area of cybersecurity. IBM also received patents in the areas of blockchain and quantum computing. IBM‘s India inventors contributed over 800 patents to this record tally, making us the second highest contributor after the US region

It is a winning streak that began just after the advent of the PC and when the world wide web debut to the public. In the history of the technology there is virtually no one that has lead technology in any area for 25 years. IBM has gone from 20,000 patents in 1992 to producing more that a 1,00,000 patents in 25 years period so thought of sharing some insight into it.

The more than 105,000 inventions from IBM during its 100-plus-year history—from FORTRAN to relational databases to the Universal Product Code—are transforming the world. Also you can see smart glasses for the visually-impaired; a technology for securing credit card transactions; a carbon nanotube that’s 50,000 times thinner than human hair; and systems for predicting car traffic before it starts and earthquakes before they strike – Breakthrough after breakthrough.

Sample Patents:
Here are some of the interesting Patents from IBM in 2017:

  • Cloud resources : A system that uses unstructured data about world or local events to forecast cloud resource needs.
  • Self-driving vehicles: A machine-learning system that can shift control between a human driver and autonomous vehicle when there’s an emergency.
  • Blockchain: A method that uses blockchain technology to reduce the number of steps involved in settling transactions between multiple business parties.
  • AI speech: A system that can help artificial intelligence analyze and mirror a user’s speech patters to improve communication between AI and humans.
  • Cybersecurity: Technology that enables AI systems to turn the table on hackers by baiting them into email exchanges and websites that expend their resources and frustrate their attacks.

What takes to innovate?
We tend to think of innovation as a single event—a flash of genius followed by a revolutionary product or service. But the truth is that the road to any significant discovery is a long and twisted path. The inventors take some of the already established ideas and come up with something novel. Do watch the following video that shows the process and importance of Patenting.

Whats new in IBM InfoSphere Information Server 11.7 – Part 3

In my last blog, we discussed about Information Governance Catalog (IGC). In this blog I wish to touch upon some new features of Information Governance that were introduced along with the new look and feel with  IBM InfoSphere Information Server version 11.7.

Enterprise Search

Social Collaboration
InfoSphere Information Server also brought social collaboration to the domain of Information Governance. When you browse your data, sometimes you would want to know what other experts think about critical assets such as reports, source files, and more.. Now it is possible, as you can rate an asset on a scale of one to five stars, and you can leave a comment with a couple of words. This enables all members of your organization to collaborate and share their expertise right where it’s needed. Also remember that the more popular the asset is, the higher is its position on the search results list.

Searching for assets
With 11.7, Searching for assets has become very easy. You don’t need to know anything about the data in your enterprise to explore it. Let’s assume that you want to find information about bank accounts, simply type ‘bank account’ in the search field in enterprise search, and that’s it. The search engine looks for the information in all asset types. It takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage. And if  you already familiar with your organization and looking for something more specific, then you just open the catalog with your data, and select asset types that you want to browse. To narrow down search results, apply advanced filters like creation and modification dates, stewards, labels, or custom attributes.

Unstructured data sources
The data in your enterprise consists of databases, tables, columns, and other sources of structured data. What about email messages, word-processing documents, audio or video files, collaboration software, or instant messages? They are also a very valuable source of information. To support a unified approach to enterprise information management, IBM StoredIQ can now be set up to synchronize data with IBM Information Governance Catalog. So now you can classify such information in IGC too.

Exploring Relationships
Data in large organizations can be very complex, and assets can be related to one another in multiple ways. To understand these complex relations better, explore them in a graphical form by using graph explorer. This view by default displays all relationships of one asset that you select. But this is just the starting point, as you can further expand relationships of this asset’s relationships in the same view. Having all this information in one place in a graphical format makes it a lot easier to dig into the structure of your data. Each relationship has direction and name. You’ll be surprised when you discover how assets are connected!

To have a look at the new Information Governance Catalog, view this video.

 

Whats new in IBM InfoSphere Information Server 11.7 – Part 2

DataStage Flow Designer

As promised in the last blog, here are a few more changes that came with InfoSphere Information Server 11.7. DataStage Flow Designer is the new web based user interface for IBM’s flagship data integration component IBM DataStage. It can be used to create, edit, load, and run DataStage jobs. But unlike the current DataStage Designer, it does not require any installation on a Microsoft Windows client environment and therefore is immediately available and easily accessible once DataStage is being installed. Moreover, you do not need to migrate jobs to a new location in order to use the new web-based IBM DataStage Flow Designer user interface. Any existing DataStage jobs can be rendered in IBM DataStage Flow Designer, avoiding complex, error-prone migrations that could lead to costly outages.

DataStage Flow Designer
DataStage Flow Designer

Here are few of it’s features.

  • Search and Quick Tours: Quickly find any job using the built-in search feature.  For example, you can search for job name, description or timestamp to find what you are looking for very quickly. Also you can familiarize yourself with the product by taking the built in quick tour. You can also watch the “Create your first job” video on the welcome page.
  • Automatic metadata propagation: Making changes to a stage in a DataStage job can be time consuming because you have to go to each subsequent stage and redo the change. DataStage Flow Designer automatically propagates the metadata to subsequent stages in that flow, increasing productivity.
  • Highlighting of all compilation errors: Today, the DataStage thick client identifies compilation errors one at a time. Big jobs with upwards of 30 or 50 stages have a difficult time on compile, because errors are highlighted one stage at a time. DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at once before re-compiling.

In summary, the new browser-based DataStage® Flow Designer is geared for data engineers, but is versatile and accessible to all business users. This cognitive designer features an intuitive, modern, and security-rich browser-based interface. Users can access the DataStage Flow Designer and quickly address their data transformation or preparation needs, without having to rely on a Windows™ desktop environment. Do watch the following video on IBM DataStage Flow Designer.

To know more, please visit the IBM Knowledge Center.
There is a lot more in IBM InfoSphere Information Server 11.7. So stay tuned.

Whats new in IBM InfoSphere Information Server 11.7 – Part 1

IBM® InfoSphere® Information Server V11.7 was released last week. And in next couple of blogs, I wish to share how 11.7 is a major milestone for Governance functionality. First let’s look at the changes from a very high level before going closer.

At a Glance
IBM® InfoSphere® Information Server V11.7 accelerates the delivery of trusted and meaningful information to your business with enhanced automation and new design and usage experiences:

Enterprise Search
Enterprise Search
  • New Enterprise smart search to discover and view enterprise information
  • Automated data discovery and classification powered by machine learning
  • Policy-and-business-classification-driven data quality evaluation
  • New browser-based cognitive design experience for data engineers
  • New and expanded Hadoop data lake and cloud capabilities and connectivity
  • Single and holistic catalog view of information across the information landscape, enabling users to derive insight through a knowledge graph

Unified Governance

Now let’s get into some details. InfoSphere® Information Server V11.7 introduces the unified governance platform, a fabric that supports Data Governance Objectives throughout analytics lifecycle. Unified Governance focuses on the following themes and capabilities to construct a data foundation for the enterprise.

  • Auto Discovery and Classification: For data in traditional repositories or in the modern Hadoop data lakes the ability to catalog data accurately and quickly with minimal user intervention is a key requirement all modern enterprises have. Auto Discovery provides the user with the ability to point to a data source and ingest metadata from that data source and Auto Classification is an optional feature used after discovery which conducts data profiling and quality analysis.
  • Auto Quality Management : Data Quality is a key component of Data Governance. Automation rules provide a way to associate evaluation of data quality with business classification of data and also provide a way to automate data quality evaluation. It will help lower the cost of quality evaluation significantly.
  • Enterprise Search – An enterprise wants to leverage data. A lot of data does not get used simply because there is no good way to find it. The Knowledge Graph is a self-service user experience which provides information with insight to the business user. This allows a CDO to improve the use of data in business decisions with a high level of confidence that it is governed data. Starting with a simple keyword search, a user can leverage context of the data and use social collaboration to narrow down data to be used for analytics or business decision making.
  • Customizable User Experience : This release introduces the ability for an enterprise to customize their users experience based on roles and allows a user to customize their experience suited to their personal preference.
  • Metadata Integration from StoredIQ into IGC enabling organizations to govern all their information assets (structured and unstructured) in a centralized repository. This is critical to support customers needs for GDPR.

Watch the following 2 minute video on Unified Governance:

This release introduces key technological innovations as well open source technology. Also there has been a tremendous change in the DataStage Designer. I will share that in the upcoming blog. So stay tuned.