Whats new in IBM InfoSphere Information Server 11.7 – Part 3

In my last blog, we discussed about Information Governance Catalog (IGC). In this blog I wish to touch upon some new features of Information Governance that were introduced along with the new look and feel with  IBM InfoSphere Information Server version 11.7.

Enterprise Search

Social Collaboration
InfoSphere Information Server also brought social collaboration to the domain of Information Governance. When you browse your data, sometimes you would want to know what other experts think about critical assets such as reports, source files, and more.. Now it is possible, as you can rate an asset on a scale of one to five stars, and you can leave a comment with a couple of words. This enables all members of your organization to collaborate and share their expertise right where it’s needed. Also remember that the more popular the asset is, the higher is its position on the search results list.

Searching for assets
With 11.7, Searching for assets has become very easy. You don’t need to know anything about the data in your enterprise to explore it. Let’s assume that you want to find information about bank accounts, simply type ‘bank account’ in the search field in enterprise search, and that’s it. The search engine looks for the information in all asset types. It takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage. And if  you already familiar with your organization and looking for something more specific, then you just open the catalog with your data, and select asset types that you want to browse. To narrow down search results, apply advanced filters like creation and modification dates, stewards, labels, or custom attributes.

Unstructured data sources
The data in your enterprise consists of databases, tables, columns, and other sources of structured data. What about email messages, word-processing documents, audio or video files, collaboration software, or instant messages? They are also a very valuable source of information. To support a unified approach to enterprise information management, IBM StoredIQ can now be set up to synchronize data with IBM Information Governance Catalog. So now you can classify such information in IGC too.

Exploring Relationships
Data in large organizations can be very complex, and assets can be related to one another in multiple ways. To understand these complex relations better, explore them in a graphical form by using graph explorer. This view by default displays all relationships of one asset that you select. But this is just the starting point, as you can further expand relationships of this asset’s relationships in the same view. Having all this information in one place in a graphical format makes it a lot easier to dig into the structure of your data. Each relationship has direction and name. You’ll be surprised when you discover how assets are connected!

To have a look at the new Information Governance Catalog, view this video.

 

Advertisements

Whats new in IBM InfoSphere Information Server 11.7 – Part 1

IBM® InfoSphere® Information Server V11.7 was released last week. And in next couple of blogs, I wish to share how 11.7 is a major milestone for Governance functionality. First let’s look at the changes from a very high level before going closer.

At a Glance
IBM® InfoSphere® Information Server V11.7 accelerates the delivery of trusted and meaningful information to your business with enhanced automation and new design and usage experiences:

Enterprise Search
Enterprise Search
  • New Enterprise smart search to discover and view enterprise information
  • Automated data discovery and classification powered by machine learning
  • Policy-and-business-classification-driven data quality evaluation
  • New browser-based cognitive design experience for data engineers
  • New and expanded Hadoop data lake and cloud capabilities and connectivity
  • Single and holistic catalog view of information across the information landscape, enabling users to derive insight through a knowledge graph

Unified Governance

Now let’s get into some details. InfoSphere® Information Server V11.7 introduces the unified governance platform, a fabric that supports Data Governance Objectives throughout analytics lifecycle. Unified Governance focuses on the following themes and capabilities to construct a data foundation for the enterprise.

  • Auto Discovery and Classification: For data in traditional repositories or in the modern Hadoop data lakes the ability to catalog data accurately and quickly with minimal user intervention is a key requirement all modern enterprises have. Auto Discovery provides the user with the ability to point to a data source and ingest metadata from that data source and Auto Classification is an optional feature used after discovery which conducts data profiling and quality analysis.
  • Auto Quality Management : Data Quality is a key component of Data Governance. Automation rules provide a way to associate evaluation of data quality with business classification of data and also provide a way to automate data quality evaluation. It will help lower the cost of quality evaluation significantly.
  • Enterprise Search – An enterprise wants to leverage data. A lot of data does not get used simply because there is no good way to find it. The Knowledge Graph is a self-service user experience which provides information with insight to the business user. This allows a CDO to improve the use of data in business decisions with a high level of confidence that it is governed data. Starting with a simple keyword search, a user can leverage context of the data and use social collaboration to narrow down data to be used for analytics or business decision making.
  • Customizable User Experience : This release introduces the ability for an enterprise to customize their users experience based on roles and allows a user to customize their experience suited to their personal preference.
  • Metadata Integration from StoredIQ into IGC enabling organizations to govern all their information assets (structured and unstructured) in a centralized repository. This is critical to support customers needs for GDPR.

Watch the following 2 minute video on Unified Governance:

This release introduces key technological innovations as well open source technology. Also there has been a tremendous change in the DataStage Designer. I will share that in the upcoming blog. So stay tuned.

DataStage job run time architecture on Hadoop

hadoop-logo In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in  IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.

Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product  in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!

Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.

How is this accomplished? YARN! 
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server  can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.

Here is more detail on how Information Server uses YARN

DataStageOnHadoopArchitecture

  1. A job is submitted to run in the Information Server engine.
  2. The ‘Conductor’ (the process responsible for coordinating the job) asks YARN to instantiate the YARN version of the Conductor: The Application Master.
  3. The YARN Client is responsible for starting and stopping Application Masters
  4. Now that the Application Master is ready, ‘Section Leaders’ (responsible for work on a datanode) are prepared
  5. Section Leaders are created and managed by YARN Node Managers.  This is the point where the BigIntegrate/BigQuality binaries will be copied to the Hadoop DataNode if they do not already exist there.
  6. Now the real work can begin – the ‘players’ (that actually run the process) are started.

All of this is automatic and behind the scenes.  The actual user interface will look and feel identical to when a job is run on Windows, AIX, or Linux.

 

IBM Stewardship Center

Need for IBM Stewardship Center in Data Curation: 

Managing Data Quality requires the joint effort of business and IT. Business defines the information policies that govern the data quality for an organization. Based on these policies, IT team implement rules so that any deviations in the data quality can be reported for business to review. For example, if the policy for a Bank is that the account holder’s age should be greater than 18. During data load, an ETL tool can run some profiling on the data to check how many records are violating this rule. Now these records needs to be shared with the business (non technical domain experts called Data Stewards) who can take appropriate action to fix the issue. As many data stewards become increasingly responsible for improving the value of their data assets, they need capabilities to help them manage these new requirements like:

  • Collaborating across multiple lines of business to build information policies that support regulatory requirements
  • Assessing the cost of poor data quality and managing such data quality issues to closure
  • Engaging subject matter experts through business processes to review and approve corporate glossary changes

StewardshipCenter

IBM Stewardship Center is a powerful browser-based interface that helps to bridge the gap between business and IT, providing a central location for users to collaborate on data governance and manage data quality issues. Stewardship Center is built on an open event management infrastructure, which makes it possible to integrate information server based stewardship seamlessly into your existing stewardship solutions and collaboration environments.

IBM Stewardship Center leverages the strengths of IBM® Business Process Manager to offer solutions to these challenges that can be used immediately or can be customized or extended to suit the specific needs of your organization. The capabilities that are provided with IBM Stewardship Center are divided into three categories: data quality exception management, governance and workflow notification, and performance metrics.
IBM Stewardship Center includes these components:

  • The Data Quality Exception sample process application, which is a sample workflow for resolving data quality issues, which can be customized or extended.
  • The Stewardship Center Application Toolkit, which can be used to extend the capabilities of the Data Quality Exception sample process application or to create your own custom workflows.
  • The Email Notification process application, which can be used to notify users by email when governance events are generated in Information Governance Catalog.
  • The Governance Rule Approval process application, which can be used to manage approvals for information governance rules from Information Governance Catalog.

For more information, see Overview of IBM Stewardship Center.
For a video see Tech Talk: Stewardship Center.

DataStage Best Practices – 3

DataStage1. Avoid unnecessary type conversions:
Set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.
2. Use Transformer stages sparingly and wisely.
Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations

3. Increase Sort performance where possible.
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types.

4. Remove Unneeded Columns.
Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

5. Avoid reading from sequential files using the Same partitioning method.
Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly re-partition.

6. It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.

7. For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU intensive applications, which typically perform multiple CPU demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.

8. Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.

9. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMS’s, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.

10. Turn off Runtime Column propagation wherever it’s not required.

 

DataStage Best Practices – 1

BestPracticesIn this series, I wish to share some of the best practices that I have come across or learnt from my peers in using DataStage. I hope  this will be helpful for a DataStage practitioner. Here are the links to Best Practice 2 and Best Practice 3 blog which is a continuation to this one.

The following are best practices:

a. There should be no Network bottleneck between Source -> ETL -> Target.
– Typically this means Private Network connection using dedicated switches.
– It also means proper capacity planning in term of network bandwidth for the Network Card as well as Switch Capacity.

b. There should be no bottleneck within the Source/Target System/Application to provide/consume data. ETL server can only process as fast as what the source can provide or the target can consume.
How fast can the ETL process the data in this example?
– Source can provide data at 010K rows / sec
– ETL can handle data at 100K rows / sec
– Target can consume data at 025K rows / sec

c. There should be no I/O bottleneck within the ETL Server.

d. There should be proper Capacity planning to cater for growth.

e. There should proper Job design to ensure Job scalability as the hardware scale. You can get some information on job design here.

f. Always use dedicated server or at least “dedicated CPU” if virtualization is required.

g. When there is a bottleneck in Source/Target, we can use less nodes in configuration file. It helps to improve performance and reduce resources usage.

h. You should be running the “right” number of jobs to ensure there is no system / process overload and wastage of resources to manage those process.

Here are some of the good reads:

  1. Architecture and Deployments
  2. Redbook on Deployment Architecture
  3. IBM InfoSphere Information Server Installation and Configuration Guide
  4. Managing and Deleting Persistent Data Sets within IBM InfoSphere Datastage

Determistic Vs Probabilistic Match

As explained in an earlier blog , Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key. There are two common approaches to decide a match in data while comparing two similar records. They are deterministic match and probabilistic match.

DetermisticVsProbabilistic

Deterministic matching typically searches for a pool of candidate duplicates and then compares values found in specified attributes between all pairs of possible duplicates. It makes allowances for missing data. The results are given a score, and the scores are used to decide if the records should be considered the same or different. There is a gray area where the scores indicate uncertainty, and such duplicates are usually referred to a data steward for investigation and decision.

Probabilistic matching looks at specified attributes and checks the frequency that these attributes occur in the dataset before assigning scores. The scores are influenced by the frequencies of existing values found. A threshold can be assigned to decide whether it is a definite match or a clerical intervention of data steward is required to determine a match.

In Summary
Deterministic decisions tables:

  • Fields are compared
  • Letter grades are assigned
  • Combined letter grades are compared to a vendor-delivered file
  • Result: Match; Fail; Suspect

Probabilistic record linkage:

  • Fields are evaluated for degree of match
  • Weight is assigned and represents the information content by value.
  • Weights are summed to derive a total score.
  • Result: Statistical probability of a match

InfoSphere QualityStage can perform both deterministic matching and probabilistic record linkage, but uses probabilistic record linkage by default. The above example highlights the advantage of probabilistic matching.