Reaching the Summit: The World’s Smartest Supercomputer

IBM and the U.S. government unveiled the world’s smartest and most powerful AI supercomputer, capable of performing 200 quadrillion calculations per second, making it the fastest in the world. IBM scientists from the U.S., Germany, Switzerland, France, Ireland, Brazil, Israel and Canada worked together over four years to build a system that is optimized from the ground-up for AI.
But what do 200 quadrillion calculations per second really mean? Think of it this way: If every person on earth completed one calculation per second, it would take 305 days to do what Summit can do in 1 second. The Summit supercomputer is about 200,000 times more powerful than the average laptop. Here is something about Summit:

Summit is designed for AI

Four years ago, when the U.S. Department of Energy brought in IBM to build a supercomputer for use at the Oak Ridge National Laboratory in Tennessee, it wasn’t just about building a computer that was fast.
The IBM team had to ensure that the system could quickly sift through all types of data, from multiple sources, to help find answers to the world’s most complex problems—from cancer to the opioid crisis to energy efficiency.
Artificial intelligence (AI) workloads are some of the most complex and challenging computational problems for computers today, and when tasked to do so, they are typically power-hungry, inefficient and unable to make sense of all the different kinds of information.
What the team came up with was new server and processor technology—called POWER9—that was built specifically for compute-intensive AI workloads and is faster and more powerful than existing systems that weren’t designed to handle the new types of data that now flow from a vast range of systems and sensors, in different formats.

This technology is not just for government

With deep learning and AI moving well beyond science fiction into the cutting edge of internet and enterprise computing, the impact of Summit is apparent in a variety of industries:

  • The National Cancer Institute will use Summit to find previously hidden relationships between disease factors.
  • The Department of Veterans Affairs will use it to combine clinical and genomic data with machine learning to understand the genetic factors contributing to opioid addiction.
  • And the DOE will use Summit to develop new materials that can transmit electricity without energy loss.

Businesses can have their own “mini Summit” today

There was a time when it would have taken years for technology built for government labs to be brought into market. But IBM has taken the same technology it used for Summit and put it into its newest commercial offerings—the AC922 and POWER9 systems. This means businesses can buy IBM’s new AI-optimized architecture and apply it to solve their toughest business challenges. In other words, this isn’t just about research. This is about giving organizations the technology they need to work smarter in new ways—from banks identifying fraud in real time to global businesses preventing supply chain breakdowns before they happen. And since IBM Power Systems are based on open-source technology, customers have more choices when it comes to which hardware and software to use, as well as more flexibility on the components used inside systems.

Four of the top six banks in North America are already using these systems for things like real-time fraud identification.


What’s inside the box

The Summit supercomputer consists of about 4600 “nodes”, which are basically rack-mounted servers. Although Summit will be 5-10 times more powerful than its predecessor, it will have only a quarter of the nodes and use substantially less power. It’s what’s inside these nodes that makes them so special. Each node consists of a specialized HPC server designed by IBM. The node contains two IBM Power9 processors and six Nvidia Tesla V100 SXM2 GPU accelerators. And to keep these systems cool, there’s a swimming pool’s worth of water flowing overhead.


IBM Vs Informatica (2 of 2)

In my last blog, we compared IBM’s Information Server and Informatica’s Power Center based on their scalability. Here is the summary: Big Data and enterprise class data environments need unlimited data scalability to keep pace with data volume growth. Informatica’s PowerCenter is NOT designed to provide unlimited data scalability which may lead to investment in expensive workarounds.

In this blog we will touch upon two other important aspect of ETL tools.

Data Governance and Metadata Management


  • IBM provides a data governance solution (Information Governance Catalog) designed for business users.
    • Information Governance Catalog has deep support of REST API interface. This makes Information Sever more open and ensures compatibility with other enterprise systems. User can create custom enhancement and loaders as well as can create unique user interfaces for a consistent look and feel.
    • There is a superior Event based notification that takes advantage of open source kafka messaging. For example, Import of metadata is an “event” that can be monitored for workflow and approval purposes, or simply for notification.
    • There is graphical reporting to illustrate relationships, data design origins, and data flow lineage to help answer “what does this mean” and “where did this data come from?”
    • There is an advanced search and navigation or a “shopping” experience for the data.
    • Metatadata Asset Manager controls what data goes in the repository. “Import Areas”  govern what is being imported into the repository (or not), and who is able to import. These imports are initiated via browser interface.  No local Windows installation is required for the metadata administrator.
  • Informatica lacks these capabilities and provides a data governance solution designed for technical users. It lacks openness of their platform and you get locked to “Informatica Only” architecture.

Data Quality

  • IBM provides an integrated data integration platform with one processing engine, one user design experience for data integration and data quality, and one shared metadata repository. Information Server gives ability to write a datastage job once and run it anywhere (transcational database, hadoop or eventually spark)
  • Informaticdataquality.pnga provides a  collection of multiple and incompatible processing engines, user design experiences, and metadata repositories. Informatica Data Quality and Informatica Power Center are two different products that have different user interfaces.  In fact, PC needs two interfaces to design jobs an manage workflows. It also uses two engines. This means that Data Quality processes have to be ‘pushed’ or ‘exported’ to PC to run.
In Summary, we can say Information Server is a better solution to go in case we want to create scalable workflows, open-ness in architecture and better productivity design and running the workflows. Information Server supports the power of 1.
  • 1 Engine: The same engine runs stand-alone, in a grid, or natively in Hadoop/YARN. Jobs can remain unchanged regardless of deployment model.
  • 1 Design Experience: Single design experience for Data Integration and Data Quality that increases productivity and reduces error.
  • 1 Repository: A single active metadata repository across the entire portfolio and so design and execution metadata instantly shared among team members.
Disclaimer: The postings on this site are my own and don’t necessarily represent IBM‘s positions, strategies or opinions

IBM Vs Informatica (1 of 2)

Some of you would have noticed in “about me” page, that one of my post that gets lot of hit is IBM and Informatica Leads Gartner Magic Quadrant for Data Integration Tools 2013. I realize that many of visitors would want to get a comparison between IBM Information Server and Informatica. I am into ETL domain since last 13 years, and have several publications and patents in this domain. So I thought of venturing into comparing these two solutions. One may ask that Gartner anyhow compares these solutions, so why is a need for me to do the same. The answer is that Gartner takes into account many different factors, and I am basing my comparison based on just the technical capability perspective of the key areas. Also I am opening a dialog where other practitioners who have worked on these two can provide inputs so that all the readers (including me) can benefit.

In this blog, I will focus on the scalability aspect of these two ETL solutions.

Scalability and Parallel Processing

Big Data Integration requires something called Massive Data Scalability. Massive Data Scalability requires the ability to process more data by simply adding more hardware.

  • IBM’s Information Server is built on a shared nothing, massively parallel processing architecture. There is no limitation on throughput and performance. If you want to process more data, you just add hardware. You don’t change your application. You can refer to my earlier blog which describes the Information Server Parallel processing which is much faster (10X to 15X) than the processing by Hadoop.

    Information Server Architecture Supporting Data Partitioning
  • Informatica’s  PowerCenter and Blaze can’t support partitioning a large data set across the nodes of a cluster or grid or MPP system. This is one of the fundamental architectural requirements for processing large data volumes. This means there is no support for running data integration logic in parallel across computing nodes, with the same logic running against separate data partitions. Because of this architectural limitations, the amount of data that you can sort, aggregate, transform, join, etc. is limited to what you can process on one node. So what does this mean?
    • First, you can’t exploit commodity grid hardware and storage for processing Big Data. You have to buy expensive SMP servers and storage because the amount of data that you can sort, transform, aggregate is what you can process on one server.
    • A second implication is that you are forced to push big ETL workloads into the parallel database. Users will be forced to push big ETL workloads into parallel databases such as Netezza, DB2, Oracle, and Teradata
    • Finally, because you can’t run all complex data transformations in the parallel database, you have to live with dirty data that has not been cleansed.

In Summary:
Processing large data in a scalable manner require data to be partitioned across separate nodes so that a single job executes the same application logic against all partitioned data. This is not possible by Informaticas Power Center. And so for processing large dataset, INFA customer has to depend on pushing the processing in the Database (too expensive) or offload some of the work to Hadoop (too slow).



25 Years of IBM Patent Leadership

IBM inventors received a record-breaking 9,043 U.S. patents during 2017—the 25th consecutive year that IBM has topped the annual list of patent recipients. IBM‘s patents in 2017 included more than 1,900 cloud patents, 1,400 in artificial intelligence (AI) and over 1,200 in the area of cybersecurity. IBM also received patents in the areas of blockchain and quantum computing. IBM‘s India inventors contributed over 800 patents to this record tally, making us the second highest contributor after the US region

It is a winning streak that began just after the advent of the PC and when the world wide web debut to the public. In the history of the technology there is virtually no one that has lead technology in any area for 25 years. IBM has gone from 20,000 patents in 1992 to producing more that a 1,00,000 patents in 25 years period so thought of sharing some insight into it.

The more than 105,000 inventions from IBM during its 100-plus-year history—from FORTRAN to relational databases to the Universal Product Code—are transforming the world. Also you can see smart glasses for the visually-impaired; a technology for securing credit card transactions; a carbon nanotube that’s 50,000 times thinner than human hair; and systems for predicting car traffic before it starts and earthquakes before they strike – Breakthrough after breakthrough.

Sample Patents:
Here are some of the interesting Patents from IBM in 2017:

  • Cloud resources : A system that uses unstructured data about world or local events to forecast cloud resource needs.
  • Self-driving vehicles: A machine-learning system that can shift control between a human driver and autonomous vehicle when there’s an emergency.
  • Blockchain: A method that uses blockchain technology to reduce the number of steps involved in settling transactions between multiple business parties.
  • AI speech: A system that can help artificial intelligence analyze and mirror a user’s speech patters to improve communication between AI and humans.
  • Cybersecurity: Technology that enables AI systems to turn the table on hackers by baiting them into email exchanges and websites that expend their resources and frustrate their attacks.

What takes to innovate?
We tend to think of innovation as a single event—a flash of genius followed by a revolutionary product or service. But the truth is that the road to any significant discovery is a long and twisted path. The inventors take some of the already established ideas and come up with something novel. Do watch the following video that shows the process and importance of Patenting.

Whats new in IBM InfoSphere Information Server 11.7 – Part 3

In my last blog, we discussed about Information Governance Catalog (IGC). In this blog I wish to touch upon some new features of Information Governance that were introduced along with the new look and feel with  IBM InfoSphere Information Server version 11.7.

Enterprise Search

Social Collaboration
InfoSphere Information Server also brought social collaboration to the domain of Information Governance. When you browse your data, sometimes you would want to know what other experts think about critical assets such as reports, source files, and more.. Now it is possible, as you can rate an asset on a scale of one to five stars, and you can leave a comment with a couple of words. This enables all members of your organization to collaborate and share their expertise right where it’s needed. Also remember that the more popular the asset is, the higher is its position on the search results list.

Searching for assets
With 11.7, Searching for assets has become very easy. You don’t need to know anything about the data in your enterprise to explore it. Let’s assume that you want to find information about bank accounts, simply type ‘bank account’ in the search field in enterprise search, and that’s it. The search engine looks for the information in all asset types. It takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage. And if  you already familiar with your organization and looking for something more specific, then you just open the catalog with your data, and select asset types that you want to browse. To narrow down search results, apply advanced filters like creation and modification dates, stewards, labels, or custom attributes.

Unstructured data sources
The data in your enterprise consists of databases, tables, columns, and other sources of structured data. What about email messages, word-processing documents, audio or video files, collaboration software, or instant messages? They are also a very valuable source of information. To support a unified approach to enterprise information management, IBM StoredIQ can now be set up to synchronize data with IBM Information Governance Catalog. So now you can classify such information in IGC too.

Exploring Relationships
Data in large organizations can be very complex, and assets can be related to one another in multiple ways. To understand these complex relations better, explore them in a graphical form by using graph explorer. This view by default displays all relationships of one asset that you select. But this is just the starting point, as you can further expand relationships of this asset’s relationships in the same view. Having all this information in one place in a graphical format makes it a lot easier to dig into the structure of your data. Each relationship has direction and name. You’ll be surprised when you discover how assets are connected!

To have a look at the new Information Governance Catalog, view this video.


Whats new in IBM InfoSphere Information Server 11.7 – Part 2

DataStage Flow Designer

As promised in the last blog, here are a few more changes that came with InfoSphere Information Server 11.7. DataStage Flow Designer is the new web based user interface for IBM’s flagship data integration component IBM DataStage. It can be used to create, edit, load, and run DataStage jobs. But unlike the current DataStage Designer, it does not require any installation on a Microsoft Windows client environment and therefore is immediately available and easily accessible once DataStage is being installed. Moreover, you do not need to migrate jobs to a new location in order to use the new web-based IBM DataStage Flow Designer user interface. Any existing DataStage jobs can be rendered in IBM DataStage Flow Designer, avoiding complex, error-prone migrations that could lead to costly outages.

DataStage Flow Designer
DataStage Flow Designer

Here are few of it’s features.

  • Search and Quick Tours: Quickly find any job using the built-in search feature.  For example, you can search for job name, description or timestamp to find what you are looking for very quickly. Also you can familiarize yourself with the product by taking the built in quick tour. You can also watch the “Create your first job” video on the welcome page.
  • Automatic metadata propagation: Making changes to a stage in a DataStage job can be time consuming because you have to go to each subsequent stage and redo the change. DataStage Flow Designer automatically propagates the metadata to subsequent stages in that flow, increasing productivity.
  • Highlighting of all compilation errors: Today, the DataStage thick client identifies compilation errors one at a time. Big jobs with upwards of 30 or 50 stages have a difficult time on compile, because errors are highlighted one stage at a time. DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at once before re-compiling.

In summary, the new browser-based DataStage® Flow Designer is geared for data engineers, but is versatile and accessible to all business users. This cognitive designer features an intuitive, modern, and security-rich browser-based interface. Users can access the DataStage Flow Designer and quickly address their data transformation or preparation needs, without having to rely on a Windows™ desktop environment. Do watch the following video on IBM DataStage Flow Designer.

To know more, please visit the IBM Knowledge Center.
There is a lot more in IBM InfoSphere Information Server 11.7. So stay tuned.