Quality Stage – III (Investigation)

In the last blog, I mentioned that we need to investigate the data. Understanding the data is a necessary precursor to cleansing and consolidation. If we know the actual condition of data in existing sources,we can correct data problems before they corrupt new systems. So let’s begin with the issues in data, which necessitates us to have an investigation done prior to handling the data. An investigate stage helps us answer questions like the following:

– Does the data column has some out of range, default (such as all nines in the telephone number) or unique values?
– Does the data column has Free-form text that requires parsing to improve “addressability” to key components for optimal matching?
– Does the data column has values that do not match metadata labels and require new field definitions for matching and load format (such as DBA, C/O, Attention text, drivers license numbers, or other comment information in a name or address field)?
– Does the data column has values that overlap adjacent fields and thus require a re-alignment of field content. For example, name information or city, state, and postal code that extend into street and mailing address fields.
– Does the data column has invalid formats that can cause conversion problems such as alphanumeric in character or numeric only fields.
– Does the data column has blank or missing data or requires some special character and punctuation handling?

Now a brief of  Investigate Stage in action in pictures…

1. How does Investigate Stage job looks like on the Data Stage Parallel Canvas?

2. How does Investigate Stage investigates free-form fields? This is done to create patterns to reveal field format.

3. What kind of report can we see from the investigate stage?

Finally, We can also use a data profiling product such as IBM InfoSphere Information Analyzer to perform a comprehensive analysis across many columns and tables to create a direct input into the cleansing and matching process by using shared metadata.


6 thoughts on “Quality Stage – III (Investigation)

  1. Hi Namit,

    Nice series, I was at a recent IBM bootcamp of IIS and your blog jst summarizes some of the products nicely. Recently i was doing a character discrete investigation on one of my data sources and output file after investigation had had less number of columns that the source file. This puzzled me. Can you help understand why? Because as i see the output is the result on each coulmn and hence there should be more rows in the output report. I know this is limited information, but still if you can provide a little insight?

    • Hi Vishu,

      Thank you for your comment. The output columns just describe the patterns of the data (remember we are investigating the data). In other words we are just seeing the investigation report here and not the actual transformation of data. Do you have a usecase to share with me with sample data to understand what you are trying to do?

  2. Hi Namit,

    I’m doing POC on indian data for seperating north indian names and south indian names using qualitystage v8.5 .
    I have developed Classification table and dictionary table now i have to develop “Pattern action file” could you please help me how to develop pattern action file and seperate north indian name bucket and south indian name bucket based on lastnames.

    Example :-
    In Classification File :-

    ;Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending

    Akula Akula B
    Alluri Alluri B
    Alva Alva B
    Baireddy Baireddy B
    Agarkar Agarkar N
    Agarwal Agarwal N
    bhat bhat N
    ………….. so on
    vikram bhat ( as per the lastname “bhat” we can tell that the person belongs to north )
    subba bairreddy ( as per the lastname “baireddy” we can tell that the person belongs to south)

    In Dictionary file : –

    SouthIndianName C 20 S SouthIndianName ;0607-0626
    NorthIndianName C 20 S NorthIndianName ;0627-0646

    Q 1 ) could u explain what is “Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending”???

    after investigation stage the output pattern report is

    ???N Mr Gulam Nabi Bhat
    ??? Mr Sanjay Jain
    ??B Mr sanjay kumar

    in Pattern Action File ( by using above pattern report developed the PAT)

    ; Parser Rules

    SEPLIST ” ~`!@#$%^&*()_-+={}[]|\\:;\”,.?/'”
    STRIPLIST ” ~`!@$^*_+={}[]|\\:;?”

    ? | ? | ? | N | [ {NamePrefix} = “” = “” = “” {NorthIndianName} = “” ]
    copy [1] {NamePrefix}
    copy [2] {FirstName}
    copy [3] {PrimaryName}
    copy [4] {NorthIndianName}

    ? | ? | ? | [ {NamePrefix} = “” = “” = “” ]
    copy [1] {NamePrefix}
    copy [2] {FirstName}
    copy [3] {PrimaryName}

    ? | ? | ? | B | [ {NamePrefix} = “” = “” = “” = “” ]
    copy [1] {NamePrefix}
    copy [2] {FirstName}
    copy [3] {PrimaryName}
    copy [4] {SouthIndianName}

    Q 2) but i’m not getting the output in standardize stage, could u explain me where i’m doing mistake???

    I have went through some basics in pattern action reference redbook

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s