Friday, October 30, 2015

5 Tips for better DataStage Design

01. Always try to complete the requirement first, Don't spoil your time to develop an optimized job which is practically useless until requirement is completed.

02. To re-arrange an existing job design, or insert new stage types into an existing job flow, first disconnect the links from the stage to be changed, then the links will retain any meta-data associated with them.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2iYZ7Pd_XtOhSaOpzef43FY6FF8xNua-hiaarGFi8jrAynO0TAjb4M5-ZXy_rq0pyq0CKZHcf-88oFiM7Otwda4xhN456mienXkRQzkP596Kx5CS6UEkCjWm5wEcPoHosEeXXG6jKnwk/s320/wordle+12.png

03. A Lookup stage can only have one input stream, one output stream, and, optionally, one reject stream. Depending on the type of lookup, it can have several reference links. To change the use of particular Lookup links in an existing job flow, disconnect the links from the Lookup stage and then right-click to change the link type, for example, Stream to Reference.

04. The Copy stage is a good place-holder between stages if you anticipate that new stages or logic will be needed in the future without damaging existing properties and derivations. When inserting a new stage, simply drag the input and output links from the Copy place-holder to the new stage. Unless the Force property is set in the Copy stage, InfoSphere DataStage optimizes the actual copy out at runtime.

05. DataStage takes longer to start-up a job for reasons such as it validates the entire environment, nodes and database connectivity before starting the ETL job. By doing so you have overheads upfront by it ensures robust and reliable data loading and integration. Parallel jobs are not recommended for small volume of data and serial fashion of ETL design, as there is an overhead is starting PX processes.

Thursday, October 29, 2015

Difference between Datastage 7.5X2,Datastage 8.0.1 and Datastage 8.5 Versions

Different Versions

DATASTAGE 8.5 ENHANCEMENTS

Datastage 8.5 Enhacements

It was more faster than before versiobs.
All the tasks become faster than previous version 8.1 .

Tasks are like save, renaming, compiling etc.

The run time of the jobs has been improved.

In Datastage transformer Stage is one of the important stage.

In Datastage 8.5 , just got better than before version..

Transformer looping has been included in the transformer stage by I.B.M

There are two types of looping available here.

Where we can output multiple O/P( Output ) links for single I/P( Input Link) Link.

For Example:

Input:

Marketing_person City-1 city-2 city-3

James Bond    Washington Newyork Newjersey

Output will be

Marketing_person City

James Bond       Washington
James Bond    New York
James Bond New Jersey

This achieved with the new system variable @ ITERATION

System Variables:

@ ITERATION : used in the looping mechanism

LastRow(): Indicated the last row in the job

LastRowInGroup(): will return the last rown in the group based on the key.

WHAT ARE THE CLIENT COMPONENTS IN DATASTAGE 7.5X2 VERSION

In Datastage 7.5X2 Version, they are 4 client Components. They are

1) Datastage Designer
2) Datastage Director
3) Datastage Manager
4) Datastage Admin

In Datastage Designer, We
Create the Jobs
Compile the Jobs
Run the Jobs

In Director, We can
View the Jobs
View the Logs
Batch Jobs
Unlock Jobs
Scheduling Jobs
Monitor the JOBS
Message Handling

In Manager , We can
Import & Export the Jobs
Node Configuration

And by using Admin , We can
Create the Projects
Organize the Projects
Delete the Projects

DataStage 8.1 to DataStage 8.5

1. DataStage Designer performance improvement By changing the Metadata algorythm, copy/delete/save jobs got faster about 30-40%.
2. Parallel Engine Performance and Resource improvements Resource usage is about 5% smaller than 8.1, for T-Sort, Windows desktop heap size has been decreased 94%.
3. Transformer enhancements Key break support
LastRowInGroup() function is added. This will return true for the last record of the group.
LastRow() will return the last record of input.
Output looping :: Allows multiple output records to be created per single input record.
Input looping    :: Allows aggregation of input records so that aggregated data can be included with the original input data. ( like adding average column to the original input is now possible. ( 2 pass.... calculation. )
New Null handling This is pretty complicate and need more verification by myself to explain clearly. But this is the description I got.
Null values can now be included in any expression.
        -> Null values no longer need to be explicitly handled.
            A null value in an expression will return a null value result. As long as the target column is nullable, records will not be dropped. Stage variables are now always nullable.

APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING is prepared to support backward compatibility.

New Transformer Functions
Create/offset a time, date or timestamp from component arguments
DateFromComponents(int32 years, int32 months, int32 dayofmonth)
DateOffsetByComponents(date basedate, int32 yearoffset, int32 monthoffset, int32 dayoffset) DateOffsetByDays(date basedate, int32 offset)
TimeFromComponents(int32 hours, int32 minutes, int32 seconds, int32 microseconds) TimeOffsetByComponents(time basetime, int32 houroffset, int32 minuteoffset, dfloat secondoffset) TimeOffsetBySeconds(time basetime, dfloat secondoffset)
TimestampOffsetByComponents(timestamp basetimestamp, int32 yearoffset, int32 monthoffset, int32 dayoffset, int32 houroffset, int32 minuteoffset, dfloat secondoffset)
TimestampOffsetBySeconds(timestamp basetimestamp, dfloat secondoffset)

Various packed decimal conversions
DecimalToDate(decimal basedecimal [,string format] )
DecimalToTime(decimal basedecimal [,string format] )
DecimalToTimestamp(decimal basedecimal [,string format] )
DateToDecimal(date basedate [,string format] )
TimeToDecimal(time basetime [,string format] )
TimestampToDecimal(timestamp basetimestamp [,string format] )

4. DataStage Function enhancements New Client \ Domain Compatibility Check Before/after routines now mask encrypted params Copy project permissions from existing project when creating new project Environment variable enhancements: creation during import Add PX Stage Reset Support Enhancement to Parallel Data Set Stage Multiple Null Field Values on Import Enhancements to improve Multi-Client Manager support
5. DataStage Serviceability enhancements New Audit Tracing Enhanced Exception Dialog ISA Lite Enhancements for DataStage Enhanced Project Creation Failure Details
6. ParallelPivot - Adding Vertical Pivoting
7. CVS (Code Version Control Integration) Information Server Manager was created on Eclipse from 8.1 Now the CVS or Subversion plugins to Eclipse are available for DataStage components.

DIFFERENCE BETWEEN DATASTAGE 7.5X2 AND DATASTAGE 8.0.1 VERSIONS

Difference between Datastage 7.5X2 and Datastage 8.0.1 Versions

1) In Datastage 7.5X2, there are 4 client components. They are
a) Datastage Design
b) Datastage Director
c Datastage Manager
d) Datastage Admin
And in
2) Datastage 8.0.1 Version, there are 5 components. They are
a) Datastage Design
b) Datastage Director
c) Datastage Admin
d) Web Console
e) Information Analyzer

Here Datastage Manager will be integrated with the Datastage Design option.

2) Datastage 7.X.2 Version is OS Dependent. That is OS users are Datastage Users.

and in 8.0.1
2)This is OS Independent . That is User can be created at Datastage, but one time dependant.

3) Datastage 7.X.2 version is File based Repository ( Folder).
3) Datastage 8.0.1 Version is Datastage Repository.

4) No Web based Administration here.
4) Web Based Administration.

5) There are 2 Architecture Components here. They are
a) Server
b) Client
5) There are 5 Architecture Components. They are
a) Common user Interface.
b) Common Repository.
c) Common Engine.
d) Common Connectivity.
e) Common Shared Services.

6) P-3 and P-4 can be performed here.
P-3 is Data Transformation.
P-4 is Metadata Management

6) P-1,P-2,P3,P4 can be performed here.
P-1 is Data Profiling
P-2 is Data Quality
P-3 is Data Transformation
P-4 is Metadata Management

7) Server is IIS

7) Sever is Websphere

8) No Web based Admin

8) Web based Admin.

DataStage 8.1 to DataStage 8.5

1. DataStage Designer performance improvement By changing the Metadata algorythm, copy/delete/save jobs got faster about 30-40%.
2. Parallel Engine Performance and Resource improvements Resource usage is about 5% smaller than 8.1, for T-Sort, Windows desktop heap size has been decreased 94%.
3. Transformer enhancements Key break support
LastRowInGroup() function is added. This will return true for the last record of the group.
LastRow() will return the last record of input.
Output looping :: Allows multiple output records to be created per single input record.
Input looping :: Allows aggregation of input records so that aggregated data can be included with the original input data. ( like adding average column to the original input is now possible. ( 2 pass.... calculation. )
New Null handling This is pretty complicate and need more verification by myself to explain clearly. But this is the description I got.
Null values can now be included in any expression.
-> Null values no longer need to be explicitly handled.
A null value in an expression will return a null value result. As long as the target column is nullable, records will not be dropped. Stage variables are now always nullable.

APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING is prepared to support backward compatibility.

New Transformer Functions
Create/offset a time, date or timestamp from component arguments
DateFromComponents(int32 years, int32 months, int32 dayofmonth)
DateOffsetByComponents(date basedate, int32 yearoffset, int32 monthoffset, int32 dayoffset) DateOffsetByDays(date basedate, int32 offset)
TimeFromComponents(int32 hours, int32 minutes, int32 seconds, int32 microseconds) TimeOffsetByComponents(time basetime, int32 houroffset, int32 minuteoffset, dfloat secondoffset) TimeOffsetBySeconds(time basetime, dfloat secondoffset)
TimestampOffsetByComponents(timestamp basetimestamp, int32 yearoffset, int32 monthoffset, int32 dayoffset, int32 houroffset, int32 minuteoffset, dfloat secondoffset)
TimestampOffsetBySeconds(timestamp basetimestamp, dfloat secondoffset)

Various packed decimal conversions
DecimalToDate(decimal basedecimal [,string format] )
DecimalToTime(decimal basedecimal [,string format] )
DecimalToTimestamp(decimal basedecimal [,string format] )
DateToDecimal(date basedate [,string format] )
TimeToDecimal(time basetime [,string format] )
TimestampToDecimal(timestamp basetimestamp [,string format] )

4. DataStage Function enhancements New Client \ Domain Compatibility Check Before/after routines now mask encrypted params Copy project permissions from existing project when creating new project Environment variable enhancements: creation during import Add PX Stage Reset Support Enhancement to Parallel Data Set Stage Multiple Null Field Values on Import Enhancements to improve Multi-Client Manager support
5. DataStage Serviceability enhancements New Audit Tracing Enhanced Exception Dialog ISA Lite Enhancements for DataStage Enhanced Project Creation Failure Details
6. ParallelPivot - Adding Vertical Pivoting
7. CVS (Code Version Control Integration) Information Server Manager was created on Eclipse from 8.1 Now the CVS or Subversion plugins to Eclipse are available for DataStage components.

Sunday, October 25, 2015

OLTP and OLAP difference

OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

olap vs oltp

- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system design.

	OLTP System Online Transaction Processing (Operational System)	OLAP System Online Analytical Processing (Data Warehouse)
Source of data	Operational data; OLTPs are the original source of the data.	Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data	To control and run fundamental business tasks	To help with planning, problem solving, and decision support
What the data	Reveals a snapshot of ongoing business processes	Multi-dimensional views of various kinds of business activities
Inserts and Updates	Short and fast inserts and updates initiated by end users	Periodic long-running batch jobs refresh the data
Queries	Relatively standardized and simple queries Returning relatively few records	Often complex queries involving aggregations
Processing Speed	Typically very fast	Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements	Can be relatively small if historical data is archived	Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database Design	Highly normalized with many tables	Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and Recovery	Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability	Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

Star schema vs. snowflake schema: Which is better?

What are the key differences in snowflake and star schema? Where should they be applied?
The Star schema vs Snowflake schema comparison brings forth four fundamental differences to the fore:
1. Data optimization:
Snowflake model uses normalized data, i.e. the data is organized inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.

Figure 1 – Snow flake model
Star model on the other hand uses de-normalized data. In the star model, dimensions directly refer to fact table and business hierarchy is not implemented via referential integrity between dimensions.

Figure 2 – Star model
2. Business model:
Primary key is a single unique key (data attribute) that is selected for a particular data. In the previous ‘advertiser’ example, the Advertiser_ID will be the primary key (business key) of a dimension table. The foreign key (referential attribute) is just a field in one table that matches a primary key of another dimension table. In our example, the Advertiser_ID could be a foreign key in Account_dimension.
In the snowflake model, the business hierarchy of data model is represented in a primary key – Foreign key relationship between the various dimension tables.
In the star model all required dimension-tables have only foreign keys in the fact tables.
3. Performance:
The third differentiator in this Star schema vs Snowflake schema face off is the performance of these models. The Snowflake model has higher number of joins between dimension table and then again the fact table and hence the performance is slower. For instance, if you want to know the Advertiser details, this model will ask for a lot of information such as the Advertiser Name, ID and address for which advertiser and account table needs to be joined with each other and then joined with fact table.
The Star model on the other hand has lesser joins between dimension tables and the facts table. In this model if you need information on the advertiser you will just have to join Advertiser dimension table with fact table.

Star schema explained

Star schema provides fast response to queries and forms the ideal source for cube structures. Learn all about star schema in this article.

4. ETL
Snowflake model loads the data marts and hence the ETL job is more complex in design and cannot be parallelized as dependency model restricts it.
The Star model loads dimension table without dependency between dimensions and hence the ETL job is simpler and can achieve higher parallelism.
This brings us to the end of the Star schema vs Snowflake schema debate. But where exactly do these approaches make sense?

Where do the two methods fit in?
With the snowflake model, dimension analysis is easier. For example, ‘how many accounts or campaigns are online for a given Advertiser?’

The star schema model is useful for Metrics analysis, such as – ‘What is the revenue for a given customer?’

Datastage Scenarios and solutions

Field mapping using Transformer stage:

Requirement:

field will be right justified zero filled, Take last 18 characters

Solution:

Right("0000000000":Trim(Lnk_Xfm_Trans.link),18)

Scenario 1:

We have two datasets with 4 cols each with different names. We should create a dataset with 4 cols 3 from one dataset and one col with the record count of one dataset.

We can use aggregator with a dummy column and get the count from one dataset and do a look up from other dataset and map it to the 3 rd dataset

Something similar to the below design:

Scenario 2:

Following is the existing job design. But requirement got changed to: Head and trailer datasets should populate even if detail records is not present in the source file. Below job don't do that job.

Hence changed the above job to this following requirement:

Used row generator with a copy stage. Given default value( zero) for col( count) coming in from row generator. If no detail records it will pick the record count from row generator.

We have a source which is a sequential file with header and footer. How to remove the header and footer while reading this file using sequential file stage of Datastage?

Sol:Type command in putty: sed '1d;$d' file_name>new_file_name (type this in job before job subroutine then use new file in seq stage)

IF I HAVE SOURCE LIKE COL1 A A B AND TARGET LIKE COL1 COL2 A 1 A 2 B1. HOW TO ACHIEVE THIS OUTPUT USING STAGE VARIABLE IN TRANSFORMER STAGE?

If keyChange =1 Then 1 Else stagevaraible+1

Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. How can short out the problem.Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option

(i) On Fail -- commit , Continue

(ii) On Skip -- Commit, Continue.

First u check how many data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u get successful massage

Question: I want to process 3 files in sequentially one by one how can i do that. while processing the files it should fetch files automatically .

Ans:If the metadata for all the files r same then create a job having file name as parameter then use same job in routine and call the job with different file name...or u can create sequencer to use the job..

Parameterize the file name.

Build the job using that parameter

Build job sequencer which will call this job and will accept the parameter for file name.

Write a UNIX shell script which will call the job sequencer three times by passing different file each time.

RE: What Happens if RCP is disable ?

In such case Osh has to perform Import and export every time when the job runs and the processing time job is also increased...

Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those stages whose output connects to the shared container input then meta data will be propagated at run time so there is no need to map it at design time.

If RCP is disabled for the job in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.

Then you have to manually enter all the column description in each stage.RCP- Runtime column propagation

Question:

Source: Target

Eno Ename Eno Ename

1 a,b 1 a

2 c,d 2 b

3 e,f 3 c

2)input is like this:
no,char
1,a
2,b
3,a
4,b
5,a
6,a
7,b
8,a

But the output is in this form with row numbering of Duplicate occurence

output:

no,char,Count
"1","a","1"
"6","a","2"
"5","a","3"
"8","a","4"
"3","a","5"
"2","b","1"
"7","b","2"
"4","b","3"
3)Input is like this:
file1
10
20
10
10
20
30

Output is like:
file2 file3(duplicates)
10 10
20 10
30 20

4)Input is like:
file1
10
20
10
10
20
30

Output is like Multiple occurrences in one file and single occurrences in one file:
file2 file3
10 30
10
10
20
20
5)Input is like this:
file1
10
20
10
10
20
30

Output is like:
file2 file3
10 30
20
6)Input is like this:
file1
1
2
3
4
5
6
7
8
9
10

Output is like:
file2(odd) file3(even)
1 2
3 4
5 6
7 8
9 10
7)How to calculate Sum(sal), Avg(sal), Min(sal), Max(sal) with out using Aggregator stage..

8)How to find out First sal, Last sal in each dept with out using aggregator stage

9)How many ways are there to perform remove duplicates function with out using Remove duplicate stage..

Scenario:

source has 2 fields like

COMPANY LOCATION
IBM HYD
TCS BAN
IBM CHE
HCL HYD
TCS CHE
IBM BAN
HCL BAN
HCL CHE

LIKE THIS.......

THEN THE OUTPUT LOOKS LIKE THIS....

Company loc count

TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE

Solution:

Seqfile......>Sort......>Trans......>RemoveDuplicates..........Dataset

Sort Trans:
Key=Company create stage variable as Company1
Sort order=Asc Company1=If(in.keychange=1) then in.Location Else Company1:',':in.Location
create keychange=True Drag and Drop in derivation
Company ....................Company
Company1........................Location
RemoveDup:
Key=Company
Duplicates To Retain=Last

11)The input is
Shirt|red|blue|green
Pant|pink|red|blue

Output should be,

Shirt:red
Shirt:blue
Shirt:green
pant:pink
pant:red
pant:blue

Solution:

it is reverse to pivote stage
use
seq------sort------tr----rd-----tr----tg
in the sort stage use create key change column is true
in trans create stage variable=if colu=1 then key c.value else key v::colum
rd stage use duplicates retain last
tran stage use field function superate columns

similar Scenario: :

source
col1 col3
1 samsung
1 nokia
1 ercisson
2 iphone
2 motrolla
3 lava
3 blackberry
3 reliance

Expected Output
col 1 col2 col3 col4
1 samsung nokia ercission
2 iphone motrolla
3 lava blackberry reliance

You can get it by using Sort stage --- Transformer stage --- RemoveDuplicates --- Transformer --tgt

Ok

First Read and Load the data into your source file( For Example Sequential File )

And in Sort stage select key change column = True ( To Generate Group ids)

Go to Transformer stage

Create one stage variable.

You can do this by right click in stage variable go to properties and name it as your wish ( For example temp)

and in expression write as below

if keychange column =1 then column name else temp:',':column name

This column name is the one you want in the required column with delimited commas.

On remove duplicates stage key is col1 and set option duplicates retain to--> Last.
in transformer drop col3 and define 3 columns like col2,col3,col4
in col1 derivation give Field(InputColumn,",",1) and
in col1 derivation give Field(InputColumn,",",2) and
in col1 derivation give Field(InputColumn,",",3)

Scenario:

12)Consider the following employees data as source?
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000

Create a job to find the sum of salaries of all employees and this sum should repeat for all the rows.

The output should look like as

employee_id, salary, salary_sum
-------------------------------
10, 1000, 11000
20, 2000, 11000
30, 3000, 11000
40, 5000, 11000

Scenario:

I have two source tables/files numbered 1 and 2.
In the the target, there are three output tables/files, numbered 3,4 and 5.

The scenario is that,

to the out put 4 -> the records which are common to both 1 and 2 should go.

to the output 3 -> the records which are only in 1 but not in 2 should go

to the output 5 -> the records which are only in 2 but not in 1 should go.

sltn:src1----->copy1------>----------------------------------->output_1(only left table)
Join(inner type)----> ouput_1
src2----->copy2------>----------------------------------->output_3(only right table)

Consider the following employees data as source?
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000

Scenario:

Create a job to find the sum of salaries of all employees and this sum should repeat for all the rows.

The output should look like as

employee_id, salary, salary_sum
-------------------------------
10, 1000, 11000
20, 2000, 11000
30, 3000, 11000
40, 5000, 11000
sltn:

Take Source --->Transformer(Add new Column on both the output links and assign a value as 1 )------------------------> 1) Aggregator (Do group by using that new column)
2)lookup/join( join on that new column)-------->tgt.

Scenario:
sno,sname,mark1,mark2,mark3
1,rajesh,70,68,79
2,mamatha,39,45,78
3,anjali,67,39,78
4,pavani,89,56,45
5,indu,56,67,78

out put is
sno,snmae,mark1,mark2,mark3,delimetercount
1,rajesh,70,68,79,4
2,mamatha,39,45,78,4
3,anjali,67,39,78,4
4,pavani,89,56,45,4
5,indu,56,67,78,4

seq--->trans--->seq

create one stage variable as delimiter..
and put derivation on stage as DSLink4.sno : "," : DSLink4.sname : "," : DSLink4.mark1 : "," :DSLink4.mark2 : "," : DSLink4.mark3

and do mapping and create one more column count as integer type.

and put derivation on count column as Count(delimter, ",")

scenario:
sname total_vowels_count
Allen 2
Scott 1
Ward 1
Under Transformer Stage Description:

total_Vowels_Count=Count(DSLink3.last_name,"a")+Count(DSLink3.last_name,"e")+Count(DSLink3.last_name,"i")+Count(DSLink3.last_name,"o")+Count(DSLink3.last_name,"u").

Scenario:

1)On daily we r getting some huge files data so all files metadata is same we have to load in to target table how we can load?
Use File Pattern in sequential file

2) One column having 10 records at run time we have to send 5th and 6th record to target at run time how we can send?
Can get through,by using UNIX command in sequential file filter option

How can we get 18 months date data in transformer stage?
Use transformer stage after input seq file and try this one as constraint in transformer stage :

DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=548 OR DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=546

where date_18 column is the column having that date which needs to be less or equal to 18 months and 548 is no. of days for 18 months and for leap year it is 546(these numbers you need to check).

What is differences between Force Compile and Compile ?

Diff b/w Compile and Validate?
Compile option only checks for all mandatory requirements like link requirements, stage options and all. But it will not check if the database connections are valid.
Validate is equivalent to Running a job except for extraction/loading of data. That is, validate option will test database connectivity by making connections to databases.

How to FInd Out Duplicate Values Using Transformer?

You can capture the duplicate records based on keys using Transformer stage variables.

1. Sort and partition the input data of the transformer on the key(s) which defines the duplicate.
2. Define two stage variables, let's say StgVarPrevKeyCol(data type same as KeyCol) and StgVarCntr as Integer with default value 0
where KeyCol is your input column which defines the duplicate.

Expression for StgVarCntr(1st stg var-- maintain order):

If DSLinknn.KeyCol = StgVarPrevKeyCol Then StgVarCntr + 1 Else 1

Expression for StgVarPrevKeyCol(2nd stg var):

DSLinknn.KeyCol

3. Now in constrain, if you filter rows where StgVarCntr = 1 will give you the unique records and if you filter StgVarCntr > 1 will give you duplicate records.

My source is Like
Sr_no, Name
10,a
10,b
20,c
30,d
30,e
40,f

My target Should Like:

Target 1:(Only unique means which records r only once)
20,c
40,f

Target 2:(Records which r having more than 1 time)
10,a
10,b
30,d
30,e

How to do this in DataStage....

**************

use aggregator and transformer stages

source-->aggregator-->transformat-->target

perform count in aggregator, and take two op links in trasformer, filter data count>1 for one llink and put count=1 for second link.

Scenario:

in my i/p source i have N no.of records

In output i have 3 targets

i want o/p like 1st rec goes to 1st targt and

2nd rec goes to 2nd target and

3rd rec goes to 3rd target again

4th rec goes to 1st taget ............ like this

do this ""without using partition techniques "" remember it.

*****************

source--->trans---->target

in trans use conditions on constraints
mod(empno,3)=1
mod(empno,3)=2
mod(empno,3)=0

Scenario:
im having i/p as
col A
a_b_c
x_F_I
DE_GH_IF

we hav to mak it as

col1 col 2 col3
a b c
x f i
de gh if

*********************

Transformer
create 3 columns with derivation
col1 Field(colA,'_',1)
col2 Field(colA,'_',2)
col3 Field(colA,'_',3)

**************

Field function divides the column based on the delimeter,
if the data in the col is like A,B,C
then
Field(col,',',1) gives A
Field(col,',',2) gives B
Field(col,',',3) gives C

Scenario:

Scenario:
Scenario:

Friday, October 23, 2015

Compile a Datastage job/multiple jobs through Command Line

Question

How do you compile all jobs in a project via command line?

Answer

Perform the following:

Open a DOS command prompt on the client
Run the following from C:\IBM\InformationServer\Clients\Classic:

dscc.exe /H hostname /U username /P password project_name /J jobname

Note: The job name can be * to compile all the jobs in the Project.

Note: you cannot use this command to compile a single job.

Thursday, October 22, 2015

Datastage All Stages

Datastage parallel stages groups

DataStage and QualityStage stages are grouped into the following logical sections:

General objects
Data Quality Stages
Database connectors
Development and Debug stages
File stages
Processing stages
Real Time stages
Restructure Stages
Sequence activities

Please refer to the list below for a description of the stages used in DataStage and QualityStage.
We classified all stages in order of importancy and frequency of use in real-life deployments (and also on certification exams). Also, the most widely used stages are marked bold or there is a link to a subpage available with a detailed description with examples.

DataStage and QualityStage parallel stages and activities

General elements

Link indicates a flow of the data. There are three main types of links in Datastage: stream, reference and lookup.

Container (can be private or shared) - the main outcome of having containers is to simplify visually a complex datastage job design and keep the design easy to understand.
Annotation is used for adding floating datastage job notes and descriptions on a job canvas. Annotations provide a great way to document the ETL process and help understand what a given job does.
Description Annotation shows the contents of a job description field. One description annotation is allowed in a datastage job.

Debug and development stages

Row generator produces a set of test data which fits the specified metadata (can be random or cycled through a specified list of values). Useful for testing and development. Click here for more..
Column generator adds one or more column to the incoming flow and generates test data for this column.
Peek stage prints record column values to the job log which can be viewed in Director. It can have a single input link and multiple output links.Click here for more..
Sample stage samples an input data set. Operates in two modes: percent mode and period mode.
Head selects the first N rows from each partition of an input data set and copies them to an output data set.
Tail is similiar to the Head stage. It select the last N rows from each partition.
Write Range Map writes a data set in a form usable by the range partitioning method.

Processing stages

Aggregator joins data vertically by grouping incoming data stream and calculating summaries (sum, count, min, max, variance, etc.) for each group. The data can be grouped using two methods: hash table or pre-sort. Click herefor more..
Copy - copies input data (a single stream) to one or more output data flows
FTP stage uses FTP protocol to transfer data to a remote machine
Filter filters out records that do not meet specified requirements.Click here for more..
Funnel combines mulitple streams into one. Click here for more..
Join combines two or more inputs according to values of a key column(s). Similiar concept to relational DBMS SQL join (ability to perform inner, left, right and full outer joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces single output stream (no reject link). Click here for more..
Lookup combines two or more inputs according to values of a key column(s). Lookup stage can have 1 source and multiple lookup tables. Records don't need to be sorted and produces single output stream and a reject link. Clickhere for more..
Merge combines one master input with multiple update inputs according to values of a key column(s). All inputs need to be sorted and unmatched secondary entries can be captured in multiple reject links. Click here for more..
Modify stage alters the record schema of its input dataset. Useful for renaming columns, non-default data type conversions and null handling
Remove duplicates stage needs a single sorted data set as input. It removes all duplicate records according to a specification and writes to a single output
Slowly Changing Dimension automates the process of updating dimension tables, where the data changes in time. It supports SCD type 1 and SCD type 2.Click here for more..
Sort sorts input columns.Click here for more..
Transformer stage handles extracted data, performs data validation, conversions and lookups.Click here for more..
Change Capture - captures before and after state of two input data sets and outputs a single data set whose records represent the changes made.
Change Apply - applies the change operations to a before data set to compute an after data set. It gets data from a Change Capture stage
Difference stage performs a record-by-record comparison of two input data sets and outputs a single data set whose records represent the difference between them. Similiar to Change Capture stage.
Checksum - generates checksum from the specified columns in a row and adds it to the stream. Used to determine if there are differencies between records.
Compare performs a column-by-column comparison of records in two presorted input data sets. It can have two input links and one output link.
Encode encodes data with an encoding command, such as gzip.
Decode decodes a data set previously encoded with the Encode Stage.
External Filter permits speicifying an operating system command that acts as a filter on the processed data
Generic stage allows users to call an OSH operator from within DataStage stage with options as required.
Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row to a single column in multiple output rows. Pivoting data results in obtaining a dataset with fewer number of columns but more rows.
Surrogate Key Generator generates surrogate key for a column and manages the key source.
Switch stage assigns each input row to an output link based on the value of a selector field. Provides a similiar concept to the switch statement in most programming languages.
Compress - packs a data set using a GZIP utility (or compress command on LINUX/UNIX)
Expand extracts a previously compressed data set back into raw binary data.

File stage types

Sequential file is used to read data from or write data to one or more flat (sequential) files.Click here for more..
Data Set stage allows users to read data from or write data to a dataset. Datasets are operating system files, each of which has a control file (.ds extension by default) and one or more data files (unreadable by other applications). Click here for more info
File Set stage allows users to read data from or write data to a fileset. Filesets are operating system files, each of which has a control file (.fs extension) and data files. Unlike datasets, filesets preserve formatting and are readable by other applications.
Complex flat file allows reading from complex file structures on a mainframe machine, such as MVS data sets, header and trailer structured files, files that contain multiple record types, QSAM and VSAM files.Click here for more info.
External Source - permits reading data that is output from multiple source programs.
External Target - permits writing data to one or more programs.
Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be used for lookups.

Database stages

Oracle Enterprise allows reading data from and writing data to an Oracle database (database version from 9.x to 10g are supported).
ODBC Enterprise permits reading data from and writing data to a database defined as an ODBC source. In most cases it is used for processing data from or to Microsoft Access databases and Microsoft Excel spreadsheets.
DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.
Teradata permits reading data from and writing data to a Teradata data warehouse. Three Teradata stages are available: Teradata connector, Teradata Enterprise and Teradata Multiload
SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl Server 2005 amd 2008 database.
Sybase permits reading data from and writing data to Sybase databases.
Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL Server. The Stored Procedure stage can be used as a source (returns a rowset), as a target (pass a row to a stored procedure to write) or a transform (to invoke procedure processing within the database).
MS OLEDB helps retrieve information from any type of information repository, such as a relational source, an ISAM file, a personal database, or a spreadsheet.
Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or writing to a number of different supported relational DB engines using native interfaces, such as Oracle, Microsoft SQL Server, DB2, Informix and Sybase.
Informix (CLI or Load)
DB2 UDB (API or Load)
Classic federation
RedBrick Load
Netezza Enterpise
iWay Enterprise

Real Time stages

XML Input stage makes it possible to transform hierarchical XML data to flat relational data sets
XML Output writes tabular data (relational tables, sequential files or any datastage data streams) to XML structures
XML Transformer converts XML documents using an XSLT stylesheet
Websphere MQ stages provide a collection of connectivity options to access IBM WebSphere MQ enterprise messaging systems. There are two MQ stage types available in DataStage and QualityStage: WebSphere MQ connector and WebSphere MQ plug-in stage.
Web services client
Web services transformer
Java client stage can be used as a source stage, as a target and as a lookup. The java package consists of three public classes: com.ascentialsoftware.jds.Column, com.ascentialsoftware.jds.Row, com.ascentialsoftware.jds.Stage
Java transformer stage supports three links: input, output and reject.
WISD Input - Information Services Input stage
WISD Output - Information Services Output stage

Restructure stages

Column export stage exports data from a number of columns of different data types into a single column of data type ustring, string, or binary. It can have one input link, one output link and a rejects link. Click here for more..
Column import complementary to the Column Export stage. Typically used to divide data arriving in a single column into multiple columns.
Combine records stage combines rows which have identical keys, into vectors of subrecords.
Make subrecord combines specified input vectors into a vector of subrecords whose columns have the same names and data types as the original vectors.
Make vector joins specified input columns into a vector of columns
Promote subrecord - promotes input subrecord columns to top-level columns
Split subrecord - separates an input subrecord field into a set of top-level vector columns
Split vector promotes the elements of a fixed-length vector to a set of top-level columns

Data quality QualityStage stages

Investigate stage analyzes data content of specified columns of each record from the source file. Provides character and word investigation methods.
Match frequency stage takes input from a file, database or processing stages and generates a frequence distribution report.
MNS - multinational address standarization.
QualityStage Legacy
Reference Match
Standarize
Survive
Unduplicate Match
WAVES - worldwide address verification and enhancement system.

Sequence activity stage types

Job Activity specifies a Datastage server or parallel job to execute.
Notification Activity - used for sending emails to user defined recipients from within Datastage
Sequencer used for synchronization of a control flow of multiple activities in a job sequence.
Terminator Activity permits shutting down the whole sequence once a certain situation occurs.
Wait for file Activity - waits for a specific file to appear or disappear and launches the processing.
EndLoop Activity
Exception Handler
Execute Command
Nested Condition
Routine Activity
StartLoop Activity
UserVariables Activity

Datastage

Friday, October 30, 2015

5 Tips for better DataStage Design

Thursday, October 29, 2015

Difference between Datastage 7.5X2,Datastage 8.0.1 and Datastage 8.5 Versions