Data cleaning and data loading in IBM style
The DataStage & QualityStage components are part of the IBM InfoSphere Information Server platform, so they are available as a standard product.
Although one product, one graphical interface, there are two different purposes for which DataStage & QualityStage can be used.
DataStage is a data integration tool that plays a key role in the processing of data. It can be used to develop jobs that support data movement and data transformation. Data movement can be implemented similarly to the Data Replication tool between source and target databases, but the heart of the product is the data transformation that can be associated with the data movement.
The jobs created in the graphical interface can support both ETL (extract-transform-load) and ELT (extract-load-transform) processes. During transformation processing, the system is able to transfer data from different sources to a shared database, assemble related data based on primary and foreign keys, and make necessary changes to the data. On the source and target side, the tool is able to work with various sources, like directly connect to the applications used by the company, even if they are live systems.
Transformation jobs consist of processing steps and the links between them. The processing steps may define data sources, transformation steps or target systems. The transformation steps able to specify the mandatory data modifications that lead to the proper data format defines by the business.
One of the biggest advantages of DataStage is that it can transform data – originated from different sources – according to corporate standards, making it available to users in a formalized format.
QualityStage is a data cleansing tool that helps you achieve pre-defined data quality goals. It can also be used to develop jobs that can eliminate redundancies in data, detect obsolete or inaccurate data formats, helps users to possess reliable and high quality data.
QualityStage is also accessible through a graphical interface and, similarly to DataStage, can be configured to provide data cleansing jobs along parallel processing threads. Parallel processing threads help to execute jobs in a time-efficient manner. All these resources operated by the IBM InfoSphere Information Server Engine.
The operation requires the execution of the appropriate security rules, making the source and target systems discoverable to the IBM InfoSphere Information Server, so the defined processes can run without any problem and ultimately be able to make changes to the designated system/database.
The jobs created by the users can be saved in the Director interface of the product, where their schedule settings can be configured. Using scheduled runs, users can always work with up-to-date data. The job processing logs can be found in the Log Manager, and available after the scheduled run.