Notes on 70-463 Implementing a Data Warehouse with Microsoft SQL Server 2012 | Joe Hunter
back to joeblogs.ie
Section One: The Logical Data Warehouse
Section Two: The Physical Data Warehouse
Section Three: Connection managers, Control flow, Containers and precedence Constraints
Section Four: All About Data Flow: Data Sources, Transformations and Destinations
Section Five: Variables, Expressions and Master Packages
Section Six: Enhancing the Data Flow
Section Seven: Robust & Restartable
Section Eight: Dynamism
Section Nine: Logging, Auditing & Package Templates
Section Ten: SSIS Installation & SSIS Deployment
Section Eleven: Executing & Securing Packages
Section Twelve: Troubleshooting & Performance Tuning
Section Thirteen: Data Quality Services (DQS)
Section Fourteen: Master Data Implementation
Section Fifteen: Master Data Management
Section Sixteen: Cleaning Data in a Data Quality Project
Section Seventeen: How To Train Your Data - Data Mining
Section Eighteen: Scripting
Section Nineteen: Cleaning Data by Identity Mapping and De-duplicating
I would recommend watching the following video below before you begin 70-463.
Section One: The Logical Data Warehouse
1.1) The Star & Snowflake Schema
This section pertains to the logical requirements when designing a data warehouse.
- Star what...? The star schema is in many ways the de facto standard for logical data warehouse design. Its name is drawn from the formation of the tables included in its schema: a single fact table at the centre and several dimension tables surrounding it as illustrated below.
- What is a fact table? A fact table is connected to all dimensions via foreign keys. Occasionally, all foreign keys in the the fact table when combined uniquely identify a row in the table. But this is not a requirement as a surrogate/simpler key can be used to serve the same purpose. More Details here on Fact Tables
- But why call it a fact table?Consider a single row in a fact table and then it is possible to form a proposition from this row. Example patient A had operation B performed on Date D for a duration of E. Ergo such a proposition is a fact.
- So in a data warehouse there is only one star schema? No there are multiple star schemas in a data warehouse and the power of the star schema comes from the interconnectness of the dimensions in each star schema. This interconnectness is more formally defined as shared or conformed dimensions. An analogy for this is the cogset on a bicycle with multiple gears. They are all interconnected and give the bicycle much more robustness by allowing it to switch between the gears.
- Does the number of dimensions in a fact table matter? Yes the number of dimensions is a measure of the granularity of analyis available. This is also called the granularity of a Star schema.
- What is a snowflake schema then? The snowflake schema is similar to the star schema in that it has a central fact table. But where it differs is in the way that the dimensions tables are normalized. In the snow flake schema the dimension tables are closer to a normalized database in that some dimensions have a foreign-key that is used as a reference to a lookup table. This lookup table may have a foreign key that is used to reference a second-level lookup table.
- When should I use a star schema or a snowflake schema? The snowflake schema is more suitable for small proof of concept projects where you can transition from an OLTP normalized schema to data warehouse snowflake schema quickly. For a long term data warehouse solution, a star schema is more suitable. It is standardized and narrative; you can find the information you need quickly as it is simpler and faster to query by requiring fewer joins.
1.2) Dimension Tables
- What is a dimension table? A dimension table stores the attributes belonging to a foreign key in the
fact table. For example if a foreign key in the fact table refers to a patient's ID and in a dimension table that
stores patient diagnoses called dimDiagnoses, by following this patient ID from the fact table to dimDiagnoses
we can identify all diagnoses for this patient. In this way dimensions give context to measures.
Reference: More Info Here
- So does a dimension table contain only attribute columns?
No a dimension table can contain different column types. As mentioned in the example above they must contain a key to
identify each row. But also may contain the following column types:
Dimension Column Type Details Attributes The most important column type, used for pivoting. Lineage Columns For auditing, never revealed to end users. Member Properties For labelling on reports. Name Columns For naming entities. Keys As mentioned above for naming entities.
- Attributes can form hierarchies? Hierarchies are important in that they provide a drill-down feature for users to navigate through the data. In Star Schemas natural hierarchies can be extracted from the names and contents of columns. Snowflakes schemas make it easier to identify hierarchies by following the lookup tables.
- Making History, the Slowly Changing Dimensions (SCD) Problem
Maintaining history is one of the requirements of a data warehouse. If a customer moves address, should all of their existing
records be updated to reflect the new address (SCD Type 1)? Or should the existing records be kept intact with the old address and
a new row added to handle the new address (SCD Type 2)? Or finally should the existing rows be expanded by adding new columns to show when
the patient started living in the old address and moved to the new address (SCD Type 3)? Which SCD type is used depends on the requirements for the data warehouse.
1.3) Fact Tables
- What are the column types in a Fact Table? There are less column types in a Fact table than in a dimension table. The most
important fact column type is a Measure.
Fact Column Type Details Keys Mainly foreign keys from the dimension tables both can include business key too from the primary source table. Lineage Columns As per dimension tables, for lineage and never exposed to end users. Measures The most important column type, usually numeric so they can be aggregated on.
- Additivity of Measures
At first this is a tricky concept to grasp. But through examples it becomes much clearer. There are
three types of additivity: Additive, Semi-Additive and non-Additive.
Additive Measures: These measures can be summarised across all dimensions e.g. Product sales
Semi-Additive Measures: Measures that can be aggregated over all dimensions except time dimension are semi-additive e.g. bank account balance.
Non-Additive Measures: Prices and percentages are examples of non-additive measures. They are not additive over any dimension.
Reference: More Info Here On Additive Measures
The embedded video below from Microsoft Virtual Academy re-iterates a lot of the content in this section and is very useful.
Section Two: The Physical Data Warehouse
Section one encompassed the logical elements of implementing a data warehouse. This section, section two, looks at the physical elements of implementing a data warehouse.
2.1) Physically Implementing Dimension & Fact Tables
First things first the database to house the data warehouse has to be created. In doing so the following has to be considered:
- What recovery model should be used?
The DW contains read-only data and there is no requirement to backup the transaction log so the most appropriate recovery model is the Simple Recovery model.
- Autoshrink and autogrowth...do I leave these on?
Both of these should be switched off and the database should be sized appropriately to allow for future growth.
Implementing the Dimension Table
If you are supporting Type 2 SCD in your dimension table, this will impose additional requirements to facilitate this, namely:
- Surrogate Keys (use identity or sequences to generate these)
- A current row flag column or two date validity columns are also required
In addition computed columns can be used to discretize continuous values in source columns.
Implementing the Fact Table
As Fact tables are on the many side of the fact to dimension relationship, they should be implemented after dimensions. This facilitates the creation of foreign key constraints. Columns in a fact table include foreign keys from the dimension tables and measures. It is recommended that you partition a large fact table to improve maintenance and performance.
2.2) Data Warehouse (DW) Performance
Invariably Data Warehouses are large and contain a lot of data. This volume of data can present performance problems if not properly indexed and compressed. This section details some of the indexing techniques to adopt in the DW and how to avail of compression features. Before embarking on indexing and compression though it is worthwhile reviewing all DW queries. There is usually scope for improvement here by using appropriate query techniques.Indexing DW Tables
- Every table should have a clustered index.
- DW surrogate keys are very suitable candidates for a clustered index.
- An integer autonumbering surrogate is key is best if possible. If unsure what column to choose, use the SSIS Data Profiling Task
- Give careful consideration to non-clustered indexes, DW queries are not very selective so may not benefit from them
1) Are they supported in the version of SQL Server being used?
2) Does the query you want to wrap in an indexed view meet all pre-requisites (there are a lot)?
Columnstore indexes can provide a dramatic improvement in DW queries. At first they can be difficult to understand coming from a row-based index. There is an excellent primer on columnstore indexes here. There are some restrictions with columnstore indexes:
- Tables containing them become read-only, so not recommended for tables with frequent updates.
- Only one is permitted per table
- When partitioned, columnstore indexes must be partitioned aligned
As per indexed views, compression is very much SQL Server edition based and in SQL 2012 it is
very much confined to enterprised edition. But if Enterprise Editions is available to you the following compression features can be availed of. Due to the decompress and then re-compress requirements when updating data, compression not very optimal for OLTP databases. However DW's are mostly read and are therefore a lot can be gained from activating compression in a data warehouse.
- Row Compression: Stores fixed length data type columns in a variable length format.
- Page Compression: Includeds row compression just mentioned, but also includes prefix and dictionary compression.
- Unicode Compression: Unicode uses 2-bytes per character, compression halves this by using 1-byte per character.
Partitions help to mitigate the problems involved in inserting, updating or deleting data with large fact table. By partitioning the table separate sections can be operated on at a time without impacting the entire table. As per the recommendation for adding columnstore indexes on large fact tables, these tables should also be partitioned.
Partition concepts are explained below:
- Partition Function: Logical operation, which partition should a row be mapped to. The partitioning column is usually used to determine this.
- Partition Scheme: Physical operation, which filegroup should a partition be saved onto.
- Partition Switching: Switches a block of data from one table or partition to another table or partition. More Details Here
- Partition Elimination This is a query optimizer technique where partitions can be eliminated from a query which helps avoid length table scans.
- Index Alignment: This is where indexes are partitioned on the same partition scheme as their base tables. Columnstore indexes must be aligned with their base tables.
Section Three: Connection managers, Control flow, Containers and precedence Constraints.
3.1) Connection Managers
|ADO.NET||Connects to data stores using .NET provider. Use this connection manager when connecting to SQL Server stored procedures over OLE DB. It allows you to use parameter names in queries and has support for more data types e.g. XML, Binary etc.|
|Flat File||Connects to flat files with delimited (CSV) and fixed width supported.|
|ODBC||Supports the ODBC specification. This is important as Microsoft has announced it is replacing support for OLE DB with ODBC.|
Connection Manager Scope There are two options Package-Scoped and Project-Scoped connection managers. A lot like variables scope the former are only available in the context of the SSIS package in which they are created and cannot be used outside of this.
The latter, Project-scoped connection managers, are available to all packages within the project.
32-bit versus 64-bit Data Providers In a production environment, the version of data provider used is determined by the underlying operating system. However the development environment is a 32-bit environment. It is also important to note that some data providers do not have both a 32-bit and 64-bit version.
Parameterization Parameterize,Parameterize,Parameterize! Any setting that is environment dependent should be parameterized. In the current context connection strings should be parameterized including the initialCatalog and instance/server name.
3.2) Control Flow Tasks & Containers
Control flow tasks represent the operations performed in a control flow. Each control flow task represents a single logical operation. Tasks can be grouped based on the functionality they provide:
|WorkFlow Tasks||As the name implies these tasks enable workflow for example the execute package task, as explained below|
|Analysis Services Tasks||If you have SSAS databases, these tasks help with the administration of SSAS objects. An example task here is the Data Mining Query Task.|
|Data Movement Tasks||The most important task in this group bar none is the Data Flow task which is discussed thoroughly in the next section.|
|Data Preparation Tasks||In some scenarios data has to be prepared prior to processing for example file has to be copied into a loading area from on destination server using the file system task.|
|SQL Serer Maintenance Tasks||Backing up databases, Shrink Database Task(avoid!), Rebuild Index Task all of these maintenance plans can be implemented in SSIS packages.|
|SQL Server Administration Tasks||Transfers are the order of the day for this group. For example Transfer Database Task, Transfer Logins Task etc|
Containers are a lot like classes in object oriented (OO) programming and a lot of the same concepts from OO apply to containers.
For instance the concept of encapsulation, all tasks and variables in a container serve a single purpose and are executed in accordance
with that purpose without any dependencies on other objects, everything is encapsulated within the container.
There are three types of containers in SSIS.
1) ForEach Loop: Enumerate! This container is used for operating over an enumerable collection of items e.g. files in a folder.
2) For Loop: Similar to a for loop in most programming languages this container executes tasks repeatedly while loop condition is satisfied.
3) Sequence: A bit like a database schema used to group tables and views together the sequence container has no programmatic logic other than to group tasks together that encapsulate a logical unit.
3.3) Precedence Constraints
Which task gets executed first and if it succeeds which gets exeuted next or if it fails what happens next? To this end an object called a precedence constraint can be used. A precedence constraint is effectively a connector between two distinct tasks that is used to determine what happens next once the first task has completed.
There are three types of precedence constraint:
- Success: If the preceding task completed WITHOUT error then the following task will be executed.
- Failure: If the preceding task completed WITH error then the following task will be executed.
- Completion: The following task is always executed irrespective of the status of the preceding task.
Section Four: All About Data Flow: Data Sources, Transformations and Destinations.
A data flow is essentially a data flow engine that provides the Extract, Transform and Load functionality of SSIS. It utilises an in-memory buffer
oriented architecture to move data from a source and then depending on requirements through a transformation to a destination.
This hands on tutorial is very useful as it goes through all of the major parts of this section.
4.1) Data Flow Source Adapters: The E in ETL
In any ETL process data has to be extracted from various sources. These sources can be in multiple formats so to extract data from these disparate formats there are different flavours of data source adapters as shown in the table below.
|CDC||New in SQL 2012, it only retrieves changed data from an insert/update/delete operation. It uses ADO.NET connection provider.|
|ODBC||New in SQL 2012, it uses native ODBC connections. (As mentioned previously, Microsoft is moving towards ODBC for relational database access over OLE DB.)|
|Raw File*||Very fast for reading data. Raw File is the SSIS native data format requiring no translation and minimal parsing.|
|XML*||Extract from an XML file using schema to define that data involved.|
|ADO.NET||Source from tables or queries using ADO.NET.|
|Flat File Source||Extract data from delimited or fixed width files using this data source adapter. There is an option available with this provider to speed up flat file data extraction. This option is called Fast Parse, it is available on the column level only and it can be used when there is no requirement to parse locale sensitive data.|
* These adapters do not use package or project connections.
4.2) Data Flow Transformations: The T in ETL
As defined here: SQL Server Integration Services transformations are the components in the data flow of a package that aggregate, merge, distribute, and modify data. Transformations can also perform lookup operations and generate sample datasets.
|Logical Row-Level||Derived Column transformations are used quite frequently as they facilitate the creation of new columns from other columns, variiables or parameters.|
|Multi-Input/Multi-output||Merge Join: Two sets are combined together with this transformation based on defined joined column(s). Both sets have to be sorted in advance.|
|Multi-Row||Aggregate As the name implies groups rows together and provides an aggregate function in the process e.g. MAX, MIN|
|Advanced Data Preparation||DQS Cleansing: This transformation is used to cleanse data by referencing it against a knowledge base.|
4.3) Data Flow Destination Adapters: The L in ETL
The data has been sourced, transformed and now has to be loaded into another location. For this purpose a data flow destination adapter can be used.
|Data Destination Adapter||Details|
|Raw File||As this is the native SSIS format, this destination is used mainly as a staging area.|
|ODBC Destination||New in 2012, two facilities for loading available: row-by-row or batch. The latter is very useful in data warehousing environments where large volumes of data are loaded, the only caveat being that the destination has to be compatible.|
|ADO.NET Destination||Load data via ADO.NET provider.|
|Data Mining Model Training||If there is a requirement to load data into a data mining model in SSAS this provider can be used.|
4.4) Lookup Transformations
- No Cache: Avoid! Last matched row is stored in cache so each row imposes a read from reference dataset.
- Partial Cache: Works like a web browser. Any rows that are not cached are added to the cache once read from reference dataset.
- Full Cache: This is the default mode and should be used when possible. In pre-exeute phase the entire reference dataset is loaded into memory. The only caveat is to ensure that there is enough memory to hold the entire reference dataset, it does not swap to disk if memory is exceeded but instead simply fails.
The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. It is similar to a join in that it matches two sets based on one or more columns. There are some differences however in that the lookup returns only one matching row in cases where there are multiple matching rows. The main use case for a lookup transformation is acquiring a Dimension table primary key and other columns when populating a Fact table. In this scenario the Dimension table is the reference dataset.
The lookup transformation usually loads all of the required rows from the reference dataset into memory/cache. How it does this depends on the cache mode setting that has been selected (this is the most important setting of the Lookup transformation):
Listed under Connection Type in the image above /\ is the Cache Connection Manager and is very useful to use with the lookup transformation as it allows lookups against more than just OLE DB sources. To populate a Cache Connection Manager requires a Cache Transform Transformation connecting to a data source. Another benefit is that the Cache Connection Manager can be declared at the project level and this enables re-use and sharing amongst lookups in a project.
Recap! Lesson 1: Create a Project and Basic Package with SSIS
This step-by-step tutorial goes through most of the content from the previous sections
Section Five: Variables, Expressions and Master Packages
5.1) Using SSIS Variables
- Name Required - variable name.
- Data Type Expanded below, the type of the variable.
- Scope Expanded below, scope is important in determining where variable will be available within a package.
- RaiseChangeEvent This property is used extensively in logging and troubleshooing, a Boolean value determining whether an event is raised when the value of the variable changes.
- 1) Package Scoped: The default scope for all newly declared variables. Variables declared at this level can be considered global and therefore accessible to all objects within the package.
- 2) Container Scoped: Accessible to objects within the container only.
- 3) Task Scoped: Only accessible to the particular task the variable is declared in.
As per all programming languages variables stored values and in SSIS it is no different they are used to store values derived at runtime and help to promote re-use. Variables can be system variables or user variables. System variables are read-only as they are set by SSIS. From an SSIS developers perspective User-Defined variables are more interesting as they can be set by the developer.
Properties Some of the following properties can be set per variable.
Data Types Some of the data types available to SSIS variables.
|Object||This is an important type. It is the base type from which all other types are inherited. As a result it can be used to store unsupported types in an SSIS variable.|
|String||Unicode Character Strings|
Variable Scope The accessibility and visibility of an SSIS variable is based on its scope which can be one of three settings:
- The Expression Operators consist of the usual operators required in most programming languages e.g. +,-, !=.
- However there is one important operator that is used quite frequently in ETL: converting from one data type to another.
- This is done using the following expression syntax (type_spec)(Data type Cast)
- The type_spec is a Data Flow Buffer data type and what follows it is the value that needs to be converted.
- This example converts 3.57 from numeric to a 4-byte signed integer: (DT_I4) 3.57 [Reference here]
An expression is a combination of constants, variables, parameters, column references, expresssion functions and/or expression operators allowing you to prescribe at design time how a value will be determined at runtime. In SSIS, expressions are written in a special proprietary language similiar to C++ that provides a set of operators and functions to implement most expression requirements.
Below are some of the function groups and their examples in SSIS expressions:
|SSIS Expression Function Group||Example||Details|
|Maths||SQRT||Retrieves the square root of a value|
|String||CODEPOINT||Unicode value of leftmost character in a string (use case?)|
|Date||Year||Returns year of specified date|
|NULL||NULL||Returns a NULL value of a requested data type*|
*Note this function it operates in a similar way to the convert operator mentioned above e.g. NULL(DT_I4) will store a NULL inside a 4-byte signed integer column.
5.3) Property Parameterization
5.4) Master Package Concept
- Package Configurations: Variables in the child packages can be referenced by parent package. Variable names must match.
- Parameters:Project Deployment model only, this involves mapping parent parameters to child parameters.
As noted this is a concept or methodology for developing SSIS projects. It is facilitated by the Workflow Task, Execute Package, which can be used to execute other SSIS packages. Arguably the same can be achieved by using the Execute SQL Server Agent Job Task or Execute Process Task however they provide a lot less functionality and error handling. The master package concept promotes the software engineering concept of functional decomposition: breaking an SSIS project into smaller atomic packages and calling and controlling these from a single master package.
The Execute Package Task is labelled the parent package and it calls child packages. The child packages can be parameterized in two ways:
Section 6: Enhancing the Data Flow
6.1) The Slowly Changing Dimensions Requirement
Inferred Dimension Members
A bit like which came the first the chicken or the egg, records can arrive in a data warehouse fact table with no matching record in any of the dimension tables. Specifically the source business key for this record has not yet been loaded into one of the dimension tables.
This problem is known as the late-arriving-dimensions or early-arriving-facts problem and has to be handled. This can be done via inferred dimension members in the following way:
1) When this record arrives in the fact table, insert a dummy or inferred record in the dimension table with this source business key, surrogate key and a flag field to denote that this is an inferred member.
2) Using the surrogate key assign it to the record in the fact table
3) Then when loading the dimension table that this record belongs to check for inferred members, update all records accordingly and clear the inferred member field.
Slowly Changing DimensionsSlowly Changing Dimensions was mentioned above and like inferred members these have to be handled in the data warehouse as well. In SSIS there is a Slowly Changing Dimension transformation available the only caveat being that it is only useful for small dimensions preferably less than 10,000 rows.
Bearing this in mind a set based solution is more preferable and the following could be used to maintain SCD's:
1) Check whether the attributes from the source row have changed via a computed column that will maintain the aggregate value of the entire row e.g.hashbytes column containing the hash value of the row.
2) Denote which rows are Type 1 SCD and which are Type 2 SCD
3) In the case of Type 1 SCD apply the update, Type 2 set the Valid To Field to current date time and then insert new row for current values.
6.2) How To Incrementally Load A Fact Table
- How Do I Enable CDC? CDC has to be enabled on the database. Then it can be enabled per table in the same database that you want to track changes for. Before enabling CDC, ensure that the SQL Server Agent is available and running on the database instance. This is because CDC creates two jobs inside the SQL Server Agent.
- What Is Generated in the Database When I Enable CDC? After CDC has been enabled in the database, as well as the SQL Server Agent Jobs being created a CDC schema is created and some system tables next:
- What CDC Components Are Availble in SSIS? There are three SSIS new components in SQL Server 2012.
- How Do I Retrieve Change Data Using the CDC Source Adapter?
Hopefully when loading a fact table in your data warehouse, you are lucky enough to have a column available to select based on a specific data range or a means of doing so on a sliding, overlapping window.
If neither of these options are available then an alternative approach is to use the Change Data Capture (CDC) functionality introduced in SQL Server 2008..
Unfortunately at present CDC is only available in enterprise editions of SQL Server so an absolute final alternative may be to read the whole dataset each time.
Sample is provided here on how to do this
- cdc.change_tables: List the tables in the database have been enabled for CDC.
- cdc.captured_columns: Which columns are being captured?
- cdc.ddl_history: History of all of the data definition (DDL) changes since CDC enablement.
- cdc.index_columns: Indexes associated with CDC tables.
- cdc.lsn_time_mapping: Used to map between log sequence number (LSN) commit values and the time the transaction was committed.
- cdc.stg_CDCSalesOrderHeader_CT: Each table that is enabled for CDC has a “copy” where all the changes are stored with additional columns describing, for example, the type of change.
CDC source adapterThis is a data flow source adapter that retrieves all changed data from CDC change tables. It has 5 different means of retrieving changed data: All, All with old values, net, net with update mask and net with merge. (These are explained in details below)
CDC Control taskPart of the control flow in SSIS, it controls the life cycle of change data capture (CDC) packages. It can also deal with error scenaries and recovery.
CDC splitterThis is a Conditional Split transformation re-purposed to route rows bases on the value in the _$operation column so that the rows can be divided into different data flows for insertion, update and deletion operations..
As mentioned above there are 5 different process modes.
|All||A single row for each change is returned. This is useful if you require comprehensive auditing but does require all changes are merged to get latest values for a specfic record.|
|All With Old Values||Before and after: same as All above except one row for pre-change and one row post change.|
|Net||In most ETL processes, this is the mode you will want. One row is returned per unique record containing an aggregate of all changes.|
|Net With Update Mask||Same as Net but a boolean column is added per source column indicating if the source column has changed.|
|Net With Merge||This mode is designed for use with the T-SQL merge statement. Both update and insert statements are grouped together with the same operation value.|
6.3) Go With The Error Flow
- 1) Fail Transformation ceases the data flow if an error is detected.
- 2) Ignore Failure allows the row to continue out the normal green data path with error values nullified in the process.
- 3) Redirect rows sends error rows out the error path to be handled in separate components.
Error paths are like the bouncers for bad rows, they can route bad rows away from the main data flow when problems with the data are detected without affecting the good rows. These error paths are represented by red connectors between data flow components.
There are 3 options for configuring error output:
The embedded video below from Microsoft Virtual Academy re-iterates a lot of the content in this section and is very useful.
Section 7: Robust & Restartable
7.1) Transactions: Maintaining ACID Properties
- In SSIS transactions can be set at three levels: Package, Control Flow Container Level or Task Level
- A pre-requisite is to ensure that the Microsoft Distributed Transaction Co-ordinator (MSDTC) service is running. Also ensure that any tasks that are going to be within a transaction are compatible with MSDTC.
- To enable a transaction for any supported component, set the TransactionOption Property to Required.
- This property can also be set to Supported (the default, this means if a transaction is active it will join it) or Not Supported, it will not join a transaction if it is active.
- The Not Supported option is useful for any auditing or logging component to make sure it will not roll back in the event of a failure.
- There are several isolation levels available in SQL Server but the default in SSIS is Serializable whick locks the entire dataset whilst it is being read until the transaction completes.
- As shown below you set the needed transaction isolation level in SSIS by specifying the IsolationLevel property of a task or container:
- It is important to note that the IsolationLevel property in SSIS objects is moot unless you explicitly open a transaction inside SSIS by setting the TransactionOption property to Required.
The ACID (Atomicity, Consistency, Isolation and Durability) properties of database transactions are important to ensure that a database is maintained in a consistent state. Should a database transaction fail, the database should be returned to its state prior to the transaction. To achive this SSIS has full support for transactions and isolation levels as discussed next.
The I in ACID, what degree of locking do you want to deploy when selecting records?
7.2) Checkpoints: Restartability
- Checkpoints are set at the package level and there are 3 settings that must be enabled and set for it to work:
1) SaveCheckpoints: Set to True to save checkpoints during execution.
2) CheckPointFileName: Path and name of an XML file to maintain state.
3) CheckPointUsage: Set to either Never, IfExists or Always. If it is set to Always the checkPointFileName must exist or package will not start.
- After you enable checkpoints in a package, the final step is to set checkpoints at the various tasks within your package. To do this, set the FailPackageOnFailure property at each task or container to True.
What can you do when a very long running ETL process fails 70% of the way through and you do not want to start at the beginning again? This is where checkpoints come in to help re-startability.
7.3) Event Handling
- OnError: Listens and responds to an executable component reporting an error
- OnVariableValueChanged: Runs when the value changes in a variable for which the RaiseChangeEvent property is set to True
- OnPreExecute: Runs before an executable component is executed
Event handling listens for specific events to occur and executes a seprate task to respond to that event. An interesting use case is
the OnPreExecute event handler that could be set up to ensure there is sufficent server resources before running.
Event handlers are set up per package and follow the control flow paradigm:
As is evident from the above image there are multiple event handlers:
Section 8: Dynamism
8.1) Connection Managers & Parameters
- Project Connection Manager: Adhering to the software engineering maxim of re-use, the project connection manager can be declared and initialised once and used throughout each package in the project. Package connection manager can be converted to project if required to do so.
- Parameters: Allow you to assign values to properties within packages at the time of package execution. There are two types: project parameters and package parameters. Parameters have a lot of similar properties as variable properties including Name, Data type, Value etc.
- Build Configurations:Multiple versions of solution and project properties can be stored using build configurations. There are two levels that can be defined: solution configurations and project configurations.
The project deployment model introduced in SQL Server 2012 makes it a lot easier to move from development to testing to production thanks to the project connection manager, parameters and build configurations. This is a welcome departure from using package configurations.
8.2) Package Configurations
Prior to 2012 package configurations were the main method moving from development to test to production. As noted in the training kit, if you are using SQL Server 2012 or higher use the project deployment model with parameters.
Section Nine: Logging, Auditing & Package Templates
This section provides details of the logging and auditing capabilities in SSIS.
- Log Providers:In SSIS there are 5 distinct log providers namely Text (CSV) files, XML Files, SQL Server table, SQL Server Profiler and Windows Event log.
- Log Configuration:When configuring logging consider what events need to be captured and of these events which properties need to be captured? The events that can be captured are the same set of events that can be bound to event handlers as mentioned here.
Some of the properties include Computer, Operator, SourceName and SourceID. Also included are StartDateTime, EndDateTime and Datacode these final three are always captured.
- Log Settings Inheritance: Logging is configured at the Package, Container or Task Level. Each level can have its own specific logging settings or inherit from its parent. To inherit from its parent the specific object must have its LoggingMode property set to UseParentSetting. ToDo: get screenshot of this
- Log Configuration Templates: A useful feature is the ability to save a configuration as a log configuration template in XML format and import it in other objects or projects.
9.2) Auditing & Lineage
- Elementary Auditing: Provides basic information about the state of an individual unit of information. Typically it captures changes in data, type of the change and the time when the change was made.
- Complete Auditing: Comprehensive information is collected sometimes every atomic change on the source data. Due to the amount of data collected it is usually implemented in a dedicated table.
- Storage of elementary audit data: The relationship between primary data and audit data in the elementary method is 1:1. Therefore audit data in this method is usually stored in dedicated fields in the same row as the primary data.
- Storage of complete audit data: As this is more comprehensive a separate table is used to store the data state changes in this method. The relationship between the primary data table and the audit table is 1:N.
- Row Count Data Flow Transformation Component
- Aggregate Data Flow Transformation Component
- Audit Data Flow Transformation Component:
Audits describe the individual states of data whereas logs describe the transitions between states. To put it more elegantly: Procedural information about the events that have been captured in the log is given a material context by the corresponding audits.
How is audit data stored in both methods?
Which SSIS Components Support Auditing?
The following components support the audit process in various ways:
This component keeps track of the number of rows that have passed
through the data path that the component is connected to.
Aggregates data in the data flow using various standard aggregate
functions like Count, Group By, Sum, Average etc.
The audit data flow transformation supplements the data flow with additional
information from a range of system variables shown in the table below. The Audit component
supports both elementary and complete auditing.
Correlating Audit Data With Log Data?
Audit data and log data are better together. To link both of these at least two properties would be required. From the table of system variables listed above, TaskID and ExecutionID could be used to uniquely identify audit rows and their corresponding log entry.
How Long Should Audit and Log Data Be Retained?
If audit and log data is allowed to accumulate unchecked then it could consume all the disk space on the drives they are written to. Audit and log data have a limited shelf life, they are usually only useful for the last set of actions they were involved in. Usually audit and log data is purged based on an aging process with a defined time-based retention period. It is recommended a two-phase approach is employed whereby the records are first marked for purging and then a second time period is used to purge flagged records.
9.3) Package Templates
- They are implemented as regular package files
- Consider what items should be added to a package template for example connection managers, event handlers, log providers
- The template can then be used in any project via the add new item option
- Once the template is added remember to change its ID property
In most software development, there are certain repitive tasks that have to be completed each time a new software project is started. To circumvent the requirement to do this each time, package templates can be used.
Some notes on package templates:
Section Ten: SSIS Installation & SSIS Deployment
10.1) SSIS Installation
Prior to installation of SSIS on any server, some pertinent questions are:
Has SQL Server been installed already on the target server?
SSIS solutions can only be deployed to an environment built on the SQL Server platform. There is no requirement to use SQL Server for storage, the file system can be used for this even though SQL Server can be used to store solutions too. The requirement is for SQL Server to provide the execution engine known as the SSIS runtime to execute solutions on.Is the installation for development or production?
This question is important for determining what software tools are required as part of the installation. On the production server if remote access via SSMS or the SSIS deployment/maintenance tools is available then there is no requirement to install these on it. So all that is really required to deploy/maintain/execute SSIS projects is Integration services. On the development server SQL Server Data Tools (SSDT) would be the main software requirement.
What account will be used for the service account?
In security, the minimal rights principle is based on the concept that principles are only granted the rights that they need to perform their work. This principle is applicable to creating accounts to host SSIS. The following accounts are recommened below based on operating system (OS) and server roles:
|OS||Access to External Resources (Yes/No)?||Account Type|
|Windows 2008r2/Windows 7||Yes||Managed Service Account (MSA)|
|Windows 2008r2/Windows 7||No||Virtual Account|
|Windows 2008/ Vista||N/A||Dedicated Domain Account|
Is it a 32-bit or 64-bit environment?
Some SSIS features are only available in 32-bit or 64-bit. Some are available on both. The important thing to consider is to ensure data providers that use 32-bit at design time are able to use the 64-bit version of the data provider at run time. The reason that these data providers use 32-bit at design time is that SSDT is a 32-bit application only.
Tools Bundles With SSIS
The following tools are bundled with SSIS and some my be useful whilst others are there for legacy purposes.
|Tool||Command line? (Yes/No)||Details|
|Package Utility (Dtutil*)||Yes||Used to move, copy, delete and encrypt packages|
|Execute Package Utility (DTexec)||Yes||Need to know this too, using DTexec can execute packages from the command line|
|Deployment Wizard||No||This tool can be used to deploy packages to the SSISDB. Discussed in more detail here|
|Package Conversion Wizard||No||Converts a set of packages and their associative configuration to a project deployment file.|
|Package Upgrade Wizard||No||Ugrades packages from versions prior to SQL Server 2012 to 2012|
|Package Installation Utility||No||Using the deployment manifest it can be used to deploy packages to an instance of SQL Server.|
10.2) SSIS Deployment
The SSISDB Catalog
SSISDB is a special database provided in SQL Server to be used as the principal SSIS solution repository. It replaces the usage of MSDB as the SSIS project and package repository. As illustrated below SSISDB can be viewed in Object Explorer and under Integration Services Catalog.
SSIS server is a name used to refer to an instance of SQL Server hosting the SSISDB catalog. Any instance of SQL Server 2012 can be used as the SSIS server, except SQL Server Express.
The SSISDB catalog can be configured via the following settings.
|Encryption Algorithm Name||What encryption algorithm should be used to encrypt sensitive data (AES_256 by default)?|
|Clean Logs Periodically||If TRUE operation details older than retention period (next) are deleted. When FALSE they are retained.|
|Retention Period (in days)||How long should operation details be retained before being purged?|
|Server Wide Default Logging Level||Default logging level for Integration Services Server.|
|Max Number of versions per project||How many versions should be kept for a single project?|
|Periodically Remove Old Versions||When TRUE purge projects greater than max number of versions per project (above). When FALSE do not purge at all.|
|Validation Timeout||If validations do not complete in number of seconds specified here then it will cease.|
These properties can be listed via the catalog.catalog_properties view as shown in image below
As shown below, SSISDB objects represent SSIS projects and packages organized in folders that are deployed to the SSIS server.
|Folders||On deployment SSIS projects are places in folders in the SSISDB catalog.|
|Projects & Packages||Projects are treated as a single unit in the project deployment model. Each project can contain one or more packages.|
|Parameters||Parameters are exposed to the environment. This means they can be configured dynamically at run time.|
|Server Environments, Server Variables and Server Environment References||Environment references represent the association of server variables with project and package parameters.|
|Operations||Operations are actions performed against the SSISDB catalog, projects and packages.|
Under the project deployment model, SSIS project deployment is integrated into SQL Server Data Tools (SSDT), as well as SQL Server Management Studio (SSMS). Both SSMS and SSDT call the Integration Services Deployment Wizard to perform the deployment. Incidentally the Wizard can be called explicitly. Project deployment files are named with the .ispac extension.
Section Eleven: Executing & Securing Packages
11.1) Package Execution
Executing Packages On-Demand: Via UI and Programmatically
Usually packages are executed automatically on a scheduled basis or in response to an event. That said they can still be executed manually. To do this through a UI it is possible to use SSMS or for older packages deployed against the MSDB the DTExecUI (shown below) can be used. Please note DTExecUI cannot be used for packages stored in SSISDB.
Executing on demand programmatically present several more options including using T-SQL, PowerShell, SSIS Managed API and DTExec command line utility.
Executing Packages Automatically
SQL Server Agent can be used to schedule the execution of SSIS packages. These packages can be scheduled to run in response to an event for example when CPU utilization drops below a specified level. Alternative they can be run on a scheduled basis either daily, weekly or monthly.
As mentioned in Section 5 the master package concept can be used to emulate what the SQL Server Agent job does with a lot more flexibility, error handling and of course parallel execution.
As defined in the previous section operations are actions performed against the SSISDB catalog, projects and packages. As the execution of an operation is in progress, this progress is charted in the catalog.operations catalog view. In this view the status column denotes how the operation has progressed with the following values: created (1), running (2), canceled (3), failed (4), pending (5), ended unexpectedly (6), succeeded (7), stopping (8), and completed (9).An alternative means of monitoring operations is via the Active Operations Dialog in SSMS. This is available by right-clicking on the node required (folder, project or package) and choosing Reports as shown in image below:
Validations are used to check project or packages for any problems. In order to prevent problems during execution it is recommended to explicitly validate before execution for the first time or after they have been reconfigured. As mentioned in 10.2, If validations are taking too long they can be configured to timeout after a specified time using the Validation Timeout property.
Executions: Abstract Versus Concrete
As shown above there are several execution logging modes available in SSISDB. Depending on individual requirements these can be changed. For instance to enhance performance it would be preferable to avoid choosing Verbose.
11.2) Securing Packages
This defintion of SSISDB security is important as it identifies all the major players: SSISDB permissions define which SSISDB principals can perform which SSISDB operations on which SSISDB securables. Parsing this statement in the following notes....
- SSISDB principals are SSISDB catalog users
- One particular role called ssis_admin can perform administrative tasks on the SSIS server
- Users must be granted explicit access to the SSISDB catalog
- With the exception of the user that created the SSIDB catalog, who is a default member of the ssis_admin role
- SSISDB securables consist of folders, projects and operations
- Folders are topmost securable, containing all other objects
- Projects are child securables of folders containing packages and environment references
- Environments contain environment variables and are child securables of the folders they belong to
The permissions illustrated below can be granted, denied or revoked for SSISDB users.
- On folders, projects and environments permissions can be controlled explicitly
- Packages, environment references and environment variables cannot have permissions assigned directly they must be inherited from their containing object.
- Packages and environment variables inherit from their containing project.
- Environment variables inherit from the containing environment.
Default Permissions: By default any SSISDB user is permitted to deploy projects to the SSISDB server.
Package Protection Levels
Although this is not discussed in the training kit book, 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012., it is important to understand package protection levels. These values appear in the Properties window that you use to configure the properties of the package when you work with packages in SQL Server Data Tools (SSDT). A full listing and description of each protection level is discussed here.
Section Twelve: Troubleshooting & Performance Tuning
Control Flow Breakpoints:
- Breakpoints can be set in the control flow
- To set one, right-click on the task or container and choose Edit Breakpoints
- The conditions that they can be set on are the same as the ones available to event handlers
- Hit count allows you to specify the number of times the break condition must occur before suspension
- Hit Count Type works in tandem with Hit Count by checking the Hit Count value has met the specified condition to suspend execution
Data Flow Data Viewers: Data Viewers are a very powerful feature in the data flow that allow you to view the rows of data as they flow through the pipeline. This is available in SSDT during development. The data is displayed in a grid display. When the rows have been reviewed the data can be moved on by clicking the Play button on the data viewer. Alternatively if finished with data viewer you can click on Detach.
Other Methods of Debugging Some other methods of debugging include: error outputs, sampling using the Percentage or Row Sampling transformation or capturing a count of the rows processed using the Row Count transformation. This is discussed in more detail in the section on Auditing and Logging.
To a certain extent Data Flow Taps are the production equivalent of Data Viewers. Data Flow Taps are new in SQL Server 2012 and allow you to output the rows from a data path in a data flow task and export the tapped data to CSV format. It is recommeded that data taps are enabled only for the purposes of troubleshooting as they contribute a lot to I/O.
To add a data tap, the following three stored procedures must be run in succession in the order shown below
|Parameter Name||Parameter Description|
|@execution_id||The execution ID for the execution that contains the package.|
|@task_package_path||PackagePath property in SSDT for the data flow task|
|@dataflow_path_id_string||ID for data flow path|
|@data_filename||File name where tapped data will be written to|
|@Max_rows||Optional how many rows to capture when not specified defaults to all|
Please note that there is an equivalent version of add_data_tap stored procedure that uses a GUID to identify a data flow task of the data flow component. It is called catalog.add_data_tap_by_guidMore details are available here
12.2) Performance Tuning
The Data Engine
What is a buffer? A buffer is an area of memory where rows of data are stored to be operated on by data transformations. The number of rows contained in a buffer is determined based on the size of each row. The size of each row in turn is determined by the number of columns in the row and the size of each column. The data flow manages data in buffers.
Downloaded from http://joe.greenrivertech.net/memory/memoryHierarchy.html
Transformation Types are significant when it comes to performance. As discussed in Section 4 transformations are either non-blocking, partially blocking or blocking. Any transformation type that is non-blocking is synchronous otherwise it is asynchronous for partial and fully blocking. A synchronous component is generally fastest as it re-uses buffers. An Asynchronous transform requires a new buffer, performs the most work and consumes more resources. A case in point being the Sort transformation which must receive all rows before it can process the data blocking all data buffers from being passed down the pipeline until it outputs.
Excellent description of transformation types here
Execution trees...To understand buffer usage, understand execution trees. Unfortunately the definition of execution trees is not very helpful. An execution tree is defined as a logical group of transformations that starts at either a source adapter or asynchronous transformation and ends at the first asynchronous transformation or destination adapter whichever comes first. Is that clear? No me neither. The gist of it is that asynchronous transformations are essentially execution tree delimiters.
Backpressure Mechanism...So what happens if the source adapter can read rows fast than the destination adapter can consume them, how in this scenario do we prevent memory from being overwhelmed? A very useful feature called Backpressure Mechanism is used to prevent this from happening. When the source adapter is reading rows faster than the components downstream can process them, this mechanism kicks in and prevents the source adapter from loading more rows.
In order to maximize memory usage, SSIS dynamically sizes buffers based on the settings below.
|Buffer Input Parameter||Configurable?||Details|
|Estimated Row Size||No||SSIS calculates this at design time based on metadata collected about the source data.|
|MinBufferSize||No||Not configurable but used by SSIS to determine if DefaultMaxBufferSiz is set too low.|
|DefaultMaxBufferRows||Yes||How many rows to include in a buffer? This is based on Estimated Row Size by DefaultMaxBufferRows to get an estimate of buffer size.|
|DefaultMaxBufferSize||Yes||Set to 10MB by default, it is limited to a max of 100MB.|
If multiple physical or logical processors are available then performance in SSIS can be enhanced by executing tasks in parallel.
In the control flow there is the MaxConcurrentExecutables which defines how many packages can run concurrently. It defaults to -1 which is the number of logical processors + 2. Of course tasks must be arranged without precedence constraints in order to execute in parallel. If you need to determine how many logical processors are on your machine this T-SQL script can determine this for you.
In the data flow there is the EngineThreads property that determines how many work threads can be run in parallel by the scheduler. This defaults to 10 in SQL Server 2012 and it divides these threads with an equal number between source components and work threads (transformation and destination components) so 10 each.
When packages are deployed to the integrations services server, it is possible to use the Benchmark Reports available there to monitor performance. First configure Server-wide logging level to Performance in SSMS as shown belowUsing PerfMon to Monitor SSIS Performance
Performance monitor or PerfMon as it is colloquially known is a Windows tool used to monitor performance. IT has some metrics or counters added to it when SSIS is installed on a server. Some of these counters are buffer related and one of the most pertinent is Buffers Spooled counter. This is useful because it reveals if your data flow has started swapping to disk storage from memory.
Section Thirteen: Data Quality Services (DQS)
Data can comply with all business rules that are implemented in a database via integrity constraints but is the data accurate? For instance both of these are valid street addresses: 54 Baggot Street and 45 Baggot Street. As a result both addresses will pass an integrity constraint to check valid addresses. However if a customer who lives at 54 Baggot Street is incorrectly entered as living at 45 Baggot street it will not be flagged by any constraints even though this is inaccurate. It is problems like these that data quality seeks to resolve.
13.1) Data Quality Dimensions
Hard data quality dimensions are measurable.
|Hard Data Quality Dimension||Details|
|Completeness||How many NULLs are there? This completeness can be at attribute, tuple or relational level. It is possible to write a T-SQL query to inspect this and show percentage of NULLs in a column. ToDo:link from here to 16:3|
|Accuracy||As mentioned in the intro above, data can satisy constraints but is it accurate? This is difficult to determine programmatically but through the use of frequency distributions or descriptive statistics it can be done.|
|Information||Entropy is the quantification of information in a system. A means of measuring this is to look at the spread of values in a column.|
|Consistency||This measure is performed by compariing information stored in various databases.|
Soft data quality dimensions are not measurable and are based mainly on data users perceptions through interview, survey and questionnaires.
|Soft Data Quality Dimension||Details|
|Presentation Quality||This is based on the user interface used to enter data and in analytical systems present data.|
|Ease of Use||Closely related to Presentation Quality, how easy is it for end users to interact with data?|
|Timeliness||How current or stale is the data?|
|Intention||Is the data the right data for it's intended use?|
|Trust||A very important dimension, do users trust the data? If for a dataset hard dimensions score well but soft dimensions do not then users do not trust their data.|
13.2) Data Quality Services (DQS)
- Databases: Adds the three DQS Databases DQS_MAIN, DQS_PROJECTS, DQS_STAGING_DATA
- Roles: Creates three roles DQS_administrator, DQS_editor, DQS_operator in the DQS_MAIN database.
- Integrates: MDS if installed on same server is integrated with DQS.
- Logins: Two logins required by DQS are added to the installation instance called ##MS_DQS_adminisrator_login## and ##MS_DQS_db_owner_login##
- Logs: Logs installation details in a log file.
- Stored Procedure: A stored procedure is added to Master database called DQinitDQS_main
Illustrated below is the architecture for Data Quality Services (DQS). The three DQS Databases (DQS_MAIN, DQS_PROJECTS, DQS_STAGING_DATA) are important as they house a lot of the functionality and data used in DQS. In fact DQS_MAIN is really the engine of DQS as it contains the DQS stored procedures and published knowledge bases (KBs). The Data Quality Client is essentially a web application that allows end users to manage KBs, execute cleansing, profiling and matching projects and administer DQS.
Installing DQSBoth DQS Server and client are installed via SQL Server installation media. The SQL Server engine must be installed in advance or as part of the install. To install the DQS client, the PC must have IE 6 SP1 or higher and .NET Framework 4 installed.
Once they are installed there is some post install configuration via the DQSinstaller.exe application. It does the following:
13.3) Data Quality Services Administration
- Setting up reference data services
- Configure threshold values for cleansing and matching
- Configuring log settings
- Monitoring DQS Activity
- Enable or disable notifications
- Backing up and restoring DQS databases. Note the backups of DQS_MAIN and DQS_PROJECTS must be synchronized.
- Manage DQS Security by adding users to appropriate DQS Roles
Using the Data Quality Client, DQS administration involves:
Using T-SQL or SQL Server Management Studion, DQS can be partially administered by:
Section 14: Master Data Implementation
There are many different types of data in the enterprise: Semi-Structured, Transactional (business transactions not database), Unstructured, Hierarchical, Metadata and the basis for this section, Master Data.
14.1) What is Master Data?
There are a myriad of definitions available for master data. Put simply any data in an organisation that requires comprehensive management can be deemed master data. This comprehensive data management can take the form of data logging, auditing and versioning. If you have a dataset that has no requirement for any or all of these processes chances are these are not master data. That said there are other metrics that can be applied to a dataset to determine its master data status...
The following table describes some of these metrics and uses the example of a hospital patient database to determine if the data held within it is master data.
|Master Data Metric||Description||Example|
|Cardinality||How many entities are involved? If there is small number for example five or less, these will be easy to maintain.||Hospitals depending on their capacity and type could have anything from 1000's to millions of patients per year.|
|Complexity||How many attributes are collected per entity? If for a product it is only two, ID and description then this is not going to be complex to maintain.||For a single patient: name, address, age, sex, medical history, diagnoses, procedures.|
|Reusage||This increases the value of data as it is used in multiple business areas and therefore the data must be correct and accurate to prevent any issue that could cause a problem enterprise wide whereever the data is used.||Patient's data could be used by billing to determine cost per patient or alternatively by doctors to check for any health issues in a population or by governments to calculate national health resouce allocation.|
|Auditing||Data may be audited for instance in a warehouse against the OLTP source system.||The clinicial information for a patient (diagnoses for example) may be audited against the patients electronic health record to ensure accuracy.|
|Versioning||Where data has an audit trail or a history of changes then history has to be maintained and this is indicatve of master data.||As mentioned above under auditing, if the patent data is audited and is then subjected to change it will have to be versioned to ensure that the correct version of the data is used.|
|Volatility||Does the data change frequently? If it does then this will require to ensure that any volatility does not break any previous data dependencies.||A patient's health record due to continuing health may change and include additional diagnoses and procedures after each admission to hospital.|
14.2) Master Data Management
- Domain/Subject Matter Experts: If there are no domain experts involved in the MDM process this can have an impact on data stewardship
- Documentation: Lack of documentation seriously hinders MDM. For instance a large normalized database with no ERD is very difficult to parse and understand the relationships involved.
- Data Conflict: Importing the same data from multiple sources proves challenging when trying to determine which data is most up to date and which is stale.
- Definitonal Differences: Two sets of country codes both referring to the same countries but completely different.
- Data Quality: Without proper data quality MDM is doomed to failure.
- Data Stewardship: Who is responsible for the data?
Gartner provide the following definition of Master data management (MDM): it is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.
Depending on the organisation size, MDM can be quite challenging. A small organisation with a single operational database can more than likely do MDM in the same operating database. But scale up to a larger enterprise and providing all of these qualities in master data proves very challenging. The following issues can contribute to problems in MDM:
To meet these challenges and ensure that MDM remains accurate, SQL Server Master Data Services can be deployed in the enterprise.
As an additional reference, there is an excellent article on Simple Talk by Hari Yadav on the challenges of MDM in the enterprise and how SQL Server MDM services can be leveraged to address these challenges.
14.3) Master Data Services Installation On Windows 7
Small caveat before embarking on Master Data Management installation please ensure the OS you are installing on is 64-bit and by extension the version of SQL Server that will host MDS is 64-bit and is one of Developer, Enterprise or BI Edition? Also that it is a Windows Server environment or professional version for Windows 7/10. Unfortunately I didn't read the small print and wasted a lot of time attempting to get Master Data Services working on Windows 10 Home.
Once the pre-requisites have been met, you begin installation by configuring the OS first...The instructions and screenshots below were completed on Windows 7 Professional.
|Open Programs and Features from Control Panel|
|Under IIS and World Wide Web Services Choose each of the sections as shown opposite|
|Under Application Development choose all options as shown|
|Next is Health and Diagnostics, ensure that HTTP Logging and Request Monitor is selected here...|
|Security requires Windows Authentication and Request Filtering|
|Performance or performance features depending on your OS requires Static Content Compression to be selected|
|Next to Management Tools or Web Management Tools, ensure that IIS Management Console has been selected|
|Once you have clicked OK, a progress bar is launched to chart the details of all changes being applied.|
|There are some more settings under .NET 3.51 that have to be enabled, HTTP Activation and non-HTTP Activation. Under Windows 7 WCF Activation could not be found?|
|Finally prior to MDS Installation, Windows Process Activation Service has to be enabled by selecting all options under it|
|Skipping past MDS installation as part of SQL Server straight to Post-Installation Tasks, launch the MDS Configuration Manager as shown and click on Create Database in top right-hand corner|
|Click Next past this screen|
|Provide the SQL Server instance details next|
|Specify the SQL Server instance with an account that has permission to create databases on that instance|
|Follow the instructions on the next screen to create the database, ensure Default collation is selected|
|Next specify the Windows Account with full access to all models|
|The database is now created, next the website for the Master Data Manager Webapp must be created as shown opposite ensure port 8080 is specified.|
|The database and the web app must now be associated with each other. The screen opposite facilitates this|
|Click Ok to complete the previous step and ok again. Internet Explorer should now launch...?|
|Unfortunately during my installation it didn't! To fix this error I had to run ASPNET_REGIIS -i for the version of the framework I was using in this case v4.0 as confirmed in IIS Manager|
|How to confirm which version of the .NET Framework MDS is using in IIS Manager|
|As mentioned next run ASPNET_REGIIS -i|
|And try MDS Manager again in Internet Explorer? Success!|
|The final step for me was to install Microsoft Silverlight....|
14.4) Master Data Services Models
A Master Data Services Model defines the structure of the data in your master data management solution. Typically each model covers one business area.
As mentioned in detail here a model contains:
- Entities which in turn contain...
- -Hierarchies (Both Derived and Explicit)
Each of these are explained in more detail below:
These are the basic MDS objects. They represent a real world object such as a person, a place or a thing. A base entity is important as it is the entity that is central to the model. Also the base entity is the starting point in a user interface for Master Data Manager.
As mentioned above entities contain attributes. Each entity must have a code and a name attribute. The code can be regarded as the key of an entity. There are three types of attributes:
1) Free-form attribute
2) Domain based attribute
3) File Attribute
These are a tree structure and can be both derived and explicit. A domain-based attribute forms a derived hierarchy. The derived hierarchy can be recursive as well for example the an employee entity with a manager attribute.
These are like arrays in that they contain a flat list of members, that are not in a hierarchy. The members must be from the same entity.
Finally there are MDS business rules which are a lot like data integrity constraints in a database in that they ensure data integrity.
They are applied in a defined order:
1) Default Value
2) Change Value
4) External Action
5) User Defined Action Script
Section 15: Master Data Management
As mentioned here, one of the main goals of master data it to promote data re-use. Master data being the authorative source of data in the enterprise imposes the requirement to be able to import, export and secure these data in the enterprise. This section explains how these requirements can be met.
15.1) Import/Export Master Data
An MDS model deployment package is an XML file with a .pkg extension. They always contain metadata about the MDS model and depending on the method used to create them may contain data as well. They include all model objects explained above as well as Version Flags and subscription views.
There are two tools used to create packages:
1) Model Deployment Wizard (Metadata only)
2) MDSModelDeploy command prompt utility (Metadata and data)
Data import is performed in two steps:
Step 1) Load the data to staging tables. For each entity there can be three staging tables named in accordance with the entity name for example for an entity called patient
During the staging process leaf and consolidate members can be created, updated, deactivated or deleted. Attribute can be updated and relationships designated in explicit hierarchies.
Step 2) Move the data from the staging tables to MDS Models.This can be done via staging stored procedures or via the MDM application itself. If possible use the staging stored procedures as they log transactions. Once the data is imported it must be validated against a set of business rules. Again this can be done using MDM or a stored procedure called dbo.updValidateModel.
Returning to the point made earlier about data re-use, data can be exported via subscription views or via web methods exposed by the MDS web service. Subscription views are regular T-SQL views and it is possible to create your own in the MDM web app. Importantly any schema changes are not propogated through to subscription views and if this happens they must be refreshed manually in the MDM web app.
15.2) Securing Master Data
- Deny over-rides all other permissions
- Read-only over-rides Update
- Deny over-rides all other permissions
- Update over-rides read-only
MDS security is administered via Windows Active Directory users and groups. As per Windows security, assign permissions to groups and then add or remove users from this group when required. In order to use MDS a user must have at a minimum Functional Area Access and Model Object Permissions. Hierarchy Member Permissions is also available but is a more advanced option.
There are Read-Only, Update and Deny permissions. Importantly the Deny permission over-rides all other permissions. In addition there is the Navigational Access Permission, which enables a user to navigate to the level where they have been assigned permissions.
There are two levels of administrators, MDS System administrator and Model Administrators. There can only be one System Administrator and if there is a requirement to re-assign this user it has to be done programmatically via T-SQL. Model Administrators have Update permissions on the complete model
Functional Area Access
In the MDM there are five functional areas:
2) Integration Management
3) Version Management
4) System Administration
5) User and Group Permissions
There are two scenarios where permissions can overlap and conflict:
1) Overlapping model and member permissions: To resolve this the most restrictive permission is applied namely
2) Overlapping user and group permissions:
15.3) Excel Master Data Services Add-In
- 1) First of all connect to the required MDS instance via the MDM URL
- 2) Once connected click on Show Explorer button and the Model Data Explorer will be displayed
- 3) Choose any entity to load its members, but in advance click on Filter else all members will be loaded in Excel
- 4) Once editing is complete, click on Publish and Validate to submit all changes
Once MDS is installed and operational, the MDS Add-In for Excel can be installed. One pre-requisite is that it requires the Visual Studio Tools for Office to be installed #ToDo: link here. In addition when choosing the 32-bit or 64-bit version of the MDS Add-In this is based on the version of Office you currently have installed and not just your processor. Once installed,open Excel and a new ribbon tab should appear called Master Data. Within here the MDS functionality is available.Edit MDS Data in Excel
Creating MDS objects in Excel
This is for advanced users only. Within the MDS Excel Add-in entities can be created and attributes changed.
Section 16: Cleaning Data in a Data Quality Project
This section extends Data Quality Services discussed earlier.
16.1) Knowledge Bases
Build the DQS Knowledge Base (KB)
- Linked Domain: These are used to map two data source fields to the same domain.
- Domain Rules: This is a condition that DQS uses to validate, standardize and correct domain values.
- Term-Based Relation: Use this to complete a domain value wherever it is encountered e.g. Inc. should always be expanded to incorporated.
- Composite Domain: This consists of two or more single domains. An example being as address it can contain House Number, Street, Area and postcode as single domains which then can be rolled-up into a single domain.
Broadly speaking building a DQS knowledge base (KB) follows these processes:
- Knowledge Discovery
- Provide a data sample, heauristics are deployed against the data and any inconsistencies or errors are returned.
- Domain Management
- This is quite involved and is explained in detail below. If you do choose to edit a knowledge base ensure it is published afterwards otherwise it will remain locked and only you can unlock KB's you have locked.
- Reference Data Services
- Data in your KB can be validated against external (hopefully clean) data provided by a third party.
- Matching Policy
- This policy allows you to enter matching rules that DQS can then use to identify duplicate rows in an entity. It essentially provides a means of stating which columns in a row should be used to identify duplicates.
A domain contains a semantic representation of a specific column in your data source.
|Domain Property||Details of Property|
|Data Type||Domain Data Type e.g. String|
|Normalize||Remove special characters from strings to increase likelihood of matches|
|Format Output||Strings again, uppercase or lowercase etc|
|Use Leading Values||For Synonyms, should they be replaced with the leading values you define. Example Close is a synonym of Shut so if Close is defined as the leading value should every instance of Shut be replaced with Close?|
|Speller||Turn spell checker on|
|Syntax Algorithms||It is possible for certain strings that syntax checking should be disabled e.g. addresses|
16.2) Creating a DQS Cleansing Project
- Data can be provided from SQL Server or from Excel. If it is the latter than Excel must be installed on the same machine as the DQS Client.
- Which columns in the data source are mapped to which KB domains?
- This is the main part of a DQS cleansing project, heuristics are deployed to clean the source data against the KB used.
- During this stage the user intervenes to determine what changes proposed by DQS should be accepted or rejected. It is possible to see and modify the status of the operation performed on the data during the previous phase. The status could be one of the following: Invalid, Corrected, Suggested, Correct, New.
- Final stage data can be exported to SQL Server or Excel. The export can include data only or data and cleansing information.
DQS uses knowledge bases for automatic computer-assisted cleansing. Now that a knowledge base has been built it is time to clean some data against it. To this end a DQS Cleansing project can be undertaken.
A cleansing project typically follows these stages:Stage 1: Sourcing the Data
Stage 2: Mapping
Stage 3: Computer Assisted Cleansing
Stage 4: Interactive Cleansing
Stage 5: Export Cleansed Data
16.3) Data Profiling
The SSIS Data Profiling Task enables you to gain an insight into a dataset but it cannot clean the same dataset. However this is still useful as some of the information garnered from the Profiling Task can prove very valuableThe metrics returned by the Data Profiling Task include:
|Column Length Distribution||Look for unexpected string lengths|
|Column Null Ratio||What percentage of NULLs are there in a column|
|Column Pattern||Uses RegEx to calculate distribution of regular expressions|
|Column Statistics||Min, Max for numeric and datetime columns. As well as average and standard deviation for numeric too|
|Column Value Distribution||For discrete columns only|
|Candidate Key||Useful for determining a primary key, this metric gives the percentage of unique values in a column|
|Functional Dependency||What is the interdependency between columns when determining their values?|
|Value Inclusion||Useful for exposing foreign keys, this metric gives the extent to which column values from one table have corresponding values in a set of colun values of another table|
Section 17: How To Train Your Data - Data Mining
17.1) Introduction to Data Mining
- 1) Identify: What are the business problems?
- 2) Transform: This is one the main steps it includes the data preparation and data training stage.
- 3) Act: What patters and rules have been learned in production can now be put into action.
- 4) Measure: As the old adage goes "what cannot be measured cannot be controlled". So in this stage a measurement of any improvements to the business should be taken.
Data Mining is a means of deriving hidden knowledge and patterns from a dataset by examining or training the dataset via data mining algorithms. Typically a data mining project involves the following steps:
17.2) Text Mining
SSIS performs text mining using two text mining transformations: the term extraction transformation and term lookup transformation. Caveat Emptor both these transforms are only fully compatible with English language text and it is not advised to use them with other languages.Term Extraction Transformation
If you need to extract the key terms from a Unicode string or text column than the Term Extraction transformation can be used. It does this by finding sentence boundaries, identifies words, tags words in their different forms, stems words and finally normalizes words by considering their case depending on requirements.
Additional functionality includes excluding any words that should not be extracted can be added to an exclusion list. Depending on the content being analyzed, the transformation is configurable and the following properties can be adjusted:
|Use Case-sensitive Term Extraction||False by default.|
|Maximum Length of term||How many words in a phrase maximum?|
|Frequency Threshold||How many times must a word appear before being extracted.|
Term Lookup Transformation
This transformation counts the number of times that a term appears in a document. A use case for this transformation is to classify documents. It does this by using a set of dictionary terms stored in a SQL Server table and applies it to an input data set. Any table that has a column containing a set of key terms can be used as the reference dictionary. It is possble to use the Term Extraction transform to derive this column. The transformation returns two columns in the transformation output: Term and Frequency by default.
17.3) Data Preparation for Data Mining
- Prior to being mined data must be prepared. Part of this preparation is to gain an understanding of it for example how are data values measured in different variables? This data overview can be gained using OLAP cubes, reports, graphs etc.
- Data mining algorithms are complex and perform several passes through the data. Therefore they can take some time to train a large dataset. With this in mind it is preferable to choose a sample of the data, a rule of thumb for an indicative sample size is 20,000 cases. This sample should be statistically random but randomness is difficult. To help with this there are two transformations in SSIS: the percentage sampling and row sampling transformation. Both of these select the rows for the output in a statistically random manner.
Section 18: Scripting
18.1) Script Task in the Control Flow
As mentioned in the title, script tasks reside in the control flow. A common use case for a script task is to create one to consume the output from the SSIS Data Profiler. Generally speaking developing a script task follows these steps:
- 1) Before coding you need to configure the script task by specifying the properties in the Script Task Editor. These include name, description and which language either Visual Basic or Visual C#?
- 2) Entrypoint is the first method that is going to be executed in the Script Task. This is Main by default and it is recommended that this remains unaltered.
- 3) Still in the Script Task Editor, what variables are going to be used? Here you specify which Read-Only and Read-Write variables will be required as shown next.
- 4) Property Expressions can be specified next.
- 5) With properties defined it is now time to start writing the code in the language selected. All code is written in the Visual Studio Tools for Applications IDE and this can be launched by clicking the Edit Script button.
- 6) In coding one of the most utilised features is to instantiate and use the Dts object. It exposes variables, connection managers, Events and logging.
Finally when finished it is important for the script task to let the parent package know the execution status?This is communicated via the TaskResult property which can indicate success or failure.
18.2) Script Component in the Data Flow
Script components are used in the data flow so when added to a data flow the first thing to consider is its context: Data source, data destination or data transformation? As shown next...
As per the Script Task the metadata required is the same including name, description and language. But there the similarities end because the metadata for a script component gets a little more complex. The columns have to be detailed for input or output or both:
|Data Source||None||Multiple outputs|
|Data transformation||One Input||Multiple outputs|
|Data Destination||One Input||No outputs|
The component output can be synchronous or asynchronous. The impact of this is in determining if the component is blocking or non-blocking. If the components SynchronousInputID property is its input ID then it is Synchronous otherwise it is asynchronous.
Script Component Coding
When writing the script component in VSTA, there are three observable items:
- Main which contains the ScriptMain class, where the code is written.
- ComponentWrapper which contains the UserComponent class.
- BufferWrapper which contains the classes for input and output.
18.3) Custom Components
Much like classes in object oriented programming, custom components are package independent script components that can be incorporated and used in multiple SSIS packages. They follow their own separate software development life cycle and need to be robust enough so they can be deployed to any package they may be required in. This requirement for robustness makes them more complex to develop than the script component mentioned previously. The main reason for pursuing custom components is to promote re-use in an enterprise.
Planning the Custom Component
Some pertinent questions prior to starting development include the following:
|Role?||Data Source, data destination or data transformation?|
|Usage?||When a source will it use multiple outputs? When a destination/transformation multiple inputs?|
|Access to external data?||Where will the data come from: externally or internal data flow buffers?|
|Behaviour?||Blocking: fully, partial or non-blocking?|
|Configuration?||How will the component be configured?|
Development is done via Visual Studio. As mentioned above custom components are similar to classes in object-oriented programming and in Visual Studion the class library template can be used to start developing the custom component. A fundamental part of development is to include references to these three SSIS libraries:
Design-Time & Run-Time Support
Unlike the script task and component mentioned above, a custom component must provide both design and run-time programmatic logic. Why this duality? Before a custom component can be deployed in production it must first be placed in the data flow by the developer and configured and validated.
Section Nineteen: Cleaning Data by Identity Mapping and De-duplicating
19.1) What is Identity Mapping & De-Duplicating?
- What Is the Problem?
- So How Do We Match Entities? In order to undertake matching a string matching algorithm could be utilised. In DQS the nGrams algorithm is used. This algorithm does the following to compare the similarity of two strings:
In most enterprises, the same set of entities can reside in independent databases where they are used for
different purposes. This becomes a problem when these disparate datasets have to be merged to form a single unified dataset. An example being a hospital that has the same set of patients stored in different databases in order to provide a different context. For instance there may be a blood bank database that stores the patients details and then the same set of patients may be stored in an emergency database but in a slightly different format with a completely different set of primary keys used to identify them.
How can these databases be merged when there is no referential integrity on keys (String Matching)? When merging how do I avoid huge cross joins from overwhelming the server (Performance Problems)? How do I know that entity is definitely that entity (No Authoritative Source)?
- Tokenize both strings (breaks them up into substrings of length n these are the nGrams)
- Determine how many nGrams are the same between both strings
- Derive a coefficient by dividing the number of nGrams that are the same by the total number of possible nGrams
- Using this coefficient, which is between 0 and 1, the closer it is to 1 the more similar they are
By way of demonstration and to see nGrams in action check out the chart below from Google Books that uses nGrams to show when SQL came to prominance and started to be discussed significantly:
Master Data Services has other string similarity algorithms available including:
Although these techniques are effective it is recommended to clean the data as much as possible first using the DQS cleansing transformation (discussed next). This will help to reduce the number of rows that require string matching. After cleaning T-SQL INNER JOIN could be used to eliminate exact matches from the search space.
19.2) DQS Cleansing & Matching
- The DQS Cleansing Transformation uses a DQS Knowledge base (created in advance) to cleanse data (As discussed here on how to create a KB). The cleansing transformation has some advanced configuration options including Confidence level, Appended Data, Reason, Appended Data Schema and Standardize Output.
As mentioned in the previous section, cleaning the data first using the DQS Cleansing transformation helps to reduce the number of rows that require approximate matching.
For DQS matching, a matching policy Knowledge Base (KB) has to be prepared. The matching poicy contains one or more matching rules which define which domains will be used for the matching process. A very useful feature is to weight domains in accordance with their importance in the matching process. For instance some domains may have to be an exact match and not just similar. Alternatively some domains may be a pre-requisite for matching.
19.3) SSIS Fuzzy Transformations
- The Science Bit...Fuzzy Transformations Algorithm Briefly fuzzy transformations combine public algorithms and internal components when matching. The following is involved in the process:
- Tokenizer: Tokens are substrings of original strings and what a tokenizer does it parse a string into a set of tokens.
- Jaccard Index (Similarity Coefficient): As mentioned in the Tokenizer above, it creates sets and these are important for the Jaccard index as it compares sets.
- Weighted Jaccard Index: The fuzzy transformation adds weights to each item in the set. Frequent tokens get lower weights and conversely less frequent tokens get higher weights.
- Edit (Levenshtein) Distance: This algorithm is also used by fuzzy components. It measures the total number of character insertions, deletions or substitutions it takes to convert from one string to another.
- SSIS Fuzzy Lookup Transformation
- Similarity threshold: A lower similarity will try to match more rows. It is recommended to start with higher threshold and gradually reduce on each iteration.
- Token Delimiters: How are tokens delimited? This is useful for different languages.
- Max number of matches to return per input row
- SSIS Fuzzy Grouping Transformation
- _key_in: This is effectively the primary key for the transformation, identifying each row.
- _key_out: Tags duplicate rows with the same ID for identification. In the canonical data row, _key_out = _key_in. Similarly the _key_out value for a group is the same as the value of the _key_in for the canonical row.
- _score: Uses a value between 0 and 1 to denote how similar is the input row to the canonical row? Obviously the canonical row has a value of 1.
- Excel Fuzzy Lookup Add-in
The fuzzy transformations only operate on character data so if number or datetime comparison is required it may be preferable to use DQS Matching. That said where there is a large volume of records required for comparison the fuzzy transformations perform much better than DQS.
The Fuzzy Lookup Transformation performs identity mapping. Matching is controlled via three parameters:
Output from the fuzzy lookup includes two columns: _Confidence and _Similarity. It is important to acknowledge that the fuzzy lookup transform builds an index called a match index on the reference table it is using. Depending on the circumstances you may want to control the creation of this index especially if the same reference table is being used multiple times. In this scenario the potentially lengthy index build could be avoided each time it is required.
The Fuzzy Grouping Transformation helps with de-duplicating. The transformation uses a canonical row to represent the standard row or most plausibly correct row to compare other rows against. The Fuzzy Grouping adds the following columns to the output:
This is available as a free plugin for Excel and once installed adds an extra tab as shown below. Once installed it enables you to perform identity mapping with rows of Excel data. The algorithm is the same as the one used in the fuzzy lookup transformation.
[Added on 11.10.2016 | joeblogs.ie]