The Jedox Roadshow comes to Brisbane

The Jedox Roadshow comes to Brisbane

Jedox is one of the best unknown data platforms I know of.  Calling them unknown may sound a little harsh because they are a major product however, I say unknown because I don’t see them being considered for purchase in a large number of companies when they are clearly a fit for purpose product – […]
Schema Definitions from an ODBC Text Source

Schema Definitions from an ODBC Text Source

Recently, I’ve worked on disjointed migration project where things just didn’t work the way you expect.  For example, you usually like to access data direct from its source (direct from a relational database) but in this situation, this was just not allowed and we were given a load of text files (hundreds) to load.  Perhaps […]
BI Success – Does the tail wag the dog?

BI Success – Does the tail wag the dog?

I am watching recorded event this morning about the keys to Business Intelligence success. The top 4 points that were provided are; Finding the right (executive sponsor) Don’t build what they ask for, build what they need Offer up a buffet (deliver multiple options) Don’t try to boil the ocean. Lets be honest, there’s nothing […]
Data Vault (CDVDM) Training Down Under

Data Vault (CDVDM) Training Down Under

If your looking to get some great Data Vault training in Australia in March, the Genesee academy is running the CDVDM course in Brisbane, Melbourne and Sydney.  You can get the schedule dates at the Genesee site.  It’s a great course and well worth the investment for anyone looking to get in to Data Vault […]
Automation of Excel Macros

Automation of Excel Macros

There is still a huge value of reporting and using Excel for end users.  Excel is simple to work with, allows data access and allows end users to program with minimal learning.  As users develop their skills they inevitably look at automation through macros.  So if we want to turn this into something a little […]
Great Courses Down Under

Great Courses Down Under

For all Australian readers, there are a couple of great courses coming to Australia.  Firstly, Barry Devlin is running a 2 day course from 24th August (see here).  For those that don’t know Barry, he is the un-official farther of data warehousing.  There at the start and  never go the media attention of others.  What […]
Chernoff Faces, R and Power BI

Chernoff Faces, R and Power BI

One of my favorite visualizations are Chernoff faces because they allow you to incorporate many attributes into a single visualization.  Compared to a bar chart (which only displays one attribute), you can get a ‘full picture’ of the situation in a single iconic image (pardon the pun).  Further, faces are a natural recognizable characteristic (for […]
Getting Started with Biml Video

Getting Started with Biml Video

I am a huge fan of Biml and think that the ability to conceptually model data movement produces huge benefits when compared to the more traditional methods. It must also strike a chord with others because my first post on biml has been made into a video by the folks at webucator.  You can check […]
Temporal Tables != Data Warehouse

Temporal Tables != Data Warehouse

With the release of SQL2016 just around the corner, there are an increasing number of posts appearing about new features that mean I don’t have to work.  One post (which motivated me to write this) claimed that this breakthrough would mean we no longer needed to store historical values – an interesting interpretation of a […]
MySql Import Issues with Tabular and OldGuids

MySql Import Issues with Tabular and OldGuids

In my last post, I looked at a MySql connection property that causes loads to fail in Power BI and how you can work around that issue (see here).  In this post, I’ll look at those implications in SSAS Tabular.  In this example, I’ve downloaded the MySql ODBC drivers (both x32 and x64) from here […]
MySql Import Issues with Power BI and OldGuids

MySql Import Issues with Power BI and OldGuids

One of the great things about self service BI is the ability to model data from different sources.  Recently, I have been using MySql sources against different versions of Tabular Models and have run into some issues.  This post shows those issues and how to get around them in Power BI Desktop designer (well at […]
Australia November Data Vault Training

Australia November Data Vault Training

For those Aussie readers that would like to do some Data Vault training (and training I can thoroughly recommend), Hans will be running Data Vault training in November. The sessions (and registration links) are listed below; Sydney : Nov 11-13 Brisbane : Nov 16-17 (2 day course – first day is 12 hrs) Melbourne : […]
Deconstructing XML Context from SQL Server

Deconstructing XML Context from SQL Server

Recently I have had a requirement to deconstruct XML in SQL Server and present fragments of that XML to users as field values.  This can be achieved relatively easy using a combination of XQuery/XPath against XML data that stored in fields within an SQL Server table.  We’ll start by looking at some simple cases and […]
BIML Annotations

BIML Annotations

BIML offers a great method to classify and manage code through annotations.  Basically, an annotation is just a tag that provides the ability to include text into our BIML code.  No big deal right?  Well perhaps not, however, if we think about how we can use this feature a little deeper, we’ll see there’s some […]
Building the Master Child Execution with BIML

Building the Master Child Execution with BIML

In a previous post we looked at using BIML to execute a stored proc in order to assign a return value to a variable in BIDS.  The use case for this is pretty simple, when we execute a package, we want to get an ID for the execution and then record execution information against that […]
BIML, Stored Procedures and Return Parameters

BIML, Stored Procedures and Return Parameters

Despite the improved logging features that have been incorporated into SQL2012+, I am a great fan of defining your own logging methods.  The most common method for this is having a package recognised by an ‘audit key’ and then using that key as a base point to record information that you want throughout the package.  […]
BIML – Importing Text Files

BIML – Importing Text Files

Here’s a BIML snippet for reading data from a text file.  If you’ve read the prior post about pushing data (from a table) to a text file, you’ll see a lot of the same constructs.  There is; A connection object (ie the FlatFileConnection) that specifies a path to a file.  The connection requires a FileFormat. […]
SSRS – Add a little Pizzazz to SSRS Reports with Row Numbers and Pop-Out Windows

SSRS – Add a little Pizzazz to SSRS Reports with Row Numbers and Pop-Out Windows

Canned reports are the bread and butter of information delivery – I have read that up to 70% of users can be classified as consumers or the type of person that gets a report and is happy with that. There is no desire to analyse additional datasets, examine tweeter sediment or speculate correlation.  Consumers just […]
Cleaning Your Database with BIML

Cleaning Your Database with BIML

Blow it away and start again :) When we are working in a dynamic environment, we need to push through changes quickly and a prerequisite for this means starting from scratch.  Starting with a blank database and deploying the build from a clean state.  If you can’t drop and recreate the database, then the next […]
Barry Devlin & the Business un-Intelligence Course

Barry Devlin & the Business un-Intelligence Course

I really enjoy Barry Devlin’s approach to Business Intelligence.  He is an under-appreciated name in the industry (he was one of the originators of ‘Data Warehouse’ way back in the mid-’80s), so when he has something to say, it’s worth a listen – especially when he challenges ‘modern’ perceptions of Data Warehousing and the BI […]
SSIS – Creating an OLEDB Connection to an ODBC Source and SSIS parameters

SSIS – Creating an OLEDB Connection to an ODBC Source and SSIS parameters

This post was originally going to be about using the advanced features of a SSIS OLEDB command to set parameters.  I’ve been using Oracle sources lately and the OLEDB command did not recognise the parameters for my query.  For example, my in my work environment, the statement (below) just did not recognise the key_field parameter […]
Data Vault Training – Australia in 2015

Data Vault Training – Australia in 2015

The Genesee Academy’s Data Vault training has been announced for 2015 in the Australian region.  Hans Hultgren will be running classes on Brisbane, Melbourne, Sydney and Auckland in July and August this year (the dates are below with links to the courses and registration).  It’s a great class (I’ve heard multiple recommendations about this one) […]
Forcing Values into SSRS Matrix Columns and Rows

Forcing Values into SSRS Matrix Columns and Rows

The concept of the trusty SSRS matrix control is great for dynamically expanding columns and rows based on the objects source data.  It’s behaviour operates in a very similar way to a Pivot table, that is, expanding and contracting based on underlying data. However, when you want to force data to be shown in rows […]
SSRS Report Parameters from Unfriendly Sources

SSRS Report Parameters from Unfriendly Sources

We have recently developed some SSRS reports against Non SQL Server sources and have run into a few issues regarding the use of multi-value parameters.  Since, we are sure to meet this problem again (and variations of it), this post is just as much of a bookmark for ways around the problems that we’ve encountered. […]
Oracle Batches & Vairables

Oracle Batches & Vairables

More of a prompt for me than an informative post :) Oracle params/variables can be defined in a batch statement through a variable keyword and then assigned a value using the exec (execute) command.  For example, we can define 2 variables and give them a value with the statements below (note the use of := […]
Tableau 9 – Features for User Experience

Tableau 9 – Features for User Experience

There’s no doubt about it – Tableau is synonymous with best practice data visualization.  Explain a problem to any consultant about visualization and tableau’s name will enter the conversation (well in most independent cases).  Tableau’s approach is targeted at visual investigation – allowing the user to work with with data in real time and draw conclusions from it.  An approach […]
A Brief Look at Qlik

A Brief Look at Qlik

I was fortunate enough to attend a Qlik briefing last week and this post is targeted at some of my observations about that product and the offering (or approach) that Qlik takes to the market.  To be explicit, these observations are my own thoughts and were, in no way dependent on the briefing. The perception […]
Biml XIII – The Biml Script for the Entire SSAS Database

Biml XIII – The Biml Script for the Entire SSAS Database

This final post on biml SSAS processing and provides the full script to produce two packages.  It really ties in the first 2 posts to provide a complete solution for our processing scenario.  Remember, our goal is to point our biml at a SSAS database and have a processing package built that only processes the […]
Biml XXII – A Framework for Partition Processing

Biml XXII – A Framework for Partition Processing

Following on from the last post on dimension processing, the next operation in our SSAS processing strategy is to process measure group partitions.  If we remember our scenario, our SSAS database takes too long to process (when we try a full process), so we are forced to process dimensions using a ProcessUpdate and then specific […]
Biml XXI – A Framework for Dimension Processing

Biml XXI – A Framework for Dimension Processing

This post looks at building an SSIS package for dimension processing – we assume that you want fine control over the processing of your SSAS cube and so use different processing activities for dimensions, facts and database (more on facts and databases later).    The reason why you would want to use a different processing option […]
Obtaining Cube Dimension Data

Obtaining Cube Dimension Data

This post started on a technique that could be used to build an automated cube processing package. The idea was that we could simply point our code at a cube, press play and then create a highly defined package in SSIS. Some may suggest that you could simply process the entire database and cut down […]
Biml XX – Automating Lookups (a fundamental dimensional process)

Biml XX – Automating Lookups (a fundamental dimensional process)

Perhaps the most fundamental activity in the creation of a dimension is the demoralisation of data.  That is, we take the OLTP’s source data and consecutively build a wide (demoralised) table describing the entity or subject area.  This process has a distinct build process whereby we start with the lowest level of granularity and incrementally […]
A Look at BIReady – Data Warehouse Automation

A Look at BIReady – Data Warehouse Automation

I had previously discussed some of the considerations, implications and issues associated with the automation of data warehousing. That article addressed some common discussion points that I have witnessed on resistance from IT/BI professionals who refuse to concede that their expertise could be replaced by an automation tool. Of course, there are always extremes of […]
BIML XIX – Creating a Staging Area using Biml

BIML XIX – Creating a Staging Area using Biml

Our prior post (Part I and Part II) looked at using biml to extract data from a source system into staging tables. In that post, we focused on using defined content to create the extracts and override the default column mapping that is available in biml. That post relied on the existence of the target […]
Biml XIII – Building Extraction ETL from Metadata (Database Definitions) (Part 2)

Biml XIII – Building Extraction ETL from Metadata (Database Definitions) (Part 2)

This post is the 2st post (in 2) for using biml to create a content driven extraction layer. The first post can be found at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-building-extraction-etl-from-metadata-database-definitions/

Introduction

Our previous post on extraction (see here and here) looked at using biml (and biml script) to create a package driven by metadata which was defined in biml. We introduced the concepts of modularisation and showed how a group of biml files could be compiled sequentially to separate out the logic of solution components.

Including asset definitions in biml sounds like a good idea, you can have objects defined as structures in code so the entire solution is defined and managed within a single set of project files. However, one issue with this approach (and the problem relates to the SSDT or BIDS representation of biml) is that that object definitions are only available for package generation and you can’t visualise and edit the table as you can in MIST (a biml tool from Varigence).

As an alternative to this approach, we can define our ETL as content in a relational model. Then, we can use biml script to build a package based on that content. Note that we are not talking about a data driven modelling approach here. A data driven approach builds ETL and target schema based on the design of a source system. This approach simply uses database content to define how data is moved. The benefits are that there can be any number of interfaces to the model and packages are built with consistency in a very fast manner. We have already defined the target environment and now want some managed way of creating a package(s) to populate it.

Package Biml

Now that we’ve got an idea about how to use iteration in C# code and embed that with biml, we can look at how to define the package that moves the data. We’ll look at this as a twostep process, firstly, create the biml (and script) to iterate over each extraction (the output from the stored proc staging_extractions) and create a container with a truncate and empty data flow (as below). Then in the second step, we can populate the data flow with transforms.

Our proc (staging_extractions) has a list of extraction objects. All we have to do is iterate over those and create the containers and their tasks. This is not dissimilar to the way that we created connections – its just that the biml contained in the foreach operator is a little longer. We used fully qualify names with the extraction_id in order to uniquely identity each task.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _con = @"Data Source=.SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _extract_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter(" exec dbo.staging_extractions", _con);
    _con_da.Fill(_extract_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Packages>
		<Package Name="002_Extractions" ConstraintMode="Parallel">
			<Tasks>
				<# foreach (DataRow _extraction in _extract_table.Rows){ #>
				
				<Container Name="<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConstraintMode="Linear">
					
					<Tasks>
						<ExecuteSQL Name="truncate_<#=  _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConnectionName="<#=  _extraction["connection_tgt"] #>" >
							<DirectInput>Select 1</DirectInput>
						</ExecuteSQL>
						<Dataflow Name="extract_<#=  _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>">
						</Dataflow>
                    </Tasks>
					
                </Container>
				<#} #>
				
            </Tasks>
        </Package>
    </Packages>
    
</Biml>

I think there is also an interesting use of constraint declaration that’s worth discussion. Notice that the package definition defines parralell execution? This ensures that each container can execute in parrell. However for each container, I’ve specified its ConstraintMode as liner which means that tasks will have dependancies in the order that they are created. This is a neat trick so that we can enjoy the benefits of parralell execution for extractions but enforce a constraints within each extraction without an added burden of coding.

Now that we’ve populated the control flow, we just need to populate the data flow with transformations (get source data, append the date, define the destination). This is shown in the image below. We will do this so we just add the tasks and look at the mapping later.

All we have to do is replace the data flow task (above) with the following snippet. You’ll notice that all script uses data from the _extraction table. The framework just simply adds the datasource, then the derived column and destination with names (and tables) that relate to the (current) extraction row in iteration.


<Dataflow Name="extract_<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>">

	<Transformations>

		<OleDbSource 
		Name="src_<#= _extraction["extraction_id"] + "_" + _extraction["entity_schema_source"] + "_" + _extraction["entity_definition_source"] #>" 
		ConnectionName="<#= _extraction["connection_source"] #>">
			<DirectInput>
				Select * from <#= _extraction["entity_schema_source"] #>.<#= _extraction["entity_definition_source"] #>;
			</DirectInput>
		</OleDbSource>

		<DerivedColumns Name="append_dt">
				<Columns>
					<Column Name="append_dt" DataType="DateTime">@[System::StartTime]</Column>
				</Columns>
		</DerivedColumns>
		
		<OleDbDestination 
			Name="tgt_<#= _extraction["extraction_id"]+ "_" + _extraction["entity_schema_tgt"] + "_" + _extraction["entity_definition_tgt"] #>" 
			ConnectionName="<#= _extraction["connection_tgt"] #>">
				<ExternalTableOutput Table="<#= _extraction["entity_schema_tgt"] #>.<#= _extraction["entity_definition_tgt"] #>"></ExternalTableOutput>
		</OleDbDestination>
	
	</Transformations>

</Dataflow>


Now to turn our attention to column mapping in the destination. If we generate the package and look at the mappings for a task, you’ll note that the columns are not mapped. We can recall from our post on destinations (here) that we can specify source to target column with the following sample.


<OleDbDestination Name="some_name" ConnectionName="some_connection">
	<ExternalTableOutput Table="target_table"></ExternalTableOutput>
	<Columns>
		<Column SourceColumn="ProductAlternateKey" TargetColumn="product_id" />
	</Columns>
</OleDbDestination>

Since we are already familiar with the idea of iteration, all we have to do is get a table of mappings for the current extraction and populate the column references. Remember that if we execute staging_column_mappings 1 (where 1 refers to the current extraction_id) we would get a set of results as below.

Now, all we have to do is used script to add a columns tag to our OledbDestination. This can be done by adding the following snippet.


<Columns>
	<#
		DataTable _col_maps = new DataTable();
		string _col_sql = "exec [dbo].[staging_column_mappings] " + _extraction["extraction_id"];
		SqlDataAdapter _col_da = new SqlDataAdapter(_col_sql, _con);
		_col_da.Fill(_col_maps);
		foreach (DataRow _map in _col_maps.Rows){ #>
		<Column SourceColumn="<#= _map["column_source"] #>" TargetColumn="<#= _map["column_tgt"] #>" />	
    <# }  #>	
</Columns>

Perhaps whats not so apparent (it may be moreso when you examine the entire script below) is that the generation of the column list is dependent on the extraction_id in the extraction iterator. This is really just a nested loop where the second loop (the inner loop) is dependent on the outer loop. The psuedo code for this is shown as;


Foreach (record _r in [dbo].[staging_extractions])
{
	.. do some stuff
	Foreach (record _mapp in _r.extraction_id)
	{
		Define the mappings
	}
}

For completeness, here is the entire file for package creation.


<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _con = @"Data Source=.SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _extract_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter(" exec dbo.staging_extractions", _con);
    _con_da.Fill(_extract_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Packages>
		<Package Name="002_Extractions" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">
			<Tasks>
				<# foreach (DataRow _extraction in _extract_table.Rows){ #>
				
				<Container Name="<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConstraintMode="Linear">
					
					<Tasks>
						<ExecuteSQL Name="truncate_<#=  _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConnectionName="<#=  _extraction["connection_tgt"] #>" >
							<DirectInput>truncate table <#= _extraction["entity_schema_tgt"] #>.<#= _extraction["entity_definition_tgt"] #>;</DirectInput>
						</ExecuteSQL>
						<Dataflow Name="extract_<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>">
							<Transformations>
								<OleDbSource Name="src_<#= _extraction["extraction_id"] + "_" + _extraction["entity_schema_source"] + "_" + _extraction["entity_definition_source"] #>" 
								ConnectionName="<#= _extraction["connection_source"] #>">
									<DirectInput>
										Select * from <#= _extraction["entity_schema_source"] #>.<#= _extraction["entity_definition_source"] #>;
                                    </DirectInput>
                                </OleDbSource>
								<DerivedColumns Name="append_dt">
									<Columns>
										<Column Name="append_dt" DataType="DateTime">@[System::StartTime]</Column>
                                    </Columns>
                                </DerivedColumns>
								<OleDbDestination Name="tgt_<#= _extraction["extraction_id"]+ "_" + _extraction["entity_schema_tgt"] + "_" + _extraction["entity_definition_tgt"] #>" 
								ConnectionName="<#= _extraction["connection_tgt"] #>">
									<ExternalTableOutput Table="<#= _extraction["entity_schema_tgt"] #>.<#= _extraction["entity_definition_tgt"] #>"></ExternalTableOutput>
									<Columns>
										<#
											DataTable _col_maps = new DataTable();
											string _col_sql = "exec [dbo].[staging_column_mappings] " + _extraction["extraction_id"];
											SqlDataAdapter _col_da = new SqlDataAdapter(_col_sql, _con);
											_col_da.Fill(_col_maps);
											foreach (DataRow _map in _col_maps.Rows){ #>
											<Column SourceColumn="<#= _map["column_source"] #>" TargetColumn="<#= _map["column_tgt"] #>" />	
                                        <# }  #>	
                                    </Columns>
								</OleDbDestination>
                            </Transformations>
						</Dataflow>
                    </Tasks>
					
                </Container>
				<#} #>
				
            </Tasks>
        </Package>
    </Packages>
    
</Biml>

The entire code may be a little much to take in all at once and you could suggest that such an approach is quiet easy to create by hand. True, for three ‘tasks’, dragging and dropping may be easier than setting up this solution. Remember though, that the goal of the post(s) is to build packages based on content. Adding tasks would be as simple as adding records to our extract table and creating additional mappings (although the mappings are only needed to the extent that columns must be mapped).


Biml XIII – Building Extraction ETL from Metadata (Database Definitions) (Part I)

Biml XIII – Building Extraction ETL from Metadata (Database Definitions) (Part I)

This post is the 1st post (in 2) for using biml to create a content driven extraction layer. The second post can be found at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-buil…nitions-part-2/

Our previous post on extraction (see here and here) looked at using biml (and biml script) to create a package driven by metadata which was defined in biml. We introduced the concepts of modularisation and showed how a group of biml files could be compiled sequentially to separate out the logic of solution components.

Including asset definitions in biml sounds like a good idea, you can have objects defined as structures in code so the entire solution is defined and managed within a single set of project files. However, one issue with this approach (and the problem relates to the SSDT or BIDS representation of biml) is that that object definitions are only available for package generation and you can’t visualise and edit the table as you can in MIST (a biml tool from Varigence).

As an alternative to this approach, we can define our ETL as content in a relational model. Then, we can use biml script to build a package based on that content. Note that we are not talking about a data driven modelling approach here. A data driven approach builds ETL and target schema based on the design of a source system. This approach simply uses database content to define how data is moved. The benefits are that there can be any number of interfaces to the model and packages are built with consistency in a very fast manner. We have already defined the target environment and now want some managed way of creating a package(s) to populate it.

The Data Definition

We will continue with the same theme from our previous example (plus a few twists) where we want to extract Adventure Works data into our ‘staging’ area. It’s very simple and demonstrates our intent and something that we can build on later. So to revisit, we have a source database (AdventureWorksDW2012) and we want to transfer data to our staging database (stage). Unlike the prior example, our source and target field names do not match so we cannot rely on auto mapping in package generation. We can also assume that (unlike the prior example), the target and source table names do not follow a convenient naming convention as so we must explicitly specify the source and target tables. For example, we may want the stage.dbo.import_product table to be called p_im_01 (and we would not derive that it related to DimProduct) in the source. Additionally, we have included an append_dt field in the target which should be populated by the ETL (ie not using a table default). We can see the required mapping in the image below.

The Proforma Package

If we consider how a single table would be populated (say for example import_product), we might propose a template package structure (as below). Our process is similar to the prior post except that this time, our tasks are bounded by a container, we append the field append_dt and explicitly map columns in the oledb destination.

There is one other thing to remember – the creation of this package should be driven by the content stored in a database.

Definition of a Metadata Environment (database content)

I will not go into how the metadata repository is physically defined. For our purposes, all we need to be concerned about it what we can get out of it. For that, I have a few procedures that produce some output.

[dbo].[staging_connections] which lists connection information (below)

[dbo].[staging_extractions] which lists an source and target objects (including their schema and connection)

[dbo].[staging_column_mappings] which lists any column mappings that should be applied (as identified by an extraction_id). For example, extraction 1 (as above) applies the following mapping

Incremental Build Methodology

Just as in the prior post, we’ll break our biml files out into 2 distinct files. The first defines connections and the second defines the extraction package.

Connections Biml

Our connections biml is shown below.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="1" #>
<#
	string _con = @"Data Source=.SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _con_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter("exec [dbo].[staging_connections];", _con);
    _con_da.Fill(_con_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

    <Connections>
			
        <#
		foreach (DataRow _r in _con_table.Rows){ #>
			<Connection Name="<#= _r["connection_name"].ToString() #>" ConnectionString="<#= _r["connection_definition"] #>"   />
		<# } #>

	</Connections>
    
</Biml>

To understand how the inclusion of biml script interacts with (static) biml, it may be a good idea to look at what actually happens when the above code is run. This can be considered a twostep process whereby the script is executed to produce an ‘all biml’ version of the file and secondly, that ‘all biml’ script is executed to produce our packages. The output of the first step is shown below.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Connections>
		<Connection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage"   />
		<Connection Name="adventure_works_2012" ConnectionString="Provider=SQLNCLI11;Data Source=.SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012"   />
	</Connections>

</Biml

We can see that when the script executes, it defines our connections. Now let’s make some observations about the operation of it. I will not go into the declarations (ie the stuff between the tags <#@ declaration #>) and focus only on code.

The first block (reproduced below) fills a data table with the results of my connection stored proc. The code block is defined as the space between the opening and closing tags (<# and #> ). Anything defined in this block is available (since its already been defined) through the entire file and not restricted to the space between the script tags. For example, I could access the data table _con_table in any other code block and that’s exactly what we do later in the file.

<#
	string _con = @"Data Source=.SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _con_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter("exec [dbo].[staging_connections];", _con);
    _con_da.Fill(_con_table);	
#>

We also need to recognise that variables may have a limited scope. As stated above, they are still are available in regardless of how the script block is broken. Consider how we iterate over rows using foreach (which is reproduced below). We declare _r as each row in the table (and then loop over each row), _r’s scope is defined (in C# code) between the two parenthesis ({ }). However, note that we close off the scripting block after we create the first { and start using static biml again. Since we have defined _r in code between the parenthesis, its value can be used in any code snippet (that lies between those {} ). That’s how we can use the connection_name and connection_definition fields from _r to mesh our static biml with script values (as below).

As the saying goes, there are many ways to skin a cat and have options for producing biml (remember that we considering execution as a twostep process that first produces an all biml file and then compiles it). How you output that biml is up to you and you could (if you like) output it directly from code. For example, consider how we define the connection in row iteration (above).

Notice that the actual connection definition (the tag <Connection …..) mixes script tags and biml? We could (if we like) output that string entirely from C# code and not worry about mixing the script with the biml. I guess this raises the question why not generate the entire biml from c#? You could and the choice is yours :)

<#
foreach (DataRow _r in _con_table.Rows){
		WriteLine("<Connection Name="" + _r["connection_name"] + "" ConnectionString="" + _r["connection_definition"] +"" /> ")	;
}#>

Part 2 continues at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-buil…nitions-part-2/


BIML XVI – Script and Reuse (Part II)

BIML XVI – Script and Reuse (Part II)

This post continues a prior post about scripting and asset definition in biml. You can find the first part here.

For clarification, we are using biml to define assets (objects in our BI environment) and then use that definition in our biml code. We also introduce biml script as a method of reusing biml code templates.

Biml – its More than Just a Package Generator

The complete Biml for the file 01_Env.biml is shown below. In summary, it specifies the objects in our environment (connections, databases, schemas and tables). Although SSIS offers us no ability to define and see tables (perhaps you could argue that a database project may), Biml does, and we can define it within the XML.

It is also interesting to note the relationship between tables and their containing database. That is, a table belongs to a schema which (of course) belongs to a database which exists in a connection. The definition of those properties are required if we are to use ‘lower’ level objects. That is, the database requires a connection (and so on).

It would seem intuitive that a table definition require columns. After all, could you have a table without them? However, there is no requirement to specify columns within a biml table and this is demonstrated in the tables below.

Finally, the astute ready will notice that I’ve referred to a tier in my code (note the first line). This has not been done before and the purpose of a tier is to guarantee build order by the biml compiler. Up until now, we have only generated biml in one file because all requirements have existed in that file. When we submit multiple files (we can build more than one biml file at a time) we want to guarantee the execution order and the tier allows this. We couldn’t (after all) use a table before its defined.

 
<#@ template tier="1" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
    <OleDbConnection Name="stage" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2012;Integrated Security=SSPI;Initial Catalog=stage" />
    <OleDbConnection Name="adventure_works" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012" />
  </Connections>
  
  <Databases>
    <Database ConnectionName="stage" Name="stage"  />
  </Databases>
  
  <Schemas>
    <Schema DatabaseName="stage" Name="dbo"  />
  </Schemas>
  
  <Tables>
    
    <!-- ah yes .... of course i should specify columns :) -->  
    <Table Name="Product" SchemaName="stage.dbo" />
    <Table Name="ProductSubCategory" SchemaName="stage.dbo" />
    
    <Table Name="ProductCategory" SchemaName="stage.dbo">
      <Columns>
        <Column Name="ProductCategoryID" DataType="Int32" />
        <Column Name="Name" DataType="String" Length="50" />
      </Columns>
    </Table>
  
  </Tables>
  
</Biml>

Biml Script, Package Creation and the ASP Comparison

The second file in our biml example defines a package. To reiterate how it works, it uses a generic template (described above) and iterates over each table defined in the first biml file to use the name of the table in the template. For this reason, biml script is often compared to asp – the initial scripting language for html. Actually biml is often described as html with biml script being asp code.

Of course this is only possible in this instance because we have employed a consistent design.

<#@ template tier="2" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
      <Packages>
        <Package Name="build_stage" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">
            <Tasks>
                <# foreach( var _table in RootNode.Tables ) { #>
                <ExecuteSQL Name="TRUNC_<#= _table #>" ConnectionName="stage">
                    <DirectInput>Truncate Table dbo.import_<#= _table #>;</DirectInput>
                </ExecuteSQL>
				
                <Dataflow Name="LOAD_<#= _table #>" >
					 
                    <Transformations>
           
                      <OleDbSource Name="SRC_<#= _table #>" ConnectionName="adventure_works">
                            <DirectInput>Select * from Production.<#= _table #>;
                            </DirectInput>
                        </OleDbSource>  
			
 			  <OleDbDestination Name="TGT_<#= _table #>" ConnectionName="stage">
			  	<ExternalTableOutput Table="dbo.import_<#= _table #>"></ExternalTableOutput>
                        </OleDbDestination> 
                    </Transformations>
			 
                    <PrecedenceConstraints>
                        <Inputs>
                            <Input OutputPathName="TRUNC_<#= _table #>.Output">
                            </Input>
                        </Inputs>
                    </PrecedenceConstraints>

                </Dataflow>

                <#  }#>
            </Tasks>
        </Package>
    </Packages>
</Biml>

In examining the script, we’ll look at 2 items. Firstly the loop identifier/creator, this occurs as the first child of the task node (after all we want to generate the code for each table). This is defined with the statement and is similar to C# code.

foreach( var _table in RootNode.Tables ) {}

If you are not familiar with code, we are just creating a loop over the tables (node) in the biml definition (that is, the environment file). The use of {} means that anything between those parenthesis will get repeated for each table. We may remember back to the first post, where we said there are many root child node types.

We have to have some way of telling the compiler (what actually builds the packages) that this is code (and not biml) so, any code needs to be encompassed in its special tag (ie <# #> ).

Secondly, we need to refer to the table name within each iteration. This is done with and equal sign following the opening tag (that is, <#= #> ) and is just the same as saying print the script value of what’s between the opening (<#=) and closing (#>) tag. Since we have defined the _table variable (in the loop), we can use it here to specify the name.

If your having trouble understanding whats going on, the biml script is just rerunning the template code (the one with Product) and replacing product with the table name (which is defined in the first biml file).

Also note that this file is a second tier which means that it will execute after the first.

Execution

We can execute more than one biml file at a time. Simply, highlight both and select ‘Generate SSIS Packages’. The definition of tiers will solve (an otherwise random execution order) and make sure that the second file can use components that are created in the first.

Conclusion

I think there’s a lot in this post. We’ve looked into using biml to specify our BI assets and then used that definition (or its master data) to create packages. We have also compared Biml to asp and demonstrated how to use script to iterate over template code.


BIML XVI – Script and Reuse (Part I)

BIML XVI – Script and Reuse (Part I)

All our posts on Biml so far have focused on how to using Biml to mimic SSIS tasks. That is, they’ve looked at how to use Biml to generate a package and focused on task configuration. One post looked at how to generate XML automatically, however it’s not quite the same as using script to generate code. Of course, the purpose of using script is to reuse a template and generate multiple objects based on master data. If the master data changes, we could automatically regenerate those objects.

The purpose of this post is to introduce Biml script as a way of automation (iterating over a template) and examine the other benefits that Biml can bring to a project. Remember that Biml is an acronym for Business Intelligence Markup Language and hence we would expect it to provide an environment for business intelligence asset definition. In that way we can use Biml to define our environment, rather than just generating SSIS packages from an XML dialect. There are several benefits in this approach;

  1. Multiple people can work on a solution at once.
  2. The solution/project can fully encompass an environment.
  3. The objects defined within the environment can be referenced from it as master data.

To illustrate this process we will look at an example that’s common in building data warehouses/marts – the extraction of staging data. To be fair, the solution presented is by no means “production ready”. We are only interested in demonstrating some concepts, introducing scripting and trying to keep the post concise. So let’s get started.

What Are We Trying To Do Here?

The scenario that we are demonstrating is quiet simple. We have three tables in our staging environment that we wish to populate from our production database. You can see an image of the production database (AdventureWorks2012) below on the left hand side and the mapping of tables to the staging database. The fields in the staging database are named in the same way, however the tables are in a different schema (dbo) and are prefixed with import_. Also, (for those not too familiar with adventure works), we’ll call out that not all fields are duplicated in the staging environment. Note that the table import_Product only has 3 fields whereas there are 25 fields in Production.Product.

If you want to create the staging tables in your own environment, you can use the following snippet.

CREATE TABLE dbo.import_ProductCategory(
	[ProductCategoryID] [int]  NOT NULL,
	[Name] nvarchar(50) NOT NULL
); 

CREATE TABLE dbo.import_ProductSubcategory(
	[ProductSubcategoryID] [int] NOT NULL,
	[ProductCategoryID] [int] NOT NULL,
	[Name] nvarchar(50) NOT NULL
);

CREATE TABLE dbo.import_Product(
	[ProductID] [int] NOT NULL,
	[Name] nvarchar(50) NOT NULL,
	[ProductSubcategoryID] [int] NULL
);

Package Template/Output

What I’d like to achieve with my Biml is to generate a package that moves data from the production table to the target. Each movement would naturally require a truncation the existing (staging) table and then dataflow to reload it from production data. An overview of the package is shown below. Note that each movement exists within its own constraints and, in that, way the movements can execute in parallel and are dependent on the truncation step.

If we look at the first load (ie Product), we could use the Biml snippet below.

Since all objects (source and destination tables) have a consistent naming convention, we know the schema and table name variations that would be required to produce this code for any table. That is, we would only have to replace Product with ProductCategory (or ProductSubCategory) and we could extract each of those tables. We are relying on the auto mapping feature that was discussed here which relies on consistent column names but baby steps first J.

 
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    
    <Package Name="build_stage" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">

      <Tasks>
        
        <ExecuteSQL Name="TRUNC_Product" ConnectionName="stage">
          <DirectInput>Truncate Table dbo.import_Product;</DirectInput>
        </ExecuteSQL>
        
        <Dataflow Name="LOAD_Product">

          <Transformations>
            <OleDbSource Name="SRC_Product" ConnectionName="adventure_works">
              <DirectInput>Select * from dbo.Product;</DirectInput>
            </OleDbSource>
            
            <OleDbDestination Name="TGT_Product" ConnectionName="stage">
              <ExternalTableOutput Table="dbo.import_Product">
              </ExternalTableOutput>
            </OleDbDestination>
            
          </Transformations>
          
          <!-- this constraint is for the data flow -->
          <PrecedenceConstraints>
            <Inputs>
              <Input OutputPathName="TRUNC_Product.Output"></Input>
            </Inputs>
          </PrecedenceConstraints>
        
        </Dataflow>
        
      </Tasks>
    </Package>
  </Packages>
</Biml> 

It may be worth pointing out two conditions that may not be that apparent in the snippet (they become obvious once pointed out and the package is generated). Firstly, the package assumes a parallel constraint mode. This ensures that each task is placed on the control flow without any precedence constraints between them. Secondly, the data flow must execute after the truncation so we need to specify a precedence between that and the (prior) truncation. You can see its position based on the comment in the code – its use may not be obvious straight away but just keep in mind that this relates to the dataflow task.

Project & Build Logic

It makes a lot of sense to separate out objects into modular components. If we think about it, there is an inherent relationship between objects that dictate a natural order in the task we are trying to achieve. For example, we have an environment with connection (definitions), that specifies target tables and finally, we have a work specification for our ETL (which we have said will rely on the prior environments metadata to generate code).

Aside from that logical sequence for building something, modularisation also provides the ability to separate work among different people – we could have someone responsible for connections, someone different responsible for table definitions and finally a developer working on package snippets.

In our example, we’ll demonstrate this by breaking our ‘project’ into 2 biml files. The first (01_Env.biml) defines the environment and the second (02_Biml) will be used to build our staging package. We can see how this in the image below.

What’s Next ?

Continue reading the next instalment at http://paultebraak.wordpress.com/2015/01/05/biml-xvi-script-and-reuse-part-ii/


Biml XV – ForEach Record in the DataSet (Control Flow)

Biml XV – ForEach Record in the DataSet (Control Flow)

Our prior posts showed techniques using the OLEDB Command to iterate over each row in the dataflow (see Biml XII and Biml XIII). Using that task we fire an SQL statement for each row in the dataflow. Column (field) values can be passed to (and returned from) the command so the technique may be handy when a single (and arguably simple) statement is required.

If we wish to embed more logic, tasks and more precise management into the iterative process, a different approach may be required. We could use a for loop task as an iterator. This operates in the control flow (as opposed to the OLEDB Command being a data flow task) and imbeds iteration in its own object container. Of course, we can still access field values from the dataset and obtain return values however, the use of container may improve usability (for SSIS control) because it can hold multiple tasks, implement precedence constraints and execute in entirety for each row that in a dataset.

This post looks at how to implement this solution in Biml.

The Package

The basic implementation of this technique of iteration is shown in the package below.

We don’t go deeply into the package design (since we’ll be creating it in Biml) however, there are a few key takeaways that we should note about the package and the variables used in the technique;

The first Execute SQL task (read_dim) assigns the results of a query (the recordset) into a variable called sub_cat_results. The type of the variable is a System.Object and that variable is in the scope of the package.

The recordset is simple (based on adventure works) and just returns ID’s and Names for product subcategories.

select ProductSubcategoryID  as sub_cat_id
, Name as sub_cat_name
from Production.ProductSubcategory;
The object of the next task (a for loop container titled for_each_record) is to iterate over each record in the dataset. As each ‘new’ row is encountered, field values for the row are assigned to the variables sub_cat_id and sub_cat_name. This assignment occurs in the outer part of the task so that those values can be accessed by other tasks within the bounds of the loop.

Additionally the variables (sub_cat_id and sub_cat_name) are scoped so that they only exist within the loop container. There is really no requirement to do this (and they could exist in the package scope) however such a design policy may make the package more concise and modular. You can read more about variable scoping here.

The loop container contains its own tasks which (of course) can access variables in the package and the loop container. We’ve used Execute SQL tasks in this example however, we are not limited to only these tasks.

In our simple example, our select_id task queries task is shown as;

select *
from [Production].[ProductSubcategory]
where ProductSubcategoryID = ?;

Naturally the parameter value passed is held in the sub_cat_id variable.

We have not implemented any precedence constraints in the for loop container – it is kept simple for the purposes of demonstrations.

The biml

The entire biml for this package is shown below. The key notes on the biml implementation are;

The variable sub_cat_results is defined as a child of the package. Since we want to limit the scope of the sub_cat_id and sub_cat_name variables to the loop container, they are defined as children of that task (ForEachAdoLoop).
The first SQL task (read_dim) needs to return a recordset and assign it to the variable sub_cat_results. In order to do this, we need to set the ResultSet property of the task (to Full) and configure the Results tag of the task.
Our iterator (see the tag ForEachAdoLoop) has properties which specifies the variable that holds a recordset (sub_cat_results) and the precedence constraint that is applicable to the task. Of course that constraint can be overwritten on a task by task basis.
As stated above, the variables that are to be used with the loop container are defined with the container (see the Variables child tag). Had these variables been defined as children of the package, they would not need to be defined at the task (loop task) level.

However, note the naming convention where the variables are used within the package (ie where called). The variable mapping within the loop refers to the sub_cat_id variable in a fully qualified manner (iteration_01.for_each_record.User.sub_cat_id). We could also create the package without fully qualification (ie User.sub_cat_id) if we wish.

The tasks within the loop container are held within a tasks tag. This is exactly the same as tasks are defined within the package and so we may liken the containers definition to that of the package.

The Biml

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    
	<Connections>
        <OleDbConnection Name="adventure_works" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorks2014" />
    </Connections>
	
    <Packages>
        <Package Name="iteration_01" ConstraintMode="Linear">
			
			<Variables>
				<Variable Name="sub_cat_results" DataType="Object" />
            </Variables>
            
			<Tasks>
				
				<!-- Get the DataSet Object -->
                <ExecuteSQL Name="read_dim" ConnectionName="adventure_works" ResultSet="Full">
					<DirectInput>select ProductSubcategoryID  as sub_cat_id, Name as sub_cat_name from Production.ProductSubcategory; </DirectInput>
					    <Results>
					    	<Result Name="0" VariableName="User.sub_cat_results" />
					    </Results>
                </ExecuteSQL>
				
				<!-- iterate over each row in the data set -->
                <ForEachAdoLoop Name="for_each_record" SourceVariableName="User.sub_cat_results" ConstraintMode="Parallel">
					<Variables>
						<Variable Name="sub_cat_id" DataType="Int32">0</Variable>
						<Variable Name="sub_cat_name" DataType="String" />
            		</Variables>
                    <VariableMappings>
							<VariableMapping Name="0" VariableName="iteration_01.for_each_record.User.sub_cat_id" />
							<VariableMapping Name="1" VariableName="iteration_01.for_each_record.User.sub_cat_name" />
                    </VariableMappings>
					
					<!-- execute some tasks using the variables passed from the dataset -->
					<Tasks>

						<ExecuteSQL Name="select_id" ConnectionName="adventure_works">
							<DirectInput>select * from [Production].[ProductSubcategory] where ProductSubcategoryID = ?;</DirectInput>
							<Parameters>
								<Parameter Name="sub_cat_id" DataType="Int32" Direction="Input" VariableName="iteration_01.for_each_record.User.sub_cat_id" />
                            </Parameters>
                        </ExecuteSQL>

						<ExecuteSQL Name="select_name" ConnectionName="adventure_works">
							<DirectInput>select * from [Production].[ProductSubcategory] where [Name] = ?;</DirectInput>
							<Parameters>
								<Parameter Name="sub_cat_name" DataType="String" Length="50" Direction="Input" VariableName="iteration_01.for_each_record.User.sub_cat_name" />
                            </Parameters>
                        </ExecuteSQL>
						
						</Tasks>
						
                </ForEachAdoLoop>
            
			</Tasks>
        
			</Package>
    </Packages>
</Biml>

Biml XIV – Variables & Scope

Biml XIV – Variables & Scope

There are probably a few light-bulb moments for SSIS developers that signify an improvement in learning and development style. One of those has to be the use of variables, and, with the use of variables comes the inevitable learning of scope (or the level at which the variable is valid). In Biml, variables are an interesting tag because they can be defined at various places (or levels) within a document and level determines the tasks of the package that the variable is valid in.

New developers should also be aware that variable definition is a lot more user-friendly in recent versions of SSDT because a variables scope is introduced at the package level, then the variable can be redirected to a task should you wish. Prior versions did not accommodate this and the variable scope was locked in the task that was defined in. For example, if you had a sequence container selected and defined a variable, it would only be available in the sequence container (hence its scope was the container).

Overview

To demonstrate variables in biml, we will create a simple package that inserts the value of variable into a table. That variable will be defined at different levels and therefore we can see how the definition impacts value that is used. It may be easier to start with the completed package (as below). Notice that there are 2 variables called location, and one is only applicable in the container (some_task_container).

In this situation, the variables value is inserted into a table so we’d expect to see the first insert use the value “package” (from the variable variables.location) and the second use the value “task_container” (from the variable some_task_container.location ). Finally, when the last insert fires, the value of the location variable returns to the package scope and the value “package” is used. This is exactly what’s shown in the tables output (note the insert time stamps).

The Biml

The full biml definition is shown below. There are a few points to note and some features that have not been called out in other posts. These are listed as they relate to each tag;

Variables

Variables can be defined for any tag or child tag of the package. The node at which they are defined determines their scope.

Naturally, they require a name and datatype. The value is defined as the nodes text.

A task referencing a variable is restricted to the scope of definition. For example, we cannot refer to the package.location variable while the container.location variable exists.

Container

The container task includes a required constraint mode (BIDS will error if it is not present).

Tasks with the container need to be nested with a tasks tag.

Execute SQL

In using a variable for a parameter (as we have done for the queries parameter), we only need to specify its type as either (User or System) and its name (for example User.location).

Parameters must also be specified by Name and DataType.

 

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  
	<Connections>
		<OleDbConnection Name="dest_stage" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=stage" />
	</Connections>

	<Packages>
		<Package Name="variables" ConstraintMode="Linear">

			<Variables>
				<Variable Name="location" DataType="String">package</Variable>
			</Variables> 

			<Tasks>

			<ExecuteSQL Name="insert_n1" ConnectionName="dest_stage">
				<DirectInput>insert into dbo.variables(value) values(?)</DirectInput>
				<Parameters>
					<Parameter Name="location" DataType="String" VariableName="User.location" /> 
				</Parameters>
			</ExecuteSQL>

			<Container Name="some_task_container" ConstraintMode="Linear"  >

				<Variables>
					<Variable Name="location" DataType="String">task_container</Variable>
				</Variables>					

				<Tasks>
					<ExecuteSQL Name="insert_n2" ConnectionName="dest_stage">
						<DirectInput>insert into dbo.variables(value) values(?)</DirectInput>
						<Parameters>
							<Parameter Name="location" DataType="String" VariableName="User.location" /> 
						</Parameters>
					</ExecuteSQL>				
				</Tasks>

			</Container>

			<ExecuteSQL Name="insert_n3" ConnectionName="dest_stage">
				<DirectInput>insert into dbo.variables(value) values(?)</DirectInput>
				<Parameters>
					<Parameter Name="location" DataType="String" VariableName="User.location" /> 
				</Parameters>
			</ExecuteSQL>

			</Tasks>
		</Package>
	</Packages>

</Biml>


Biml XIII – Oledb Command Part II – Return Variables

Biml XIII – Oledb Command Part II – Return Variables

In our prior post, we looked at the Oledb Command, a data flow transform that fires a statement for each row in the pipelines dataset. That post, we used generic parameters that were defined in our query. For example our query was;

insert into [dbo].[customers]
(customer_code, first_name, last_name)
 values
  (?, ?, ?);

 

The transform allows us more flexibility and we will look at some of that now. We will look at the execution of a stored procedure (and the variation in parameter specification) and the usage of return parameters (so that values are appended to the dataset as it flow through the buffer).

This post is broken into two components. Firstly, we quickly examine the biml just for the transform – I’m assuming that you can follow based on the prior post. Secondly, we present a more complete example so a full solution can be worked through.

The Task

Often I can delegate logic to a stored procedure, for example, the generation of audit keys. For example (and very simply), we could take the example of inserting a record (see this post), use a stored procedure to insert the data and then return information about that process. Since we are working with customers, let’s just use a procedure to insert a record and return the first letter of the customer’s name. A procedure would look like this;

Create proc [dbo].[customer_udpate]
(
	@customer_code nvarchar(15)
	, @first_name nvarchar(50)
	, @last_name nvarchar(50)
	, @last_name_first_char nvarchar(1) output
)
as 
begin

	set nocount on;

	insert into [dbo].[customers](customer_code, first_name, last_name)
	values(@customer_code, @first_name, @last_name);

	set @last_name_first_char = LEFT(@last_name,1); 

end 

If we were to execute that through an SQL Script we use some code like (with the output below);

declare @customer_code nvarchar(15)='AW00011000';
declare @first_name nvarchar(50)='Jon';
declare @last_name nvarchar(50)='Yang';
declare @last_name_first_char nvarchar(1);

exec [dbo].[customer_udpate]
	@customer_code = @customer_code
	, @first_name = @first_name 
	, @last_name = @last_name
	, @last_name_first_char = @last_name_first_char output ;

select @last_name_first_char as last_name_first_char ;

This is simple for a single SQL query, but what about in SSIS? Remember for our situation, that the oledb command runs for each row that enters the command (each row in the buffer) and we want to append the output value to our dataset (so that it can be consumed later in the dataflow). So how does our biml for the oledb command look?

<OleDbCommand Name="insert_customer_dynamic" ConnectionName="dest_stage">
	<DirectInput>
		exec dbo.customer_udpate @customer_code=?, @first_name=?, @last_name=?, @last_name_first_char=? output;
	</DirectInput>
	<Parameters>
		<Parameter SourceColumn="customer_code" TargetColumn="@customer_code" DataType="String" Length="15" />
		<Parameter SourceColumn="first_name" TargetColumn="@first_name" DataType="String" Length="50" />
		<Parameter SourceColumn="last_name" TargetColumn="@last_name" DataType="String" Length="50" />
		<Parameter SourceColumn="last_name_first_char" TargetColumn="@last_name_first_char" DataType="String" Length="1" Direction="Output" />
	</Parameters>
</OleDbCommand>
 

Here we can see a few differences since our last implementation.

  • This is given but query has changed and matches the SQL command (note the output clause for the return parameters)
  • TargetColums (for parameters) a match by name rather than the generic Param_n (as in the prior example)
  • The output parameter includes an additional Direction attribute (specifying direction out rather than the default input value)

It should also be apparent that the dataset entering the transform needs to put the return value in a column. That is, we need to append a column to the dataset to store the return value. This brings us to our ‘full’ implementation.

The Full Implementation

If we continue with the name of the prior post, and the stored procedure shown in this post, we can derive a business case by which we want to achieve the following outcomes;

  1. source the customer records
  2. fired the procedure (to insert them into the staging table)
  3. record the procedures return result (even if it is just a simple value)
  4. record the outcome of the entire load (that is, the results for each record loaded)

Finished package (dataflow) is shown in the following image below. Note that we include a derived column(s) that store the return values from our procedures in the dataset as it flows to the destination.

The complete biml for this below. There are no extraordinary considerations that need further explanation other than perhaps the choice of posting the record into the same destination table twice (ie [dbo].[customers]). The only reason this was done was for simplicity and demonstration purposes. If we queried a single customer we could see their entry and audit record in the same table. Certainly not a production solution though.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<Connections>
		<OleDbConnection Name="dest_stage" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=stage" />
		<OleDbConnection Name="src_adventure_works" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2008R2" />
  </Connections>
	
	<Packages>
    <Package Name="oledb_cmd_02" ConstraintMode="Linear">
			<Tasks>
			
				<Dataflow Name="oledb_cmd">
					<Transformations>
						<OleDbSource Name="src_customer" ConnectionName="src_adventure_works">
							<DirectInput>
								select 
								  CustomerAlternateKey as customer_code
								  , FirstName as first_name
								  , LastName as last_name
								from dbo.DimCustomer; 
                            </DirectInput>
                        </OleDbSource>
						
						
						<DerivedColumns Name="add_first_name_place_holder">
							<Columns>
								<Column Name="last_name_first_char" DataType="String" Length="1">NULL(DT_WSTR, 1)</Column>
              </Columns>
            </DerivedColumns>
				
						
						<OleDbCommand Name="insert_customer_dynamic" ConnectionName="dest_stage">
							<DirectInput>
								exec dbo.customer_udpate @customer_code=?, @first_name=?, @last_name=?, @last_name_first_char=? output;
                            </DirectInput>
							<Parameters>
								<Parameter SourceColumn="customer_code" TargetColumn="@customer_code" DataType="String" Length="15" />
								<Parameter SourceColumn="first_name" TargetColumn="@first_name" DataType="String" Length="50" />
								<Parameter SourceColumn="last_name" TargetColumn="@last_name" DataType="String" Length="50" />
								<Parameter SourceColumn="last_name_first_char" TargetColumn="@last_name_first_char" DataType="String" Length="1" Direction="Output" />
              </Parameters>
            </OleDbCommand>

						<OleDbDestination Name="dest_customer" ConnectionName="dest_stage">
							<ExternalTableOutput Table="[dbo].[customers]" />
            </OleDbDestination>
						
          </Transformations>
        </Dataflow>
      </Tasks>
		 </Package>
    </Packages>
</Biml>

Biml XII – Oledb Command (Dataflow transformation)

Biml XII – Oledb Command (Dataflow transformation)

The data flows OLEDB Command executes an SQL statement for each row of the data flow. A good example of its use is the update of expired records end dates for type 2 changed dimensions. I’ve briefly discussed the logic for roll your own type 2 (changing) attributes here with the need to update expired records at point 3.3. In that post, we looked at the conditional split (the biml for its basic functionality) in a very simple example. The main point to reiterate for the OLEDB command is that a command executes for each row of the transforms input and (naturally) we have access to each columns value in that row.

As a sidebar to the use of the OLEDB Command (and the overhead of executing a statement for every row in the transforms input), an alternative method of update is to dump the input into a temporary table and then execute a single update command against the dimension table using the temporary table.

In this post we look at the biml configuration for using the oledb command to inserting rows into a table. We’ll keep the focus on customers (code, first name and last name). The completed dataflow will look like the following. The update syntax is easily based on this example.

The Biml

The package only contains two transforms – the source and the oledb command (acting as a destination). As we would expect, the oledb command requires three basic settings. These are ;

Requirement

Comment & Notes

Values for Name and Connection

All tasks and transforms require a name. Since we’re executing an SQL statement, we need the OLEDB connection.

Since they apply to the entire transformation, their attributes of the tasks root definition (ie attributes of the OledbCommand node).

The Statement to Execute

This is managed by the <DirectInput> tag’s text

Note that parameters are introduced to the statement using the ? symbol

Parameters specifications that map the input columns to command parameters

Managed by the transform Parameters node (grouping each parameter).

Each parameter is defined by the its source column, and target parameter position and a data type.

Note the parameter position (the order by which it appears in the SQL statement) is a zero based value (its index) and suffixed to the Param_ target column.

In defining the data type, only its type is required. For example, we could define the parameter as only being a String type. Where precision is not specified (that is, the length), the package will be build and execute however it will warn that columns are out of synchronisation.

 

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
		<OleDbConnection Name="src_adventure_works" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2008R2" />
		<OleDbConnection Name="dest_stage" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=stage" />
    </Connections>
	
    <Packages>
      <Package Name="oledb_cmd" ConstraintMode="Linear">
			<Tasks>
				<Dataflow Name="oledb_cmd">
					<Transformations>
						<OleDbSource Name="src_customer" ConnectionName="src_adventure_works">
							<DirectInput>
								select 
								  CustomerAlternateKey as customer_code
								  , FirstName as first_name
								  , LastName as last_name
								from dbo.DimCustomer; 
              </DirectInput>
            </OleDbSource>
						
						<OleDbCommand Name="insert_customer_dynamic" ConnectionName="dest_stage">
							<DirectInput>
								insert into [dbo].[customers]
								(customer_code, first_name, last_name)
								values
								(?, ?, ?);
              </DirectInput>
							<Parameters>
								<!--
									note that you do not have to specify string length 
									if not, the package will warn for column synchronization 
								-->							
								<Parameter SourceColumn="customer_code" TargetColumn="Param_0" DataType="String" Length="15" />
								<Parameter SourceColumn="first_name" TargetColumn="Param_1" DataType="String" Length="50" />
								<Parameter SourceColumn="last_name" TargetColumn="Param_2" DataType="String" Length="50" />
               </Parameters>
              </OleDbCommand>
						
             </Transformations>
            </Dataflow>
      </Tasks>
		</Package>
		 
    </Packages>
</Biml>



 


Biml XI – Conditional Splits in Data Flows

Biml XI – Conditional Splits in Data Flows

At first thought, I don’t see a huge need for conditional splits in SSIS, separating the stream (as it were) is something that I usually try and do at the source. That is, we should only bring in as much data as is required by the transform. However, in thinking about it, the transform is an absolute necessity when you create your own SCD (Slowly Changing Dimension) logic. Rather than the native SSIS SCD component which really doesn’t offer too much flexibility, building your own can offer huge performance improvements and complete control over the package execution.

The basics of a role your own SCD component work like this;

  1. the dimension table includes hash column which stores the hash for changing attributes
  2. the ETL package builds a view of the dimension and in doing so calculates the hash for the changing attributes
  3. the package looks the hash from the dimension table and then splits the data into three conditional streams;
    1. If there is no hash, the record doesn’t exist and must be added.
    2. If there is a hash and it is not the same as the ETL hash, the changing attributes have changed (from the dimension table) and the new record must be inserted (with the previous being expired).
    3. Finally, if the two hash values match, the changing attributes have not changed and no changes required to the dimension data.

It’s easy to see here why the split component is needed.

The Split in Biml

To demonstrate how the split works in Biml, we’ll continue with using customer names from adventure works. Our purpose is simply to create different outputs based on the customer last name. That is, if the customer’s last name begins with A they should be sent to an output, B to another and the remaining going to the default (or catch all) output. The executed package would appear as below. Note counts coming into the buffer (18,484) are split between A records (842), B records (1,061) and everything else (16,581).

The Biml for this package is shown below. Note that the outputs for the split (see the ConditionalSplit
tag) are defined by an OutputPath
(which is of course contained in the OutputPaths
root node of the transform). The condition for the split is defined in the Expression
nodes text. There is no requirement to define the default output.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="src_adventure_works" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2008R2"/>
    <OleDbConnection Name="dest_stage" ConnectionString="Provider=SQLOLEDB;Data Source=.SQL2008R2;Integrated Security=SSPI;Initial Catalog=stage"/>
  </Connections>
  <Packages>
    <Package Name="split_transform" ConstraintMode="Linear">
      <Tasks>
        <Dataflow Name="split_transform_01">
          <Transformations>
            <OleDbSource Name="src_names" ConnectionName="src_adventure_works">
              <DirectInput>
                select
                [CustomerAlternateKey]  as customer_code
                , [FirstName] as first_name
                , [LastName] as last_name
                from dbo.DimCustomer
              </DirectInput>
            </OleDbSource>

            <ConditionalSplit Name="split_name">
              <OutputPaths>
                <OutputPath Name="Name_A" IsErrorOutput="false">
                  <Expression>LEFT(last_name,1)=="A"</Expression>
                </OutputPath>
                <OutputPath Name="Name_B" IsErrorOutput="false">
                  <Expression>LEFT(last_name,1)=="B"</Expression>
                </OutputPath>
              </OutputPaths>
            </ConditionalSplit>

            <OleDbDestination Name="Customers_Default" ConnectionName="dest_stage">
              <InputPath OutputPathName="split_name.Default" />
              <ExternalTableOutput Table="[dbo].[customers]" />
            </OleDbDestination>

            <OleDbDestination Name="Customers_A" ConnectionName="dest_stage">
              <InputPath OutputPathName="split_name.Name_A" />
              <ExternalTableOutput Table="[dbo].[customers_A]" />
            </OleDbDestination>

            <OleDbDestination Name="Customers_B" ConnectionName="dest_stage">
              <InputPath OutputPathName="split_name.Name_B" />
              <ExternalTableOutput Table="[dbo].[customers_B]" />
            </OleDbDestination>
            
            
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

The XML Gotcha

If you’ve used SSIS for a while, you’ll probably have forgotten the pain that you had learning that expressions are more C# based rather than formula. For example, to use an equal in a formula you’d use == rather than =. When we work with XML (html or any markup type language), the nature of the syntax may cause some issues with special characters are required. For example, XML is heavily reliant on the characters <> for tags, so how can we use those characters in a formula without the engines interpreter mistaking the symbol for a tag?

For example what would happen if we wanted to have an output like last name being from A <= M. In this case we have to substitute the special character with a code (as is exactly what happens in HTML). The snippet below shows how would achieve this by using the character codes for the symbols >, & and <. Our goal here is to use split names A-K and M-Z.

<ConditionalSplit Name="split_name">
  <OutputPaths>
	<OutputPath Name="Name_A" IsErrorOutput="false">
	  <Expression>LEFT(last_name,1)&gt;="A" &amp;&amp; LEFT(last_name,1)&lt;"M" </Expression>
	</OutputPath>
	<OutputPath Name="Name_M" IsErrorOutput="false">
	  <Expression>LEFT(last_name,1)&gt;="M"</Expression>
	</OutputPath>
  </OutputPaths>
</ConditionalSplit>


 


Biml X – Derived Columns

Biml X – Derived Columns

This post looks at the biml requirements and functionality of derived columns – how to apply logic to a column (replace a value) and add new columns. The scenario is pretty simple and builds on our other concepts of but now includes a little more ‘business logic’ (albeit simple) so we can look at a more typical package.

In this scenario, we start with some territory keys. Of course we could include more data but we want to keep it simple. Based on those keys, we have to look up the territory name. As luck would have it, not all territories have names and if there’s no valid name (that is, no lookup value found), we would like to assign a default value (“Unknown”). Also, we would like to keep a copy of the territories original name so we will store that as well. We will be using AdventureWorks2014 as our source database and staging as the destination (called stage).

Setup

The only setup requirements are the destination table (dbo.Territory). This can be added to the staging database with the following snippet;

create table dbo.Territory
(
	TerritoryID Int not null 
	, TerritoryName nvarchar(50) not null 
	, TerritoryNameUnadjusted nvarchar(50)
)

 

Also, we will show the completed package first – this gives an indication of the logic that we are trying to achieve with biml.

The Biml

The full biml is given in the snippet below. Since we have covered the general structure, sources, destinations and lookups in other posts, we will concentrate only on the biml for the derived columns transform.

There are a few things about the tag to note. Firstly, a column change (to the buffer) is managed through the column tag. This is a child of a columns parent which is a child of the derivedcolumns node (sounds obvious but it’s worth a call out). Perhaps this is best explained by examining the configuration for a derived column as in the image below. As we add (or change) more columns in the transform, we add more rows in the grid and (therefore) additional column tags would be added to our biml. Since XML requires containers, the columns tag is used to group these in biml.

Secondly, derivation as a new column or replacement column is handled by the ReplaceExisting property of the column tag. By default the new column is assumed to be a new column. Additionally (and related to column replacement), the DataType and Length of the column are required attributes but have no effect when the column is being used to replace an existing column. Here, the original type is maintained.

Finally, the formula is applied as text between the column tags (opening and closing tag)

The Package Biml

The Biml for this package is;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="src_oledb" ConnectionString="Provider=SQLOLEDB.1;Data Source=.SQL2012;Initial Catalog=AdventureWorks2014;Integrated Security=SSPI;"/>
    <OleDbConnection Name="dest_oledb" ConnectionString="Provider=SQLOLEDB.1;Data Source=.SQL2012;Initial Catalog=stage;Integrated Security=SSPI;"/>
  </Connections>

    
  <Packages>
    <Package Name ="data_conversion" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
        <Dataflow Name="data_conversion_01">
          <Transformations>
             
            <OleDbSource Name="src_region" ConnectionName="src_oledb" >
              <DirectInput>
                SELECT [TerritoryID] FROM [Sales].[SalesTerritory]
              </DirectInput>
            </OleDbSource>

            <Lookup Name="lookup_region_name" OleDbConnectionName="src_oledb" NoMatchBehavior="IgnoreFailure">
              <DirectInput>SELECT top 5 TerritoryID, Name  FROM [Sales].[SalesTerritory] Order By TerritoryId</DirectInput>
              <Inputs>
                <Column SourceColumn="TerritoryID"/>
              </Inputs>
              <Outputs>
                <Column SourceColumn="Name" TargetColumn="TerritoryNameUnadjusted"/>
              </Outputs>
            </Lookup>

            <DerivedColumns Name="duplicate_territory_name">
              <Columns>
                <Column Name="TerritoryName" DataType="String" Length="50">
                  TerritoryNameUnadjusted
                </Column>
              </Columns>
            </DerivedColumns>
            
            
            <DerivedColumns Name="validate_territory_name">
              <Columns>
                <!-- the DataType and Length are required however they have no effect for ReplaceExisting=true -->
                <Column ReplaceExisting="true" Name="TerritoryName" DataType="String" Length="100"  >
                  (ISNULL(TerritoryName)) ? "Unknown" : TerritoryName
                </Column>
              </Columns>
            </DerivedColumns>
   
            <OleDbDestination Name="dest_territory" ConnectionName ="dest_oledb">
            <ExternalTableOutput Table="dbo.Territory" />
            </OleDbDestination>
             
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>  
  </Packages>
</Biml>

The output (below) shows the derived column has replaced the TerritoryName with Unknown where it has been previously null.


Biml IX – The Lookup

Biml IX – The Lookup

Our previous posts biml data flows have only included extractions and destinations. We have looked at moving data (the relation way) here and error configuration here. These posts have set the framework for discussing other dataflow transformations that are commonly used. Perhaps the most common of these is the lookup that is the focus of this post. We look at biml requirements for three configurations;

  • a bare bones implementation
  • customisation for differing input column names (ie looking up based on different columns) and
  • finally handling lookup errors

The Environment

As with the previous example, well focus on something simple and continue with the customer theme. Our goal is therefore, provided with a list of customer codes (the field CustomerAlternateKey in DimCustomer), we will use biml to create the transformations look up their first and last name and then populate our staging table (dbo.customer). That staging table can be created with the following sql and unlike our prior posts, we will not enforce any constraints on it.

create table dbo.customer
(
    CustomerAlternateKey nvarchar(25) NOT NULL,
    FirstName nvarchar(50),
    LastName nvarchar(50)
);

Our proforma script to looks like the one below. This includes connections, the package (container), dataflow, source and destination components with the package and dataflow. You’ll also note that we have also left a place holder for the lookup.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
   <OleDbConnection Name="src_adventure_works_oledb" ConnectionString="…" />
    <OleDbConnection Name="dest_stage" ConnectionString="…" />
  </Connections>

  <Packages>
    <Package Name="lookup" ConstraintMode="Linear">
      <Tasks>

        <Dataflow Name="dft_customer_01">
          <Transformations>

            <OdbcSource Name="src_customer" Connection="src_adventure_works_odbc">
              <DirectInput>select CustomerAlternateKey from dbo.DimCustomer;</DirectInput>
            </OdbcSource>
			
           .... some lookup ....
            
        <OleDbDestination Name="dest_customer" ConnectionName="dest_stage">
	  <ExternalTableOutput Table="[dbo].[customer]" />
	</OleDbDestination>

          
</Transformations>
        </Dataflow>
	   
      </Tasks>
    </Package>
  </Packages>
</Biml>

Part 1 – Bare Bones

If we were to think about how the lookup transformation works, we can determine the minimum requirements for the lookup. There is an input to the component as a base, then we would specify the

  • source for our lookup (by data source and query)
  • how the two dataset relate to each other (matching field)
  • the columns (from the lookup data set) that are to be appended to the original dataset.

This is not complex and we would also expect the column matching (and outputs) to look like the image below. Note the match is on CustomerAlternateKey and the FirstName and LastName fields are returned.

The SQL query to lookup values (that is provide the lookup table) is defined as;

Select
CustomerAlternateKey
, FirstName
, LastName
from dbo.DimCustomer;

When we look at the biml required for this, we can see these minimal set of options are defined as child nodes of the lookup task. That is

  • The Lookup tag specifies the data source to use as an attribute and initialises the transform.
  • The source query is specified via the DirectInput tag
  • The Inputs tag specifies what fields to match on.
  • The Outputs tag specifies what column are returned from the lookup table (or lookup query)

<Lookup Name="lookup_names" OleDbConnectionName="src_adventure_works_oledb" >
  <DirectInput>
	Select 	CustomerAlternateKey , FirstName, LastName	from dbo.DimCustomer;
  </DirectInput>
  <Inputs>
	<Column SourceColumn="CustomerAlternateKey" />
  </Inputs>
  <Outputs>
	<Column SourceColumn="FirstName" />
	<Column SourceColumn="LastName" />
  </Outputs>
</Lookup>

Now may also be a good sedge way to remember how the lookup works and note that it is not a join. The data buffer flows into the transform and appends columns based on the lookup. Now, by default, the transform will fail if there is no match found however, the component is no a join.

Part 2 – Matching Column Names

In Part 1 (above), data is looked up on an auto match based on column names (Note the Columns node of the Inputs tag). However, commonly, the base column names and lookup column names are not named the same. For example, consider our input to the transform specified CustomerAlternateKey as code_lhs and the query defining the lookup code as code_rhs.

If this is the case, we must explicitly specify the Source and Target column names for the Inputs tag. The OledbSource and Lookup for this situation would be defined as in the biml below;

<OleDbSource  Name="src_customer" ConnectionName="src_adventure_works_oledb">
	<DirectInput>select CustomerAlternateKey as code_lhs from dbo.DimCustomer;</DirectInput>
</OleDbSource>            

<Lookup Name="lookup_names" OleDbConnectionName="src_adventure_works_oledb">
    <DirectInput>
		select
			CustomerAlternateKey as code_rhs
            , FirstName
            , LastName
        from dbo.DimCustomer;
    </DirectInput>
    <!-- input columns are always included in the output, if the match does not occur on name, we specify the columns -->
    <Inputs>
		<Column SourceColumn="code_lhs" TargetColumn="code_rhs" />
	</Inputs>
		<Outputs>
			<Column SourceColumn="FirstName" />
            		<Column SourceColumn="LastName" />
		</Outputs>
</Lookup>

<OleDbDestination Name="dest_customer" ConnectionName="dest_stage">
	<ExternalTableOutput Table="[dbo].[customer]" />
    <Columns>
		<Column SourceColumn="code_lhs" TargetColumn="CustomerAlternateKey" />
    </Columns>
</OleDbDestination>

Note also that (since the buffer or base table) defines the CustomerAlternateKey as code_lhs, there is no auto mapping for the data destination and therefore we must explicitly specify the mapping.

It should also be apparent that, if the lookup is based on a composite key (or more than one column), each column should be expressed as a column.

Part 3 – Configuring Match Behaviour

The most common configuration for the lookup component is what to do in the case of lookup failures (that is, what happens to the components output and operation when a match is not found for an input row). The default behaviour is to fail the task however, there are several other options which include;

  • Redirect the unmatched records to a new output (NoMatch)
  • Ignore the issue and pass null values as the values for lookup fields (of course there only apply to the unmatched rows)
  • Redirect the unmatched records to the Error output

These configurations are handled in a very similar manner and rely on the NoMatchBehaviour property for the Lookup tag. What is important is how the output buffers are used. That is, if you specify that match failures be sent to the Error, you should specify a destination that consumes it (note the InputPath specification for each destination). I’ve used the word should because such an oversight will not through an error in the package (creation or execution) but rows sent to matched buffers will be lost.

The Biml for this data flow (redirecting the Error to a no match buffer is shown below). Also note that we will force rows to that buffer since our lookup table is restricted to 100 rows.

<Dataflow Name="dft_customer_03">
  <Transformations>
	<OleDbSource  Name="src_customer" ConnectionName="src_adventure_works_oledb">
	  <DirectInput>select CustomerAlternateKey from dbo.DimCustomer;</DirectInput>
	</OleDbSource>
	<Lookup Name="lookup_names" OleDbConnectionName="src_adventure_works_oledb" NoMatchBehavior="RedirectRowsToNoMatchOutput">
	  <!-- other NoMatch Options {IgnoreFailure, RedirectRowsToErrorOutput, FailComponent, RedirectRowsToNoMatchOutput} -->
	  <DirectInput>
		select top 100 
		CustomerAlternateKey
		, FirstName
		, LastName
		from dbo.DimCustomer;
	  </DirectInput>
	  <Inputs>
		<Column SourceColumn="CustomerAlternateKey" />
	  </Inputs>
	  <Outputs>
		<Column SourceColumn="FirstName" />
		<Column SourceColumn="LastName" />
	  </Outputs>
	</Lookup>
	
<OleDbDestination Name="dest_customer" ConnectionName="dest_stage">
	  <InputPath OutputPathName="lookup_names.Match" />
	  <ExternalTableOutput Table="[dbo].[customer]" />
	</OleDbDestination>

	<OleDbDestination Name="dest_customer_errors" ConnectionName="dest_stage">
	  <InputPath OutputPathName="lookup_names.NoMatch" />
	  <ExternalTableOutput Table="[dbo].[customer]" />
	</OleDbDestination>

  </Transformations>
</Dataflow>

You’ll also note that the error destination is physically the same as the match (success) destination. This is for illustration purposes only (and some succinct database design). The execution for this is shown below;


Conclusion

This post has looked at various biml snippets for the lookup task.

 


BIML VIII – Managing Errors in Data Destinations (ADO Destination)

BIML VIII – Managing Errors in Data Destinations (ADO Destination)

In a previous post we looked at the biml required to send data to an ODBC destination. Relational destinations are the bread and butter of SSIS (well the work I do anyway) and one of the most common ways to manage insert errors is to redirect the dataflow two or more forgiving destination. This post will look at how to achieve that.

Environment

The example that were going to use is pretty contrived – we will send some sample data through to a table with a primary key. Initially the table is empty and we will expect all the data load. A second data flow will then try to add all records and what we expect to find is that the number of records originally inserted will be rejected into an errors table.

It’s pretty simple example where I am relying on the primary key (and duplicates) to raise the error on the second load. Since my tables are initially empty, I know the number of errors that I am expecting (ie the number of records loaded in the first load). The SQL to create the two tables is;

create table dbo.customer 
(
	CustomerAlternateKey nvarchar(25) not null primary key 
	, FirstName nvarchar(50)
	, LastName nvarchar(50)

);
create table dbo.customer_errors 
(
	CustomerAlternateKey nvarchar(25)   
	, FirstName nvarchar(50)
	, LastName nvarchar(50)

);

The Biml

We’ll start off with the biml we left off in the previous post. There are a few small changes. You’ll notice that we’ve relying on auto-mapping of column names, are using an ado.net destination, selecting 100 customer records (source) and have changed a few names.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};Server=.;Database=AdventureWorksDW2012;Trusted_Connection=yes;" />
    <AdoNetConnection Name="dest_stage_ado" ConnectionString="Data Source=.;Initial Catalog=stage;Integrated Security=true;" Provider="System.Data.SqlClient" />
  </Connections>
    <Packages>
      <Package Name="copy_customer_with_error" ConstraintMode="Linear">
        <Tasks>
          <Dataflow Name="dft_customer_01">
            <Transformations>
              <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
                <DirectInput>
                  select top 100 CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
                </DirectInput>
              </OdbcSource>
              <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer" >
                <ExternalTableOutput Table="[dbo].[customer]" />
              </AdoNetDestination>
            </Transformations>
          </Dataflow>
        </Tasks>
      </Package>
    </Packages>
</Biml>

Next, we’ll add a second dataflow (called dft_customer_02). The purpose of this flow is to reinsert (all) records into the customer table. At the moment, this dataflow is exactly the same as the first except that our source doesn’t restrict the number of records.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};Server=.;Database=AdventureWorksDW2012;Trusted_Connection=yes;" />
    <AdoNetConnection Name="dest_stage_ado" ConnectionString="Data Source=.;Initial Catalog=stage;Integrated Security=true;" Provider="System.Data.SqlClient" />
  </Connections>
    <Packages>
      <Package Name="copy_customer_with_error" ConstraintMode="Linear">
        <Tasks>
          
          <Dataflow Name="dft_customer_01">
            <Transformations>
              <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
                <DirectInput>
                  select top 100 CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
                </DirectInput>
              </OdbcSource>
              <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer" >
                <ExternalTableOutput Table="[dbo].[customer]" />
              </AdoNetDestination>
            </Transformations>
          </Dataflow>

          <Dataflow Name="dft_customer_02">
            <Transformations>
              <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
                <DirectInput>
                  select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
                </DirectInput>
              </OdbcSource>
              <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer" >
                <ExternalTableOutput Table="[dbo].[customer]" />
              </AdoNetDestination>
            </Transformations>
          </Dataflow>          
          
        </Tasks>
      </Package>
    </Packages>
</Biml>

Now, we’ll add the error configuration. If we think about what we are trying to achieve, we need to alter our dft_customer_02 dataflow to add another destination (capture the errors) and configure the first destination dest_customer to redirect errors to it. The snippet for dft_customer_02 becomes;

<Dataflow Name="dft_customer_02">
<Transformations>
  <OdbcSource Name="src_adventure_works" Connection="src_adventure_works" >
	<DirectInput>
	  select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
	</DirectInput>
  </OdbcSource>
  
  <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer" >
	<ExternalTableOutput Table="[dbo].[customer]" />
	<ErrorHandling ErrorRowDisposition="RedirectRow" TruncationRowDisposition ="RedirectRow" />
  </AdoNetDestination>
  
  <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer_errors"  >
	<InputPath OutputPathName="dest_customer.Error" />
	<ExternalTableOutput Table="[dbo].[customer_errors]" />
  </AdoNetDestination>
</Transformations>

</Dataflow> 

The key components to include for error configuration is the definition of error handling (in dest_customer) and the InputPath for the errors destination. For ErrorHandeling, both attributes ErrorRowDisposition and TruncationRowDisposition are required. Also not the input path for the error table is the output path of the prior component (component name suffixed with “.Error”).

The Result Package

When we execute this biml and run the package, our output resembles the following. Note the 100 rows are correctly inserted in dft_customer_01 as we expect. Unfortunately, the second data flow (dft_customer_02) pushes all records to the error output. Since we only have 100 records in our customer table, we would expect 100 rows to be sent to the error because these are the records that violate the primary key (constraint).

This reason for this is the default configuration for our dest_customer destination (in dft_customer_02) – by default is tries a bulk load.

When a data destination is used that supports bulk inserts, and there is a fail on any insert, the entire buffer is rejected. We can see this setting in SSIS configuration for bulk inserts in the screen shot below. The same applies to fast loads on ODBC destinations which can use a Batch processing (Bulk) mode or a Row by Row insert. Naturally, the row by row methods will only reject the rows that fail.

This is a property of the destination, and our completed code (below) includes the fast load configuration (its only applicable to the dest_customer destination in dft_customer_02. When we run or package, the second data flow then rejects the 100 records that were previously inserted (remember to initialise the environment by delete records from the tables in our staging database).

Complete Code

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};Server=.;Database=AdventureWorksDW2012;Trusted_Connection=yes;" />
    <AdoNetConnection Name="dest_stage_ado" ConnectionString="Data Source.;Initial Catalog=stage;Integrated Security=true;" Provider="System.Data.SqlClient" />
  </Connections>
    <Packages>
      <Package Name="copy_customer_with_error" ConstraintMode="Linear">
        <Tasks>
          
          <Dataflow Name="dft_customer_01">
            <Transformations>
              <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
                <DirectInput>
                  select top 100 CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
                </DirectInput>
              </OdbcSource>
              <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer" >
                <ExternalTableOutput Table="[dbo].[customer]" />
              </AdoNetDestination>
            </Transformations>
          </Dataflow>

          <Dataflow Name="dft_customer_02">
            <Transformations>
              <OdbcSource Name="src_adventure_works" Connection="src_adventure_works" >
                <DirectInput>
                  select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
                </DirectInput>
              </OdbcSource>
              
              <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer" UseFastLoadIfAvailable="false">
                <ExternalTableOutput Table="[dbo].[customer]" />
                <ErrorHandling ErrorRowDisposition="RedirectRow" TruncationRowDisposition ="RedirectRow" />
              </AdoNetDestination>
              
              <AdoNetDestination ConnectionName ="dest_stage_ado" Name="dest_customer_errors"  >
                <InputPath OutputPathName="dest_customer.Error" />
                <ExternalTableOutput Table="[dbo].[customer_errors]" />
              </AdoNetDestination>
            </Transformations>
            
          </Dataflow>          
          
        </Tasks>
      </Package>
    </Packages>
</Biml>

 


BIML VII – Extracting Data Using ODBC Destinations

BIML VII – Extracting Data Using ODBC Destinations

In previous posts we looked at using Biml to extract data from our database tables to flat files and raw files. While these destinations can be handy, the majority of extracts that I use are to staging tables in a relational engine. This post looks at the Biml required to send our data to ODBC destinations.

The Environment

For the purpose of demonstration, I will use the DimCustomer table from adventure works. There’s nothing too complicated about that table, all I want to do is extract some data from it so we’ll just use a simple query (with code, first name and last name);

select 
CustomerAlternateKey
, FirstName
, LastName 
from dbo.DimCustomer;

For the destination, we’ll look at two options. Firstly we will rely on an auto mapping feature (so that we do not specify column mapping between the source and destination). Secondly, we will override that feature and specify column to column mapping. For this I’ll create two tables in a staging database called customer_automapping (with the same field names as the source) and customer_manualmapping (with different names). The SQL for these tables is given below;

create table dbo.customer_automapping
(
	CustomerAlternateKey nvarchar(25)
	, FirstName nvarchar(50)
	, LastName nvarchar(50)
);

create table dbo.customer_manual
(
	customer_cd nvarchar(25)
	, first_name nvarchar(50)
	, last_name nvarchar(50)
);

Proforma Template

We’ll use Biml to create two parallel dataflows in a package (the package will be called copy_customer) and the data flows will be called customer_automapping and customer_manualmapping. Our template package looks like the one below;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="dest_stage" ConnectionString="Driver={SQL Server Native Client 11.0};...." />
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};..." />
  </Connections>
  <Packages>
    <Package Name="copy_customer" ConstraintMode="Parallel">
      <Tasks>
        <Dataflow Name="customer_manualmapping">
			<Transformations>
			</Transformations>
        </Dataflow>
        <Dataflow Name="customer_automapping">
			<Transformations>
			</Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Note we have already define placeholders for each dataflow, all we have to do is populate those (transformation) nodes. We also define our source and destination connections.

The ODBC Source

We have previously addressed an ODBC source so for simplicity will just add the code. This needs to be added to each dataflow transformation node;

<OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
  <DirectInput>
	select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
  </DirectInput>
</OdbcSource>

The ODBC Destination

We’d expect the ODBC destination to require a name (since every item requires one), connection and a table (name). These are the only requirements for an automapping feature (where the source fields are automatically mapped to the destination fields on the basis of name). The snippet for the automapping dataflow is therefore;

<Dataflow Name="customer_automapping">
  <Transformations>
	<OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
	  <DirectInput>
		select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
	  </DirectInput>
	</OdbcSource>
	<OdbcDestination Name="dest_customer" Connection="dest_stage" >
	  <ExternalTableOutput Table="[dbo].[customer_automapping]" />
	</OdbcDestination>
  </Transformations>
</Dataflow>

Note that the table is defined by the ExternalTableOutput tag. There is no need to fully qualify the table as I have done (by inclusion of the schema) and <ExternalTableOutput
Table=customer_automapping />

would suffice provided the default schema for the connection was the same as the table. As expected, our dataflow creates the source and destination components and automatically maps fields from the source to the destination based on name.

To explicitly set the source to target mappings, we need to include a columns (and column(s)) tag as a child node of the destination. The Biml for this is below but it is worth identifying that mapping will only be overwritten if it exists in this tag. For example, if two columns have the same name and there is no mapping column specification, the two fields will be auto matched. Therefore, to fully specify the mapping in the ODBC destination we would use the following snippet;

<OdbcDestination Name="dest_customer" Connection="dest_stage" >
  <ExternalTableOutput Table="[dbo].[customer]" />
  <Columns>
	<Column SourceColumn="CustomerAlternateKey" TargetColumn="customer_cd"  />
	<Column SourceColumn="FirstName" TargetColumn="first_name"  />
	<Column SourceColumn="LastName" TargetColumn="last_name"  />
  </Columns>
</OdbcDestination>

It’s clear to see that the mapping occurs between the SourceColumn and TargetColumn attributes of each column node.

The Complete Biml File

For completeness, we’ll add the complete Biml Code;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="dest_stage" ConnectionString="Driver={SQL Server Native Client 11.0};…." />
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};…;" />
  </Connections>
  <Packages>
    <Package Name="copy_customer" ConstraintMode="Parallel">
      <Tasks>
        <Dataflow Name="customer_manualmapping">
          <Transformations>
            <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
              <DirectInput>
                select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
              </DirectInput>
            </OdbcSource>
            <OdbcDestination Name="dest_customer" Connection="dest_stage" >
              <ExternalTableOutput Table="[dbo].[customer]" />
              <Columns>
                <Column SourceColumn="CustomerAlternateKey" TargetColumn="customer_cd"  />
                <Column SourceColumn="FirstName" TargetColumn="first_name"  />
                <Column SourceColumn="LastName"   />
              </Columns>
            </OdbcDestination>
          </Transformations>
        </Dataflow>
        <Dataflow Name="customer_automapping">
          <Transformations>
            <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
              <DirectInput>
                select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
              </DirectInput>
            </OdbcSource>
            <OdbcDestination Name="dest_customer" Connection="dest_stage" >
              <ExternalTableOutput Table="[dbo].[customer_automapping]" />
            </OdbcDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Overwriting the Automap

There may be a situation where you want to Automap some fields but not others (say for example we did not want to populate the CustomerAlternateKey field). In this situation we can rely on auto-mapping and specify the column as not being used through the IsUsed attribute. For example, we could use;

<OdbcDestination Name="dest_customer" Connection="dest_stage" >
  <ExternalTableOutput Table="[dbo].[customer_automapping]" />
  <Columns>
	<Column SourceColumn="CustomerAlternateKey"  IsUsed="false" />
  </Columns>
</OdbcDestination>

 


BIML VI – Using Include to Separate BIML Components into Files

BIML VI – Using Include to Separate BIML Components into Files

In a previous post, we looked at using the XML clause in SQL Server to generate Biml’s XML. The basis of that post was that our package had a pro forma template and we could just substitute code into placeholders. For example, the foundation snippet (below) could just be used and we could replace the … my Type snippet … components with automatically generated code (which was of course derived by the for XML clause in SQL Server).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhostSQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
    … my raw file connection snippet …
  </Connections>
    
  <FileFormats>
    ... my format snippet ...
  </FileFormats>
  
  <Packages>
    <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
	  ... my task snippet ...
	  </Tasks>
    </Package>
  </Packages>
</Biml>

In that example I pasted the code directly into the placeholder. Notwithstanding the generation of XML (other systems may not include the XML clause or we may wish to refine the query to generate the code), placing the full XML into the snippet placeholders creates a rather large file. This may make it unfriendly and hard to work with and confusing to someone expected to pick up the code.

One method around this is to use the Biml include directive is separate the Biml code into component files. In order to do this we simply place the snippet into a file (the files extension is irrelevant) and then call that file in the Biml code.

For example, if we had a file (F:tempbimlconnections.txt) that defines our connections, we could simply include the following code into our Biml script.

<#@ include file="F:tempbimlconnections.txt" #>

Of course, the connections.txt file contains the XML that we would expect to be placed in the … snippet… section. When the Biml is run, the contents of the file are read and are placed where the directive is.

 


BIML V – Precedence Constraints

BIML V – Precedence Constraints

In prior posts, precedence constraints have been managed through the package setting for ConstraintMode. This assumes either a parallel execution (ie ConstraintMode=Parallel) or sequential execution (ie ConstraintMode=Linear). The ConstraintMode is a mandatory property for the package and considers the order by which tasks are defined in code.

To illustrate this we’ll add three SQL tasks and observe the generated package when the constraint mode is set.

Linear Constraint

A Linear constraint creates a precedence constraint between tasks in the order in which those tasks are added in script (as below)

<Package Name=”Precidence_Constraints” ConstraintMode=”Linear”>
  <Tasks>
	<ExecuteSQL Name=”Select 1” ConnectionName=”DB_Src”>
	  <DirectInput>Select 1</DirectInput>
	</ExecuteSQL>
	<ExecuteSQL Name=”Select 2” ConnectionName=”DB_Src”>
	  <DirectInput>Select 2</DirectInput>
	</ExecuteSQL>
	<ExecuteSQL Name=”Select 3” ConnectionName=”DB_Src”>
	  <DirectInput>Select 3</DirectInput>
	</ExecuteSQL>
  </Tasks>
</Package>

 

Parallel Constraint

 

A parallel constraint simply adds the tasks to the control flow (as below). Note that since all tasks are defined in parallel, there is no execution order (as defined by a constraint).

 

<Package Name="Precidence_Constraints" ConstraintMode="Parallel">
  <Tasks>
	<ExecuteSQL Name="Select 1" ConnectionName="DB_Src">
	  <DirectInput>Select 1</DirectInput>
	</ExecuteSQL>
	<ExecuteSQL Name="Select 2" ConnectionName="DB_Src">
	  <DirectInput>Select 2</DirectInput>
	</ExecuteSQL>
	<ExecuteSQL Name="Select 3" ConnectionName="DB_Src">
	  <DirectInput>Select 3</DirectInput>
	</ExecuteSQL>
  </Tasks>
</Package>

 

Naturally, we may want more control over the order in which tasks are executed and we can do this by overwriting a tasks default constraints (which is set in the package). That is, overwriting the ConstraintMode that we set. For example assume that we want to fire [Select 1] and then [Select 2] in parallel with [Select 3] as in the following image.

 

 

In order to do this, we need to set the precedence constraint for [Select 3] (of course I’m assuming that we have a linear constraint mode to create the constraint between [Select 1] and [Select 2]). This is a property of the task and set on the incoming task. For example, [Select 3] is defined by the following snippet.

<ExecuteSQL Name="Select 3" ConnectionName="DB_Src">
  <DirectInput>Select 3</DirectInput>
  <PrecedenceConstraints>
	<Inputs>
	  <Input OutputPathName="Select 1.Output" />
	</Inputs>
  </PrecedenceConstraints>
</ExecuteSQL>

 

Note that the precedence constraint is a node of the task and the input of the constraint (the task to be executed prior to this one) is defined by name and suffixed with “.Output” (ie OutputPathName=”Select 1.Output”).

 

Yes We Overwrite the Constraint

 

It is important to remember that we are overwriting the default constraint for the package. While this may not be an issue with parallel execution (since we need to define each constraint manually), it may be an issue for linear constraints.

 

For example, suppose that we add [Select 4] to the package and we want to have this fire in parallel with [Select 1] and beta constraint for [Select 3] (as below).

 

 

We may assume that we can just append [Select 4] as the final task in the package and specify the precedence constraints for [Select 3] as [Select 1] and [Select 2]. The pseudo xml (yes I made that up), is

 

<ExecuteSQL Name="Select 1" />
<ExecuteSQL Name="Select 2" />
<ExecuteSQL Name="Select 3" >
  <PrecedenceConstraints Inputs>
	  <Input OutputPathName="Select 1.Output" />
	  <Input OutputPathName="Select 4.Output" />
  </PrecedenceConstraints>
</ExecuteSQL>
<ExecuteSQL Name="Select 4" />

 

Unfortunately, the linear mode for the package is enforced whenever the precedence constraint is not set for a task. The above snippet would produce the following package.

 

Notice that the precedence constraint between [Select 3] and [Select 4] is dependent on its own execution? That is [Select 3] can only executed after [Select 4] and [Select 4] is dependent on [Select 3].

 

Clearly this is not the outcome that we anticipated (and in my testing it crashes visual studio).

 

If we wish to overwrite the packages default constraint (it was linear), we must explicitly set the input constraint to be nothing. This can be done with the following snippet;

 

<ExecuteSQL Name="Select 4" ConnectionName="DB_Src">
	<DirectInput>Select 4</DirectInput>
	<PrecedenceConstraints>
	  <Inputs />
	</PrecedenceConstraints>
</ExecuteSQL>  

 

We can define tasks in any order, but we need to be aware that a linear constraint will assume a sequential execution order.

Other Properties of the Constraint

Perhaps the most important property (when dealing with multiple input constraints) are logical conditions. That is, whether all or only one inputs need to be completed successfully or not (that is the prior task(s) complete with success or failure.

 

We can set the logical conditions through the tasks <PrecedenceConstraints> tag by using the property LogicalType. This property has two allowable values (And, Or) and specifies that all input constraints must execute prior to this task (if And is specified) or only one prior task must complete before this one (if Or is specified). We can see the outcome of the Or Condition below. Of course there is no need to specify this explicitly, by default AND is used.

 

<ExecuteSQL Name="Select 3" ConnectionName="DB_Src">
  <DirectInput>Select 3</DirectInput>
  <PrecedenceConstraints LogicalType="Or">
	<Inputs>
	  <Input OutputPathName="Select 1.Output"  />
	  <Input OutputPathName="Select 4.Output" />
	</Inputs>
  </PrecedenceConstraints>
</ExecuteSQL>

 

Finally, we may want to specify a failure condition, for example, [Select 3] would only be fired if [Select 4] fails and [Select 1] is successful. This type of control is usually used for error handling. We can do this by specifying the property ExecutionValue property for the constraint. Note that this snippet assumes And (default) logic for constrains

 

<ExecuteSQL Name="Select 3" ConnectionName="DB_Src">
  <DirectInput>Select 3</DirectInput>
  <PrecedenceConstraints>
	<Inputs>
	  <Input OutputPathName="Select 1.Output"   />
	  <Input OutputPathName="Select 4.Output" EvaluationValue="Failure"/>
	</Inputs>
  </PrecedenceConstraints>
</ExecuteSQL>

 

Conclusion

 

There is a high degree of control for managing the execution order of tasks. These can be initially handled with minimum effort by utilising default values for package properties (that is the packages ConstraintMode). Additionally this to full behaviour can be overwritten by explicitly specifying the constraints for each task.

 


BIML IV– Extracting All Tables to Raw Files

BIML IV– Extracting All Tables to Raw Files

Perhaps the first task of any (DW/ DM) project is to extract data – after all it’s the E in the ETL isn’t it? I know there’s a lot more that goes on here but it’s the most boring job I can think of doing and one which inevitably has to be done. In a previous Post (Biml III), I looked at using BIML to extract a table to a Raw File. This post will extend that so that we can automatically generate the code required to do the task – that is, we want to have a source database and automatically create a package to extract all the tables from the database.

To be honest, the Biml framework was put in place in the prior post. What I like about what we are going to do here is the quick generation of a package to achieve this. We will look into some Xml SQL also and, while you could argue that the process is a bit of a hack, it gets a result very fast. Also, we could extend this practice to other database platforms, all we want to do is generate Biml Code. So let’s begin.

Overview

I’ve already stated our objective is to generate Biml so that we can create a package which will extract all tables from our database. In order to do this, we are going to create the Xml specification through a set of queries that populate the proforma for a single table extract. Here’s the proforma;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhostSQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
    <FileConnection Name="Target" FilePath="F:\Output.raw"  />
  </Connections>
   
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
	…. (column list)
      </Columns>
    </RawFileFormat>
  </FileFormats>
 
  <Packages>
    <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
              <FileOutput ConnectionName ="Target" />
            </RawFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

 

There are 3 major things we’ll need to define. There are;

  1. The output file connection which is held in the Connections node. Let’s assume that is name should include the schema and table_name. For example, the extract for Person.Address should output a file called Person_Address.
  2. The file format for a raw file which is simply the list of columns that we want to export (held in the FileFormats node). There is a parent node of the column list which is defined by name and, (in our work), we should specify this by the schema and table.
  3. The data flow transform with an Oledb source and raw file destination (mapping will be managed by name matching). The target uses the file connection (defined in step 1) and the file format (step 2) which is defined by schema and table_name. Since we want to give our data flow components a meaningful name we’ll also prefix them with src (for source) and dest (for destination).

Now, lets look at how we can generate XML for each of our 3 components.

Target Files

The generation of target files is pretty straight forward. All we need is a query that produces a file connection node with the (file output). One for each table in the database. The query below does this and produces the output. There’s nothing too complicate here, but I’ve you never heard of the information schema on SQL Server, you may want to read a little more about the views here. Basically, they define the database (and what we are interested in is tables and columns).

The query,

/*	PTB, 10-Nov-2014
	SQL to Generate Biml FileConnections (all tables in a DB)
 */

declare @output_directory nvarchar(50) = 'F:\DB_DUMP'

select 
'<FileConnection Name="' 
	+ TABLE_SCHEMA + '_' 
	+ TABLE_NAME + '" FilePath="' 
	+ @output_directory + '\' 
	+ TABLE_SCHEMA + '_' + TABLE_NAME + '.raw" />'
from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = 'BASE TABLE'

Produces the folowing output.


All we want to do is take the output from the query and post it where our connections node.



File Formats

All the file format needs to do is specify a list of columns under a RawFileFormat Node. Of course the format must have a name but in essence we need to produce some xml that looks like the following snippet. And, it needs to be done for every table that we are going to extract.

    <RawFileFormat Name="Person_Address">
      <Columns>
        <Column Name="AddressID" />
        <Column Name="AddressLine1" />
        <Column Name="AddressLine2" />
        <Column Name="City" />
        <Column Name="StateProvinceID" />
        <Column Name="PostalCode" />
        <Column Name="rowguid" />
        <Column Name="ModifiedDate" />
      </Columns>

We are also a little restricted by the capabilities of the raw file output. This destination does not accept certain data types (eg Geography, XML or images) and so these need to be excluded from the definition. If they are not, our package will try and map source columns to these fields and the transform will fail (actually, it won’t validate). This is not an issue with Biml, rather a limitation of the destination.

Again, if you’ve not used the INFORMATION_SCHEMA before, you may want to have a quick look at the INFORMATION_SCHEMA.COLUMNS view. This view lists all the columns (and tables of the database). Since I want xml returned, I can use the XML clause to have the query generate an XML node for me (rather than a standard row set). There is a lot in the XML clause (if you want more information, you can read it here). My query;

/*	PTB, 10-Nov-2014
	SQL to Generate Biml FileFormats for all tables in a DB
 */

select 

  TABLE_SCHEMA + '_' + TABLE_NAME "@Name"
,  (select COLUMN_NAME "@Name" 
	from INFORMATION_SCHEMA.COLUMNS 
	where TABLE_NAME = t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA
		and  
		( 
			DATA_TYPE <> 'geography'
			and not ( DATA_TYPE='nvarchar' and CHARACTER_MAXIMUM_LENGTH = -1 ) 
			and not ( DATA_TYPE='varbinary' and CHARACTER_MAXIMUM_LENGTH = -1 ) 
			and DATA_TYPE <> 'Xml'
		)
	for xml path('Column'), type, root('Columns')
	)
from INFORMATION_SCHEMA.TABLES t 
WHERE TABLE_NAME in
					(
						SELECT TABLE_NAME 
						from INFORMATION_SCHEMA.TABLES 
						where TABLE_TYPE = 'BASE TABLE'
					)

for XML PATH('RawFileFormat'), type , ROOT('FileFormats')  

Returns an xml node which is shown in the results view like this.


Now, when we click the xml (note that it looks like a url), a new output window opens up with the following code. This is exactly what we need for the file formats and we can past over the proforma FileFormats node.

The Xml output from the File Formats query

 

 

 

Data Flows

Since we want to include all our data flows in a single package (with one dataflow for each table), we need to generate the xml for the data flow in a packages node. We know that the package has a tasks child which specifies the control flow and each data flow should reside within it. For example, a single data flow would look like this.

      <Tasks>
 
        <Dataflow Name="EXTRACT_Person_Address">
    
          <Transformations>
          <OleDbSource Name="src_Person_Address" ConnectionName="Source">
            <DirectInput>Select * from Person.Address</DirectInput>
          </OleDbSource>
          <RawFileDestination Name="dest_Person_Address" RawFileFormatName="Person_Address" WriteOption="CreateAlways">
            <FileOutput ConnectionName="Person_Address" />
          </RawFileDestination>
        </Transformations>
       
        </Dataflow>
     
      </Tasks>

Now, we need a query to generate xml for all tables…. (and here is it). It gives us the full tasks node for our package.

/*	PTB, 10-Nov-2014
	SQL to Generate Dataflow's for Biml output (all tables in a DB)
 */

select 

'EXTRACT_' + TABLE_SCHEMA + '_' + TABLE_NAME as "@Name"
, 'src_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/OleDbSource/@Name"
, 'Source' as "Transformations/OleDbSource/@ConnectionName"
, 'Select * from ' + TABLE_SCHEMA + '.' + TABLE_NAME  as "Transformations/OleDbSource/DirectInput"
, 'dest_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@Name"
, TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@RawFileFormatName"
, 'CreateAlways' as "Transformations/RawFileDestination/@WriteOption"
, TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/FileOutput/@ConnectionName"

from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = 'BASE TABLE' 
order by TABLE_SCHEMA, TABLE_NAME
for XML path('Dataflow'), root('Tasks')

Putting it All Together

Because the xml generated is so big, I will not include the full output. However, I think its good to have an indication of what the output should look like. So a subset of the output will look like this.

Conclusion

When we build the Biml, we will get a single ssis package with a data flow for each table (as shown below) with all the effort of generating the transforms handled by Biml. I’d like to go into the xml clause for SQL a little further but I think it would complicate what we are trying to achieve (after all this post is pretty long as it is).

Now … I have to ask … Is this the most elegant solution and could it be done better?

 

Perhaps not but I think it’s a pretty cheeky (and extremely fast) way to generate a package to do an awful lot of copying.

 


BIML III – Extracting Database Tables to Raw Files

BIML III – Extracting Database Tables to Raw Files

In a previous post, we looked at extracting a table to a flat file. Personally, I am not a huge fan of flat file output because you lose so much metadata when a table with all its data types are dumped into text. To preserve this information, I would much rather use a raw file output. This post looks at the Biml requirements to output to a raw file.

Overview

If you haven’t read the post about output to test files, it may be worth a quick read. Basically we discuss the process that we use to extract a table as the following list of activities.

  1. Create a connection to SQL Server
  2. Add a dataflow task
  3. Specify a source (let’s just use an OLEDB source) and provide the query.
  4. Add a flat file as a destination output then join the source to destination and then configure the output by creating a new flat file connection manager.

Of course, we could replace flat file here with raw file and we would pretty much have the same method. That’s exactly what is required – the only real difference is that the raw file format does not require delimiter specifications. Our proforma template remains the same;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <!-- Placeholder for connection -->
  </Connections>
  <FileFormats>
    <!-- Placeholder for structures-->
  </FileFormats>
  <Packages>
    <Package Name="CopyAccount" ConstraintMode="Linear">
    <!-- Place Holder for a Package -->
    </Package>
  </Packages>
</Biml>

Connections

Our database connection remains the same. The output (raw file) is specified as a file connection.

<Connections>
  <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
  <FileConnection Name="Target" FilePath="F:\Output.raw"  />
</Connections>

 

FileFormat

The file format (remembering that it’s a child of the FileFormats and defines a table structure) only needs to specify the format name and column output.

  
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
        <Column Name="AccountKey"   />
        <Column Name="AccountCodeAlternateKey"   />
        <Column Name="AccountDescription"   />
        <Column Name="AccountType"    />
        <Column Name="Operator"   />
      </Columns>
    </RawFileFormat>
  </FileFormats> 

 

As with text formats, the order is not important, only the Name. We will utilise auto mapping between the source and target.

The Package/ Data Flow

For the text file extract, our package contained a dataflow which simply had a source and destination. The source specified an SQL statement and the destination specified the type (FlatFile) which had an embedded file format. This is an important point… the flat file destination requires a file connection and the connection requires a format.

This is not the same for the raw file output. Here the raw file destination requires both a connection and the file format. If we have a quick look at the dataflow snippet, you’ll see what I mean.

<Dataflow Name="Copy_Dim_Account">
  <Transformations>
	<OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
	  <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
	</OleDbSource>
	<RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
	  <FileOutput ConnectionName ="Target" />
	</RawFileDestination>
  </Transformations>
</Dataflow> 

Now compare this to the flat file and notice that the file format is not required (for text).

<Dataflow Name="Copy_Dim_Account">
  <Transformations>
	<OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
	  <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
	</OleDbSource>
	<FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
	</FlatFileDestination>
  </Transformations>
</Dataflow> 

 

Complete Code

For completeness, the full code to extract an table (or more precisely a SQL query) to a raw file is;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhostSQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
    <FileConnection Name="Target" FilePath="F:\Output.raw"  />
  </Connections>
  
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
        <Column Name="AccountKey"   />
        <Column Name="AccountCodeAlternateKey"   />
        <Column Name="AccountDescription"   />
        <Column Name="AccountType"    />
        <Column Name="Operator"   />
      </Columns>
    </RawFileFormat>
  </FileFormats>

  <Packages>
    <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
              <FileOutput ConnectionName ="Target" />
            </RawFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

 

Conclusion

There is not a great deal of variation here between a flat file output and a text file so you may ask why include it? In a future post well look at a simple solution to output an entire database to raw files. Our goal in that post is to mimic the staging steps of data warehouse creation. However, in order to do that, we need the basics of file output.

 


To Automate or Not? – The Pros and Cons of Data Warehouse Automation

To Automate or Not? – The Pros and Cons of Data Warehouse Automation

I have been thinking about data warehouse automation for at least 2 years now. That’s just a long enough time for me to see the pros and cons, examine the state of affairs, consider the consequences of both camps and lastly consider my reaction to those consequences. I’ve come to realise that the last point (my reaction) plays on my mind because the subject is filled more with emotion than rational decisions. What I hope to achieve with this post is to materialise some of my thoughts on the matter and present a few ideas that may be considered by others (if they were considering utilising this technology).

Before we get into the grit of discussion, it is probably worth noting that I consider the data warehouse in the more traditional sense. That is, a relational engine that stores and delivers agreed data. Recently, the additional possibilities arising from the big data echo system and the concepts of a data lake or swamp, the possibilities of schema on write (where data doesn’t necessary have to conform to a model on load) will add a layer of complexity to this discussion which we just don’t need. We could also add data virtualisation but that would also cloud the water in what could be simplified as decision to automate or not.

In my traditional view of this world, the data warehouse needs to deliver two things. Firstly, it needs to store an historical record data (as it existed) and secondly, it needs to present information consumer in the manner that they require. Those two points might seem like real no-brainers but I thought it’s better to call them out now because they create some flow on implications which are;

  • the historical record of data may not be what the consumer currently wants
  • the ‘delivery’ requirements of the system will change
  • the system must be capable of changing requirement

Data Warehouse Development

In traditional development, the system is designed is designed at a conceptual level, a data modeller designs and builds the relational framework (the schema) and then ETL developers populate it. The last stage (ETL development) requires the largest effort on a project and has been documented to require more than 60% of total effort.

Now, let’s examine the first two steps. I suggest that the conceptual design is driven by understood practices –to the extent of a cookie cutter template. Regardless of the methodology used, most employ a series of processes and silos that do not change from implementation to implementation. Data is extracted into a staging area, transformed into a warehouse, deployed to a presentation server/layer and delivered via a user interface (reporting/analytic layer). We expect the modeller to use a tool to manage and maintain the modelling process (eg ERwin, DeZign) and would think it unprofessional and prone to error if components were hand coded manually.

In contrast, when it comes to ETL development, most processes manual. Sure, there’s a tool that is used (eg SSIS, Kettle, Talend) which acts as an IDE however, the development process is still labour intensive. The developer links together components to perform some action or task and each task/package must be manually developed and configured (for example, you would write 10 tasks to extract 10 tables into a staging area).

Is the process of the ETL development using an IDE similar to that of the modeller (and therefore argued as automated)? I think yes and no. Yes because there is a layer of abstraction between the development process and the output (today we would think it insane to manually write code in a text editor) and no because the abstraction layer is not high enough (meaning that still very task orientated).

Automation

Enter the realm of automation technologies. To summarise these tools, they;

  1. Generally use a model (and data) driven approach to define the conceptual layer required. For example, you point the tool source system to reverse engineer the underlying model which can later adjust (for example at other entities, relationships etc).
  2. Define a schema based on the prior step (and generate the DDL to define an alternate schema).
  3. Map source systems to the target schema and generate the code and processes to do this.
  4. Manage the execution of the code/processes derived in step 3.
  5. Provide some presentation layer.

The Arguments

So why are automation products considered by purists and coders as second-rate citizens in the world of data warehouse development? There are a few key arguments that I’ve come across;

  • The tool does not produce the schema that we want or we can’t change it the way we want. This contention presents itself in several ways.
    • The modeller suggests that there is a correct way to build the schema and the tool should conform to that (or be capable of alteration to the correct schema). I’ve italicised the word correct here because that can really be a matter of conjecture. Consider storing related data in a different table which is directly related to the first (ie 1:1 relationship). You can convincingly argue both sides (there should be 1 table or 2). I think the key point of this argument is that the tool is not flexible for schema alteration. Sometimes this is true, sometimes it’s just nonsense and based on hearsay or single product evaluation.
  • Another point is that the data is not retained correctly (for example, history is not recorded and is incapable of change). Most people that present this have not completed an in depth analysis of the products available. Just like other process based software, each product is not created equally and it would be foolish to judge the class of products on a limited scope (I am sure there’s a product that suites your needs). What is interesting about this point is how much bias is targeted at automation software. The same scenarios can be readily found in traditional implementations where requirements have changed (or dare I suggest an oversight in design) renders the existing schema redundant. Change and rework is inevitable in this scenario and it’s surprising how that circumstance is eliminated from our evaluation process.
  • The tool is a black-box and I can write better transformations. This is perhaps the most common discussion point I see amongst developers. They argue that the tool produces generic code which could be improved had it been created manually (by them). Perhaps this is true and there are some instances where code could be written that outperforms and is more elegant and generated code. However the converse may also be true – that is, manual development may be more prone to poor performance. How can this be? Well, for a start you assume that the developer can deliver the results they say (and they actually know what they’re talking about). But this is not always the case in some developers obtain their positions through bluffs on their resume which are all too often found out too late.
    • When presented with this argument there is a natural association that the solution requires an elegant and well performing solution. Sometimes this is true and sometimes it’s not. Performance improvements that look good on paper may have absolutely no benefit to the overall solution even if they deliver a tangible result. For example, the a generated process taking 25 minutes compared to a manually created process of 20 minutes may show improvement however, is there really an improvement if the entire process completes in the processing window?
  • It’s too expensive. When an automation tool is evaluated its price is naturally considered as part of the evaluation equation. Let’s say the typical range of prices is between $50K and $80K. That is evaluated against a consultant for six to ten weeks or an FT employee the nine months. More than enough time to build it manually right? – I’m not so sure. ETL projects are notorious for overruns and while we expect our projects to go smoothly, the chances are that there will be overruns and these are not factored into the cost equation.
    • Also I’d like to consider in the situation where the developer(s) leave. In this situation, there is often more unplanned rework through either a lack of knowledge or a different consultants approach.
  • Further considering the “it’s too expensive line“, I would also like to introduce the concept of the time value of data (whoops information). I have never seen this addressed in a cost benefit analysis, but to my mind it must play a role. The time value of data relates to the increase in productivity, revenue or output by having solution faster (let’s assume that an automated tool can do this). What does it mean in monetary terms (to the business) to have data now (and not in six/nine months)? Can they utilise the data warehouse from competitive advantage to increase profitability?

The Plusses – Viva Automation

What can automation deliver that has not been addressed above? On the pro side of automation software we could suggest;

  • A consistent approach. Since the automation software develops everything for us, its output is delivered in a consistent manner. Of course this has good and bad points, but the key is that there is a consistent approach. Developer peculiarities should not exist since the tools operate through meta-data (and generation). Theoretically one developer is equivalent to another so there may be no need for sliding pay scales based on experience or special skills.
  • Since the operator manages the software, one might suggest that developers are not needed (and neither are their specialist skills). This may make resourcing a lot simpler and remove the barriers between IT and the business (provided we accept those barriers exist). There is the potential for a business user with training the software to build and manage the data warehouse rather than a specialist IT resource. This type of transformation has occurred with accounting software that does not require an accountant to operate it (perhaps only a data entry clerk).

The Emotion

Every discussion I have had around the use of automation tools involves some emotion (in some cases I’d go so far to say that the conversation is sometimes dictated to by emotion). To get straight to the point, the idea that specialist developers can be replaced by software is so confrontational that we (myself included) simply refuse to accept it and so live in a state of denial. The denial says “I’ve been doing this for XX years, have specialist skills and no machine can take my place”.

This has happened in so many industries and is such a predictable response, I think all we need do is identify its (potential) presence.

Finally

There have been lots of points in this post. If you think I’ve missed something, want to add your own point or expand on something I’ve mentioned, then just comment or drop me a line. As usual comments are open and welcome.

I also say that all comments will be published (provided they are not spam, contribute to the discussion and are vendor agnostic). If you would like to insert the comment in a section so that the flow of the article remains intact, just let me know and I’ll add it.

The floor is now open J.

****

As a side note, its time to revise these practices – perhaps more in a later post.

BIML II – Extracting Database Tables to Text Files

BIML II – Extracting Database Tables to Text Files

In this post we look at the Biml required to generate a SSIS package that copies data from a table to a flat file (that is, a text file output). It’s a pretty simple example but one that can build on our prior knowledge. We have already gone over the basics of Biml in our first post so we won’t worry about an overview of Biml, how to install it or generate packages. What’s worth remembering though, is that Biml is an XML specification and packages and connections are child nodes of the XML root. We’ll also deviate little bit in this post by using an OLEDB connection to SQL Server and specifying a file format.

Overview

If we think about the way our package would be built in SSIS, we would probably follow a simple set of steps;

  1. Create a connection to SQL Server
  2. Add a dataflow task
  3. Specify a source (let’s just use an OLEDB source) and provide the query.
  4. Add a flat file as a destination output then join the source to destination and then configure the output by creating a new flat file connection manager.

What’s somewhat hidden in this process is that the output file structure is defined automatically when we create a new connection – that is, in the destination wizard (it is handled by the wizard when you create a new connection manager). Keep in mind that the Biml has no idea about this and so we need to specify the file structure somewhere in the Biml code.

So if we are going to achieve this in Biml, we would need to;

  1. Add the connections
  2. Specify the file format
  3. Create a package that has a data flow task
    1. Uses the source connection (with a query) to produce a dataset … and
    2. Places that dataset into an output connection.

So we would expect our structure to be something like this (I’m using a little prior knowledge and know that the FileFormats tag is a root element);

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
    <!-- Placeholder for connection -->
  </Connections>

  <FileFormats>
    <!-- Placeholder for structures-->
  </FileFormats>

  <Packages>
    <Package Name="CopyAccount" ConstraintMode="Linear">
    <!-- Place Holder for a Package -->
    </Package>
  </Packages>

</Biml>

Connections

If we follow our previous example, we can simply specify our connections under the connection node as in the following code (again Intellisense can work wonders here). The OLEDB connection should be pretty straight forward, but what is interesting is the simplicity of the FlatFile connection. As you would expect, its go a Name and a Path (or destination), but what’s interesting is the FileFormat specification. This is a required property for a flat file and specifies the structure of the stream.

<Connections>
  <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
  <FlatFileConnection Name="Target" FilePath="F:\Output.txt" FileFormat="AccountTable" />
</Connections>

FileFormat

We have briefly discussed the creation of a format. For our purposes we could summarise it as the definition of a dataflow (somewhat akin to the definition of a table). As you would expect (and is logical for an XML table definition), the structure defines a header (that is the table container) and the columns within that table.

<FileFormats>
    <FlatFileFormat Name="AccountTable" ColumnNamesInFirstDataRow="true" >
      <Columns>
        <Column Name="AccountKey" Delimiter="Comma" />
        <Column Name="AccountCodeAlternateKey" Delimiter="Comma" />
        <Column Name="AccountDescription" Delimiter="Comma" />
        <Column Name="AccountType" Delimiter="Comma" />
        <Column Name="Operator" Delimiter="LF" />
      </Columns>
    </FlatFileFormat>
</FileFormats>

Naturally each node has to have a Name, and columns also require a Delimiter. You’ll notice that the last columns delimiter is a Line Feed (LF or new line). This just ensures that each row appears on its on line.

We could also specify data types for columns and other properties however, these are not mandatory for our example. What is required is that the field names for the FileFormat match those from our table. When the SSIS package is generated, the mappings from the source (OLEDB data source) to the destination (the FlatFile) will be based on name. The order is not important, only the Name.

The Package

If we consider that we have the framework for moving data, we would expect that our package require the following tasks with a precedence constraint between them;

  1. A Data Flow Task
    1. OLEDB Source (which specifies the connection and the SQL Command)
    2. A flat file destination (specifying the file and format).

This is exactly what we need to specify in our package. The Biml for this is;

    <Package Name="Copy_Dim_Account" ConstraintMode="Linear" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
            </FlatFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>

We can easily see how these tasks relate to both the source and destination of the dataflow and how they reference the objects that we previously defined (ie the connection and FileFormat).

Previously, we also examined the use of the packages ConstraintMode as a property which dictates how tasks on the Control Flow relate to each other. That is, a linear mode creates constraints between tasks where as a Parallel mode does not.
This property is not materialised in the DataFlow. For example, if the package has a Parrallel Constraint mode, the tasks (or transformations) within the data flow will automatically have the precedence constraint created (in the order that they appear in code).

Also note that the source uses a direct input tag as a child of the OLEDB Source which is the same as the syntax for the ExecuteSql tag.

The Complete Solution

For completeness, the full code is ;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
    <FlatFileConnection Name="Target" FilePath="F:\Output.txt" FileFormat="AccountTable" />
  </Connections>
  <FileFormats>
    <FlatFileFormat Name="AccountTable" ColumnNamesInFirstDataRow="true" RowDelimiter="LF" >
      <Columns>
        <Column Name="AccountKey" Delimiter="Comma" />
        <Column Name="AccountCodeAlternateKey" Delimiter="Comma" />   
        <Column Name="AccountDescription" Delimiter="Comma" />
        <Column Name="AccountType" Delimiter="Comma"  />
        <Column Name="Operator" Delimiter="LF" />
      </Columns>
    </FlatFileFormat>
  </FileFormats>

  <Packages>
    <Package Name="Copy_Dim_Account" ConstraintMode="Linear" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
            </FlatFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
  </Biml>

Conclusion

In this post, we have looked at the generation of packages that move data from a table to a text file. This is a simple scenario that demonstrates the minimum requirements for tag specification and utilises automatic mapping of source fields to target.


Data Vault and Data Mining (AusDM)

Data Vault and Data Mining (AusDM)

What does the data vault data warehousing methodology have in common with data mining?

There are two conferences / courses being run in Australia in November, 2014.

Data Vault

The Data Vault data warehousing methodology has gained a lot of traction over the past few years. If you’ve never heard of it, I would suggest that it is somewhat of a cross between normalisation and star schema design (of course I am leaving myself open to a bit of criticism with that definition but it is only half a sentence).

Apart from the mandatory recognition to the inventor (Dan Linstedt), the best book that I’ve read on the methodology was written by Hans Hultgren and it is simply a must have if you are interested in learning the method (you can check it out on amazon here ). I cannot recommend this book highly enough. Anyway, I digress.

Hans will be in Sydney conducting Data Vault training on 12th – 14th November (that is, training to become a certified Data Vault Modeller). He has partnered with Analytics8 and MIP (don’t ask me why the prices are different) and while I would never usually promote vendors training solutions, I can make an exception for training delivered by Hans. If you are interested in finding out more about here.

AusDM

AusDM or Australasian Data Mining Conference is begin run in Brisbane at Queensland University of Technology (Gardens Point) on 27th-28th November. The first time I went to this conference, it was purely an academic conference on data mining. If you’ve never been to an academic conference, they are a lot like other conferences, except that research papers are presented, so the presentations are focused and specific (you’ll also need a bit of background in the subject area). To be honest most industry participants find this pretty dry and boring. Research findings leads technical implementation by a good number of years and the work seems mostly theoretical.

What I like about AusDm is that they have an industry focus in addition to the academic presentations. For example, there is workshop on R which (IMO) makes the price of admission inconsequential (when compared to other R training).

If you would like to know more about AusDm please check out the site http://ausdm14.ausdm.org/home


BIML I – Getting Started with BIML

BIML I – Getting Started with BIML

Traditional SSIS development requires the ETL developer to create a package and then manually add and configure tasks. While this method of visual development is fine, it suffers from reproducibility and maintenance overhead. For example, imagine (a very contrived scenario) where you create a package with a single ‘Execute SQL Task‘ and configure it to fire a procedure. Simple enough but now suppose you want to create a new package and fire a different proc – here you have to manually go in and follow all the same steps again. Perhaps not so bad but what if you create 10 packages and then decide that the Execute SQL Task should be prefixed EXSQL, you would have to manually go into each package and rename the task. This could be very cumbersome and it (by nature) carries a maintenance burden (or maintenance overhead).

BIML (Business Intelligence Markup Language) is an XML specification that (amongst other things) allows you to specify the creation SSIS packages. One of the great things about BIML is that it can also include code snippets that amplifies the ability to generate code. This first post will form part of a series that show how to use BIML to create packages and what type of outcome we can achieve. But (of course) first things first so let’s start with a simple goal, so let’s look at using BIML to generate a package that executes an SQL command.

Before we continue, a quick revisit my above comments about maintainability for a moment.

An SSIS package is really just an XML file right, so why can’t I just edit that?

If you’ve ever played around with an SSIS package, you’ll probably know that it is an XML file. You can see this by viewing the code of your packages as shown below (right click on the package and view code). So you may think that you can generate or alter this pretty easily.

However, I have found this XML to be very finicky. Suppose you wanted to redefine a connection by editing script (something which should be simple you’d think and I’ve discussed here). You would expect that you could just examine the code, find the connection string, alter it and save it. But I’ve never had any success in that approach. When you try to open the file again (using the visual designer) it is invalid so your pretty much helpless with the changes you tried to make (sorry but there’s no comeback). That’s only to change an existing package so the creation of a package is (IMO of course) is just not possible. You could use an API but that just seems out of reach for most.

Of course, newer versions of SSIS have project wide objects (like connections). Using these would naturally require less maintenance however were are still faced with task management (altering and creation) and package development. Further, if we want to define a group of packages by meta-data (say for example a list of tables to extract), there is no current way to do it in SSIS. A much simpler approach is BIML with give us a generic specification to development.

What do I Need to get running with BIML?

BIML is available when BIDS helper is installed. BIDS Helper is a community addin for SQL Server Data Tools (SSDT and BIDS) and is available on code plex. If you haven’t heard of BIDS helper before, take some time to look around the features. It really is a must have for business intelligence development in the Microsoft space.

Once BIDS helper is installed, you can add a new BIML file by right clicking on the Project (or SSIS Packages) node and selecting ‘Add New Biml File‘. The file is created in the Miscellaneous Folder (as shown below).

 

A single BIML file can contain the code that creates many packages (as will become apparent) but for now would we will just work with the basics and use it to create a package that fires an SQL statement.

Adding Code

Opening the BimlScript file shows the XML in the file (we should probably rename it to give it a decent name but we will not worry about it this time). When you start, the file just contains the root node (as below).

<Biml xmlns=http://schemas.varigence.com/biml.xsd>
</Biml>

What BIML now allows us to do is create objects within the root (and later other nodes). For example, since we want to create a package, we naturally expect that we need to add a package child. Naming is pretty intuitive since it follows the SSIS object names. I’ve already hinted that we can use the same file to create multiple packages so it seems natural that a package tag should be embedded within a packages tag. Something like the following.

<Biml xmlns=http://schemas.varigence.com/biml.xsd>

<Packages>

<Package></Package>

</Packages>

</Biml>

Actually, I did cheat a little (and so can you). The intellisense functionality lets you know what tags are allowed at the position you are. Remember in XML that a tag is just an identifier for an object so since I want to do something in within by BIML tag (where else would I put it), I can just open a tag and see what options are available. I want to create a package (so a search for package shows I can add packages and of course within that, I can add a package).

Validation and Creation

Back to our package snippet. We’ve got the packages declaration (as above) to specify a package within a packages node. You’d expect there is a way to validate the XML to be sure it has the required information to create packages? Well, just right click on the (explorer solution) and select ‘Check Biml for Errors‘. A dialog opens with a list of errors.

 

If we think about it, it makes sense doesn’t it? Of course we need to give our packages a name! In contrast, the error for ConstraintMode is something that we take for granted when we generate packages visually. We define the order of tasks in a package through constraints (or connectors). However Biml has no indication and so the layout of tasks other than the order in which they appear in XML. How they relate to each other must be defined in the package declaration and this is the purpose of the ConstraintMode property. There are two allowable values for this property (either parallel or linear) which (as the name suggests) specifies that tasks will not be connected or connected sequentially.

So in order to create a package, we update the Biml code as (note that since we are only having one task in this package the constraint mode is irrelevant);

<Biml xmlns=http://schemas.varigence.com/biml.xsd>

<Packages>

<Package ConstraintMode=Linear” Name=package_01></Package>

</Packages>

</Biml>

We now have everything that we need to generate packages from the script. Simply right click on the Biml file and select the Generate SSIS Packages (as above). You’ll see a new packages create under the SIS packages node in solution explorer. Of course there are no tasks in our package.

Adding the Execute SQL task to the Package.

We already seen that the XML nodes in naming conventions are very similar to that of SSIS (both in structure and definition). If we think about adding an Execute SQL task to the package, then, there would be a few points we could make (or at least assume);

  1. a task would be embedded within a tasks node
  2. the task would probably be defined by name (or a task type)
  3. an execute SQL task requires a connection
  4. and execute SQL task requires a command
  5. the connection would be required before the command.

Turns out that the Biml to add this task to our package is pretty straightforward. Of course you can use the IntelliSense to suggest what nodes are available and, when you look at the code and think about what you’re trying to do it seems to make sense (that is, create a connection, create a package, at a task to the package uses a connection and has some syntax).

<Biml
xmlns=http://schemas.varigence.com/biml.xsd>

<Connections>

<Connection Name=SQL01 ConnectionString=Provider=SQLNCLI10.1;Data Source=.;Persist Security Info=False;IntegratedSecurity=SSPI;Initial Catalog=AdventureWorksDW/>

</Connections>

<Packages>

<Package ConstraintMode=“Linear” Name=package_01>

<Tasks>

<ExecuteSQL Name=fire_proc_01” ConnectionName=SQL01 >

<DirectInput>Select 1</DirectInput>

</ExecuteSQL>

</Tasks>

</Package>

</Packages>

</Biml>

Perhaps the most interesting thing about this snippet is the connection object. We’d expect it to be embedded within connections however, the location as a root node gives us an advantage of reuse, so adding another task with the package or additional packages and tasks can reuse that connection. For example, to create 2 packages with 2 Execute SQL commands in each (which require the first to complete and then the next), we could write;

<Biml xmlns=http://schemas.varigence.com/biml.xsd>

 <Connections>

  <Connection Name=SQL01” ConnectionString=Provider=SQLNCLI10.1;… (as above) />

 </Connections>

<Packages>

  <Package ConstraintMode=Linear” Name=package_01>

    <Tasks>

      <ExecuteSQL Name=SQL_CMD_1” ConnectionName=SQL01 >

       <DirectInput>Select 1</DirectInput>

      </ExecuteSQL>

     <ExecuteSQL Name=SQL_CMD_2” ConnectionName=SQL01 >

   <DirectInput>Select 1</DirectInput>

</ExecuteSQL>

    </Tasks>

  </Package>

  <Package ConstraintMode=Linear” Name=package_02>

<Tasks>

<ExecuteSQL Name=SQL_CMD_1” ConnectionName=SQL01 >

<DirectInput>Select 1</DirectInput>

</ExecuteSQL>

<ExecuteSQL Name=SQL_CMD_2” ConnectionName=SQL01 >

<DirectInput>Select 1</DirectInput>

</ExecuteSQL>

</Tasks>

</Package>

</Packages>

</Biml>

 

This produces package_01 which looks like and we can see how the specification of ConstraintMode has caused SQL_CMD_1 to link to SQL_CMD_2.

 

Note the output from the second package definition (where ConstratinMode is Parallel) does not introduce precedence constraints between the 2 tasks.

 

Conclusion

This post has looked at using Biml to create packages in SSIS. It’s a pretty simple example, but we can see how the XML biml can be used to automatically create (and recreate) packages. If we wanted to change a property (say for example the SQL syntax), we could just change it in the code and regenerate the packages.

In later posts, we’ll extend this functional knowledge for more complicated scenarios.

 

 

 

 

 

 

 

 

 

 


Changing Excel Table Connections in VBA

Changing Excel Table Connections in VBA

In previous posts, I’ve discussed using VBA to extract connection information from Pivot Tables and hacking a tables command so that you can programmatically change the results of a query. But what if you have a table and you want to change its connection and its command?

This can be achieved pretty easy using the tables .QueryTable object. In fact, we can change the entire table using this class.

So how does it work? Suppose, I’ve inserted some data into a worksheet from SQL Server. The menu path is DATA à From Other Sources à From SQL Server. Next I get the familiar connection windows to connect to a server and then select a table (as below).

 

I’ll give the table a friendly name (adventure_works) and return the data to a table. I get a connected table which is sourced from a query and can be updated (either right click somewhere in the table and select Refresh from the popup menu or select the refresh button from the TABLE TOOLS à DESIGN (ribbon) menu).

I can also use the DESIGN of the table tools to give my table a decent name by just replacing its name (as in the following image).

Now I want to change the database server (and potentially the query) that my table uses. There are 2 things to keep in mind here. Firstly, we need to change the query’s connection string (no brainer eh) and secondly, the table (in Excel) was added as a table.

As in the previous post, I can refer to the table as a list object in VBA, this is a property of the sheet, so, in order to reference the table I’ll have to use some code like this (note the table was on Sheet2).

Dim li As ListObject

Set li = Sheet2.ListObjects(“aw_employees”)

For the connection string, I can refer to it through the ListObjects QueryTable class. So I could determine what the current connection string or set it by using the following code;

Debug.Print li.QueryTable.Connection ‘retrieve the connectionstring

li.QueryTable.Connection = “OLEDB;Provider=SQLOLEDB.1;…(this is my connection string) ..”

If I then refresh the table (using li.Refresh), one may think that’s the end of the story and our works completed – but that’s not quite true just yet.

Remember that we added the table as an SQL table? Because of this, the ‘query’ for the .querytable actually refers to a table including the SQL database and has its .CommandType as a table. If we debug the code to determine the command (as below) we could see this

After we change the connection, we could set the new table to use employee (without a database) with something simple (like);

li.QueryTable.CommandText = “dbo.DimEmployee”

li.Refresh

Or, we could change the query type of an Command, and provide its text. This would suffice for that;

li.QueryTable.CommandType = xlCmdSql

li.QueryTable.CommandText = “select * from dbo.DimEmployee”

li.Refresh

That’s about it, we can change the tables connection, its query and have these changes reflected in our table.

 


The Wisdom of the (Tableau) Crowd?

The Wisdom of the (Tableau) Crowd?

Ok … I’ll admit it … Tableau has some great features.

One of the standout features of the new release (8.2) is the ability to drop sheets from file and manipulate it visually as a data source – there’s no need for SQL to join (using the Jet Engine) when using Excel sources. The benefit of this is that it removes the manual requirement of connections that are typically associated with user end reporting tools and combining data.  Another instance of this type of approach was developed by Targit (they used a folder share as a directory share to drop files which could then be consumed by the application automatically – a very nice feature). There’s an obvious benefit to the tight integration of data and its downstream analysis – especially for business users that are not so concerned with technical model design but the analysis of data and using it in the tool. Removing layers of integration delivers on the promise of traditional pitches that use catch cries of

spend less time working on data and more time understanding it

To be fair though, we assume that the data being used is pretty clean to begin with and we are really defining aggregation paths.  A common negation for this type of activity (where users mange and consume their own data) is that the production of data is not verified by IT or officially sanctioned (most noticeably it is raised from an IT centric standpoint). There could be lots written about that discussion point but for now we will just accept that it is going to happen anyway and leave it.

Another great feature is the idea of story boards – where you can trace (for want of a better word) thought patterns used in investigation and present a complete view of the analysis – I’ve heard it dubbed a Power Point Killer (and there may be some truth in that where Power Point is used with a heavy reliance on data integration). Again, the concept is not new (Yellowfin has had it enabled for at least one version now).

But I digress and should move onto crowds … or more importantly the vendor conference. I was staggered to discover that the Tableau conference in Seattle was sold out early with 5,200 attendees (yes that’s right 5,200). That’s a remarkable number when you consider that the PASS Summit on SQL Server (incorporating the full MS stack) gets (an advertised more than) 5,000 attendees (I am assuming that it’s just over the 5,000 so somewhere in the range of 5,000 – 5,100). Since this includes both the database administration and the BI arm, those dedicated to BI would fall far short of the 5,000. In contrast, the PASS BA conference which specifically targets Business Analytics gets 750+ attendees (again I’ll assume were talking 750-760 attendees). Given the ubiquity of the MS platform as a tool, the similar number of attendees at the Tableau conference is astonishing.

 


Brisbane SQL Saturday

Brisbane SQL Saturday

SQL Saturday in Brisbane is Back!!

Im happy to announce that we are concreting arrangements for a Brisbane SQL Saturday .. its simply the best way to get SQL Server and BI training from some of Australia’s (and perhaps some international) best trainers and consultants.  Of course, the great thing is, ITS FREE and run by the SQL community members.  So tell your boss, family and friends.

We’ve scoped out a great venue at Queensland University of Technology and are in the process of finalising some paperwork now– so leave the 20th September open for some great SQL and BI goodness.

Like to know more?  Keep an eye out on the SQL Saturday site for the event of ping me and I’ll add you to the distribution list.


Brisbane SQL Saturday

Brisbane SQL Saturday

SQL Saturday in Brisbane is Back!!

Im happy to announce that we are concreting arrangements for a Brisbane SQL Saturday .. its simply the best way to get SQL Server and BI training from some of Australia’s (and perhaps some international) best trainers and consultants.  Of course, the great thing is, ITS FREE and run by the SQL community members.  So tell your boss, family and friends.

We’ve scoped out a great venue at Queensland University of Technology and are in the process of finalising some paperwork now– so leave the 20th September open for some great SQL and BI goodness.

Like to know more?  Keep an eye out on the SQL Saturday site for the event of ping me and I’ll add you to the distribution list.

Update : 21-07-2013

Ok …. we could not host the event on SQL Saturdays site – will that stop us?

NO we’ve created our own web site to manage the event.  Check out www.sqlgroup.com.au/home.aspx to find out more.

Same great content, same great day, same venue – Just another name


Flawless Excel in Business Intelligence

Flawless Excel in Business Intelligence

Somewhat prompted by some recent posts about the inherent error associated with Excel as a BI tool (and more importantly how some magical tool solves this problem) I thought I’d discuss Excel as a BI tool. In-fact, it’s one of my favourite pondering points “what role should Excel should play business intelligence”

As I’ve stated about it’s a discussion that’s often prompted by claims that a product improves the inherent risk associated with Excel (and to a lesser extent Access). Vendors often suggest that their Excel support mitigates the inherent risk associated with Excel because it adds a corporate security and authorised data capabilities the users trusted tool (often it is argued that Excel the only tool business users want to use or perhaps can use).

There are several flavours presented in the pitch (often several of them are presented in the same discussion) so let’s look at a few of the key selling (or scaring) points;

  1. Excel has inherent risk and is error prone. For this, I’ll take an extreme case since it suites my discussion. The nature of these stories is that Excel caused some monumental loss because of an error in a formula or there was some other error in the spreadsheet. There’s nothing like a bit of fear so make you sit up and pay attention is there? If you have not had this error yet don’t worry because you will … it’s only a matter of time (in fact we’ll even through in some statistics to show just how common this can be). It’s no big deal if you haven’t found it yet because you just haven’t looked hard enough. Infact some vendors even go so far to claim that users are arrogant (now that’s going to mend the bridge between IT and Business). But then again, there’s nothing like creating a bit of division is there – ‘oh yeah now I remember divide and conquer’
  2. The Excel (or Access System) is undocumented and the removal of the Key Person that developed the tool would cause monumental loss to the company. How could any responsible manager allow a system or some reporting tool to operate if it is reliant on a key person … this should never happen should it? If you want to be a good manager (and there’s the no back-out argument), ensure business continuality you need a system that’s IT sanctioned with proper documentation – so who wouldn’t want or need that that?
  3. Oh yes – did I mention that IT hasn’t sanctioned your Excel (or Access) work? They don’t support it and you can’t go to them with problems. This really enforces the risk associated with point 1 & 2. If there’s an issue then you are own your own. This is a very compelling argument that enforces a CYA mentality – perhaps we should rewrite the argument to the words ‘if you can’t transfer blame, you better not take responsibility for it’.
  4. End users should focus on data analysis and not data preparation. Here we are presented with the classic position that too much of time is spent in data preparation and not analysis. Could this possibly be a hint to suggest that an unproductive employee that could get so much more done if only used the tool.
  5. IT takes too long or does not do it right. This is such a common selling point that it is almost taken for granted (especially in the realm of reporting). A pitch that focuses on the ‘corporate lifecycle of reporting’ may go something like this;
    1. The business user describes the report to the IT analyst
    2. Sometime later the IT guys get around to writing the spec for the report and pass it to the dev queue where the report writer creates it, this doesn’t happen straight away though because the report writer is so busy. So say 6 months later the user gets their report.
    3. The report is inevitability wrong and the cycle begins again.
    4. By now the business has moved on and the business user needs a new report.
  6. Users need Excel interaction. If there’s been no justification presented above we need one resounding point – the final nail in the coffin (as it where). The pitch here focuses on user needs. We often hear very compelling statements along the lines that users just want to use Excel because that’s what they know or it’s just the way they work. Picture the preparation of a presentation we the analyst grabs some information from one report, creates a LOOKUP from data from another and then creates a chart to paste into Power Point. Based on this story it’s easy to see why users always are going to use Excel (well so the pitch goes).

I have listed a more than a few points here but I think they can be summarised into a few categories.

  1. Relying on Excel for the manipulation of data is a risky proposition (so you need some involvement of IT to provide corporate data to manipulate) and
  2. Users need Excel.

Now the vendor presents their tool – something to solve all the problems listed above, but how appropriate are such tools in reality? Aside from the idea that they present a very narrow perspective of what BI can do, we will now focus on how the vendors tool satisfies the gaping hole in the end users tool set.

Now (the vendor) present the age of enlightenment in their product … our (their) tool solves all these problems because it;

  1. Is managed by IT –hosted so that there will not be any issues associated with lost or deleted files.
  2. Has gone through a rigorous testing process.
  3. Can be used by anyone. The software is so intuitive that any user can pick it up and use it
  4. Delivers data to users in the format they want, that is, its accessible in Excel.

As convincing as these arguments might appear on the surface there’s a real problem with this line of thinking and we can sum this up in two ways;

Firstly to paraphrase Lyndsay Wise (W|T) – if users need Excel then the BI tool is not doing what it should. While I could go into a long winded discussion, I think the statement is succinct. Where users have to bring data out into Excel to manipulate it (you don’t tend to hear the buzz word Mash Up anymore do you?) then there’s a pretty convincing argument that the sanctioned solution just does cut it (sanctioned as in corporate structure). If we think about it for a minute longer and push the case for Excel we’d realise that what we are really saying is that our business model is so inflexible it needs an additional layer of manipulation in order to provide any real value. This may be the case but is so shouldn’t the problem be attacked at the source and not downstream, that is, when it enters the model and not when it’s spat out?

Now for the second point – one where the tool somehow removes error where there is some form of manipulation required. I am very interested to know exactly how any tool achieves this magical proposition (and have never seen a convincing argument by a vendor). At best, you’ve got a cell location (or multiple) that links back to corporate (and supposedly agreed) data. However, if we look at process of what Excel is being used for, we are faced with exactly the same issue that would be present had they not had a tool. That is, they are free to use the data how they wish. After all, that’s the reason why Excel is so popular – it allows the user (of any skill level) a programmatic interface for working with data. The tool adds no comfort than would otherwise be available from Excel.

If we would like to suggest that the availability of data in Excel is the key to Excel’s woes as a corporate BI tool (and those nasty errors that are so well published) then we are simply adding another layer for users to get wrong. After all the placement of data relies on the end user finding the right information to begin with!

Not surprisingly the argument for safety around Excel as a BI tool resolves in the user and not the tool. Once its placed in Excel, data is at the direction of the user and it is their ability that we seek mitigation against.

 


Flawless Excel in Business Intelligence

Flawless Excel in Business Intelligence

Somewhat prompted by some recent posts about the inherent error associated with Excel as a BI tool (and more importantly how some magical tool solves this problem) I thought I’d discuss Excel as a BI tool. In-fact, it’s one of my favourite pondering points “what role should Excel should play business intelligence”

As I’ve stated about it’s a discussion that’s often prompted by claims that a product improves the inherent risk associated with Excel (and to a lesser extent Access). Vendors often suggest that their Excel support mitigates the inherent risk associated with Excel because it adds a corporate security and authorised data capabilities the users trusted tool (often it is argued that Excel the only tool business users want to use or perhaps can use).

There are several flavours presented in the pitch (often several of them are presented in the same discussion) so let’s look at a few of the key selling (or scaring) points;

  1. Excel has inherent risk and is error prone. For this, I’ll take an extreme case since it suites my discussion. The nature of these stories is that Excel caused some monumental loss because of an error in a formula or there was some other error in the spreadsheet. There’s nothing like a bit of fear so make you sit up and pay attention is there? If you have not had this error yet don’t worry because you will … it’s only a matter of time (in fact we’ll even through in some statistics to show just how common this can be). It’s no big deal if you haven’t found it yet because you just haven’t looked hard enough. Infact some vendors even go so far to claim that users are arrogant (now that’s going to mend the bridge between IT and Business). But then again, there’s nothing like creating a bit of division is there – ‘oh yeah now I remember divide and conquer’
  2. The Excel (or Access System) is undocumented and the removal of the Key Person that developed the tool would cause monumental loss to the company. How could any responsible manager allow a system or some reporting tool to operate if it is reliant on a key person … this should never happen should it? If you want to be a good manager (and there’s the no back-out argument), ensure business continuality you need a system that’s IT sanctioned with proper documentation – so who wouldn’t want or need that that?
  3. Oh yes – did I mention that IT hasn’t sanctioned your Excel (or Access) work? They don’t support it and you can’t go to them with problems. This really enforces the risk associated with point 1 & 2. If there’s an issue then you are own your own. This is a very compelling argument that enforces a CYA mentality – perhaps we should rewrite the argument to the words ‘if you can’t transfer blame, you better not take responsibility for it’.
  4. End users should focus on data analysis and not data preparation. Here we are presented with the classic position that too much of time is spent in data preparation and not analysis. Could this possibly be a hint to suggest that an unproductive employee that could get so much more done if only used the tool.
  5. IT takes too long or does not do it right. This is such a common selling point that it is almost taken for granted (especially in the realm of reporting). A pitch that focuses on the ‘corporate lifecycle of reporting’ may go something like this;
    1. The business user describes the report to the IT analyst
    2. Sometime later the IT guys get around to writing the spec for the report and pass it to the dev queue where the report writer creates it, this doesn’t happen straight away though because the report writer is so busy. So say 6 months later the user gets their report.
    3. The report is inevitability wrong and the cycle begins again.
    4. By now the business has moved on and the business user needs a new report.
  6. Users need Excel interaction. If there’s been no justification presented above we need one resounding point – the final nail in the coffin (as it where). The pitch here focuses on user needs. We often hear very compelling statements along the lines that users just want to use Excel because that’s what they know or it’s just the way they work. Picture the preparation of a presentation we the analyst grabs some information from one report, creates a LOOKUP from data from another and then creates a chart to paste into Power Point. Based on this story it’s easy to see why users always are going to use Excel (well so the pitch goes).

I have listed a more than a few points here but I think they can be summarised into a few categories.

  1. Relying on Excel for the manipulation of data is a risky proposition (so you need some involvement of IT to provide corporate data to manipulate) and
  2. Users need Excel.

Now the vendor presents their tool – something to solve all the problems listed above, but how appropriate are such tools in reality? Aside from the idea that they present a very narrow perspective of what BI can do, we will now focus on how the vendors tool satisfies the gaping hole in the end users tool set.

Now (the vendor) present the age of enlightenment in their product … our (their) tool solves all these problems because it;

  1. Is managed by IT –hosted so that there will not be any issues associated with lost or deleted files.
  2. Has gone through a rigorous testing process.
  3. Can be used by anyone. The software is so intuitive that any user can pick it up and use it
  4. Delivers data to users in the format they want, that is, its accessible in Excel.

As convincing as these arguments might appear on the surface there’s a real problem with this line of thinking and we can sum this up in two ways;

Firstly to paraphrase Lyndsay Wise (W|T) – if users need Excel then the BI tool is not doing what it should. While I could go into a long winded discussion, I think the statement is succinct. Where users have to bring data out into Excel to manipulate it (you don’t tend to hear the buzz word Mash Up anymore do you?) then there’s a pretty convincing argument that the sanctioned solution just does cut it (sanctioned as in corporate structure). If we think about it for a minute longer and push the case for Excel we’d realise that what we are really saying is that our business model is so inflexible it needs an additional layer of manipulation in order to provide any real value. This may be the case but is so shouldn’t the problem be attacked at the source and not downstream, that is, when it enters the model and not when it’s spat out?

Now for the second point – one where the tool somehow removes error where there is some form of manipulation required. I am very interested to know exactly how any tool achieves this magical proposition (and have never seen a convincing argument by a vendor). At best, you’ve got a cell location (or multiple) that links back to corporate (and supposedly agreed) data. However, if we look at process of what Excel is being used for, we are faced with exactly the same issue that would be present had they not had a tool. That is, they are free to use the data how they wish. After all, that’s the reason why Excel is so popular – it allows the user (of any skill level) a programmatic interface for working with data. The tool adds no comfort than would otherwise be available from Excel.

If we would like to suggest that the availability of data in Excel is the key to Excel’s woes as a corporate BI tool (and those nasty errors that are so well published) then we are simply adding another layer for users to get wrong. After all the placement of data relies on the end user finding the right information to begin with!

Not surprisingly the argument for safety around Excel as a BI tool resolves in the user and not the tool. Once its placed in Excel, data is at the direction of the user and it is their ability that we seek mitigation against.

 


Packt’s 10 Year Celebration

Packt’s 10 Year Celebration

Usually, the cost of IT & reference books is … well pricey to say the least.  Unfortunately, if you wait for a sale, you save some cash but end up with old tech books.

Well luckily Packt Publishing is having a 10 year celebration.  No e-book over $10.  If you are in the market, it might be worthwhile checking it out .  But here’s the kicker.  Its only lasting 10 days.

 


Packt’s 10 Year Celebration

Packt’s 10 Year Celebration

Usually, the cost of IT & reference books is … well pricey to say the least.  Unfortunately, if you wait for a sale, you save some cash but end up with old tech books.

Well luckily Packt Publishing is having a 10 year celebration.  No e-book over $10.  If you are in the market, it might be worthwhile checking it out .  But here’s the kicker.  Its only lasting 10 days.

 


Extracting Connection Information from Pivot Tables

Extracting Connection Information from Pivot Tables

Extracting Connection Information from Pivot Tables

It’s pretty easy to add pivot tables to worksheets – even easier to add them when the source is a cube. However there are a couple of ways to go about adding them which could make your life easier – that’s one of the things we look at in this post.

When you first create a connection to an OLAP source (and I am including tabular in that), a connection is created in the work book. So just for fun, let’s go through the process and connect to the cube. On the Data tab chose From Analysis Services from the Other Sources dropdown.

We’re asked for a server (not shown) so let’s put in the correct details and then click next. Then we get a list of cubes and perspectives to connect to (I’m using adventure works here so the names appearing in the image below would be familiar to most).

We then have the option to specify a few additional details. Perhaps the most important of these are the filename that stores the connection and the friendly name. Most of the time though we probably discard putting any thought into this and just click next. However in the image below, I’ve specified the filename and the friendly name. Finally we can click Finish and get to work on the pivot table.

The Next Pivot Table

Now when it comes to the next pivot table we can run through the process again – if we chose the same names we’ll get a warning but let’s not go into that. Alternatively, we can just insert a new pivot table and select an existing connection.

You can see this in the screen flow below. When we create a new pivot, we can choose the external source and, then the connection.

There’s one big reason why you’d want to insert a new pivot like this – it’s a lot easier to maintain the workbook when you need to understand where data is coming from (or you need to change data sources).

The Detail – Getting Connection and Pivot Info

It would be a pretty ordinary post if that’s all I had to say (no code after all) – but now the fun starts. Workbooks never finish up as simple and elegant as we’d like – NEVER. It’s easy to add new data sources, new pivots and even pivots of data sourced from other pivots. Throw 20-30 of them in a workbook and your left scratching your head as to where the information is coming from and what do I need to update (say when the data source changes). If you only had one connection it would be easy, but life’s not like that and there’s connections flying around everywhere. So I needed to look at each pivot, determine its source and connection string.

Here’s the code:

Sub GetPivotData()

  Dim wks As Worksheet

  Dim pt As PivotTable

  For Each wks In Worksheets

    For Each pt In wks.PivotTables

      Debug.Print wks.Name & “:” & pt.Name

      Debug.Print “Cache Index : ” & pt.CacheIndex

      GetCache (pt.CacheIndex)

Debug.Print “— break —”

    Next pt

  Next wks

End Sub

 

Function GetCache(Index As Long)

  Dim wb As Workbook

  Set wb = ActiveWorkbook

  Dim pc As PivotCache

  Set pc = wb.PivotCaches(Index)

 

  If pc.OLAP Then

    Debug.Print “Is OLAP : ” & pc.OLAP

    Debug.Print “Connection Name : ” & pc.WorkbookConnection

    Debug.Print “Connection String : ” & pc.Connection

  Else

    Debug.Print “Is OLAP : ” & pc.OLAP

    Debug.Print “Data Range : ” & pc.SourceData

    End If

End Function

 

My Sub (GetPivotData) loops through each sheet on the workbook, getting each pivot on the sheet and then calls a function (GetCache) to get the cache (or connection) information for that pivot. While I could get connection information by just adapting the function to iterate over the slicer caches (as below), I’d like to associate each pivot table with its connection.

Additionally, I want to identify any pivots that are not based on an external (OLAP) source. To do this, I can use the Cache.OLAP property and retrieve the appropriate connection info.

 

 


Extracting Connection Information from Pivot Tables

Extracting Connection Information from Pivot Tables

Extracting Connection Information from Pivot Tables

It’s pretty easy to add pivot tables to worksheets – even easier to add them when the source is a cube. However there are a couple of ways to go about adding them which could make your life easier – that’s one of the things we look at in this post.

When you first create a connection to an OLAP source (and I am including tabular in that), a connection is created in the work book. So just for fun, let’s go through the process and connect to the cube. On the Data tab chose From Analysis Services from the Other Sources dropdown.

We’re asked for a server (not shown) so let’s put in the correct details and then click next. Then we get a list of cubes and perspectives to connect to (I’m using adventure works here so the names appearing in the image below would be familiar to most).

We then have the option to specify a few additional details. Perhaps the most important of these are the filename that stores the connection and the friendly name. Most of the time though we probably discard putting any thought into this and just click next. However in the image below, I’ve specified the filename and the friendly name. Finally we can click Finish and get to work on the pivot table.

The Next Pivot Table

Now when it comes to the next pivot table we can run through the process again – if we chose the same names we’ll get a warning but let’s not go into that. Alternatively, we can just insert a new pivot table and select an existing connection.

You can see this in the screen flow below. When we create a new pivot, we can choose the external source and, then the connection.

There’s one big reason why you’d want to insert a new pivot like this – it’s a lot easier to maintain the workbook when you need to understand where data is coming from (or you need to change data sources).

The Detail – Getting Connection and Pivot Info

It would be a pretty ordinary post if that’s all I had to say (no code after all) – but now the fun starts. Workbooks never finish up as simple and elegant as we’d like – NEVER. It’s easy to add new data sources, new pivots and even pivots of data sourced from other pivots. Throw 20-30 of them in a workbook and your left scratching your head as to where the information is coming from and what do I need to update (say when the data source changes). If you only had one connection it would be easy, but life’s not like that and there’s connections flying around everywhere. So I needed to look at each pivot, determine its source and connection string.

Here’s the code:

Sub GetPivotData()

  Dim wks As Worksheet

  Dim pt As PivotTable

  For Each wks In Worksheets

    For Each pt In wks.PivotTables

      Debug.Print wks.Name & “:” & pt.Name

      Debug.Print “Cache Index : ” & pt.CacheIndex

      GetCache (pt.CacheIndex)

Debug.Print “— break —”

    Next pt

  Next wks

End Sub

 

Function GetCache(Index As Long)

  Dim wb As Workbook

  Set wb = ActiveWorkbook

  Dim pc As PivotCache

  Set pc = wb.PivotCaches(Index)

 

  If pc.OLAP Then

    Debug.Print “Is OLAP : ” & pc.OLAP

    Debug.Print “Connection Name : ” & pc.WorkbookConnection

    Debug.Print “Connection String : ” & pc.Connection

  Else

    Debug.Print “Is OLAP : ” & pc.OLAP

    Debug.Print “Data Range : ” & pc.SourceData

    End If

End Function

 

My Sub (GetPivotData) loops through each sheet on the workbook, getting each pivot on the sheet and then calls a function (GetCache) to get the cache (or connection) information for that pivot. While I could get connection information by just adapting the function to iterate over the slicer caches (as below), I’d like to associate each pivot table with its connection.

Additionally, I want to identify any pivots that are not based on an external (OLAP) source. To do this, I can use the Cache.OLAP property and retrieve the appropriate connection info.

 

 


SQL Saturday Lisbon Portugal

SQL Saturday Lisbon Portugal

Paul te Braak:

I can only express my appreciation also – thanks guys!

Originally posted on Paul Turley's SQL Server BI Blog:

We’re on our way home from Lisbon, Portugal after speaking at SQL Saturday #267. Having had the pleasure to attend and speak at quite a few SQL Saturday events over the years, I’ve never seen a group of event organizers work so tirelessly with as much passion and dedication. Thanks to Niko Neugebauer, Vitor Pombeiro, Paulo Matos, Quilson Antunes, Murilo Miranda, André Batista and Paulo Borges for the late nights and long hours you all invested to make this a very special SQL Saturday. The event was a great success; as well as a special day of sightseeing the day afterward for all the speakers. After recruiting an impressive list of speakers from all over the globe, these volunteers went well beyond the call of duty to chauffer speakers from the airport to hotels, the event venues and around the city. It was quite a treat. Thank you!

View original


SQL Saturday Lisbon Portugal

SQL Saturday Lisbon Portugal

Paul te Braak:

I can only express my appreciation also – thanks guys!

Originally posted on Paul Turley's SQL Server BI Blog:

We’re on our way home from Lisbon, Portugal after speaking at SQL Saturday #267. Having had the pleasure to attend and speak at quite a few SQL Saturday events over the years, I’ve never seen a group of event organizers work so tirelessly with as much passion and dedication. Thanks to Niko Neugebauer, Vitor Pombeiro, Paulo Matos, Quilson Antunes, Murilo Miranda, André Batista and Paulo Borges for the late nights and long hours you all invested to make this a very special SQL Saturday. The event was a great success; as well as a special day of sightseeing the day afterward for all the speakers. After recruiting an impressive list of speakers from all over the globe, these volunteers went well beyond the call of duty to chauffer speakers from the airport to hotels, the event venues and around the city. It was quite a treat. Thank you!

View original


Creating Tables in Excel with VBA and External Data – Part II

Creating Tables in Excel with VBA and External Data – Part II

In Part I, we looked at adding a data table to a Workbook which uses an external data source (we used OLAP but the process can be applied to any source). This post looks at manipulating an existing table.

Without VBA, we can manually manage the table – both the its properties and the underlying query. Simply right click on the table and select the Edit Query.. or External Data Properties from the popup menu. Changes are made to the table data are made automatically.

If we chose to edit the query, we can simply overwrite the Command Text (as seen in the image below). These changes well be automatically applied (including resizing the table by rows or columns for a different sized data set) once the OK button is clicked.

For External Data Properties, we can configure how the table reacts with new data. For example, you may notice that, the table accommodates additional rows and columns however, when the query returns a table with fewer rows and columns, the table retains its old sizing (number of columns) and includes a blank columns (for data that previously existed). You can manually resize this (dragging the bounds of the tables border) or set the properties of the table to overwrite existing data. If you want to ensure that this option exists and that new sizes are automatically incorporated into the table – make sure that the check box for Overwrite is marked in External Data Properties.


VBA

Now to the VBA – As commandant Lassard would say “There are many, many, many, many fine reasons to use VBA”. We have so much flexibility but let’s keep it simple, here’s what we I’ve set up.

Cell B1 is data validated based on the cell range D1:D2 – nice and simple. When we change that cell, the table updates for the new country.

In order to determine if the there is a change in or data (the Country selected) we have to create a worksheet event to capture and test the change. I have gone into this in some detail here and the code is below. Note that this needs to be added to the sheet code (not in a separate bas module). All we do is check that our only B1 is updated and then call the refresh subroutine.

Private Sub Worksheet_Change(ByVal Target As Range)

  ‘ update table if cell 1,B is changed
If Target.Cells.Count = 1 And Target.Cells.Row = 1 And Target.Cells.Column = 2 Then UpdateMyTable

End Sub

Now for the updating component – the bit that’s called when cell(B1) is changed. I think this is pretty straight forward but I’ll walk through it anyway. First, the code;

Public Sub UpdateMyTable()

  ‘ ensure that any new changes are reflected in the table dimensions
Sheet1.ListObjects(“Table_abax_sql3″).QueryTable.RefreshStyle = xlOverwriteCells

  ‘ set the comand text
Sheet1.ListObjects(“Table_abax_sql3″).QueryTable.CommandText = NewQuery(Sheet1.Cells(1, 2))
Sheet1.ListObjects(“Table_abax_sql3″).Refresh

End Sub

Private Function NewQuery(Country As String) As String

NewQuery = “select {[Measures].[Reseller Sales Amount] } on 0, ” & _
“[Product].[Category].[Category] on 1 ” & _
“from [Adventure Works] ” & _
“where [Geography].[Geography].[Country].&[" & Country & "]“


End Function

I’ve kept the same format as in the original post. The function NewQuery determines what the MDX should be – based on the provided country. All we is set the tables command to the new mdx (in (QueryTable.CommandText)) and refresh it.

I’ve also set the refresh style so that any changes in the command (grid size) are automatically reflected in the worksheet table.

That’s about the size of it! – I hope you find it useful.


Creating Tables in Excel with VBA and External Data – Part II

Creating Tables in Excel with VBA and External Data – Part II

In Part I, we looked at adding a data table to a Workbook which uses an external data source (we used OLAP but the process can be applied to any source). This post looks at manipulating an existing table.

Without VBA, we can manually manage the table – both the its properties and the underlying query. Simply right click on the table and select the Edit Query.. or External Data Properties from the popup menu. Changes are made to the table data are made automatically.

If we chose to edit the query, we can simply overwrite the Command Text (as seen in the image below). These changes well be automatically applied (including resizing the table by rows or columns for a different sized data set) once the OK button is clicked.

For External Data Properties, we can configure how the table reacts with new data. For example, you may notice that, the table accommodates additional rows and columns however, when the query returns a table with fewer rows and columns, the table retains its old sizing (number of columns) and includes a blank columns (for data that previously existed). You can manually resize this (dragging the bounds of the tables border) or set the properties of the table to overwrite existing data. If you want to ensure that this option exists and that new sizes are automatically incorporated into the table – make sure that the check box for Overwrite is marked in External Data Properties.


VBA

Now to the VBA – As commandant Lassard would say “There are many, many, many, many fine reasons to use VBA”. We have so much flexibility but let’s keep it simple, here’s what we I’ve set up.

Cell B1 is data validated based on the cell range D1:D2 – nice and simple. When we change that cell, the table updates for the new country.

In order to determine if the there is a change in or data (the Country selected) we have to create a worksheet event to capture and test the change. I have gone into this in some detail here and the code is below. Note that this needs to be added to the sheet code (not in a separate bas module). All we do is check that our only B1 is updated and then call the refresh subroutine.

Private Sub Worksheet_Change(ByVal Target As Range)

  ‘ update table if cell 1,B is changed
If Target.Cells.Count = 1 And Target.Cells.Row = 1 And Target.Cells.Column = 2 Then UpdateMyTable

End Sub

Now for the updating component – the bit that’s called when cell(B1) is changed. I think this is pretty straight forward but I’ll walk through it anyway. First, the code;

Public Sub UpdateMyTable()

  ‘ ensure that any new changes are reflected in the table dimensions
Sheet1.ListObjects(“Table_abax_sql3″).QueryTable.RefreshStyle = xlOverwriteCells

  ‘ set the comand text
Sheet1.ListObjects(“Table_abax_sql3″).QueryTable.CommandText = NewQuery(Sheet1.Cells(1, 2))
Sheet1.ListObjects(“Table_abax_sql3″).Refresh

End Sub

Private Function NewQuery(Country As String) As String

NewQuery = “select {[Measures].[Reseller Sales Amount] } on 0, ” & _
“[Product].[Category].[Category] on 1 ” & _
“from [Adventure Works] ” & _
“where [Geography].[Geography].[Country].&[" & Country & "]“


End Function

I’ve kept the same format as in the original post. The function NewQuery determines what the MDX should be – based on the provided country. All we is set the tables command to the new mdx (in (QueryTable.CommandText)) and refresh it.

I’ve also set the refresh style so that any changes in the command (grid size) are automatically reflected in the worksheet table.

That’s about the size of it! – I hope you find it useful.


Creating Tables in Excel with VBA and External Data – Part I

Creating Tables in Excel with VBA and External Data – Part I

This post looks at how we can add a table to an Excel sheet which uses a MDX query as its source. This is a very handy feature to use for a couple reasons;

    1. The table retains the connection OLAP source (hence can be updated by a user at will)
    2. We can use it to extract data from MOLAP or tabular sources (i.e. run MDX or DAX)
    3. We can define complex queries to return a result set that cannot be obtained with a pivot table

Note that most workarounds for creating a table from OLAP sources rely on the creation of the pivot table, its formatting is a tabular source and a copy and paste the values. Hardly an attractive option!

  1. We can use the table!! – (This is really important for certain activities like data mining table analysis)

How to Do It

We’ll look at a simple query from adventure works;

select [Measures].[Reseller Sales Amount] on 0,
[Product].[Category].[Category] on 1
from [Adventure Works]
where [Geography].[Geography].[Country].&[Australia]

and an OLEDB connection string (note the OLEDB specification at the start of the string)

OLEDB;Provider=MSOLAP;Data Source=@server;Initial Catalog=Adventure Works DW 2008R2;

I have incorporated those to strings into 2 functions (MyQuery and MyConnectionString) – this just removes some of the clutter from the code.

Now we just need to use the ListObjects.Add method. The code (now in with all Sub’s and Functions) is pretty much the bare bones you need to add the table. In other posts, I’ll look into higher level of control for the output.

The CODE

The complete code is shown below. Ive included everything so it can simply be pasted into a new VB module

Sub CreateTable()

  With Sheet1.ListObjects.Add(SourceType:=0 _
, Source:=MyConnectionString() _
, Destination:=Range(“$A$1″) _
                            ).QueryTable
.CommandType = xlCmdDefault
.CommandText = MyQuery()
.ListObject.DisplayName = “MyMDXQueryTable”
.Refresh BackgroundQuery:=False
.PreserveColumnInfo = False

  End With

End Sub

Private Function MyQuery() As String

     MyQuery = “select [Measures].[Reseller Sales Amount] on 0, ” & _
“[Product].[Category].[Category] on 1 ” & _
“from [Adventure Works] ” & _
“where [Geography].[Geography].[Country].&[Australia]“

End Function

Private Function MyConnectionString() As String

     MyConnectionString = “OLEDB;Provider=MSOLAP;Data Source=@server;Initial Catalog=Adventure Works DW 2008R2;”

End Function

Walk through

This is pretty much the bare bones approach. As code walk through (see Sub CreateTable), we add the list object specifying its connection string and destination, set the command and refresh info. The only statement that is not entirely necessary is naming the table (see .ListObject.DisplayName) but I tend to think is a good idea because we will want to refer to it by name at a later stage.

Out Come

The code will add a table like the one in the following image. The field names are fully qualified which is not that nice and we will look at how this can be changed in another post. For now, our purpose is to get a table is in the workbook (the purpose of this post) so that it can be used as a table and refreshed.


PS – the code above adds the listobject by reference to the sheet within VBA (see Sheet1.ListObjects). Its probably worthwhile to point out that this is the sheet reference (ie the number of the sheet in the book) and not the name of the sheet.

One more thing – when the query uses refers to a attributes in a hierarchy the OLEDB result set (table) will include parent attributes of the hierarchy as a column. This is nothing to worry about for the moment!

Next - changing the tables query.


Creating Tables in Excel with VBA and External Data – Part I

Creating Tables in Excel with VBA and External Data – Part I

This post looks at how we can add a table to an Excel sheet which uses a MDX query as its source. This is a very handy feature to use for a couple reasons;

    1. The table retains the connection OLAP source (hence can be updated by a user at will)
    2. We can use it to extract data from MOLAP or tabular sources (i.e. run MDX or DAX)
    3. We can define complex queries to return a result set that cannot be obtained with a pivot table

Note that most workarounds for creating a table from OLAP sources rely on the creation of the pivot table, its formatting is a tabular source and a copy and paste the values. Hardly an attractive option!

  1. We can use the table!! – (This is really important for certain activities like data mining table analysis)

How to Do It

We’ll look at a simple query from adventure works;

select [Measures].[Reseller Sales Amount] on 0,
[Product].[Category].[Category] on 1
from [Adventure Works]
where [Geography].[Geography].[Country].&[Australia]

and an OLEDB connection string (note the OLEDB specification at the start of the string)

OLEDB;Provider=MSOLAP;Data Source=@server;Initial Catalog=Adventure Works DW 2008R2;

I have incorporated those to strings into 2 functions (MyQuery and MyConnectionString) – this just removes some of the clutter from the code.

Now we just need to use the ListObjects.Add method. The code (now in with all Sub’s and Functions) is pretty much the bare bones you need to add the table. In other posts, I’ll look into higher level of control for the output.

The CODE

The complete code is shown below. Ive included everything so it can simply be pasted into a new VB module

Sub CreateTable()

  With Sheet1.ListObjects.Add(SourceType:=0 _
, Source:=MyConnectionString() _
, Destination:=Range(“$A$1″) _
                            ).QueryTable
.CommandType = xlCmdDefault
.CommandText = MyQuery()
.ListObject.DisplayName = “MyMDXQueryTable”
.Refresh BackgroundQuery:=False
.PreserveColumnInfo = False

  End With

End Sub

Private Function MyQuery() As String

     MyQuery = “select [Measures].[Reseller Sales Amount] on 0, ” & _
“[Product].[Category].[Category] on 1 ” & _
“from [Adventure Works] ” & _
“where [Geography].[Geography].[Country].&[Australia]“

End Function

Private Function MyConnectionString() As String

     MyConnectionString = “OLEDB;Provider=MSOLAP;Data Source=@server;Initial Catalog=Adventure Works DW 2008R2;”

End Function

Walk through

This is pretty much the bare bones approach. As code walk through (see Sub CreateTable), we add the list object specifying its connection string and destination, set the command and refresh info. The only statement that is not entirely necessary is naming the table (see .ListObject.DisplayName) but I tend to think is a good idea because we will want to refer to it by name at a later stage.

Out Come

The code will add a table like the one in the following image. The field names are fully qualified which is not that nice and we will look at how this can be changed in another post. For now, our purpose is to get a table is in the workbook (the purpose of this post) so that it can be used as a table and refreshed.


PS – the code above adds the listobject by reference to the sheet within VBA (see Sheet1.ListObjects). Its probably worthwhile to point out that this is the sheet reference (ie the number of the sheet in the book) and not the name of the sheet.

One more thing – when the query uses refers to a attributes in a hierarchy the OLEDB result set (table) will include parent attributes of the hierarchy as a column. This is nothing to worry about for the moment!

Next - changing the tables query.


Microsoft Tabular Modeling Cookbook

Microsoft Tabular Modeling Cookbook

I am pleased to announce that my tabular modelling book is finished.  The title is succinct – ‘Microsoft Tabular Modeling Cookbook’ and its available from packt (and others) at http://www.packtpub.com/microsoft-tabular-modeling-cookbook/book

There is (naturally) a lot in the book – but it is designed to get you up and running fast. We look at all flavours of modelling – both power pivot and SSAS and pretty much everything in between.

Of course any comments and feedback is welcome!


Microsoft Tabular Modeling Cookbook

Microsoft Tabular Modeling Cookbook

I am pleased to announce that my tabular modelling book is finished.  The title is succinct – ‘Microsoft Tabular Modeling Cookbook’ and its available from packt (and others) at http://www.packtpub.com/microsoft-tabular-modeling-cookbook/book

There is (naturally) a lot in the book – but it is designed to get you up and running fast. We look at all flavours of modelling – both power pivot and SSAS and pretty much everything in between.

Of course any comments and feedback is welcome!


Who Stole My WIndows Start menu – FIXED

Who Stole My WIndows Start menu – FIXED

I could be at risk of becoming stuck in my ways but I really find the metro style of Windows 8(s) unproductive.  Ah yes, you can search for an app – provided you know its name (and there in lies the problem).  8.1 was going to improve on 8 by having a start menu (make a mental note to fully read release notes) however, this did not revoke to my beloved win 7 start menu (with a program menu) –> that’s right, there is no program folder!

Perhaps the best thing about 8.1 was the ability to start windows directly to the desktop – but where’s my real (fully expandable) start menu? … this is what I want.

So, if you want to run win 8.xx revert try Classic Shell (http://www.classicshell.net/).  It is great – no two ways about it.  It gives the power of win 8 – with the ease of win 7 (well easy for me anyway)!


Who Stole My WIndows Start menu – FIXED

Who Stole My WIndows Start menu – FIXED

I could be at risk of becoming stuck in my ways but I really find the metro style of Windows 8(s) unproductive.  Ah yes, you can search for an app – provided you know its name (and there in lies the problem).  8.1 was going to improve on 8 by having a start menu (make a mental note to fully read release notes) however, this did not revoke to my beloved win 7 start menu (with a program menu) –> that’s right, there is no program folder!

Perhaps the best thing about 8.1 was the ability to start windows directly to the desktop – but where’s my real (fully expandable) start menu? … this is what I want.

So, if you want to run win 8.xx revert try Classic Shell (http://www.classicshell.net/).  It is great – no two ways about it.  It gives the power of win 8 – with the ease of win 7 (well easy for me anyway)!


Excel New Feature – INQUIRE

Excel New Feature – INQUIRE

Excel 2013 included some great new features for BI. However, one really cool feature that flew under the radar was Inquire. For all intents and purposes, this is formula tracing at the GYM, on steroids and the only the only runner in the race. It’s a great inclusion to understand the structure of the workbook including dependencies between books (including formula references), record the formulas and document just about anything in the book.

So let’s start with a quick recap of formula auditing. This is still available and shows the how a cell is derived. You typically activate the cell and then press the Trace Precedents (if you want to see what cells are used in the cells formula) or Trace Dependents (if you want to see what other cells have a formula which is dependent on the selected cell). These are in the formulas ribbon. We can see how this works in the following image where the dependencies of cell B7 are shown as B5 and B6 (note the blue line connecting the cells).

When the formula was a linked value from another book, a grid would show to indicate and external reference. In the prior image, the cells A1 and A2 refer to cells in another workbook (called book 2) and so tracing the Precedents would show the following.

Now let’s compare this to Inquire

Firstly, Inquire must be activate as an Excel add-in. Go to File à Options, then chose add-ins from the options pane, then manage click Go from the Excel Add-Ins dropdown.


Ensure that the Inquire add-in is checked. Then an Inquire ribbon tab should be present.

The ribbon gives us many options, but let’s focus on the structure of the work book. If we click the Workbook analysis button, a new window will open which allows us to chose what we want to look at. For example, we could list formulas by selecting the All formulas from the Formulas node in the tree (as shown). Note that all formulas are listed which includes a reference to an external book.

Don’t like the window? We can even export the results for any number of selected nodes (to a new workbook) by hitting the ‘Excel Export’ button (in this window).

We can investigate the relationships (both Precedents and Dependencies) of a particular cell (just click the Cell Relationship button in the ribbon). Here a new window opens with our choices (we can chose the direction and number of expansion levels (for investigation)).

This post has just scratched the surface – there are a few other interesting options to explore (eg file comparisons and passwords), however, this feature should be very useful for tracing (and perhaps more importantly documenting) formulas in books.


Excel New Feature – INQUIRE

Excel New Feature – INQUIRE

Excel 2013 included some great new features for BI. However, one really cool feature that flew under the radar was Inquire. For all intents and purposes, this is formula tracing at the GYM, on steroids and the only the only runner in the race. It’s a great inclusion to understand the structure of the workbook including dependencies between books (including formula references), record the formulas and document just about anything in the book.

So let’s start with a quick recap of formula auditing. This is still available and shows the how a cell is derived. You typically activate the cell and then press the Trace Precedents (if you want to see what cells are used in the cells formula) or Trace Dependents (if you want to see what other cells have a formula which is dependent on the selected cell). These are in the formulas ribbon. We can see how this works in the following image where the dependencies of cell B7 are shown as B5 and B6 (note the blue line connecting the cells).

When the formula was a linked value from another book, a grid would show to indicate and external reference. In the prior image, the cells A1 and A2 refer to cells in another workbook (called book 2) and so tracing the Precedents would show the following.

Now let’s compare this to Inquire

Firstly, Inquire must be activate as an Excel add-in. Go to File à Options, then chose add-ins from the options pane, then manage click Go from the Excel Add-Ins dropdown.


Ensure that the Inquire add-in is checked. Then an Inquire ribbon tab should be present.

The ribbon gives us many options, but let’s focus on the structure of the work book. If we click the Workbook analysis button, a new window will open which allows us to chose what we want to look at. For example, we could list formulas by selecting the All formulas from the Formulas node in the tree (as shown). Note that all formulas are listed which includes a reference to an external book.

Don’t like the window? We can even export the results for any number of selected nodes (to a new workbook) by hitting the ‘Excel Export’ button (in this window).

We can investigate the relationships (both Precedents and Dependencies) of a particular cell (just click the Cell Relationship button in the ribbon). Here a new window opens with our choices (we can chose the direction and number of expansion levels (for investigation)).

This post has just scratched the surface – there are a few other interesting options to explore (eg file comparisons and passwords), however, this feature should be very useful for tracing (and perhaps more importantly documenting) formulas in books.


Understanding the LIFT CHART

Understanding the LIFT CHART

The lift chart is synonymous with evaluating data mining model performance and the predictive power of one model against another. Often, in presentations and training sessions it is suggested that the chart is indicative of the models ability to accurately predict within a training population. For example, the following explanation is provided;

“the lift chart shows that this model is good because it only needs to evaluate 30% of data in order to correctly predict all the target outcomes”

This type of statement is simply not true – it is INCORRECT, WRONG, MISLEADING and shows a lack of understanding about what the chart represents. This post looks at explaining the chart by examining how it is created -  seeking to remove some of the misconceptions about the use of the chart.

Consider the following example. In this chart it would be argued that an ideal model (red line) would only need ~ 55% of the population in order to predict all the target states. Without a model, we would need to use the entire population and so our model (being somewhat useful) lies between the best model and a random guess. These values can be determined by the intercepts each model with the X axis (note that at 55% of the population, the best model achieves 100% accuracy).

Another common question arising from the interpretation of this chart occurs when we know that the target (predicted) state is found in only 55% of the population. The question is “why do we show 100% accuracy when only 55% of the population can exist in the predicted state and therefore the Y axis should have a maximum of 55%”.

For my own analysis, I shall ask a question of the reader so that the construction of the chart can better be understood. The question is simple.

If my model does not predict the correct result 100% of the time how could my accuracy ever achieve 100%? Let’s be realistic, it would have to be a pretty impressive model to never be wrong – and this is what the chart always shows à 100% accuracy!

Now let’s look at construction

In order to create a lift chart (also referred to as an accumulative gain) the data mining model needs to be able to predict the probability of its prediction. For example, we predict a state and the probability of that state. Within SSAS, this is achieved with the PredictProbability
function.

Now, since we can include the probability of our predictions, we would naturally order training data by the predicted probability in suggesting the likelihood of a test case being the predicted target state. Or perhaps put another way, if I only had 10 choices for choosing which test case (that is a an observation from the testing data) would be the predicted value, I would choose the top 10 testing cases based on their predicted probability – after all the model is suggesting that these cases have the highest probability of being the predicted state.

As we move through the testing data (and the predicted probability decreases), it is natural to expect the model to become less accurate – will make more false predictions. So let’s summarise this (training) data. For convenience, I have group my training data into 10 bins and each bin has ~ 320 cases (the red line below). Working with the assumption that the predictive power of my model decreases with probability, the number of predictions also decreases as we move through more of the training data. This is clearly visible in the chart and data below – the first bin has a high predictive power (275 correct predictions) while the last bin has only 96 correct predictions.

If I focus on the models ability to correctly predict values, I will notice that it can predict 1,764 correct results – but now let’s turn our look to the accumulative power of the model. If, from the set of my sample data I could only choose 322 cases (coincidently this is the number of cases in bin 1), I would choose all cases from Bin 1 and get 275 correct (or 16% of the possible correct values). If I had to choose 645 cases, I would choose the cases from bin 1 & 2 and get 531 correct (30% of correct possibilities). This continues with the more predictions that I make and is summarised in the following table.

This data is transposed onto the lift chart – the Bin on the X axis (representing the % of population) and the Percent Running Correct on the Y axis (representing the number of correct predictions). As we can see, the data is indicative of the models ability to quickly make accurate predictions rather than its overall predictive ability.

Best and Random Cases

The chart also includes best and random cases as guidelines for prediction – let’s focus on these. These lines are theoretical – really ‘what if’ type of scenarios.

Suppose that I had an ideal model. If this was the case my model would predict 322 in bin 1, 323 in bin 2 and so on – it must because we have ordered the data by PredictProbability and in a perfect world we would get them all correct! However, the model can only predict 1,764 correct values -we know this from the actual results. Because of this we would only need up to bin 6 to get all our correct values (see column ‘Running Correct Best Case’ in the following table. Just as we did for the model prediction we can convert this to a percent of total correct (the population) and chart it with the model.

Now for the random guess – again this is theoretical. I know that I can only predict 1,764 correct values so, if these were evenly distributed amongst my bins, I would have ~176 correct predictions in each bin. This is then added to the chart.

What is the Problem?

Now we can see that the chart is essentially just a view of how quickly the model makes accurate predictions.  Perhaps there is nothing wrong with that but what happens when we compare models?  Well, in this case, the comparison is relative.  Those steps are reproduced for each chart and what you essentially see is relative comparative performance.  Thus, the comparison of two models in the charts gives NO indication of performance accuracy – after all how could they since they each plot relative percent accuracy for their own states.

For this reason, relying on this chart as a sole measure of accuracy is just dangerous and really shows very little about the total accuracy of the model.

Conclusion

Understanding how the lift chart has been constructed can help in understanding how to interpret it. We can see that it indicates the accumulative power of the model to give predictions – or perhaps more correctly the accumulative power of the model to give its correct prediction.


Understanding the LIFT CHART

Understanding the LIFT CHART

The lift chart is synonymous with evaluating data mining model performance and the predictive power of one model against another. Often, in presentations and training sessions it is suggested that the chart is indicative of the models ability to accurately predict within a training population. For example, the following explanation is provided;

“the lift chart shows that this model is good because it only needs to evaluate 30% of data in order to correctly predict all the target outcomes”

This type of statement is simply not true – it is INCORRECT, WRONG, MISLEADING and shows a lack of understanding about what the chart represents. This post looks at explaining the chart by examining how it is created -  seeking to remove some of the misconceptions about the use of the chart.

Consider the following example. In this chart it would be argued that an ideal model (red line) would only need ~ 55% of the population in order to predict all the target states. Without a model, we would need to use the entire population and so our model (being somewhat useful) lies between the best model and a random guess. These values can be determined by the intercepts each model with the X axis (note that at 55% of the population, the best model achieves 100% accuracy).

Another common question arising from the interpretation of this chart occurs when we know that the target (predicted) state is found in only 55% of the population. The question is “why do we show 100% accuracy when only 55% of the population can exist in the predicted state and therefore the Y axis should have a maximum of 55%”.

For my own analysis, I shall ask a question of the reader so that the construction of the chart can better be understood. The question is simple.

If my model does not predict the correct result 100% of the time how could my accuracy ever achieve 100%? Let’s be realistic, it would have to be a pretty impressive model to never be wrong – and this is what the chart always shows à 100% accuracy!

Now let’s look at construction

In order to create a lift chart (also referred to as an accumulative gain) the data mining model needs to be able to predict the probability of its prediction. For example, we predict a state and the probability of that state. Within SSAS, this is achieved with the PredictProbability
function.

Now, since we can include the probability of our predictions, we would naturally order training data by the predicted probability in suggesting the likelihood of a test case being the predicted target state. Or perhaps put another way, if I only had 10 choices for choosing which test case (that is a an observation from the testing data) would be the predicted value, I would choose the top 10 testing cases based on their predicted probability – after all the model is suggesting that these cases have the highest probability of being the predicted state.

As we move through the testing data (and the predicted probability decreases), it is natural to expect the model to become less accurate – will make more false predictions. So let’s summarise this (training) data. For convenience, I have group my training data into 10 bins and each bin has ~ 320 cases (the red line below). Working with the assumption that the predictive power of my model decreases with probability, the number of predictions also decreases as we move through more of the training data. This is clearly visible in the chart and data below – the first bin has a high predictive power (275 correct predictions) while the last bin has only 96 correct predictions.

If I focus on the models ability to correctly predict values, I will notice that it can predict 1,764 correct results – but now let’s turn our look to the accumulative power of the model. If, from the set of my sample data I could only choose 322 cases (coincidently this is the number of cases in bin 1), I would choose all cases from Bin 1 and get 275 correct (or 16% of the possible correct values). If I had to choose 645 cases, I would choose the cases from bin 1 & 2 and get 531 correct (30% of correct possibilities). This continues with the more predictions that I make and is summarised in the following table.

This data is transposed onto the lift chart – the Bin on the X axis (representing the % of population) and the Percent Running Correct on the Y axis (representing the number of correct predictions). As we can see, the data is indicative of the models ability to quickly make accurate predictions rather than its overall predictive ability.

Best and Random Cases

The chart also includes best and random cases as guidelines for prediction – let’s focus on these. These lines are theoretical – really ‘what if’ type of scenarios.

Suppose that I had an ideal model. If this was the case my model would predict 322 in bin 1, 323 in bin 2 and so on – it must because we have ordered the data by PredictProbability and in a perfect world we would get them all correct! However, the model can only predict 1,764 correct values -we know this from the actual results. Because of this we would only need up to bin 6 to get all our correct values (see column ‘Running Correct Best Case’ in the following table. Just as we did for the model prediction we can convert this to a percent of total correct (the population) and chart it with the model.

Now for the random guess – again this is theoretical. I know that I can only predict 1,764 correct values so, if these were evenly distributed amongst my bins, I would have ~176 correct predictions in each bin. This is then added to the chart.

What is the Problem?

Now we can see that the chart is essentially just a view of how quickly the model makes accurate predictions.  Perhaps there is nothing wrong with that but what happens when we compare models?  Well, in this case, the comparison is relative.  Those steps are reproduced for each chart and what you essentially see is relative comparative performance.  Thus, the comparison of two models in the charts gives NO indication of performance accuracy – after all how could they since they each plot relative percent accuracy for their own states.

For this reason, relying on this chart as a sole measure of accuracy is just dangerous and really shows very little about the total accuracy of the model.

Conclusion

Understanding how the lift chart has been constructed can help in understanding how to interpret it. We can see that it indicates the accumulative power of the model to give predictions – or perhaps more correctly the accumulative power of the model to give its correct prediction.


Presenting Sessions on Data Mining

Presenting Sessions on Data Mining

I am currently preparing a few presentations on using Data Mining in business intelligence.  These will be at the Brisbane SQL Server user group (this month) and SQL Rally Amsterdam (in Nov).  I am especially looking forward to Amsterdam because this will be my first trip to the Netherlands.

 The application of data mining within an organisation is an interesting topic for me which I liken to a milestone in organisational maturity.  When most people discuss business intelligence (and analytics), they are talking about the production of a system so that the and end user can either get canned reports quickly or (more importantly) interrogate data in a real time manner.  After all, this was the intent of OLAP! – the idea that the end user can drive data in real time so that once they think of a problem and a potential solution, they can verify their thoughts against data. 

 However, as good as this type of system is (compared to the starting base), this can be one of the short comings of stoping the BI journey here.  That is, the user needs to think of a solution to a problem, and then verify against it.  But what do we verify against? – and when do we stop investigating?  This is one of the major benefits of data mining.  It allows a scientific analysis at a targeted problem without the limitations of our dimensional thought (after all we can only think in a relatively small number of dimensions & attributes at a time).


Presenting Sessions on Data Mining

Presenting Sessions on Data Mining

I am currently preparing a few presentations on using Data Mining in business intelligence.  These will be at the Brisbane SQL Server user group (this month) and SQL Rally Amsterdam (in Nov).  I am especially looking forward to Amsterdam because this will be my first trip to the Netherlands.

 The application of data mining within an organisation is an interesting topic for me which I liken to a milestone in organisational maturity.  When most people discuss business intelligence (and analytics), they are talking about the production of a system so that the and end user can either get canned reports quickly or (more importantly) interrogate data in a real time manner.  After all, this was the intent of OLAP! – the idea that the end user can drive data in real time so that once they think of a problem and a potential solution, they can verify their thoughts against data. 

 However, as good as this type of system is (compared to the starting base), this can be one of the short comings of stoping the BI journey here.  That is, the user needs to think of a solution to a problem, and then verify against it.  But what do we verify against? – and when do we stop investigating?  This is one of the major benefits of data mining.  It allows a scientific analysis at a targeted problem without the limitations of our dimensional thought (after all we can only think in a relatively small number of dimensions & attributes at a time).


SQL Stream @ BBBT

SQL Stream @ BBBT

Most presentations to the BBBT focus on delivering better and more scalable reporting & analytical solutions.  While this may be a gross over simplification of the vendors that present, a general summary is that most focus on the delivery of a full platform (cloud or onprem) or an improved method of analytics and reporting under an existing architecture (the more traditional BI implementation of stage, transform and storage which include those dreaded latency issues). 

These are very valid applications for BI (and to be truthful account for the 99.9% of most requirements), however, sometimes a technology is presented that is so completely different that it is worthy of a note – enter SQL Stream.  This offering seems truly remarkable.  A technology that streams big data so that it can be queried using SQL.  The outcome – real time reporting and alerting over blended big data (yes combining different sources) – sounds cool eh?  Well it is!

Finally, I must also give my apologies to all the other vendors that have presented the BBBT with great products and were not been mentioned (on this blog).  As stated, these products cover the traditional implementation of BI that are encountered on a day to day basis.  Check out the BBBT website if you are interested in what the BBBT is, its meetings, podcasts and membership.  


SQL Stream @ BBBT

SQL Stream @ BBBT

Most presentations to the BBBT focus on delivering better and more scalable reporting & analytical solutions.  While this may be a gross over simplification of the vendors that present, a general summary is that most focus on the delivery of a full platform (cloud or onprem) or an improved method of analytics and reporting under an existing architecture (the more traditional BI implementation of stage, transform and storage which include those dreaded latency issues). 

These are very valid applications for BI (and to be truthful account for the 99.9% of most requirements), however, sometimes a technology is presented that is so completely different that it is worthy of a note – enter SQL Stream.  This offering seems truly remarkable.  A technology that streams big data so that it can be queried using SQL.  The outcome – real time reporting and alerting over blended big data (yes combining different sources) – sounds cool eh?  Well it is!

Finally, I must also give my apologies to all the other vendors that have presented the BBBT with great products and were not been mentioned (on this blog).  As stated, these products cover the traditional implementation of BI that are encountered on a day to day basis.  Check out the BBBT website if you are interested in what the BBBT is, its meetings, podcasts and membership.  


thenakedleaf blog

thenakedleaf blog

I have long been a fan of the Jedox product for its write-back and text capabilities in OLAP.  Its ability to publish Excel pages to the web (server) allows reports (and input forms) to be created quickly.  There are some very impressive methods for write back and it can be a great piece of technology (in the right environment of course)!

 Now, thanks to Chris Mentor there is a new practical blog about using Jedox with tips, tricks and explanations.  If your interested in the product or want to extend your understanding of the toolset then the site should be on your reading list.


thenakedleaf blog

thenakedleaf blog

I have long been a fan of the Jedox product for its write-back and text capabilities in OLAP.  Its ability to publish Excel pages to the web (server) allows reports (and input forms) to be created quickly.  There are some very impressive methods for write back and it can be a great piece of technology (in the right environment of course)!

 Now, thanks to Chris Mentor there is a new practical blog about using Jedox with tips, tricks and explanations.  If your interested in the product or want to extend your understanding of the toolset then the site should be on your reading list.


Privacy Policy

Copyright © 2017 BBBT - All Rights Reserved
Powered by WordPress & Atahualpa
X