Handling Late-Arriving Members in SSIS

I was chatting with some other SQL Community members recently and the topic of handling late-arriving members for a warehouse load came up. I figured it was worth taking a couple of minutes to share the general concept I’ve been using to handle late-arriving members in SSIS. The steps are relatively straightforward if you’ve done much SSIS, but sometimes the little things can trip you up. The example below is based loosely on the WideWorldImporters database samples from MS.

The general steps are

  1. Set up your source query.
  2. Pass the data through a Lookup for your Dimension with the missing results routed to a “No Match” output.
  3. Insert those “No Match” rows into your Dimension using a SQL task – checking to make sure that this particular row hasn’t already been inserted (this is important).
  4. Do another lookup using a “Partial Cache” to catch these newly-inserted members.
  5. Use a UNION ALL transform to bring the existing and late-arriving members together.

Inferred Member Data Flow

Sample code for the Inferred Insert:


SELECT @CustomerID = ?


               FROM   WideWorldImportsDW.Dimension.Customer

               WHERE  [WWI Customer ID] = @CustomerID)

  INSERT WideWorldImportersDW.Dimension.Customer

         ([WWI Customer ID],


          [Bill To Customer],


          [Buying Group],

          [Primary Contact],

          [Postal Code],

          [Valid From],

          [Valid To],

          [Lineage Key])

  VALUES ( @CustomerID,— WWI Customer ID – int


           + Cast(@CustomerID AS NVARCHAR(10)),— Customer – nvarchar(100)

           N”,— Bill To Customer – nvarchar(100)

           N”,— Category – nvarchar(50)

           N”,— Buying Group – nvarchar(50)

           N”,— Primary Contact – nvarchar(50)

           N”,— Postal Code – nvarchar(10)

           Sysdatetime(),— Valid From – datetime2(7)

           ‘9999-12-31 23:59:59.9999999’,— Valid To – datetime2(7)

           2 — Lineage Key – int



SSIS Catalog (SSISDB) Cleanup

I really appreciate what MS has done w/ the SSIS Catalog. We have built-in logging at a level that wasn’t possible in prior releases, but that comes at a cost. The default retention is pretty high (365 days) and MS chose to handle cleanup using cascading deletes. This sort of coding makes life easier for the developers, but almost always performs poorly. That’s especially the case when you have 1 parent row with 100’s of thousands or child rows related to it.  The problem is compounded because the first time people realize they need to lower the retention limit is about the same time that the database is starting to fill up or has filled up. At that point, it’s too late to lower the retention by a large number because the corresponding delete will still cause issues.

I knew someone had written some code for this and a quick search on StackOverflow found this question and the answer from Bill Fellows ( b | t ) :

I’m including the code here for a quick reference, though I recommend reading his answer, up-voting it, and checking out his other resources. The only change I’d consider making is perhaps clearing out the child tables in batches of 1000-5000 rows at a time instead of all rows. I found that I had to work in increments of no more than 10 days at a time to delete history. Any more than that and the time to execute increased quite a bit. In this case, look for your max days currently retained, and drop the @DaysRetention variable to that less a couple of days to see what is removed. Keep dropping that until you reach the point at which you want to stop, set your SSIS Catalog retention days to that value, and you should be good from that point on.

It’s probably worth noting that in SQL Server 2012, MS didn’t include proper indexing by default. You may need to add some indexes in order to have the delete perform in a timely manner and avoid excessive table scans. The index issue was addressed in SQL Server 2014 and higher (and I think in SQL Server 2012 SP2 as well).

Exploring BIML

Getting Started

I attended Varigence’s BIML Workshop last year prior to the PASS conference and got to learn a little BIML while seeing what it can do. Recently, I realized that I had a job to do that required applying the same SSIS ETL to several different sets of tables in our system. I figured this would be a great time to put some of that BIML into use.

To get started, you’ll need either the BIDS Helper add-on or a copy of Varigence’s Mist. While Mist is an excellent tool, it’s also priced towards the people who are really working with BIML and not targeted at those who aren’t quite ready to justify the cost. ($249/month or $3999 at the time I’m writing this – worth it if you use a lot of BIML, but not for everyone)

I made sure my BIDS Helper bits were up to date, started a new SSIS Project, and added a new BIML file by right-clicking the project and selecting “Add New BIML File”. After that I followed a couple of excellent walkthroughs and tutorials to get me started and refresh my memory. To name a couple of sources:



1. Actually getting my connection string correct. I don’t often type in connection strings so hit the ConnectionStrings.com website for SQL OLEDB connections. That told me to use this format for my connect strings:


Regardless of what I tried, this refused to connect as expected. I eventually changed back to the older syntax and that worked:

Provider=SQLNCLI11;Data Source=myServerAddress;Initial Catalog=myDataBase;Integrated Security=True;

Not a huge deal, but this was giving me fits with my connections with either security issues or having to specify DB Names in my code.


2. Next up – not knowing C# or BIML very well. Nothing for this except time and practice. 🙂  I got better along the way, slowly catching things as I typed them in and having some feel for what would happen.


3. For some reason, my Intellisense completely stopped working at one point. It worked for everything prior to when I added my “import” lines for various namespaces, but anything after that wouldn’t work at all. I got around this a little by using the BimlScript Editor online to check some code, but obviously this doesn’t work when you’re importing local columns. I never did get this working but also haven’t tried to reproduce it with a new project.

Update: It seems that if I move my @ Import statements before I open the first BIML tag in the XML, my intellisense works again. Putting the statements within the Biml section is perfectly valid, but putting them at the top/beginning of the file before I start writing Biml makes a lot more sense and results in Intellisense working again. This would have made my life easier.


4. The last major challenge this time through was trying to get BIML to generate a logical “AND” in the SSIS expression I wanted to use. I typed it in, used “&&” for my AND statement, and checked syntax. I repeatedly got errors about syntax. The first time around, I figured my expression might be too long. It was and I broke it apart only to get the same error. I looked at the string generated using some MessageBox.Show() hacks and it looked okay. I finally came across a tip indicating that BIML doesn’t work well with the “&” character. The suggestion was to replace that with “&” whenever you wanted to output an ampersand as part of the expression. This worked well and my package was generated.



I still need to tweak my package. I have more tables to import from a different source on the same server, but targeting a single database on my destination server. I need to figure out tiers and included files to better reuse and organize code. I’d love to get the correct connections going so BIML and SSIS read the correct column data types, especially when it comes to BLOB data columns. I need to rework some of the package components to better adhere to best practices. Overall it’s been an interesting ride trying to figure these things out and while it may not have saved time with this particular project, I’ll be able to use the lessons learned for my next project.

SSIS, MySQL, and invalid Dates–Take #2

A while back I posted an article on handling dates in select statements from MySQL. The idea was that I wanted a quick way to generate select statements for MySQL that would automatically set any invalid date to NULL. (an invalid date being one such as “0000-00-00” or “1973-02-00”)  After some more trial and error, I realized that this script was incomplete. I’ve created an updated script that should better handle those edge cases.

The script is largely the same as the former, but I’ve added handling for invalid years or months or days. We don’t want to redirect the rows for further error handling. We want to treat the dates that won’t successfully import as NULL. This could be done several ways within SSIS, but handling the transform within the select statement reduces our need to write many transforms to handle each date time for each table.  (I suspect that BIML could be my friend for this task, but I’m not quite ready to tackle this with BIML.)

  WHEN ordinal_position = 1 THEN ‘SELECT ‘ 
  ELSE ‘, ‘ 
, CASE data_type WHEN ‘date’
THEN CONCAT(‘CASE WHEN ‘, column_name, ‘= ”0000-00-00” THEN NULL ‘
, ‘WHEN ‘, column_name, ‘ LIKE ”%0000%” THEN NULL ‘
, ‘WHEN ‘, column_name, ‘ LIKE ”%-00%” THEN NULL ‘
, ‘else ‘, column_name, ‘ END as ‘, column_name)
WHEN ‘timestamp’
THEN CONCAT(‘CASE WHEN ‘, column_name, ‘= ”0000-00-00” THEN NULL ‘
, ‘WHEN ‘, column_name, ‘ LIKE ”%0000%” THEN NULL ‘
, ‘WHEN ‘, column_name, ‘ LIKE ”%-00%” THEN NULL ‘
, ‘else ‘, column_name, ‘ END as ‘, column_name)
WHEN ‘time’
THEN CONCAT(‘CAST(‘, column_name, ‘ AS CHAR(8) ) as ‘, column_name ) 
ELSE column_name
, CASE WHEN ordinal_position  maxord THEN  
  ELSE CONCAT(‘     FROM MySchema.’, c.table_name, ‘;’) 
  END) AS Select_Column_Name
FROM   information_schema.columns AS c
       JOIN (SELECT table_name,
                    Max(ordinal_position) AS MaxOrd
             FROM   information_schema.columns
             WHERE  table_schema = ‘MySchema’
             GROUP  BY table_name) AS t1
         ON c.table_name = t1.table_name
WHERE  table_schema = ‘MySchema’
ORDER  BY c.table_name, ordinal_position
LIMIT  0, 50000;

SSDT: Publishing Your Project

Build the project

In order to successfully publish your project, it must first be able to build successfully.

Start by building your project. Right-click the project and select “Build”.


If the build is successful, the project can be published.


You may want to create a folder within your project to store saved Publish Profiles. These can be used later to easily publish the project to your servers.



Creating Publish Profiles

Right-click the project and select Publish. This will bring up the Publish Database dialog.

  • Choosing to publish or opening a saved publish profile will initiate a build of the project.


Choose your target database, set your advanced options (similar to Schema Compare options), and choose the “Save Profile As” option to save this to a location within your project. Selecting the “Add profile to project” option will create the publish profile in the root of the project. You may wish to either move the file to a folder storing all of your publish profiles or, if you saved it without adding to the project, show all files of the project so you can include the file in the project.



Some options you may want to consider:

  • “Always re-create database” – this will re-create the database. Any data in the database will be lost.
  • “Block incremental deployment if data loss might occur” – If there are any changes that could result in the publish action failing because of data loss, this option will stop the script from running.
  • “DROP objects in target but not in project” – This will remove anything in the database that doesn’t exist in the project. Useful if you want consistency, but you may want to ensure this isn’t checked if there could be objects in the database that were created, but didn’t make it to the project.

Under the “Advanced Deployment Options”

  • Allow Incompatible Platform – Useful if you may publish to a different version of SQL Server than the one specified in the project
  • Include transactional scripts – Will run the entire update operation as a transaction. If any one part fails, the transaction will roll back. If you have cross-database dependencies, selecting this option could result in no changes being published if you’re publishing to a new server. For a new publication, you may want to de-select this option to ensure a successful deploy of what can be published.
  • Script state checks – This option will ensure that the publish action will only work on the specified server and database.
  • Verify deployment – Checks the database and project before publishing to try to ensure there are no changes that will cause problems with the publication such as missing data for a foreign key.


Using Publish Profiles

Once you’ve set up your publish profiles, you can easily use these to push changes to that server and database without needing to specify additional parameters. The easiest way to use them is to double-click the Publish Profile within the project and choose to either “Generate Script” or “Publish”.

Generate Script will generate a script for you to use to update the target at a later time (run in SQLCMD mode).

Publish will immediately attempt to push the changes to the target.

You can also use these at a later point to push changes through the SQLPackage.exe command line.



To publish your package through a command line we use something like the following:

Code Snippet
  1. sqlpackage /a:publish /sf:.\sql\Local\Adventureworks2008.dacpac /pr:.\Publish\Local.publish.xml

The above will:

  • Use the “Publish” Action
  • Use the Source File named Adventureworks2008.dacpac, built in the sql\Local folder
  • Use the publish profile named “Local.publish.xml” (defined to push to the local SQL Server)

You may want to add SQLPackage.exe to your path. By default it is installed in:

C:\Program Files (x86)\Microsoft SQL Server\110\DAC\bin

You can override quite a few of the default settings through various command line arguments. This includes source, target, and variables. You can find a full list of the command line arguments at the SQL Package reference online.


Jenkins Automation for CI

We use Jenkins at my current workplace and set up a Jenkins job to do the following (With thanks to Matthew Sneeden for the assistance.):

  • Get the latest from our mainline repository
  • Build each SQLProj file.
    • Building the SLN file will result in also attempting to publish the database
    • msbuild .\Adventureworks.sqlproj /t:build /p:Configuration=”Local”
      • This assumes that msbuild.exe is in your path.
    • Configuration is mostly to control the location of the dacpac file generated.
  • Run SQLPackage w/ a specified Publish Profile for the appropriate environment and using the newly built dacpac as the source.

We are currently investigating how we can use Snapshot files to better control releases to our UAT and Production environments. This series will be updated when that information is available.

Tales of a Lazy DBA–MySQL, SSIS, and “0” Dates

We’ve recently been tasked with converting a lot of MySQL Data into our system from a multi-tenant MySQL Database. We don’t have a fixed schedule to actually execute the imports because they’ll be on a “per customer” basis.  Well, that sounded like a great task for SSIS.  I set up the ODBC driver, connected, set the initial “sql_mode” options for our connections to (‘mssql, allow_invalid_dates’) and started to work.
First problem we ran into with an ADO.NET connection to MySQL and writing a SELECT * from schema.table was when we hit a MySQL “Date” column containing a value of ‘0000-00-00’.  SSIS threw an error, not sure what to do.  Thanks to some others who have solved this problem, I realized that within the MySQL Select statement, we could do something like:

CASE date_created WHEN ‘0000-00-00’ THEN NULL else date_created END as date_created

to pass those dates over as NULL. That solves the implicit conversion to datetime (SQL 2005) and avoids the invalid dates.  I ran something similar for a “Time” column to:

CAST(Time_Created as char(8) ) as Time_Created

So that solved one particular table export with about 20 or so CAST and CASE statements.  Needless to say, I wasn’t looking forward to doing this for another 500 tables with a total of almost 6000 columns.
I finally set up a really basic query to generate most of the SELECT statements we would need to pull our MySQL data across without too much pain. Admittedly, a small step and I’d still need to copy/paste when I set up each new ADO.NET source, but it worked reasonably well. I’m adding the code snippet for MySQL here in case anyone else has a similar problem.

CONCAT(CASE WHEN ordinal_position = 1 THEN 'SELECT ' ELSE ', ' END,
 CASE Data_Type WHEN 'date'
 THEN CONCAT('CASE ',column_name,' WHEN ''0000-00-00'' THEN NULL else '
    ,column_name,' END as ',column_name)
 WHEN 'time' THEN CONCAT('CAST(',column_name,' AS CHAR(8) ) as ',column_name)
 ELSE column_name END, CASE WHEN Ordinal_Position  MaxOrd THEN ''
    FROM MySchema.', c.table_name) END
    ) as Select_Column_Name
from information_schema.columns as c
JOIN (select table_name, MAX(ordinal_position) as MaxOrd
        from information_schema.columns
        WHERE Table_Schema = 'MySchema'
        GROUP BY Table_Name) as t1
    ON c.Table_Name = t1.Table_Name
where table_schema = 'MySchema'
order by c.table_name, ordinal_position LIMIT 0, 50000;

I’d love to hear other ideas if anyone has encountered this before and come up with a more elegant solution for translating “0” Date or Time data from MySQL into SQL Server.

SQL 2008 R2 – SSIS Data Export Oddities

I don’t know if anyone else has had these sorts of issues, but we work quite a bit with conversions of various source data into our SQL Server database. I recently was handed a backup from SQL 2008 R2. I figured that wasn’t a huge problem. Restore onto my local test box, use SSIS to push over to a SQL 2005 instance so it can work its way to Production. No big deal, right?

The first time I tried this using the SQL Native Client 10.0 on both sides, I ran into an errors with the mappings. I can’t quite figure that one out because from what I can see, there’s nothing at all in the source database using a feature that would not be available in SQL 2005.

I tried exporting to an MS Access MDB file. No luck from SQL 2008 R2 with the default settings because of an invalid size error on a varchar to longtext conversion.

I was able to successfully export using SNAC from SQL 2008 R2 to a SQL 2008 instance. So from there I thought I could upload directly to the SQL 2005 instance. No such luck using SNAC – again. I was able to export from there to an Access MDB file and pull that into SQL 2005.  I don’t quite get why that worked, but figure I’ve got some odd mapping in the XML files defining the defaults that I’m missing.


I was recently asked to repeat this task and figured there had to be a better way. This time I restored the DB to my R2 instance – no issues. I then used the SNAC client to access my R2 instance, but set up an OLEDB connection to my SQL Server 2005 target. For some reason, the mappings are just different enough that this worked with no issues. I was able to transfer directly. I now wish I’d tried that the first time, but I’d already blown a couple of hours on it.

If anyone else has encountered that and knows why SNAC from 2008 R2 doesn’t seem to work directly to SNAC on 2005, I’d love to know the reasons as well. If not and you encounter something similar, maybe trying the OLEDB connections will work for you.

SSIS, OLEDB Data Sources, and Table Variables

I ran across an interesting problem when trying to tune some SSIS data source performance issues the other day. I had noticed that we were doing a pretty large select out of a table filtered on a handful of date-related columns. We then pulled in most of the data from the table. Well, SQL Server saw this query and apparently figured that since we needed most of the columns, a table scan would be the most efficient way to get that data.  Since in real life, we only need a very small percentage of those rows, I decided to rewrite the select to pull in the PK values first, then pull in the data based on that subset.

My first attempt was to use a Temp Table. Sadly, using a temp table does not work in SSIS Data Sources, even if you put that inside a stored procedure. Disappointed, but not deterred, I decided to use a Table Variable. I quickly made the changes and tested them with some values I knew would returned quite a few results. Great – no worries.  I was still using a stored procedure at the time and used some of our standard code to start it off so had no issues.

Ultimately, I didn’t want to write a bunch of stored procedures that I would need to maintain outside of SSIS just for this particular load.  We had the code already inside of the OLE DB Sources, just not optimized as much as I’d like. I started pulling out the old queries, adding a table variable, populating it with the necessary PK values, the using that to filter out the main data set.  Save, deploy, run, see major speed increase.

Then, I looked at the actual insert/update counts. Yes, it ran much faster, but in the process, it didn’t pick up any results.  I was pretty confused because the step was successful, the code ran as expected through SSMS, there were no errors or warnings. I re-verified that running that same code got results in SSMS; it did.  I started doing some searching to see what might be happening and found this post by Dustin Ryan.

In short, it seems that if you have row counts being returned (default behavior), you return a number of rows affected when you insert the table variable, which results in a “Done in Proc” type message, which in turn tells SSIS that the data flow is done and moves on to the next step.  The workaround?  Add SET NOCOUNT ON; to the beginning of your OLEDB Source query (or make sure this is in your stored proc – we tend to put that in our procs by default).

Once again, thanks to Dustin Ryan for his post on this topic. There’s a good possibility I’d still be struggling with the root cause of this without that assistance.

SSIS and XML Output

I was pretty surprised by this and am in pretty good company.  SSIS 2005 and 2008 do not have an XML Destination adapter. We have another “opportunity” from Microsoft to work out a way to export XML data.  I’ll disagree that this is a good idea, despite those saying that it’s easy to work around in a script component.  I’m not a great SSIS Script writer.

My problem – I want to take a simple SQL query that uses FOR XML AUTO to generate XML and put that into a file.  If you set this as the only part of your source query, you end up with an output of DT_Image.  Converting that to text results in a long string of numbers. While I was amused at the result, it didn’t help me generate the necessary XML for our partners.

I came up with a relatively simple workaround that works well for me because I’m only dealing with one XML Data Set at a time.  I put something like the following in my OLEDB Source:



I was then able to pipe that into a Delimited Text File destination with a CRLF Delimiter and no column headers. That generated results I could pass on to our partners. While it’s not the most elegant solution, it was much easier to me than trying to write and maintain a script component just to handle XML.

Data Loads

One of the things I really like about SQL Server Integration Services is that it can load large data sets quickly, especially if you need to look up values along the way.  However, there are times that things behave in strange manners that seem hard to explain.  We were recently trying to load about 64 million rows into a warehouse-type database and kept hitting a bottleneck once we got past around 30-35 million rows.  No matter what I tried, the load would run at about 2 million rows per minute, gradually slowing down until it reached the 10,000 rows per second load speed.  I found several helpful posts, including some pointers to check the MaxRowBufferSize and similar settings.  Tweaking these helped someone who was on a memory-bound machine with a very similar problem.  I tried that and saw my initial load times improve, but still slowed down to a crawl at around the same point.

I learned more about looping through (aka “shredding”) a recordset of CustomerID values to try to go through that set.  I set up a FOR Loop to loop through 5 million records at a time, I checked for processes running against the table. Nothing seemed to help.  I had posted information on Twitter, had a co-worker looking at the package with me, and even got some replies from Paul Randal.  (Thank you for the assist, Paul – sorry I wasn’t checking my tweets more regularly that time.)

In the process above, I was challenged trying to set the CustomerID in some dynamic fashion on my source query and eventually resorted to treating the SQL Command as an expression and hacking in my variables into the WHERE clause.  It worked, but definitely felt awkward. I am more than open to some suggestions about how to use SSIS variables inside of a FOR loop as part of the OLEDB Source Command.  I also learned a little more about watching variable values – like needing to have a breakpoint set before they’ll be available.  And my co-worker pointed me to a wonderful discussion showing that DtExecUI runs only in 32-bit mode. I’d been using that as a quick/easy way to get my parameters plugged in and to keep an eye on the general progress of the load.

About the same time that Paul suggested checking my index fragmentation levels, I remember seeing a very similar behavior pattern several years ago when a co-worker was trying to update a column in the middle of a somewhat wide clustered index on a multi-million row table. The server was thrashing all over the place, trying to re-order millions of rows as this command took place.  I checked the table and sure enough, there was a wide clustered index on the table that would cause exactly that behavior.  I don’t know why I didn’t check it before other than thinking that I’d just created a test table and no indexes on it.  In retrospect, that was poor planning on my part.  For the future, I’ll remember to check clustered indexes before I try to load millions of (unsorted) rows into a new table.

It was a somewhat humbling experience, mostly because one of the basic things I know about loading lots of data was also something that I just completely ignored when doing all of my troubleshooting. Still, a little humility is a good thing and I’ve now re-learned several programming techniques in the process. The best part about this was seeing a 6 hour load process reduced down to 30 minutes, including dropping and re-creating the clustered index. That’s going to make a noticeable difference in our nightly processes.