PDW Proof-of-Concept (POC) Lessons
So it came across my path to perform the very first Microsoft SQL Server 2012 Parallel Data Warehouse (PDW V2) POC in South Africa for a very large federated enterprise. I want to share the key lessons I learned from this experience with the rest of the Microsoft (MS) community, without giving away too much…
Note: This is a technical post.
Background: This is massively parallel processing (MPP)
There is a very good reason this is number 1.
Ensure, you have S.M.A.R.T. goals for the POC.
PS. Dear mister customer, remember, it is a proof-of-concept. Not proof of solution.
S – Specific:
Improve performance, is not specific. Anyone can improve performance on a customer’s code that is badly written.
Improve performance for “Query 1” by 10 times is a better example.
M – Measurable:
Following on from the above goal.
Example would be: Query 1 currently runs for 6000 seconds loading 10 Million records into the Fact table.
A – Achievable:
Yes or No. The first 2 will determine if the goal is achievable or not. If the customer expects 1000 times improvement, then it is not necessarily realistic.
10 Times is achievable and we have the potential to even improve that number.
Here it is important that you have a copy of the DML and DDL to assess the complexity involved and if the goal is achievable.
R – Realistic:
Same as above. If the goal is not realistic, it won’t be achievable either.
A realistic scope is to choose no more than 3-5 fact table for the POC. Once the scope is larger than this the time-frame gets too large for a POC. Then it is moving towards a Pilot or solution.
T – Time based:
The POC needs to be complete in 6 weeks, ending on 10 October 2013 is a good example.
If there is no time box then it is not a POC.
Ensure you have dedicated resources on the POC. A POC requires a huge amount of effort in a very short time-frame, with excellent results. Ensure that the team has a strong project manager (PM) to keep everyone honest and the progress ball rolling. The PM can also greatly assist with logistics.
Be aware that changes would be required to the ETL, including SSIS and T-SQL.
Here you will need to use the PDW Destination adapter to load data into PDW. You will need to create “PDW loading” packages to move the data from SMP to PDW. If you have large volume tables then dwloader should provide even better loading speeds. There are slight differences in performance between FASTAppend and Append.
FASTAppend is slightly faster. Both are very fast, I received at a minimum 1 Million rows insert per minute on a very wide table with mostly nvarchar data types.
Append mode will is transaction save and will load into a “temporary” table first and then move the data into the destination table.
If you have a long stored procedure with several hundred lines of code you might be required to break it into smaller CTAS pieces to gain faster load times.
Refrain from #temp tables, the JBOD has much more disks to spin the IO with.
I have found under my test conditions that doing the CTAS directly into a clustered columnstore index (CCSI) is slower than performing a double step into a Heap first then into the CCSI. On one scenario the timing was 26 minutes vs (6 + 4) 10minutes.
Don’t ask me why, it just is faster.
Provide regular feedback to the client. This not only shows progress but also raises the customer’s excitement about the PDW.
Without a on-site appliance to work on, make sure you have fast reliable VPN (internet) connectivity to the data center hosting the appliance.
You should see load time gains of at least 10x.
Query time improvements should be much more since data is stored in a CCSI structure across several nodes.