Friday, September 14, 2007

SQL Server 2005 Analysis Services – Column Truncation "Inconvenience"

I've been using SQL Server Analysis Services for large data analysis (analyzing texts) recently, and I've encountered an intriguing problem, the investigation of which has led me to some unexpected discoveries (and a workaround).

To reproduce the problem start SQL Server Business Intelligence Studio, then follow these 6 simple steps:

  1. Create a new SQL Server Analysis project;
  2. Add a Data Source object, pointing to the AdventureWorks database on your SQL Server 2005;
  3. Add a Data Source View using the Data Source added in the previous step and create a Named Query using the following statement:
    select sys.objects.[name] as ObjectName
     ,sys.objects.[object_id] as ObjectId
     ,object_definition(sys.objects.[object_id]) as ObjectDefinition
     from sys.objects
     where (sys.objects.[type] not in
       ('F', 'IT', 'PK', 'S', 'SQ', 'U', 'UQ'))
    This query returns the definitions of certain database objects; its purpose in this case is to provide an OLAP dimension using the nvarchar(max) data type for its members.

    If the Logical Primary Key for the named query is not selected automatically, set the ObjectId column as the Logical Primary Key;
  4. Add a Cube using the Data Source View created in the previous step. Use all the default settings;
  5. A single Dimension is created;
  6. Deploy the project.

I should have said *attempt* to deploy the project. ;) Yes, the deployment fails with the following error message:

Errors in the back-end database access module. The size specified for a binding 
was too small, resulting in one or more column values being truncated.

Why?

Upon further investigation (i.e. by processing each object individually) we can see that our »Object Definitions« dimension is the source of the problem. It would be quite reasonable to expect this error if we had altered the data source view or the underlying objects after the dimensions had been created, and if this modification had resulted in an increase of the size of any of the data source view columns, as is the case in this issue reported on Microsoft Connect:

The problem reported by Pieter M. does not apply to our case, though, as we've only just created the objects. Let's look at the dimension's definition (right-click the »Object Definitions« dimension in Solution Explorer and select »View Code«) to continue our investigation. Look for the Attribute named »Object Definition« in the XML (Dimension/Attributes/Attribute/Name). Under the KeyColumns/KeyColumn elements you can see the properties for the Object Definition column.

<Attribute dwd:design-time-name="28921594-bbde-434a-b855-8b76cccf1e8a">
 <ID>Object Definition</ID>
 <Name>Object Definition</Name>
 <EstimatedCount>191</EstimatedCount>
 <KeyColumns>
  <KeyColumn dwd:design-time-name="0377ce84-9c52-40ac-8149-64f2cf826009">
   <DataType>WChar</DataType>
   <Source xsi:type="ColumnBinding"
    dwd:design-time-name="2509b813-0e0f-40e5-aa48-317a60cfe881">
    <TableID>ObjectDefinitions</TableID>
    <ColumnID>ObjectDefinition</ColumnID>
   </Source>
  </KeyColumn>
 </KeyColumns>
 <OrderBy>Key</OrderBy>
 <InstanceSelection>List</InstanceSelection>
</Attribute>

In the DataType element we can see that the data type is correctly mapped (i.e. nvarchar(max) to WChar); what we cannot see is the DataSize element. But, hold on, do we really need one? After all, nvarchar(max) is supposed to be unlimited, isn't it? Well, it is limited to 2 GBytes (2,147,483,647 Bytes, to be exact).

How large is in fact the data in the Object Definition column? A quick check in SQL Server Management studio...

select max(datalength(object_definition(sys.objects.[object_id]))) as MaxSize
 from sys.objects
 where (sys.objects.[type] not in
   ('F', 'IT', 'PK', 'S', 'SQ', 'U', 'UQ'))

...reveals that the largest value is "only" 12,844 Bytes. Ok, let's take on the challenge, and add the following element to the dimension's definition...

<DataSize>12844</DataSize>

...as a sibling following the DataType element (as documented in Books Online). If we now process the dimension, the processing finishes successfully, and we can deploy the project. There it is!

So, if we can predict the size of our [n]varchar data, we're all set. But what if we can't?

BTW, if you're thinking of explicitly setting the size of [n]varchar data source view columns, bear in mind that even this won't work every time. For instance: casting the values to nvarchar(4000) results in the value for the DataSize element being set to 1,024 Bytes. How weird is that!?


Those Magic Numbers...

I must admit I'm somewhat baffled by how the mapping between Database Engine and Analysis Services data types is implemented. After all, both the nvarchar(max) as well as the WChar data types are based on the String data type (System.String) of the .Net framework, which means that the upper limit of their sizes is supposed to be 2 GBytes. Why would the actual size then still need to be specified? And why isn't it then specified automatically or – dare I say it – assumed? And how can 1024 be equal to 4000?

I have tried specifying the maximum value of 2147483647 – since, one can't always predict the size of nvarchar(max) data, don't you agree? Isn't that what the *MAX* is for? Anyway, I was greeted with this wonderful error message:

Memory error: Allocation failure : Not enough storage is available to process 
this command.

To respond to this particular exception the documentation suggests closing programs, reducing paging, adding more RAM etc., which all seem pretty absurd given the situation, but – as you probably know – it's just a generic operating system exception.

So, if we can't specify the said maximum value, what can we do? The most logical solution would be to analyze existing data and perhaps come up with an approximation of how large the data is going to grow in the future. Which is a fine strategy, until it breaks. Well, I've been playing around a bit with this option and have found what seems to be the actual upper limit for the WChar data type. And it's a weird one, to say the least. Does the size of 163,315,555 Bytes mean anything to you?

Yes, here it is (for you to copy and paste to your project)...

<DataSize>163315555</DataSize>

...the actual limit that still allows the deployment (at least the deployment of this particular SSAS project).

WTF? If you have any clue, please, leave a comment. And if you can deploy the project using a larger value, please, provide a repro. I really would like to know.


...

So, why all the fuss? Who needs to analyze large pieces of text? Well, I do. :) Honestly, I've come across this little column truncation "inconvenience" merely by chance. I don't think the texts that I need to run analyses against would ever exceed a thousand Bytes in size but one never knows... And now I know they should never exceed 156 MBytes. :)


ML


p.s. I've reported this issue on Connect as well, so, please, vote on it if you think it should be resolved.
SSAS: varchar/nvarchar column size incorrectly mapped between data source views and cube dimensions (#298085)

Friday, September 07, 2007

An Unusual Data Rescue

Take a look at the picture below.

The art of shredding...

Somewhere in that lovely mess, on one shred of paper – or more? – there is an eleven digit number that I need.

How lovely!

Luckily for me, I've managed to find all the pieces. And it only took about 20 minutes. :) Which is still a long time compared to about three seconds that it originally took to cause the problem...

Lessons learned:

  1. Don't shred anything unless you're absolutely, positively, beyond a shadow of a doubt certain that complete and utter destruction is what you're after.
  2. Shredders are not perfect.

Phew!


ML