Case Sensitive collation sort order

November 23rd, 2012

A recent inquiry from one of our front line CSS engineers had me look into how case sensitive collations decide the sort order. Consider a simple question like How should the values 'a 1', 'a 2', 'A 1' and 'A 2' sort?


create table [test] (
	[col] varchar(10)
		collate Latin1_General_CS_AS);
go

insert into [test] ([col]) values
	('a 1'),
	('a 2'),
	('A 1'),
	('A 2');
go

select [col]
from [test]
order by [col];

Here are two possible outputs:

spacer
spacer

Which one is correct? A programmer will chose the first order: 'a 1', 'a 2', 'A 1', 'A 2'. Because if one would implement a string comparison routine it would compare character by character until a difference is encountered, and 'a' sorts ahead of 'A'. But this answer is wrong. The correct sort order is 'a 1', 'A 1', 'a 2', 'A 2'! And if you ran the query in SQL Server you certainly got the second output. But look again at the sort order and focus on just the first character:

spacer

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to: alphabetic ordering, diacritic ordering, case ordering

So, in a case sensitive collation, is 'a' ahead or after 'A' in the sort order? The images shows them actually interleaved, is 'a', 'A', 'a', 'A'. What’s going on? The answer is that collation sort order is a little more nuanced that just comparing characters until a difference is encountered. This is described in the Unicode Technical Standard #10: UNICODE COLLATION ALGORITHM. And yes, the same algorithm is applied for non-Unicode types (VARCHAR) too. The algorithm actually gives different weight to character differences and case differences, a difference in alphabetic order is more important than one in case order. To compare the sort order of two strings the algorithm is more like the following:

  • Compare every character in case insensitive, accent insensitive manner. If a difference is found, this decides the sort order. If no difference is found, continue.
  • Compare every character in case insensitive, accent sensitive manner. If a difference is found, this decides the sort order. If no difference is found, continue.
  • Compare every character in case sensitive manner (we already know from the step above there is no accent difference). If a difference is found, this decides the sort order. If no difference is found the strings are equal.

Needless to say the real algorithm does not need to traverse the strings 3 times, but the logic is equivalent to above. And remember that when the strings have different lengths then the comparison expands the shorter string with spaces and compares up to the length of the longest string. Combined with the case sensitivity rules this gives to a somewhat surprising result when using an inequality in a WHERE clause:


select [col]
from [test]
where [col] > 'A'
order by [col];

spacer

That’s right, we got back all 4 rows, including those that start with 'a'. This surprises some, but is the correct result. 'a 1' should be in the result, even though 'a' is < 'A'. If you follow the algorithm above: first we expand the shorter string with spaces, so the comparison is between 'a 1' and 'A  '. Then we do the first pass comparison, which is only alphabetic order, case insensitive and accent insensitive, character by character: 'a' and 'A' are equal, ' ' and ' ' are equal, but '1' is > ' '. The comparison stops, we found a alphabetic order difference so 'a 1' > 'A  ', the row qualifies and is included in the result. Ditto for 'a 2'.

Posted in Tutorials | No Comments »

Handling exceptions that occur during the RECEIVE statement in activated procedures

October 15th, 2012

The typical SQL Server activation procedure is contains a WHILE (1=1) loop and exit conditions based on checking @@ROWCOUNT. Error handling is done via a BEGIN TRY ... BEGIN CATCH block. This pattern is present in many Service Broker articles on the web, including this web site, in books and in Microsoft samples:

create procedure [<procedure name>]
as
declare @dialog_handle uniqueidentifier
 , @message_type_name sysname
 , @message_body varbinary(max);
set nocount on;

while(1=1)
begin
 begin transaction;
 begin try;
  receive top(1)
   @dialog_handle = conversation_handle
   , @message_type_name = message_type_name
   , @message_body = message_body
  from [<queue name>];
  if @@rowcount = 0
  begin
   rollback;
   break;
  end
  if @message_type_name = N'<my message type>'
  begin
   -- process the message here
                        ...
  end
  else if @message_type_name = N'schemas.microsoft.com/SQL/ServiceBroker/Error'
     or @message_type_name = N'schemas.microsoft.com/SQL/ServiceBroker/EndDialog'
  begin
   end conversation @dialog_handle;
  end
  commit transaction;
 end try
 begin catch
  declare @error_number int = ERROR_NUMBER()
   , @error_message nvarchar(4000) = ERROR_MESSAGE()
   , @xact_state int = XACT_STATE();
  if @xact_state = -1 or @xact_state = 1
  begin
   rollback;
  end
  -- log the error here
               ....
 end catch
end
go

This patter though contains a problem: it will handle very poorly a disabled queue, and hence it will handle very poorly poison messages.

Error 9617 The service queue “…” is currently disabled

Read the rest of this entry »

Posted in Tutorials | Comments Off

Inside the SQL Server 2012 Columnstore Indexes

May 29th, 2012

Columnar storage has established itself as the de-facto option for Business Intelligence (BI) storage. The traditional row-oriented storage of RDBMS was designed for fast single-row oriented OLTP workloads and it has problems handling the large volume range oriented analytical processing that characterizes BI workloads. But what is columnar storage and, more specifically, how does SQL Server 2012 implement columnar storage with the new COLUMNSTORE indexes?

Read the rest of this entry »

Posted in Columnstore, SQL 2012, Tutorials | 3 Comments »

What is an LSN: Log Sequence Number

January 17th, 2012

LSNs, or Log Sequence Numbers, are explained on MSDN at Introduction to Log Sequence Numbers:

Every record in the SQL Server transaction log is uniquely identified by a log sequence number (LSN). LSNs are ordered such that if LSN2 is greater than LSN1, the change described by the log record referred to by LSN2 occurred after the change described by the log record LSN.

There are several places where LSNs are exposed. For example sys.database_recovery_status has the columns last_log_backup_lsn and fork_point_lsn, sys.database_mirroring has the mirroring_failover_lsn column and the msdb table backupset contains first_lsn, last_lsn, checkpoint_lsn, database_backup_lsn, fork_point_lsn and differential_base_lsn. Not surprisingly all these places where LSNs are exposed are related to backup and recovery (mirroring is a form of recovery). The LSN is exposed a numeric(25,0) value that can be compared: a bigger LSN number value means a later log sequence number and therefore it can indicate if more log needs to be backed up or recovered.

Yet we can dig deeper. Look at the LSN as exposed by the fn_dblog function:

select * from fn_dblog(null, null);

Current LSN Operation Context Transaction ID ...
00000014:00000061:0001 LOP_BEGIN_XACT LCX_NULL 0000:000001e4
00000014:00000061:0002 LOP_INSERT_ROWS LCX_NULL 0000:000001e4
...
00000014:00000061:02ea LOP_INSERT_ROWS LCX_NULL 0000:000001e4
00000014:000000d9:0001 LOP_INSERT_ROWS LCX_NULL 0000:000001e4
00000014:000000d9:0002 LOP_INSERT_ROWS LCX_NULL 0000:000001e4
...
00000014:00000153:021e LOP_INSERT_ROWS LCX_NULL 0000:000001e4
00000014:00000153:021f LOP_COMMIT_XACT LCX_NULL 0000:000001e4
00000014:000001ab:0001 LOP_BEGIN_XACT LCX_NULL 0000:000001ea
...
00000014:000001ab:01ac LOP_INSERT_ROWS LCX_NULL 0000:000001ea
00000015:00000010:0001 LOP_INSERT_ROWS LCX_NULL 0000:000001ea

The LSN is shown as a three part structure. The first part seems to stay the same (the 14), the middle part apparently has some erratic increases and the last part seems to increase monotonically but it resets back to 0 when the middle part changes. It is much easier to understand what’s happening when you know what the three parts are:

  • the first part is the VLF sequence number.
  • the middle part is the offset to the log block
  • the last part is the slot number inside the log block

Virtual Log files

SQL Server manages the log by splitting into regions called VLFs, Virtual Log Files, as explained in the Transaction Log Physical Architecture. We can inspect the VLFs of a database using DBCC LOGINFO:

FileId FileSize    StartOffset  FSeqNo  Status      Parity CreateLSN
------ ----------- -------------------------------- ------ -----------------
2      253952      8192         20      2           64     0
2      253952      262144       21      2           64     0
2      270336      516096       22      2           64     20000000021700747
2      262144      786432       23      2           64     21000000013600747
2      262144      1048576      24      2           64     22000000024900748
...
2      393216      16973824     75      2           64     74000000013600748
2      393216      17367040     0       0           0      74000000013600748
2      393216      17760256     0       0           0      74000000013600748
2      524288      18153472     0       0           0      74000000013600748

(59 row(s) affected)

So back to our fn_dblog result: the Current LSN value of 00000014:00000061:0001 means the following log record:

  • the VLF sequence number 0×14 (20 decimal)
  • in the log block starting at offset 0×61 in the VLF (measured in units of 512 bytes)
  • the slot number 1

From the DBCC LOGINFO output we know that the VLF with sequence number 0×14 is starting at offset 8192 in the LDF file and has a size of 253952. It is the first VLF in the log file. the next LSNs increase the slot number until the LSN 00000014:00000061:02ea and then the LSN changes to 00000014:000000d9:0001. This means that the block that starts at offset 0×61 has filled up and the next block, from offset 0xd9, started to be used. If we do the math, 0xd9-0×61 is 0×78 (or 120 decimal), which is exactly 60Kb. You may remember that the maximum SQL Server log block size is 60Kb.

Our transaction continued to insert records until the LSN 00000014:00000153:021e then at the next LSN we have a commit operation (LOP_COMMIT_XACT). The next LSN, which is for the next transaction start, is at LSN 00000014:000001ab:0001. What does this tell us? The transaction terminated with the INSERT operation at LSN 00000014:00000153:021e and then it issued a COMMIT. The commit operation generated one more log record, the one at LSN 00000014:00000153:021f and then asked for the commit LSN to be hardened. This causes the log block containing this LSN (the log block starting at offset 0×153 in the VFL 0×14) to be closed and written to disk (along with any other not-yet-written log-block that is ahead of this block). The next block can start immediately after this block, so the next operation will have the LSN 00000014:000001ab:0001. If we do the math we can see that 0x1ab-0×153 is 0×58 (decimal 88) so this last log block had 44Kb.

I hope by now is clear what happened next in the fn_dblogoutput I shown: the transaction continues to insert records generating more LSNs. The entry at 00000014:000001ab:01ac is not only the last one in the current log block, but is also the last one in the current VLF. The next LSN is 00000015:00000010:0001, the LSN at slot 1 in the first log block (offset 0×10) of the VLF with sequence number 0×15.

Decimal LSNs

What about those LSN numbers like the 20000000021700747 CreateLSN value shown in DBCC LOGINFO output? They are still three part LSNs, but the values are represented in decimal. 20000000021700747 would be the LSN 00000014:000000d9:02eb.

Log flush waits

Look at the following fn_dblog output:

...
00000016:0000003c:0001  LOP_BEGIN_XACT      LCX_NULL    0000:0000057f
00000016:0000003c:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:0000057f
00000016:0000003c:0003  LOP_COMMIT_XACT     LCX_NULL    0000:0000057f
00000016:0000003d:0001  LOP_BEGIN_XACT      LCX_NULL    0000:00000580
00000016:0000003d:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:00000580
00000016:0000003d:0003  LOP_COMMIT_XACT     LCX_NULL    0000:00000580
00000016:0000003e:0001  LOP_BEGIN_XACT      LCX_NULL    0000:00000581
00000016:0000003e:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:00000581
00000016:0000003e:0003  LOP_COMMIT_XACT     LCX_NULL    0000:00000581
00000016:0000003f:0001  LOP_BEGIN_XACT      LCX_NULL    0000:00000582
00000016:0000003f:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:00000582
00000016:0000003f:0003  LOP_COMMIT_XACT     LCX_NULL    0000:00000582
...

Can you spot the problem? Notice how the LSN slot number stays at at 1,2,3 and the log block number grows very frequently: 3c, 3d, 3e, 3f… This is the log signature of a sequence of single row inserts that each committed individually. After each INSERT the application had to wait for the log block to be flushed to disk. The logs blocks were all of minimal size, 512 bytes.

Here is how I generated this log:

set nocount on;
declare @i int = 0;
while @i < 1000
begin
	insert into t (filler) values ('A');
	set @i += 1;
end
go

Had the application used a batch commit it would had solved several issues:

  • only wait for one or few commits to flush the log.
  • write less log, since every transaction generates an LOP_BEGIN_XACT and an LOP_COMMIT_XACT (they add up!).
  • write the operations in few large IO requests as opposed to a lot of small IO requests.

So lets add batch commits to our test script:

set nocount on;
declare @i int = 0;
begin transaction
while @i < 1000
begin
	insert into t (filler) values ('A');
	set @i += 1;
	if @i % 100 = 0
	begin
		commit;
		begin transaction;
	end
end
commit
go

and then lets look at the log:

00000016:0000011d:0001  LOP_BEGIN_XACT     LCX_NULL   0000:000005d9
00000016:0000011d:0002  LOP_INSERT_ROWS    LCX_HEAP   0000:000005d9
00000016:0000011d:0003  LOP_INSERT_ROWS    LCX_HEAP   0000:000005d9
...
00000016:0000011d:006f  LOP_INSERT_ROWS    LCX_HEAP   0000:000005d9
00000016:0000011d:0070  LOP_COMMIT_XACT    LCX_NULL   0000:000005d9
00000016:00000136:0001  LOP_BEGIN_XACT     LCX_NULL   0000:000005db
00000016:00000136:0002  LOP_INSERT_ROWS    LCX_HEAP   0000:000005db
...
00000016:00000136:0064  LOP_INSERT_ROWS    LCX_HEAP   0000:000005db
00000016:00000136:0065  LOP_INSERT_ROWS    LCX_HEAP   0000:000005db
00000016:00000136:0066  LOP_COMMIT_XACT    LCX_NULL   0000:000005db
00000016:0000014d:0001  LOP_BEGIN_XACT     LCX_NULL   0000:000005dc
00000016:0000014d:0002  LOP_INSERT_ROWS    LCX_HEAP   0000:000005dc
00000016:0000014d:0003  LOP_INSERT_ROWS    LCX_HEAP   0000:000005dc
...
00000016:0000014d:0066  LOP_INSERT_ROWS    LCX_HEAP   0000:000005dc
00000016:0000014d:0067  LOP_COMMIT_XACT    LCX_NULL   0000:000005dc

We can see how the batch commit has created fewer, larger, log blocks. The application had to wait fewer times for the log to harden, and on each wait it issued a larger IO request. The log blocks are also more densely filled with LOP_INSERT_ROWS operations and do not have the overhead of having to log the LOP_BEGIN_XACT/LOP_COMMIT_XACT for every row inserted.

Conclusion

This little example shows not only how to understand the LSN structure, but it also shows how to read into the fn_dblog output. I also wanted to show the typical log signature of a commit operation: the LOP_COMMIT_XACT operation is recorded in the log and the log block is closed and flushed to disk. The next LSN will usually have slot 1 in the next block. The last example even shows how reading the log can spot potential application performance problems.

Posted in Tutorials | Comments Off

SQL Server table columns under the hood

October 20th, 2011

You probably can easily answer a question like ‘What columns does this table have?’. Whether you use the SSMS object explorer, or sp_help, or you query sys.column, the answer is fairly easy to find. But what is I ask ‘What are the physical columns of this table?’. Huh? Is there any difference? Lets see.

At the logical layer tables have exactly the structure you declare it in your CREATE TABLE statement, and perhaps modifications from ALTER TABLE statements. This is the layer at which you can look into sys.columns and see the table structure, or look at the table in SSMS object explorer and so on and so forth. But there is also a lower layer, the physical layer of the storage engine where the table might have surprisingly different structure from what you expect.

Inspecting the physical table structure

To view the physical table structure you must use the undocumented system internals views: sys.system_internals_partitions and sys.system_internals_partition_columns:

select p.index_id, p.partition_number,
	pc.leaf_null_bit,
	coalesce(cx.name, c.name) as column_name,
	pc.partition_column_id,
	pc.max_inrow_length,
	pc.max_length,
	pc.key_ordinal,
	pc.leaf_offset,
	pc.is_nullable,
	pc.is_dropped,
	pc.is_uniqueifier,
	pc.is_sparse,
	pc.is_anti_matter
from sys.system_internals_partitions p
join sys.system_internals_partition_columns pc
	on p.partition_id = pc.partition_id
left join sys.index_columns ic
	on p.object_id = ic.object_id
	and ic.index_id = p.index_id
	and ic.index_column_id = pc.partition_column_id
left join sys.columns c
	on p.object_id = c.object_id
	and ic.column_id = c.column_id
left join sys.columns cx
	on p.object_id = cx.object_id
	and p.index_id in (0,1)
	and pc.partition_column_id = cx.column_id
where p.object_id = object_id('...')
order by index_id, partition_number;

Lets inspect some simple table structures:

create table users (
	user_id int not null identity(1,1),
	first_name varchar(100) null,
	last_name varchar(100) null,
	birth_date datetime null);

Running our query after, of course, we specify object_id('users'):

spacer

We can see that the physical structure is very much as we expected: the physical rowset has 4 physical columns, of the expected types and sizes. One thing to notice is that, although the column order is the one we specified, the columns are layout on disk in a different order: the user_id is stored in row at offset 4, followed by the birth_date at offset 8 and then by the two variable length columns (first_name and last_name) that have negative offsets, an indication that they reside in the variable size portion of the row. This should be of no surprise as we know that the row format places all fixed length columns first in the row, ahead of the variable length columns.

Adding a clustered index

Our table right now is a heap, we should make user_id a primary key, and lets check the table structure afterward:

alter table users add constraint pk_users_user_id primary key (user_id);


spacer

As we can see, the table has changed from a heap into a clustered index (index_id changed from 0 to 1) and the user_id column has become part of the key.

Non-Unique clustered indexes

Now lets say that we want a different clustered index, perhaps by birth_date (maybe our application requires frequent range scans on this field). We would change the primary key into a non-clustered primary key (remember, the primary key and the clustered index do not have to be the same key) and add a clustered index by the birth_date column:

alter table users drop constraint pk_users_user_id;
create clustered index cdx_users on users (birth_date);
alter table users add constraint
	pk_users_user_id primary key nonclustered  (user_id);


spacer

Several changes right there:

  • A new index has appeared (index_id 2), which was expected, this is the non-clustered index that now enforces the primary key constraint on column user_id.
  • The table has a new physical column, with partition_column_id 0. This new column has no name, because it is not visible in the logical table representation (you cannot select it, nor update it). This is an uniqueifier column (is_uniqueifier is 1) because we did not specify that our clustered index in unique:

    If the clustered index is not created with the UNIQUE property, the Database Engine automatically adds a 4-byte uniqueifier column to the table. When it is required, the Database Engine automatically adds a uniqueifier value to a row to make each key unique. This column and its values are used internally and cannot be seen or accessed by users.



    Klaus Aschenbrenner has a good series of articles that explain uniqueifier columns in more detail.

  • The non-clustered index that enforces the primary key constraint has 3 physical columns: user_id and two unnamed ones. This is because each row in the non-clustered index contains the corresponding clustered index row key values, in our case the birth_date column and the uniqueifier column.

Online operations and anti-matter columns

Is the uniqueifier column the only hidden column in a table? No. Lets rebuild our non-clustered index, making sure we specify it to be an online operation:

alter index pk_users_user_id on users rebuild with (online=on);


spacer

Notice how the non-clustered index now has one more column, a new column that has the is_antimatter value 1. As you probably guess, this is an anti-matter column. For an in-depth explanation of what is the purpose of the anti-matter column I recommend reading Online Indexing Operations in SQL Server 2005:

During the build phase, rows in the new “in-build” index may be in an intermediate state called antimatter. This mechanism allows concurrent DELETE statements to leave a trace for the index builder transaction to avoid inserting deleted rows. At the end of the index build operation all antimatter rows should be cleared.

Note that even after the online operation finishes and the antimatter rows are removed, the antimatter column will be part of the physical rowset structure.

There could exist more hidden columns in the table, for example if we enable change tracking:

alter table users ENABLE CHANGE_TRACKING;


spacer

Enabling change tracking on our table has added one more hidden column.

Table structure changes

Next lets look at some table structure modifying operations: adding, dropping and modifying columns. Were going to start anew with a fresh table for our examples:

create table users  (
	user_id int not null identity(1,1) primary key,
	first_name char(100) null,
	last_name char(100) null,
	birth_date datetime null);
go


spacer

Next lets modify some columns: we want to make the birth_date not nullable and reduce the length of the first_name to 75:

alter table users alter column birth_date datetime not null;
alter table users alter column first_name char(75);
go


spacer

Notice how the two alter operations ended up in adding a new column each, and dropping the old column. But also note how the null bit and the leaf_offset values have stayed the same. This means that the column was added ‘in place’, replacing the dropped column. This is a metadata only operation that did not modify any record in the table, it simply changed how the data in the existing records is interpreted.

But now we figure out the 75 characters length is wrong and we want to change it back to 100. Also, the birth_date column probably doesn’t need hours, so we can change it to a date type:

alter table users alter column birth_date date not null;
alter table users alter column first_name char(100);
go


spacer

The birth_date column has changed type and now it requires only 3 bytes, but the change occurred in-place, just as the nullability change we did before: it remains at the same offset in the record and it has the same null bit. However, the first_name column was moved from offset 8 to offset 211, and the null bit was changed from 4 to 5. Because the first_name column has increased in size it cannot occupy the same space as before in the record and the record has effectively increased to accommodate the new first_name column. This happened despite the fact that the first_name column was originally of size 100 so in theory it could reclaim the old space it used in the record, but the is simply a too corner case for the storage engine to consider.

By now we figure that the fixed length 100 for the first_name and last_name columns was a poor choice, so we would like to change them to more appropriate variable length columns:

alter table users alter column first_name varchar(100);
alter table users alter column last_name varchar(100);
go


spacer

The type change from fixed length column to variable length column cannot be done in-place, so the first_name and last_name columns get new null bit values.They also have negative leaf_offset values, which is typical for variable length columns as they don’t occupy fixed positions in the record.

Next lets change the length of the variable columns:

alter table users alter column first_name varchar(250);
alter table users alter column last_name varchar(250);
go

spacer

For this change the column length was modified without dropping the column and adding a new one. An increase of size for a variable length columns is one of the operations that is really altering the physical column, not dropping the column and adding a new one to replace it. However, a decrease in size, or a nullability change, does again drop and add the column, as we can quickly check now:

alter table users alter column first_name varchar(100);
alter table users alter column last_name varchar(250) not null;
go


spacer

Finally, lets say we tried to add two more fixed length columns, but we we undecided on the name and length so we added a couple of columns, then deleted them and added again a new one:

alter table users add mid_name char(50);
alter table users add surname char(25);
alter table users drop column mid_name;
alter table users drop column surname;

alter table users add middle_name char(100);
go


spacer

This case is interesting because it shows how adding a new fixed column can reuse the space left in the row by dropped fixed columns, but only if the dropped columns are the last fixed columns. In our table the columns mid_name and surname where originally added at offset 211 and 261 respectively and their length added up to 75 bytes. After we dropped them, the middle_name column we added is placed at offset 211, thus reusing the space formerly occupied by the dropped columns. This happens even though the length of the newly added column is 100 bytes, bigger than the 75 bytes occupied by the dropped columns before.

By now, our 5 column table has actually 15 columns in the physical storage format, 10 dropped columns and 5 usable columns. The table uses 412 bytes of fixed space for the 3 fixed length columns that have a total length of only 112 bytes. The variable length columns that now store the first_name and last_name are stored in the record after these 412 reserved bytes for fixed columns that are now dropped. Since the records always consume all the reserved fixed size, this is quite wasteful. How do we reclaim it? Rebuild the table:

alter table users rebuild;
go


spacer

As you can see the rebuild operation got rid off the dropped columns and now the physical storage layout is compact, aligned with the logical layout. So whenever doing changes to a table structure remember that at the storage layer most changes are cumulative, they are most times implemented by dropping a column and adding a new column back, with the new type/length/nullability. Whenever possible the a modified column reuses the space in the record and newly added columns may reuse space previously used by dropped columns.

Partitioned Tables

When partitioning is taken into account another dimension of the physical table structure is revealed. To start, lets consider a typical partitioned table:

create partition function pf (date) as range for values ('20111001');
go

create partition scheme ps as partition pf all to ([PRIMARY]);
go

create table sales (
	sale_date date not null,
	item_id int not null,
	store_id int not null,
	price money not null)
	on ps(sale_date);
go


spacer

As we know, the partitioned tables are, from a physical point of view, a collection of rowsets that belong to the same logical object (the 'table'). Looking under the hood shows that our table has two rowsets, and they have identical column structure. Typically new partitions are added by the ETL process that uploads data into a staging table that gets switched in using a fast partition switch operation:

create table sales_staging (
	sale_date date not null,
	item_id int not null,
	store_id int not null,
	price numeric(10,2) not null,
	constraint check_partition_range
		check (sale_date = '20111002'))
	on [PRIMARY];

alter partition scheme ps next used [PRIMARY];
alter partition function pf() split range ('20111002');
go

alter table sales_staging switch to sales partition $PARTITION.PF('20111002');
go
Msg 4944, Level 16, State 1, Line 2

ALTER TABLE SWITCH statement failed because column 'price' has data type numeric(10,2) in source table 'test.dbo.sales_staging' which is different from its type money in target table 'test.dbo.sales'.

OK, so we made a mistake in the staging table, so lets correct it, then try to switch it in again:

alter table sales_staging alter column price money not null;
alter table sales_staging switch to sales partition $PARTITION.PF('20111002');
go


spacer

We see that the partition switch operation has switched in the rowset of the staging table into the second partition of the sales table. But since the staging table had an operation that modified a column type, which in effect has dropped the numeric price columns and added a new price column of the appropriate money type, the second rowset of the partitioned table now has 5 columns, including a dropped one. The partition switch operation brings into the partitioned table the staging table as is, dropped columns and all. What that means is that partitions of the partitioned table can have, under the hood, a completely different physical structure from one another. Normally this should be of little concern and just a courisity to woe the newbies at DBA cocktail parties, but this problem has a darker side, known as KB2504090:

This issue occurs because the accessor that SQL Server uses to insert data into different partitions recognizes the metadata changes incorrectly. When data is inserted into the new partition that is created after a column is dropped, the number of the maximum nullable columns in the new partition may be one less than the number of the maximum nullable columns in the old partition.

If you ever find your partitioned tables not to have an uniform internal structure across all partitions, I recommend you rebuild the partition that is different. This simple query can be used to look at the number of physical columns in each partition of a table:

select count(*) as count_columns,
	index_id,
	partition_number
from sys.system_internals_partitions p
join sys.system_internals_partition_columns pc
	on p.partition_id = pc.partition_id
where p.object_id = object_id('sales')
group by index_id, partition_number;
go


spacer

This query shows that partitions 1 and 3 have 4 physical columns, while partition 2 has 5. We can rebuild partition 2:

alter table sales rebuild partition = 2;
go

With this operation we have rebuild the partition number 2 from scratch and removed all dropped columns in the process.

Note that my query above would only detect differences in the number of physical columns, but two partitions can still have a very different physical layout even if they have the same number of physical columns, eg. one can have physical column number 9 dropped and physical column number 10 used, while the other can have the opposite. Use your judgement when looking at the internal physical column layout to understand if they are truly the same.

Posted in CodeProject, Samples,