Tuesday, August 08, 2006

Full synchronisation

In contrast to Unidirectional synchronisation (discussed in a previous post) also better described as source-to-target data propagation, the purpose of Full synchronisation is to consolidate data at both locations – following the meaning of the word 'synchronisation' more closely: resulting in current historical events (e.g. business transactions) at both locations.

From Greek: syn- (together [with]), chronos (time) – synchronos (at the same time, in the same period).

The rules sound fairly simple:

  • Insert all rows from one location to the other if they don't already exist at the destination;
  • Update the rows at one location with the values from the other location that were changed at a later time.

What about deleted rows? Deleting rows from one location that do not also exist at the other might sound right but would quite obviously be in conflict with the insert rule. Each synchronisation would simply result in rows being un-deleted. One quite simple alternative is to use a Deleted flag – e.g. a bit column for which the value 1 designates a 'deleted' row and the default value of 0 a 'regular' row. In addition the delete operation could be prevented altogether through the use of an INSTEAD OF DELETE trigger.

As far as the synchronisation is concerned the Deleted flag allows the update operation to handle 'deleted' rows: rows flagged as deleted at one location are in turn flagged as deleted at the other.

These two tables represent the two locations:

create table dbo.RemoteSales
 (
 SaleId  int  identity (1, 1) not null
 ,DateChanged datetime   not null
 ,ProductName varchar(128)   not null
 ,CustomerName varchar(128)   not null
 ,SaleDate datetime   not null
 ,Quantity decimal(18,2)   not null
 ,Deleted bit    not null
 )
go

alter table dbo.RemoteSales
 add constraint pk_RemoteSales_SaleDate_ProductName_CustomerName
   primary key nonclustered
   (
   SaleDate
   ,ProductName
   ,CustomerName
   )
go

create unique clustered index ux_RemoteSales_SaleId
 on dbo.RemoteSales
  (
  SaleId
  )
go

alter table dbo.RemoteSales
 add constraint df_RemoteSales_SaleDate
   default (getdate())
   for SaleDate
   with values
go

alter table dbo.RemoteSales
 add constraint df_RemoteSales_DateChanged
   default (getdate())
   for DateChanged
   with values
go 

alter table dbo.RemoteSales
 add constraint df_RemoteSales_Deleted
   default (0)
   for Deleted
   with values
go

create table dbo.LocalSales
 (
 SaleId  int  identity (1, 1) not null
 ,DateChanged datetime   not null
 ,ProductName varchar(128)   not null
 ,CustomerName varchar(128)   not null
 ,SaleDate datetime   not null
 ,Quantity decimal(18,2)   not null
 ,Deleted bit    not null
 )
go

alter table dbo.LocalSales
 add constraint pk_LocalSales_SaleDate_ProductName_CustomerName
   primary key nonclustered
   (
   SaleDate
   ,ProductName
   ,CustomerName
   )
go

create unique clustered index ux_LocalSales_SaleId
 on dbo.LocalSales
  (
  SaleId
  )
go

alter table dbo.LocalSales
 add constraint df_LocalSales_SaleDate
   default (getdate())
   for SaleDate
   with values
go

alter table dbo.LocalSales
 add constraint df_LocalSales_DateChanged
   default (getdate())
   for DateChanged
   with values
go 

alter table dbo.LocalSales
 add constraint df_LocalSales_Deleted
   default (0)
   for Deleted
   with values
go

Notice the primary key, it's vital to the operation. It's a composite key, spanning three columns (SaleDate, ProductName and CustomerName) that uniquely identify each sale transaction. Of course in a real-life situation the customer name and the product name would not be used here, but rather be represented by foreign keys referencing a table of customers and a table of products respectively.

Now the values:

-- Some initial transactions
insert dbo.RemoteSales
 (
 DateChanged
 ,ProductName
 ,CustomerName
 ,Quantity
 )
 select dateadd(w, -5, getdate()) as DateChanged
  ,'Granny Smith' as ProductName
  ,'Jim' as CustomerName
  ,27 as Quantity
 union all
 select dateadd(w, -4, getdate())
  ,'Granny Smith'
  ,'Bob'
  ,62
 union all
 select dateadd(w, -4, getdate())
  ,'Golden delicious'
  ,'Bob'
  ,12
go

insert dbo.LocalSales
 (
 DateChanged
 ,ProductName
 ,CustomerName
 ,Quantity
 )
 select dateadd(w, -7, getdate()) as DateChanged
  ,'Granny Smith' as ProductName
  ,'Sam' as CustomerName
  ,28 as Quantity
 union all
 select dateadd(w, -12, getdate())
  ,'Granny Smith'
  ,'Lana'
  ,6
 union all
 select dateadd(w, -14, getdate())
  ,'Golden delicious'
  ,'Rob'
  ,72
go

-- A couple of conflicting rows
insert dbo.LocalSales
 (
 DateChanged
 ,ProductName
 ,CustomerName
 ,SaleDate
 ,Quantity
 )
 select top 1
  getdate() as DateChanged
  ,dbo.RemoteSales.ProductName as ProductName
  ,dbo.RemoteSales.CustomerName as CustomerName
  ,dbo.RemoteSales.SaleDate as SaleDate
  ,120 as Quantity
  from dbo.RemoteSales
  order by DateChanged
go

insert dbo.RemoteSales
 (
 DateChanged
 ,ProductName
 ,CustomerName
 ,SaleDate
 ,Quantity
 ,Deleted
 )
 select getdate() as DateChanged
  ,dbo.LocalSales.ProductName as ProductName
  ,dbo.LocalSales.CustomerName as CustomerName
  ,dbo.LocalSales.SaleDate as SaleDate
  ,dbo.LocalSales.Quantity as Quantity
  ,1
  from dbo.LocalSales
  where (dbo.LocalSales.CustomerName = 'Lana')
go

Before designing the synchronisation script let's take a closer look at the rules:

The insert rule is as simple as possible: all transactions must be stored at both locations.

The update would be difficult without the DateChanged column. With it the job is also quite simple – 'the latest state prevails'. All latest changes to the rows must exist at both locations – including the Deleted column values to accurately represent cancelled or annulled transactions.

By using a FULL OUTER JOIN query we can observe the differences between the remote and the local table:

select *
 from dbo.RemoteSales
  full join dbo.LocalSales
    on (dbo.LocalSales.ProductName = dbo.RemoteSales.ProductName)
     and (dbo.LocalSales.CustomerName = dbo.RemoteSales.CustomerName)
     and (dbo.LocalSales.SaleDate = dbo.RemoteSales.SaleDate)
 order by dbo.RemoteSales.DateChanged desc

Some rows exist at one location but not at the other while some rows have been changed at one location but the values not yet propagated to the other.

For the purposes of this rather simplified example we'll leave out any concurrency issues, and perhaps deal with them in another post.

As mentioned the insert part is simple:

  • The propagation of values from the Local table to the Remote:
    insert dbo.LocalSales
     (
     DateChanged
     ,ProductName
     ,CustomerName
     ,SaleDate
     ,Quantity
     ,Deleted
     )
     select dbo.RemoteSales.DateChanged as DateChanged
      ,dbo.RemoteSales.ProductName as ProductName
      ,dbo.RemoteSales.CustomerName as CustomerName
      ,dbo.RemoteSales.SaleDate as SaleDate
      ,dbo.RemoteSales.Quantity as Quantity
      ,dbo.RemoteSales.Deleted as Deleted
      from dbo.RemoteSales
      where (not exists(
        select *
         from dbo.LocalSales
         where (dbo.LocalSales.ProductName = dbo.RemoteSales.ProductName)
          and (dbo.LocalSales.CustomerName = dbo.RemoteSales.CustomerName)
          and (dbo.LocalSales.SaleDate = dbo.RemoteSales.SaleDate)
        ))
  • And the opposite direction:
    insert dbo.RemoteSales
     (
     DateChanged
     ,ProductName
     ,CustomerName
     ,SaleDate
     ,Quantity
     ,Deleted
     )
     select dbo.LocalSales.DateChanged as DateChanged
      ,dbo.LocalSales.ProductName as ProductName
      ,dbo.LocalSales.CustomerName as CustomerName
      ,dbo.LocalSales.SaleDate as SaleDate
      ,dbo.LocalSales.Quantity as Quantity
      ,dbo.LocalSales.Deleted as Deleted
      from dbo.LocalSales
      where (not exists(
        select *
         from dbo.RemoteSales
         where (dbo.RemoteSales.ProductName = dbo.LocalSales.ProductName)
          and (dbo.RemoteSales.CustomerName = dbo.LocalSales.CustomerName)
          and (dbo.RemoteSales.SaleDate = dbo.LocalSales.SaleDate)
        ))

By storing the time of data modifications in the DateChanged column it's also really easy to know which rows need to be updated. Of course the default constraint will only work when inserting data as long as the value is not set by the insert statement, and the value would have to be set explicitly when updating data.

  • The propagation of changes from the Local table to the Remote:
    update dbo.LocalSales
     set dbo.LocalSales.DateChanged = dbo.RemoteSales.DateChanged
      ,dbo.LocalSales.ProductName = dbo.RemoteSales.ProductName
      ,dbo.LocalSales.CustomerName = dbo.RemoteSales.CustomerName
      ,dbo.LocalSales.SaleDate = dbo.RemoteSales.SaleDate
      ,dbo.LocalSales.Quantity = dbo.RemoteSales.Quantity
      ,dbo.LocalSales.Deleted = dbo.RemoteSales.Deleted
     from dbo.RemoteSales
      inner join dbo.LocalSales
        on (dbo.LocalSales.ProductName = dbo.RemoteSales.ProductName)
         and (dbo.LocalSales.CustomerName = dbo.RemoteSales.CustomerName)
         and (dbo.LocalSales.SaleDate = dbo.RemoteSales.SaleDate)
     where (dbo.RemoteSales.DateChanged > dbo.LocalSales.DateChanged)
  • And the opposite direction:
    update dbo.RemoteSales
     set dbo.RemoteSales.DateChanged = dbo.LocalSales.DateChanged
      ,dbo.RemoteSales.ProductName = dbo.LocalSales.ProductName
      ,dbo.RemoteSales.CustomerName = dbo.LocalSales.CustomerName
      ,dbo.RemoteSales.SaleDate = dbo.LocalSales.SaleDate
      ,dbo.RemoteSales.Quantity = dbo.LocalSales.Quantity
      ,dbo.RemoteSales.Deleted = dbo.LocalSales.Deleted
     from dbo.LocalSales
      inner join dbo.RemoteSales
        on (dbo.RemoteSales.ProductName = dbo.LocalSales.ProductName)
         and (dbo.RemoteSales.CustomerName = dbo.LocalSales.CustomerName)
         and (dbo.RemoteSales.SaleDate = dbo.LocalSales.SaleDate)
     where (dbo.LocalSales.DateChanged > dbo.RemoteSales.DateChanged)

If we now execute the FULL OUTER JOIN query mentioned above we can now see that the data has been synchronised appropriately between the two locations.

What are the typical use cases that could benefit from this example? As mentioned before most real-life scenarios can be quite easily handled by Replication – which, on the other hand, is not available in all editions of SQL Server. Wherever replication cannot be used to synchronise datasources other methods will have to be used, and once again a set-based solution is the simplest to design and will generally outperform any other method.


ML

No comments: