Database cleanup guide

After my previous post on “Primary key recommandations”, I was asked to write a general guide for how to approach such a cleanup project I’m doing at the moment. There is no definitive answer for how to do this. I guess you have to use all your database skill and just go through everything.

The tools for doing the work depends on what you need to do. So far in my project, I have used “Microsoft SQL Server Management Studio” (Download express here) and the UltraEdit editor that has the capabilty to edit large files, in addition to have nice features that most programmers like. Another tool that is goog for these kind of work is Redgate SQL Belt. In addition, I use “Visio” database diagram to verify if the datamodel is correct by using reverse engineering feature.

My receipt for the cleanup project:

  • Script the database schema in MSSMS
    • schema only
    • No indexes since these shall be recreated later
  • Alter script with the most obvious and simple changes
  • Import script in new working database
  • Change and add constraints
  • Verify datamodel in Visio
  • Add the original or cleaned data to the working database with new schema
  • Go through code to verify indexes manually, or use “SQL Profiler” with “Database Engine Tuning Advisor” to get a quick start on what to do with the new database regarding indexes and statistics.
  • Deploy database to test enviroment and run tests against the application itself if you don’t have access to testing facilities on database level.

Script cleanup

The current database use Guid for PK columns stored in char(38) columns. This is one major performance issue. Therefor I changes all PK columns from char(38) to uniqueidentifer datatype for all tables.

Determine the column that kan be set to “NOT NULL” in each table. This can be done by the following statements for the tables and columns you need to investigate:

SELECT COUNT(*) FROM <table> WHERE <column> IS NULL ;
SELECT COUNT(*) FROM <table>
WHERE LEN(LTRIM(RTRIM(<column>))) = 0;

You also want to check the column datatypes in the script before importing the script to the new working database. If a column contains a number predefined values, I often run a “SELECT DISTINCT” statement for this column to check if I can chose a better datatype for the column, or just introduce a “Code table” and a foreign key in the current table. I also run a “SELECT MAX” statement to check the upper boundary for the column, to be able to make an assumption on what datatype to use for the column.

SELECT DISTINCT <column> FROM <table>;
SELECT MAX(<column>) FROM <table>;
SELECT MIN(<column>) FROM <table>;

See “Size matter” section of this post for more datatype related issues.

PK/FK Constraints

Make sure the constrains are in place for every table with cooresponding index.  If these are not in place, every query will probably use full table scan. If you have large tables, this will kill your application.

The process of finding primary and foreign keys in the database might be time consuming if they doesn’t follow a kind of naming standard. The following script lists all constraints in a SQL server database. It works on SQL Server 2008 R2. I’m not sure of the backward compability.

SELECT
	OBJECT_NAME(OBJECT_ID) AS NameofConstraint,
	SCHEMA_NAME(schema_id) AS SchemaName,
	OBJECT_NAME(parent_object_id) AS TableName,
	type_desc AS ConstraintType
FROM sys.objects
WHERE type_desc LIKE '%CONSTRAINT'
ORDER BY OBJECT_NAME(parent_object_id);

The result of above SQL might help you to see a pattern on the constraint naming standard and gives you a list of the existing one.

Another approach is to go table for table and column for column. Use the following script to list tables and columns. The column column_id can often be uses to determine the PK column as well since the primary key often is the first colum that is added to a table.

SELECT
    a.object_id, a.name as TableName, b.column_id,
    b.name as ColumnName, c.name as Datatype,
    b.max_length, b.is_nullable
FROM sys.all_objects a
	inner join sys.all_columns b on b.object_id = a.object_id
	inner join sys.systypes c on c.xtype = b.system_type_id
WHERE a.type_desc = 'USER_TABLE'
  AND a.name LIKE '%'
  AND b.name LIKE '%'
  AND c.name like '%'
ORDER BY a.name, b.column_id

Size matters

Size matter in number of rows in tables and datatypes of columns in tables. The following script gives you number of rows in a user tables for the current database on SQL Server. Use the result to determine what tables to start with. Start with tables with the most rows and optimize these tables.

SELECT
   substring(obj.name, 1, 50) AS Table_Name,
   ind.rows AS Number_of_Rows
FROM sysobjects AS obj
	INNER JOIN sysindexes AS ind ON obj.id = ind.id
WHERE obj.xtype = 'u' AND ind.indid < 2
  AND obj.name LIKE '%'
ORDER BY obj.name

Make sure the datatype and length fits the application needs. For instance, if you have a column Type that contains integer values between 1 and 10, the datatype can be set to tinyint (SQL Server). This column will use 1 byte for each row, opposite to int (4 bytes), bigint (8 bytes). These considerations must be done for every datatype used through out the datatbase. See table below for more informations on some important datatypes in SQL Server.

Data type
Range Storage
bigint -2^63   (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807) 8 Bytes
int -2^31 (-2,147,483,648) to 2^31-1   (2,147,483,647) 4 Bytes
smallint -2^15 (-32,768) to 2^15-1   (32,767) 2 Bytes
tinyint 0 to 255 1 Byte
bit An integer data type that can   take a value of 1, 0, or NULL. The SQL Server Database Engine optimizes   storage of bit columns. If there are 8 or less bit columns in a table, the   columns are stored as 1 byte. If there are from 9 up to 16 bit columns, the columns   are stored as 2 bytes, and so on.
char [ ( n ) ] Fixed-length, non-Unicode   character data with a length of n bytes. n must be a value from 1 through   8,000
varchar [ ( n | max ) ] Variable-length, non-Unicode   character data. n can be a value from 1 through 8,000. max indicates that the   maximum storage size is 2^31-1 bytes.
nchar [ ( n ) ] Fixed-length Unicode character   data of n characters. n must be a value from 1 through 4,000. The storage   size is two times n bytes.
nvarchar [ ( n | max ) ] Variable-length Unicode   character data. ncan be a value from 1 through 4,000. max indicates that the   maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two   times the number of characters entered + 2 bytes.
binary [ ( n ) ] Fixed-length binary data with a   length of n bytes, where n is a value from 1 through 8,000
varbinary [ ( n | max) ] Variable-length binary data. n   can be a value from 1 through 8,000. max indicates that the maximum storage   size is 2^31-1 bytes. The storage size is the actual length of the data   entered + 2 bytes
date Default format: YYYY-MM-DD.  Range ‘0001-01-01’ through ‘9999-12-31’ 3 bytes
datetime Date range ‘January 1, 1753’   through ‘December 31, 9999’ 8 bytes
smalldatetime Date range ‘1900-01-01’ through ‘2079-06-06’ 4 bytes
decimal[ (p[ ,s] )] Fixed precision and scale numbers. When maximum precision is used, valid values are from – 10^38 +1 through 10^38 – 1.1-9 = 5 bytes
10-19 = 9 bytes
20-28 = 13 bytes
29-38 = 17 bytes
float [ (n) ] Where n is the number of bits that are used to store the mantissa of the float number in scientific notation and, therefore, dictates the precision and storage size. If n is specified, it must be a value between 1 and 53. The default value of n is 53.

1-24 = 4 bytes
25-53 = 8 bytes

Primary key recommandations

I’m doing a project for a customer at the moment where database cleanup is a large part of the project. There are many issues in the project database, and primary key constructions are just one of them.

Definition:

A primary key is used to uniquely identify each row in a table. A primary key can consist of one or more fields on a table. When multiple fields are used as a primary key, they are called a composite key.

My guess is the 80-90% of all SQL tables world-wide use a single column primary key with some kind of key generation. The primary key should in most cases have auto-increamental vales when a row is created. This means that we don’t have to supply the value for the primary key column during inserting rows. At SQL Server the auto-increment function is given by the IDENTITY(n,m) keyword set in the CREATE TABLE statement.

CREATE TABLE Customer
(
    Id integer IDENTITY(1,1) PRIMARY KEY,
    Name varchar(50) NOT NULL
);

The datatype of the Id column is set to Integer. Unless you have more than 2 billion customers, the integer datatype is suffient.  But if your table most likeshall have over 2 billion row, the datatype bigint should be considered for the PK column. The DDL for this is shown below.

CREATE TABLE Person
(
     Id bigint IDENTITY(1,1) PRIMARY KEY,
     FirstName varchar(50) NOT NULL,
     LastName varchar(50) NOT NULL
);

 The datatype uniqueidentifier can also be used for a PK column. This datatype is commonly used when syncronization between databases, servers and clusters are important in the daily dba operations. This column can be populated equally the IDENTITY(n,m). To populate an uniqueidentifer column upon INSERT, set the default value to newid().

CREATE TABLE CurrencyRate
(
      Id uniqueidentifier NOT NULL PRIMARY KEY
                    DEFAULT newid(),
      RateDate datetime NOT NULL,
      FromCode varchar(10) NOT NULL,
      ToCode varchar(10) NOT NULL,
      RateValue [float] NOT NULL
);

For tables with large amount of data, the datatype uniqueidentifier should be reconsidered, and maybe using integer or bigint as datatype.The reason for this is that integer datatype on PK columns is better due to increased performance.

Hope this helps someone.