Database cleanup guide


After my previous post on “Primary key recommandations”, I was asked to write a general guide for how to approach such a cleanup project I’m doing at the moment. There is no definitive answer for how to do this. I guess you have to use all your database skill and just go through everything.

The tools for doing the work depends on what you need to do. So far in my project, I have used “Microsoft SQL Server Management Studio” (Download express here) and the UltraEdit editor that has the capabilty to edit large files, in addition to have nice features that most programmers like. Another tool that is goog for these kind of work is Redgate SQL Belt. In addition, I use “Visio” database diagram to verify if the datamodel is correct by using reverse engineering feature.

My receipt for the cleanup project:

  • Script the database schema in MSSMS
    • schema only
    • No indexes since these shall be recreated later
  • Alter script with the most obvious and simple changes
  • Import script in new working database
  • Change and add constraints
  • Verify datamodel in Visio
  • Add the original or cleaned data to the working database with new schema
  • Go through code to verify indexes manually, or use “SQL Profiler” with “Database Engine Tuning Advisor” to get a quick start on what to do with the new database regarding indexes and statistics.
  • Deploy database to test enviroment and run tests against the application itself if you don’t have access to testing facilities on database level.

Script cleanup

The current database use Guid for PK columns stored in char(38) columns. This is one major performance issue. Therefor I changes all PK columns from char(38) to uniqueidentifer datatype for all tables.

Determine the column that kan be set to “NOT NULL” in each table. This can be done by the following statements for the tables and columns you need to investigate:

SELECT COUNT(*) FROM <table> WHERE <column> IS NULL ;
SELECT COUNT(*) FROM <table>
WHERE LEN(LTRIM(RTRIM(<column>))) = 0;

You also want to check the column datatypes in the script before importing the script to the new working database. If a column contains a number predefined values, I often run a “SELECT DISTINCT” statement for this column to check if I can chose a better datatype for the column, or just introduce a “Code table” and a foreign key in the current table. I also run a “SELECT MAX” statement to check the upper boundary for the column, to be able to make an assumption on what datatype to use for the column.

SELECT DISTINCT <column> FROM <table>;
SELECT MAX(<column>) FROM <table>;
SELECT MIN(<column>) FROM <table>;

See “Size matter” section of this post for more datatype related issues.

PK/FK Constraints

Make sure the constrains are in place for every table with cooresponding index.  If these are not in place, every query will probably use full table scan. If you have large tables, this will kill your application.

The process of finding primary and foreign keys in the database might be time consuming if they doesn’t follow a kind of naming standard. The following script lists all constraints in a SQL server database. It works on SQL Server 2008 R2. I’m not sure of the backward compability.

SELECT
	OBJECT_NAME(OBJECT_ID) AS NameofConstraint,
	SCHEMA_NAME(schema_id) AS SchemaName,
	OBJECT_NAME(parent_object_id) AS TableName,
	type_desc AS ConstraintType
FROM sys.objects
WHERE type_desc LIKE '%CONSTRAINT'
ORDER BY OBJECT_NAME(parent_object_id);

The result of above SQL might help you to see a pattern on the constraint naming standard and gives you a list of the existing one.

Another approach is to go table for table and column for column. Use the following script to list tables and columns. The column column_id can often be uses to determine the PK column as well since the primary key often is the first colum that is added to a table.

SELECT
    a.object_id, a.name as TableName, b.column_id,
    b.name as ColumnName, c.name as Datatype,
    b.max_length, b.is_nullable
FROM sys.all_objects a
	inner join sys.all_columns b on b.object_id = a.object_id
	inner join sys.systypes c on c.xtype = b.system_type_id
WHERE a.type_desc = 'USER_TABLE'
  AND a.name LIKE '%'
  AND b.name LIKE '%'
  AND c.name like '%'
ORDER BY a.name, b.column_id

Size matters

Size matter in number of rows in tables and datatypes of columns in tables. The following script gives you number of rows in a user tables for the current database on SQL Server. Use the result to determine what tables to start with. Start with tables with the most rows and optimize these tables.

SELECT
   substring(obj.name, 1, 50) AS Table_Name,
   ind.rows AS Number_of_Rows
FROM sysobjects AS obj
	INNER JOIN sysindexes AS ind ON obj.id = ind.id
WHERE obj.xtype = 'u' AND ind.indid < 2
  AND obj.name LIKE '%'
ORDER BY obj.name

Make sure the datatype and length fits the application needs. For instance, if you have a column Type that contains integer values between 1 and 10, the datatype can be set to tinyint (SQL Server). This column will use 1 byte for each row, opposite to int (4 bytes), bigint (8 bytes). These considerations must be done for every datatype used through out the datatbase. See table below for more informations on some important datatypes in SQL Server.

Data type
Range Storage
bigint -2^63   (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807) 8 Bytes
int -2^31 (-2,147,483,648) to 2^31-1   (2,147,483,647) 4 Bytes
smallint -2^15 (-32,768) to 2^15-1   (32,767) 2 Bytes
tinyint 0 to 255 1 Byte
bit An integer data type that can   take a value of 1, 0, or NULL. The SQL Server Database Engine optimizes   storage of bit columns. If there are 8 or less bit columns in a table, the   columns are stored as 1 byte. If there are from 9 up to 16 bit columns, the columns   are stored as 2 bytes, and so on.
char [ ( n ) ] Fixed-length, non-Unicode   character data with a length of n bytes. n must be a value from 1 through   8,000
varchar [ ( n | max ) ] Variable-length, non-Unicode   character data. n can be a value from 1 through 8,000. max indicates that the   maximum storage size is 2^31-1 bytes.
nchar [ ( n ) ] Fixed-length Unicode character   data of n characters. n must be a value from 1 through 4,000. The storage   size is two times n bytes.
nvarchar [ ( n | max ) ] Variable-length Unicode   character data. ncan be a value from 1 through 4,000. max indicates that the   maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two   times the number of characters entered + 2 bytes.
binary [ ( n ) ] Fixed-length binary data with a   length of n bytes, where n is a value from 1 through 8,000
varbinary [ ( n | max) ] Variable-length binary data. n   can be a value from 1 through 8,000. max indicates that the maximum storage   size is 2^31-1 bytes. The storage size is the actual length of the data   entered + 2 bytes
date Default format: YYYY-MM-DD.  Range ‘0001-01-01’ through ‘9999-12-31’ 3 bytes
datetime Date range ‘January 1, 1753’   through ‘December 31, 9999’ 8 bytes
smalldatetime Date range ‘1900-01-01’ through ‘2079-06-06’ 4 bytes
decimal[ (p[ ,s] )] Fixed precision and scale numbers. When maximum precision is used, valid values are from – 10^38 +1 through 10^38 – 1.1-9 = 5 bytes
10-19 = 9 bytes
20-28 = 13 bytes
29-38 = 17 bytes
float [ (n) ] Where n is the number of bits that are used to store the mantissa of the float number in scientific notation and, therefore, dictates the precision and storage size. If n is specified, it must be a value between 1 and 53. The default value of n is 53.

1-24 = 4 bytes
25-53 = 8 bytes

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s