After my previous post on “Primary key recommandations”, I was asked to write a general guide for how to approach such a cleanup project I’m doing at the moment. There is no definitive answer for how to do this. I guess you have to use all your database skill and just go through everything.
The tools for doing the work depends on what you need to do. So far in my project, I have used “Microsoft SQL Server Management Studio” (Download express here) and the “UltraEdit“ editor that has the capabilty to edit large files, in addition to have nice features that most programmers like. Another tool that is goog for these kind of work is Redgate SQL Belt. In addition, I use “Visio” database diagram to verify if the datamodel is correct by using reverse engineering feature.
My receipt for the cleanup project:
- Script the database schema in MSSMS
- schema only
- No indexes since these shall be recreated later
- Alter script with the most obvious and simple changes
- Import script in new working database
- Change and add constraints
- Verify datamodel in Visio
- Add the original or cleaned data to the working database with new schema
- Go through code to verify indexes manually, or use “SQL Profiler” with “Database Engine Tuning Advisor” to get a quick start on what to do with the new database regarding indexes and statistics.
- Deploy database to test enviroment and run tests against the application itself if you don’t have access to testing facilities on database level.
The current database use Guid for PK columns stored in char(38) columns. This is one major performance issue. Therefor I changes all PK columns from char(38) to uniqueidentifer datatype for all tables.
Determine the column that kan be set to “NOT NULL” in each table. This can be done by the following statements for the tables and columns you need to investigate:
SELECT COUNT(*) FROM <table> WHERE <column> IS NULL ;
SELECT COUNT(*) FROM <table> WHERE LEN(LTRIM(RTRIM(<column>))) = 0;
You also want to check the column datatypes in the script before importing the script to the new working database. If a column contains a number predefined values, I often run a “SELECT DISTINCT” statement for this column to check if I can chose a better datatype for the column, or just introduce a “Code table” and a foreign key in the current table. I also run a “SELECT MAX” statement to check the upper boundary for the column, to be able to make an assumption on what datatype to use for the column.
SELECT DISTINCT <column> FROM <table>;
SELECT MAX(<column>) FROM <table>;
SELECT MIN(<column>) FROM <table>;
See “Size matter” section of this post for more datatype related issues.
Make sure the constrains are in place for every table with cooresponding index. If these are not in place, every query will probably use full table scan. If you have large tables, this will kill your application.
The process of finding primary and foreign keys in the database might be time consuming if they doesn’t follow a kind of naming standard. The following script lists all constraints in a SQL server database. It works on SQL Server 2008 R2. I’m not sure of the backward compability.
SELECT OBJECT_NAME(OBJECT_ID) AS NameofConstraint, SCHEMA_NAME(schema_id) AS SchemaName, OBJECT_NAME(parent_object_id) AS TableName, type_desc AS ConstraintType FROM sys.objects WHERE type_desc LIKE '%CONSTRAINT' ORDER BY OBJECT_NAME(parent_object_id);
The result of above SQL might help you to see a pattern on the constraint naming standard and gives you a list of the existing one.
Another approach is to go table for table and column for column. Use the following script to list tables and columns. The column column_id can often be uses to determine the PK column as well since the primary key often is the first colum that is added to a table.
SELECT a.object_id, a.name as TableName, b.column_id, b.name as ColumnName, c.name as Datatype, b.max_length, b.is_nullable FROM sys.all_objects a inner join sys.all_columns b on b.object_id = a.object_id inner join sys.systypes c on c.xtype = b.system_type_id WHERE a.type_desc = 'USER_TABLE' AND a.name LIKE '%' AND b.name LIKE '%' AND c.name like '%' ORDER BY a.name, b.column_id
Size matter in number of rows in tables and datatypes of columns in tables. The following script gives you number of rows in a user tables for the current database on SQL Server. Use the result to determine what tables to start with. Start with tables with the most rows and optimize these tables.
SELECT substring(obj.name, 1, 50) AS Table_Name, ind.rows AS Number_of_Rows FROM sysobjects AS obj INNER JOIN sysindexes AS ind ON obj.id = ind.id WHERE obj.xtype = 'u' AND ind.indid < 2 AND obj.name LIKE '%' ORDER BY obj.name
Make sure the datatype and length fits the application needs. For instance, if you have a column Type that contains integer values between 1 and 10, the datatype can be set to tinyint (SQL Server). This column will use 1 byte for each row, opposite to int (4 bytes), bigint (8 bytes). These considerations must be done for every datatype used through out the datatbase. See table below for more informations on some important datatypes in SQL Server.
|bigint||-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)||8 Bytes|
|int||-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)||4 Bytes|
|smallint||-2^15 (-32,768) to 2^15-1 (32,767)||2 Bytes|
|tinyint||0 to 255||1 Byte|
|bit||An integer data type that can take a value of 1, 0, or NULL. The SQL Server Database Engine optimizes storage of bit columns. If there are 8 or less bit columns in a table, the columns are stored as 1 byte. If there are from 9 up to 16 bit columns, the columns are stored as 2 bytes, and so on.|
|char [ ( n ) ]||Fixed-length, non-Unicode character data with a length of n bytes. n must be a value from 1 through 8,000|
|varchar [ ( n | max ) ]||Variable-length, non-Unicode character data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes.|
|nchar [ ( n ) ]||Fixed-length Unicode character data of n characters. n must be a value from 1 through 4,000. The storage size is two times n bytes.|
|nvarchar [ ( n | max ) ]||Variable-length Unicode character data. ncan be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes.|
|binary [ ( n ) ]||Fixed-length binary data with a length of n bytes, where n is a value from 1 through 8,000|
|varbinary [ ( n | max) ]||Variable-length binary data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size is the actual length of the data entered + 2 bytes|
|date||Default format: YYYY-MM-DD. Range ‘0001-01-01’ through ‘9999-12-31’||3 bytes|
|datetime||Date range ‘January 1, 1753’ through ‘December 31, 9999’||8 bytes|
|smalldatetime||Date range ‘1900-01-01’ through ‘2079-06-06’||4 bytes|
|decimal[ (p[ ,s] )]||Fixed precision and scale numbers. When maximum precision is used, valid values are from – 10^38 +1 through 10^38 – 1.1-9 = 5 bytes
10-19 = 9 bytes
20-28 = 13 bytes
29-38 = 17 bytes
|float [ (n) ]||Where n is the number of bits that are used to store the mantissa of the float number in scientific notation and, therefore, dictates the precision and storage size. If n is specified, it must be a value between 1 and 53. The default value of n is 53.
1-24 = 4 bytes