Sunday, 7 April 2013

SQL Server Performance Tips

The following Tips and Guidelines are very important from the performance perspective of your SQL Server development or production database. The best way to achieve optimum performance benefits is to experiment with the following guidelines while using/viewing a query execution plan, both with and without implementing the respective tip/guideline.

Tips and Guidelines

• As a common practice, every table should have a clustered index. Generally, but not always, the clustered index should be on a column that monotonically increases, such as an identity column or some other column where the value is unique. In many cases, the primary key is the ideal column for a clustered index.
• Indexes should be measured on all columns that are frequently used in WHEREORDER BYGROUP BYTOP andDISTINCT clauses.
• Do not automatically add indexes on a table because it seems like the right thing to do. Only add indexes if you know that they will be used by the queries run against the table.
• For historical (static) tables, create the indexes with a FILLFACTOR and a PAD_INDEX of 100 to ensure there is no wasted space. This reduces disk I/O, helping to boost overall performance.
• Queries that return a single row are just as fast using a non-clustered index as a clustered index.
• Queries that return a range of rows are just as fast using a clustered index as a non-clustered index.
• Do not add more indexes on your OLTP tables to minimize the overhead that occurs with indexes during data modifications.
• Do not add the same index more than once on a table with different names.
• Drop all those indexes that are not used by the Query Optimizer, generally. You probably won't want to add an index to a table under the following conditions:
  • If the index is not used by the query optimizer. Use the Query Analyzer's "Show Execution Plan" option to see if your queries against a particular table use an index or not.
  • If the table is small, most likely indexes will not be used.
  • If the column(s) to be indexed are very wide.
  • If the column(s) are defined as TEXTNTEXT or IMAGE data types.
  • If the table is rarely queried but insertion, updating is frequent.
• To provide up-to-date statistics, the query optimizer needs to make smart query optimization decisions. You will generally want to leave the "Auto Update Statistics" database option on. This helps to ensure that the optimizer statistics are valid, ensuring that queries are properly optimized when they are run.
• Keep the "width" of your indexes as narrow as possible. This reduces the size of the index and reduces the number of disk I/O reads required to read the index.
• If possible, try to create indexes on columns that have integer values instead of characters. Integer values use less overhead than character values.
• If you have two or more tables that are frequently joined together, then the columns used for the joins should have an appropriate index. If the columns used for the joins are not naturally compact, then consider adding surrogate keys to the tables that are compact in order to reduce the size of the keys. This will decrease I/O during the join process, which increases overall performance.
• When creating indexes, try to make them unique indexes if at all possible. SQL Server can often search through a unique index faster than a non-unique index. This is because, in a unique index, each row is unique and once the needed record is found, SQL Server doesn't have to look any further.
• If a particular query against a table is run infrequently and the addition of an index greatly speeds the performance of the query, but the performance of INSERTSUPDATES and DELETES is negatively affected by the addition of the index, consider creating the index for the table for the duration of when the query is run and then dropping the index. An example of this is when monthly reports are run at the end of the month on an OLTP application.
• Avoid using FLOAT or REAL data types as primary keys, as they add unnecessary overhead that can hurt performance.
• If you want to boost the performance of a query that includes an AND operator in the WHERE clause, consider the following:
  • Of the search criteria in the WHERE clause, at least one of them should be based on a highly selective column that has an index.
  • If at least one of the search criteria in the WHERE clause is not highly selective, consider adding indexes to all of the columns referenced in the WHERE clause.
  • If none of the columns in the WHERE clause are selective enough to use an index on their own, consider creating a covering index for this query.
• The Query Optimizer will always perform a table scan or a clustered index scan on a table if the WHERE clause in the query contains an OR operator and if any of the referenced columns in the OR clause are not indexed (or do not have a useful index). Because of this, if you use many queries with OR clauses, you will want to ensure that each referenced column in the WHERE clause has an index.
• If you have a query that uses ORs and it is not making the best use of indexes, consider rewriting it as a UNION and then testing performance. Only through testing can you be sure that one version of your query will be faster than another.
• If you use the SOUNDEX function against a table column in a WHERE clause, the Query Optimizer will ignore any available indexes and perform a table scan.
• Queries that include either the DISTINCT or the GROUP BY clauses can be optimized by including appropriate indexes. Any of the following indexing strategies can be used:
  • Include a covering, non-clustered index (covering the appropriate columns) of the DISTINCT or the GROUP BYclauses.
  • Include a clustered index on the columns in the GROUP BY clause.
  • Include a clustered index on the columns found in the SELECT clause.
  • Adding appropriate indexes to queries that include DISTINCT or GROUP BY is most important for those queries that run often.
• Avoid clustered indexes on columns that are already "covered" by non-clustered indexes. A clustered index on a column that is already "covered" is redundant. Use the clustered index for columns that can better make use of it.
• Ideally a clustered index should be based on a single column (not multiple columns) that are as narrow as possible. This not only reduces the clustered index's physical size, it also reduces the physical size of non-clustered indexes and boosts SQL Server's overall performance.
• When you create a clustered index, try to create it as a unique clustered index, not a non-unique clustered index.
• SET NOCOUNT ON at the beginning of each stored procedure you write. This statement should be included in every stored procedure, trigger, etc. that you write.
• Keep Transact-SQL transactions as short as possible within a stored procedure. This helps to reduce the number of locks, helping to speed up the overall performance of your SQL Server application.
• If you are creating a stored procedure to run in a database other than the Master database, don't use the prefix sp_in its name. This special prefix is reserved for system stored procedures. Although using this prefix will not prevent a user defined stored procedure from working, what it can do is to slow down its execution ever so slightly.
• Before you are done with your stored procedure code, review it for any unused code, parameters or variables that you may have forgotten to remove while you were making changes and remove them. Unused code just adds unnecessary bloat to your stored procedures, although it will not necessarily negatively affect performance of the stored procedure.
• For best performance, all objects that are called within the same stored procedure should be owned by the same object owner or schema, preferably dbo, and should also be referred to in the format ofobject_owner.object_name or schema_owner.object_name.
• When you need to execute a string of Transact-SQL, you should use the sp_executesql stored procedure instead of the EXECUTE statement.
• If you use input parameters in your stored procedures, you should validate all of them at the beginning of your stored procedure. This way, if there is a validation problem and the client application needs to be notified of the problem, it happens before any stored procedure processing takes place, preventing wasted effort and boosting performance.
• When calling a stored procedure from your application, it is important that you call it using its qualified name, for example:
exec dbo.myProc
...instead of:
exec myProc
• If you think a stored procedure will return only a single value and not a record set, consider returning the single value as an output parameter.
• Use stored procedures instead of views. They offer better performance.
• Don't include code, variable or parameters that don't do anything.
• Don't be afraid to make broad-minded use of in-line and block comments in your Transact-SQL code. They will not affect the performance of your application and they will enhance your productivity when you have to come back to the code and try to modify it.
• If possible, avoid using SQL Server cursors. They generally use a lot of SQL Server resources and reduce the performance and scalability of your applications.
• If you have the choice of using a join or a sub-query to perform the same task within a query, generally the join is faster. This is not always the case, however, and you may want to test the query using both methods to determine which is faster for your particular application.
• If your application requires you to create temporary tables for use on a global or per connection use, consider the possibility of creating indexes for these temporary tables. While most temporary tables probably won't need -- or even use -- an index, some larger temporary tables can benefit from them. A properly designed index on a temporary table can be as great a benefit as a properly designed index on a standard database table.
• Instead of using temporary tables, consider using a derived table instead. A derived table is the result of using aSELECT statement in the FROM clause of an existing SELECT statement. By using derived tables instead of temporary tables, you can reduce I/O and often boost your application's performance.
• For better performance, if you need a temporary table in your Transact-SQL code, consider using a table variable instead of creating a conventional temporary table.
• Don't repeatedly reuse the same function to calculate the same result over and over within your Transact-SQL code.
• If you use BULK INSERT to import data into SQL Server, then use the TABLOCK hint along with it. This will prevent SQL Server from running out of locks during very large imports and will also boost performance due to the reduction of lock contention.
• Always specify the narrowest columns you can. The narrower the column, the less amount of data SQL Server has to store and the faster SQL Server is able to read and write data. In addition, if any sorts need to be performed on the column, the narrower the column, the faster the sort will be.
• If you need to store large strings of data and they are less than 8000 characters, use a VARCHAR data type instead of a TEXT data type. TEXT data types have extra overhead that drag down performance.
• Don't use the NVARCHAR or NCHAR data types unless you need to store 16-bit character (Unicode) data. They take up twice as much space as VARCHAR or CHAR data types, increasing server I/O and wasting unnecessary space in your buffer cache.
• If the text data in a column varies greatly in length, use a VARCHAR data type instead of a CHAR data type. The amount of space saved by using VARCHAR over CHAR on variable length columns can greatly reduce the I/O reads that the cache memory uses to hold data, improving overall SQL Server performance.
• If a column's data does not vary widely in length, consider using a fixed-length CHAR field instead of a VARCHAR. While it may take up a little more space to store the data, processing fixed-length columns is faster in SQL Server than processing variable-length columns.
• If you have a column that is designed to hold only numbers, use a numeric data type such as INTEGER instead of aVARCHAR or CHAR data type. Numeric data types generally require less space to hold the same numeric value than does a character data type. This helps to reduce the size of the columns and can boost performance when the columns are searched (WHERE clause), joined to another column or sorted.
• If you use the CONVERT function to convert a value to a variable length data type such as VARCHAR, always specify the length of the variable data type. If you do not, SQL Server assumes a default length of 30. Ideally, you should specify the shortest length to accomplish the required task. This helps to reduce memory use and SQL Server resources.
• Avoid using the new BIGINT data type unless you really need its additional storage capacity. The BIGINT data type uses 8 bytes of memory, versus 4 bytes for the INT data type.
• Don't use the DATETIME data type as a primary key. From a performance perspective, it is more efficient to use a data type that uses less space. For example, the DATETIME data type uses 8 bytes of space, while the INT data type only takes up 4 bytes. The less space used, the smaller the table and index, and the less I/O overhead that is required to access the primary key.
• If you are creating a column that you know will be subject to many sorts, consider making the column integer-based and not character-based. This is because SQL Server can sort integer data much faster than character data.
• Carefully evaluate whether your SELECT query needs the DISTINCT clause or not. Some developers automatically add this clause to every one of their SELECT statements, even when it is not necessary. This is a bad habit that should be stopped.
• When you need to use SELECT INTO option, keep in mind that it can lock system tables, preventing other users from accessing the data they need while the data is being inserted. In order to prevent or minimize the problems caused by locked tables, try to schedule the use of SELECT INTO when your SQL Server is less busy. In addition, try to keep the amount of data inserted to a minimum. In some cases, it may be better to perform several smaller SELECT INTOs instead of performing one large SELECT INTO.
• If you need to verify the existence of a record in a table, don't use SELECT COUNT (*) in your Transact-SQL code to identify it. This is very inefficient and wastes server resources. Instead, use the Transact-SQL IF EXISTS to determine if the record in question exists, which is much more efficient.
• By default, some developers -- especially those who have not worked with SQL Server before -- routinely include code similar to this in their WHERE clauses when they make string comparisons:
SELECT column_name FROM table_name
WHERE LOWER (column_name) = 'name'
In other words, these developers are making the assumption that the data in SQL Server is case-sensitive, which it generally is not. If your SQL Server database is not configured to be case sensitive, you don't need to use LOWER orUPPER to force the case of text to be equal for a comparison to be performed. Just leave these functions out of your code. This will speed up the performance of your query, as any use of text functions in a WHERE clause hurts performance.
However, what if your database has been configured to be case-sensitive? Should you then use the LOWER and UPPERfunctions to ensure that comparisons are properly compared? No. The above example is still poor coding. If you have to deal with ensuring case is consistent for proper comparisons, use the technique described below, along with appropriate indexes on the column in question:
SELECT column_name FROM table_name 
WHERE column_name = 'NAME' or column_name = 'name'
This code will run much faster than the first example.
• If you currently have a query that uses NOT IN, which offers poor performance because the SQL Server optimizer has to use a nested table scan to perform this activity, instead try to use one of the following options, all of which offer better performance:
  • Use IN
  • Perform a LEFT OUTER JOIN and check for a NULL condition
• When you have a choice of using the IN or the EXISTS clause in your Transact-SQL, you will generally want to use the EXISTS clause, as it is usually more efficient and performs faster.
• If you find that SQL Server uses a TABLE SCAN instead of an INDEX SEEK when you use an IN/OR clause as part of your WHERE clause, even when those columns are covered by an index, consider using an index hint to force the Query Optimizer to use the index.
• If you use LIKE in your WHERE clause, try to use one or more leading characters in the clause, if possible. For example, use:
LIKE 'm%' instead of LIKE ‘%m’
• If your application needs to retrieve summary data often, but you don't want to have the overhead of calculating it on the fly every time it is needed, consider using a trigger that updates summary values after each transaction into a summary table.
• When you have a choice of using the IN or the BETWEEN clauses in your Transact-SQL, you will generally want to use the BETWEEN clause, as it is much more efficient. For example...
SELECT task_id, task_name 
FROM tasks 
WHERE task_id in (1000, 1001, 1002, 1003, 1004) much less efficient than this:
SELECT task_id, task_name
FROM tasks 
WHERE task_id BETWEEN 1000 and 1004
• If possible, try to avoid using the SUBSTRING function in your WHERE clauses. Depending on how it is constructed, using the SUBSTRING function can force a table scan instead of allowing the optimizer to use an index (assuming there is one). If the substring you are searching for does not include the first character of the column you are searching for, then a table scan is performed.
• If possible, you should avoid using the SUBSTRING function and use the LIKE condition instead for better performance. Instead of doing this:
WHERE SUBSTRING(task_name,1,1) = 'b'
Try using this instead:
WHERE task_name LIKE 'b%'
• Avoid using optimizer hints in your WHERE clauses. This is because it is generally very hard to out-guess the Query Optimizer. Optimizer hints are special keywords that you include with your query to force how the Query Optimizer runs. If you decide to include a hint in a query, this forces the Query Optimizer to become static, preventing the Query Optimizer from dynamically adapting to the current environment for the given query. More often than not, this hurts -- not helps -- performance.
• If you have a WHERE clause that includes expressions connected by two or more AND operators, SQL Server will evaluate them from left to right in the order they are written. This assumes that no parentheses have been used to change the order of execution. Because of this, you may want to consider one of the following when using AND:
  • Locate the least likely true AND expression first.
  • If both parts of an AND expression are equally likely of being false, put the least complex AND expression first.
  • You may want to consider using Query Analyzer or Management Studio to look at the execution plans of your queries to see which is best for your situation
• Don't use ORDER BY in your SELECT statements unless you really need to, as it adds a lot of extra overhead. For example, perhaps it may be more efficient to sort the data at the client than at the server.
• Whenever SQL Server has to perform a sorting operation, additional resources have to be used to perform this task. Sorting often occurs when any of the following Transact-SQL statements are executed:
• If you have to sort by a particular column often, consider making that column a clustered index. This is because the data is already presorted for you and SQL Server is smart enough not to resort the data.
• If your WHERE clause includes an IN operator along with a list of values to be tested in the query, order the list of values so that the most frequently found ones are placed at the start of the list and the less frequently found ones are placed at the end of the list. This can speed up performance because the IN option returns true as soon as any of the values in the list produce a match. The sooner the match is made, the faster the query completes.
• If your application performs many wildcard (LIKE %) text searches on CHAR or VARCHAR columns, consider using SQL Server's full-text search option. The Search Service can significantly speed up wildcard searches of text stored in a database.
• The GROUP BY clause can be used with or without an aggregate function. However, if you want optimum performance, don't use the GROUP BY clause without an aggregate function. This is because you can accomplish the same end result by using the DISTINCT option instead, and it is faster. For example, you could write your query two different ways:
SELECT task_id
FROM tasks
WHERE task_id BETWEEN 10 AND 20
FROM tasks
WHERE task_id BETWEEN 10 AND 20
• It is important to design applications that keep transactions as short as possible. This reduces locking and increases application concurrently, which helps to boost performance.
• In order to reduce network traffic between the client or middle-tier and SQL Server -- and also to boost your SQL Server-based application's performance -- only the data needed by the client or middle-tier should be returned by SQL Server. In other words, don't return more data (both rows and columns) from SQL Server than you need to the client or middle-tier and then further reduce the data to the data you really need at the client or middle-tier. This wastes SQL Server resources and network bandwidth.
• To make complex queries easier to analyze, consider breaking them down into their smaller constituent parts. One way to do this is to simply create lists of the key components of the query, such as:
  • List all of the columns that are to be returned
  • List all of the columns that are used in the WHERE clause
  • List all of the columns used in the JOINs (if applicable)
  • List all the tables used in JOINs (if applicable)
  • Once you have the above information organized into this easy-to-comprehend form, it is much easier to identify those columns that could potentially make use of indexes when executed.

important database designing rules

Before you start reading this article let me confirm to you I am not a guru in database designing. The below 11 points are what I have learnt via projects, my own experiences, and my own reading. I personally think it has helped me a lot when it comes to DB designing. Any criticism is welcome.
The reason I am writing a full blown article is, when developers design a database they tend to follow the three normal forms like a silver bullet. They tend to think normalization is the only way of designing. Due this mind set they sometimes hit road blocks as the project moves ahead.
If you are new to normalization, then click and see 3 normal forms in action which explains all the three normal forms step by step.
Said and done normalization rules are important guidelines but taking them as a mark on stone is calling for trouble. Below are my own 11 rules which I remember on the top of my head while doing DB design.

Rule 1: What is the nature of the application (OLTP or OLAP)?

When you start your database design the first thing to analyze is the nature of the application you are designing for, is it Transactional or Analytical. You will find many developers by default applying normalization rules without thinking about the nature of the application and then later getting into performance and customization issues. As said, there are two kinds of applications: transaction based and analytical based, let’s understand what these types are.
Transactional: In this kind of application, your end user is more interested in CRUD, i.e., creating, reading, updating, and deleting records. The official name for such a kind of database is OLTP.
Analytical: In these kinds of applications your end user is more interested in analysis, reporting, forecasting, etc. These kinds of databases have a less number of inserts and updates. The main intention here is to fetch and analyze data as fast as possible. The official name for such a kind of database is OLAP.
In other words if you think inserts, updates, and deletes are more prominent then go for a normalized table design, else create a flat denormalized database structure.
Below is a simple diagram which shows how the names and address in the left hand side are a simple normalized table and by applying a denormalized structure how we have created a flat table structure.

Rule 2: Break your data into logical pieces, make life simpler

This rule is actually the first rule from 1st normal form. One of the signs of violation of this rule is if your queries are using too many string parsing functions like substring, charindex, etc., then probably this rule needs to be applied.
For instance you can see the below table which has student names; if you ever want to query student names having “Koirala” and not “Harisingh”, you can imagine what kind of a query you will end up with.
So the better approach would be to break this field into further logical pieces so that we can write clean and optimal queries.

Rule 3: Do not get overdosed with rule 2

Developers are cute creatures. If you tell them this is the way, they keep doing it; well, they overdo it leading to unwanted consequences. This also applies to rule 2 which we just talked above. When you think about decomposing, give a pause and ask yourself, is it needed? As said, the decomposition should be logical.
For instance, you can see the phone number field; it’s rare that you will operate on ISD codes of phone numbers separately (until your application demands it). So it would be a wise decision to just leave it as it can lead to more complications.

Rule 4: Treat duplicate non-uniform data as your biggest enemy

Focus and refactor duplicate data. My personal worry about duplicate data is not that it takes hard disk space, but the confusion it creates.
For instance, in the below diagram, you can see “5th Standard” and “Fifth standard” means the same. Now you can say the data has come into your system due to bad data entry or poor validation. If you ever want to derive a report, they would show them as different entities, which is very confusing from the end user point of view.
One of the solutions would be to move the data into a different master table altogether and refer them via foreign keys. You can see in the below figure how we have created a new master table called “Standards” and linked the same using a simple foreign key.

Rule 5: Watch for data separated by separators

The second rule of 1st normal form says avoid repeating groups. One of the examples of repeating groups is explained in the below diagram. If you see the syllabus field closely, in one field we have too much data stuffed. These kinds of fields are termed as “Repeating groups”. If we have to manipulate this data, the query would be complex and also I doubt about the performance of the queries.
These kinds of columns which have data stuffed with separators need special attention and a better approach would be to move those fields to a different table and link them with keys for better management.
So now let’s apply the second rule of 1st normal form: “Avoid repeating groups”. You can see in the above figure I have created a separate syllabus table and then made a many-to-many relationship with the subject table.
With this approach the syllabus field in the main table is no more repeating and has data separators.

Rule 6: Watch for partial dependencies

Watch for fields which depend partially on primary keys. For instance in the above table we can see the primary key is created on roll number and standard. Now watch the syllabus field closely. The syllabus field is associated with a standard and not with a student directly (roll number).
The syllabus is associated with the standard in which the student is studying and not directly with the student. So if tomorrow we want to update the syllabus we have to update it for each student, which is painstaking and not logical. It makes more sense to move these fields out and associate them with the Standard table.
You can see how we have moved the syllabus field and attached it to the Standards table.
This rule is nothing but the 2nd normal form: “All keys should depend on the full primary key and not partially”.

Rule 7: Choose derived columns preciously

If you are working on OLTP applications, getting rid of derived columns would be a good thought, unless there is some pressing reason for performance. In case of OLAP where we do a lot of summations, calculations, these kinds of fields are necessary to gain performance.
In the above figure you can see how the average field is dependent on the marks and subject. This is also one form of redundancy. So for such kinds of fields which are derived from other fields, give a thought: are they really necessary?
This rule is also termed as the 3rd normal form: “No column should depend on other non-primary key columns”. My personal thought is do not apply this rule blindly, see the situation; it’s not that redundant data is always bad. If the redundant data is calculative data, see the situation and then decide if you want to implement the 3rd normal form.

Rule 8: Do not be hard on avoiding redundancy, if performance is the key

Do not make it a strict rule that you will always avoid redundancy. If there is a pressing need for performance think about de-normalization. In normalization, you need to make joins with many tables and in denormalization, the joins reduce and thus increase performance.

Rule 9: Multidimensional data is a different beast altogether

OLAP projects mostly deal with multidimensional data. For instance you can see the below figure, you would like to get sales per country, customer, and date. In simple words you are looking at sales figures which have three intersections of dimension data.
For such kinds of situations a dimension and fact design is a better approach. In simple words you can create a simple central sales fact table which has the sales amount field and it makes a connection with all dimension tables using a foreign key relationship.

Rule 10: Centralize name value table design

Many times I have come across name value tables. Name and value tables means it has key and some data associated with the key. For instance in the below figure you can see we have a currency table and a country table. If you watch the data closely they actually only have a key and value.
For such kinds of tables, creating a central table and differentiating the data by using a type field makes more sense.

Rule 11: For unlimited hierarchical data self-reference PK and FK

Many times we come across data with unlimited parent child hierarchy. For instance consider a multi-level marketing scenario where a sales person can have multiple sales people below them. For such scenarios, using a self-referencing primary key and foreign key will help to achieve the same.
This article is not meant to say that do not follow normal forms, instead do not follow them blindly, look at your project's nature and the type of data you are dealing with first.