Troubleshooting SQL Server I/O bottlenecks
One
of the most common reason for server performance issues with respect to
SQL Server is the presence of an I/O bottleneck on the system. When I
say I/O bottleneck, it can mean issues like slow disks, other processes
hogging I/O, out-dated drivers, etc. In this blog, I will seek to
outline the approach for identifying and troubleshooting I/O bottlenecks
on SQL Server.
The Symptoms
The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:
-
You see a lot of threads waiting on one or more of the following waits:
- PAGEIOLATCH_*
- WRITELOG
- TRACEWRITE
- SQLTRACE_FILE_WRITE_IO_COMPLETION
- ASYNC_IO_COMPLETION
- IO_COMPLETION
- LOGBUFFER
- You see the famous "I/O taking longer than 15 seconds" messages in the SQL Server errorlogs:
2012-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:\SEDATA\stressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:0x00000000022000”.
Troubleshooting
Data Collection:
If
you see the symptoms outlined above quite frequently on your SQL Server
installation, then it will be safe to draw the conclusion that your
instance is suffering from a disk subsystem or I/O bottleneck. Let's
look at the data collection and troubleshooting approach pertaining to
the same:
-
Enable a custom Performance Monitor collector to capture all disk related counters. Just go to start->run, type perfmon, and hit ok. Next, go to Data Collector sets->User Defined, right click on User Defined, and click New-> Data Collector set.
Note: The best thing about perfmon(apart from the fact that it is built into windows) is that it's a very lightweight diagnostic, and has negligible performance overhead/impact.
- Give
the data collector set a name, and select Create manually. Under type
of data, select the "Create data logs" option, and check the Performance
Counter checkbox under it.
- Next,
click on add performance counters, and select the "LogicalDisk",
"Process" and "PhysicalDisk" groups, and select "All instances" for both
before adding them.
- After
you have added the counters, you can also modify the sample interval.
You might want to do this if you see spikes lasting less than 15
seconds, which is the default sample interval. I sometimes use an
interval of 5 seconds when I want to closely monitor an environment .
- Click on Finish and you will now see the new Data Collector set created under User Defined.
- Next, right click on the Data Collector set you just created, and click start.
I
normally recommend my clients to run the perfmon collector set for at
least one business day, so that it has captured data for the load
exerted by at least one standard business cycle.
Analysis:
Now
that we have the data, we can start the analysis. After stopping the
collector set, you can open the blg file generated (the path is
displayed under the output column, on the right hand side in perfmon)
using perfmon (a simple double click works, as the file type is
associated with perfmon by default). Once open, is should have
automatically loaded all the counters. Analysing with all the counters
can be a bit cumbersome, so I would suggest that you first delete all
the counters and then add specific counters one by one.
I will list out the important counters here, along with their expected values:
-
Process->IO Data Bytes/sec: This counter represents the average amount of IO Data bytes/sec spawned by each process. In conjunction with IO Other Bytes/sec, this counter can be used to determine the average IO per second as well as the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
- Process-> IO Other Bytes/sec: This counter represents the non-data IO spawned by each process during the capture. Usually, the amount of non-data IO is very low as compared to data IO. Use the total of both IO Data Bytes and IO other bytes to determine the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
- Physical Disk/Logical Disk->Avg. Disk Sec/Read: This counter signifies the average amount of time, in ms, that it takes for a read I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
- Physical Disk/Logical Disk->Avg. Disk Sec/Write:
This counter signifies the average amount of time, in ms, that it takes
for a write I/O request to be serviced for each physical/logical disk.
An average of less than 10 ms (0.010) is good, and between 10-15 ms
(0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a
cause for concern.
- Physical Disk/Logical Disk->Disk Bytes/Sec:
This counter represents, in bytes, the throughput of your I/O subsystem
for each physical/logical disk. Look for the max value for each disk,
and divide it by 1024 twice to get the max throughput in MB for the
disk. SAN's generally start from 200-250 MB/s these days. If you see
that the throughput is lower than the specifications for the disk, it's
not necessarily a cause for concern. Check this counter in conjunction
with the Avg Disk Sec/Read or Avg Disk Sec/Write counters (depending on
the wait/symptom you see in SQL), and see the latency at the time of the
maximum throughput. If the latency is green, then it just means that
SQL spawned I/O that was less the disk throughput capacity, and was
easily handled by the disk.
- Physical Disk/Logical Disk->Avg. Disk Queue Length:
This counter represents the average number of I/O's pending in the I/O
queue for each physical/logical disk. Generally, if the average is
greater than 2 per spindle, it's a cause for concern. Please note that I
mentioned the acceptable threshold as 2 per spindle. Most SAN’s these
days have multiple spindles. So, for example, if your SAN has 4
spindles, the acceptable threshold for Avg Disk Queue Length would be 8.
Check the other counters to confirm.
- Physical Disk/Logical Disk->Split IO/Sec:
This counter indicates the I/O's for which the Operating System had to
make more than one command call, grouped by physical/logical disk. This
happens if the IO request touches data on non-contiguous file segments.
It's a good indicator of file/volume fragmentation.
- Physical Disk/Logical Disk->%Disk Time: This counter is a general mark of how busy the physical/logical disk is. Actually, it is nothing more than the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale. This is the reason you can see the %Disk Time going greater than 100, as explained in the KB http://support.microsoft.com/kb/310067. It basically means that the Avg. Disk Queue Length was greater than 1 during that time. If you've captured the perfmon for a long period (a few hours or a complete business day), and you see the %Disk Time to be greater than 80%, it's generally indicative of a disk bottleneck, and you should take a closer look at the other counters to arrive at a logical conclusion.
It's
important to keep 2 things in mind. One, make sure your data capture is
not skewed or biased in any way (for example, do not run a capture at
the time of a monthly data load or something). Second, make sure you
correlate the numbers reflected across the various counters to arrive at
the overall picture of how your disks are doing.
Most
of the time, I see that people are surprised when they are told that
there are I/O issues on the system. Their typical response is "But, it's
been working just fine for x years, how can it create a bottleneck
now?". The answer lies within the question itself. When the server was
initially configured, the disk resources were sufficient for the load on
the server. However, with time, it's inevitable that the business grows
as a whole, and so do the number of transactions, as well as the
overall load. As a result, there comes a day when the load breaches that
threshold, and the disk resources on the server are no longer
sufficient to handle it. If you come to office one fine day, see high
latency on the disks during normal working hours, and are sure that
-
No special/additional workloads are running on SQL
-
No other process on the server is spawning excessive I/O,
-
Nothing changed on the server in the past 24 hours (like a software installation, patching, reboot, etc.)
-
All the BIOS and disk drivers on the server are up to date,
Then it's highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.
Hope this helps. As always, comments, feedbacks and suggestions are welcome.
Source : http://blogs.msdn.com/b/sqljourney/archive/2013/06/04/how-to-troubleshooting-sql-server-i-o-bottlenecks.aspx
Troubleshooting Slow Disk I/O in SQL Server
Below is a list of counters and thresholds to watch for when monitoring IO on your SQL Server:
Edit
7/19/2011 - Please monitor the below counters and compare average
values over time to the thresholds provided. The counters are intended
for OLTP workloads and larger workloads such as OLAP and batch activity
may not fit within these thresholds.
Object
|
Counter
|
Instance
|
Threshold
|
Comments
|
Physical Disk / Logical Disk
|
Avg. Disk sec/Read
|
*
|
( < .005 excellent; .005 - .010 Good; .010 - .015 Fair; > .015 investigate)
|
Measures read latency on the disks
|
Physical Disk / Logical Disk
|
Avg. Disk sec/Write
|
*
|
( < .005 excellent; .005 - .010 Good; .010 - .015 Fair; > .015 investigate)
|
Measures write latency on the disks
|
Physical Disk / Logical Disk
|
Avg. Disk sec/Transfer
|
*
|
( < .005 excellent; .005 - .010 Good; .010 - .015 Fair; > .015 investigate)
|
Measures average latency for read or write operations
|
Physical Disk / Logical Disk
|
% Idle Time
|
*
|
> 50%
|
The disk should not be working constantly
|
Physical Disk / Logical Disk
|
Avg. Disk Queue Length
|
*
|
None
|
Be
careful when using old thresholds of 2 with SQL Server and SANs. This
value can be much higher than 2. Do not rely solely on this counter to
diagnose an IO problem.
|
Physical Disk / Logical Disk
|
Disk reads/sec
|
*
|
None
|
Number of read IOPs
|
Physical Disk / Logical Disk
|
Disk writes/sec
|
*
|
None
|
Number of write IOPs
|
Physical Disk / Logical Disk
|
Disk Transfers/sec
|
*
|
None
|
Number of read and write IOPs (used to compare against capacity of storage subsystem)
|
Physical Disk / Logical Disk
|
Disk Read Bytes/sec
|
*
|
None
|
Used for determine bandwidth utilization for Read Ops
|
Physical Disk / Logical Disk
|
Disk Write Bytes/sec
|
*
|
None
|
Used to determine bandwidth utilization for Write Ops
|
Physical Disk / Logical Disk
|
Disk Bytes/sec
|
*
|
None
|
Used to determine bandwidth total bandwidth utilization
|
Processor
|
% Processor Time
|
_Total
|
< 80%
|
Amount of total CPU usage across all processors
|
Processor
|
% User Time
|
_Total
|
< 80%
|
Amount of total CPU usage in user mode across all processors
|
Processor
|
% Privileged Time
|
_Total
|
< 30%
|
Amount of total CPU usage in kernel mode across all processors
|
Memory
|
Available Mbytes
|
n/a
|
> 100MB
|
Available Physical RAM on the box
|
SQL Server:Buffer Manager
|
Buffer Cache Hit Ratio
|
n/a
|
> 98%
|
Indication of how often SQL Server is finding data pages in the buffer pool
|
SQL Server:Buffer Manager
|
Page Life Expectancy
|
n/a
|
> 300
|
Indicates pressure on memory as data pages are flushed to disk – inducing IO
|
SQL Server:Buffer Manager
|
Lazy Writes/sec
|
n/a
|
< 20
|
Measure of when the lazy writer thread must clear up buffers to free space for pages – inducing IO
|
SQL Server:Buffer Manager
|
Free Pages
|
n/a
|
> 640
|
Measure of how many free pages are in the buffer pool for use
|
SQL Server:Wait Statistics
|
Page IO latch waits
|
Average wait time (ms)
|
None
|
Average wait time spent waiting on Page IO between disk and memory
|
No comments:
Post a Comment