Server and Network Monitoring
Activities:
11-1
11-2
11-3
11-4
11-5
11-6
11-7
11-8
11-9
11-11
11-12
Monitoring Resources
System monitoring
enables you to be proactive in maintaining fast server response, and it gives
you the tools to quickly find and resolve problems after they strike. System
monitoring accomplishes several purposes. One reason to monitor is to become
familiar with your server’s performance so you know how to interpret a
problem. It may be difficult to diagnose a problem or determine if there’s a
resource shortage unless you first know what performance is typical for your
system. Other reasons to monitor are to prevent problems before they occur and
to diagnose existing problems to resolve them.
In order to monitor your system,
troubleshoot/detect problems proactively, and plan for the future growth, it's
important to create a baseline. A baseline is a snapshot of the system, a status
quo of the system/network. It's important to gather data that is representative
of the whole system - collect data at regular intervals over time during busy
and non-busy cycles. Baselines, or benchmarks, establish normal system
performance characteristics and provide a basis for comparing data collected
during problem situations with the data showing normal performance conditions.
This creates a way to diagnose problems and identify components that need to be
upgraded. Benchmarks are acquired in the following ways:
- By generating statistics about CPU, disk, memory, and I/O with no users on the
system, to establish a baseline for comparing to more active periods.
- By using performance monitoring to establish slow, average, and peak periods.
Keep records on these periods
- By establishing benchmarks to track growth in the user of servers, such as
increases in users, increases in software, and increases in the average amount
of time users are on the system.
The best way to get a feel for a server’s performance is to gather benchmarks,
and then to frequently monitor server performance after you have the benchmark
data. Performance indicators can be confusing at first, so the more time you
spend observing them, the better you’ll understand them. For example, viewing
the CPU utilization on a server the first few times doesn’t tell you much, but
viewing it over a period of two or three months, noting slow and peak periods,
helps you develop knowledge about how CPU demand varies for that server.
Performance Monitor
is a tool for data collection/monitoring. The System Monitor is the most powerful monitoring tool , which you can
use in a multitude of ways for tracking system performance and determining how
to optimize server functions. System Monitor is like a window into the inner
workings of just about every aspect of the system, such as hard disks, memory, the
processor, the page file, etc. For example, you might monitor memory and page to
determine if you have fully tuned the page file for satisfactory performance and
to determine if you have adequate RAM for the server load. Performance
Monitor monitors system components are called objects (e.g. Processor, Memory),
and there can be multiple instances of the same object. As
you add new services, new objects are added. For example, when you install IIS,
more objects are added to monitor IIService, HTTP Service , and FTP Server
activity. Different measurements of
objects are called counters. (e.g. Pages/sec, % Processor Time). A counter is an
indicator of a quantity of the object that can be measured in some unit, such as
percent, rate per second, or peak value, depending on what is appropriate to the
object.
System Monitor offers 3 modes of monitoring: chart, histogram or report. A chart
is a running line graph of the object that show distinct peaks and valleys. A
histogram is a running bar chart that shows each object as a bar in a different
color. The report mode simply provides numbers on a screen which you can capture
to put in a report. System Monitor can be used to monitor not only the local
computer, but other computers on the network. This is a powerful option for a
server administrator, because it means you can monitor other network servers or
workstations from one place, like your Win2K professional workstation.
These are some
critical objects/counters to monitor system performance:
Memory: To
monitor the server’s page file performance (page file is disk space reserved
for use when
memory requirements exceed the available RAM).
Memory Pages/sec – indicates a number of 4K pages transferred in/out of
the paging file in one second. The number consistently higher than 2-3 indicates
lack of physical RAM.
Paging File % Usage and %Usage Peak show how much of the page file is
currently occupied. Neither should frequently exceed 99%, but look at this info
in relation to Memory Pages/sec. If the values are frequently over 99%, increase
the page file size.
Page Faults/sec – hard page fault occurs when a program doesn’t have
enough physical memory to execute a given function. If there’s frequently over
5 hard page faults/sec, this is another strong indication of a memory
bottleneck. Increase RAM.
Sometimes software applications use the system’s RAM very inefficiently,
causing performance problems. Inefficient use of memory occurs for at least 2
reasons: poor program design and failure to return memory to the server after a
process is complete (leaking memory). Leaking memory is a very common problem
that has a cumulative impact, because the program may go through several cycles
in which it repeatedly accesses blocks of memory that aren’t released. The
result is that the page file continually grows, resulting in slower and slower
performance. Adding RAM or increasing page file size in that case is not likely
to address the performance problem. A better solution is to identify the program
and redesign it or purchase on that is more efficient. In System Monitor, track
the Process object and the counters Page File Bytes and Page Faults/sec, for
each process that you suspect is causing a problem. A high rate of page faults
for one process in relation to total number of page faults is a strong indicator
that there’s a problem with that process.
Processor: There
are three important components to studying the processor load:
- The percent of time the processor is in use
- The length of the queue containing processes waiting to run
- the frequency of interrupt requests from hardware.
Processor % Processor Time – indicates percentage of total time the
processor spends not idle, in use at the present time. It’s normal for the
processor to fluctuate between 50% to 10%. The number consistently higher than
60%(especially in the 80-100%) indicates a potential problem. It’s time to
collect additional data by monitoring the number of processes waiting in line
for their turn on the processor. Processor Queue Length counter for the System
object determines if there’s a queue of waiting processes. If the processor is
often at 100%, but there are no processes waiting in the queue, the processor is
handling the load, but if 4-5 processes are always in line, this suggests that
it’s time to consider a faster processor. Before deciding that you need to
purchase a new processor, make sure the processor load is not due to a
malfunctioning h/w component, such as NIC or disk adapter. When you monitor the
processor load, also check %
Interrupt Time and Interrupts/sec for the Processor object. A high frequency of
interrupts /sec, such as over 1000, is an indication that there’s a problem.
Also, frequently high % Interrupt Time (over 80%) is another indication of a h/w
problem. Theses counters don’t locate the component, but they do show that the
overload problem is unlikely to be solved by a new processor. Check the system
log for information about h/w problems.
Disk: LogicalDisk and PhysicalDisk.
Use LogicalDisk to observer activity on a set of disks, such as a striped
volume. Use PhysicalDisk to monitor a specific disk, such as disk 0 in a set of
five disks. PhysicalDisk %Disk Time show the amount of activity on a
disk, and Physical Disk Current Disk Queue Length – shows the number of
waiting requests to access the disk. If one disk frequently is busy at the near
100% level, information on the number of waiting requests helps to diagnose the
problem. If there are 0-1 requests
normally in the queue, the disk load is acceptable, If the queue generally has 2
or more requests, it’s time to move some files from the overloaded disk to one
less busy. The best way to determine which files to move is to understand what
apps and data are on the server and how they are used. If all of the server
disks are constantly busy, it maybe necessary to add more disks to distribute
the data or invest in the RAID array. Another source of disk activity is the
page file. Monitor the Memory counter Pages/sec and PhysicalDisk % Disk Time
simultaneously. This shows the paging activity in relationship to the activity
on the disk. Sometimes, the disk data transfer rate, which is measured by the
Physical Disk Disk Bytes/sec also is a problem. Use all three counters to track
page file activity and how fast the page file is written to disk. This give you
a good idea of the page file activity and the disk speed at the same time. If
page file activity is a problem, consider increasing RAM or implementing a page
file on more than one disk (- Place the paging file on the striped volume for
optimizing performance (but can’t perform memory dump then), or on any other
partition than boot partition.
- Create multiple smaller paging files on different physical disks for faster
access
If paging activity
is low, but the transfer rate is slow for large files, such as the page file or
a database file, consider upgrading to faster disks.
A visible indication that a disk may be a bottleneck is that its LED is
lighting constantly and you can hear the disk busily reading and writing data.
There are 3 general reasons thy a disk is busy. One reason is simply that it’s
experiencing heavy sustained use. It’s not a problem is the disk can handle
the load. If the Current Disk Queue Length and Avg. Disk Queue Length generally
stay in the 1-2 range, the disk is handling the load, even though you may see
its lights on frequently. If the queue length is often in the 3 and over range,
then you need to explore more about the problem, which lead to the other reasons
why a disk is busy. Another reason why, is that there’s really a memory
shortage causing disk activity, because of heavy use by the page file. Determine
if there’s memory shortage and upgrade.
Some additional reasons why a disk is busy or a source of bottlenecks are:
fragmentation (run Disk Defragmenter); location of disk files (need to
distribute files more evenly among different sets). In some situations, a hard
disk simply may have a slow transfer rate and may need to be upgraded,
particularly if it’s an older disk. Set up a test by transferring large files
to that disk and measure PhysicalDisk DiskBytes/sec along with %DiskTime. Maybe
replace with high-speed SCSI adapters.
LogicalDisk %Free Space – indicates what percentage of space on the
logical drive is still available. The low or rapidly decreasing number indicates
a problem, need to add more space.