Server and Network Monitoring

 

Slides

Activities:

11-1
11-2
11-3
11-4
11-5
11-6
11-7
11-8
11-9
11-11
11-12

Monitoring Resources
System monitoring enables you to be proactive in maintaining fast server response, and it gives you the tools to quickly find and resolve problems after they strike. System monitoring accomplishes several purposes. One reason to monitor is to become familiar with your server’s performance so you know how to interpret a problem. It may be difficult to diagnose a problem or determine if there’s a resource shortage unless you first know what performance is typical for your system. Other reasons to monitor are to prevent problems before they occur and to diagnose existing problems to resolve them.
In order to monitor your system, troubleshoot/detect problems proactively, and plan for the future growth, it's important to create a baseline. A baseline is a snapshot of the system, a status quo of the system/network. It's important to gather data that is representative of the whole system - collect data at regular intervals over time during busy and non-busy cycles. Baselines, or benchmarks, establish normal system performance characteristics and provide a basis for comparing data collected during problem situations with the data showing normal performance conditions. This creates a way to diagnose problems and identify components that need to be upgraded. Benchmarks are acquired in the following ways:
- By generating statistics about CPU, disk, memory, and I/O with no users on the system, to establish a baseline for comparing to more active periods.
- By using performance monitoring to establish slow, average, and peak periods. Keep records on these periods
- By establishing benchmarks to track growth in the user of servers, such as increases in users, increases in software, and increases in the average amount of time users are on the system.
The best way to get a feel for a server’s performance is to gather benchmarks, and then to frequently monitor server performance after you have the benchmark data. Performance indicators can be confusing at first, so the more time you spend observing them, the better you’ll understand them. For example, viewing the CPU utilization on a server the first few times doesn’t tell you much, but viewing it over a period of two or three months, noting slow and peak periods, helps you develop knowledge about how CPU demand varies for that server.
Performance Monitor is a tool for data collection/monitoring. The System Monitor is the most powerful monitoring tool , which you can use in a multitude of ways for tracking system performance and determining how to optimize server functions. System Monitor is like a window into the inner workings of just about every aspect of the system, such as hard disks, memory, the processor, the page file, etc. For example, you might monitor memory and page to determine if you have fully tuned the page file for satisfactory performance and to determine if you have adequate RAM for the server load. Performance Monitor monitors system components are called objects (e.g. Processor, Memory), and there can be multiple instances of the same object. As you add new services, new objects are added. For example, when you install IIS, more objects are added to monitor IIService, HTTP Service , and FTP Server activity. Different measurements of objects are called counters. (e.g. Pages/sec, % Processor Time). A counter is an indicator of a quantity of the object that can be measured in some unit, such as percent, rate per second, or peak value, depending on what is appropriate to the object.
System Monitor offers 3 modes of monitoring: chart, histogram or report. A chart is a running line graph of the object that show distinct peaks and valleys. A histogram is a running bar chart that shows each object as a bar in a different color. The report mode simply provides numbers on a screen which you can capture to put in a report. System Monitor can be used to monitor not only the local computer, but other computers on the network. This is a powerful option for a server administrator, because it means you can monitor other network servers or workstations from one place, like your Win2K professional workstation.

These are some critical objects/counters to monitor system performance:
Memory: To monitor the server’s page file performance (page file is disk space reserved for use when memory requirements exceed the available RAM).
Memory Pages/sec – indicates a number of 4K pages transferred in/out of the paging file in one second. The number consistently higher than 2-3 indicates lack of physical RAM.
Paging File % Usage and %Usage Peak show how much of the page file is currently occupied. Neither should frequently exceed 99%, but look at this info in relation to Memory Pages/sec. If the values are frequently over 99%, increase the page file size.
Page Faults/sec – hard page fault occurs when a program doesn’t have enough physical memory to execute a given function. If there’s frequently over 5 hard page faults/sec, this is another strong indication of a memory bottleneck. Increase RAM.
Sometimes software applications use the system’s RAM very inefficiently, causing performance problems. Inefficient use of memory occurs for at least 2 reasons: poor program design and failure to return memory to the server after a process is complete (leaking memory). Leaking memory is a very common problem that has a cumulative impact, because the program may go through several cycles in which it repeatedly accesses blocks of memory that aren’t released. The result is that the page file continually grows, resulting in slower and slower performance. Adding RAM or increasing page file size in that case is not likely to address the performance problem. A better solution is to identify the program and redesign it or purchase on that is more efficient. In System Monitor, track the Process object and the counters Page File Bytes and Page Faults/sec, for each process that you suspect is causing a problem. A high rate of page faults for one process in relation to total number of page faults is a strong indicator that there’s a problem with that process.
Processor: There are three important components to studying the processor load:
- The percent of time the processor is in use
- The length of the queue containing processes waiting to run
- the frequency of interrupt requests from hardware.
Processor % Processor Time
– indicates percentage of total time the processor spends not idle, in use at the present time. It’s normal for the processor to fluctuate between 50% to 10%. The number consistently higher than 60%(especially in the 80-100%) indicates a potential problem. It’s time to collect additional data by monitoring the number of processes waiting in line for their turn on the processor. Processor Queue Length counter for the System object determines if there’s a queue of waiting processes. If the processor is often at 100%, but there are no processes waiting in the queue, the processor is handling the load, but if 4-5 processes are always in line, this suggests that it’s time to consider a faster processor. Before deciding that you need to purchase a new processor, make sure the processor load is not due to a malfunctioning h/w component, such as NIC or disk adapter. When you monitor the processor load,  also check % Interrupt Time and Interrupts/sec for the Processor object. A high frequency of interrupts /sec, such as over 1000, is an indication that there’s a problem. Also, frequently high % Interrupt Time (over 80%) is another indication of a h/w problem. Theses counters don’t locate the component, but they do show that the overload problem is unlikely to be solved by a new processor. Check the system log for information about h/w problems. 
Disk: LogicalDisk and PhysicalDisk. Use LogicalDisk to observer activity on a set of disks, such as a striped volume. Use PhysicalDisk to monitor a specific disk, such as disk 0 in a set of five disks. PhysicalDisk %Disk Time show the amount of activity on a disk, and Physical Disk Current Disk Queue Length – shows the number of waiting requests to access the disk. If one disk frequently is busy at the near 100% level, information on the number of waiting requests helps to diagnose the problem.  If there are 0-1 requests normally in the queue, the disk load is acceptable, If the queue generally has 2 or more requests, it’s time to move some files from the overloaded disk to one less busy. The best way to determine which files to move is to understand what apps and data are on the server and how they are used. If all of the server disks are constantly busy, it maybe necessary to add more disks to distribute the data or invest in the RAID array. Another source of disk activity is the page file. Monitor the Memory counter Pages/sec and PhysicalDisk % Disk Time simultaneously. This shows the paging activity in relationship to the activity on the disk. Sometimes, the disk data transfer rate, which is measured by the Physical Disk Disk Bytes/sec also is a problem. Use all three counters to track page file activity and how fast the page file is written to disk. This give you a good idea of the page file activity and the disk speed at the same time. If page file activity is a problem, consider increasing RAM or implementing a page file on more than one disk (- Place the paging file on the striped volume for optimizing performance (but can’t perform memory dump then), or on any other partition than boot partition.
- Create multiple smaller paging files on different physical disks for faster access

If paging activity is low, but the transfer rate is slow for large files, such as the page file or a database file, consider upgrading to faster disks.  A visible indication that a disk may be a bottleneck is that its LED is lighting constantly and you can hear the disk busily reading and writing data. There are 3 general reasons thy a disk is busy. One reason is simply that it’s experiencing heavy sustained use. It’s not a problem is the disk can handle the load. If the Current Disk Queue Length and Avg. Disk Queue Length generally stay in the 1-2 range, the disk is handling the load, even though you may see its lights on frequently. If the queue length is often in the 3 and over range, then you need to explore more about the problem, which lead to the other reasons why a disk is busy. Another reason why, is that there’s really a memory shortage causing disk activity, because of heavy use by the page file. Determine if there’s memory shortage and upgrade.
Some additional reasons why a disk is busy or a source of bottlenecks are: fragmentation (run Disk Defragmenter); location of disk files (need to distribute files more evenly among different sets). In some situations, a hard disk simply may have a slow transfer rate and may need to be upgraded, particularly if it’s an older disk. Set up a test by transferring large files to that disk and measure PhysicalDisk DiskBytes/sec along with %DiskTime. Maybe replace with high-speed SCSI adapters.
 
LogicalDisk %Free Space – indicates what percentage of space on the logical drive is still available. The low or rapidly decreasing number indicates a problem, need to add more space.