|
Last night at work our use of OMSA and Nagios paid off (it often does). Three crucial production servers showed up in monitoring with degraded RAID5 arrays. It appears all three servers had 1 out of 3 drives in state "foreign"! I was able to quickly login and bring the foreign disks online and perform consistency checks. Without Nagios and OMSA we would have never known and the servers would have lived on with zero redundancy until their inevitable failure. Some people may not know about OMSA or IPMI so I thought I would write a quick blurb. Overview OMSA is a application designed by Dell that allows for administration of hardware on Dell servers. It offers both a Web browser based GUI and a CLI. It uses the OpenIPMI standard and thus requires OpenIPMI be installed. Check out Dell's site for mounds of mostly useless detail - Openmanage. Anyway, the company I work for uses Dell servers almost exclusively. We have well over 10,000 PowerEdge servers deployed throughout several DCs. So we don't have the luxury of manually auditing server health by peering at a server's front panel, it must be automated. We use the Dell OMSA application on a growing percentage of our servers to monitor the state of Physical Disks, Virtual Disks, Memory, Controllers, Controller Batteries, etc. We monitor the status of these various components via Nagios by using the NRPE daemon and a couple custom check plug-ins that I wrote in bash. In most instances we install only the RPMs required for using the CLI. The Web package uses a builtin web server and the additional system resources required make it less appealing, plus bash scripts don't need GUIs ;). The CLI includes several commands of which only two will be used by most people; these are omreport and omconfig. Omreport is used for...reporting, and omconfig for...configuring. Some of the common uses include: omreport storage controller - controller status information (firmware version, driver version, status, etc). omreport storage vdisk controller=0 - virtual disk status information. omreport storage pdisk controller=0 - physical disk status information omreport system summary - The firmware version and omsa version information may be important when determining what omconfig command to use. Using the above you can determine if any of your virtual disks or physical disks are in a degraded state. You can also add an additional disk as a hotswap drive using omconfig. One thing that is less obvious about omconfig is how to get a disk that is showing as foreign into a RAID array. omconfig storage controller action=clearforeignconfig controller=CONTROLLERID - This will remove any previous config info held on that disk and now it's state will be "ready". It's a little scary running this command on a production server because you do not specify any physical disk ID. That leaves you wondering if the command was meant for physical disks or something else, worry not only foreign physical disks will be cleared. omconfig storage pdisk action=assignglobalhotspare assign=yes controller=CONTROLLERID pdisk=PDID - This will assign the newly "readied" drive as a globalhostspare. When the assign=yes is specified it will automatically make use of the drive in the case of a degraded array. If you now re-issue the omreport storage pdisk controller=CONTROLLERID command you will see the rebuild status of the physical disk.
There are so many ways to make use of the info available via OMSA and IPMIl. The data is a perfect match for Nagios and Graphing with Munin. Of course your Nagios service checks can quickly grow when you start doing large amounts of client-side checks. We tend to stay away from the less important ones that a small company might make use of (fans,power,case intrusion,etc). |