Nagios XI 2014-R1.0RC3 brief installation testing May 2014

The report contains some details of testing of Nagios XI 2014R1.0RC3 on a small network in Australia. Since this report was written, full Nagios release was announced and our testing was published here:

Nagios XI 2014-R1 brief installation testing May2014

Scenario on the test network for a customer:

  • Main server is CentOS 6.5 on X84_64 platform.
  • Installed Nagios XI 2014R1.0RC3 with Standard Edition license.
  • The server runs four KVM virtual machines:
    Solaris 11.1
    OpenSUSE 13.1
    Oracle Linux 6.5
    Ubuntu 14.04
    
Several issues or concerns were uncovered or checked:
  1. Memory footprint: For simple monitoring of four small virtual machines (each with one network interface and 1 GB RAM) and one host server, Nagios XI seemed to have grabbed lot of memory initially. Further investigation found the memory utilisation by Nagios to be quite low, which was good news as scalability in large networks is one of my evaluation goals.

    The following non-standard services and processes were enabled on the host server when Nagios XI was installed:

    nagios
    nagiosxi
    ndo2db
    nrpe
    postgresql
    mrtg
    
  2. Straight after the installation, I started getting lot of emails via cron job for MRTG. They contained the following notes:
    Subroutine SNMP_Session::pack_sockaddr_in6 redefined at
    /usr/local/share/perl5/Exporter.pm line 66.
     at /usr/bin/../lib/mrtg2/SNMP_Session.pm line 149.
    Subroutine SNMPv1_Session::pack_sockaddr_in6 redefined at
    /usr/local/share/perl5/Exporter.pm line 66.
     at /usr/bin/../lib/mrtg2/SNMP_Session.pm line 604.
    
    To eliminate them, I modified the cron job to divert any standard errors to /dev/null:
    */5 * * * * root LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg
    --lock-file /var/lock/mrtg/mrtg_l --confcache-file
    /var/lib/mrtg/mrtg.ok 2>/dev/null
    
  3. Nagios XI installation completely replaced /etc/sysconfig/iptables on the Linux server. It was necessary to revert to the original one and add important lines to accomodate support for Nagios:
    # SNMP
    -A INPUT -m state --state NEW -m udp -p udp --dport 161 -j ACCEPT
    # SNMP Traps
    -A INPUT -m state --state NEW -m udp -p udp --dport 162 -j ACCEPT
    # NRPE
    -A INPUT -m state --state NEW -m tcp -p tcp --dport 5666
    -j ACCEPT
    # NSCA
    -A INPUT -m state --state NEW -m tcp -p tcp --dport 5667
    -j ACCEPT
    
  4. Nagios XI system component status reports that performance grapher "NPCD not running".

    This message seems to be ambiguous because process listing and the log file clearly show that npcd is running:

    # ps -elf | grep npc[d]
    5 S nagios  3454   1  0  80   0 - 92222 hrtime May15 ? 00:00:01
    /usr/local/nagios/bin/npcd -d -f
    /usr/local/nagios/etc/pnp/npcd.cfg
    
    GUI interface fails to restart it through "Action" menu.
  5. NRPE Agent install failed:
    ubuntu14.04 is not currently supported.
    oracleserver6.5 is not currently supported.
    
    Note that Oracle Linux 6.5 is almost identical to RHEL 6.5. In fact, Oracle Linux has two files that describe it:
    # cat /etc/redhat-release
    Red Hat Enterprise Linux Server release 6.5 (Santiago)
    
    # cat /etc/oracle-release
    Oracle Linux Server release 6.5
    
    And Ubuntu is a fork of Debian.
  6. Within "Monitoring Wizard: Solaris Server" the installation menu reported:
    Only Solaris 10 supported.
    
    Solaris 11 has been around for almost three years and it is pity that it is not supported yet. It is still one of the main commercial Unices.
  7. When attempting Solaris 10 agent install on Solaris 11 server, it hung at this point:
    =================================
    Nagios XI Solaris Agent Installer
    =================================
    
    This script will do a complete install of the Nagios XI
    Solaris agent by executing all necessary sub-scripts.
    
    IMPORTANT: This script should only be used on a Solaris
               system.  Do NOT use this on a system running
                   any other operating system.
    
    ./fullinstall[18]: /usr/ucb/echo:
    not found [No such file or directory]
    
    On Solaris 11, /usr/ucb directory tree is deprecated. There is a workaround, but it should be avoided:
    # pkg add compatibility/ucb
    
    Therefore, the Nagios XI agent install on Solaris 11 should remove dependency on /usr/ucb commands.

    The fix for the echo command problem is easy. Here is the modification in fullinstall script:

    #/usr/ucb/echo -n "Do you want to continue? [Y/n] "
    printf "Do you want to continue? [Y/n] "
    
    Unfortunately,the agent installation on Solaris 11 still failed because it was an unsupported platform:
    =================================
    Nagios XI Solaris Agent Installer
    =================================
    
    This script will do a complete install of the Nagios XI
    Solaris agent by executing all necessary sub-scripts.
    
    IMPORTANT: This script should only be used on a Solaris
               system.  Do NOT use this on a system running
                   any other operating system.
    
    Do you want to continue? [Y/n] Y
    Proceeding with installation...
    This platform () is not currently supported.
    
  8. Within "Monitoring Wizard: Linux Server" it did not find OpenSUSE 13.1 agent (Debian agent did not exist for download).
  9. Auto-Discovery Wizard required CLI to be executed manually:
    # chmod +x 
    /usr/local/nagiosxi/html/includes/
    components/autodiscovery/setup.sh
    # cd /usr/local/nagiosxi/html/includes/components/autodiscovery/
    # ./setup.sh
    
  10. My changes to improve NRPE Agent installation on Solaris 11:
    # ln -s nagios-plugins-1.4.16-sol10-i386-local \
      nagios-plugins-1.4.16-sol11-i386-local
    
    # ln -s nrpe-2.14-sol10-i386-local nrpe-2.14-sol11-i386-local
    
    # ln -s top-3.6.1-sol10-x86-local top-3.6.1-sol11-x86-local
    
    Edit init.sh and add Solaris 11 support:
    if [ $os = "SunOS" ] ; then
            if [ $ver = "5.10" ] ; then
                    platform="sol10"
            fi
            if [ $ver = "5.11" ] ; then
                    platform="sol11"
            fi
    fi
    
    Edit fullinstall.sh and add support for Solaris 11:
    # Check platform and architecture
    case "$platform" in
            "sol10"|"sol11")
    ...
    
    The installation then completed, but additional errors showed and log file summarised it in logs-all-steps-20140511-170345.tar.gz.
    SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
    EVENT-TIME: Sun May 11 17:03:16 EST 2014
    PLATFORM: KVM, CSN: unknown,
    HOSTNAME: sol11-vm2.circlingcycle.com.au
    SOURCE: software-diagnosis, REV: 0.1
    EVENT-ID: bf04b56a-9e55-ed1e-a381-c2a4c7b47fef
    DESC: A service failed - a method is failing in a retryable
    manner but too often.
    AUTO-RESPONSE: The service has been placed into the maintenance
    state.
    IMPACT: svc:/application/nagios/nrpe:default is unavailable.
    REC-ACTION: Run 'svcs -xv svc:/application/nagios/nrpe:default'
    to determine the generic reason why the service failed, the
    location of any logfiles, and a list of other services
    impacted. Please refer to the associated reference document at
    
    http://support.oracle.com/msg/SMF-8000-YX
    
    for the latest service procedures and policies regarding this
    diagnosis.
    
    SFM could not start the NRPE services for two reasons (one was related to obsolete openssl library and the other related to the requirement of physical network interface and this Solaris 11 server was just a KVM vurtual machine):
    # cat /var/svc/log/application-nagios-nrpe:default.log
    [ May 11 17:01:46 Disabled. ]
    [ May 11 17:01:53 Rereading configuration. ]
    [ May 11 17:03:13 Enabled. ]
    [ May 11 17:03:13 Executing start method
    ("/lib/svc/method/nrpe start"). ]
    ld.so.1: nrpe: fatal: libssl.so.0.9.7: open failed: No such file
    or directory
    /lib/svc/method/nrpe: line 10: 13937: Killed
    [ May 11 17:03:15 Method "start" exited with status 9. ]
    [ May 11 17:03:15 Executing start method
    ("/lib/svc/method/nrpe start"). ]
    ld.so.1: nrpe: fatal: libssl.so.0.9.7: open failed: No such file
    or directory
    /lib/svc/method/nrpe: line 10: 13939: Killed
    [ May 11 17:03:15 Method "start" exited with status 9. ]
    [ May 11 17:03:15 Executing start method
    ("/lib/svc/method/nrpe start"). ]
    ld.so.1: nrpe: fatal: libssl.so.0.9.7: open failed: No such file
    or directory
    /lib/svc/method/nrpe: line 10: 13941: Killed
    [ May 11 17:03:15 Method "start" exited with status 9. ]
    
    # svcadm enable -r nrpe
    svcadm: svc:/application/nagios/nrpe:default depends on
    svc:/network/physical, which has multiple instances.
    
    # svcs
    ...
    maintenance    17:03:15 svc:/application/nagios/nrpe:default
    
  11. Monitoring Wizard seems to take long time during configuration verification to update submitted changes. This needs to be investigated further.
  12. In spite of these initial glitches, there is no doubt that Nagios XI is a great product, with plentiful of features. A small snapshot of how it looks on the network I am testing at the moment: