Class MonitorImpl

  • All Implemented Interfaces:
    Monitor, Serializable, Remote

    public class MonitorImpl
    extends UnicastRemoteObject
    implements Monitor
    The main starting point for the monitor.
     TODO: DNS lookups via dnsjava, cycle through all DNS entries slowly
  failed today (2019-09-13), so this is bumped up the list
           During this failure, TCP connections were stalled 15 seconds, then successful.
               So monitoring for connections taking over 10 seconds might also catch this failure mode.
     TODO: bind/named: rndc status
     TODO: Monitor PDU load levels
     TODO: Fan speed, CPU voltages, system/CPU temps (BMC) (Hardware Sensors)
     TODO: ECC reports
     TODO: Managed switches
     TODO: Watch /var/log/audit/audit.log for any SELinux denials or messages that indicate intrusion.
         Group servers by cluster and then by rack, can add power warning on rack node itself
     monitor available entropy
     monitor master stats, including shared entropy pool
     watch route scripts on gw1/gw2 pairs making sure consistent with each other
     watchdog to detect failure between RMI client and server
         handle failed RMI
     3ware, verify last battery test interval, test at least once a year
         same for LSI (PDList look for Media Error Count: non-zero)
     Monitor all syslog stuff for errors, split to separate logs for error, warning, info.  errors and higher to one file to monitor.
         watch for SMART status (May 20 04:42:31 xen907-5 smartd[3454]: Device: /dev/hde, 1 Offline uncorrectable sectors) in /var/log/messages
             smartd -q onecheck?
         watch for procmail/spamc failures in: /var/log/mail/info
             Dec  8 19:25:10 www3 sendmail[1795]: mB88rS9R029955: to=\\jl, delay=16:31:40, xdelay=00:00:04, mailer=local, pri=6962581,
                                                                  dsn=4.0.0, stat=Deferred: local mailer (/usr/bin/procmail) exited with EX_TEMPFAIL
     Watch the count of files in "/var/spool/aoserv/spamassassin" - a large number of files (not directories) indicates training failing.  Also watch parsed times.
     mrtg auto-check and graphs
     /proc/version against template
     Other hardware (temps, fans, ...)
        reboot detections using uptime? (or last command)
     port monitoring
         also enable port monitoring on all ports (including via aoserv-daemon)
         update all code and procedures that adds net_binds to start them all as monitored
         maximum alert level based on account level?
         monitor the ports on all IP addresses (including, minimize use of wildcard to reduce monitoring rate - or monitor on separate node for
     AOServ data integrity?
     AOServ Daemon errors and warnings
     DNS, forward and reverse (non-AO name servers)
         Also query each nameserver for all expected values slowly over time
     kernel parameters (/proc/sys/...)
         shared memory segments
     other stuff that used to be in server reports
     smart monitoring?
       make sure backups not going to the same primary or secondary physical machine
       low priority if successful but scanned 0 (like no backups configured)
         mysql myisam corruption (check table ... quick fast)
         credit card scanner from aoserv-daemon
     distro scans
     software updates (could we auto-search for them)?
     snapshot-backups space and timing (adpserver)
     jilter state
     aoserv daemon/master/website errors (and anything else that used to get emailed to address)
     sendmail queues
         watch for files older than 7 days to help keep things clean - eventually delete outright?
         watch for growing - this is a sign of a problem
     nmap - other tools that Mark Akins uses - nessus
     domain registration expiration :)
     apache (make sure no empty apache logs after rotate !)
     max latency setting on a per-server basis (or per netdevice) - or a hierachy server_farms, server, net_device, ip_address
     TODO: Log history and persist over time?  Simple files to disk?  Locking to prevent multiple JVM's corrupting?
           server_reports-style logging?
     Need to monitor log rotation success states (big log files built-up on keepandshare and awstats broken as a result)
          configurable limits per alert level
          based on 5-minute averages, sampled every minute, will take up to 9 minutes to alert
       snapshot space
     Monitor syslog for ECC errors (at least for i5000 module with most recent 2.6.18-92.1.10+ kernels).  See wiki page for
     Monitor all SSL certificates, ours and customers, could have on a single SSL
     Certificates node per server, perhaps?  Read from filesystem or connect to port?
         Make part of port monitoring
     3ware/BIOS firmware version monitoring?
     LSI monitoring (MegaCli LdInfo/PdList, battery monitoring, too)?
     UPS Monitor:
          battery calibration once a year or when load is increased
          Perhaps just as a procedure - how to schedule in NOC interface?
     Open Resolvers:
     Process Monitoring
          Any process linked to libraries or binaries that do not match current on-disk: This might mean a process needs restarted to have latest security updates.
     PostgreSQL and MySQL concurrency monitoring, too, like Apache instances
          Certain application pool sizes, too, such as Tomcat and custom applications (aoserv-master)?
     Tomcat Manager monitoring
     Tomcat Log file monitoring (watch tomcat log files for key things like OutOfMemoryError and "Too many open files" (and translations?)
     Watch the jvm_crashes.log file?
     Integrate NOC with Amazon Cloud Watch
     Monitor for certificates issued in domains that are not expected.
         Jonathon Moldenhaur described how there are lists of certificates issued.
    AO Industries, Inc.
    See Also:
    Serialized Form