Class MonitorImpl

  • All Implemented Interfaces:
    Monitor, Serializable, Remote

    public class MonitorImpl
    extends UnicastRemoteObject
    implements Monitor
    The main starting point for the monitor. TODO: DNS lookups via dnsjava, cycle through all DNS entries slowly ns1.aoindustries.com failed today (2019-09-13), so this is bumped up the list During this failure, TCP connections were stalled 15 seconds, then successful. So monitoring for connections taking over 10 seconds might also catch this failure mode. TODO: bind/named: rndc status TODO: Monitor PDU load levels TODO: Fan speed, CPU voltages, system/CPU temps (BMC) (Hardware Sensors) TODO: ECC reports TODO: Managed switches TODO: Watch /var/log/audit/audit.log for any SELinux denials or messages that indicate intrusion. TODO: Group servers by cluster and then by rack, can add power warning on rack node itself /cluster/ /rack1 /rack2 /virtual TODO: monitor available entropy monitor master stats, including shared entropy pool watch route scripts on gw1/gw2 pairs making sure consistent with each other watchdog to detect failure between RMI client and server handle failed RMI 3ware, verify last battery test interval, test at least once a year same for LSI (PDList look for Media Error Count: non-zero) Monitor all syslog stuff for errors, split to separate logs for error, warning, info. errors and higher to one file to monitor. watch for SMART status (May 20 04:42:31 xen907-5 smartd[3454]: Device: /dev/hde, 1 Offline uncorrectable sectors) in /var/log/messages smartd -q onecheck? smartctl? watch for procmail/spamc failures in: /var/log/mail/info Dec 8 19:25:10 www3 sendmail[1795]: mB88rS9R029955: to=\\jl, delay=16:31:40, xdelay=00:00:04, mailer=local, pri=6962581, dsn=4.0.0, stat=Deferred: local mailer (/usr/bin/procmail) exited with EX_TEMPFAIL Watch the count of files in "/var/spool/aoserv/spamassassin" - a large number of files (not directories) indicates training failing. Also watch parsed times. mrtg auto-check and graphs /proc/version against template Other hardware (temps, fans, ...) reboot detections using uptime? (or last command) DiskIO port monitoring also enable port monitoring on all ports (including 127.0.0.1 via aoserv-daemon) update all code and procedures that adds net_binds to start them all as monitored maximum alert level based on account level? monitor the 0.0.0.0 ports on all IP addresses (including 127.0.0.1), minimize use of wildcard to reduce monitoring rate - or monitor on separate node for 0.0.0.0? AOServ data integrity? AOServ Daemon errors and warnings DNS, forward and reverse (non-AO name servers) Also query each nameserver for all expected values slowly over time kernel parameters (/proc/sys/...) shared memory segments other stuff that used to be in server reports smart monitoring? kernel? 3ware? Dell/LSI? backups make sure backups not going to the same primary or secondary physical machine low priority if successful but scanned 0 (like no backups configured) mysql replications mysql myisam corruption (check table ... quick fast) credit card scanner from aoserv-daemon distro scans software updates (could we auto-search for them)? snapshot-backups space and timing (adpserver) jilter state netstat aoserv daemon/master/website errors (and anything else that used to get emailed to aoserv@aoindustries.com address) sendmail queues watch for files older than 7 days to help keep things clean - eventually delete outright? watch for growing - this is a sign of a problem nmap - other tools that Mark Akins uses - nessus domain registration expiration :) apache (make sure no empty apache logs after rotate !) max latency setting on a per-server basis (or per netdevice) - or a hierachy server_farms, server, net_device, ip_address TODO: Log history and persist over time? Simple files to disk? Locking to prevent multiple JVM's corrupting? server_reports-style logging? Need to monitor log rotation success states (big log files built-up on keepandshare and awstats broken as a result) CPU configurable limits per alert level based on 5-minute averages, sampled every minute, will take up to 9 minutes to alert LVM: snapshot space vgck Monitor syslog for ECC errors (at least for i5000 module with most recent 2.6.18-92.1.10+ kernels). See wiki page for xen917-5.fc.aoindustries.com Monitor all SSL certificates, ours and customers, could have on a single SSL Certificates node per server, perhaps? Read from filesystem or connect to port? HTTPS IMAPS/IMAP+TLS POP3S/POP3+TLS MySQL SMTPS/SMTP+TLS PostgreSQL Make part of port monitoring 3ware/BIOS firmware version monitoring? LSI monitoring (MegaCli LdInfo/PdList, battery monitoring, too)? UPS Monitor: battery calibration once a year or when load is increased Perhaps just as a procedure - how to schedule in NOC interface? Open Resolvers: http://openresolverproject.org/ Process Monitoring Any process linked to libraries or binaries that do not match current on-disk: This might mean a process needs restarted to have latest security updates. PostgreSQL and MySQL concurrency monitoring, too, like Apache instances Certain application pool sizes, too, such as Tomcat and custom applications (aoserv-master)? Tomcat Manager monitoring JMX Tomcat Log file monitoring (watch tomcat log files for key things like OutOfMemoryError and "Too many open files" (and translations?) Watch the jvm_crashes.log file? Integrate NOC with Amazon Cloud Watch Monitor for certificates issued in domains that are not expected. Jonathon Moldenhaur described how there are lists of certificates issued.
    Author:
    AO Industries, Inc.
    See Also:
    Serialized Form