The main starting point for the monitor.
TODO: DNS lookups via dnsjava, cycle through all DNS entries slowly
ns1.aoindustries.com failed today (2019-09-13), so this is bumped up the list
During this failure, TCP connections were stalled 15 seconds, then successful.
So monitoring for connections taking over 10 seconds might also catch this failure mode.
TODO: bind/named: rndc status
TODO: Monitor PDU load levels
TODO: Fan speed, CPU voltages, system/CPU temps (BMC) (Hardware Sensors)
TODO: ECC reports
TODO: Managed switches
TODO: Watch /var/log/audit/audit.log for any SELinux denials or messages that indicate intrusion.
Group servers by cluster and then by rack, can add power warning on rack node itself
monitor available entropy
monitor master stats, including shared entropy pool
watch route scripts on gw1/gw2 pairs making sure consistent with each other
watchdog to detect failure between RMI client and server
handle failed RMI
3ware, verify last battery test interval, test at least once a year
same for LSI (PDList look for Media Error Count: non-zero)
Monitor all syslog stuff for errors, split to separate logs for error, warning, info. errors and higher to one file to monitor.
watch for SMART status (May 20 04:42:31 xen907-5 smartd: Device: /dev/hde, 1 Offline uncorrectable sectors) in /var/log/messages
smartd -q onecheck?
watch for procmail/spamc failures in: /var/log/mail/info
Dec 8 19:25:10 www3 sendmail: mB88rS9R029955: to=\\jl, delay=16:31:40, xdelay=00:00:04, mailer=local, pri=6962581,
dsn=4.0.0, stat=Deferred: local mailer (/usr/bin/procmail) exited with EX_TEMPFAIL
Watch the count of files in "/var/spool/aoserv/spamassassin" - a large number of files (not directories) indicates training failing. Also watch parsed times.
mrtg auto-check and graphs
/proc/version against template
Other hardware (temps, fans, ...)
reboot detections using uptime? (or last command)
also enable port monitoring on all ports (including 127.0.0.1 via aoserv-daemon)
update all code and procedures that adds net_binds to start them all as monitored
maximum alert level based on account level?
monitor the 0.0.0.0 ports on all IP addresses (including 127.0.0.1), minimize use of wildcard to reduce monitoring rate - or monitor on separate node for 0.0.0.0?
AOServ data integrity?
AOServ Daemon errors and warnings
DNS, forward and reverse (non-AO name servers)
Also query each nameserver for all expected values slowly over time
kernel parameters (/proc/sys/...)
shared memory segments
other stuff that used to be in server reports
make sure backups not going to the same primary or secondary physical machine
low priority if successful but scanned 0 (like no backups configured)
mysql myisam corruption (check table ... quick fast)
credit card scanner from aoserv-daemon
software updates (could we auto-search for them)?
snapshot-backups space and timing (adpserver)
aoserv daemon/master/website errors (and anything else that used to get emailed to firstname.lastname@example.org address)
watch for files older than 7 days to help keep things clean - eventually delete outright?
watch for growing - this is a sign of a problem
nmap - other tools that Mark Akins uses - nessus
domain registration expiration :)
apache (make sure no empty apache logs after rotate !)
max latency setting on a per-server basis (or per netdevice) - or a hierachy server_farms, server, net_device, ip_address
TODO: Log history and persist over time? Simple files to disk? Locking to prevent multiple JVM's corrupting?
Need to monitor log rotation success states (big log files built-up on keepandshare and awstats broken as a result)
configurable limits per alert level
based on 5-minute averages, sampled every minute, will take up to 9 minutes to alert
Monitor syslog for ECC errors (at least for i5000 module with most recent 2.6.18-92.1.10+ kernels). See wiki page for xen917-5.fc.aoindustries.com
Monitor all SSL certificates, ours and customers, could have on a single SSL
Certificates node per server, perhaps? Read from filesystem or connect to port?
Make part of port monitoring
3ware/BIOS firmware version monitoring?
LSI monitoring (MegaCli LdInfo/PdList, battery monitoring, too)?
battery calibration once a year or when load is increased
Perhaps just as a procedure - how to schedule in NOC interface?
Open Resolvers: http://openresolverproject.org/
Any process linked to libraries or binaries that do not match current on-disk: This might mean a process needs restarted to have latest security updates.
PostgreSQL and MySQL concurrency monitoring, too, like Apache instances
Certain application pool sizes, too, such as Tomcat and custom applications (aoserv-master)?
Tomcat Manager monitoring
Tomcat Log file monitoring (watch tomcat log files for key things like OutOfMemoryError and "Too many open files" (and translations?)
Watch the jvm_crashes.log file?
Integrate NOC with Amazon Cloud Watch
Monitor for certificates issued in domains that are not expected.
Jonathon Moldenhaur described how there are lists of certificates issued.