We need to setup a nagios server that alerts us to system failures. These include machines which are disconnected from Jenkins and/or have full disk space. It would let our reactions be far more predictive rather than reactive. This is a long running goal, but for the moment, I'll settle for a nagios server and all machines having nagios clients. If we want to replace nagios with another equivalent like icinga, that works too.
So, we need to have nagios in the internal vlan, so it can monitor everything. On the easy side, we can monitor for ping, various metrics (disk space), services that are running. What policy do we want for alert ? And what SLA/SLE, especially due to timezone difference ?
I'd say alert to a list like alerts. We'll still do best effort working day coverage. This only enhances our ability to see what fails sooner than someone else noticing the failure.
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.
So, I reused the existing role I had, and setup a nagios server. Now, I need to: - move munin internally (server is installed, i need to clean the role, move the data) - connect munin/nagios - add more check to nagios (the hard part, do that without repeating data all over the place) - add more servers So far, it worked, cause I got paged for a IP v6 problem in the cage (cause there is no ipv6 in the cage in the first place...)
So: All servers managed by ansible are now monitored for ping/ssh (which did permit to see that our freebsd hosts blocked ping, because i got paged for that as soon as I deployed). Aka, all but gerrit prod. I have added smtp port on supercolony, and vhost checking for a couple of web site, see ansible repo for details. For now, and while I do clean the roles and stuff, I am the only one receiving alerts, but we will need a plan for the future, I did discuss with nigel on irc. Notes for myself (and people that care), here the list of things to do: - investigate more nrpe (like, security impact on having it opened on the nated IP of the cage) - add munin/nagios connexion - add check of process: - cron - custom process - add custom check (gerrit, jenkins server being offline, etc) - refine httpd check (like more than "http 200")
So, munin -> nagios connection do work, but: - hit some selinux issue: type=AVC msg=audit(1537977117.718:115791): avc: denied { search } for pid=19206 comm="send_nsca" name="nagios" dev="dm-0" ino=271810 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:nagios_etc_t:s0 tclass=dir This one shouldn't be too hard to fix. - have to understand how munin is supposed to be integrated. For example, I see: [1537976773] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;supercolony.gluster.org;Disk usage in percent;1;WARNINGs: / is 93.80 (outside range [:92]). [1537976773] Warning: Passive check result was received for service 'Disk usage in percent' on host 'supercolony.gluster.org', but the service could not be found! - see why supercolony do alert, but not the builder at 100% cpu I set up
Now, blocked with: type=AVC msg=audit(1537979718.243:116446): avc: denied { name_connect } for pid=27096 comm="send_nsca" dest=5667 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket Guess I might need to write my own policy.
First step: https://github.com/fedora-selinux/selinux-policy/pull/229
Second step: https://github.com/fedora-selinux/selinux-policy-contrib/pull/72 In the mean time, I will make munin run as unconfined server side until I can work on a send_nsca policy.
So, I did deploy NRPE internally, and testing on the munin server. Right now, it just test the load and for zombie process, but I have code for SElinux, checking the rpm db and I think a architecture for adding more.
So, status (again for myself mostly) - check for process stuck in Z state is done and working - check for selinux is done, tested - the munin notification should now clean themself - check for specific process is done and working, tested on squid/ubunoun Next step: - verify again NRPE in details (like, is it confined by selinux properly, what can a rogue client achieve) - improve notification - add more verification on various servers
So, NRPE seems to be confined, notification got improved (text message are better than before), and I am adding servers one by one.
This bug is moved to https://github.com/gluster/project-infrastructure/issues/41, and will be tracked there from now on. Visit GitHub issues URL for further details