Bug 1564372 - Setup Nagios server
Summary: Setup Nagios server
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: project-infrastructure
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: M. Scherer
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-06 06:09 UTC by Nigel Babu
Modified: 2020-03-12 13:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 13:03:25 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Nigel Babu 2018-04-06 06:09:12 UTC
We need to setup a nagios server that alerts us to system failures. These include machines which are disconnected from Jenkins and/or have full disk space. It would let our reactions be far more predictive rather than reactive.

This is a long running goal, but for the moment, I'll settle for a nagios server and all machines having nagios clients.

If we want to replace nagios with another equivalent like icinga, that works too.

Comment 1 M. Scherer 2018-04-09 12:21:53 UTC
So, we need to have nagios in the internal vlan, so it can monitor everything.

On the easy side, we can monitor for ping, various metrics (disk space), services that are running. 

What policy do we want for alert ? And what SLA/SLE, especially due to timezone difference ?

Comment 2 Nigel Babu 2018-04-09 16:20:52 UTC
I'd say alert to a list like alerts. We'll still do best effort working day coverage. This only enhances our ability to see what fails sooner than someone else noticing the failure.

Comment 3 Shyamsundar 2018-06-20 18:25:33 UTC
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Comment 4 M. Scherer 2018-09-10 15:28:36 UTC
So, I reused the existing role I had, and setup a nagios server. 

Now, I need to:
- move munin internally (server is installed, i need to clean the role, move the data)
- connect munin/nagios
- add more check to nagios (the hard part, do that without repeating data all over the place)
- add more servers

So far, it worked, cause I got paged for a IP v6 problem in the cage (cause there is no ipv6 in the cage in the first place...)

Comment 5 M. Scherer 2018-09-26 11:24:14 UTC
So:

All servers managed by ansible are now monitored for ping/ssh (which did permit to see that our freebsd hosts blocked ping, because i got paged for that as soon as I deployed). Aka, all but gerrit prod.


I have added smtp port on supercolony, and vhost checking for a couple of web site, see ansible repo for details. 

For now, and while I do clean the roles and stuff, I am the only one receiving alerts, but we will need a plan for the future, I did discuss with nigel on irc.

Notes for myself (and people that care), here the list of things to do:
- investigate more nrpe (like, security impact on having it opened on the nated IP of the cage)
- add munin/nagios connexion
- add check of process:
   - cron
   - custom process

- add custom check (gerrit, jenkins server being offline, etc)

- refine httpd check (like more than "http 200")

Comment 6 M. Scherer 2018-09-26 15:57:17 UTC
So, munin -> nagios connection do work, but:

- hit some selinux issue:

type=AVC msg=audit(1537977117.718:115791): avc:  denied  { search } for  pid=19206 comm="send_nsca" name="nagios" dev="dm-0" ino=271810 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:nagios_etc_t:s0 tclass=dir


This one shouldn't be too hard to fix.

- have to understand how munin is supposed to be integrated. For example, I see:

[1537976773] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;supercolony.gluster.org;Disk usage in percent;1;WARNINGs: / is 93.80 (outside range [:92]).
[1537976773] Warning:  Passive check result was received for service 'Disk usage in percent' on host 'supercolony.gluster.org', but the service could not be found!

- see why supercolony do alert, but not the builder at 100% cpu I set up

Comment 7 M. Scherer 2018-09-26 16:36:39 UTC
Now, blocked with:

type=AVC msg=audit(1537979718.243:116446): avc:  denied  { name_connect } for  pid=27096 comm="send_nsca" dest=5667 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket

Guess I might need to write my own policy.

Comment 8 M. Scherer 2018-09-27 13:50:37 UTC
First step:

https://github.com/fedora-selinux/selinux-policy/pull/229

Comment 9 M. Scherer 2018-09-27 15:03:48 UTC
Second step:
https://github.com/fedora-selinux/selinux-policy-contrib/pull/72


In the mean time, I will make munin run as unconfined server side until I can work on a send_nsca policy.

Comment 10 M. Scherer 2018-09-28 15:21:43 UTC
So, I did deploy NRPE internally, and testing on the munin server. Right now, it just test the load and for zombie process, but I have code for SElinux, checking the rpm db and I think a architecture for adding more.

Comment 11 M. Scherer 2018-09-28 17:54:40 UTC
So, status (again for myself mostly)
- check for process stuck in Z state is done and working
- check for selinux is done, tested
- the munin notification should now clean themself
- check for specific process is done and working, tested on squid/ubunoun

Next step:
- verify again NRPE in details (like, is it confined by selinux properly, what can a rogue client achieve)
- improve notification
- add more verification on various servers

Comment 12 M. Scherer 2019-02-19 11:28:21 UTC
So, NRPE seems to be confined, notification got improved (text message are better than before), and I am adding servers one by one.

Comment 15 Worker Ant 2020-03-12 13:03:25 UTC
This bug is moved to https://github.com/gluster/project-infrastructure/issues/41, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.