Bug 1427895

Summary: EPEL7: Updating Nagios from 4.0.8 to 4.2.4 breaks existing installations
Product: [Fedora] Fedora EPEL Reporter: Lenz Grimmer <lenz>
Component: nagiosAssignee: Stephen John Smoogen <smooge>
Status: CLOSED CANTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: epel7CC: affix, athmanem, b.heden, jose.p.oliveira.oss, lemenkov, linux, ondrejj, shawn.starr, smooge, smooge, s, swilkerson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-02 16:32:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Differences in nagios.cfg from version 4.0.8 to 4.2.4 none

Description Lenz Grimmer 2017-03-01 13:21:40 UTC
Created attachment 1258679 [details]
Differences in nagios.cfg from version 4.0.8 to 4.2.4

Description of problem:

When running "yum update" on en EL7 system that is subscribed to the EPEL7 yum repo, Nagios will be updated from version 4.0.8 to 4.2.4. After the update, Nagios refuses to start with an old configuration:

# systemctl start nagios
Job for nagios.service failed because the control process exited with error code. See "systemctl status nagios.service" and "journalctl -xe" for details.

# systemctl status nagios
● nagios.service - Nagios Network Monitoring
   Loaded: loaded (/usr/lib/systemd/system/nagios.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2017-03-01 13:25:25 CET; 1min 5s ago
     Docs: https://www.nagios.org/documentation/
  Process: 16882 ExecStopPost=/usr/bin/rm -f /var/spool/nagios/cmd/nagios.cmd (code=exited, status=0/SUCCESS)
  Process: 16894 ExecStartPre=/usr/sbin/nagios -v /etc/nagios/nagios.cfg (code=exited, status=1/FAILURE)
 Main PID: 1047 (code=exited, status=0/SUCCESS)

Mar 01 13:25:25 centos7.fritz.box systemd[1]: Starting Nagios Network Monitoring...
Mar 01 13:25:25 centos7.fritz.box systemd[1]: nagios.service: control process exited, code=exited status=1
Mar 01 13:25:25 centos7.fritz.box systemd[1]: Failed to start Nagios Network Monitoring.
Mar 01 13:25:25 centos7.fritz.box systemd[1]: Unit nagios.service entered failed state.
Mar 01 13:25:25 centos7.fritz.box systemd[1]: nagios.service failed.

# nagios --verify-config /etc/nagios/nagios.cfg

Nagios Core 4.2.4
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 12-07-2016
License: GPL

Website: https://www.nagios.org
Reading configuration data...
Error in configuration file '/etc/nagios/nagios.cfg' - Line 454 (Check result path '/var/log/nagios/spool/checkresults' is not a valid directory)
   Error processing main config file!

Version-Release number of selected component (if applicable):

# rpm -q nagios
nagios-4.2.4-2.el7.x86_64

How reproducible:

Always when performing an update from a previously installed older version of Nagios.

Steps to Reproduce:
1. Take an existing EL7/EPEL7 installation still running Nagios 4.0.8
2. Run "yum update"
3. Observe that Nagios fails to start

Actual results:

Nagios fails to restart:

# systemctl start nagios
Job for nagios.service failed because the control process exited with error code. See "systemctl status nagios.service" and "journalctl -xe" for details.

Expected results:

Nagios should continue to function after an update and not be rendered in a broken state.

Additional info:

Attached please find a diff of the various configuration options that changed during the update.

I'd like to point out that his is the second time that Nagios in EPEL7 was updated to a new major version, introducing a lot of disruptive changes. The previous update from version 3.x to 4.0.8 already broke several things (e.g. PNP4Nagios plugin), this update also changes the directory layout and location of several files, rendering existing 4.0.8 installations in a broken state during update.

FWIW, this is in violation of the package update policies for EPEL - https://fedoraproject.org/wiki/EPEL_Updates_Policy

"All updates should strive to avoid situations that require manual intervention to keep the package functioning after update."

"Major updates with changes to user experience are to be avoided."

As the member of a project that depends on a stable version of Nagios in EPEL, this repeated breaking of Nagios causes a lot of confusion and headache on our end and for our users.

Comment 1 Stephen John Smoogen 2017-03-01 14:18:23 UTC
I understand that this is a second large update. The package had been left without an active maintainer for 2 years and had multiple unfixed security problems in it. The first massive update occurred before I took over and at that time it was either upgrade to latest or remove the package from the archive. 

FWIW, I have blogged and emailed about it being a large disruptive breakage and asked for feedback and testing multiple times. I also got a go ahead from the EPEL Steering Committee that this update was going to happen and followed all the rules for that.  I will also bring it up at this weeks meeting that you feel that this package is not meeting guidelines. [I will recuse myself from the debate because I am currently on the committee.]

If you are wanting to help on this I would really appreciate constructive feedback and work on getting things working for people. I know nagios is an important product and don't want screw with people..

Upstream does not offer fixes for the old versions without a contract with them and they are mostly focused on getting their customers to the latest version. The code changes greatly in between versions so trying to back port patches in the PHP parts which have had most of the security problems is more than anyone has been willing to do.

Comment 2 Lenz Grimmer 2017-03-01 15:57:38 UTC
Hi John,

thanks for your reply, much appreciated!

(In reply to Stephen John Smoogen from comment #1)

> I understand that this is a second large update. The package had been left
> without an active maintainer for 2 years and had multiple unfixed security
> problems in it. The first massive update occurred before I took over and at
> that time it was either upgrade to latest or remove the package from the
> archive. 

I see, thank you for the background information, and for looking after this package to begin with.
 
> FWIW, I have blogged and emailed about it being a large disruptive breakage
> and asked for feedback and testing multiple times.

That information unfortunately did not reach us or the our users that have reported issues with Nagios to us :/

Would you mind sharing the URLs of these blog posts, so we can refer our users to them?

> I also got a go ahead
> from the EPEL Steering Committee that this update was going to happen and
> followed all the rules for that.  I will also bring it up at this weeks
> meeting that you feel that this package is not meeting guidelines. [I will
> recuse myself from the debate because I am currently on the committee.]

Thanks, that'd be appreciated. I think some of the breakage could have been avoided by sticking to the path names and file locations as they were established by the previous versions. Admittedly they were not perfect (and likely violated the FHS in some places), but the combination of updating to a new major version *combined* with the shuffling of files and directories was a tad bit too much change in my opinion.
 
> If you are wanting to help on this I would really appreciate constructive
> feedback and work on getting things working for people. I know nagios is an
> important product and don't want screw with people..

I wish I had known about the upcoming changes beforehand, I admit I do not follow the development of EPEL closely. The cat's out of the bag now anyway, so changing things again at this point will likely cause even more confusion.
 
> Upstream does not offer fixes for the old versions without a contract with
> them and they are mostly focused on getting their customers to the latest
> version. The code changes greatly in between versions so trying to back port
> patches in the PHP parts which have had most of the security problems is
> more than anyone has been willing to do.

Which is understandable. In our case we also suffered from the fact that the PNP4Nagios Broker Module npcdmod.o no longer works on Nagios 4.x., but that's a different story.

At this point there is probably not much that can be done about this other than having users migrate their configurations manually after an update. So I guess you can close this one as WONTFIX...

FWIW, I've documented the process for our users in our bug tracker now: https://tracker.openattic.org/browse/OP-1955

And if you're curious about the pain that we had with the last update to 4.0.8, feel free to take a look at https://tracker.openattic.org/browse/OP-1955

Comment 3 Stephen John Smoogen 2017-03-01 16:16:44 UTC
Blog Places where I brought up the changes to nagios package:

http://smoogespace.blogspot.com/2016/11/updating-nagios-in-epel-7-looking-for.html

http://smoogespace.blogspot.com/2017/02/major-update-to-nagios-in-fedora.html

http://smoogespace.blogspot.com/2017/02/major-update-to-fedoraepel-moving-to.html

https://lists.fedorahosted.org/archives/list/epel-devel@lists.fedoraproject.org/thread/R7IXWBY5DTHXEEQPYYJFO53MJK4QD2GW/

The working on updating nagios has been going on since last October or so as it was found that the previous package maintainer was no longer responding to emails. So communication on it has been spread out over some time.

Comment 4 Lenz Grimmer 2017-03-01 17:08:12 UTC
Thanks for the pointers, much appreciated. We'll probably write up a blog post to inform our users about this change and how they can fix their installations.

Comment 5 Stephen John Smoogen 2017-03-02 16:32:16 UTC
I am going to close this as CANTFIX versus WONTFIX. The time it could have been fixed was while this was in testing but it can't be postfixed. I am going to put into place a couple of README to fix the issues found.