1275986 – Director update to 7.1 fails due to httpd service restart failure

Bug 1275986 - Director update to 7.1 fails due to httpd service restart failure

Summary: Director update to 7.1 fails due to httpd service restart failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	y2
Target Release:	7.0 (Kilo)
Assignee:	Jiri Stransky
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-28 09:53 UTC by Jan Provaznik
Modified:	2015-12-21 16:57 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-0.8.6-78.el7ost
Doc Type:	Bug Fix
Doc Text:	Pacemaker prevented Puppet from restarting services during an Overcloud update. This caused Puppet runs to fail, which caused the Overcloud update to fail. This fix puts Pacemaker into maintenance mode for the duration of the Puppet run during the Overcloud update. Pacemaker now allows Puppet to restart services and now the Overcloud update does not fail anymore.
Clone Of:
Environment:
Last Closed:	2015-12-21 16:57:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	245093	0	None	None	None	Never
Red Hat Product Errata	RHBA-2015:2651	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OSP 7 director Bug Fix Advisory	2015-12-21 21:50:26 UTC

Description Jan Provaznik 2015-10-28 09:53:40 UTC

Description of problem:
Update from 7.0 to 7.1 may randomly fail because httpd service fails to restart (strictly speaking it restarts and httpd service is running but exit code is 1).

How reproducible:
This happens randomly, you can reproduce easily just by running "systemctl restart httpd" on OC controller node, from time to time it fails with:
/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.

When this is hit during OC update, the failed UpdateDeployment stderr is:
    "deploy_stderr": "\u001b[1;31mWarning: Scope(Class[Keystone]): Execution of db_sync does not depend on $enabled anymore. Please use sync_db instead.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Glance::Registry]): Execution of db_sync does not depend on $manage_service or $enabled anymore. Please use sync_db instead.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_host'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_protocol'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_port'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_path'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Concat::Setup]): concat::setup is deprecated as a public API of the concat module and should no longer be directly included in the manifest.\u001b[0m\n\u001b[1;31mWarning: Scope(Swift::Storage::Server[6002]): The default incoming_chmod set to 0644 may yield in error prone directories and will be changed in a later release.\u001b[0m\n\u001b[1;31mWarning: Scope(Swift::Storage::Server[6002]): The default outgoing_chmod set to 0644 may yield in error prone directories and will be changed in a later release.\u001b[0m\n\u001b[1;31mWarning: Scope(Swift::Storage::Server[6001]): The default incoming_chmod set to 0644 may yield in error prone directories and will be changed in a later release.\u001b[0m\n\u001b[1;31mWarning: Scope(Swift::Storage::Server[6001]): The default outgoing_chmod set to 0644 may yield in error prone directories and will be changed in a later release.\u001b[0m\n\u001b[1;31mWarning: Scope(Swift::Storage::Server[6000]): The default incoming_chmod set to 0644 may yield in error prone directories and will be changed in a later release.\u001b[0m\n\u001b[1;31mWarning: Scope(Swift::Storage::Server[6000]): The default outgoing_chmod set to 0644 may yield in error prone directories and will be changed in a later release.\u001b[0m\n\u001b[1;31mError: /Stage[main]/Apache::Service/Service[httpd]: Failed to call refresh: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.\u001b[0m\n\u001b[1;31mError: /Stage[main]/Apache::Service/Service[httpd]: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.\nWrapped exception:\nExecution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.\u001b[0m\n", 

This occurs no matter if pacemaker is maintenance mode or not. Also I tried bumping systemd unit timeout but this didn't help too. The issue might be caused by the fact that horizon CSS compressing is done during httpd service start -> this very sub-optimal solution which causes that httpd starts ~20 seconds.

Comment 2 Jan Provaznik 2015-10-29 10:46:06 UTC

Because of randomness of this issue it's quite tricky to track it down, anyway so far it seems that this issue doesn't occur if pacemaker is stopped or if following pre-start scripts are commented out in /lib/systemd/system/httpd.service.d/openstack-dashboard.conf:
[Service]
ExecStartPre=/usr/bin/python /usr/share/openstack-dashboard/manage.py collectstatic --noinput 
ExecStartPre=/usr/bin/python /usr/share/openstack-dashboard/manage.py compress --force

Comment 3 James Slagle 2015-11-02 19:57:06 UTC

Jan, I thought we were always stopping pacemaker in yum_update.sh? Or do you mean we need to do a systemctl stop pacemaker instead of just a pcs cluster stop?

If it's the horizon compress css that's causing it, does this need to be a core bug instead?

Comment 4 Jan Provaznik 2015-11-02 20:17:48 UTC

I think pacemaker is stopped during "yum update" run but is started again when yum finishes - https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/yum_update.sh#L73

httpd service (among others) is being stopped/started during post-deployment steps by puppet - and at this point it sometimes fails.

I wasn't able to track this deep enough to identify the root issue. But I think that hooking css compression (especially which takes a long time) to httpd service restart is bug by itself (postinst script sounds like a much better place). I think horizon team is best fit for this BZ :).

Comment 5 Mark Chappell 2015-11-05 14:09:51 UTC

This will also trigger similar issues for any template updates that trigger service restarts (for example changing configuration settings)

Comment 6 Mark Chappell 2015-11-05 14:15:53 UTC

(config settings in any of the pcs managed resources, including most of the endpoints)

Comment 7 James Slagle 2015-11-06 18:11:53 UTC

jan, is there a workaround for this one? please document it if there is one

Comment 8 Jan Provaznik 2015-11-06 18:56:12 UTC

I don't have a workaround for this - IMO most straightforward and cleanest fix would be update horizon package and move compression of CSS files to RPM postinst phase.

Comment 9 James Slagle 2015-11-06 22:23:35 UTC

i actually got almost all the way through a complete update of a HA 3 controller, 1 compute, 1 ceph, net iso single nic with vlans environment on virt. But, I hit this bug on 2 of the controllers. now when i try to update again to see if the issue is transient and can be worked around by just running the update again, I hit:

https://bugzilla.redhat.com/show_bug.cgi?id=1278975
https://bugzilla.redhat.com/show_bug.cgi?id=1278544

Comment 10 James Slagle 2015-11-11 02:53:01 UTC

Here's what I'm seeing with this one in my environment. Note that I'm testing with the patch from https://review.openstack.org/#/c/242141 applied to my copy of tht. Note that however pacemaker is only put into maintenance mode on steps 1-3, and this httpd restart error happens during step 4. Do you think we need to expand the patch to also apply to step 4?

Puppet fails like this:

Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: to call refresh: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.\u001b[0m\n\u001b[1;31mError: /Stage[main]/Apache::Service/Service[httpd]: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.\nWrapped exception:\nExecution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.\u001b[0m\n", "deploy_status_code": 6}
Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: [2015-11-10 18:11:17,188] (heat-config) [DEBUG] [2015-11-10 18:08:26,715] (heat-config) [DEBUG] Running FACTER_heat_outputs_path="/var/run/heat-config/heat-config-puppet/f83000df-6c52-4fa3-b46a-08a8f73e9058"  FACTER_fqdn="overcloud-controller-1.localdomain"  FACTER_deploy_config_name="overcloud-ControllerNodesPostDeployment-b7xinlxthgsl-ControllerPuppetConfig-rsoqvnfattbx-ControllerPuppetConfigImpl-npn7t3oehyz5"  puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/f83000df-6c52-4fa3-b46a-08a8f73e9058.pp
Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: [2015-11-10 18:11:17,182] (heat-config) [INFO] Return code 6


<snip>

Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: Error: /Stage[main]/Apache::Service/Service[httpd]: Failed to call refresh: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.
Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: Error: /Stage[main]/Apache::Service/Service[httpd]: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.
Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: Wrapped exception:
Nov 10 18:11:17 overcloud-controller-1.localdomain os-collect-config[4076]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service canceled.

Notice is says the job was canceled.

journalctl -u httpd shows:

Nov 10 18:11:06 overcloud-controller-1.localdomain systemd[1]: Stopping Cluster Controlled httpd...
Nov 10 18:11:07 overcloud-controller-1.localdomain systemd[1]: Starting Cluster Controlled httpd...
Nov 10 18:11:08 overcloud-controller-1.localdomain python[19352]: WARNING:root:"dashboards" and "default_dashboard" in (local_)settings is DEPRECATED now and may be unsupported in some future release. The preferred way to spe
Nov 10 18:11:08 overcloud-controller-1.localdomain systemd[1]: Stopping Cluster Controlled httpd...
Nov 10 18:11:10 overcloud-controller-1.localdomain python[19352]: 0 static files copied to '/usr/share/openstack-dashboard/static', 818 unmodified.
Nov 10 18:11:12 overcloud-controller-1.localdomain python[19602]: WARNING:root:"dashboards" and "default_dashboard" in (local_)settings is DEPRECATED now and may be unsupported in some future release. The preferred way to spe
Nov 10 18:11:13 overcloud-controller-1.localdomain python[19602]: RemovedInDjango19Warning: "requires_model_validation" is deprecated in favor of "requires_system_checks".
Nov 10 18:11:13 overcloud-controller-1.localdomain python[19602]: WARNING:py.warnings:RemovedInDjango19Warning: "requires_model_validation" is deprecated in favor of "requires_system_checks".
Nov 10 18:11:13 overcloud-controller-1.localdomain python[19602]: RemovedInDjango19Warning: SortedDict is deprecated and will be removed in Django 1.9.
Nov 10 18:11:13 overcloud-controller-1.localdomain python[19602]: WARNING:py.warnings:RemovedInDjango19Warning: SortedDict is deprecated and will be removed in Django 1.9.
Nov 10 18:11:13 overcloud-controller-1.localdomain python[19602]: RemovedInDjango19Warning: Loading the `url` tag from the `future` library is deprecated and will be removed in Django 1.9. Use the default `url` tag instead.
Nov 10 18:11:13 overcloud-controller-1.localdomain python[19602]: WARNING:py.warnings:RemovedInDjango19Warning: Loading the `url` tag from the `future` library is deprecated and will be removed in Django 1.9. Use the default 
Nov 10 18:11:14 overcloud-controller-1.localdomain python[19602]: RemovedInDjango19Warning: SortedDict is deprecated and will be removed in Django 1.9.
Nov 10 18:11:14 overcloud-controller-1.localdomain python[19602]: WARNING:py.warnings:RemovedInDjango19Warning: SortedDict is deprecated and will be removed in Django 1.9.
Nov 10 18:11:16 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:18 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:18 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:18 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:18 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:18 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:30 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:31 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:31 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:31 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:31 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:31 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:41 overcloud-controller-1.localdomain python[19602]: ERROR:scss.expression:Function not found: twbs-font-path:1
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: Found 'compress' tags in:
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: /usr/lib/python2.7/site-packages/horizon/templates/horizon/_conf.html
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: /usr/share/openstack-dashboard/openstack_dashboard/dashboards/theme/templates/_stylesheets.html
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: /usr/lib/python2.7/site-packages/horizon/templates/horizon/_scripts.html
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: /usr/share/openstack-dashboard/openstack_dashboard/templates/_stylesheets.html
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: Compressing... done
Nov 10 18:11:42 overcloud-controller-1.localdomain python[19602]: Compressed 6 block(s) from 4 template(s).
Nov 10 18:11:42 overcloud-controller-1.localdomain systemd[1]: Stopped Cluster Controlled httpd.

So the sequence was: Stopping, Starting, Stopping, Stopped.

Definitely likely pacemaker is competing with puppet here for the service restart.

I do see the following in the pacemaker journalctl output:

Nov 10 18:11:28 overcloud-controller-1.localdomain crmd[28629]: error: Operation httpd_stop_0: Timed Out (node=overcloud-controller-1, call=765, timeout=31ms)

Finally, on the node that showed this error, /run/systemd/system/httpd.service.d/50-pacemaker.conf no longer existed at all. So something definitely got out of sync.

Comment 11 James Slagle 2015-11-11 02:56:52 UTC

(In reply to James Slagle from comment #10)
> Here's what I'm seeing with this one in my environment. Note that I'm
> testing with the patch from https://review.openstack.org/#/c/242141 applied
> to my copy of tht. Note that however pacemaker is only put into maintenance
> mode on steps 1-3, and this httpd restart error happens during step 4. Do
> you think we need to expand the patch to also apply to step 4?

Actually, I might be incorrect on that. Looks like the httpd restart failure was still under step 3.

Comment 12 James Slagle 2015-11-11 02:58:57 UTC

(In reply to James Slagle from comment #11)
> (In reply to James Slagle from comment #10)
> > Here's what I'm seeing with this one in my environment. Note that I'm
> > testing with the patch from https://review.openstack.org/#/c/242141 applied
> > to my copy of tht. Note that however pacemaker is only put into maintenance
> > mode on steps 1-3, and this httpd restart error happens during step 4. Do
> > you think we need to expand the patch to also apply to step 4?
> 
> Actually, I might be incorrect on that. Looks like the httpd restart failure
> was still under step 3.

Indeed the hiera value of the step variable was 3. But the resource that applies step 3 is called: ControllerOvercloudServicesDeployment_Step4

A bit confusing there :)

Comment 13 Mark Chappell 2015-11-11 09:58:52 UTC

Expanding to Step 4 should be fine, what we need to be careful about is making sure that we don't put *new* clusters into maintenance mode.  As that prevents the whole thing from starting up.  I was starting to lean towards a script before the initial puppet run, and after step 3/4 (in puppet terms, rather than template terms), which would bring the cluster into and out of maintenance mode.  But would do nothing if the cluster hadn't been started up yet (ie fresh builds)

Comment 14 James Slagle 2015-11-16 20:14:09 UTC

jistr, assigning this one to you as I think it will be fixed under the larger umbrella of fixing pacemaker/puppet service restart issues

Comment 16 Marius Cornea 2015-12-15 14:38:43 UTC

I did several updates for both 7.0 -> 7.2 and 7.1 -> 7.2 and I didn't hit the issue reported here so I will move this to verified.

Comment 18 errata-xmlrpc 2015-12-21 16:57:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2651

Note You need to log in before you can comment on or make changes to this bug.