Bug 1383780
| Summary: | rhel-osp-director: Overcloud update fails with "httpd has stopped: ERROR: cluster remained unstable for more than 1800 seconds, exiting" | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||||||
| Component: | rhosp-director | Assignee: | Andrew Beekhof <abeekhof> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Omri Hochman <ohochman> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 8.0 (Liberty) | CC: | abeekhof, arkady_kanevsky, chjones, david_paterson, dbecker, fdinitto, gael_rehault, mburns, michele, morazi, randy_perryman, rhel-osp-director-maint, rscarazz, sasha, smerrow, sumedh_sathaye, wayne_allen | ||||||||
| Target Milestone: | async | Keywords: | Reopened, Triaged | ||||||||
| Target Release: | 8.0 (Liberty) | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2018-08-02 07:59:21 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1305654, 1335596 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Alexander Chuzhoy
2016-10-11 17:55:53 UTC
Created attachment 1209283 [details]
versionlock list
anything in the journal on the controllers ? So without sosreports I will try to put out some thought here in the meantime. We fail in the following snippet of code: + pcs resource disable openstack-keystone + check_resource openstack-keystone stopped 1800 + '[' 3 -ne 3 ']' + service=openstack-keystone + state=stopped + timeout=1800 + '[' stopped = stopped ']' + match_for_incomplete=Started + timeout -k 10 1800 crm_resource --wait Now when we call the disable for openstack-keystone in Liberty, we are basically asking to stop all the child services of the resource: http://acksyn.org/files/tripleo/liberty-new-install.pdf A few possibilities come to mind: - In OSP 8 we do not have the correct stop timeout for systemd resources (200s), so one of the child services failed to stop and this broke the process. Will need a sosreport to doublecheck this - We actually hit a known pacemaker bug that makes crm_resource --wait never terminate: https://bugzilla.redhat.com/show_bug.cgi?id=1349493 (In reply to Michele Baldessari from comment #5) > A few possibilities come to mind: > - In OSP 8 we do not have the correct stop timeout for systemd resources > (200s), so one of the child services failed to stop and this broke the > process. This is what happened. The new nova-compute clone has the default timeouts instead of 200s or 300s. These operations timed out, and without fencing enabled the cluster was unable to do anything to continue recovery. This prevented openstack-nova-conductor-clone, libvirtd-compute-clone, and their dependancies from being stopped and the update to bork. You want to run the following and re-test: for RESOURCE in neutron-openvswitch-agent-compute-clone libvirtd-compute-clone ceilometer-compute-clone nova-compute-clone; do sudo pcs resource update $RESOURCE op start timeout=200s op stop timeout=200s done *** This bug has been marked as a duplicate of bug 1386186 *** for RESOURCE in neutron-openvswitch-agent-compute-clone libvirtd-compute-clone ceilometer-compute-clone nova-compute-clone; do
sudo pcs resource update $RESOURCE op start timeout=200s op stop timeout=200s
done
These Resource do not exist. Can you define the correct Resources?
If you have done this successfully can you post your commands? Created attachment 1212867 [details]
Timeouts from Updated and Upgraded Install
This is the timeouts from my install as you can see they are all set with the 200+ where asked.
Created attachment 1212871 [details]
resource timeouts from stock JS-5.0 install (OSP8)
timeout values for resources in my stock JS-5.0 (OSP8) install, fyi. Generated by:
sudo pcs resource | grep -v r8 | awk '{print $3}' | while read sedon ; do sudo pcs resource show $sedon; done > timeouts.dat
Are these ok? Seems like it AFAIK
(In reply to Wayne Allen from comment #11) > Created attachment 1212871 [details] > resource timeouts from stock JS-5.0 install (OSP8) > > timeout values for resources in my stock JS-5.0 (OSP8) install, fyi. > Generated by: > > sudo pcs resource | grep -v r8 | awk '{print $3}' | while read sedon ; do > sudo pcs resource show $sedon; done > timeouts.dat > > Are these ok? Seems like it AFAIK Yes, the starts and stops are all set to 200 or higher. (In reply to Randy Perryman from comment #8) > for RESOURCE in neutron-openvswitch-agent-compute-clone > libvirtd-compute-clone ceilometer-compute-clone nova-compute-clone; do > sudo pcs resource update $RESOURCE op start timeout=200s op stop > timeout=200s > done > > > These Resource do not exist. Can you define the correct Resources? Hi Randy, sorry for the delay. Those are all created as part of the instance HA overlay feature which wont be part of a basic triple-o installation. Somehow I missed that this was an overcloud update, we do not currently expect updates or upgrades work when the IHA feature has been configured (because it's not integrated with puppet and confuses the update logic) - although we are working on addressing that. There was a thread on this in mid October that I will bounce to you again which details the current process for updates. Reopening this BZ. We need fix backported to OSP8 and OSP9 not just OSP10. Use this BZ for OSP8/Liberty fix. Will dup 1386186 for OSP9. We haven't seen this in 6.0.1 update since we started patching it to set timeout to 300s. I'll try to remember to see what the current rabbitmq timeout default is in a fresh 6.0.1 install to see if it should be closed... In fresh OSP 10 deployment rabbit stop timeout is still set to 200 by default we had issues with anything under 300
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
Meta Attrs: notify=true
Operations: monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
start interval=0s timeout=200s (rabbitmq-start-interval-0s)
stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)
Setting needinfo to Andrew 200s is already quite a long time. Can we get some updated logs that I can pass on to our rabbit engineers to ensure there isn't some deeper issue? Can we get the logs asked for on https://bugzilla.redhat.com/show_bug.cgi?id=1383780#c19 ? We need to understand why 200s are not enough for a rabbit to stop. Sorry but those logs are no longer available. The stamp in question has been re-built. Thanks David, I'll close this one out for now and we can revisit if needed. |