Bug 1703946
| Summary: | Nova_compute container stuck in restarting | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sandeep Yadav <sandyada> |
| Component: | puppet-pacemaker | Assignee: | Andrew Beekhof <abeekhof> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | nlevinki <nlevinki> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 13.0 (Queens) | CC: | abeekhof, afariasa, chjones, dhill, jjoyce, jschluet, kgaillot, kmehta, kthakre, ltamagno, michele, pkomarov, rhos-maint, slinaber, tvignaud, ykulkarn |
| Target Milestone: | async | Keywords: | Triaged, ZStream |
| Target Release: | 13.0 (Queens) | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-25 11:40:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1704870, 1721198 | ||
| Bug Blocks: | |||
|
Description
Sandeep Yadav
2019-04-29 07:50:51 UTC
> Waiting for evacuations to complete or fail
This is the key issue.
Nova compute must not be started until all evacuations have either completed or failed.
The problem we have seen in the past is that some evacuations get stuck.
Although there is a nova command to list non-live migrations, there is no command that allows non-live migrations to be purged.
What you will need to do is connect to Galera and update the relevant database table directly.
Setting the status to 'failed' for the affected migration is required.
Also, this is not a command that should be run on a regular basis:
> [root@wb-sdc-controller03 ~]# pcs resource cleanup --force;sleep 60
The load it introduces in a large cluster can even create failures on occasion.
Hello Andrew, Earlier when this issue occurred the status was "pre-migrating" in the migrations table. Although this time the instances are in "error" state. Shall we still proceed with updating the records migrations table for all the instances having status error to failed? Also, the instances which have error status in migrations table do not lie under compute node 7 where the nova-compute container is unhealthy. They all reside on a the compute node 14. Regards, Yadnesh K (In reply to Yadnesh Kulkarni from comment #11) > Hello Andrew, > > Earlier when this issue occurred the status was "pre-migrating" in the > migrations table. Although this time the instances are in "error" state. > Shall we still proceed with updating the records migrations table for all > the instances having status error to failed? Shouldn't be necessary. According to https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/extraconfig/tasks/instanceha/check-run-nova-compute#L52 , "error" is fine. > > Also, the instances which have error status in migrations table do not lie > under compute node 7 where the nova-compute container is unhealthy. They all > reside on a the compute node 14. Are these live migrations or the evacuation kind? It might be worth running check-run-nova-compute manually to see if it still thinks there is a problem From the output of x-text/check-run-nova-compute, we see that we are delaying nova from coming up because the node is yet to be unfenced:
"Waiting for fence-down flag to be cleared"
I'm currently looking into what that's the case.
Looking at the migrations table, I would expect wb-sdc-compute08 to have trouble coming back up next time:
[abeekhof@collab-shell x-text]$ cat migrations.sql | sed 's/),(/),\n(/'g | grep -e running -e progress -e accepted
('2019-04-11 13:21:45','2019-05-23 18:31:05','2019-04-15 12:03:51',17985,'wb-sdc-compute08.wbsdc.in',NULL,NULL,'accepted','94cfd444-93a3-4d6c-8c35-080090dfeebf',NULL,NULL,'wb-sdc-compute08.wbsdc.in',NULL,17985,'evacuation',0,NULL,NULL,NULL,NULL,NULL,NULL,'a41eac31-7f37-4ed6-b97b-7e1c2286333a'),
Running crm_simulate against sos_commands/pacemaker/crm_report/wb-sdc-controller03/cib.xml.live, I see:
Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute08 (blocked)
compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute14 (blocked)
compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute07 (blocked)
compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute13 (blocked)
The reason those are blocked is that stonith has been disabled:
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
Presumably it was disabled to avoid those nodes being repeatably fenced because of failed stop actions such as:
May 27 03:45:51 [648538] wb-sdc-controller03 pengine: warning: unpack_rsc_op_failure: Processing failed stop of compute-unfence-trigger:11 on wb-sdc-compute07: unknown | rc=189
Which is symptomatic of a connection bug that is being tracked in pacemaker (I don't have the bug number handy at the moment).
Until we get that bug resolved, can we please use the following as a work-around:
pcs resource update compute-unfence-trigger op stop timeout=20 on-fail=block
This will allow fencing to be re-enabled without resulting in the compute being fenced every time compute-unfence-trigger fails in this way.
Progress... It looks like we've found a root cause. There is a bug related to poorly timed remote shutdowns that results in the controller's connection to that remote being permanently wedged (until pacemaker is restarted on that controller). This explains why the cleanup did nothing. While we work on a fix, it would be a good idea to check how the client's IPMI devices work. Some devices implement OFF as a soft power down (the equivalent of running shutdown from the command line) followed by a hard power off (cutting the power). If this is the case for the client, it would definitely contribute to them experiencing this issue, and it would be worth reconfiguring it to just do the hard power off if at all possible. |