Description of problem: * Nova_compute container stuck in restarting # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d2cba025bf7e wb-sdc-sat01.wbsdc.in:5000/wbsdc-osp13_containers-nova-compute:13.0-78 "kolla_start" 2 months ago Up 16 minutes (unhealthy) nova_compute c5afdd8d57e5 wb-sdc-sat01.wbsdc.in:5000/wbsdc-osp13_containers-nova-compute:13.0-78 "kolla_start" 2 months ago Up 16 minutes (healthy) nova_migration_target docker logs nova_compute ~~~ Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for fence-down flag to be cleared Waiting for evacuations to complete or fail Waiting for evacuations to complete or fail Waiting for evacuations to complete or fail Waiting for evacuations to complete or fail ~~~ NO Instance on compute ~~~ # virsh list --all Id Name State -------------------------------------------------- ~~~ * If we Enable fencing, Compute nodes are not coming up * When we disable fencing and manually fence, it is working ~~~ ]# attrd_updater --query --all --name=evacuate name="evacuate" host="*-compute07" value="no" name="evacuate" host="*-compute08." value="no" name="evacuate" host="*-compute04" value="yes" ~~~ Version-Release number of selected component (if applicable): OSP 13 + Instance HA How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
> Waiting for evacuations to complete or fail This is the key issue. Nova compute must not be started until all evacuations have either completed or failed. The problem we have seen in the past is that some evacuations get stuck. Although there is a nova command to list non-live migrations, there is no command that allows non-live migrations to be purged. What you will need to do is connect to Galera and update the relevant database table directly. Setting the status to 'failed' for the affected migration is required.
Also, this is not a command that should be run on a regular basis: > [root@wb-sdc-controller03 ~]# pcs resource cleanup --force;sleep 60 The load it introduces in a large cluster can even create failures on occasion.
Hello Andrew, Earlier when this issue occurred the status was "pre-migrating" in the migrations table. Although this time the instances are in "error" state. Shall we still proceed with updating the records migrations table for all the instances having status error to failed? Also, the instances which have error status in migrations table do not lie under compute node 7 where the nova-compute container is unhealthy. They all reside on a the compute node 14. Regards, Yadnesh K
(In reply to Yadnesh Kulkarni from comment #11) > Hello Andrew, > > Earlier when this issue occurred the status was "pre-migrating" in the > migrations table. Although this time the instances are in "error" state. > Shall we still proceed with updating the records migrations table for all > the instances having status error to failed? Shouldn't be necessary. According to https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/extraconfig/tasks/instanceha/check-run-nova-compute#L52 , "error" is fine. > > Also, the instances which have error status in migrations table do not lie > under compute node 7 where the nova-compute container is unhealthy. They all > reside on a the compute node 14. Are these live migrations or the evacuation kind? It might be worth running check-run-nova-compute manually to see if it still thinks there is a problem
From the output of x-text/check-run-nova-compute, we see that we are delaying nova from coming up because the node is yet to be unfenced: "Waiting for fence-down flag to be cleared" I'm currently looking into what that's the case. Looking at the migrations table, I would expect wb-sdc-compute08 to have trouble coming back up next time: [abeekhof@collab-shell x-text]$ cat migrations.sql | sed 's/),(/),\n(/'g | grep -e running -e progress -e accepted ('2019-04-11 13:21:45','2019-05-23 18:31:05','2019-04-15 12:03:51',17985,'wb-sdc-compute08.wbsdc.in',NULL,NULL,'accepted','94cfd444-93a3-4d6c-8c35-080090dfeebf',NULL,NULL,'wb-sdc-compute08.wbsdc.in',NULL,17985,'evacuation',0,NULL,NULL,NULL,NULL,NULL,NULL,'a41eac31-7f37-4ed6-b97b-7e1c2286333a'),
Running crm_simulate against sos_commands/pacemaker/crm_report/wb-sdc-controller03/cib.xml.live, I see: Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger] compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute08 (blocked) compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute14 (blocked) compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute07 (blocked) compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED wb-sdc-compute13 (blocked) The reason those are blocked is that stonith has been disabled: <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> Presumably it was disabled to avoid those nodes being repeatably fenced because of failed stop actions such as: May 27 03:45:51 [648538] wb-sdc-controller03 pengine: warning: unpack_rsc_op_failure: Processing failed stop of compute-unfence-trigger:11 on wb-sdc-compute07: unknown | rc=189 Which is symptomatic of a connection bug that is being tracked in pacemaker (I don't have the bug number handy at the moment). Until we get that bug resolved, can we please use the following as a work-around: pcs resource update compute-unfence-trigger op stop timeout=20 on-fail=block This will allow fencing to be re-enabled without resulting in the compute being fenced every time compute-unfence-trigger fails in this way.
Progress... It looks like we've found a root cause. There is a bug related to poorly timed remote shutdowns that results in the controller's connection to that remote being permanently wedged (until pacemaker is restarted on that controller). This explains why the cleanup did nothing. While we work on a fix, it would be a good idea to check how the client's IPMI devices work. Some devices implement OFF as a soft power down (the equivalent of running shutdown from the command line) followed by a hard power off (cutting the power). If this is the case for the client, it would definitely contribute to them experiencing this issue, and it would be worth reconfiguring it to just do the hard power off if at all possible.