Bug 1703946 - Nova_compute container stuck in restarting
Summary: Nova_compute container stuck in restarting
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-pacemaker
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: async
: 13.0 (Queens)
Assignee: Andrew Beekhof
QA Contact: nlevinki
URL:
Whiteboard:
Depends On: 1704870 1721198
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-29 07:50 UTC by Sandeep Yadav
Modified: 2024-06-13 22:06 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-25 11:40:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-20438 0 None None None 2022-11-24 12:23:23 UTC
Red Hat Knowledge Base (Solution) 4847411 0 None None None 2020-02-20 18:48:17 UTC

Description Sandeep Yadav 2019-04-29 07:50:51 UTC
Description of problem:


* Nova_compute container stuck in restarting

# docker ps
CONTAINER ID        IMAGE                                                                                  COMMAND             CREATED             STATUS                      PORTS               NAMES

d2cba025bf7e        wb-sdc-sat01.wbsdc.in:5000/wbsdc-osp13_containers-nova-compute:13.0-78                 "kolla_start"       2 months ago        Up 16 minutes (unhealthy)                       nova_compute
c5afdd8d57e5        wb-sdc-sat01.wbsdc.in:5000/wbsdc-osp13_containers-nova-compute:13.0-78                 "kolla_start"       2 months ago        Up 16 minutes (healthy)                         nova_migration_target


docker logs nova_compute
~~~
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared

Waiting for evacuations to complete or fail
Waiting for evacuations to complete or fail
Waiting for evacuations to complete or fail
Waiting for evacuations to complete or fail
~~~

NO Instance on compute
~~~
# virsh list --all
 Id    Name                           State
--------------------------------------------------
~~~

* If we Enable fencing, Compute nodes are not coming up
* When we disable fencing and manually fence, it is working

~~~
]# attrd_updater --query --all --name=evacuate
name="evacuate" host="*-compute07" value="no"
name="evacuate" host="*-compute08." value="no"
name="evacuate" host="*-compute04" value="yes"
~~~

Version-Release number of selected component (if applicable):

OSP 13 + Instance HA

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 9 Andrew Beekhof 2019-05-21 08:34:15 UTC
> Waiting for evacuations to complete or fail

This is the key issue.
Nova compute must not be started until all evacuations have either completed or failed.

The problem we have seen in the past is that some evacuations get stuck.
Although there is a nova command to list non-live migrations, there is no command that allows non-live migrations to be purged.

What you will need to do is connect to Galera and update the relevant database table directly.
Setting the status to 'failed' for the affected migration is required.

Comment 10 Andrew Beekhof 2019-05-21 08:37:12 UTC
Also, this is not a command that should be run on a regular basis:

> [root@wb-sdc-controller03 ~]# pcs resource cleanup --force;sleep 60

The load it introduces in a large cluster can even create failures on occasion.

Comment 11 Yadnesh Kulkarni 2019-05-21 10:01:24 UTC
Hello Andrew,

Earlier when this issue occurred the status was "pre-migrating" in the migrations table. Although this time the instances are in "error" state. Shall we still proceed with updating the records migrations table for all the instances having status error to failed?

Also, the instances which have error status in migrations table do not lie under compute node 7 where the nova-compute container is unhealthy. They all reside on a the compute node 14.

Regards,
Yadnesh K

Comment 13 Andrew Beekhof 2019-05-21 11:34:43 UTC
(In reply to Yadnesh Kulkarni from comment #11)
> Hello Andrew,
> 
> Earlier when this issue occurred the status was "pre-migrating" in the
> migrations table. Although this time the instances are in "error" state.
> Shall we still proceed with updating the records migrations table for all
> the instances having status error to failed?

Shouldn't be necessary.
According to https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/extraconfig/tasks/instanceha/check-run-nova-compute#L52 , "error" is fine.

> 
> Also, the instances which have error status in migrations table do not lie
> under compute node 7 where the nova-compute container is unhealthy. They all
> reside on a the compute node 14.

Are these live migrations or the evacuation kind?
It might be worth running check-run-nova-compute manually to see if it still thinks there is a problem

Comment 28 Andrew Beekhof 2019-05-28 03:15:50 UTC
From the output of x-text/check-run-nova-compute, we see that we are delaying nova from coming up because the node is yet to be unfenced:

"Waiting for fence-down flag to be cleared"


I'm currently looking into what that's the case.

Looking at the migrations table, I would expect wb-sdc-compute08 to have trouble coming back up next time:

[abeekhof@collab-shell x-text]$ cat migrations.sql | sed 's/),(/),\n(/'g | grep -e running -e progress -e accepted
('2019-04-11 13:21:45','2019-05-23 18:31:05','2019-04-15 12:03:51',17985,'wb-sdc-compute08.wbsdc.in',NULL,NULL,'accepted','94cfd444-93a3-4d6c-8c35-080090dfeebf',NULL,NULL,'wb-sdc-compute08.wbsdc.in',NULL,17985,'evacuation',0,NULL,NULL,NULL,NULL,NULL,NULL,'a41eac31-7f37-4ed6-b97b-7e1c2286333a'),

Comment 29 Andrew Beekhof 2019-05-28 03:53:54 UTC
Running crm_simulate against sos_commands/pacemaker/crm_report/wb-sdc-controller03/cib.xml.live, I see:

 Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
     compute-unfence-trigger	(ocf::pacemaker:Dummy):	FAILED wb-sdc-compute08 (blocked)
     compute-unfence-trigger	(ocf::pacemaker:Dummy):	FAILED wb-sdc-compute14 (blocked)
     compute-unfence-trigger	(ocf::pacemaker:Dummy):	FAILED wb-sdc-compute07 (blocked)
     compute-unfence-trigger	(ocf::pacemaker:Dummy):	FAILED wb-sdc-compute13 (blocked)

The reason those are blocked is that stonith has been disabled:

        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>

Presumably it was disabled to avoid those nodes being repeatably fenced because of failed stop actions such as:

May 27 03:45:51 [648538] wb-sdc-controller03    pengine:  warning: unpack_rsc_op_failure:	Processing failed stop of compute-unfence-trigger:11 on wb-sdc-compute07: unknown | rc=189

Which is symptomatic of a connection bug that is being tracked in pacemaker (I don't have the bug number handy at the moment).

Until we get that bug resolved, can we please use the following as a work-around:

   pcs resource update compute-unfence-trigger op stop timeout=20 on-fail=block 

This will allow fencing to be re-enabled without resulting in the compute being fenced every time compute-unfence-trigger fails in this way.

Comment 38 Andrew Beekhof 2019-06-04 04:35:11 UTC
Progress...

It looks like we've found a root cause.
There is a bug related to poorly timed remote shutdowns that results in the controller's connection to that remote being permanently wedged (until pacemaker is restarted on that controller).

This explains why the cleanup did nothing. 

While we work on a fix, it would be a good idea to check how the client's IPMI devices work.
Some devices implement OFF as a soft power down (the equivalent of running shutdown from the command line) followed by a hard power off (cutting the power).
If this is the case for the client, it would definitely contribute to them experiencing this issue, and it would be worth reconfiguring it to just do the hard power off if at all possible.


Note You need to log in before you can comment on or make changes to this bug.