Bug 1565950 - [Virt Setup] After rebooting controller, using 'openstack server reboot', openstack says controller is active, virsh says 'shut off'
Summary: [Virt Setup] After rebooting controller, using 'openstack server reboot', ope...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-11 07:04 UTC by Roee Agiman
Modified: 2020-12-21 19:38 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-04 08:41:11 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Roee Agiman 2018-04-11 07:04:43 UTC
Description of problem:
Hey
I have non-HA setup with 1 cont 2 comps and 1 uc
I tried to change the configuration in neutron.conf, adding 'segments' to 'service_plugin'.
Instead of restarting the service I decided to reboot the whole controller.
The openstack client showed that it went through the reboot process fine, and now according to the client, it is back in active, same IP.

openstack server list (source stackrc):
83d4b8b3-0744-43e5-98e7-b6276c153c16 | controller-0 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | controller

But when trying to ping/SSH - the controller is unreachable.
Ping:
[stack@undercloud-0 ~]$ ping 192.168.24.11
PING 192.168.24.11 (192.168.24.11) 56(84) bytes of data.
From 192.168.24.1 icmp_seq=1 Destination Host Unreachable
SSH:
[stack@undercloud-0 ~]$ ssh heat-admin@192.168.24.11
ssh: connect to host 192.168.24.11 port 22: No route to host

When I was looking into virsh list from the hypervisor - I'm seeing this:
[]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 3     undercloud-0                   running
 12    compute-0                      running
 13    compute-1                      running
 -     controller-0                   shut off


Version-Release number of selected component (if applicable):
OSP13 Puddle 2018-04-03.3

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP13
2. Add 'segments' to 'service_plugin' in neutron.conf (not sure if related)
3. Reboot controller

Actual results:
Controller unreachable

Expected results:
Controller reachable

Additional info:

Comment 1 Roee Agiman 2018-04-11 07:13:00 UTC
Update - step 2 is unrelated for sure, it is just the reboot with 'openstack server reboot controller-0'.

Update2 - A working WA is to execute 'virsh start controller-0' from the HV.

Comment 3 Artom Lifshitz 2018-08-01 13:56:43 UTC
Removing triaged keyword, this was never actually looked at by the Compute DFG.

Comment 4 Lee Yarwood 2018-08-03 12:56:07 UTC
Moving this back to DF DFG and openstack-ironic as this isn't a compute DFG issue, we don't support any of the lower level OVB or virtualBMC code used to manage the power state of the virt overcloud hosts.

If there's an issue with how Nova is handling the high level reboot call then please spell that out here, otherwise this looks like an issue specific to OVB/virtualBMC or whatever ironic is talking to reboot these hosts.

Comment 5 Derek Higgins 2018-09-20 15:50:52 UTC
Hi Roee,
   are you still able to reproduce this, I'm looking for some logs if you have them, in particular ironic-conductor and  virsh logs from your virt host.

Did you notice if "controller-0" on the virt host started and then shutdown again, or did it just stay shutdown.

Also what was the output of 
$ openstack baremetal node list

Comment 6 Roee Agiman 2018-10-02 10:47:23 UTC
Restarted controller using the command provided in description;
First thing after the command executed - in 'REBOOT' state (openstack server list - status)
On virsh - controller-0 seems to be 'running'

After switching back to 'ACTIVE' in the openstack client prompt, the controller is still refusing to connect - 
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin@192.168.24.10
ssh: connect to host 192.168.24.10 port 22: Connection refused

Looks a bit different from before though - at least now I'm getting pings back.

After a while of 'connection refused' - switched to 'shut down' in the virsh output.

Now the ssh request is not refused but 'No router to host'
Pings are also gone.

(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| 318ea15b-d3c2-486c-8382-e7b054ee9cea | compute-0    | da421ea2-2889-4905-bf1e-250aaf66a7d5 | power on    | active             | False       |
| d0999304-be95-4f27-a955-48d1d744d59b | compute-1    | 87464425-a446-4e8e-9fce-4a7971810edc | power on    | active             | False       |
| 172df189-c548-451a-870e-9dc2eff9913f | controller-0 | fe69dac5-3a54-48c5-8a88-77a07add2927 | power off   | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

Let me know if additional information is required.

Comment 7 Roee Agiman 2018-10-02 10:48:00 UTC
By the way - this is the puddle.

(undercloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 
13   -p 2018-09-28.1

Comment 8 Derek Higgins 2018-11-05 14:58:58 UTC
(In reply to Roee Agiman from comment #6)
> Let me know if additional information is required.

Could you attach the ironic-conductor logs and the virsh logs from your virt host.

Comment 11 Derek Higgins 2018-11-12 11:35:22 UTC
According to your conductor logs the nova conductor ran a soft-shoutdown at 
2018-11-12 05:37:51 but the VM shutdown at 2018-11-12 10:59:38

Are the times in sync, was there a 5+ hour delay?
If so, do you have bmc logs associated with the same event so we can see what it was doing? Also if you check the controller journal log does it show anything happening over the same time period?

Comment 13 Derek Higgins 2018-11-13 11:31:26 UTC
The fact that ssh stoped responding immediately but ping is still working suggests that the shutdown has been initiated but something is blocking it.

If a service is ignoring SIGTERM then the shutdown will be blocked until a timeout (probably 2.5 minutes), you'll see this in the controller journal

Nov 13 09:46:07 controller systemd[1]: XXX.service stop-sigterm timed out. Killing.

But you've mentioned that the delay is more like 10 minutes so I suspect its a process that has gone into a "Uninterruptible sleep" and is ignoring SIGKILL. This usually happens if a process is waiting on IO (e.g. NFS server can't be contacted). Can you check the journal log on the controller for any indication of a process not shutting down when it should be.


Note You need to log in before you can comment on or make changes to this bug.