Description of problem: Hey I have non-HA setup with 1 cont 2 comps and 1 uc I tried to change the configuration in neutron.conf, adding 'segments' to 'service_plugin'. Instead of restarting the service I decided to reboot the whole controller. The openstack client showed that it went through the reboot process fine, and now according to the client, it is back in active, same IP. openstack server list (source stackrc): 83d4b8b3-0744-43e5-98e7-b6276c153c16 | controller-0 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | controller But when trying to ping/SSH - the controller is unreachable. Ping: [stack@undercloud-0 ~]$ ping 192.168.24.11 PING 192.168.24.11 (192.168.24.11) 56(84) bytes of data. From 192.168.24.1 icmp_seq=1 Destination Host Unreachable SSH: [stack@undercloud-0 ~]$ ssh heat-admin.24.11 ssh: connect to host 192.168.24.11 port 22: No route to host When I was looking into virsh list from the hypervisor - I'm seeing this: []# virsh list --all Id Name State ---------------------------------------------------- 3 undercloud-0 running 12 compute-0 running 13 compute-1 running - controller-0 shut off Version-Release number of selected component (if applicable): OSP13 Puddle 2018-04-03.3 How reproducible: 100% Steps to Reproduce: 1. Deploy OSP13 2. Add 'segments' to 'service_plugin' in neutron.conf (not sure if related) 3. Reboot controller Actual results: Controller unreachable Expected results: Controller reachable Additional info:
Update - step 2 is unrelated for sure, it is just the reboot with 'openstack server reboot controller-0'. Update2 - A working WA is to execute 'virsh start controller-0' from the HV.
Removing triaged keyword, this was never actually looked at by the Compute DFG.
Moving this back to DF DFG and openstack-ironic as this isn't a compute DFG issue, we don't support any of the lower level OVB or virtualBMC code used to manage the power state of the virt overcloud hosts. If there's an issue with how Nova is handling the high level reboot call then please spell that out here, otherwise this looks like an issue specific to OVB/virtualBMC or whatever ironic is talking to reboot these hosts.
Hi Roee, are you still able to reproduce this, I'm looking for some logs if you have them, in particular ironic-conductor and virsh logs from your virt host. Did you notice if "controller-0" on the virt host started and then shutdown again, or did it just stay shutdown. Also what was the output of $ openstack baremetal node list
Restarted controller using the command provided in description; First thing after the command executed - in 'REBOOT' state (openstack server list - status) On virsh - controller-0 seems to be 'running' After switching back to 'ACTIVE' in the openstack client prompt, the controller is still refusing to connect - (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.10 ssh: connect to host 192.168.24.10 port 22: Connection refused Looks a bit different from before though - at least now I'm getting pings back. After a while of 'connection refused' - switched to 'shut down' in the virsh output. Now the ssh request is not refused but 'No router to host' Pings are also gone. (undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | 318ea15b-d3c2-486c-8382-e7b054ee9cea | compute-0 | da421ea2-2889-4905-bf1e-250aaf66a7d5 | power on | active | False | | d0999304-be95-4f27-a955-48d1d744d59b | compute-1 | 87464425-a446-4e8e-9fce-4a7971810edc | power on | active | False | | 172df189-c548-451a-870e-9dc2eff9913f | controller-0 | fe69dac5-3a54-48c5-8a88-77a07add2927 | power off | active | False | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ Let me know if additional information is required.
By the way - this is the puddle. (undercloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 13 -p 2018-09-28.1
(In reply to Roee Agiman from comment #6) > Let me know if additional information is required. Could you attach the ironic-conductor logs and the virsh logs from your virt host.
According to your conductor logs the nova conductor ran a soft-shoutdown at 2018-11-12 05:37:51 but the VM shutdown at 2018-11-12 10:59:38 Are the times in sync, was there a 5+ hour delay? If so, do you have bmc logs associated with the same event so we can see what it was doing? Also if you check the controller journal log does it show anything happening over the same time period?
The fact that ssh stoped responding immediately but ping is still working suggests that the shutdown has been initiated but something is blocking it. If a service is ignoring SIGTERM then the shutdown will be blocked until a timeout (probably 2.5 minutes), you'll see this in the controller journal Nov 13 09:46:07 controller systemd[1]: XXX.service stop-sigterm timed out. Killing. But you've mentioned that the delay is more like 10 minutes so I suspect its a process that has gone into a "Uninterruptible sleep" and is ignoring SIGKILL. This usually happens if a process is waiting on IO (e.g. NFS server can't be contacted). Can you check the journal log on the controller for any indication of a process not shutting down when it should be.