Hide Forgot
Description: When using the rhel-osp-director Beta7 and associated Overcloud image (overcloud-full-8.0-20160219.3-beta-7.tar), I cannot successfully reboot Controller machines cleanly when using NFS backed storage for Cinder. Nodes are also booting from SAN via iSCSI and have remote root disks. Compute machines have this same problem if they have an instance started that's backed by volume and hosted on the remote NFS storage. If the system has a remote NFS mount and you try to reboot it, things look as if they are stuck trying to unmount NFS volumes and never reboot without a hard poweroff or if you actually wait 30 minutes systemd forces a reboot. See the attached screenshot that illustrates observed behavior via the console. See also attached snippet output from 20:47 to 21:17 in /var/log/messages. Can you tell me why this is happening, and what can be done to address it? This used to work in OSP6 where we had boot from SAN '/' via iSCSI, and remote NFS mounts for Cinder. My environment now: openstack-tripleo-image-elements-0.9.7-2.el7ost.noarch openstack-tripleo-heat-templates-0.8.7-12.el7ost.noarch openstack-tripleo-puppet-elements-0.0.2-1.el7ost.noarch openstack-tripleo-common-0.1.1-1.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.7-12.el7ost.noarch python-tripleoclient-0.1.1-2.el7ost.noarch openstack-tripleo-0.0.7-1.el7ost.noarch If you need any other logs, please let me know. Bugzilla dependencies (if any): N/A Hardware dependencies (if any): N/A Upstream information Date it will be upstream: N/A Version: RHEL-OSP8 Beta7 External links: Severity (U/H/M/L): H Business Priority: Must
Created attachment 1136347 [details] Console output from node having problems unmounting remote NFS storage used for Cinder
Created attachment 1136348 [details] Syslog output from host starting reboot until it rebooted 30 minutes later
Seems like Storage DFG but not sure. Can somebody from the group at least help to investigate where is the issue?
There's a lot of info here, but this was director beta on 8. This just came over to storage and adding a number of people to take a look. Not sure this is still an issue or was reproduced after the beta.
Tzach, do we have any recent NFS reboot tests to check if this is still an issue? If not is this something someone can try when NFS testing is being done? I moved it off the 10 list to be reviewed for the next release, but please bring it back if its still an issue.
Paul, we don't have any nfs reboot testing that I know about. Bringing up a nfs based system to help Omri's Upgrade testing. I'll use that system to check this bug and report here. Keeping need info flag for tracking this.
I've just tested this on a RHOS10 OSPD (2016-11-29.1) NFS backend for Glance and Cinder. Created bootable volume from Cirros, plus another empty volume. Booted three instances: From bootable volume From image+attached Cinder volume From image without adding cinder volume as reference. instances were spread on both compute nodes. On undercloud /stackrc and issued: nova reboot controller-0 (where volume service was running) Nova reboot compute-0 Nova reboot compute-1 All three worked, reset command rebooted them very fast a matter of a few seconds. Post procedure status of three server active and running. Instances were all in shutdown state, default expected behavior post compute node reboot. So this looks like a none issue at RHOS10. Paul - Should I check older versions 9 8 7?
Tzach, thanks for the verification and testing on OSP10! Given the origin of this issue on director-8 beta, I think verifying on the current release is sufficient to close it out. Setting the state to mark this fixed in current release OSP10 rc.
We are experiencing this same behavior (Glance NFS connection not unmounting during issue of a 'reboot' command from shell, then systemd shuts machine down 30 minutes later). We are using OSP10 GA release. We did get similar behavior as in comment 8 when issuing a 'nova reboot' of the node, so we see consistency there. However, it seems that issuing a 'nova reboot' of a node does not attempt a clean shutdown of the underlying operating system on the node, it issues a hard power of the node which could leave artifacts. Is this expected behavior when using 'nova reboot'? I (and I'm sure others) would like to see a graceful shutdown of the controller/compute node when issuing a reboot command without having to wait for 30 minutes.
I'm not directly familiar with the underlying operation, but almost sounds like something is timing out on a clean shutdown or leads to a hard power reset. Will look for some nova help here, but you may want to ask you expected behavior question on the openstack-dev[nova] mailing list to get more info.
(In reply to tim.darnell from comment #10) > However, it seems that issuing a 'nova reboot' of a node does not attempt a > clean shutdown of the underlying operating system on the node, it issues a > hard power of the node which could leave artifacts. > > Is this expected behavior when using 'nova reboot'? I (and I'm sure others) > would like to see a graceful shutdown of the controller/compute node when > issuing a reboot command without having to wait for 30 minutes. As this is an Ironic reboot it always attempts a hard reboot in Newton, the ability for it to preform soft reboots only landed last week in Ocata and will not be backported in OSP 10 : Ironic: Add soft reboot support to ironic driver https://review.openstack.org/#/c/403745/