Bug 1317730 - [OSP 8.0 Bug]: Reboot errors unmounting NFS backed Cinder backends
Summary: [OSP 8.0 Bug]: Reboot errors unmounting NFS backed Cinder backends
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: All
OS: Linux
unspecified
high
Target Milestone: rc
: 10.0 (Newton)
Assignee: Paul Grist
QA Contact: Tzach Shefi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-15 02:13 UTC by Dave Cain
Modified: 2020-03-11 15:03 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-29 18:56:58 UTC
Target Upstream Version:


Attachments (Terms of Use)
Console output from node having problems unmounting remote NFS storage used for Cinder (273.09 KB, image/jpeg)
2016-03-15 02:15 UTC, Dave Cain
no flags Details
Syslog output from host starting reboot until it rebooted 30 minutes later (1.59 MB, text/plain)
2016-03-15 02:16 UTC, Dave Cain
no flags Details

Description Dave Cain 2016-03-15 02:13:16 UTC
Description: When using the rhel-osp-director Beta7 and associated Overcloud image (overcloud-full-8.0-20160219.3-beta-7.tar), I cannot successfully reboot Controller machines cleanly when using NFS backed storage for Cinder.  Nodes are also booting from SAN via iSCSI and have remote root disks.  Compute machines have this same problem if they have an instance started that's backed by volume and hosted on the remote NFS storage.  

If the system has a remote NFS mount and you try to reboot it, things look as if they are stuck trying to unmount NFS volumes and never reboot without a hard poweroff or if you actually wait 30 minutes systemd forces a reboot.  See the attached screenshot that illustrates observed behavior via the console.  See also attached snippet output from 20:47 to 21:17 in /var/log/messages.

Can you tell me why this is happening, and what can be done to address it?  This used to work in OSP6 where we had boot from SAN '/' via iSCSI, and remote NFS mounts for Cinder.


My environment now:
openstack-tripleo-image-elements-0.9.7-2.el7ost.noarch
openstack-tripleo-heat-templates-0.8.7-12.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.2-1.el7ost.noarch
openstack-tripleo-common-0.1.1-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.7-12.el7ost.noarch
python-tripleoclient-0.1.1-2.el7ost.noarch
openstack-tripleo-0.0.7-1.el7ost.noarch

If you need any other logs, please let me know.

Bugzilla dependencies (if any): N/A

Hardware dependencies (if any): N/A

Upstream information

Date it will be upstream: N/A

Version: RHEL-OSP8 Beta7

External links:


Severity (U/H/M/L): H

Business Priority: Must

Comment 2 Dave Cain 2016-03-15 02:15:49 UTC
Created attachment 1136347 [details]
Console output from node having problems unmounting remote NFS storage used for Cinder

Comment 3 Dave Cain 2016-03-15 02:16:35 UTC
Created attachment 1136348 [details]
Syslog output from host starting reboot until it rebooted 30 minutes later

Comment 4 Jaromir Coufal 2016-10-10 03:31:54 UTC
Seems like Storage DFG but not sure. Can somebody from the group at least help to investigate where is the issue?

Comment 5 Paul Grist 2016-10-14 00:44:55 UTC
There's a lot of info here, but this was director beta on 8. This just came over to storage and adding a number of people to take a look. Not sure this is still an issue or was reproduced after the beta.

Comment 6 Paul Grist 2016-10-14 00:48:52 UTC
Tzach, do we have any recent NFS reboot tests to check if this is still an issue? If not is this something someone can try when NFS testing is being done? 

I moved it off the 10 list to be reviewed for the next release, but please bring it back if its still an issue.

Comment 7 Tzach Shefi 2016-11-14 12:33:39 UTC
Paul, we don't have any nfs reboot testing that I know about. 

Bringing up a nfs based system to help Omri's Upgrade testing. 
I'll use that system to check this bug and report here. 

Keeping need info flag for tracking this.

Comment 8 Tzach Shefi 2016-11-29 16:53:21 UTC
I've just tested this on a RHOS10 OSPD (2016-11-29.1)
NFS backend for Glance and Cinder.

Created bootable volume from Cirros, plus another empty volume. 
Booted three instances:
From bootable volume
From image+attached Cinder volume
From image without adding cinder volume as reference. 
instances were spread on both compute nodes.

On undercloud /stackrc and issued: 
nova reboot controller-0  (where volume service was running)
Nova reboot compute-0
Nova reboot compute-1 

All three worked, reset command rebooted them very fast a matter of a few seconds. 

Post procedure status of three server active and running. 
Instances were all in shutdown state, default expected behavior post compute node reboot. 

So this looks like a none issue at RHOS10. 
Paul - Should I check older versions 9 8 7?

Comment 9 Paul Grist 2016-11-29 18:56:58 UTC
Tzach, thanks for the verification and testing on OSP10!  Given the origin of this issue on director-8 beta, I think verifying on the current release is sufficient to close it out. Setting the state to mark this fixed in current release OSP10 rc.

Comment 10 tim.darnell 2017-01-12 23:21:10 UTC
We are experiencing this same behavior (Glance NFS connection not unmounting during issue of a 'reboot' command from shell, then systemd shuts machine down 30 minutes later). We are using OSP10 GA release.

We did get similar behavior as in comment 8 when issuing a 'nova reboot' of the node, so we see consistency there.

However, it seems that issuing a 'nova reboot' of a node does not attempt a clean shutdown of the underlying operating system on the node, it issues a hard power of the node which could leave artifacts.

Is this expected behavior when using 'nova reboot'? I (and I'm sure others) would like to see a graceful shutdown of the controller/compute node when issuing a reboot command without having to wait for 30 minutes.

Comment 11 Paul Grist 2017-01-30 19:10:04 UTC
I'm not directly familiar with the underlying operation, but almost sounds like something is timing out on a clean shutdown or leads to a hard power reset.

Will look for some nova help here, but you may want to ask you expected behavior question on the openstack-dev[nova] mailing list to get more info.

Comment 12 Lee Yarwood 2017-01-31 14:25:49 UTC
(In reply to tim.darnell from comment #10)
> However, it seems that issuing a 'nova reboot' of a node does not attempt a
> clean shutdown of the underlying operating system on the node, it issues a
> hard power of the node which could leave artifacts.
> 
> Is this expected behavior when using 'nova reboot'? I (and I'm sure others)
> would like to see a graceful shutdown of the controller/compute node when
> issuing a reboot command without having to wait for 30 minutes.

As this is an Ironic reboot it always attempts a hard reboot in Newton, the ability for it to preform soft reboots only landed last week in Ocata and will not be backported in OSP 10 :

Ironic: Add soft reboot support to ironic driver
https://review.openstack.org/#/c/403745/


Note You need to log in before you can comment on or make changes to this bug.