1854950 – Instances can't access to their volumes during FFU OSP10->OSP13

Bug 1854950 - Instances can't access to their volumes during FFU OSP10->OSP13

Summary: Instances can't access to their volumes during FFU OSP10->OSP13

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z13
Target Release:	13.0 (Queens)
Assignee:	Alan Bishop
QA Contact:	Tzach Shefi
Docs Contact:	Chuck Copello
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-08 13:10 UTC by Ganesh Kadam
Modified:	2023-12-15 18:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-8.4.1-63.el7ost
Doc Type:	Bug Fix
Doc Text:	Before this update, instances were unable to access their volumes after upgrading from RHOSP 10 to RHOSP 13, because the NFS share being used as a backend for OpenStack Block Storage (cinder) was not unmounted before migrating the OpenStack Block Storage services from the host to the containers. Therefore, when the containerized service started up and changed the ownership of all files in OpenStack Block Storage service directory, it also changed the ownership of files on the NFS share. With this update, OpenStack Block Storage NFS shares are unmounted prior to upgrading the services to run in containers. This resolves the issue, and instances can now access their volumes after upgrading to RHOSP 13.
Clone Of:
Environment:
Last Closed:	2020-10-28 18:23:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1887435	None	None	None	2020-07-20 13:33:25 UTC
OpenStack gerrit	741940	None	MERGED	[queens-only] Unmount cinder NFS shares during upgrade	2021-01-05 05:05:57 UTC
Red Hat Issue Tracker	OSP-30837	None	None	None	2023-12-15 18:27:36 UTC
Red Hat Product Errata	RHBA-2020:4388	None	None	None	2020-10-28 18:24:09 UTC

Comment 5 Alan Bishop 2020-07-10 18:41:05 UTC

It took a while, but I tracked down the problem. When cinder-volume runs in a container, kolla itself (prior to launching cinder-volume) will do a "chown -R cinder:kolla /var/lib/cinder". This is conceptually OK, because cinder's service directory is meant to be used by only the cinder service. The problem occurs when there's an active NFS mount under /var/lib/cinder, because that causes the ownership of all files on the NFS share to be changed. 

Under normal circumstances, there won't be any active NFS mounts inside the cinder-volume container prior to when the service starts. However, in a FFU scenario, there may be an NFS mount on the host leftover from when cinder ran on the host. The FFU (and normal upgrade) process needs to ensure there are no NFS shares mounted under /var/lib/cinder prior to launching the containerized cinder-volume service.

Comment 7 Alan Bishop 2020-07-11 18:37:15 UTC

It's definitely not a netapp issue, and because the problem is specific to cinder I plan to fix it. That's why I assigned the bz to myself, and I've already started working on it.

Comment 9 Alan Bishop 2020-07-15 13:18:35 UTC

Kolla executes the recursive chown only when the top level /var/lib/cinder directory's ownership isn't cinder:kolla. Kolla should only need to execute the chown once, and so the customer shouldn't experience any more problems. Unless, of course, the customer has additional clouds that are scheduled for FFU to OSP-13.

Comment 19 Tzach Shefi 2020-10-04 17:25:05 UTC

Alan, 

As this involves FFU which is a long and tedious process,
I'd like to confirm my verification steps before I take a stab at this. 

My plan of action:

1. Deploy OSP10 system, with Cinder using Netapp NFS as a backend. 
2. Boot an up instance or two with volumes attached, write to volumes. 
3. Start FFU upgrade to OSP13, reach controller upgrade step. 
4. Verify that I still have access to volumes from inside instances. 
5. Complete FFU and recheck instance/volume access. 


Sounds easy enough, the only bit that worries me is your comment #5 -> 
"Under normal circumstances, there won't be any active NFS mounts inside the cinder-volume container prior to when the service starts. However, in a FFU scenario, there may be an NFS mount on the host leftover from when cinder ran on the host"


Is there away I can trigger this? 
Should I manually create a mount on the host just to test comment#5

Thanks

Comment 20 Alan Bishop 2020-10-05 13:54:06 UTC

Sorry Tzach, I can see how that statement is concerning, but your plan of action looks fine. 

What I meant is that in a fully containerized deployment, at the time kolla executes the recursive chown there will not be any active NFS mounts associate with the cinder-volume service. That's because kolla hasn't started c-vol yet! That's what I meant by "under normal circumstances." Your steps 1 and 2 will create the FFU situation where there -are- NFS mounts (the ones left over from OSP-10). The fix ensures these mounts are removed during the FFU process, so they're torn down prior to kolla executing the chown.

Comment 22 Tzach Shefi 2020-10-15 13:26:47 UTC

Verified on:
openstack-tripleo-heat-templates-8.4.1-68.el7ost.noarch

Installed an OSP10 system, Cinder backed by Netapp NFS 

Created two NFS backed volumes attached to two separate instances, one on each of the two compute nodes. 
Created FS and mounted volumes, wrote a test file on both volumes. 
Used watch -n command to review both text files every 5 seconds. 

Started FFU process

(undercloud) [stack@undercloud-0 ~]$ openstack overcloud upgrade run --roles Controller --skip-tags validation
..
..

PLAY RECAP *********************************************************************
controller-0               : ok=21   changed=4    unreachable=0    failed=0   
controller-1               : ok=21   changed=4    unreachable=0    failed=0   
controller-2               : ok=21   changed=4    unreachable=0    failed=0   

Thursday 15 October 2020  08:00:29 -0400 (0:00:00.389)       0:00:34.897 ****** 
=============================================================================== 

Updated nodes - Controller
Success
Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] 

Up till here there was no issue, both instances's volumes and files were accessible during controller upgrade. 
BZ verified as working properly, as before this fix volumes would disconnect which didn't happen in my case.  

For anyone doing this upgrade, 
during the undercloud upgrade I had to bump OSP10 to 13.0-RHEL-7/7.7-latest/ (2020-03-10.1) don't recall which 13Z this is. 
As OSP10 is RHEL7.7 and OSP13z13 is RHEL7.9, with out this temp upgrade step I had hit dependency issues.
With this workaround I was able to upgrade the undercloud from OSP10 to OSP13z13(rhel7.9) and then start the overcloud upgrade.

Comment 28 errata-xmlrpc 2020-10-28 18:23:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388

Note You need to log in before you can comment on or make changes to this bug.