Bug 1854950

Summary:	Instances can't access to their volumes during FFU OSP10->OSP13
Product:	Red Hat OpenStack	Reporter:	Ganesh Kadam <gkadam>
Component:	openstack-tripleo-heat-templates	Assignee:	Alan Bishop <abishop>
Status:	CLOSED ERRATA	QA Contact:	Tzach Shefi <tshefi>
Severity:	high	Docs Contact:	Chuck Copello <ccopello>
Priority:	high
Version:	13.0 (Queens)	CC:	abishop, igallagh, mburns, nbourgeo
Target Milestone:	z13	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-8.4.1-63.el7ost	Doc Type:	Bug Fix
Doc Text:	Before this update, instances were unable to access their volumes after upgrading from RHOSP 10 to RHOSP 13, because the NFS share being used as a backend for OpenStack Block Storage (cinder) was not unmounted before migrating the OpenStack Block Storage services from the host to the containers. Therefore, when the containerized service started up and changed the ownership of all files in OpenStack Block Storage service directory, it also changed the ownership of files on the NFS share. With this update, OpenStack Block Storage NFS shares are unmounted prior to upgrading the services to run in containers. This resolves the issue, and instances can now access their volumes after upgrading to RHOSP 13.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-28 18:23:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 5 Alan Bishop 2020-07-10 18:41:05 UTC

It took a while, but I tracked down the problem. When cinder-volume runs in a container, kolla itself (prior to launching cinder-volume) will do a "chown -R cinder:kolla /var/lib/cinder". This is conceptually OK, because cinder's service directory is meant to be used by only the cinder service. The problem occurs when there's an active NFS mount under /var/lib/cinder, because that causes the ownership of all files on the NFS share to be changed. 

Under normal circumstances, there won't be any active NFS mounts inside the cinder-volume container prior to when the service starts. However, in a FFU scenario, there may be an NFS mount on the host leftover from when cinder ran on the host. The FFU (and normal upgrade) process needs to ensure there are no NFS shares mounted under /var/lib/cinder prior to launching the containerized cinder-volume service.

Comment 7 Alan Bishop 2020-07-11 18:37:15 UTC

It's definitely not a netapp issue, and because the problem is specific to cinder I plan to fix it. That's why I assigned the bz to myself, and I've already started working on it.

Comment 9 Alan Bishop 2020-07-15 13:18:35 UTC

Kolla executes the recursive chown only when the top level /var/lib/cinder directory's ownership isn't cinder:kolla. Kolla should only need to execute the chown once, and so the customer shouldn't experience any more problems. Unless, of course, the customer has additional clouds that are scheduled for FFU to OSP-13.

Comment 19 Tzach Shefi 2020-10-04 17:25:05 UTC

Alan, 

As this involves FFU which is a long and tedious process,
I'd like to confirm my verification steps before I take a stab at this. 

My plan of action:

1. Deploy OSP10 system, with Cinder using Netapp NFS as a backend. 
2. Boot an up instance or two with volumes attached, write to volumes. 
3. Start FFU upgrade to OSP13, reach controller upgrade step. 
4. Verify that I still have access to volumes from inside instances. 
5. Complete FFU and recheck instance/volume access. 


Sounds easy enough, the only bit that worries me is your comment #5 -> 
"Under normal circumstances, there won't be any active NFS mounts inside the cinder-volume container prior to when the service starts. However, in a FFU scenario, there may be an NFS mount on the host leftover from when cinder ran on the host"


Is there away I can trigger this? 
Should I manually create a mount on the host just to test comment#5

Thanks

Comment 20 Alan Bishop 2020-10-05 13:54:06 UTC

Sorry Tzach, I can see how that statement is concerning, but your plan of action looks fine. 

What I meant is that in a fully containerized deployment, at the time kolla executes the recursive chown there will not be any active NFS mounts associate with the cinder-volume service. That's because kolla hasn't started c-vol yet! That's what I meant by "under normal circumstances." Your steps 1 and 2 will create the FFU situation where there -are- NFS mounts (the ones left over from OSP-10). The fix ensures these mounts are removed during the FFU process, so they're torn down prior to kolla executing the chown.

Comment 22 Tzach Shefi 2020-10-15 13:26:47 UTC

Verified on:
openstack-tripleo-heat-templates-8.4.1-68.el7ost.noarch

Installed an OSP10 system, Cinder backed by Netapp NFS 

Created two NFS backed volumes attached to two separate instances, one on each of the two compute nodes. 
Created FS and mounted volumes, wrote a test file on both volumes. 
Used watch -n command to review both text files every 5 seconds. 

Started FFU process

(undercloud) [stack@undercloud-0 ~]$ openstack overcloud upgrade run --roles Controller --skip-tags validation
..
..

PLAY RECAP *********************************************************************
controller-0               : ok=21   changed=4    unreachable=0    failed=0   
controller-1               : ok=21   changed=4    unreachable=0    failed=0   
controller-2               : ok=21   changed=4    unreachable=0    failed=0   

Thursday 15 October 2020  08:00:29 -0400 (0:00:00.389)       0:00:34.897 ****** 
=============================================================================== 

Updated nodes - Controller
Success
Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] 

Up till here there was no issue, both instances's volumes and files were accessible during controller upgrade. 
BZ verified as working properly, as before this fix volumes would disconnect which didn't happen in my case.  

For anyone doing this upgrade, 
during the undercloud upgrade I had to bump OSP10 to 13.0-RHEL-7/7.7-latest/ (2020-03-10.1) don't recall which 13Z this is. 
As OSP10 is RHEL7.7 and OSP13z13 is RHEL7.9, with out this temp upgrade step I had hit dependency issues.
With this workaround I was able to upgrade the undercloud from OSP10 to OSP13z13(rhel7.9) and then start the overcloud upgrade.

Comment 28 errata-xmlrpc 2020-10-28 18:23:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388