It took a while, but I tracked down the problem. When cinder-volume runs in a container, kolla itself (prior to launching cinder-volume) will do a "chown -R cinder:kolla /var/lib/cinder". This is conceptually OK, because cinder's service directory is meant to be used by only the cinder service. The problem occurs when there's an active NFS mount under /var/lib/cinder, because that causes the ownership of all files on the NFS share to be changed. Under normal circumstances, there won't be any active NFS mounts inside the cinder-volume container prior to when the service starts. However, in a FFU scenario, there may be an NFS mount on the host leftover from when cinder ran on the host. The FFU (and normal upgrade) process needs to ensure there are no NFS shares mounted under /var/lib/cinder prior to launching the containerized cinder-volume service.
It's definitely not a netapp issue, and because the problem is specific to cinder I plan to fix it. That's why I assigned the bz to myself, and I've already started working on it.
Kolla executes the recursive chown only when the top level /var/lib/cinder directory's ownership isn't cinder:kolla. Kolla should only need to execute the chown once, and so the customer shouldn't experience any more problems. Unless, of course, the customer has additional clouds that are scheduled for FFU to OSP-13.
Alan, As this involves FFU which is a long and tedious process, I'd like to confirm my verification steps before I take a stab at this. My plan of action: 1. Deploy OSP10 system, with Cinder using Netapp NFS as a backend. 2. Boot an up instance or two with volumes attached, write to volumes. 3. Start FFU upgrade to OSP13, reach controller upgrade step. 4. Verify that I still have access to volumes from inside instances. 5. Complete FFU and recheck instance/volume access. Sounds easy enough, the only bit that worries me is your comment #5 -> "Under normal circumstances, there won't be any active NFS mounts inside the cinder-volume container prior to when the service starts. However, in a FFU scenario, there may be an NFS mount on the host leftover from when cinder ran on the host" Is there away I can trigger this? Should I manually create a mount on the host just to test comment#5 Thanks
Sorry Tzach, I can see how that statement is concerning, but your plan of action looks fine. What I meant is that in a fully containerized deployment, at the time kolla executes the recursive chown there will not be any active NFS mounts associate with the cinder-volume service. That's because kolla hasn't started c-vol yet! That's what I meant by "under normal circumstances." Your steps 1 and 2 will create the FFU situation where there -are- NFS mounts (the ones left over from OSP-10). The fix ensures these mounts are removed during the FFU process, so they're torn down prior to kolla executing the chown.
Verified on: openstack-tripleo-heat-templates-8.4.1-68.el7ost.noarch Installed an OSP10 system, Cinder backed by Netapp NFS Created two NFS backed volumes attached to two separate instances, one on each of the two compute nodes. Created FS and mounted volumes, wrote a test file on both volumes. Used watch -n command to review both text files every 5 seconds. Started FFU process (undercloud) [stack@undercloud-0 ~]$ openstack overcloud upgrade run --roles Controller --skip-tags validation .. .. PLAY RECAP ********************************************************************* controller-0 : ok=21 changed=4 unreachable=0 failed=0 controller-1 : ok=21 changed=4 unreachable=0 failed=0 controller-2 : ok=21 changed=4 unreachable=0 failed=0 Thursday 15 October 2020 08:00:29 -0400 (0:00:00.389) 0:00:34.897 ****** =============================================================================== Updated nodes - Controller Success Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] Up till here there was no issue, both instances's volumes and files were accessible during controller upgrade. BZ verified as working properly, as before this fix volumes would disconnect which didn't happen in my case. For anyone doing this upgrade, during the undercloud upgrade I had to bump OSP10 to 13.0-RHEL-7/7.7-latest/ (2020-03-10.1) don't recall which 13Z this is. As OSP10 is RHEL7.7 and OSP13z13 is RHEL7.9, with out this temp upgrade step I had hit dependency issues. With this workaround I was able to upgrade the undercloud from OSP10 to OSP13z13(rhel7.9) and then start the overcloud upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4388