Bug 1816918

Summary: Error: relabel failed "/var/lib/nova": operation not supported
Product: Red Hat OpenStack Reporter: Shyam <shyam.biradar>
Component: openstack-tripleo-heat-templatesAssignee: Ollie Walsh <owalsh>
Status: CLOSED CURRENTRELEASE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.0 (Train)CC: broose, drosenfe, emacchi, jamsmith, jansari, jjoyce, jpichon, jschluet, lhh, lvrabec, mburns, owalsh, pchavva, slinaber, tvignaud, zcaplovi
Target Milestone: z2Keywords: TestOnly, Triaged, ZStream
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200405044623.ec9970c.el8ost Doc Type: Bug Fix
Doc Text:
This update fixes a bug that prevented nova-compute containers from restarting. Previously, podman used a selinux relabelling process that failed if any NFS exports were mounted below `var/lib/nova`. The relabel attempt prevented the container restart with the error message ---- Error: relabel failed "/var/lib/nova": operation not supported ---- This update replaces the selinux relabelling process with a custom process that tolerates NFS mounts. nova-compute containers now restart even when there are NFS exports mounted below /var/lib/nova.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-05 14:38:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1824852    

Description Shyam 2020-03-25 06:22:46 UTC
Description of problem:

TrilioVault's datamover container mounts "/var/lib/nova" with "shared,z" option.
Here is the heat template code for it: https://github.com/shyam-biradar/triliovault-cfg-scripts/blob/master/redhat-director-scripts/docker/services/trilio-datamover-osp16.yaml#L176

Then, Trilio Datamover service mounts a NFS share under "/var/lib/nova" inside datamover container. 
Main goal of this is, to make Trilio's NFS share to 'nova_compute' and 'nova_libvirt' container.
As, "/var/lib/nova" directory is mounted and shared among nova_compute and nova_libvirt, we achived this goal by mounting NFS share under "/var/lib/nova".

This worked fine till RHOSP14, but now in RHOSP16, we are facing issue during overcloud deployment with TrilioVault containers.

Overcloud deployment for first time with TrilioVault containers iw working fine, but subsequent upgrades of Trilio containers are failing with following error of 'relabelling'.


"fatal: [overcloud-novacompute-0]: FAILED! => {"ansible_job_id": "692027900534.93212", "attempts": 2, "changed": false, "finished": 1, "msg": "Paunch failed with config_id tripleo_step3", "rc": 126, "stderr": "Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=nova_statedir_owner', '--filter', 'label=config_id=tripleo_step3', '--format', '{{.Names}}']\" - retrying without config_id\nDid not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=nova_statedir_owner', '--format', '{{.Names}}']\"\nError running ['podman', 'run', '--name', 'nova_statedir_owner', '--label', 'config_id=tripleo_step3', '--label', 'container_name=nova_statedir_owner', '--label', 'managed_by=tripleo-Compute', '--label', 'config_data={\"command\": \"/container-config-scripts/pyshim.sh /container-config-scripts/nova_statedir_ownership.py\", \"detach\": false, \"environment\": {\"TRIPLEO_DEPLOY_IDENTIFIER\": \"1585050745\", \"__OS_DEBUG\": \"false\"}, \"image\": \"devundercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-nova-compute:16.0-83\", \"net\": \"none\", \"privileged\": false, \"user\": \"root\", \"volumes\": [\"/var/lib/nova:/var/lib/nova:shared,z\", \"/var/lib/container-config-scripts/:/container-config-scripts/:z\"]}', '--conmon-pidfile=/var/run/nova_statedir_owner.pid', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/nova_statedir_owner.log', '--env=TRIPLEO_DEPLOY_IDENTIFIER=1585050745', '--env=__OS_DEBUG=false', '--net=none', '--privileged=false', '--user=root', '--volume=/var/lib/nova:/var/lib/nova:shared,z', '--volume=/var/lib/container-config-scripts/:/container-config-scripts/:z', '--cpuset-cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15', 'devundercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-nova-compute:16.0-83', '/container-config-scripts/pyshim.sh', '/container-config-scripts/nova_statedir_ownership.py']. [126]\n\nstdout: \nstderr: Error: relabel failed \"/var/lib/nova\": operation not supported\n\n", "stderr_lines": ["Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=nova_statedir_owner', '--filter', 'label=config_id=tripleo_step3', '--format', '{{.Names}}']\" - retrying without config_id", "Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=nova_statedir_owner', '--format', '{{.Names}}']\"", "Error running ['podman', 'run', '--name', 'nova_statedir_owner', '--label', 'config_id=tripleo_step3', '--label', 'container_name=nova_statedir_owner', '--label', 'managed_by=tripleo-Compute', '--label', 'config_data={\"command\": \"/container-config-scripts/pyshim.sh /container-config-scripts/nova_statedir_ownership.py\", \"detach\": false, \"environment\": {\"TRIPLEO_DEPLOY_IDENTIFIER\": \"1585050745\", \"__OS_DEBUG\": \"false\"}, \"image\": \"devundercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-nova-compute:16.0-83\", \"net\": \"none\", \"privileged\": false, \"user\": \"root\", \"volumes\": [\"/var/lib/nova:/var/lib/nova:shared,z\", \"/var/lib/container-config-scripts/:/container-config-scripts/:z\"]}', '--conmon-pidfile=/var/run/nova_statedir_owner.pid', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/nova_statedir_owner.log', '--env=TRIPLEO_DEPLOY_IDENTIFIER=1585050745', '--env=__OS_DEBUG=false', '--net=none', '--privileged=false', '--user=root', '--volume=/var/lib/nova:/var/lib/nova:shared,z', '--volume=/var/lib/container-config-scripts/:/container-config-scripts/:z', '--cpuset-cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15', 'devundercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-nova-compute:16.0-83', '/container-config-scripts/pyshim.sh', '/container-config-scripts/nova_statedir_ownership.py']. [126]", "", "stdout: ", 
"stderr: Error: relabel failed \"/var/lib/nova\": operation not supported", ""], "stdout": "", "stdout_lines": []}
"

Version-Release number of selected component (if applicable):
RHOSP16



How reproducible:
Always reproducible. 


Steps to Reproduce:
1. Deploy RHOSP16
2. Deploy TrilioVault 4.0 through overcloud deploy command
3. Upgrade TrilioVault build on RHOSP16 through overcloud deploy command
It will fail with above error.


Actual results:
Overcloud deploy fails.

Expected results:
Overcloud deploy should be successful. 


Additional info:
If there is any selinux option/label for Trilio NFS to make this work, it would be great.

Comment 1 Emilien Macchi 2020-03-25 13:31:19 UTC
Hi, please post your /var/log/audit/audit.log (grep for AVC) from the overcloud-novacompute-0 (or full sos report); so we can help and identify what rule is missing.

Comment 2 Cédric Jeanneret 2020-03-25 13:46:56 UTC
Some additional context since we don't have everything here.

This issue was also discussed on #tripleo this week. owalsh stepped in with some proposal and all.

The thing is:
- that trilio thing mounts an NFS share in a subdirectory of /var/lib/nova - this prevents the relabelling, because NFS
- this solution was apparently approved at some point by Red Hat eng

Solution
- one of the proposal made was to actually mount that NFS elsewhere in the containers (there are, iirc, 3 containers, including triliovault)
- in order to do so, a new param would be needed for the libvirt container, something like NovaLibvirtOptVolumes 
- then NovaComputeOptVolumes could be used in addition

Doing so would allow triliovault to work as expected, but would require some work in order to properly get the NFS share name (there's apparently some kind of hashing at some point for the name)

The current solution (mounting the share directly in /var/lib/nova) was working with Docker because the selinux separation was deactivated back then. The move to podman enforces selinux separation, therefore we had to add some flags to the shares, such as that "z" one, which requires a recursive relabelling of the volume.

NFS doesn't really support SELinux. We can pass a "context" (mount -t nfs -o context="...") but this won't make the relabelling work (recursion).... Moving the share elsewhere is probably the best move at this point.

@Shyam: would the proposed solution be OK for you? Adding a new param in order to get extras volumes in libvirt container? I know this means you'll need to rehash some data in order to provide the right path during deploy time, but it shouldn't be that complicated, right? Maybe you can even pre-hash things once and re-use that generated during the whole deploy (i.e. as a param).

Thank you for your feedback.

Cheers,

C. (aka Tengu on #tripleo)

Comment 3 Shyam 2020-03-26 07:11:07 UTC
Hi Team,


Moving NFS share to somewhere else is something very difficult. In that case we need to achive many things a the same time.
1. We need to make NFS share available to nova_compute container
2. We need to make NFS share available to nova_libvirt container
3. Mount point of NFS share is static, our datamover service calculates the hash for given NFS share and uses it as mount point.
4. This approach makes it less dynamic.

owalsh is proposing something else. Here is PR owalsh raised for this: https://review.opendev.org/#/c/715015/
You will get more details on the approach owalsh is using here.

Thank you.

Let me know if you need additional information.

Comment 4 Cédric Jeanneret 2020-03-27 06:36:00 UTC
Hello Shyam,

Thank you for your feedback. I've indeed seen Oliver proposal and it seems to solve your issue (among others).

I let Oliver manage this BZ as well (I've put him as assignee yesterday).

Cheers,

C.

Comment 5 Shyam 2020-03-30 04:40:33 UTC
Thank you Cedric.

Comment 11 Ollie Walsh 2020-04-30 13:24:49 UTC
*** Bug 1813941 has been marked as a duplicate of this bug. ***

Comment 12 Lon Hohberger 2020-05-18 10:43:30 UTC
According to our records, this should be resolved by openstack-tripleo-heat-templates-11.3.2-0.20200405044625.ec9970c.el8ost.  This build is available now.

Comment 13 Shyam 2020-05-26 07:56:35 UTC
Hi,


With this fix, our trilio_datamover container is not getting started. It's remained in 'Created' state.
I tried to start this container using 'podman start' command, it's failing following error.

[root@overcloud-novacompute-0 heat-admin]# podman ps --all | grep trilio
70464925affe  devundercloud.ctlplane.localdomain:8787/trilio/trilio-datamover:4.0.91-rhosp16                    kolla_start           18 hours ago  Created                         trilio_datamover


[root@overcloud-novacompute-0 heat-admin]# podman start trilio_datamover
Error: unable to start container "trilio_datamover": relabel failed "/var/lib/nova": operation not supported



When this is happening:
We deployed 4.0.90 containers of Triliovault, it worked fine. trilio_datamover container started well. NFS share mounted under '/var/lib/nova/...'.
But when I tried to upgrade the cloud with 4.0.91 containers of triliovault, 4.0.91 trilio_datamover container is not getting started.
Overcloud deployment intermittently failing. 

Let us know your thoughts.

Comment 14 Shyam 2020-05-26 09:50:27 UTC
Hi,


Following is the error found in ansible logs. Deployment failed at step5 while starting 'trilio_datamover' container.

------------------------------------------------------------------------------------
2020-05-26 13:41:53,780 p=25153 u=mistral |  FAILED - RETRYING: Wait for containers to start for step 5 using paunch (1189 retries left).
2020-05-26 13:41:57,105 p=25153 u=mistral |  FAILED - RETRYING: Wait for containers to start for step 5 using paunch (1188 retries left).
2020-05-26 13:42:00,377 p=25153 u=mistral |  ok: [overcloud-controller-0] => {"action": ["Applying config_id tripleo_step5"], "ansible_job_id": "439545201027.925672", "attempts": 14, "changed": false, "finished": 1, "rc": 0, "stderr": "Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=cinder_volume_init_bundle', '--filter', 'label=config_id=tripleo_step5', '--format', '{{.Names}}']" - retrying without config_id\nDid not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=cinder_volume_init_bundle', '--format', '{{.Names}}']"\nRemoved /etc/systemd/system/multi-user.target.wants/tripleo_trilio_dmapi.service.\nDid not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=trilio_dmapi', '--filter', 'label=config_id=tripleo_step5', '--format', '{{.Names}}']" - retrying without config_id\nDid not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=trilio_dmapi', '--format', '{{.Names}}']"\nCreated symlink /etc/systemd/system/multi-user.target.wants/tripleo_trilio_dmapi.service → /etc/systemd/system/tripleo_trilio_dmapi.service.\n", "stderr_lines": ["Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=cinder_volume_init_bundle', '--filter', 'label=config_id=tripleo_step5', '--format', '{{.Names}}']" - retrying without config_id", "Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=cinder_volume_init_bundle', '--format', '{{.Names}}']"", "Removed /etc/systemd/system/multi-user.target.wants/tripleo_trilio_dmapi.service.", "Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=trilio_dmapi', '--filter', 'label=config_id=tripleo_step5', '--format', '{{.Names}}']" - retrying without config_id", "Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=trilio_dmapi', '--format', '{{.Names}}']"", "Created symlink /etc/systemd/system/multi-user.target.wants/tripleo_trilio_dmapi.service → /etc/systemd/system/tripleo_trilio_dmapi.service."], "stdout": "Info: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: Loading facts\nInfo: 
------------------------------------------------------------------------------------

Comment 15 Ollie Walsh 2020-05-26 10:10:26 UTC
(In reply to Shyam from comment #13)
> Hi,
> 
> 
> With this fix, our trilio_datamover container is not getting started. It's
> remained in 'Created' state.
> I tried to start this container using 'podman start' command, it's failing
> following error.
> 
> [root@overcloud-novacompute-0 heat-admin]# podman ps --all | grep trilio
> 70464925affe 
> devundercloud.ctlplane.localdomain:8787/trilio/trilio-datamover:4.0.91-
> rhosp16                    kolla_start           18 hours ago  Created      
> trilio_datamover
> 
> 
> [root@overcloud-novacompute-0 heat-admin]# podman start trilio_datamover
> Error: unable to start container "trilio_datamover": relabel failed
> "/var/lib/nova": operation not supported

This suggests /var/lib/nova is bind mounted with selinux relabelling enabled e.g /var/lib/nova:/var/lib/nova:shared,z
Change this to /var/lib/nova:/var/lib/nova:shared instead.

> 
> 
> 
> When this is happening:
> We deployed 4.0.90 containers of Triliovault, it worked fine.
> trilio_datamover container started well. NFS share mounted under
> '/var/lib/nova/...'.
> But when I tried to upgrade the cloud with 4.0.91 containers of triliovault,
> 4.0.91 trilio_datamover container is not getting started.

I doubt the version matters, just that the NFS mounts exist when the container is restarted.

> Overcloud deployment intermittently failing. 
> 
> Let us know your thoughts.

Comment 16 Shyam 2020-05-26 10:18:54 UTC
Hi,

Yes, we use 'shared,z' flags while mounting the '/var/lib/nova'.
Here is the code.
https://github.com/trilioData/triliovault-cfg-scripts/blob/stable/4.0/redhat-director-scripts/docker/services/trilio-datamover-osp16.yaml#L158

Let me try by removing 'z' flag.

Thank you.

Comment 18 Ollie Walsh 2021-02-05 14:38:26 UTC
(In reply to Shyam from comment #16)
> Hi,
> 
> Yes, we use 'shared,z' flags while mounting the '/var/lib/nova'.
> Here is the code.
> https://github.com/trilioData/triliovault-cfg-scripts/blob/stable/4.0/redhat-
> director-scripts/docker/services/trilio-datamover-osp16.yaml#L158
> 
> Let me try by removing 'z' flag.
> 
> Thank you.

Hi,

I can see this was committed in https://github.com/trilioData/triliovault-cfg-scripts/commit/55db096d5ab7496797dfed5c9f23670f119a74ba so I assume it works and closing the BZ.

Thanks,
Ollie