Bug 2058626

Summary:	Multiple Azure upstream kube fsgroupchangepolicy tests are permafailing expecting gid "1000" but geting "root"
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Storage	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	jsafrane, sippy, stbenjam, wking
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	External Storage [Driver: disk.csi.azure.com] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with pvc data source in parallel [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with different fsgroup applied to the volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with different fsgroup applied to the volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with same fsgroup applied to the volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with same fsgroup applied to the volume contents
Last Closed:	2022-08-10 10:51:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2022-02-25 12:46:09 UTC

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]

(and several other similar tests) are permafailing in CI since Feb 23:

https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-storage%5D%20In-tree%20Volumes%20%5BDriver%3A%20azure-disk%5D%20%5BTestpattern%3A%20Dynamic%20PV%20(default%20fs)%5D%20fsgroupchangepolicy%20(Always)%5BLinuxOnly%5D%2C%20pod%20created%20with%20an%20initial%20fsgroup%2C%20new%20pod%20fsgroup%20applied%20to%20volume%20contents%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D

The test is blocking 4.11 nightly payloads and thus quite important to get fixed.

A sample failure prow job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1496904650283028480

The failure looks relatively clear:

Run #0: Failed expand_less 	5m23s
fail [k8s.io/kubernetes.0/test/e2e/storage/utils/utils.go:728]: Expected
    <string>: root
to equal
    <string>: 1000

TRT took a look at fixing but given the implications of a test asserting 1000 and getting root, we figured we should turn over to storage team.

Comment 1 Jan Safranek 2022-02-25 14:01:02 UTC

In addition, OCP does not install on Azure since yesterday:

  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure/1496906555121995776

Prometheus pods report permission denied:

ts=2022-02-24T19:08:33.801Z caller=query_logger.go:86 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-azure/1496904660361940992/artifacts/e2e-azure/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log


CSI migration PR in MCO landed around the same time: https://github.com/openshift/machine-config-operator/pull/2949, i.e. all Azure Disk set up and mounting is done by the CSI driver now.

I can see in the Prometheus PV:
    fsType: ""

Together with missing FSGroupChangePolicy in CSIDriver instance it means that fsGroup was not applied and Prometheus can't access its volume.

Azure Disk in-tree volume plugin used "fsType: ext4" here.

Comment 2 Jan Safranek 2022-02-25 15:42:18 UTC

We decided that while this PR can fix installation, we would need backport to 4.10 to fix upgrade. It's quite late for such changes.
In addition, we can't fix Cinder CSIDriver in the same way, as we ship its CSIDriver since 4.9. Therefore it's better to revert MCO changes and fix CSIDriver instances in our operators properly and ship "CSIDriver.FSGroupPolicy: File" everywhere it makes sense, together with some code that deletes and re-creates CSIDriver during update (CSIDriver is read-only after creation).

In addition, we should fix the translation library upstream from in-tree PV with empty fsType to CSI PV with "fsType: ext4" in all volume plugins that default to ext4, so other Kubernetes distros don't hit the same issue.

Comment 5 Wei Duan 2022-03-02 04:22:58 UTC

Not happened in these three days.
Updated status to "Verified"

Comment 8 errata-xmlrpc 2022-08-10 10:51:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069