Bug 2058626 - Multiple Azure upstream kube fsgroupchangepolicy tests are permafailing expecting gid "1000" but geting "root"
Summary: Multiple Azure upstream kube fsgroupchangepolicy tests are permafailing expec...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Jan Safranek
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-25 12:46 UTC by Devan Goodwin
Modified: 2022-08-10 10:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
External Storage [Driver: disk.csi.azure.com] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with pvc data source in parallel [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with different fsgroup applied to the volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with different fsgroup applied to the volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with same fsgroup applied to the volume contents [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with same fsgroup applied to the volume contents
Last Closed: 2022-08-10 10:51:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2968 0 None open Bug 2058626: Revert "Bump(openshift/api): to get CSI changes" 2022-02-25 16:40:47 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:56 UTC

Description Devan Goodwin 2022-02-25 12:46:09 UTC
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]

(and several other similar tests) are permafailing in CI since Feb 23:

https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-storage%5D%20In-tree%20Volumes%20%5BDriver%3A%20azure-disk%5D%20%5BTestpattern%3A%20Dynamic%20PV%20(default%20fs)%5D%20fsgroupchangepolicy%20(Always)%5BLinuxOnly%5D%2C%20pod%20created%20with%20an%20initial%20fsgroup%2C%20new%20pod%20fsgroup%20applied%20to%20volume%20contents%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D

The test is blocking 4.11 nightly payloads and thus quite important to get fixed.

A sample failure prow job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1496904650283028480

The failure looks relatively clear:

Run #0: Failed expand_less 	5m23s
fail [k8s.io/kubernetes.0/test/e2e/storage/utils/utils.go:728]: Expected
    <string>: root
to equal
    <string>: 1000

TRT took a look at fixing but given the implications of a test asserting 1000 and getting root, we figured we should turn over to storage team.

Comment 1 Jan Safranek 2022-02-25 14:01:02 UTC
In addition, OCP does not install on Azure since yesterday:

  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure/1496906555121995776

Prometheus pods report permission denied:

ts=2022-02-24T19:08:33.801Z caller=query_logger.go:86 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-azure/1496904660361940992/artifacts/e2e-azure/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log


CSI migration PR in MCO landed around the same time: https://github.com/openshift/machine-config-operator/pull/2949, i.e. all Azure Disk set up and mounting is done by the CSI driver now.

I can see in the Prometheus PV:
    fsType: ""

Together with missing FSGroupChangePolicy in CSIDriver instance it means that fsGroup was not applied and Prometheus can't access its volume.

Azure Disk in-tree volume plugin used "fsType: ext4" here.

Comment 2 Jan Safranek 2022-02-25 15:42:18 UTC
We decided that while this PR can fix installation, we would need backport to 4.10 to fix upgrade. It's quite late for such changes.
In addition, we can't fix Cinder CSIDriver in the same way, as we ship its CSIDriver since 4.9. Therefore it's better to revert MCO changes and fix CSIDriver instances in our operators properly and ship "CSIDriver.FSGroupPolicy: File" everywhere it makes sense, together with some code that deletes and re-creates CSIDriver during update (CSIDriver is read-only after creation).

In addition, we should fix the translation library upstream from in-tree PV with empty fsType to CSI PV with "fsType: ext4" in all volume plugins that default to ext4, so other Kubernetes distros don't hit the same issue.

Comment 5 Wei Duan 2022-03-02 04:22:58 UTC
Not happened in these three days.
Updated status to "Verified"

Comment 8 errata-xmlrpc 2022-08-10 10:51:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.