Hide Forgot
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [Suite:openshift/conformance/parallel] [Suite:k8s] (and several other similar tests) are permafailing in CI since Feb 23: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-storage%5D%20In-tree%20Volumes%20%5BDriver%3A%20azure-disk%5D%20%5BTestpattern%3A%20Dynamic%20PV%20(default%20fs)%5D%20fsgroupchangepolicy%20(Always)%5BLinuxOnly%5D%2C%20pod%20created%20with%20an%20initial%20fsgroup%2C%20new%20pod%20fsgroup%20applied%20to%20volume%20contents%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D The test is blocking 4.11 nightly payloads and thus quite important to get fixed. A sample failure prow job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1496904650283028480 The failure looks relatively clear: Run #0: Failed expand_less 5m23s fail [k8s.io/kubernetes.0/test/e2e/storage/utils/utils.go:728]: Expected <string>: root to equal <string>: 1000 TRT took a look at fixing but given the implications of a test asserting 1000 and getting root, we figured we should turn over to storage team.
In addition, OCP does not install on Azure since yesterday: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure/1496906555121995776 Prometheus pods report permission denied: ts=2022-02-24T19:08:33.801Z caller=query_logger.go:86 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied" https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-azure/1496904660361940992/artifacts/e2e-azure/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log CSI migration PR in MCO landed around the same time: https://github.com/openshift/machine-config-operator/pull/2949, i.e. all Azure Disk set up and mounting is done by the CSI driver now. I can see in the Prometheus PV: fsType: "" Together with missing FSGroupChangePolicy in CSIDriver instance it means that fsGroup was not applied and Prometheus can't access its volume. Azure Disk in-tree volume plugin used "fsType: ext4" here.
We decided that while this PR can fix installation, we would need backport to 4.10 to fix upgrade. It's quite late for such changes. In addition, we can't fix Cinder CSIDriver in the same way, as we ship its CSIDriver since 4.9. Therefore it's better to revert MCO changes and fix CSIDriver instances in our operators properly and ship "CSIDriver.FSGroupPolicy: File" everywhere it makes sense, together with some code that deletes and re-creates CSIDriver during update (CSIDriver is read-only after creation). In addition, we should fix the translation library upstream from in-tree PV with empty fsType to CSI PV with "fsType: ext4" in all volume plugins that default to ext4, so other Kubernetes distros don't hit the same issue.
Not happened in these three days. Updated status to "Verified"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069