+++ This bug was initially created as a clone of Bug #1977179 +++ Description of problem: PVC keeps in pending when using hostpath-provisioner Version-Release number of selected component (if applicable): Client Version: 4.8.0-202106281541.p0.git.1077b05.assembly.stream-1077b05 Server Version: 4.8.0-rc.1 Kubernetes Version: v1.21.0-rc.0+766a5fe $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-cnv kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.6.5 Succeeded openshift-local-storage local-storage-operator.4.7.0-202102110027.p0 Local Storage 4.7.0-202102110027.p0 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.17.0 Succeeded openshift-storage ocs-operator.v4.8.0-431.ci OpenShift Container Storage 4.8.0-431.ci Succeeded How reproducible: Always Steps to Reproduce: 1. Create a dv with hostpath-provisioner --- apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: dv1 spec: source: http: url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2 pvc: accessModes: - ReadWriteOnce resources: requests: storage: 2Gi storageClassName: hostpath-provisioner volumeMode: Filesystem contentType: kubevirt 2. Create a vm to consume the dv --- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachine metadata: name: vm1 spec: template: spec: domain: resources: requests: memory: 512M devices: rng: {} disks: - disk: bus: virtio name: dv1 volumes: - name: dv1 dataVolume: name: dv1 running: true 3. Actual results: [cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get dv NAME PHASE PROGRESS RESTARTS AGE dv1 WaitForFirstConsumer N/A 21m [cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dv1 Pending hostpath-provisioner 22m [cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm1-crrvd 0/1 Pending 0 21m [cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get vm NAME AGE VOLUME vm1 23m [cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get vmi NAME AGE PHASE IP NODENAME vm1 23m Pending Expected results: The pvc can be bound and the vm can be running Additional info: Detailed log attached --- Additional comment from Yan Du on 2021-06-29 06:41:29 UTC --- hpp build: hostpath-provisioner-operator-container-v4.8.0-16 --- Additional comment from Yan Du on 2021-06-29 06:42:25 UTC --- --- Additional comment from Yan Du on 2021-06-29 06:46:53 UTC --- It works on hpp build hostpath-provisioner-operator-container-v4.8.0-15, should be a regression issue --- Additional comment from Yan Du on 2021-06-29 07:21:43 UTC --- must gather log https://drive.google.com/drive/folders/1iaJ8uHDiqOSARB_n9Zz_bmrRwn7yNLzY --- Additional comment from Alexander Wels on 2021-06-29 14:39:25 UTC --- Created fix in attached PR link Basically needed to modify the SCC to included projected volumes. Something that is enabled by default for all pods in 4.8 --- Additional comment from Alexander Wels on 2021-06-29 14:45:41 UTC --- backport to 4.8 branch --- Additional comment from Bartosz Rybacki on 2021-06-30 07:37:22 UTC --- fixed in:hostpath-provisioner-operator v4.8.0-17 hco bundle: v4.8.0-444 --- Additional comment from Alex Kalenyuk on 2021-06-30 08:03:19 UTC --- Need to verify this on OpenShift RC.1 (bug does not occur on fc7) --- Additional comment from Dan Kenigsberg on 2021-06-30 08:39:12 UTC --- @akalenyu can you point to the OpenShift RC.1 change that triggered this bug (bz, jira, pr)? --- Additional comment from Fabian Deutsch on 2021-06-30 08:42:52 UTC --- Is there any workaround for this bug that an admin could perform? --- Additional comment from Alex Kalenyuk on 2021-06-30 09:42:45 UTC --- (In reply to Fabian Deutsch from comment #10) > Is there any workaround for this bug that an admin could perform? I don't think we can work around this in production as any W/A will involve scaling down our operator so the SCC doesn't get reconciled: - Scale down HPP operator - Manually add `- projected` to the HPP SCC's .volumes[] (named hostpath-provisioner) - Edit daemonset to trigger attempt to start pods (named hostpath-provisioner as well) (Or if we want to verify this without redeploying CNV, we could scale down HCO and replace the HPP operator image in the cluster). (In reply to Dan Kenigsberg from comment #9) > @akalenyu can you point to the OpenShift RC.1 change that > triggered this bug (bz, jira, or)? The reason we hit this is that BoundServiceAccountTokenVolume feature gate is now enabled by default in k8s 1.21: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#overview https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume (Thanks awels for these findings). As for the reason we only see it in RC of OCP, changelog of FC.9 (https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.8.0-fc.9) points to Bug 1946479 which would explain why the feature gate was disabled on the OCP side before. --- Additional comment from Dan Kenigsberg on 2021-06-30 10:02:15 UTC --- Alex, would it be possible for a customer to disable this openshift feature gate? If so it would be a valid workaround that should be documented here. --- Additional comment from Fabian Deutsch on 2021-06-30 10:27:12 UTC --- IIUIC then these are three distinct steps: - Scale down HPP operator --> oc scale - Manually add `- projected` to the HPP SCC's .volumes[] (named hostpath-provisioner) --> oc patch? - Edit daemonset to trigger attempt to start pods (named hostpath-provisioner as well) --> oc delete -l… ? Once we have a fix then it's about scaling it up again: oc scale … Is this correct? --- Additional comment from Yan Du on 2021-06-30 10:54:54 UTC --- (In reply to Dan Kenigsberg from comment #12) > Alex, would it be possible for a customer to disable this openshift feature > gate? If so it would be a valid workaround that should be documented here. not sure if I understand that right, I think if we want to disable the BoundServiceAccountTokenVolume feature gate, we need to stop kubelet in all nodes, and restart the kubelet without BoundServiceAccountTokenVolume para. Maybe scale down HCO and replace the HPP operator image is better than this.
Test on OCP 4.8.0-rc.3 and cnv 2.6.5, issue can not be reproduced.
Test on CNV 2.6.6 with hostpath-provisioner-operator-container-v2.6.6-3 Issue have been fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3119