Bug 1977756 - [2.6.z] PVC keeps in pending when using hostpath-provisioner
Summary: [2.6.z] PVC keeps in pending when using hostpath-provisioner
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.6.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 2.6.6
Assignee: Bartosz Rybacki
QA Contact: Yan Du
URL:
Whiteboard:
Depends On: 1977179 1977383
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-30 12:31 UTC by Adam Litke
Modified: 2021-08-10 17:33 UTC (History)
10 users (show)

Fixed In Version: hostpath-provisioner-rhel8-operator-v2.6.6-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1977179
Environment:
Last Closed: 2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt hostpath-provisioner-operator pull 112 0 None closed [release-v0.7] Add projected volume to SCC to be compatible with Open Shift 4.8 2021-07-02 08:20:29 UTC
Red Hat Product Errata RHSA-2021:3119 0 None None None 2021-08-10 17:33:48 UTC

Description Adam Litke 2021-06-30 12:31:13 UTC
+++ This bug was initially created as a clone of Bug #1977179 +++

Description of problem:
PVC keeps in pending when using hostpath-provisioner

Version-Release number of selected component (if applicable):
Client Version: 4.8.0-202106281541.p0.git.1077b05.assembly.stream-1077b05
Server Version: 4.8.0-rc.1
Kubernetes Version: v1.21.0-rc.0+766a5fe
$ oc get csv -A
NAMESPACE                              NAME                                           DISPLAY                       VERSION                 REPLACES                                  PHASE
openshift-cnv                          kubevirt-hyperconverged-operator.v4.8.0        OpenShift Virtualization      4.8.0                   kubevirt-hyperconverged-operator.v2.6.5   Succeeded
openshift-local-storage                local-storage-operator.4.7.0-202102110027.p0   Local Storage                 4.7.0-202102110027.p0                                             Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server                0.17.0                                                            Succeeded
openshift-storage                      ocs-operator.v4.8.0-431.ci                     OpenShift Container Storage   4.8.0-431.ci                                                      Succeeded


How reproducible:
Always

Steps to Reproduce:
1. Create a dv with hostpath-provisioner
---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv1
spec:
  source:
    http:
      url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 2Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  contentType: kubevirt
2. Create a vm to consume the dv
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  name: vm1
spec:
  template:
    spec:
      domain:
        resources:
          requests:
            memory: 512M
        devices:
          rng: {}
          disks:
          - disk:
              bus: virtio
            name: dv1
      volumes:
      - name: dv1
        dataVolume:
          name: dv1
  running: true
3. 


Actual results:
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get dv
NAME   PHASE                  PROGRESS   RESTARTS   AGE
dv1    WaitForFirstConsumer   N/A                   21m
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get pvc
NAME   STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS           AGE
dv1    Pending                                      hostpath-provisioner   22m
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get pod
NAME                      READY   STATUS    RESTARTS   AGE
virt-launcher-vm1-crrvd   0/1     Pending   0          21m
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get vm
NAME   AGE   VOLUME
vm1    23m   
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get vmi
NAME   AGE   PHASE     IP    NODENAME
vm1    23m   Pending   


Expected results:
The pvc can be bound and the vm can be running 

Additional info:
Detailed log attached

--- Additional comment from Yan Du on 2021-06-29 06:41:29 UTC ---

hpp build: hostpath-provisioner-operator-container-v4.8.0-16

--- Additional comment from Yan Du on 2021-06-29 06:42:25 UTC ---



--- Additional comment from Yan Du on 2021-06-29 06:46:53 UTC ---

It works on hpp build hostpath-provisioner-operator-container-v4.8.0-15, should be a regression issue

--- Additional comment from Yan Du on 2021-06-29 07:21:43 UTC ---

must gather log https://drive.google.com/drive/folders/1iaJ8uHDiqOSARB_n9Zz_bmrRwn7yNLzY

--- Additional comment from Alexander Wels on 2021-06-29 14:39:25 UTC ---

Created fix in attached PR link

Basically needed to modify the SCC to included projected volumes. Something that is enabled by default for all pods in 4.8

--- Additional comment from Alexander Wels on 2021-06-29 14:45:41 UTC ---

backport to 4.8 branch

--- Additional comment from Bartosz Rybacki on 2021-06-30 07:37:22 UTC ---

fixed in:hostpath-provisioner-operator	v4.8.0-17

hco bundle: v4.8.0-444

--- Additional comment from Alex Kalenyuk on 2021-06-30 08:03:19 UTC ---

Need to verify this on OpenShift RC.1 (bug does not occur on fc7)

--- Additional comment from Dan Kenigsberg on 2021-06-30 08:39:12 UTC ---

@akalenyu can you point to the OpenShift RC.1 change that triggered this bug (bz, jira, pr)?

--- Additional comment from Fabian Deutsch on 2021-06-30 08:42:52 UTC ---

Is there any workaround for this bug that an admin could perform?

--- Additional comment from Alex Kalenyuk on 2021-06-30 09:42:45 UTC ---

(In reply to Fabian Deutsch from comment #10)
> Is there any workaround for this bug that an admin could perform?

I don't think we can work around this in production as any W/A will involve scaling down our operator so the SCC doesn't get reconciled:
- Scale down HPP operator
- Manually add `- projected` to the HPP SCC's .volumes[] (named hostpath-provisioner)
- Edit daemonset to trigger attempt to start pods (named hostpath-provisioner as well)

(Or if we want to verify this without redeploying CNV, we could scale down HCO and replace the HPP operator image in the cluster).

(In reply to Dan Kenigsberg from comment #9)
> @akalenyu can you point to the OpenShift RC.1 change that
> triggered this bug (bz, jira, or)?

The reason we hit this is that BoundServiceAccountTokenVolume feature gate is now enabled by default in k8s 1.21:
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#overview
https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume

(Thanks awels for these findings).

As for the reason we only see it in RC of OCP, changelog of FC.9 (https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.8.0-fc.9)
points to Bug 1946479 which would explain why the feature gate was disabled on the OCP side before.

--- Additional comment from Dan Kenigsberg on 2021-06-30 10:02:15 UTC ---

Alex, would it be possible for a customer to disable this openshift feature gate? If so it would be a valid workaround that should be documented here.

--- Additional comment from Fabian Deutsch on 2021-06-30 10:27:12 UTC ---

IIUIC then these are three distinct steps:

- Scale down HPP operator --> oc scale
- Manually add `- projected` to the HPP SCC's .volumes[] (named hostpath-provisioner) --> oc patch?
- Edit daemonset to trigger attempt to start pods (named hostpath-provisioner as well) --> oc delete -l… ?

Once we have a fix then it's about scaling it up again: oc scale …

Is this correct?

--- Additional comment from Yan Du on 2021-06-30 10:54:54 UTC ---

(In reply to Dan Kenigsberg from comment #12)
> Alex, would it be possible for a customer to disable this openshift feature
> gate? If so it would be a valid workaround that should be documented here.

not sure if I understand that right, I think if we want to disable the BoundServiceAccountTokenVolume feature gate, we need to stop kubelet in all nodes, and restart the kubelet without BoundServiceAccountTokenVolume para. Maybe scale down HCO and replace the HPP operator image is better than this.

Comment 1 Yan Du 2021-07-08 05:40:26 UTC
Test on OCP 4.8.0-rc.3 and cnv 2.6.5, issue can not be reproduced.

Comment 2 Yan Du 2021-07-09 02:46:21 UTC
Test on CNV 2.6.6 with hostpath-provisioner-operator-container-v2.6.6-3

Issue have been fixed.

Comment 7 errata-xmlrpc 2021-08-10 17:33:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119


Note You need to log in before you can comment on or make changes to this bug.