Bug 1977179 - PVC keeps in pending when using hostpath-provisioner
Summary: PVC keeps in pending when using hostpath-provisioner
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Alexander Wels
QA Contact: Yan Du
URL:
Whiteboard:
Depends On: 1977383
Blocks: 1977756
TreeView+ depends on / blocked
 
Reported: 2021-06-29 06:38 UTC by Yan Du
Modified: 2021-08-10 20:02 UTC (History)
10 users (show)

Fixed In Version: hostpath-provisioner-rhel8-operator v4.8.0-17
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1977756 (view as bug list)
Environment:
Last Closed: 2021-07-27 14:32:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt hostpath-provisioner-operator pull 110 0 None closed Add projected volume to SCC to be compatible with Open Shift 4.8 2021-06-30 01:27:17 UTC
Github kubevirt hostpath-provisioner-operator pull 111 0 None closed [release-v0.8] Add projected volume to SCC to be compatible with Open Shift 4.8 2021-06-30 01:27:21 UTC
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:33:33 UTC

Description Yan Du 2021-06-29 06:38:38 UTC
Description of problem:
PVC keeps in pending when using hostpath-provisioner

Version-Release number of selected component (if applicable):
Client Version: 4.8.0-202106281541.p0.git.1077b05.assembly.stream-1077b05
Server Version: 4.8.0-rc.1
Kubernetes Version: v1.21.0-rc.0+766a5fe
$ oc get csv -A
NAMESPACE                              NAME                                           DISPLAY                       VERSION                 REPLACES                                  PHASE
openshift-cnv                          kubevirt-hyperconverged-operator.v4.8.0        OpenShift Virtualization      4.8.0                   kubevirt-hyperconverged-operator.v2.6.5   Succeeded
openshift-local-storage                local-storage-operator.4.7.0-202102110027.p0   Local Storage                 4.7.0-202102110027.p0                                             Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server                0.17.0                                                            Succeeded
openshift-storage                      ocs-operator.v4.8.0-431.ci                     OpenShift Container Storage   4.8.0-431.ci                                                      Succeeded


How reproducible:
Always

Steps to Reproduce:
1. Create a dv with hostpath-provisioner
---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv1
spec:
  source:
    http:
      url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 2Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  contentType: kubevirt
2. Create a vm to consume the dv
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  name: vm1
spec:
  template:
    spec:
      domain:
        resources:
          requests:
            memory: 512M
        devices:
          rng: {}
          disks:
          - disk:
              bus: virtio
            name: dv1
      volumes:
      - name: dv1
        dataVolume:
          name: dv1
  running: true
3. 


Actual results:
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get dv
NAME   PHASE                  PROGRESS   RESTARTS   AGE
dv1    WaitForFirstConsumer   N/A                   21m
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get pvc
NAME   STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS           AGE
dv1    Pending                                      hostpath-provisioner   22m
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get pod
NAME                      READY   STATUS    RESTARTS   AGE
virt-launcher-vm1-crrvd   0/1     Pending   0          21m
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get vm
NAME   AGE   VOLUME
vm1    23m   
[cnv-qe-jenkins@infra-debug3b-cgz2l-executor cnv-tests]$ oc get vmi
NAME   AGE   PHASE     IP    NODENAME
vm1    23m   Pending   


Expected results:
The pvc can be bound and the vm can be running 

Additional info:
Detailed log attached

Comment 1 Yan Du 2021-06-29 06:41:29 UTC
hpp build: hostpath-provisioner-operator-container-v4.8.0-16

Comment 5 Alexander Wels 2021-06-29 14:39:25 UTC
Created fix in attached PR link

Basically needed to modify the SCC to included projected volumes. Something that is enabled by default for all pods in 4.8

Comment 6 Alexander Wels 2021-06-29 14:45:41 UTC
backport to 4.8 branch

Comment 7 Bartosz Rybacki 2021-06-30 07:37:22 UTC
fixed in:hostpath-provisioner-operator	v4.8.0-17

hco bundle: v4.8.0-444

Comment 9 Dan Kenigsberg 2021-06-30 08:39:12 UTC
@akalenyu can you point to the OpenShift RC.1 change that triggered this bug (bz, jira, pr)?

Comment 10 Fabian Deutsch 2021-06-30 08:42:52 UTC
Is there any workaround for this bug that an admin could perform?

Comment 11 Alex Kalenyuk 2021-06-30 09:42:45 UTC
(In reply to Fabian Deutsch from comment #10)
> Is there any workaround for this bug that an admin could perform?

I don't think we can work around this in production as any W/A will involve scaling down our operator so the SCC doesn't get reconciled:
- Scale down HPP operator
- Manually add `- projected` to the HPP SCC's .volumes[] (named hostpath-provisioner)
- Edit daemonset to trigger attempt to start pods (named hostpath-provisioner as well)

(Or if we want to verify this without redeploying CNV, we could scale down HCO and replace the HPP operator image in the cluster).

(In reply to Dan Kenigsberg from comment #9)
> @akalenyu can you point to the OpenShift RC.1 change that
> triggered this bug (bz, jira, or)?

The reason we hit this is that BoundServiceAccountTokenVolume feature gate is now enabled by default in k8s 1.21:
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#overview
https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume

(Thanks awels for these findings).

As for the reason we only see it in RC of OCP, changelog of FC.9 (https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.8.0-fc.9)
points to Bug 1946479 which would explain why the feature gate was disabled on the OCP side before.

Comment 12 Dan Kenigsberg 2021-06-30 10:02:15 UTC
Alex, would it be possible for a customer to disable this openshift feature gate? If so it would be a valid workaround that should be documented here.

Comment 13 Fabian Deutsch 2021-06-30 10:27:12 UTC
IIUIC then these are three distinct steps:

- Scale down HPP operator --> oc scale
- Manually add `- projected` to the HPP SCC's .volumes[] (named hostpath-provisioner) --> oc patch?
- Edit daemonset to trigger attempt to start pods (named hostpath-provisioner as well) --> oc delete -l… ?

Once we have a fix then it's about scaling it up again: oc scale …

Is this correct?

Comment 14 Yan Du 2021-06-30 10:54:54 UTC
(In reply to Dan Kenigsberg from comment #12)
> Alex, would it be possible for a customer to disable this openshift feature
> gate? If so it would be a valid workaround that should be documented here.

not sure if I understand that right, I think if we want to disable the BoundServiceAccountTokenVolume feature gate, we need to stop kubelet in all nodes, and restart the kubelet without BoundServiceAccountTokenVolume para. Maybe scale down HCO and replace the HPP operator image is better than this.

Comment 15 Yan Du 2021-06-30 13:57:05 UTC
Verified the workaround and it works well

1. oc scale deployment/hostpath-provisioner-operator --replicas=0

2. oc edit scc hostpath-provisioner with '- prjected':
volumes:
- hostPath
- secret
- projected

3. edit/patch the daemonset to trigger a reconcile on it 

4. oc scale deployment/hostpath-provisioner-operator --replicas=1

After that, create the dv and vm with hostpath-provisioner sc:
$ oc get pvc
NAME   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
dv1    Bound    pvc-af0d07a3-8b6b-436b-91b9-83af8bd00213   245Gi      RWO            hostpath-provisioner   25s
$ oc get vmi
NAME   AGE   PHASE     IP             NODENAME
vm1    19s   Running   10.131.1.223   infra-debug3a-5g7ld-worker-0-2hd8n

Comment 16 Yan Du 2021-07-02 08:33:22 UTC
Test passed with hostpath-provisioner-operator-container-v4.8.0-17

still waiting for new OCP rc1 build to verify the bug

Comment 17 Yan Du 2021-07-06 04:01:54 UTC
Issue has been fixed with below version:

Client Version: 4.8.0-202106281541.p0.git.1077b05.assembly.stream-1077b05
Server Version: 4.8.0-rc.1
Kubernetes Version: v1.21.0-rc.0+766a5fe
hostpath-provisioner-operator-container-v4.8.0-17

Comment 18 Yan Du 2021-07-06 06:46:31 UTC
Retest on Server Version: 4.8.0-rc.3, hpp works fine.

Move bug to verified

Comment 21 errata-xmlrpc 2021-07-27 14:32:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920


Note You need to log in before you can comment on or make changes to this bug.