1708128 – PVC events are lacking in detail - hard to understand root cause for simple failures

Bug 1708128 - PVC events are lacking in detail - hard to understand root cause for simple failures

Summary: PVC events are lacking in detail - hard to understand root cause for simple f...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jan Safranek
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-09 08:13 UTC by Qixuan Wang
Modified:	2020-10-01 07:45 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-01 07:42:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Qixuan Wang 2019-05-09 08:13:24 UTC

Description of problem:
If local storage is the default storage in the cluster, following the Wizard to create VM via web console, the import progress will fail. The generated VM yaml doesn't specify storageClassName: local storage(here is hdd) on PVC spec so PVC and couldn't bind. 
  

Version-Release number of selected component (if applicable):
cnv-libvirt-container-v1.4.0-6.1556622302


How reproducible:
100%


Steps to Reproduce:
Set local storage to default.
Follow the Wizard to create a VM from web console.


Actual results:

Pod status
Importer Error
"0/3 nodes are available: 1 node(s) didn't match node selector, 2 node(s) didn't find available persistent volumes to bind.

========================================================================

PVC status
PVC is pending with message "waiting for first consumer to be created before binding"

========================================================================

PV status
[root@cnv-executor-shanks-master1 ~]# oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                STORAGECLASS   REASON    AGE
local-pv-17957d4f   6521Mi     RWO            Delete           Available                        hdd                      22h
local-pv-25e8c722   6521Mi     RWO            Delete           Available                        hdd                      22h
local-pv-3c0d2778   6521Mi     RWO            Delete           Available                        hdd                      14h
local-pv-4327fee4   6521Mi     RWO            Delete           Available                        hdd                      22h
local-pv-6017277d   6521Mi     RWO            Delete           Available                        hdd                      15h
local-pv-72584191   6521Mi     RWO            Delete           Available                        hdd                      22h
local-pv-7b1a6494   6521Mi     RWO            Delete           Bound       shanks/upload-test   hdd                      1h
local-pv-aa631362   6521Mi     RWO            Delete           Bound       shanks/cirros-dv     hdd                      1h
local-pv-c71f73db   6521Mi     RWO            Delete           Available                        hdd                      22h

=========================================================================

Which storage be used
[root@cnv-executor-shanks-master1 ~]# oc get sc
NAME                PROVISIONER                    AGE
glusterfs-storage   kubernetes.io/glusterfs        23h
hdd (default)       kubernetes.io/no-provisioner   22h

=========================================================================


apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  annotations:
    name.os.template.cnv.io/rhel7.6: Red Hat Enterprise Linux 7.6
  selfLink: /apis/kubevirt.io/v1alpha3/namespaces/test/virtualmachines/qwang-test
  resourceVersion: '289061'
  name: qwang-test
  uid: af77cb5a-7220-11e9-b335-fa163eb3676b
  creationTimestamp: '2019-05-09T06:07:11Z'
  generation: 1
  namespace: test
  labels:
    app: qwang-test
    flavor.template.cnv.io/small: 'true'
    os.template.cnv.io/rhel7.6: 'true'
    template.cnv.ui: openshift_rhel7-generic-small
    vm.cnv.io/template: rhel7-generic-small
    workload.template.cnv.io/generic: 'true'
spec:
  dataVolumeTemplates:
    - metadata:
        name: rootdisk-qwang-test
      spec:
        pvc:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi
        source:
          http:
            url: >-
              https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
  running: true
  template:
    metadata:
      labels:
        vm.cnv.io/name: qwang-test
    spec:
      domain:
        cpu:
          cores: 1
          sockets: 1
          threads: 1
        devices:
          disks:
            - bootOrder: 1
              disk:
                bus: virtio
              name: rootdisk
          interfaces:
            - bridge: {}
              name: nic0
          rng: {}
        resources:
          requests:
            memory: 2G
      networks:
        - name: nic0
          pod: {}
      terminationGracePeriodSeconds: 0
      volumes:
        - dataVolume:
            name: rootdisk-qwang-test
          name: rootdisk



Expected results:
PVC consume PV. The import progress should complete. VM can run.
Users do kind of "one-button" VM creation on UI, they don't care about which storage is used behind. 


Additional info:

Comment 1 Qixuan Wang 2019-05-09 08:37:14 UTC

If storageClassName is not specified on the PVC spec, default storage should be consumed. I'm not sure why this default local storage has the problem.

Comment 2 Tomas Jelinek 2019-05-09 10:29:13 UTC

I believe the UI uses the API correctly and the issue is in the underlying storage.

@Adam, can you please have a look at the provided yaml if it is the correct way how to express that we want to use the default storage class? And if yes, why is it not working? Thank you!

Comment 3 Tomas Jelinek 2019-05-14 07:34:20 UTC

Moving to storage for handling there. Please feel free to reassign to UI if you believe it is not using the API correctly.

Comment 4 Adam Litke 2019-05-14 12:39:22 UTC

This seems like a configuration problem with your cluster.  Are all of your nodes marked as schedulable to run VMs?  Do you have local volumes available on all nodes?

Comment 6 Qixuan Wang 2019-05-21 08:08:58 UTC

(In reply to Adam Litke from comment #4)
> Are all of your nodes marked as schedulable to run VMs?  
Yes. Nodes are schedulable
[root@cnv-executor-qwang-master1 ~]# oc get node -o yaml  | grep schedulable
      kubevirt.io/schedulable: "true"
      kubevirt.io/schedulable: "true"

> Do you have local volumes available on all nodes?
Yes. 
vdc                                                                               253:32   0   20G  0 disk
|-vg_local_storage-lv_local1                                                      252:5    0  6.6G  0 lvm  /mnt/local-storage/hdd/disk1
|-vg_local_storage-lv_local2                                                      252:6    0  6.6G  0 lvm  /mnt/local-storage/hdd/disk2
`-vg_local_storage-lv_local3                                                      252:7    0  6.6G  0 lvm  /mnt/local-storage/hdd/disk3

Comment 7 Adam Litke 2019-05-28 20:54:15 UTC

Qixuan, Thanks for providing information.  From the info it seems 2/3 nodes are schedulable.  Does each node have three PVs configured?  I'm concerned that the situation is that the only node with available PVs remaining happens to be unschedulable.  Please also show me the result of 'oc describe pvc rootdisk-qwang-test'

Comment 9 Qixuan Wang 2019-05-29 10:59:34 UTC

[root@cnv-executor-qwang-master1 ~]# oc describe pvc rootdisk-qwang-vm-cirros
Name:          rootdisk-qwang-vm-cirros
Namespace:     bug
StorageClass:  hdd
Status:        Pending
Volume:
Labels:        app=containerized-data-importer
               cdi-controller=rootdisk-qwang-vm-cirros
Annotations:   cdi.kubevirt.io/storage.contentType=kubevirt
               cdi.kubevirt.io/storage.import.endpoint=https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
               cdi.kubevirt.io/storage.import.importPodName=importer-rootdisk-qwang-vm-cirros-pt52n
               cdi.kubevirt.io/storage.import.source=http
               cdi.kubevirt.io/storage.pod.phase=Pending
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Events:
  Type    Reason                Age                From                         Message
  ----    ------                ----               ----                         -------
  Normal  WaitForFirstConsumer  12s (x15 over 3m)  persistentvolume-controller  waiting for first consumer to be created before binding

Comment 10 Adam Litke 2019-06-04 12:54:12 UTC

@Qixuan, maybe it would be faster if you could provide me an environment where this is happening for you and I can take a look.

Comment 11 Qixuan Wang 2019-06-05 10:05:49 UTC

I can reproduce it with CNV 1.4 async build. I didn't see this problem with CNV 2.0.

Comment 12 Adam Litke 2019-06-11 11:55:56 UTC

Hi Qixuan,

I looked at your environment and your PVC is requesting 10Gi of storage but your PVs are only 6521Mi.  Please retry the steps with a smaller VM disk size or provide larger PVs.

Comment 13 Qixuan Wang 2019-06-11 12:01:56 UTC

Thanks, Adam. I'm not sure why create a VM from UI wizard, the rootdisk needs 10Gi storage? 10Gi is a result of tradeoff？

Comment 14 Adam Litke 2019-06-11 12:25:02 UTC

We need to look at more informative events on PVCs to quickly indicate such a simple problem (not any PVs with the requested size available).

Comment 15 Tomas Jelinek 2019-07-31 14:22:58 UTC

As per Comment 14, the problem is that the PVC events are lacking in detail making it hard to figure out what the root cause of the issue is. Changing title accordingly and moving to openshift/storage.

Comment 16 Jan Safranek 2019-07-31 16:04:21 UTC

The pod has events "2 node(s) didn't find available persistent volumes to bind", maybe we can send those to PVC too (Michelle approves :-)

Comment 22 Jan Safranek 2020-06-17 09:32:25 UTC

This bug has been fixed upstream with https://github.com/kubernetes/kubernetes/pull/91455


PVC events should look like this:

Events:
  Type    Reason                Age                From                         Message
  ----    ------                ----               ----                         -------
  Normal  WaitForFirstConsumer  20s (x6 over 87s)  persistentvolume-controller  waiting for first consumer to be created before binding
  Normal  WaitForPodScheduled        5s                 persistentvolume-controller  waiting for pod pod-0 to be scheduled

Which should give user a hint that something may be wrong with pod-0 scheduling.

Waiting for 1.19 rebase to land.

Comment 23 Jan Safranek 2020-07-08 08:31:26 UTC

Waiting for 1.19 rebase to land.

Comment 24 Jan Safranek 2020-07-29 16:52:58 UTC

Rebase (rc2) has landed, please check if it's OK.

Comment 27 Chao Yang 2020-08-18 07:02:08 UTC

Verified on 4.6.0-0.nightly-2020-08-16-072105
1.oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
local-pv-4b27b3af   1Gi        RWO            Delete           Available           lvs-file                22h
2.Create pvc with requests storage is 2Gi
3.Create pod
4.oc describe pvc
Events:
  Type       Reason                Age                From                         Message
  ----       ------                ----               ----                         -------
  Normal     WaitForFirstConsumer  36s (x2 over 43s)  persistentvolume-controller  waiting for first consumer to be created before binding
  Normal     WaitForPodScheduled   6s (x2 over 21s)   persistentvolume-controller  waiting for pod pod1 to be scheduled

5.oc describe pod
Events:
  Type     Reason            Age        From  Message
  ----     ------            ----       ----  -------
  Warning  FailedScheduling  <unknown>        0/6 nodes are available: 3 node(s) didn't find available persistent volumes to bind, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Comment 28 Jan Safranek 2020-10-01 07:42:42 UTC

We won't release any 4.1 update.

Note You need to log in before you can comment on or make changes to this bug.