Bug 1867593

Summary:	Clone of large DV remains stuck at a certain point in progress
Product:	Container Native Virtualization (CNV)	Reporter:	Alex Kalenyuk <akalenyu>
Component:	Storage	Assignee:	Adam Litke <alitke>
Status:	CLOSED NOTABUG	QA Contact:	Alex Kalenyuk <akalenyu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.4.0	CC:	cnv-qe-bugs, fdeutsch, gouyang, ndevos, ngavrilo, ratamir, yadu, ycui
Target Milestone:	---	Keywords:	Automation
Target Release:	2.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-12 18:53:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Kalenyuk 2020-08-10 12:45:16 UTC

Description of problem:
Cloning a large DV (Windows) gets stuck on CloneInProgress at a certain point

Version-Release number of selected component (if applicable):
2.4.0

How reproducible:
100%

Steps to Reproduce:
1. Create DV from a large image (used win19.raw)
2. Clone this DV

Actual results:
The DV gets stuck at a certain percentage in CloneInProgress

Expected results:
Target DV reaches succeeded

Additional info:

[cnv-qe-jenkins@natalie-7gvlk-executor clone-ocs]$ cat dv-source.yaml 
--- 
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata: 
  name: dv-source
spec: 
  pvc: 
    accessModes: 
      - ReadWriteMany
    resources: 
      requests: 
        storage: 38Gi
    storageClassName: ocs-storagecluster-ceph-rbd
    volumeMode: Block
  source: 
    http: 
      url: "http://PATH/win19.raw"

[cnv-qe-jenkins@natalie-7gvlk-executor clone-ocs]$ cat dv-target.yaml 
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv-target
spec:
  source:
    pvc:
      namespace: default
      name: dv-source
  pvc:
    accessModes:
      - ReadWriteMany
    storageClassName: ocs-storagecluster-ceph-rbd
    volumeMode: Block
    resources:
      requests:
        storage: 38Gi


Progress:
http://pastebin.test.redhat.com/891979


Pods:
http://pastebin.test.redhat.com/891981


Note:
Did not delete the DV, since the first time it occurred deleting the DV resulted in env pods dying.
As of now, it is still in CloneInProgress at same percentage.

Comment 1 Yan Du 2020-08-11 01:22:58 UTC

I can reproduce the issue on my cluster as well.

Client Version: 4.5.0-202007240519.p0-b66f2d3
Server Version: 4.5.4
Kubernetes Version: v1.18.3+012b3ec
CNV2.4

$ oc get dv
NAME        PHASE             PROGRESS   RESTARTS   AGE
dv-source   Succeeded         100.0%     0          10h
dv-target   CloneInProgress   41.86%     0          10h

$ oc get pod
NAME                                              READY   STATUS    RESTARTS   AGE
5d9a5650-bef0-4159-9dfb-dfaab43bef5b-source-pod   1/1     Running   0          10h
cdi-upload-dv-target                              1/1     Running   0          10h

$ oc logs cdi-upload-dv-target
I0810 15:00:15.880462       1 uploadserver.go:63] Upload destination: /dev/cdi-block-volume
I0810 15:00:15.880695       1 uploadserver.go:65] Running server on 0.0.0.0:8443
I0810 15:00:47.259486       1 uploadserver.go:319] Content type header is "blockdevice-clone"
I0810 15:00:47.259625       1 data-processor.go:277] Calculating available size
I0810 15:00:47.262408       1 data-processor.go:285] Checking out block volume size.
I0810 15:00:47.262454       1 data-processor.go:297] Request image size not empty.
I0810 15:00:47.262540       1 data-processor.go:302] Target size 38Gi.
I0810 15:00:47.288226       1 data-processor.go:206] New phase: TransferDataFile
I0810 15:00:47.290368       1 util.go:161] Writing data...

Comment 2 Alex Kalenyuk 2020-08-11 11:47:38 UTC

Some updates:

CNV 2.4.0: this issue seems to be specific to OCS, HPP clone was successful (nfs ongoing):
[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ oc get dv
NAME            PHASE       PROGRESS   RESTARTS   AGE
dv-source-hpp   Succeeded   100.0%     0          74m
dv-target-hpp   Succeeded   100.0%     0          69m

NAME          AGE   PHASE     IP            NODENAME
vmi-windows   41s   Running   10.128.2.47   akalenyu24-pwwpj-worker-4fq6v


CNV 2.5: This reproduces on 2.5 (OCS):
[cnv-qe-jenkins@akalenyu25-gkvrt-executor clone-hangs]$ oc get dv -w
NAME        PHASE             PROGRESS   RESTARTS   AGE
dv-source   Succeeded         100.0%                98m
dv-target   CloneInProgress   45.84%                60m
dv-target   CloneInProgress   45.84%                60m


I'm planning to try cloning the slightly smaller win_19.qcow2 image and will post an update.

Comment 3 Alex Kalenyuk 2020-08-11 14:19:52 UTC

Ran the same scenario with the win_19.qcow2 10G image (as opposed to the raw image which is 30G) on 2.4.0 env:
It hits 100%, but stays in CloneInProgress:

[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ oc get dv -w
NAME        PHASE             PROGRESS   RESTARTS   AGE
dv-source   Succeeded         100.0%     0          27m
dv-target   CloneInProgress   17.06%     0          16m
dv-target   CloneInProgress   17.10%     0          16m
dv-target   CloneInProgress   17.12%     0          16m
...
dv-target   CloneInProgress   99.68%     0          36m
dv-target   CloneInProgress   99.94%     0          36m
dv-target   CloneInProgress   100.00%    0          36m
dv-target   CloneInProgress   100.00%    0          36m
dv-target   CloneInProgress   100.00%    0          36m
dv-target   CloneInProgress   100.00%    0          36m
...
dv-target   CloneInProgress   100.00%    0          67m
dv-target   CloneInProgress   100.00%    0          67m
dv-target   CloneInProgress   100.00%    0          67m

[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ cat dv-target.yaml 
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: dv-target
spec:
  source:
    pvc:
      namespace: default
      name: dv-source
  pvc:
    accessModes:
      - ReadWriteMany
    storageClassName: ocs-storagecluster-ceph-rbd
    volumeMode: Block
    resources:
      requests:
        storage: 38Gi



[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ cat dv-source.yaml 
--- 
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata: 
  name: dv-source
spec: 
  pvc: 
    accessModes: 
      - ReadWriteMany
    resources: 
      requests: 
        storage: 38Gi
    storageClassName: ocs-storagecluster-ceph-rbd
    volumeMode: Block
  source: 
    http: 
      url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/windows-images/win_19.qcow2"


In both cases (raw 30G and qcow2 10G), when the DV gets stuck at a certain percentage in progress, I can't create an OCS DV anymore:
The importer pod just stays in pending, log:

[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ cat cirros-dv.yaml 
--- 
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata: 
  name: dv-source-cirros
spec: 
  pvc: 
    accessModes: 
      - ReadWriteMany
    resources: 
      requests: 
        storage: 1Gi
    storageClassName: ocs-storagecluster-ceph-rbd
    volumeMode: Block
  source: 
    http: 
      url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2"
[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ oc create -f cirros-dv.yaml 
datavolume.cdi.kubevirt.io/dv-source-cirros created
[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ oc get dv dv-source-cirros 
NAME               PHASE     PROGRESS   RESTARTS   AGE
dv-source-cirros   Pending   N/A        0          4m42s
[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ oc get pod importer-dv-source-cirros 
NAME                        READY   STATUS    RESTARTS   AGE
importer-dv-source-cirros   0/1     Pending   0          4m47s
[cnv-qe-jenkins@akalenyu24-pwwpj-executor clone-dv]$ oc describe pod importer-dv-source-cirros 
Name:           importer-dv-source-cirros
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=containerized-data-importer
                cdi.kubevirt.io=importer
                prometheus.cdi.kubevirt.io=
Annotations:    cdi.kubevirt.io/storage.createdByController: yes
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  PersistentVolumeClaim/dv-source-cirros
Containers:
  importer:
    Image:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-cdi-importer@sha256:75c1bc2fd3484318a47d7753a92ece00eae4297ede3ac76aaa4f0d534688ec82
    Port:       8443/TCP
    Host Port:  0/TCP
    Args:
      -v=1
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:     0
      memory:  0
    Environment:
      IMPORTER_SOURCE:       http
      IMPORTER_ENDPOINT:     http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2
      IMPORTER_CONTENTTYPE:  kubevirt
      IMPORTER_IMAGE_SIZE:   1Gi
      OWNER_UID:             b7645616-fca7-4000-bfb7-de9deeaae85d
      INSECURE_TLS:          false
      IMPORTER_DISK_ID:      
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-g4h5h (ro)
    Devices:
      /dev/cdi-block-volume from cdi-data-vol
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  cdi-data-vol:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  dv-source-cirros
    ReadOnly:   false
  default-token-g4h5h:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-g4h5h
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  67s (x7 over 5m39s)  default-scheduler  running "VolumeBinding" filter plugin for pod "importer-dv-source-cirros": pod has unbound immediate PersistentVolumeClaims

Comment 4 Alex Kalenyuk 2020-08-13 09:22:05 UTC

Update:

OCS configured with 50G:
Apparently our envs are configured with 50G for OCS, so it might be the cause for this failure.
Went through some of the OCS pod logs, but couldn't find indication that I ran out of capacity, where else could I look for such error?

NFS:
Cant confirm anything regarding NFS, since I cant get an NFS DV at all at the moment:
[cnv-qe-jenkins@akalenyu24-pwwpj-executor cnv-tests]$ oc get dv dv-source-nfs 
NAME            PHASE              PROGRESS   RESTARTS   AGE
dv-source-nfs   ImportInProgress   39.25%     7          99m
Will look further into this.

Comment 5 Alex Kalenyuk 2020-08-18 10:37:02 UTC

Update:

NFS:
NFS is successful in cloning a large 30G win19.raw image.

So this bug is specific to OCS.

Comment 6 Natalie Gavrielov 2020-08-26 12:22:54 UTC

We should also test this on RHCOS BM, in case it's mostly an issue with PSI.

Comment 8 Adam Litke 2020-09-09 12:36:48 UTC

Hi Niels, can we get some OCS expert eyes on this data?

Comment 10 Niels de Vos 2020-09-15 07:04:35 UTC

In order to look into the OCS behaviour, warnings, errors and failures, we need OCS must-gather as well. Please see https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.4/html/troubleshooting_openshift_container_storage/downloading-log-files-and-diagnostic-information_rhocs on how to gather these details.

Clear steps on how to reproduce this would help as well, also it is important to mention:
- the exact sizes of the qcow2 images (in bytes)
- versions of containers used (will be in OCS must-gather too)

Comment 12 Natalie Gavrielov 2020-09-29 06:25:39 UTC

Niels, any updates on this?

Comment 13 Raz Tamir 2020-10-01 15:42:46 UTC

Hi,

Looking at this BZ it seems like OCS behaves how it is currently designed.
When a cluster reaches, and passes, the last threshold of 95%, it will become a read-only and all operations will fail.
Deleteing a PVC involves a write oprtation which means it will fail too as described above.

For real world scenarios, clusters will have multiple TB of storage which will give enough time for the cluster admin to add capacity as soon as the first aleart is poping.

If from the CDI you can query for the available size and block the operation if it will exceed the available capacity, that will be the right solution IMO.

Thoughts?

Comment 14 Adam Litke 2020-10-05 19:52:54 UTC

(In reply to Raz Tamir from comment #13)
> Hi,
> 
> Looking at this BZ it seems like OCS behaves how it is currently designed.
> When a cluster reaches, and passes, the last threshold of 95%, it will
> become a read-only and all operations will fail.
> Deleteing a PVC involves a write oprtation which means it will fail too as
> described above.
> 
> For real world scenarios, clusters will have multiple TB of storage which
> will give enough time for the cluster admin to add capacity as soon as the
> first aleart is poping.
> 
> If from the CDI you can query for the available size and block the operation
> if it will exceed the available capacity, that will be the right solution
> IMO.
> 
> Thoughts?

This is about PVC creation.  If the OCS provisioner gives us a PVC to use we will assume (like any other application) that we can use it.  Now, if freezing writes is truly the OCS design then we should close this as NOTABUG (although I still think OCS should exempt certain PVCs from the write block (ie. etcd) in order to avoid having a completely broken cluster.

Niels, how would you recommend we proceed?

Comment 17 Niels de Vos 2020-10-12 14:09:18 UTC

(In reply to Adam Litke from comment #14)
...
> This is about PVC creation.  If the OCS provisioner gives us a PVC to use we
> will assume (like any other application) that we can use it.  Now, if
> freezing writes is truly the OCS design then we should close this as NOTABUG
> (although I still think OCS should exempt certain PVCs from the write block
> (ie. etcd) in order to avoid having a completely broken cluster.
> 
> Niels, how would you recommend we proceed?

We could probably come up with a way to do thick-provisioning for PVCs backed by RBD volumes (thin provisioning is default for RBD). This would likely result in an option to the StorageClass. There is not an easy way to set the option only for selected provision requests. So, as a result, we may end up with two StorageClasses, one for thin provisioning (default, current behavior), and a second for thick provisioning (new, optional, internally write zeros after thin provisioning).

Or, when host-assisted cloning in CDI is used, you could consider running fallocate(2) with FALLOC_FL_ZERO_RANGE and verify that the device has sufficient backing storage, before copying the data onto it. Although, I am not sure this fails in a timely matter, it may result in such a blocking write as well...

Comment 18 Adam Litke 2020-10-12 18:53:05 UTC

It seems that OCS is working as it was designed to do so I am closing this bug.  I don't necessarily agree with the behavior but bugs around the 85% threshold should be reported on OCS, not CNV.