Bug 1967086

Summary: Cloning DataVolumes between namespaces fails while creating cdi-upload pod
Product: Container Native Virtualization (CNV) Reporter: nijin ashok <nashok>
Component: StorageAssignee: Alexander Wels <awels>
Status: CLOSED ERRATA QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.5.5CC: alitke, awels, cnv-qe-bugs, kgershon, yadu
Target Milestone: ---   
Target Release: 2.6.6   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: v2.6.6-37 registry-proxy.engineering.redhat.com/rh-osbs/iib:89865 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1982269 (view as bug list) Environment:
Last Closed: 2021-08-10 17:33:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1982269    

Description nijin ashok 2021-06-02 12:26:13 UTC
Description of problem:

While cloning data volume between namespaces, the cloning is scheduled but never starts.

$ oc get dvs
NAME                   PHASE            PROGRESS   RESTARTS   AGE
dv-tests-cloning-001   CloneScheduled   N/A                   30s

The PVC status is "bound".

$ oc get pvc
NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
dv-tests-cloning-001   Bound    pvc-326a53af-b8f7-4328-a30f-76ec1d21ee21   12Gi       RWO            ocs-external-storagecluster-ceph-rbd   33s

But there is no cdi-upload pod.

The cdi-deployment logs have got error "Pod \"cdi-upload-dv-tests-cloning-001\" is invalid: spec.containers[0].resources.requests: Invalid value: \"1m\": must be less than or equal to cpu limit".

===
{"level":"error","ts":1622451245.2904956,"logger":"controller","msg":"Reconciler error","controller":"upload-controller","name":"dv-tests-cloning-001","namespace":"tests-cloning","error":"Pod \"cdi-upload-dv-tests-cloning-001\" is invalid: spec.containers[0].resources.requests: Invalid value: \"1m\": must be less than or equal to cpu limit","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/kubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
===

As per my understanding, the pod.Spec.Containers[0].Resources is obtained from the defaultPodResourceRequirements and it has got default values.

==
$ oc get cdiconfig -o yaml
apiVersion: v1
items:
- apiVersion: cdi.kubevirt.io/v1beta1
  kind: CDIConfig
  status:
    defaultPodResourceRequirements:
      limits:
        cpu: "0"
        memory: "0"
      requests:
        cpu: "0"
        memory: "0"
===

I cannot find a way to see the spec send by the cdi controller to create the pod but it looks like it's sending requests more than that of limits? However, that doesn't make sense since cdiconfig has got default values.

There are no quotas and limits for the namespace.

The permission is also mapped correctly. 

Version-Release number of selected component (if applicable):

2.5.5

How reproducible:

Observed in a customer environment and not reproduced locally.

Steps to Reproduce:

1. Issue is observed when cloning dv between namespaces.

Actual results:

Cloning DataVolumes between namespaces fails while creating cdi-upload pod.

Expected results:

cloning should work

Additional info:

Comment 2 Alexander Wels 2021-06-04 17:31:48 UTC
Can you check if the target namespace has a LimitRange defined?

Comment 3 nijin ashok 2021-06-07 02:53:40 UTC
(In reply to Alexander Wels from comment #2)
> Can you check if the target namespace has a LimitRange defined?

Target doesn't have LimitRange defined.

Comment 9 Alexander Wels 2021-06-28 11:43:55 UTC
So triple checked the code and there is nothing we do that sets the limits or request to anything other than what is specified in the defaultPodSourceRequirements in the CDIConfig object (which is set in the CDI CR). So there must be a mutating webhook somewhere that automatically modifies those values, and the usual suspect is a LimitRange for those fields. However as we saw, the must gather doesn't report anything about a LimitRange, and there is no Cluster wide LimitRange object in Open Shift.

That being said, all 0s is probably not a great default value and after some testing the following values are reasonable defaults, and we created a PR to set them to those values by default if not specified:

CPULimit: 750m (3/4 of a CPU max)
MemLimit: 600M (600M of memory max)
CPURequest 100m (1/10 of a CPU minimum)
MemRequest 60M (60M minimum)

Should be sufficient for most work loads. For a work around you can set those values in the CDI CR, and we can see if that lets them continue testing. The linked PR makes this the default values.

Comment 12 Yan Du 2021-07-07 12:31:23 UTC
Moving back to POST because we haven't modified the release branch yet. please attach the cherrypick PR to this bug.

Comment 25 errata-xmlrpc 2021-08-10 17:33:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119