Bug 1751193

Summary: DataVolume CRD is missing
Product: Container Native Virtualization (CNV) Reporter: Marek Libra <mlibra>
Component: StorageAssignee: Michael Henriksen <mhenriks>
Status: CLOSED DUPLICATE QA Contact: Qixuan Wang <qixuan.wang>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1.0CC: alitke, bthurber, cnv-qe-bugs, fsimonce, mhenriks, ngavrilo, rhallise, ycui
Target Milestone: ---   
Target Release: 2.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-07 19:31:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marek Libra 2019-09-11 11:25:08 UTC
Description of problem:
The DataVolume CRD is removed after CNV installation. A condition (unknown to me atm) leads to CDI deployment rollback during operator's reconciliation.

How reproducible:
Not sure atm. 


Actual results:
$ oc get crd|grep -i datavolume
$ oc get crd datavolumes.cdi.kubevirt.io
  Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "datavolumes.cdi.kubevirt.io" not found

Additional info:
$ oc describe cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster
Name:         cdi-hyperconverged-cluster
Namespace:
Labels:       app=hyperconverged-cluster
Annotations:  <none>
API Version:  cdi.kubevirt.io/v1alpha1
Kind:         CDI
Metadata:
  Creation Timestamp:  2019-09-10T08:43:44Z
  Generation:          21
  Owner References:
    API Version:           hco.kubevirt.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  HyperConverged
    Name:                  hyperconverged-cluster
    UID:                   ae3b782b-d30f-11e9-9484-5254005adc66
  Resource Version:        753210
  Self Link:               /apis/cdi.kubevirt.io/v1alpha1/cdis/cdi-hyperconverged-cluster
  UID:                     18e8776a-d3a7-11e9-985c-5254005adc66
Spec:
Status:
  Conditions:
    Last Heartbeat Time:   2019-09-10T08:50:03Z
    Last Transition Time:  2019-09-10T08:43:44Z
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2019-09-10T08:50:03Z
    Last Transition Time:  2019-09-10T08:43:44Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2019-09-10T08:50:03Z
    Last Transition Time:  2019-09-10T08:43:44Z
    Message:               ConfigMap not owned by cr
    Reason:                ConfigError
    Status:                True
    Type:                  Degraded
  Phase:                   Error
Events:                    <none>

Comment 2 Ryan Hallisey 2019-09-16 15:02:15 UTC
Adding CDI operator folks.  The CDI operator is responsible for managing the Data Volume API.   Can you provide the cdi-operator logs?

cc mhenriks alitke

Comment 3 Michael Henriksen 2019-09-16 15:29:41 UTC
Looked at this last week.  Basically, the CDI operator was unable to create a special configmap.  That caused the CDI CR to get into the "Error" phase.  Which is terminal and requires manual intervention.  I did that on the test system (deleted the "status.phase" property of the CDI CR) and everything CDI related ended up getting created correctly (including the DataVolume CRD of course).

IMO this particular failure should not have required manual intervention, though.

Comment 5 Adam Litke 2019-09-16 20:13:34 UTC
Since there are no reproduction steps we need logs in order to understand what might have happened here.  If we can't have either of these things I think we will have to close this bug as not reproducible.

Comment 6 Marek Libra 2019-09-17 07:05:06 UTC
It was one of the QE's env I have observed this issue on.
As there were additional changes performed since last week when Michael looked into this issue there, I am not able to grab logs or additional info atm.

IIUC, QE is currently waiting for a working HCO release so they can redeploy to latest CNV version. When ready, we will repeat testing and potentially add either logs or reproduction steps here.

Comment 7 Brett Thurber 2019-09-18 06:48:38 UTC
@Marek, Wouldn't this impact VM creation in general?

Comment 8 Marek Libra 2019-09-18 06:56:39 UTC
Sure, most (if not all) CDI would be effectively broken.

Comment 10 Ying Cui 2019-09-18 12:42:14 UTC
Tested this issue on latest HCO build hco-bundle-registry-v2.1.0-47. Can NOT reproduce it.  

We set Medium to severity and priority in bug scrub, we need to see how to trigger this issue firstly.

$ oc get crd|grep -i datavolume
datavolumes.cdi.kubevirt.io                                      2019-09-16T23:25:37Z

$ oc get crd datavolumes.cdi.kubevirt.io
NAME                          CREATED AT
datavolumes.cdi.kubevirt.io   2019-09-16T23:25:37Z

$ oc describe cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster
Name:         cdi-hyperconverged-cluster
Namespace:    
Labels:       app=hyperconverged-cluster
Annotations:  <none>
API Version:  cdi.kubevirt.io/v1alpha1
Kind:         CDI
Metadata:
  Creation Timestamp:  2019-09-16T23:25:37Z
  Finalizers:
    operator.cdi.kubevirt.io
  Generation:  9
  Owner References:
    API Version:           hco.kubevirt.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  HyperConverged
    Name:                  hyperconverged-cluster
    UID:                   958ba315-d83d-11e9-aa51-fa163ee2fcab
  Resource Version:        850308
  Self Link:               /apis/cdi.kubevirt.io/v1alpha1/cdis/cdi-hyperconverged-cluster
  UID:                     49e512ca-d8d9-11e9-a645-fa163ed3ad0d
Spec:
Status:
  Conditions:
    Last Heartbeat Time:   2019-09-16T23:26:42Z
    Last Transition Time:  2019-09-16T23:25:38Z
    Message:               Deployment Completed
    Reason:                DeployCompleted
    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2019-09-16T23:26:42Z
    Last Transition Time:  2019-09-16T23:25:38Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2019-09-16T23:26:42Z
    Last Transition Time:  2019-09-16T23:26:42Z
    Status:                False
    Type:                  Degraded
  Observed Version:        v2.1.0-14
  Operator Version:        v2.1.0-14
  Phase:                   Deployed
  Target Version:          v2.1.0-14
Events:                    <none>

Comment 11 Adam Litke 2019-09-18 13:26:35 UTC
We don't anticipate this happening very often and there is a remediation step (listed below) so pushing out to 2.1.1.

Remediation:
To work around this issue you can either delete the CDI CR (don't do this in case you have DataVolumes in your environment already) or you can edit the CDI CR status phase and setting it to empty.

Comment 12 Marek Libra 2019-10-11 11:49:46 UTC
I face the same issue again, different different, 4 days since its deployment.
Still not sure about reproduction steps as when I got to the environment, it was already in the CDI-failing state.

$ oc get crd|grep -i datavolume
# empty response

$ oc describe cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster
Name:         cdi-hyperconverged-cluster
Namespace:    
Labels:       app=hyperconverged-cluster
Annotations:  <none>
API Version:  cdi.kubevirt.io/v1alpha1
Kind:         CDI
Metadata:
  Creation Timestamp:  2019-10-09T05:25:11Z
  Generation:          17
  Owner References:
    API Version:           hco.kubevirt.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  HyperConverged
    Name:                  hyperconverged-cluster
    UID:                   8674375d-e90d-11e9-a447-5254003e64e0
  Resource Version:        3105791
  Self Link:               /apis/cdi.kubevirt.io/v1alpha1/cdis/cdi-hyperconverged-cluster
  UID:                     2a77a00a-ea55-11e9-b9de-5254003e64e0
Spec:
Status:
  Conditions:
    Last Heartbeat Time:   2019-10-11T08:08:36Z
    Last Transition Time:  2019-10-09T05:25:12Z
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2019-10-11T08:08:36Z
    Last Transition Time:  2019-10-09T05:25:12Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2019-10-11T08:08:36Z
    Last Transition Time:  2019-10-09T05:25:12Z
    Message:               Reconciling to error state, no configmap
    Reason:                ConfigError
    Status:                True
    Type:                  Degraded
  Phase:                   Error
Events:                    <none>



Cluster version is 4.2.0-0.nightly-2019-10-07-011045

Comment 13 Marek Libra 2019-10-11 11:55:02 UTC
Per suggestion above, I did

$ oc edit cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster

and removed "Status.Phase" row, the operator started deployment, the CRD were back.
But cdi-uploadproxy-9bf4c55dc-vxwmn pod is in CreateContainerConfigError:

$ oc describe pod cdi-uploadproxy-9bf4c55dc-vxwmn
Name:               cdi-uploadproxy-9bf4c55dc-vxwmn
Namespace:          openshift-cnv
Priority:           0
PriorityClassName:  <none>
Node:               working-wgxsr-worker-0-xlvtj/192.168.126.52
Start Time:         Fri, 11 Oct 2019 13:49:52 +0200
Labels:             cdi.kubevirt.io=cdi-uploadproxy
                    operator.cdi.kubevirt.io/createVersion=v2.1.0-20
                    pod-template-hash=9bf4c55dc
Annotations:        k8s.v1.cni.cncf.io/networks-status:
                      [{
                          "name": "openshift-sdn",
                          "interface": "eth0",
                          "ips": [
                              "10.130.0.94"
                          ],
                          "default": true,
                          "dns": {}
                      }]
                    openshift.io/scc: restricted
Status:             Running
IP:                 10.130.0.94
Controlled By:      ReplicaSet/cdi-uploadproxy-9bf4c55dc
Containers:
  cdi-uploadproxy:
    Container ID:  cri-o://d3c54815f51329b686532d2b8aec1055b3e7133b15a218c00226c03b56c8b126
    Image:         brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20
    Image ID:      brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy@sha256:05bd9aac15ab9ce821dc0a5a4401f87cd2e81716c19824738e2d16c497f2c836
    Port:          <none>
    Host Port:     <none>
    Args:
      -v=1
    State:          Running
      Started:      Fri, 11 Oct 2019 13:51:07 +0200
    Ready:          True
    Restart Count:  0
    Readiness:      http-get https://:8443/healthz delay=2s timeout=1s period=5s #success=1 #failure=3
    Environment:
      APISERVER_PUBLIC_KEY:       <set to the key 'id_rsa.pub' in secret 'cdi-api-signing-key'>        Optional: false
      UPLOAD_SERVER_CLIENT_KEY:   <set to the key 'tls.key' in secret 'cdi-upload-server-client-key'>  Optional: false
      UPLOAD_SERVER_CLIENT_CERT:  <set to the key 'tls.crt' in secret 'cdi-upload-server-client-key'>  Optional: false
      UPLOAD_SERVER_CA_CERT:      <set to the key 'ca.crt' in secret 'cdi-upload-server-client-key'>   Optional: false
      SERVICE_TLS_KEY:            <set to the key 'tls.key' in secret 'cdi-upload-proxy-server-key'>   Optional: false
      SERVICE_TLS_CERT:           <set to the key 'tls.crt' in secret 'cdi-upload-proxy-server-key'>   Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cdi-uploadproxy-token-989xt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  cdi-uploadproxy-token-989xt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cdi-uploadproxy-token-989xt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From                                   Message
  ----     ------     ----               ----                                   -------
  Normal   Scheduled  2m2s               default-scheduler                      Successfully assigned openshift-cnv/cdi-uploadproxy-9bf4c55dc-vxwmn to working-wgxsr-worker-0-xlvtj
  Normal   Pulling    114s               kubelet, working-wgxsr-worker-0-xlvtj  Pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20"
  Normal   Pulled     90s                kubelet, working-wgxsr-worker-0-xlvtj  Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20"
  Warning  Failed     89s (x2 over 90s)  kubelet, working-wgxsr-worker-0-xlvtj  Error: secrets "cdi-api-signing-key" not found
  Warning  Failed     63s (x2 over 74s)  kubelet, working-wgxsr-worker-0-xlvtj  Error: secrets "cdi-upload-server-client-key" not found
  Normal   Pulled     48s (x4 over 89s)  kubelet, working-wgxsr-worker-0-xlvtj  Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20" already present on machine
  Normal   Created    47s                kubelet, working-wgxsr-worker-0-xlvtj  Created container cdi-uploadproxy
  Normal   Started    47s                kubelet, working-wgxsr-worker-0-xlvtj  Started container cdi-uploadproxy

Comment 14 Michael Henriksen 2019-10-11 13:01:52 UTC
The "CreateContainerConfigError" error always happens when CDI is initially deployed but should resolve eventually once cdi-apiserver/cdi-deployment start up fully and kubernetes retries creating the cdi-uploadproxy pod.

Comment 15 Adam Litke 2019-10-15 11:52:20 UTC
This issue is very rare and there is a workaround so moving out to 2.2

Comment 16 Adam Litke 2019-11-06 21:42:47 UTC
@Michael, is there anything to be done here?

Comment 17 Michael Henriksen 2019-11-13 14:11:39 UTC
Yeah, the operator should retry creating the cdi-config configmap when appropriate.

Comment 18 Adam Litke 2020-01-07 19:31:46 UTC

*** This bug has been marked as a duplicate of bug 1781336 ***