Bug 1751193
Summary: | DataVolume CRD is missing | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Marek Libra <mlibra> |
Component: | Storage | Assignee: | Michael Henriksen <mhenriks> |
Status: | CLOSED DUPLICATE | QA Contact: | Qixuan Wang <qixuan.wang> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 2.1.0 | CC: | alitke, bthurber, cnv-qe-bugs, fsimonce, mhenriks, ngavrilo, rhallise, ycui |
Target Milestone: | --- | ||
Target Release: | 2.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-01-07 19:31:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marek Libra
2019-09-11 11:25:08 UTC
Adding CDI operator folks. The CDI operator is responsible for managing the Data Volume API. Can you provide the cdi-operator logs? cc mhenriks alitke Looked at this last week. Basically, the CDI operator was unable to create a special configmap. That caused the CDI CR to get into the "Error" phase. Which is terminal and requires manual intervention. I did that on the test system (deleted the "status.phase" property of the CDI CR) and everything CDI related ended up getting created correctly (including the DataVolume CRD of course). IMO this particular failure should not have required manual intervention, though. Since there are no reproduction steps we need logs in order to understand what might have happened here. If we can't have either of these things I think we will have to close this bug as not reproducible. It was one of the QE's env I have observed this issue on. As there were additional changes performed since last week when Michael looked into this issue there, I am not able to grab logs or additional info atm. IIUC, QE is currently waiting for a working HCO release so they can redeploy to latest CNV version. When ready, we will repeat testing and potentially add either logs or reproduction steps here. @Marek, Wouldn't this impact VM creation in general? Sure, most (if not all) CDI would be effectively broken. Tested this issue on latest HCO build hco-bundle-registry-v2.1.0-47. Can NOT reproduce it. We set Medium to severity and priority in bug scrub, we need to see how to trigger this issue firstly. $ oc get crd|grep -i datavolume datavolumes.cdi.kubevirt.io 2019-09-16T23:25:37Z $ oc get crd datavolumes.cdi.kubevirt.io NAME CREATED AT datavolumes.cdi.kubevirt.io 2019-09-16T23:25:37Z $ oc describe cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster Name: cdi-hyperconverged-cluster Namespace: Labels: app=hyperconverged-cluster Annotations: <none> API Version: cdi.kubevirt.io/v1alpha1 Kind: CDI Metadata: Creation Timestamp: 2019-09-16T23:25:37Z Finalizers: operator.cdi.kubevirt.io Generation: 9 Owner References: API Version: hco.kubevirt.io/v1alpha1 Block Owner Deletion: true Controller: true Kind: HyperConverged Name: hyperconverged-cluster UID: 958ba315-d83d-11e9-aa51-fa163ee2fcab Resource Version: 850308 Self Link: /apis/cdi.kubevirt.io/v1alpha1/cdis/cdi-hyperconverged-cluster UID: 49e512ca-d8d9-11e9-a645-fa163ed3ad0d Spec: Status: Conditions: Last Heartbeat Time: 2019-09-16T23:26:42Z Last Transition Time: 2019-09-16T23:25:38Z Message: Deployment Completed Reason: DeployCompleted Status: True Type: Available Last Heartbeat Time: 2019-09-16T23:26:42Z Last Transition Time: 2019-09-16T23:25:38Z Status: False Type: Progressing Last Heartbeat Time: 2019-09-16T23:26:42Z Last Transition Time: 2019-09-16T23:26:42Z Status: False Type: Degraded Observed Version: v2.1.0-14 Operator Version: v2.1.0-14 Phase: Deployed Target Version: v2.1.0-14 Events: <none> We don't anticipate this happening very often and there is a remediation step (listed below) so pushing out to 2.1.1. Remediation: To work around this issue you can either delete the CDI CR (don't do this in case you have DataVolumes in your environment already) or you can edit the CDI CR status phase and setting it to empty. I face the same issue again, different different, 4 days since its deployment. Still not sure about reproduction steps as when I got to the environment, it was already in the CDI-failing state. $ oc get crd|grep -i datavolume # empty response $ oc describe cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster Name: cdi-hyperconverged-cluster Namespace: Labels: app=hyperconverged-cluster Annotations: <none> API Version: cdi.kubevirt.io/v1alpha1 Kind: CDI Metadata: Creation Timestamp: 2019-10-09T05:25:11Z Generation: 17 Owner References: API Version: hco.kubevirt.io/v1alpha1 Block Owner Deletion: true Controller: true Kind: HyperConverged Name: hyperconverged-cluster UID: 8674375d-e90d-11e9-a447-5254003e64e0 Resource Version: 3105791 Self Link: /apis/cdi.kubevirt.io/v1alpha1/cdis/cdi-hyperconverged-cluster UID: 2a77a00a-ea55-11e9-b9de-5254003e64e0 Spec: Status: Conditions: Last Heartbeat Time: 2019-10-11T08:08:36Z Last Transition Time: 2019-10-09T05:25:12Z Status: False Type: Available Last Heartbeat Time: 2019-10-11T08:08:36Z Last Transition Time: 2019-10-09T05:25:12Z Status: False Type: Progressing Last Heartbeat Time: 2019-10-11T08:08:36Z Last Transition Time: 2019-10-09T05:25:12Z Message: Reconciling to error state, no configmap Reason: ConfigError Status: True Type: Degraded Phase: Error Events: <none> Cluster version is 4.2.0-0.nightly-2019-10-07-011045 Per suggestion above, I did $ oc edit cdi.cdi.kubevirt.io/cdi-hyperconverged-cluster and removed "Status.Phase" row, the operator started deployment, the CRD were back. But cdi-uploadproxy-9bf4c55dc-vxwmn pod is in CreateContainerConfigError: $ oc describe pod cdi-uploadproxy-9bf4c55dc-vxwmn Name: cdi-uploadproxy-9bf4c55dc-vxwmn Namespace: openshift-cnv Priority: 0 PriorityClassName: <none> Node: working-wgxsr-worker-0-xlvtj/192.168.126.52 Start Time: Fri, 11 Oct 2019 13:49:52 +0200 Labels: cdi.kubevirt.io=cdi-uploadproxy operator.cdi.kubevirt.io/createVersion=v2.1.0-20 pod-template-hash=9bf4c55dc Annotations: k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.130.0.94" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Running IP: 10.130.0.94 Controlled By: ReplicaSet/cdi-uploadproxy-9bf4c55dc Containers: cdi-uploadproxy: Container ID: cri-o://d3c54815f51329b686532d2b8aec1055b3e7133b15a218c00226c03b56c8b126 Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20 Image ID: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy@sha256:05bd9aac15ab9ce821dc0a5a4401f87cd2e81716c19824738e2d16c497f2c836 Port: <none> Host Port: <none> Args: -v=1 State: Running Started: Fri, 11 Oct 2019 13:51:07 +0200 Ready: True Restart Count: 0 Readiness: http-get https://:8443/healthz delay=2s timeout=1s period=5s #success=1 #failure=3 Environment: APISERVER_PUBLIC_KEY: <set to the key 'id_rsa.pub' in secret 'cdi-api-signing-key'> Optional: false UPLOAD_SERVER_CLIENT_KEY: <set to the key 'tls.key' in secret 'cdi-upload-server-client-key'> Optional: false UPLOAD_SERVER_CLIENT_CERT: <set to the key 'tls.crt' in secret 'cdi-upload-server-client-key'> Optional: false UPLOAD_SERVER_CA_CERT: <set to the key 'ca.crt' in secret 'cdi-upload-server-client-key'> Optional: false SERVICE_TLS_KEY: <set to the key 'tls.key' in secret 'cdi-upload-proxy-server-key'> Optional: false SERVICE_TLS_CERT: <set to the key 'tls.crt' in secret 'cdi-upload-proxy-server-key'> Optional: false Mounts: /var/run/secrets/kubernetes.io/serviceaccount from cdi-uploadproxy-token-989xt (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: cdi-uploadproxy-token-989xt: Type: Secret (a volume populated by a Secret) SecretName: cdi-uploadproxy-token-989xt Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m2s default-scheduler Successfully assigned openshift-cnv/cdi-uploadproxy-9bf4c55dc-vxwmn to working-wgxsr-worker-0-xlvtj Normal Pulling 114s kubelet, working-wgxsr-worker-0-xlvtj Pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20" Normal Pulled 90s kubelet, working-wgxsr-worker-0-xlvtj Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20" Warning Failed 89s (x2 over 90s) kubelet, working-wgxsr-worker-0-xlvtj Error: secrets "cdi-api-signing-key" not found Warning Failed 63s (x2 over 74s) kubelet, working-wgxsr-worker-0-xlvtj Error: secrets "cdi-upload-server-client-key" not found Normal Pulled 48s (x4 over 89s) kubelet, working-wgxsr-worker-0-xlvtj Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/container-native-virtualization/virt-cdi-uploadproxy:v2.1.0-20" already present on machine Normal Created 47s kubelet, working-wgxsr-worker-0-xlvtj Created container cdi-uploadproxy Normal Started 47s kubelet, working-wgxsr-worker-0-xlvtj Started container cdi-uploadproxy The "CreateContainerConfigError" error always happens when CDI is initially deployed but should resolve eventually once cdi-apiserver/cdi-deployment start up fully and kubernetes retries creating the cdi-uploadproxy pod. This issue is very rare and there is a workaround so moving out to 2.2 @Michael, is there anything to be done here? Yeah, the operator should retry creating the cdi-config configmap when appropriate. *** This bug has been marked as a duplicate of bug 1781336 *** |