Bug 2079916 - KubeVirt CR seems to be in DeploymentInProgress state and not recovering
Summary: KubeVirt CR seems to be in DeploymentInProgress state and not recovering
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.10.1
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.12.0
Assignee: Itamar Holder
QA Contact: Denys Shchedrivyi
URL:
Whiteboard:
: 2099635 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-28 13:36 UTC by Kedar Bidarkar
Modified: 2023-01-24 13:37 UTC (History)
8 users (show)

Fixed In Version: hco-bundle-registry-container- v4.12.0-479
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-24 13:36:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
virt_operator1 (63.57 KB, text/plain)
2022-04-28 15:04 UTC, Kedar Bidarkar
no flags Details
virt_operator2 (127.19 KB, text/plain)
2022-04-28 15:05 UTC, Kedar Bidarkar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt issues 8031 0 None open virt-operator updates the `kubevirt.io/generation` annotation on its operands even when no reconcile needed 2022-07-03 10:08:30 UTC
Red Hat Issue Tracker CNV-17887 0 None None None 2022-11-08 14:00:53 UTC
Red Hat Product Errata RHSA-2023:0408 0 None None None 2023-01-24 13:37:30 UTC

Description Kedar Bidarkar 2022-04-28 13:36:42 UTC
Description of problem:
KubeVirt CR seems to be in DeploymentInProgress state and not recovering

oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}'
ReconcileComplete	True	Reconcile completed successfully
Available	False	KubeVirt is not available: Deploying version sha256:48b123381f4aec379a24cd6bb2d641721919a7c4d95a6d42c7934a41177a0f37 with registry registry.redhat.io/container-native-virtualization
Progressing	True	KubeVirt is progressing: Deploying version sha256:48b123381f4aec379a24cd6bb2d641721919a7c4d95a6d42c7934a41177a0f37 with registry registry.redhat.io/container-native-virtualization
Degraded	False	Reconcile completed successfully
Upgradeable	False	KubeVirt is progressing: Deploying version sha256:48b123381f4aec379a24cd6bb2d641721919a7c4d95a6d42c7934a41177a0f37 with registry registry.redhat.io/container-native-virtualization





[kbidarka@localhost auth]$ oc get kubevirt kubevirt-kubevirt-hyperconverged -n openshift-cnv -o yaml
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  annotations:
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1alpha3
  creationTimestamp: "2022-04-27T14:49:03Z"
  ...
  name: kubevirt-kubevirt-hyperconverged
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: hco.kubevirt.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: HyperConverged
    name: kubevirt-hyperconverged
    uid: f19055d3-f566-4214-b596-9ae2be777f79
  resourceVersion: "1538042"
  uid: ab64cffc-c144-46a9-b083-1dae1e2eeddc
spec:
  certificateRotateStrategy:
    selfSigned:
      ca:
        duration: 48h0m0s
        renewBefore: 24h0m0s
      server:
        duration: 24h0m0s
        renewBefore: 12h0m0s
  configuration:
    developerConfiguration:
      diskVerification:
        memoryLimit: 2G
      featureGates:
      - DataVolumes
      - SRIOV
      - CPUManager
      - CPUNodeDiscovery
      - Snapshot
      - HotplugVolumes
      - ExpandDisks
      - GPU
      - HostDevices
      - DownwardMetrics
      - NUMA
      - WithHostModelCPU
      - HypervStrictCheck
      - SRIOVLiveMigration
      - LiveMigration
    machineType: pc-q35-rhel8.4.0
    migrations:
      completionTimeoutPerGiB: 800
      network: migration-nad
      parallelMigrationsPerCluster: 5
      parallelOutboundMigrationsPerNode: 2
      progressTimeout: 150
    network:
      defaultNetworkInterface: masquerade
    obsoleteCPUModels:
      "486": true
      Conroe: true
      athlon: true
      core2duo: true
      coreduo: true
      kvm32: true
      kvm64: true
      n270: true
      pentium: true
      pentium2: true
      pentium3: true
      pentiumpro: true
      phenom: true
      qemu32: true
      qemu64: true
    selinuxLauncherType: virt_launcher.process
    smbios:
      family: Red Hat
      manufacturer: Red Hat
      product: Container-native virtualization
      sku: 4.10.1
      version: 4.10.1
  customizeComponents: {}
  productComponent: compute
  productName: hyperconverged-cluster
  productVersion: 4.10.1
  uninstallStrategy: BlockUninstallIfWorkloadsExist
  workloadUpdateStrategy:
    batchEvictionInterval: 1m0s
    batchEvictionSize: 10
    workloadUpdateMethods:
    - LiveMigrate
status:
  conditions:
  - lastProbeTime: "2022-04-28T11:36:08Z"
    lastTransitionTime: "2022-04-28T11:36:08Z"
    message: Deploying version sha256:48b123381f4aec379a24cd6bb2d641721919a7c4d95a6d42c7934a41177a0f37
      with registry registry.redhat.io/container-native-virtualization
    reason: DeploymentInProgress
    status: "False"
    type: Available
  - lastProbeTime: "2022-04-28T11:36:08Z"
    lastTransitionTime: "2022-04-28T11:36:08Z"
    message: Deploying version sha256:48b123381f4aec379a24cd6bb2d641721919a7c4d95a6d42c7934a41177a0f37
      with registry registry.redhat.io/container-native-virtualization
    reason: DeploymentInProgress
    status: "True"
    type: Progressing
  - lastProbeTime: "2022-04-28T11:36:08Z"
    lastTransitionTime: "2022-04-28T11:36:08Z"
    message: Deploying version sha256:48b123381f4aec379a24cd6bb2d641721919a7c4d95a6d42c7934a41177a0f37
      with registry registry.redhat.io/container-native-virtualization
    reason: DeploymentInProgress
    status: "False"
    type: Degraded
  - lastProbeTime: "2022-04-28T08:23:53Z"
    lastTransitionTime: null
    message: All resources were created.
    reason: AllResourcesCreated
    status: "True"
    type: Created

Version-Release number of selected component (if applicable):
4.10.1

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
DeploymentInProgress for KubeVirt CR is True.

KubeVirt is not Ready.

Expected results:

KubeVirt is in Ready state.

Additional info:

Comment 1 Kedar Bidarkar 2022-04-28 15:04:22 UTC
Created attachment 1875704 [details]
virt_operator1

Comment 2 Kedar Bidarkar 2022-04-28 15:05:05 UTC
Created attachment 1875705 [details]
virt_operator2

Comment 5 sgott 2022-05-27 13:30:53 UTC
Is it possible to reproduce this?

Deferring to the next release due to capacity.

Comment 8 Debarati Basu-Nag 2022-06-24 22:14:11 UTC
Must gather is being attached. Since this is impacting smoke on 4.12, adding testblocker label.

Comment 15 Debarati Basu-Nag 2022-06-29 21:38:06 UTC
When a node maintenance CR is created, I see kubevirt.status.conditions continuing to stay, till the cr is deleted:
============================
[cnv-qe-jenkins@c01-dbn-4012-8k679-executor ~]$ kubectl get kubevirt kubevirt-kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.conditions"
[
  {
    "lastProbeTime": "2022-06-29T21:32:36Z",
    "lastTransitionTime": "2022-06-29T21:32:36Z",
    "message": "Deploying version sha256:f9904655a1c579b7db4f55f621852795397c2a01c72d0c420c916ec8f0466024 with registry registry.redhat.io/container-native-virtualization",
    "reason": "DeploymentInProgress",
    "status": "False",
    "type": "Available"
  },
  {
    "lastProbeTime": "2022-06-29T21:32:36Z",
    "lastTransitionTime": "2022-06-29T21:32:36Z",
    "message": "Deploying version sha256:f9904655a1c579b7db4f55f621852795397c2a01c72d0c420c916ec8f0466024 with registry registry.redhat.io/container-native-virtualization",
    "reason": "DeploymentInProgress",
    "status": "True",
    "type": "Progressing"
  },
  {
    "lastProbeTime": "2022-06-29T21:32:36Z",
    "lastTransitionTime": "2022-06-29T21:32:36Z",
    "message": "Deploying version sha256:f9904655a1c579b7db4f55f621852795397c2a01c72d0c420c916ec8f0466024 with registry registry.redhat.io/container-native-virtualization",
    "reason": "DeploymentInProgress",
    "status": "False",
    "type": "Degraded"
  },
  {
    "lastProbeTime": "2022-06-24T03:59:37Z",
    "lastTransitionTime": null,
    "message": "All resources were created.",
    "reason": "AllResourcesCreated",
    "status": "True",
    "type": "Created"
  }
]
[cnv-qe-jenkins@c01-dbn-4012-8k679-executor ~]$
============================

Comment 16 lpivarc 2022-06-30 07:53:25 UTC
(In reply to Debarati Basu-Nag from comment #15)
> When a node maintenance CR is created, I see kubevirt.status.conditions
> continuing to stay, till the cr is deleted:
> ============================
> [cnv-qe-jenkins@c01-dbn-4012-8k679-executor ~]$ kubectl get kubevirt
> kubevirt-kubevirt-hyperconverged -n openshift-cnv -o json | jq
> ".status.conditions"
> [
>   {
>     "lastProbeTime": "2022-06-29T21:32:36Z",
>     "lastTransitionTime": "2022-06-29T21:32:36Z",
>     "message": "Deploying version
> sha256:f9904655a1c579b7db4f55f621852795397c2a01c72d0c420c916ec8f0466024 with
> registry registry.redhat.io/container-native-virtualization",
>     "reason": "DeploymentInProgress",
>     "status": "False",
>     "type": "Available"
>   },
>   {
>     "lastProbeTime": "2022-06-29T21:32:36Z",
>     "lastTransitionTime": "2022-06-29T21:32:36Z",
>     "message": "Deploying version
> sha256:f9904655a1c579b7db4f55f621852795397c2a01c72d0c420c916ec8f0466024 with
> registry registry.redhat.io/container-native-virtualization",
>     "reason": "DeploymentInProgress",
>     "status": "True",
>     "type": "Progressing"
>   },
>   {
>     "lastProbeTime": "2022-06-29T21:32:36Z",
>     "lastTransitionTime": "2022-06-29T21:32:36Z",
>     "message": "Deploying version
> sha256:f9904655a1c579b7db4f55f621852795397c2a01c72d0c420c916ec8f0466024 with
> registry registry.redhat.io/container-native-virtualization",
>     "reason": "DeploymentInProgress",
>     "status": "False",
>     "type": "Degraded"
>   },
>   {
>     "lastProbeTime": "2022-06-24T03:59:37Z",
>     "lastTransitionTime": null,
>     "message": "All resources were created.",
>     "reason": "AllResourcesCreated",
>     "status": "True",
>     "type": "Created"
>   }
> ]
> [cnv-qe-jenkins@c01-dbn-4012-8k679-executor ~]$
> ============================

This is expected behavior as one of our handlers will be missing. We can look at his if we can improve here but this is not related to the issue described here.

Comment 18 lpivarc 2022-06-30 08:19:29 UTC
*** Bug 2099635 has been marked as a duplicate of this bug. ***

Comment 19 Igor Bezukh 2022-07-03 08:45:38 UTC
Following my observations, I can see that upon changing the Kubevirt CR, Kubevirt unconditionally updates its operands with the `kubevirt.io/generation` annotation even though the operand doesn't require re-conciliation. For example when adding the `Spec.configuration.cpuModel` field in the KV CR, this will cause change in the `kubevirt.io/generation` field of each operand. 

This is indeed an issue, it may trigger some status flip-flop such as that described in the bug. 

However the status eventually recovers and very quickly, I can see it in the reproductions. IMO this bug is not a blocker one. It may be a blocker if the status in KV CR is really stuck at deploying, but this
is not the case.

Comment 22 sgott 2022-07-05 21:32:38 UTC
Per Comment #19, the impact of this issue on clusters with OpenShift CNV deployed is that unnecessary reconciliations can occur. This was due to a conscious design decision to keep virt-operator operands with a common generation annotation. However, there doesn't appear to be an obvious reason why that is necessary. That is still being investigated.

Regardless of whether there is worth in reconciling resources that otherwise did not change, this transient will occur extremely rapidly and should not even be noticed in typical deployments.

Reconciliation also of course only occurs in the first place if the HCO CR was modified, so is generally expected to be infrequent.

Because there is no danger of loss of data, and the cluster's ability to upgrade will not be impaired, risk of disruption to a cluster is minimal. Consequently we've removed the blocker flag and deferred this BZ to the next major release.

Comment 24 sgott 2022-07-12 13:13:14 UTC
To clarify comment #22, "extremely rapidly" isn't clear enough. The reconciliation is usually just a few seconds. Almost always less than 5 seconds.

Comment 27 Denys Shchedrivyi 2022-11-18 18:24:16 UTC
Verified - we didn't see the issue during automation runs

Comment 32 errata-xmlrpc 2023-01-24 13:36:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408


Note You need to log in before you can comment on or make changes to this bug.