Bug 2058167

Summary: Post deploy on a baremetal cluster SSP is looping attempting to reconcile
Product: Container Native Virtualization (CNV) Reporter: Debarati Basu-Nag <dbasunag>
Component: SSPAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED ERRATA QA Contact: Geetika Kapoor <gkapoor>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.10.0CC: agilboa, akrejcir, cnv-qe-bugs, dholler, rkishner, stirabos, ycui
Target Milestone: ---Keywords: Regression
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kubevirt-ssp-operator-container-v4.10.0-50, hco-bundle-registry-4.10.0-696 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-16 16:07:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ssp log none

Description Debarati Basu-Nag 2022-02-24 12:21:20 UTC
Created attachment 1863175 [details]
ssp log

Description of problem:
Post deployment of BM cluster bm02-cnvqe2-rdu2, noticed that HCO is in degraded state, due to SSP not being available. From the ssp operator log, it looks like it is continuously attempting to reconcile and failing

Version-Release number of selected component (if applicable):
4.10.0 - 686 

How reproducible:
Not sure.

Steps to Reproduce:
1.Not sure.
2.
3.

Actual results:
HCO Status condition:
====================
{
      "lastTransitionTime": "2022-02-23T16:59:54Z",
      "message": "Reconcile completed successfully",
      "observedGeneration": 73,
      "reason": "ReconcileCompleted",
      "status": "True",
      "type": "ReconcileComplete"
    },
    {
      "lastTransitionTime": "2022-02-24T00:24:04Z",
      "message": "SSP is not available: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPNotAvailable",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2022-02-24T00:24:04Z",
      "message": "SSP is progressing: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPProgressing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2022-02-24T00:46:30Z",
      "message": "SSP is degraded: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPDegraded",
      "status": "True",
      "type": "Degraded"
    },
    {
      "lastTransitionTime": "2022-02-24T00:24:04Z",
      "message": "SSP is progressing: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPProgressing",
      "status": "False",
      "type": "Upgradeable"
    }
===================
SSP status:
===================
{
  "conditions": [
    {
      "lastHeartbeatTime": "2022-02-24T00:49:32Z",
      "lastTransitionTime": "2022-02-24T00:49:32Z",
      "message": "Reconciling SSP resources",
      "reason": "Available",
      "status": "False",
      "type": "Available"
    },
    {
      "lastHeartbeatTime": "2022-02-24T00:49:32Z",
      "lastTransitionTime": "2022-02-24T00:49:32Z",
      "message": "Reconciling SSP resources",
      "reason": "Progressing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastHeartbeatTime": "2022-02-24T00:49:32Z",
      "lastTransitionTime": "2022-02-24T00:49:32Z",
      "message": "Reconciling SSP resources",
      "reason": "Degraded",
      "status": "True",
      "type": "Degraded"
    }
  ],
  "observedGeneration": 6,
  "observedVersion": "4.10.0",
  "operatorVersion": "4.10.0",
  "phase": "Deploying",
  "targetVersion": "4.10.0"
}
From SSP operator log, this message showing up again and again:
=================
{"level":"error","ts":1645655734.9615152,"logger":"controller-runtime.manager.controller.ssp","msg":"Reconciler error","reconciler group":"ssp.kubevirt.io","reconciler kind":"SSP","name":"ssp-kubevirt-hyperconverged","namespace":"openshift-cnv","error":"Operation cannot be fulfilled on ssps.ssp.kubevirt.io \"ssp-kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
================

Attached is hco operator log and ssp operator log

Expected results:


Additional info:

Comment 1 Debarati Basu-Nag 2022-02-24 12:36:16 UTC
Moving to storage, as triage by Oren, indicated this is a CDI issue.

Comment 2 Arnon Gilboa 2022-02-24 14:09:13 UTC
Moved to SSP after having a debug session with @akrejcir

Comment 3 Andrej Krejcir 2022-02-24 15:19:01 UTC
I reproduced this on my development cluster. It is not related to bare metal.

The problem is that SSP and CDI modify the same labels in a loop.
This is the update done by SSP:

@ ["metadata","labels","app.kubernetes.io/component"]
- "storage"
+ "templating"
@ ["metadata","labels","app.kubernetes.io/managed-by"]
- "cdi-controller"
+ "ssp-operator"

And CID reverts it back.

I will post a PR to SSP, to break the loop.

Comment 5 Roni Kishner 2022-03-03 10:46:57 UTC
Verified on kubevirt-ssp-operator-container-v4.10.0-50

Note: The fix mention the labels are being set now when auto-update is disabled, this could mean the bug will again when auto-update is disabled, but I didn't manage to reproduce it even then so cant say for sure

Comment 8 errata-xmlrpc 2022-03-16 16:07:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947