Bug 2058167 - Post deploy on a baremetal cluster SSP is looping attempting to reconcile
Summary: Post deploy on a baremetal cluster SSP is looping attempting to reconcile
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: SSP
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Andrej Krejcir
QA Contact: Geetika Kapoor
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-24 12:21 UTC by Debarati Basu-Nag
Modified: 2022-03-16 16:07 UTC (History)
7 users (show)

Fixed In Version: kubevirt-ssp-operator-container-v4.10.0-50, hco-bundle-registry-4.10.0-696
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 16:07:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ssp log (14.15 MB, text/plain)
2022-02-24 12:21 UTC, Debarati Basu-Nag
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt ssp-operator pull 316 0 None Merged DataSources: Set app labels only when auto-update is disabled 2022-02-25 14:53:56 UTC
Github kubevirt ssp-operator pull 317 0 None Merged [release-v0.13] DataSources: Set app labels only when auto-update is disabled 2022-02-25 14:53:54 UTC
Red Hat Issue Tracker CNV-16644 0 None None None 2022-03-07 14:23:22 UTC
Red Hat Product Errata RHSA-2022:0947 0 None None None 2022-03-16 16:07:19 UTC

Description Debarati Basu-Nag 2022-02-24 12:21:20 UTC
Created attachment 1863175 [details]
ssp log

Description of problem:
Post deployment of BM cluster bm02-cnvqe2-rdu2, noticed that HCO is in degraded state, due to SSP not being available. From the ssp operator log, it looks like it is continuously attempting to reconcile and failing

Version-Release number of selected component (if applicable):
4.10.0 - 686 

How reproducible:
Not sure.

Steps to Reproduce:
1.Not sure.
2.
3.

Actual results:
HCO Status condition:
====================
{
      "lastTransitionTime": "2022-02-23T16:59:54Z",
      "message": "Reconcile completed successfully",
      "observedGeneration": 73,
      "reason": "ReconcileCompleted",
      "status": "True",
      "type": "ReconcileComplete"
    },
    {
      "lastTransitionTime": "2022-02-24T00:24:04Z",
      "message": "SSP is not available: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPNotAvailable",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2022-02-24T00:24:04Z",
      "message": "SSP is progressing: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPProgressing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2022-02-24T00:46:30Z",
      "message": "SSP is degraded: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPDegraded",
      "status": "True",
      "type": "Degraded"
    },
    {
      "lastTransitionTime": "2022-02-24T00:24:04Z",
      "message": "SSP is progressing: Reconciling SSP resources",
      "observedGeneration": 73,
      "reason": "SSPProgressing",
      "status": "False",
      "type": "Upgradeable"
    }
===================
SSP status:
===================
{
  "conditions": [
    {
      "lastHeartbeatTime": "2022-02-24T00:49:32Z",
      "lastTransitionTime": "2022-02-24T00:49:32Z",
      "message": "Reconciling SSP resources",
      "reason": "Available",
      "status": "False",
      "type": "Available"
    },
    {
      "lastHeartbeatTime": "2022-02-24T00:49:32Z",
      "lastTransitionTime": "2022-02-24T00:49:32Z",
      "message": "Reconciling SSP resources",
      "reason": "Progressing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastHeartbeatTime": "2022-02-24T00:49:32Z",
      "lastTransitionTime": "2022-02-24T00:49:32Z",
      "message": "Reconciling SSP resources",
      "reason": "Degraded",
      "status": "True",
      "type": "Degraded"
    }
  ],
  "observedGeneration": 6,
  "observedVersion": "4.10.0",
  "operatorVersion": "4.10.0",
  "phase": "Deploying",
  "targetVersion": "4.10.0"
}
From SSP operator log, this message showing up again and again:
=================
{"level":"error","ts":1645655734.9615152,"logger":"controller-runtime.manager.controller.ssp","msg":"Reconciler error","reconciler group":"ssp.kubevirt.io","reconciler kind":"SSP","name":"ssp-kubevirt-hyperconverged","namespace":"openshift-cnv","error":"Operation cannot be fulfilled on ssps.ssp.kubevirt.io \"ssp-kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
================

Attached is hco operator log and ssp operator log

Expected results:


Additional info:

Comment 1 Debarati Basu-Nag 2022-02-24 12:36:16 UTC
Moving to storage, as triage by Oren, indicated this is a CDI issue.

Comment 2 Arnon Gilboa 2022-02-24 14:09:13 UTC
Moved to SSP after having a debug session with @akrejcir

Comment 3 Andrej Krejcir 2022-02-24 15:19:01 UTC
I reproduced this on my development cluster. It is not related to bare metal.

The problem is that SSP and CDI modify the same labels in a loop.
This is the update done by SSP:

@ ["metadata","labels","app.kubernetes.io/component"]
- "storage"
+ "templating"
@ ["metadata","labels","app.kubernetes.io/managed-by"]
- "cdi-controller"
+ "ssp-operator"

And CID reverts it back.

I will post a PR to SSP, to break the loop.

Comment 5 Roni Kishner 2022-03-03 10:46:57 UTC
Verified on kubevirt-ssp-operator-container-v4.10.0-50

Note: The fix mention the labels are being set now when auto-update is disabled, this could mean the bug will again when auto-update is disabled, but I didn't manage to reproduce it even then so cant say for sure

Comment 8 errata-xmlrpc 2022-03-16 16:07:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.