Bug 2128906 - CDI operator is not always collecting aggregating Progressing and Degraded conditions from its operands
Summary: CDI operator is not always collecting aggregating Progressing and Degraded co...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.12.0
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: 4.12.5
Assignee: Alex Kalenyuk
QA Contact: Yan Du
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-21 22:03 UTC by Simone Tiraboschi
Modified: 2023-07-18 13:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-18 13:55:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-21402 0 None None None 2022-10-31 13:16:15 UTC

Description Simone Tiraboschi 2022-09-21 22:03:32 UTC
Description of problem:
We got a case were cdi-deployment wasn't able to start due to PSA.
The deployment controller was clearly stating that on the conditions on the deployment status:

$ oc get deployments -n openshift-cnv cdi-deployment  -o json | jq .status.conditions
[
  {
    "lastTransitionTime": "2022-09-21T21:21:28Z",
    "lastUpdateTime": "2022-09-21T21:21:28Z",
    "message": "Deployment does not have minimum availability.",
    "reason": "MinimumReplicasUnavailable",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-09-21T21:21:28Z",
    "lastUpdateTime": "2022-09-21T21:21:28Z",
    "message": "pods \"cdi-deployment-6f4888b5cb-r9f5h\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"cdi-controller\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"cdi-controller\" must set securityContext.capabilities.drop=[\"ALL\"]), seccompProfile (pod or container \"cdi-controller\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")",
    "reason": "FailedCreate",
    "status": "True",
    "type": "ReplicaFailure"
  },
  {
    "lastTransitionTime": "2022-09-21T21:31:29Z",
    "lastUpdateTime": "2022-09-21T21:31:29Z",
    "message": "ReplicaSet \"cdi-deployment-6f4888b5cb\" has timed out progressing.",
    "reason": "ProgressDeadlineExceeded",
    "status": "False",
    "type": "Progressing"
  }
]


although cdi-operator was still reporting Progressing=True, Degraded=False on its CR:

$ oc get cdi cdi-kubevirt-hyperconverged -o yaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
metadata:
  annotations:
    cdi.kubevirt.io/configAuthority: ""
  creationTimestamp: "2022-09-21T21:21:24Z"
  finalizers:
  - operator.cdi.kubevirt.io
  generation: 2
  labels:
    app: kubevirt-hyperconverged
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: hco-operator
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.11.0
  name: cdi-kubevirt-hyperconverged
  resourceVersion: "39138"
  uid: 8cf09f52-9ad4-48d0-8831-d97923fdfe29
spec:
  certConfig:
    ca:
      duration: 48h0m0s
      renewBefore: 24h0m0s
    server:
      duration: 24h0m0s
      renewBefore: 12h0m0s
  config:
    featureGates:
    - HonorWaitForFirstConsumer
  infra: {}
  uninstallStrategy: BlockUninstallIfWorkloadsExist
  workload: {}
status:
  conditions:
  - lastHeartbeatTime: "2022-09-21T21:21:25Z"
    lastTransitionTime: "2022-09-21T21:21:25Z"
    status: "False"
    type: Available
  - lastHeartbeatTime: "2022-09-21T21:21:25Z"
    lastTransitionTime: "2022-09-21T21:21:25Z"
    message: Started Deployment
    reason: DeployStarted
    status: "True"
    type: Progressing
  - lastHeartbeatTime: "2022-09-21T21:21:25Z"
    lastTransitionTime: "2022-09-21T21:21:25Z"
    status: "False"
    type: Degraded
  operatorVersion: 4.11.0
  phase: Deploying
  targetVersion: 4.11.0


so HCO is going to report progressing forever as well (HCO is reading only the conditions on CDI CR and not the ones on its operands)


Version-Release number of selected component (if applicable):
4.11.0

How reproducible:
100%

Steps to Reproduce:
1. try to deploy with cdi-deployment getting stuck (for instance CNV 4.11.0 on OCP 4.12.0 with enforcing restricted PSA on openshift-cnv)
2. check `oc get deployments -n openshift-cnv cdi-deployment  -o json | jq .status.conditions`
3. compare with `oc get cdi cdi-kubevirt-hyperconverged -o json | jq .status.conditions`

Actual results:
a mismatch between cdi-deployment and CDI CR:

on CDI deployment:
status:
  conditions:
  - lastTransitionTime: "2022-09-21T21:21:28Z"
    lastUpdateTime: "2022-09-21T21:21:28Z"
    message: 'pods "cdi-deployment-6f4888b5cb-r9f5h" is forbidden: violates PodSecurity
      "restricted:latest": allowPrivilegeEscalation != false (container "cdi-controller"
      must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities
      (container "cdi-controller" must set securityContext.capabilities.drop=["ALL"]),
      seccompProfile (pod or container "cdi-controller" must set securityContext.seccompProfile.type
      to "RuntimeDefault" or "Localhost")'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  - lastTransitionTime: "2022-09-21T21:31:29Z"
    lastUpdateTime: "2022-09-21T21:31:29Z"
    message: ReplicaSet "cdi-deployment-6f4888b5cb" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing

on CDI CR:
status:
  conditions:
  - lastHeartbeatTime: "2022-09-21T21:21:25Z"
    lastTransitionTime: "2022-09-21T21:21:25Z"
    status: "False"
    type: Available
  - lastHeartbeatTime: "2022-09-21T21:21:25Z"
    lastTransitionTime: "2022-09-21T21:21:25Z"
    message: Started Deployment
    reason: DeployStarted
    status: "True"
    type: Progressing
  - lastHeartbeatTime: "2022-09-21T21:21:25Z"
    lastTransitionTime: "2022-09-21T21:21:25Z"
    status: "False"
    type: Degraded



Expected results:
CDI CR conditions correctly aggregates the ones from its operands.

Additional info:
Please notice that correctly reporting a failed install/upgrade is actually a prerequisite for the unsafe fail forward upgrades feature:
https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/

Currently we are still not opting-in, but that feature will eventually let customers try to recover from stuck upgrades.

Comment 1 Adam Litke 2022-11-23 18:38:21 UTC
Alex please have a look.

Comment 3 Yan Du 2023-02-01 13:20:30 UTC
Alex, could you please provide some update?

Comment 5 Alex Kalenyuk 2023-07-18 13:55:53 UTC
Tracking this at https://issues.redhat.com/browse/CNV-31135


Note You need to log in before you can comment on or make changes to this bug.