Bug 1847672 - Changes to probe fields in operator manifests are not applied during upgrade
Summary: Changes to probe fields in operator manifests are not applied during upgrade
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Dan Mace
QA Contact: ge liu
Depends On:
Blocks: 1829923 1848729 1849619
TreeView+ depends on / blocked
Reported: 2020-06-16 18:55 UTC by Dan Mace
Modified: 2020-10-27 16:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator ignored several probe properties, including timeoutSeconds. Consequence: Operators which changed their release manifest s to adjust those properties did not get the changes applied to clusters on updating to the new release image. Fix: The cluster-version operator now applies these probe properties. Result: The cluster-version operator ensures that the in-cluster probe state matches the requested state from the operator's release manifests.
Clone Of:
Last Closed: 2020-10-27 16:07:36 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 383 None closed Bug 1847672: Expand supported set of probe field mutations 2020-10-27 19:34:34 UTC
Red Hat Product Errata RHBA-2020:4196 None None None 2020-10-27 16:07:56 UTC

Description Dan Mace 2020-06-16 18:55:29 UTC
Description of problem:

Changing the readiness or liveness probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of an operator deployment manifest aren't being applied to operator deployment resources during upgrades.

How reproducible:

Commit new probe timeout values for a CVO-managed operator, e.g. https://github.com/openshift/machine-config-operator/pull/1818.

Actual results:

On a new installation, the correct values are applied; but when performing a Y or Z upgrade to the new commit, the new values are not applied to the deployment.

Expected results:

The new values should be applied during an upgrade.

Comment 1 Dan Mace 2020-06-16 18:56:44 UTC
This blocks a fix for quorum-guard which involves changing probe timeout values (https://bugzilla.redhat.com/show_bug.cgi?id=1829923).

Comment 2 Dan Mace 2020-06-16 19:00:11 UTC
https://github.com/openshift/cluster-version-operator/pull/383 is an attempt to fix the problem, but the fix doesn't appear to work and we don't yet know why.

Comment 5 W. Trevor King 2020-06-18 22:41:57 UTC
Looks good in CI [1], where 4.5.0-rc.1 -> 4.6.0-0.ci-2020-06-18-154744 started with:

$ oc adm release extract --to=4.5 quay.io/openshift-release-dev/ocp-release:4.5.0-rc.1-x86_64
Extracted release payload from digest sha256:7ea01a3c4d91f852f480ea40189f1762fcd2e77b8843a0662c471889f0b72028 created at 2020-06-05T17:58:18Z
$ oc adm release extract --to=4.6 registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-06-18-154744
Extracted release payload from digest sha256:25910e71a3bd53e86bdad8aeb4ea6453b944e54b6c0a70806bc8d673dcf17c28 created at 2020-06-18T15:48:15Z
$ diff -u 4.{5,6}/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml 
--- 4.5/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml	2020-06-05 00:05:51.000000000 -0700
+++ 4.6/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml	2020-06-18 06:02:23.000000000 -0700
@@ -52,9 +52,9 @@
         operator: Exists
         effect: NoSchedule
-      - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c4520124123a1425128d1f90740530069be125c911fe3e1d760d9bf6d1ce19c1
+      - name: guard
+        image: registry.svc.ci.openshift.org/ocp/4.6-2020-06-18-154744@sha256:e4e40d4fd585029f7287f7bcdb45067c696d126869a3d817891049cd5039f04d
         imagePullPolicy: IfNotPresent
-        name: guard
         terminationMessagePolicy: FallbackToLogsOnError
         - mountPath: /mnt/kube
@@ -82,8 +82,10 @@
                 export NSS_SDB_USE_CACHE=no
                 [[ -z $cert || -z $key ]] && exit 1
                 curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}'
-            initialDelaySecond: 5
-            periodSecond: 5
+          initialDelaySeconds: 5
+          periodSeconds: 5
+          failureThreshold: 3
+          timeoutSeconds: 3
             cpu: 10m
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040/artifacts/launch/pods.json | jq -r '.items[] | select(.metadata.name | startswith("etcd-quorum-guard")).spec.containers[].readinessProbe | {initialDelaySeconds, periodSeconds, failureThreshold, timeoutSeconds} | tostring' | uniq

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040

Comment 6 Dan Mace 2020-06-19 11:46:33 UTC
Thanks for the followup, Trevor

Comment 7 liujia 2020-06-22 03:20:48 UTC
According to comment5, the verify steps from QE side should be:
1. Install ocp v4.5 without the backport pr in v4.5(such as 4.5.0-rc.1) 
2. Upgrade to latest v4.6 nightly build which included pr383
3. Check if probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of etcd-quorum-guard updated.

Comment 10 errata-xmlrpc 2020-10-27 16:07:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.