1847672 – Changes to probe fields in operator manifests are not applied during upgrade

Bug 1847672 - Changes to probe fields in operator manifests are not applied during upgrade

Summary: Changes to probe fields in operator manifests are not applied during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Dan Mace
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1829923 1848729 1849619
TreeView+	depends on / blocked

Reported:	2020-06-16 18:55 UTC by Dan Mace
Modified:	2020-10-27 16:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The cluster-version operator ignored several probe properties, including timeoutSeconds. Consequence: Operators which changed their release manifest s to adjust those properties did not get the changes applied to clusters on updating to the new release image. Fix: The cluster-version operator now applies these probe properties. Result: The cluster-version operator ensures that the in-cluster probe state matches the requested state from the operator's release manifests.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:07:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 383	0	None	closed	Bug 1847672: Expand supported set of probe field mutations	2020-10-27 19:34:34 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:07:56 UTC

Description Dan Mace 2020-06-16 18:55:29 UTC

Description of problem:

Changing the readiness or liveness probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of an operator deployment manifest aren't being applied to operator deployment resources during upgrades.

How reproducible:

Commit new probe timeout values for a CVO-managed operator, e.g. https://github.com/openshift/machine-config-operator/pull/1818.

Actual results:

On a new installation, the correct values are applied; but when performing a Y or Z upgrade to the new commit, the new values are not applied to the deployment.

Expected results:

The new values should be applied during an upgrade.

Comment 1 Dan Mace 2020-06-16 18:56:44 UTC

This blocks a fix for quorum-guard which involves changing probe timeout values (https://bugzilla.redhat.com/show_bug.cgi?id=1829923).

Comment 2 Dan Mace 2020-06-16 19:00:11 UTC

https://github.com/openshift/cluster-version-operator/pull/383 is an attempt to fix the problem, but the fix doesn't appear to work and we don't yet know why.

Comment 5 W. Trevor King 2020-06-18 22:41:57 UTC

Looks good in CI [1], where 4.5.0-rc.1 -> 4.6.0-0.ci-2020-06-18-154744 started with:

$ oc adm release extract --to=4.5 quay.io/openshift-release-dev/ocp-release:4.5.0-rc.1-x86_64
Extracted release payload from digest sha256:7ea01a3c4d91f852f480ea40189f1762fcd2e77b8843a0662c471889f0b72028 created at 2020-06-05T17:58:18Z
$ oc adm release extract --to=4.6 registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-06-18-154744
Extracted release payload from digest sha256:25910e71a3bd53e86bdad8aeb4ea6453b944e54b6c0a70806bc8d673dcf17c28 created at 2020-06-18T15:48:15Z
$ diff -u 4.{5,6}/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml 
--- 4.5/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml	2020-06-05 00:05:51.000000000 -0700
+++ 4.6/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml	2020-06-18 06:02:23.000000000 -0700
@@ -52,9 +52,9 @@
         operator: Exists
         effect: NoSchedule
       containers:
-      - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c4520124123a1425128d1f90740530069be125c911fe3e1d760d9bf6d1ce19c1
+      - name: guard
+        image: registry.svc.ci.openshift.org/ocp/4.6-2020-06-18-154744@sha256:e4e40d4fd585029f7287f7bcdb45067c696d126869a3d817891049cd5039f04d
         imagePullPolicy: IfNotPresent
-        name: guard
         terminationMessagePolicy: FallbackToLogsOnError
         volumeMounts:
         - mountPath: /mnt/kube
@@ -82,8 +82,10 @@
                 export NSS_SDB_USE_CACHE=no
                 [[ -z $cert || -z $key ]] && exit 1
                 curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}'
-            initialDelaySecond: 5
-            periodSecond: 5
+          initialDelaySeconds: 5
+          periodSeconds: 5
+          failureThreshold: 3
+          timeoutSeconds: 3
         resources:
           requests:
             cpu: 10m
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040/artifacts/launch/pods.json | jq -r '.items[] | select(.metadata.name | startswith("etcd-quorum-guard")).spec.containers[].readinessProbe | {initialDelaySeconds, periodSeconds, failureThreshold, timeoutSeconds} | tostring' | uniq
{"initialDelaySeconds":5,"periodSeconds":5,"failureThreshold":3,"timeoutSeconds":3}

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040

Comment 6 Dan Mace 2020-06-19 11:46:33 UTC

Thanks for the followup, Trevor

Comment 7 liujia 2020-06-22 03:20:48 UTC

According to comment5, the verify steps from QE side should be:
1. Install ocp v4.5 without the backport pr in v4.5(such as 4.5.0-rc.1) 
2. Upgrade to latest v4.6 nightly build which included pr383
3. Check if probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of etcd-quorum-guard updated.

Comment 10 errata-xmlrpc 2020-10-27 16:07:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.