Bug 1847672

Summary: Changes to probe fields in operator manifests are not applied during upgrade
Product: OpenShift Container Platform Reporter: Dan Mace <dmace>
Component: Cluster Version OperatorAssignee: Dan Mace <dmace>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, jokerman, wking
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator ignored several probe properties, including timeoutSeconds. Consequence: Operators which changed their release manifest s to adjust those properties did not get the changes applied to clusters on updating to the new release image. Fix: The cluster-version operator now applies these probe properties. Result: The cluster-version operator ensures that the in-cluster probe state matches the requested state from the operator's release manifests.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:07:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1829923, 1848729, 1849619    

Description Dan Mace 2020-06-16 18:55:29 UTC
Description of problem:

Changing the readiness or liveness probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of an operator deployment manifest aren't being applied to operator deployment resources during upgrades.

How reproducible:

Commit new probe timeout values for a CVO-managed operator, e.g. https://github.com/openshift/machine-config-operator/pull/1818.

Actual results:

On a new installation, the correct values are applied; but when performing a Y or Z upgrade to the new commit, the new values are not applied to the deployment.

Expected results:

The new values should be applied during an upgrade.

Comment 1 Dan Mace 2020-06-16 18:56:44 UTC
This blocks a fix for quorum-guard which involves changing probe timeout values (https://bugzilla.redhat.com/show_bug.cgi?id=1829923).

Comment 2 Dan Mace 2020-06-16 19:00:11 UTC
https://github.com/openshift/cluster-version-operator/pull/383 is an attempt to fix the problem, but the fix doesn't appear to work and we don't yet know why.

Comment 5 W. Trevor King 2020-06-18 22:41:57 UTC
Looks good in CI [1], where 4.5.0-rc.1 -> 4.6.0-0.ci-2020-06-18-154744 started with:

$ oc adm release extract --to=4.5 quay.io/openshift-release-dev/ocp-release:4.5.0-rc.1-x86_64
Extracted release payload from digest sha256:7ea01a3c4d91f852f480ea40189f1762fcd2e77b8843a0662c471889f0b72028 created at 2020-06-05T17:58:18Z
$ oc adm release extract --to=4.6 registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-06-18-154744
Extracted release payload from digest sha256:25910e71a3bd53e86bdad8aeb4ea6453b944e54b6c0a70806bc8d673dcf17c28 created at 2020-06-18T15:48:15Z
$ diff -u 4.{5,6}/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml 
--- 4.5/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml	2020-06-05 00:05:51.000000000 -0700
+++ 4.6/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml	2020-06-18 06:02:23.000000000 -0700
@@ -52,9 +52,9 @@
         operator: Exists
         effect: NoSchedule
       containers:
-      - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c4520124123a1425128d1f90740530069be125c911fe3e1d760d9bf6d1ce19c1
+      - name: guard
+        image: registry.svc.ci.openshift.org/ocp/4.6-2020-06-18-154744@sha256:e4e40d4fd585029f7287f7bcdb45067c696d126869a3d817891049cd5039f04d
         imagePullPolicy: IfNotPresent
-        name: guard
         terminationMessagePolicy: FallbackToLogsOnError
         volumeMounts:
         - mountPath: /mnt/kube
@@ -82,8 +82,10 @@
                 export NSS_SDB_USE_CACHE=no
                 [[ -z $cert || -z $key ]] && exit 1
                 curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}'
-            initialDelaySecond: 5
-            periodSecond: 5
+          initialDelaySeconds: 5
+          periodSeconds: 5
+          failureThreshold: 3
+          timeoutSeconds: 3
         resources:
           requests:
             cpu: 10m
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040/artifacts/launch/pods.json | jq -r '.items[] | select(.metadata.name | startswith("etcd-quorum-guard")).spec.containers[].readinessProbe | {initialDelaySeconds, periodSeconds, failureThreshold, timeoutSeconds} | tostring' | uniq
{"initialDelaySeconds":5,"periodSeconds":5,"failureThreshold":3,"timeoutSeconds":3}

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040

Comment 6 Dan Mace 2020-06-19 11:46:33 UTC
Thanks for the followup, Trevor

Comment 7 liujia 2020-06-22 03:20:48 UTC
According to comment5, the verify steps from QE side should be:
1. Install ocp v4.5 without the backport pr in v4.5(such as 4.5.0-rc.1) 
2. Upgrade to latest v4.6 nightly build which included pr383
3. Check if probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of etcd-quorum-guard updated.

Comment 10 errata-xmlrpc 2020-10-27 16:07:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196