Description of problem: Changing the readiness or liveness probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of an operator deployment manifest aren't being applied to operator deployment resources during upgrades. How reproducible: Commit new probe timeout values for a CVO-managed operator, e.g. https://github.com/openshift/machine-config-operator/pull/1818. Actual results: On a new installation, the correct values are applied; but when performing a Y or Z upgrade to the new commit, the new values are not applied to the deployment. Expected results: The new values should be applied during an upgrade.
This blocks a fix for quorum-guard which involves changing probe timeout values (https://bugzilla.redhat.com/show_bug.cgi?id=1829923).
https://github.com/openshift/cluster-version-operator/pull/383 is an attempt to fix the problem, but the fix doesn't appear to work and we don't yet know why.
Looks good in CI [1], where 4.5.0-rc.1 -> 4.6.0-0.ci-2020-06-18-154744 started with: $ oc adm release extract --to=4.5 quay.io/openshift-release-dev/ocp-release:4.5.0-rc.1-x86_64 Extracted release payload from digest sha256:7ea01a3c4d91f852f480ea40189f1762fcd2e77b8843a0662c471889f0b72028 created at 2020-06-05T17:58:18Z $ oc adm release extract --to=4.6 registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-06-18-154744 Extracted release payload from digest sha256:25910e71a3bd53e86bdad8aeb4ea6453b944e54b6c0a70806bc8d673dcf17c28 created at 2020-06-18T15:48:15Z $ diff -u 4.{5,6}/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml --- 4.5/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml 2020-06-05 00:05:51.000000000 -0700 +++ 4.6/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml 2020-06-18 06:02:23.000000000 -0700 @@ -52,9 +52,9 @@ operator: Exists effect: NoSchedule containers: - - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c4520124123a1425128d1f90740530069be125c911fe3e1d760d9bf6d1ce19c1 + - name: guard + image: registry.svc.ci.openshift.org/ocp/4.6-2020-06-18-154744@sha256:e4e40d4fd585029f7287f7bcdb45067c696d126869a3d817891049cd5039f04d imagePullPolicy: IfNotPresent - name: guard terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /mnt/kube @@ -82,8 +82,10 @@ export NSS_SDB_USE_CACHE=no [[ -z $cert || -z $key ]] && exit 1 curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}' - initialDelaySecond: 5 - periodSecond: 5 + initialDelaySeconds: 5 + periodSeconds: 5 + failureThreshold: 3 + timeoutSeconds: 3 resources: requests: cpu: 10m $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040/artifacts/launch/pods.json | jq -r '.items[] | select(.metadata.name | startswith("etcd-quorum-guard")).spec.containers[].readinessProbe | {initialDelaySeconds, periodSeconds, failureThreshold, timeoutSeconds} | tostring' | uniq {"initialDelaySeconds":5,"periodSeconds":5,"failureThreshold":3,"timeoutSeconds":3} [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1273719951609303040
Thanks for the followup, Trevor
According to comment5, the verify steps from QE side should be: 1. Install ocp v4.5 without the backport pr in v4.5(such as 4.5.0-rc.1) 2. Upgrade to latest v4.6 nightly build which included pr383 3. Check if probe integer fields (initialDelaySeconds, periodSeconds, failureThreshold, and timeoutSeconds) of etcd-quorum-guard updated.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196