1974520 – [release-4.9] CI update from 4.7 to 4.8 sticks on: EncryptionMigrationController_Error: EncryptionMigrationControllerDegraded: etcdserver: request timed out

Bug 1974520 - [release-4.9] CI update from 4.7 to 4.8 sticks on: EncryptionMigrationController_Error: EncryptionMigrationControllerDegraded: etcdserver: request timed out

Summary: [release-4.9] CI update from 4.7 to 4.8 sticks on: EncryptionMigrationControl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Lukasz Szaszkiewicz
QA Contact:	liyao
Docs Contact:
URL:
Whiteboard:	tag-ci
Depends On:
Blocks:	1979303
TreeView+	depends on / blocked

Reported:	2021-06-21 22:31 UTC by W. Trevor King
Modified:	2021-10-18 17:36 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	The encryption conditions will be cleared out when there is no work to be done. Previously, especially during an upgrade from 4.7 to 4.8 the conditions could remain indefinitely and be reported to the end-users as a degraded condition for some operators.
Clone Of:
Clones:	1979303 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:35:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 462	None	closed	clear encryption conditions when there is no work to be done	2021-07-15 18:51:20 UTC
Github	openshift cluster-kube-apiserver-operator pull 1172	None	closed	clear encryption conditions when there is no work to be done	2021-07-15 18:51:19 UTC
Github	openshift cluster-openshift-apiserver-operator pull 460	None	closed	clear encryption conditions when there is no work to be done	2021-07-15 18:51:20 UTC
Github	openshift library-go pull 1125	None	closed	clear encryption conditions when there is no work to be done	2021-07-05 14:20:38 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:36:15 UTC

Description W. Trevor King 2021-06-21 22:31:33 UTC

Update from 4.7.17 to 4.8.0-0.ci-2021-06-14-090014 stuck on [1]:

  : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	2h31m26s
    fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-1fiidp14/release@sha256:92fa389aa952b8ed952a3ffe624be6353e4be9ec2bec1a811f4172b324a7f84d
    Unexpected error:
        <*errors.errorString | 0xc0012d4d90>: {
            s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded",
        }
        Cluster did not complete upgrade: timed out waiting for the condition: 
    Cluster operator authentication is degraded
    occurred

  : [sig-cluster-lifecycle] Cluster completes upgrade	2h30m0s
    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded

  : Operator upgrade authentication	0s
    Failed to upgrade authentication, operator was degraded 
(EncryptionMigrationController_Error): EncryptionMigrationControllerDegraded: etcdserver: request timed out

  : [bz-apiserver-auth] clusteroperator/authentication should not change condition/Degraded
    Run #0: Failed	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Jun 21 01:16:31.171 - 7104s E clusteroperator/authentication condition/Degraded status/True reason/EncryptionMigrationControllerDegraded: etcdserver: request timed out
    1 tests failed during this blip (2021-06-21 01:16:31.171240906 +0000 UTC to 2021-06-21 01:16:31.171240906 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

There were some other signs that etcd was briefly sad, but only the auth operator seems to have decided to stay mad about etcd for hours.  Checking on the auth pods:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.namespace | contains("auth")) | .metadata as $m | .status.containerStatuses[].state.running.startedAt + " " + $m.namespace + " " + $m.name' | sort | tail -n3
2021-06-21T01:16:24Z openshift-oauth-apiserver apiserver-7cfbf75f7d-dvvnn
2021-06-21T01:19:18Z openshift-oauth-apiserver apiserver-7cfbf75f7d-bscbj
2021-06-21T03:16:41Z openshift-oauth-apiserver apiserver-7cfbf75f7d-v7pxg

Poking at that late one, looks like a single restart:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "apiserver-7cfbf75f7d-v7pxg").status.containerStatuses[] | {restartCount, lastState,state}'
{
  "restartCount": 1,
  "lastState": {
    "terminated": {
      "containerID": "cri-o://46452442dcccabe46958da0d3cce07860b9ba893df9e0c90212878f2d0ccfcbf",
      "exitCode": 0,
      "finishedAt": "2021-06-21T03:16:41Z",
      "reason": "Completed",
      "startedAt": "2021-06-21T01:17:50Z"
    }
  },
  "state": {
    "running": {
      "startedAt": "2021-06-21T03:16:41Z"
    }
  }
}

Not sure why that container was terminated, but we have logs from its whole run, including the termination:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods/openshift-oauth-apiserver_apiserver-7cfbf75f7d-v7pxg_oauth-apiserver_previous.log | grep -1 03:15:40.011632 
I0621 03:15:20.999809       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0621 03:15:40.011632       1 genericapiserver.go:684] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"kube-system", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationStart' Received signal to terminate, becoming unready, but keeping serving
I0621 03:15:40.011692       1 genericapiserver.go:684] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"kube-system", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationPreShutdownHooksFinished' All pre-shutdown hooks have been finished

Seems fairly rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=144h&type=junit&search=EncryptionMigrationControllerDegraded.*etcdserver:+request+timed+out' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-devfile-integration-tests-main-v4.8.odo-integration-devfile-odo-periodic (all) - 24 runs, 29% failed, 14% of failures match = 4% impact
pull-ci-openshift-machine-api-operator-release-4.8-e2e-aws-disruptive (all) - 33 runs, 97% failed, 3% of failures match = 3% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 31 runs, 65% failed, 5% of failures match = 3% impact
periodic-ci-openshift-windows-machine-config-operator-release-4.9-vsphere-e2e-periodic (all) - 18 runs, 100% failed, 6% of failures match = 6% impact
pull-ci-openshift-installer-master-e2e-aws-workers-rhel7 (all) - 71 runs, 66% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 68 runs, 59% failed, 3% of failures match = 1% impact

Anyhow, I expect there was an etcd instability trigger at the beginning of this, but this bug is about the auth operator not recovering from that instability despite having lots of time to do so.  It's also possible that I'm just completely misreading these assets :p  And despite the rarity, the severity of hanging an update makes me think this is high-severity, at least until we have a better idea of what's going on.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512

Comment 1 Lukasz Szaszkiewicz 2021-07-01 21:37:08 UTC

The new instance:

acquired a lock at:
2021-06-21T01:14:51.630754688Z I0621 01:14:51.630702       1 leaderelection.go:253] successfully acquired lease openshift-authentication-operator/cluster-authentication-operator-lock

waited for caches for the encryption controllers at:
2021-06-21T01:14:51.706080667Z I0621 01:14:51.705910       1 base_controller.go:66] Waiting for caches to sync for EncryptionPruneController
2021-06-21T01:14:51.706093456Z I0621 01:14:51.705933       1 base_controller.go:66] Waiting for caches to sync for EncryptionMigrationController

observed a degraded condition at:
2021-06-21T01:14:51.806808962Z I0621 01:14:51.806784       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-06-21T00:40:05Z","message":"EncryptionMigrationControllerDegraded: etcdserver: request timed out\nAPIServerStaticResourcesDegraded: \"oauth-apiserver/svc.yaml\" (string): etcdserver: request timed out\nAPIServerStaticResourcesDegraded: \nRouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: etcdserver: request timed out","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-06-21T00:28:17Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-06-21T00:40:05Z","message":"OAuthServerDeploymentAvailable: availableReplicas==2","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-06-21T00:18:52Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}],"versions":[{"name":"oauth-apiserver","version":"4.7.17"},{"name":"operator","version":"4.8.0-0.ci-2021-06-14-090014"},{"name":"oauth-openshift","version":"4.7.17_openshift"}]}}

started the encryption controllers at:
2021-06-21T01:14:55.306409205Z I0621 01:14:55.306165       1 base_controller.go:72] Caches are synced for EncryptionMigrationController 
2021-06-21T01:14:55.306423053Z I0621 01:14:55.306411       1 base_controller.go:109] Starting #1 worker of EncryptionMigrationController controller ...

Based on the above and the attached screenshots the degraded condition must have been set by the previous instance (4.7). The new precondition checker added in 4.8 prevents the encryption controllers from running and clearing the stale conditions.

Comment 5 W. Trevor King 2021-07-16 03:31:08 UTC

Seems more rare now, with no hits in jobs that could touch 4.9 code:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=144h&type=junit&search=EncryptionMigrationControllerDegraded.*etcdserver:+request+timed+out' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 6 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 19 runs, 100% failed, 5% of failures match = 5% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 6 runs, 100% failed, 17% of failures match = 17% impact

Back in c0, we had a few hits in periodic-ci-openshift-windows-machine-config-operator-release-4.9-vsphere-e2e-periodic and pull-ci-openshift-installer-master-e2e-aws-workers-rhel7 .

Comment 6 liyao 2021-07-23 03:47:17 UTC

From the PR code
1. It clears stale conditions when the preconditions are not fulfilled like below
		if err != nil {
			degradedCondition = nil
			progressingCondition = nil
		}
2. All encryption conditions are covered in the code
$ oc get kubeapiserver -o yaml | grep -i encr
      type: EncryptionPruneControllerDegraded
      type: EncryptionStateControllerDegraded
      type: Encrypted
      type: EncryptionKeyControllerDegraded
      type: EncryptionMigrationControllerDegraded
      type: EncryptionMigrationControllerProgressing 

From the CI result, no hits in jobs that touch 4.9 code 
$ curl -s 'https://search.ci.openshift.org/?maxAge=144h&type=junit&search=EncryptionMigrationControllerDegraded.*etcdserver:+request+timed+out' | sed 's/<[^>]*>//g' | grep 'failures match' | sort
periodic-ci-openshift-release-master-nightly-4.8-e2e-azure (all) - 9 runs, 11% failed, 100% of failures match = 11% impact

Move to VERIFIED status based on above.

Comment 9 errata-xmlrpc 2021-10-18 17:35:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.