Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1913341

Summary: GCP: strange cluster behavior in CI run
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: Cloud ComputeAssignee: Danil Grigorev <dgrigore>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, dgrigore, jerzhang, mfojtik, xxia
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Proxy update cause full cluster configuration update, including api server restart. Consequence: Some operator tests are timing out because of unexpected api server outage. Fix: Separate proxy tests, and add postconditions for cluster operators to become healthy again. Result: CI does not flake on this particular issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:35:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Gugino 2021-01-06 14:47:12 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/780/pull-ci-openshift-machine-api-operator-master-e2e-gcp-operator/1346779212207362048

Artifacts link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/780/pull-ci-openshift-machine-api-operator-master-e2e-gcp-operator/1346779212207362048/artifacts/e2e-gcp-operator/gather-extra/

First symptom: orphaned/superfluous replicaset
----------------------------------------------
Seems a superfluous replicaset (unready, 0 replicas) was created for the machine-api-controllers deployment "machine-api-controllers-6cd6b9556f"

This replicaset has the right ownerRef UUID.  The replicaSet was "creationTimestamp": "2021-01-06T12:04:29Z"

The correct (wanted, and ready with 1 replica) replicaset is "name": "machine-api-controllers-6cd6b9556f",

It also has the right ownerRef UUID and also matches the name from the deployment's status.  It was created at "creationTimestamp": "2021-01-06T12:00:26Z"

The unwanted replicaset is newer, but the deployment controller is not attempting to roll that replicaset out.  The cluster persisted for at least another hour during the CI run, so there's no reason this replicaset should exist.

Next Symptom: metrics
------------
mapi_mao_collector_up is not available until about 12:12.  It reports until 12:13 and then there is a gap in metrics until 12:16.  Metrics appear stable after that point.

cluster:nodes_roles doesn't appear until 12:13 and is stable after that point.

Next Symptom: kube-apiserver
----------------------------
There are some messages from the authentication operator in events.json complaining about waiting for the kube-apiserver:

            "lastTimestamp": "2021-01-06T11:49:25Z",
            "message": "Status for clusteroperator/authentication changed: Available message changed from \"OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints\\nOAuthServiceCheckEndpointAccessibleControllerAvailable: Get \\\"https://172.30.242.238:443/healthz\\\": dial tcp 172.30.242.238:443: i/o timeout (Client.Timeout exceeded while awaiting headers)\\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \\\"https://oauth-openshift.apps.ci-op-btl51m23-5f4c2.origin-ci-int-gce.dev.openshift.com/healthz\\\": EOF\\nWellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap \\\"oauth-openshift\\\" not found (check authentication operator, it is supposed to create this)\" to \"OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints\\nOAuthServiceCheckEndpointAccessibleControllerAvailable: Get \\\"https://172.30.242.238:443/healthz\\\": dial tcp 172.30.242.238:443: i/o timeout (Client.Timeout exceeded while awaiting headers)\\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \\\"https://oauth-openshift.apps.ci-op-btl51m23-5f4c2.origin-ci-int-gce.dev.openshift.com/healthz\\\": EOF\\nWellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)\"",


This goes on for a while, goes away, then we see the kube-apiserver might be dead?

            "lastTimestamp": "2021-01-06T12:06:23Z",
            "message": "Status for clusteroperator/authentication changed: Available changed from True to False (\"WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.3:6443: connect: connection refused (kube-apiserver is probably not running on that node, killed before graceful termination or crash-looping)\")",


These symptoms combined look like there may be problems with the API server.  Most concerning is the extra replicaset that is hanging around.

Comment 1 Michael Gugino 2021-01-06 14:48:58 UTC
Moving this over to kube-apiserver for further investigation.

Comment 2 Michael Gugino 2021-01-06 15:00:51 UTC
The MCO also decided it was appropriate to drain and update masters during the middle of the run?

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2021-01-06T12:04:24Z",
            "involvedObject": {
                "kind": "Node",
                "name": "ci-op-btl51m23-5f4c2-9trms-master-0",
                "uid": "afa93698-bf83-4966-bd4c-076ff331c723"
            },
            "kind": "Event",
            "lastTimestamp": "2021-01-06T12:04:24Z",
            "message": "Draining node to update config.",


I don't know if this is expected or some kind of bug, but it's going to wreck CI runs.

Comment 3 Yu Qi Zhang 2021-01-06 15:37:45 UTC
The MCO actually should have rebooted nodes twice during your run, once due to a change in:

1. /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt, which went from empty to:

data:,%0A-----BEGIN%20CERTIFICATE-----%0AMIICQDCCAamgAwIBAgIUXyjCAhFKon7pffzuFrHRBmLQRqwwDQYJKoZIhvcNAQEL%0ABQAwMjELMAkGA1UEBhMCVVMxCzAJBgNVBAgMAk5DMRYwFAYDVQQDDA1taXRtLXBy%0Ab3h5LWNhMB4XDTIwMTEyMDIxMTk1NVoXDTQ4MDUwMTIxMTk1NVowMjELMAkGA1UE%0ABhMCVVMxCzAJBgNVBAgMAk5DMRYwFAYDVQQDDA1taXRtLXByb3h5LWNhMIGfMA0G%0ACSqGSIb3DQEBAQUAA4GNADCBiQKBgQDPqHuUZz1Qt236e03XWjsVCy1OnuuWQ3a1%0AOQX21waqypGPem6iKiSmAmf%2BYrKdkX2O5L2sqPjpZ%2Bciv9z5h9d0xCCNy%2B06UQTu%0A%2BpIrgIxm2p5wQmSGgI5KNUL6dgU3dC5aRDMU7hV0RGXKrJktTmU2SnZsOUe4SZ5L%0AVyanpk8N1QIDAQABo1MwUTAdBgNVHQ4EFgQUFD2D%2FLygThvwLRKukmH1Y%2B%2BhxLIw%0AHwYDVR0jBBgwFoAUFD2D%2FLygThvwLRKukmH1Y%2B%2BhxLIwDwYDVR0TAQH%2FBAUwAwEB%0A%2FzANBgkqhkiG9w0BAQsFAAOBgQCLZeGgfFRCOWqSXcDzUzeJE03HD%2Fy2rLY%2BhDMn%0Akj69xY3%2BK0z4DRbb1vpMiYsTYF%2FZ1d6blND4KW%2B7oi9R6PPYg6XATDMf9tVfTQIe%0AqBkDNfqABPlwJABpixwD20XXQUBqADEyO3tdQLtiMi5Qr6QHOX3%2BFepiHAgdxAFt%0AMqy4Gw%3D%3D%0A-----END%20CERTIFICATE-----%0A

2. once due to a change to proxy configs, which also went from empty to:

[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://172.30.236.49:8080\nEnvironment=HTTPS_PROXY=http://172.30.236.49:8080\nEnvironment=NO_PROXY=.cluster.local,.com,.org,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-btl51m23-5f4c2.origin-ci-int-gce.dev.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,metadata,metadata.google.internal,metadata.google.internal.,quay.io\n"

Both are read directly from the cluster level objects watched by the MCO. This is expected MCO behaviour. The first change happened ~18 minutes after bootstrap MCC rendered the initial configs (Reading cluster level configs) and the second happened ~5 minutes after that. If that timeline doesn't sound correct perhaps look into why those configs were generated late.

Comment 4 Michael Gugino 2021-01-06 15:54:19 UTC
The unwanted replicaSet was created suspiciously close to when the MCO rollout began: 
"creationTimestamp": "2021-01-06T12:04:29Z"


MCO event:
"firstTimestamp": "2021-01-06T12:04:24Z",


Comparing the two replica sets, the unwanted one contains PROXY/NOPROXY info, while the wanted one does not.  How did this proxy information get injected into a replicaset but not a deployment?  Possibly something updated the deployment, deployment created new replicaset, before anything else could take place, MAO rolled back deployment for some reason.  If the MAO rolled it back, it would seem the MAO didn't add the proxy information in the first place.

Comment 5 Michael Gugino 2021-01-06 16:39:56 UTC
Moving back to cloud.  This is most likely due to our proxy tests.

Comment 6 Joel Speed 2021-02-08 10:03:46 UTC
We are planning to move this test to the beginning of the test suite as a disruptive test during the next cycle, this should resolve this issue.

Comment 7 Joel Speed 2021-02-24 13:54:02 UTC
Some progress was made on a blocker for this issue during this cycle. @dgrigore would you be able to handle this during the next sprint once your existing PR is merged?

Comment 11 errata-xmlrpc 2021-07-27 22:35:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 12 Red Hat Bugzilla 2023-09-15 00:57:46 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days