Bug 1913341
| Summary: | GCP: strange cluster behavior in CI run | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michael Gugino <mgugino> |
| Component: | Cloud Compute | Assignee: | Danil Grigorev <dgrigore> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, dgrigore, jerzhang, mfojtik, xxia |
| Version: | 4.7 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Proxy update cause full cluster configuration update, including api server restart.
Consequence: Some operator tests are timing out because of unexpected api server outage.
Fix: Separate proxy tests, and add postconditions for cluster operators to become healthy again.
Result: CI does not flake on this particular issue.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:35:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Michael Gugino
2021-01-06 14:47:12 UTC
Moving this over to kube-apiserver for further investigation. The MCO also decided it was appropriate to drain and update masters during the middle of the run?
{
"apiVersion": "v1",
"count": 1,
"eventTime": null,
"firstTimestamp": "2021-01-06T12:04:24Z",
"involvedObject": {
"kind": "Node",
"name": "ci-op-btl51m23-5f4c2-9trms-master-0",
"uid": "afa93698-bf83-4966-bd4c-076ff331c723"
},
"kind": "Event",
"lastTimestamp": "2021-01-06T12:04:24Z",
"message": "Draining node to update config.",
I don't know if this is expected or some kind of bug, but it's going to wreck CI runs.
The MCO actually should have rebooted nodes twice during your run, once due to a change in: 1. /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt, which went from empty to: data:,%0A-----BEGIN%20CERTIFICATE-----%0AMIICQDCCAamgAwIBAgIUXyjCAhFKon7pffzuFrHRBmLQRqwwDQYJKoZIhvcNAQEL%0ABQAwMjELMAkGA1UEBhMCVVMxCzAJBgNVBAgMAk5DMRYwFAYDVQQDDA1taXRtLXBy%0Ab3h5LWNhMB4XDTIwMTEyMDIxMTk1NVoXDTQ4MDUwMTIxMTk1NVowMjELMAkGA1UE%0ABhMCVVMxCzAJBgNVBAgMAk5DMRYwFAYDVQQDDA1taXRtLXByb3h5LWNhMIGfMA0G%0ACSqGSIb3DQEBAQUAA4GNADCBiQKBgQDPqHuUZz1Qt236e03XWjsVCy1OnuuWQ3a1%0AOQX21waqypGPem6iKiSmAmf%2BYrKdkX2O5L2sqPjpZ%2Bciv9z5h9d0xCCNy%2B06UQTu%0A%2BpIrgIxm2p5wQmSGgI5KNUL6dgU3dC5aRDMU7hV0RGXKrJktTmU2SnZsOUe4SZ5L%0AVyanpk8N1QIDAQABo1MwUTAdBgNVHQ4EFgQUFD2D%2FLygThvwLRKukmH1Y%2B%2BhxLIw%0AHwYDVR0jBBgwFoAUFD2D%2FLygThvwLRKukmH1Y%2B%2BhxLIwDwYDVR0TAQH%2FBAUwAwEB%0A%2FzANBgkqhkiG9w0BAQsFAAOBgQCLZeGgfFRCOWqSXcDzUzeJE03HD%2Fy2rLY%2BhDMn%0Akj69xY3%2BK0z4DRbb1vpMiYsTYF%2FZ1d6blND4KW%2B7oi9R6PPYg6XATDMf9tVfTQIe%0AqBkDNfqABPlwJABpixwD20XXQUBqADEyO3tdQLtiMi5Qr6QHOX3%2BFepiHAgdxAFt%0AMqy4Gw%3D%3D%0A-----END%20CERTIFICATE-----%0A 2. once due to a change to proxy configs, which also went from empty to: [Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://172.30.236.49:8080\nEnvironment=HTTPS_PROXY=http://172.30.236.49:8080\nEnvironment=NO_PROXY=.cluster.local,.com,.org,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-btl51m23-5f4c2.origin-ci-int-gce.dev.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,metadata,metadata.google.internal,metadata.google.internal.,quay.io\n" Both are read directly from the cluster level objects watched by the MCO. This is expected MCO behaviour. The first change happened ~18 minutes after bootstrap MCC rendered the initial configs (Reading cluster level configs) and the second happened ~5 minutes after that. If that timeline doesn't sound correct perhaps look into why those configs were generated late. The unwanted replicaSet was created suspiciously close to when the MCO rollout began: "creationTimestamp": "2021-01-06T12:04:29Z" MCO event: "firstTimestamp": "2021-01-06T12:04:24Z", Comparing the two replica sets, the unwanted one contains PROXY/NOPROXY info, while the wanted one does not. How did this proxy information get injected into a replicaset but not a deployment? Possibly something updated the deployment, deployment created new replicaset, before anything else could take place, MAO rolled back deployment for some reason. If the MAO rolled it back, it would seem the MAO didn't add the proxy information in the first place. Moving back to cloud. This is most likely due to our proxy tests. We are planning to move this test to the beginning of the test suite as a disruptive test during the next cycle, this should resolve this issue. Some progress was made on a blocker for this issue during this cycle. @dgrigore would you be able to handle this during the next sprint once your existing PR is merged? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |