We saw in this CI job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1414662507020161024 the following symptoms: - Multiple unrelated pods fail probes around the same time - API server is unreachable on old and new connections - etcd communication using the host network is disrupted I dug into the SDN logs and I can't find what's obviously wrong, but everything is pointing at something systemic with networking. I also dug into the historical loki logs for this job in the SDN space, and likewise nothing sticks out: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221625518320244%22,%221626123120244%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade%2F1414662507020161024%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-sdn%5C%22%22%7D%5D
TRT setting blocker+ due to impact on upgrades. This appears to impact most gcp upgrade jobs, and there's some evidence of it happening maybe in AWS too. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1414686006967275520
@bbennett, @vpickard, I will not be able to figure this one out in any timely fashion. I can help however possible, but since this is blocker+ I suggest we find someone more expert in the nitty gritty of what could be wrong. The most common problem with this job is the failure of "Cluster should remain functional during upgrade" which if you dig in to a few of the jobs is telling us that different APIs are just not responding due to i/o timeouts. Another test case that shows up to me is "Check if alerts are firing during or after upgrade success". It usually flakes, but sometimes fails (means it failed twice in a row). This is the message: alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"} alert HighOverallControlPlaneCPU fired for 180 seconds with labels: {severity="warning"} Also, GCP upgrades recently failed the last three jobs test about taking too long to upgrade. Is the cluster choking? Is it worth a try to see what these jobs would look like running on more resources? I think we do that with our network-stress job so I can see about trying that just to give us a data point. I found similar types of failures, although not as consistent, in the ovn upgrade job (same versions), just for reference. The aws upgrade job [1] (not ovn) failed the same top level test case (cluster should remain functional...) but that has a different log, about an alert KubePodNotReady firing. Another test case that is frequently failing is "Cluster frontend ingress remain available" which may be related to this? There is a BZ for that already, but the PR for it to pre-pull images doesn't seem to help. [0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417562198238040064 [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/1418013626882592768 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1970985
> alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"} This is bad component behavior. > alert HighOverallControlPlaneCPU fired for 180 seconds with labels: {severity="warning"} This one might be bad component behavior? Or might be an overly-touchy alert rule? Or might be underprovisioned CI machines? Or might be overly hungry OpenShift components? I'd opened [1] for "it's fine, just ignore", but will try to drum up some interest in getting this addressed one way or the other. I expect it's orthogonal to the SDN/API-connectivity issues. [1]: https://github.com/openshift/origin/pull/26341
> Another test case that shows up to me is "Check if alerts are firing during > or after upgrade success". It usually flakes, but sometimes > fails (means it failed twice in a row). This is the message: > > alert AggregatedAPIErrors fired for 30 seconds with labels: > {name="v1.packages.operators.coreos.com", namespace="default", > severity="warning"} > alert HighOverallControlPlaneCPU fired for 180 seconds with labels: > {severity="warning"} because of this warning I saw I wrote a jira ticket to investigate what these upgrade jobs look like on larger nodes. I was able to test it and there is no difference in test results. totally just an fyi: https://issues.redhat.com/browse/SDN-2069
Hi Jamo Do we have a status update on this bug? Any progress made or help needed? /Alex
(In reply to Alexander Constantinescu from comment #6) > Hi Jamo > > Do we have a status update on this bug? Any progress made or help needed? > > /Alex @aconstan, back in comment#3 I was asking Ben and Vic if we could get help. This is essentially a catch-all bz because our upgrade-gcp job is failing most of the time for multiple reasons (see description). I moved this bz in to our current sprint for myself to finally try to take a closer look. I think it was actually worse back in mid July when this was filed. Currently (past ~2 weeks) the job is failing 40% of the time. The reason for the failures in this time period are almost always because of one failing test case: "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" The reasons for this test case failure seem multiple as well, although I can't say they are not all the same root cause or not. There are 15 unresolved bzs that have some comment calling out this test case [0]. digging through the last two weeks failures of this test case I see: 1) "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is degraded" -and- "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded" - There are plenty of open bugs in bz matching 'Cluster operator <X> is degraded'. Do we need to have deep dives in to each case to get new bugs open or be able to attribute them to active and open bzs? 2) "kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests" - I think this is tracked in this NEW bz: https://bugzilla.redhat.com/show_bug.cgi?id=1955333 3) "alert HighOverallControlPlaneCPU fired for 120 seconds with labels" -and- "alert ExtremelyHighIndividualControlPlaneCPU fired for 90 seconds with labels" - this seems likely to be https://bugzilla.redhat.com/show_bug.cgi?id=1985073 which has some PRs in POST 4) "alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"}" - This had a similar bug that seemed specific to oVirt, but it was closed recently due to insufficient data. https://bugzilla.redhat.com/show_bug.cgi?id=1974283 Either we can try to re-open this one or file a new one. What is the consensus here? keep this open as a catch-all until the upgrade jobs start passing more reliably? If so, I would not keep this as a blocker. making upgrade jobs pass is not something we have been blocking releases on in the past. Or can we close this bug and create some tasks (jira or bzs) to dig in to 1) and possibly re-open 3) ? [0] https://bugzilla.redhat.com/buglist.cgi?bug_status=ASSIGNED&bug_status=POST&bug_status=NEW&bug_status=MODIFIED&known_name=Openshift%20Container%20Search&list_id=12085872&longdesc=Cluster%20should%20remain%20functional%20during%20upgrade&longdesc_type=substring&product=OpenShift%20Container%20Platform&query_based_on=Openshift%20Container%20Search&query_format=advanced
I filed https://bugzilla.redhat.com/show_bug.cgi?id=1996187 for 1) in comment #7. Closing this out since we have multiple other bugs to track failures in the upgrade job. I don't expect the upgrade job to turn fully green after all of them are fixed, but it would behoove us to find some way to have some person/people consistently dig in to the failures here (and the ovn job) and file specific BZs as they are uncovered.