Description of problem: Version-Release number of selected component (if applicable):
bug tracker release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 test failures.
is this related to MCO? Can you be more specific? Logs/artifacts?
I see the first error reported as Sep 20 08:24:28.457: INFO: Unexpected error occurred: Cluster did not complete upgrade: timed out waiting for the condition: unable to retrieve cluster version during upgrade: Get https://api.ci-op-s77fxbzv-45560.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions/version: dial tcp 52.206.198.100:6443: connect: connection refused Failing jobs: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/176 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/169
*** Bug 1754537 has been marked as a duplicate of this bug. ***
*** Bug 1754526 has been marked as a duplicate of this bug. ***
I actually filed this bug to collect the individual bugs in Dependencies. So, the dependencies aren't really duplicates. Sorry I didn't make it clear early on.
okay, let us make one bug per job type if there is a similar pattern and it is recent and relatively high frequency. Usually the first failure leads to a cascade of failures, so we don't need a bug per failure in a failing job.
*** Bug 1754555 has been marked as a duplicate of this bug. ***
kube-apiserver is not responding to requests, a quick look doesn't show any clear indications as to why and must-gather of course is failing so I'm not sure how much we'll get here, but moving to kube-apiserver
installer-e2e-aws-upgrade seems to have a similar failure log: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7466
working from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/176 Etcd stops on node 43 and we lose all connection for 7 minutes. When it comes back, the kube-apiservers are failing health checks, but they are still somehow responding (must have all failed the health for aws route?). two noteworthy bugs so far. 1. kubelet is taking pods from running to pending. 101 instances like: Sep 23 07:42:58.748 W ns/openshift-console pod/console-9dd8ff7c5-bcqwv node/ip-10-0-141-43.ec2.internal invariant violation (bug): pod should not transition Running->Pending even when terminated 2. MCO reboot event doesn't appear reliable. We see it for node 211 before draining starts, but node 43 starts draining without that event. Or we really do drain twice Sep 23 07:32:23.805 I node/ip-10-0-142-211.ec2.internal Draining node to update config. Sep 23 07:35:00.728 W node/ip-10-0-142-211.ec2.internal Node ip-10-0-142-211.ec2.internal has been rebooted, boot id: 20996f4a-ec9f-442d-8d1e-d6f4a71df06e Sep 23 07:35:33.733 I node/ip-10-0-147-211.ec2.internal Draining node to update config. 3. Thing I wish I had, kube-apiserver healthz. I'll see about updating the e2e monitor. discounted theories: 1. etcd quorum destroyed: false. If this were the case, we wouldn't be able to watch events from the e2e monitor. unexplained timeline: 0. something triggers an upgrade 1. Sep 23 07:24:07.047: kube-apiserver finishes "upgrade" to revision 8 2. MCO starts rebooting machines Sep 23 07:34:46.543 W node/ip-10-0-143-11.ec2.internal Node ip-10-0-143-11.ec2.internal has been rebooted, boot id: 299f6b83-a581-4ff7-b6a4-9d821153b4c0 Sep 23 07:35:00.728 W node/ip-10-0-142-211.ec2.internal Node ip-10-0-142-211.ec2.internal has been rebooted, boot id: 20996f4a-ec9f-442d-8d1e-d6f4a71df06e 4. MCO issued reboot, and now we see workloads being killed, including etcd Sep 23 07:35:08.269 W ns/openshift-etcd pod/etcd-member-ip-10-0-142-211.ec2.internal node/ip-10-0-142-211.ec2.internal graceful deletion within 0s Sep 23 07:35:08.278 W ns/openshift-etcd pod/etcd-member-ip-10-0-142-211.ec2.internal node/ip-10-0-142-211.ec2.internal deleted 5. Cool kubelet bug happens again Sep 23 07:35:27.477 W ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-142-211.ec2.internal node/ip-10-0-142-211.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID 5. 5 start Draining 43 *before* etcd is back up. Looks like that kubelet reboot event isn't reliable. Sep 23 07:35:47.754 I node/ip-10-0-141-43.ec2.internal Draining node to update config. 6. Next mention of etcd is stopping on 43 machine *before* 211 is back up Sep 23 07:35:55.522 I ns/openshift-etcd pod/etcd-member-ip-10-0-141-43.ec2.internal Stopping container etcd-metrics Sep 23 07:35:55.534 I ns/openshift-etcd pod/etcd-member-ip-10-0-141-43.ec2.internal Stopping container etcd-member 7. All kube-apiservers start reporting 500 status. This is probably because they cannot contact etcd Sep 23 07:42:57.295 W ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-141-43.ec2.internal Readiness probe failed: HTTP probe failed with statuscode: 500 (2 times)
is this a dup of bug 1754133?
*** Bug 1755041 has been marked as a duplicate of this bug. ***
Verified CI issue yesterday. Machine content out of date.