1754523 – [build-cop] Tracker for release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 failures

Bug 1754523 - [build-cop] Tracker for release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 failures

Summary: [build-cop] Tracker for release-openshift-origin-installer-e2e-aws-upgrade-ro...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1754526 1754537 1754555 1755041 (view as bug list)
Depends On:	1754526 1754537 1754555
Blocks:	1755066
TreeView+	depends on / blocked

Reported:	2019-09-23 13:24 UTC by Lokesh Mandvekar
Modified:	2019-09-24 17:06 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1755066 (view as bug list)
Environment:
Last Closed:	2019-09-24 17:06:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Lokesh Mandvekar 2019-09-23 13:24:56 UTC

Description of problem:


Version-Release number of selected component (if applicable):

Comment 1 Lokesh Mandvekar 2019-09-23 13:26:22 UTC

bug tracker release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 test failures.

Comment 2 Kirsten Garrison 2019-09-23 17:09:59 UTC

is this related to MCO? Can you be more specific? Logs/artifacts?

Comment 3 Mrunal Patel 2019-09-23 17:20:29 UTC

I see the first error reported as Sep 20 08:24:28.457: INFO: Unexpected error occurred: Cluster did not complete upgrade: timed out waiting for the condition: unable to retrieve cluster version during upgrade: Get https://api.ci-op-s77fxbzv-45560.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions/version: dial tcp 52.206.198.100:6443: connect: connection refused

Failing jobs:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/176
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/169

Comment 4 Mrunal Patel 2019-09-23 17:23:38 UTC

*** Bug 1754537 has been marked as a duplicate of this bug. ***

Comment 5 Mrunal Patel 2019-09-23 17:26:46 UTC

*** Bug 1754526 has been marked as a duplicate of this bug. ***

Comment 6 Lokesh Mandvekar 2019-09-23 17:30:00 UTC

I actually filed this bug to collect the individual bugs in Dependencies. So, the dependencies aren't really duplicates. Sorry I didn't make it clear early on.

Comment 7 Mrunal Patel 2019-09-23 17:34:10 UTC

okay, let us make one bug per job type if there is a similar pattern and it is recent and relatively high frequency. Usually the first failure leads to a cascade of failures, so we don't need a bug per failure in a failing job.

Comment 8 Lokesh Mandvekar 2019-09-23 17:47:18 UTC

*** Bug 1754555 has been marked as a duplicate of this bug. ***

Comment 9 Scott Dodson 2019-09-23 18:28:07 UTC

kube-apiserver is not responding to requests, a quick look doesn't show any clear indications as to why and must-gather of course is failing so I'm not sure how much we'll get here, but moving to kube-apiserver

Comment 10 Lokesh Mandvekar 2019-09-23 19:40:53 UTC

installer-e2e-aws-upgrade seems to have a similar failure log: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7466

Comment 11 David Eads 2019-09-23 20:19:00 UTC

working from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/176

Etcd stops on node 43 and we lose all connection for 7 minutes. When it comes back, the kube-apiservers are failing health checks, but they are still somehow responding (must have all failed the health for aws route?).

two noteworthy bugs so far.
1. kubelet is taking pods from running to pending. 101 instances like:
Sep 23 07:42:58.748 W ns/openshift-console pod/console-9dd8ff7c5-bcqwv node/ip-10-0-141-43.ec2.internal invariant violation (bug): pod should not transition Running->Pending even when terminated

2. MCO reboot event doesn't appear reliable. We see it for node 211 before draining starts, but node 43 starts draining without that event. Or we really do drain twice
Sep 23 07:32:23.805 I node/ip-10-0-142-211.ec2.internal Draining node to update config.
Sep 23 07:35:00.728 W node/ip-10-0-142-211.ec2.internal Node ip-10-0-142-211.ec2.internal has been rebooted, boot id: 20996f4a-ec9f-442d-8d1e-d6f4a71df06e
Sep 23 07:35:33.733 I node/ip-10-0-147-211.ec2.internal Draining node to update config.

3. Thing I wish I had, kube-apiserver healthz. I'll see about updating the e2e monitor.

discounted theories:
1. etcd quorum destroyed: false. If this were the case, we wouldn't be able to watch events from the e2e monitor.

unexplained

timeline:

0. something triggers an upgrade
1. Sep 23 07:24:07.047: kube-apiserver finishes "upgrade" to revision 8

2. MCO starts rebooting machines
Sep 23 07:34:46.543 W node/ip-10-0-143-11.ec2.internal Node ip-10-0-143-11.ec2.internal has been rebooted, boot id: 299f6b83-a581-4ff7-b6a4-9d821153b4c0
Sep 23 07:35:00.728 W node/ip-10-0-142-211.ec2.internal Node ip-10-0-142-211.ec2.internal has been rebooted, boot id: 20996f4a-ec9f-442d-8d1e-d6f4a71df06e

4. MCO issued reboot, and now we see workloads being killed, including etcd
Sep 23 07:35:08.269 W ns/openshift-etcd pod/etcd-member-ip-10-0-142-211.ec2.internal node/ip-10-0-142-211.ec2.internal graceful deletion within 0s
Sep 23 07:35:08.278 W ns/openshift-etcd pod/etcd-member-ip-10-0-142-211.ec2.internal node/ip-10-0-142-211.ec2.internal deleted

5. Cool kubelet bug happens again
Sep 23 07:35:27.477 W ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-142-211.ec2.internal node/ip-10-0-142-211.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID

5. 5 start Draining 43 *before* etcd is back up. Looks like that kubelet reboot event isn't reliable.
Sep 23 07:35:47.754 I node/ip-10-0-141-43.ec2.internal Draining node to update config.

6. Next mention of etcd is stopping on 43 machine *before* 211 is back up
Sep 23 07:35:55.522 I ns/openshift-etcd pod/etcd-member-ip-10-0-141-43.ec2.internal Stopping container etcd-metrics
Sep 23 07:35:55.534 I ns/openshift-etcd pod/etcd-member-ip-10-0-141-43.ec2.internal Stopping container etcd-member

7. All kube-apiservers start reporting 500 status. This is probably because they cannot contact etcd
Sep 23 07:42:57.295 W ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-141-43.ec2.internal Readiness probe failed: HTTP probe failed with statuscode: 500 (2 times)

Comment 12 W. Trevor King 2019-09-23 23:30:16 UTC

is this a dup of bug 1754133?

Comment 14 Neelesh Agrawal 2019-09-24 16:51:33 UTC

*** Bug 1755041 has been marked as a duplicate of this bug. ***

Comment 15 Kirsten Garrison 2019-09-24 17:06:45 UTC

Verified CI issue yesterday. Machine content out of date.

Note You need to log in before you can comment on or make changes to this bug.