1818104 – [upgrade] Frontends were unreachable during disruption for at least...

Bug 1818104 - [upgrade] Frontends were unreachable during disruption for at least...

Summary: [upgrade] Frontends were unreachable during disruption for at least...

Keywords:
Status:	CLOSED DUPLICATE of bug 1809665
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Dan Mace
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:	buildcop
Depends On:	1809665 1809668 1869785
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-27 17:19 UTC by Hongkai Liu
Modified:	2022-08-04 22:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-08 23:36:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-03-27 17:19:09 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23242#1:build-log.txt%3A11083

Failing tests:
[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]
2020/03/26 23:40:16 Container test in pod e2e-aws-upgrade failed, exit code 1, reason Error
2020/03/26 23:49:13 Copied 176.08MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade
2020/03/26 23:49:13 Releasing lease for "aws-quota-slice"
2020/03/26 23:49:14 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/03/26 23:49:15 Ran for 1h31m34s
error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-7shj0sm3/e2e-aws-upgrade failed after 1h30m9s (failed containers: test): ContainerFailed one or more containers exited
Container test exited with code 1, reason Error
---
ard
Mar 26 23:38:53.271 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-9498659d4-8qb76 node/ created
Mar 26 23:38:53.281 I ns/openshift-machine-config-operator replicaset/etcd-quorum-guard-9498659d4 Created pod: etcd-quorum-guard-9498659d4-8qb76
Mar 26 23:38:53.284 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-9498659d4-8qb76 0/6 nodes are available: 3 node(s) didn't match node selector, 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't satisfy existing pods anti-affinity rules.
Mar 26 23:38:54.725 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-5c9b9b597c-49t6w node/ip-10-0-135-254.us-west-2.compute.internal deleted
Mar 26 23:38:54.802 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-9498659d4-8qb76 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-9498659d4-8qb76 to ip-10-0-135-254.us-west-2.compute.internal
Mar 26 23:38:55.362 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-9498659d4-8qb76 Container image "registry.svc.ci.openshift.org/ocp/4.3-2020-03-26-221327@sha256:9e71afa828f820ece9d26153a3ba52ea597609b4298acf57c4db20096e52b0d5" already present on machine
Mar 26 23:38:55.515 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-9498659d4-8qb76 Created container guard
Mar 26 23:38:55.545 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-9498659d4-8qb76 Started container guard
Mar 26 23:39:17.714 W clusterversion/version cluster reached 4.3.0-0.ci-2020-03-26-221327
Mar 26 23:39:17.714 W clusterversion/version changed Progressing to False: Cluster version is 4.3.0-0.ci-2020-03-26-221327
Mar 26 23:39:28.850 I ns/openshift-ingress service/router-default Updated load balancer with new hosts (3 times)
Mar 26 23:40:15.008 I test="[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" failed
Failing tests:
[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]

Comment 1 W. Trevor King 2020-03-31 04:29:17 UTC

I think that "ard" bit is just a truncated line.  The job actually failed because of:

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Mar 26 23:39:42.393: Frontends were unreachable during disruption for at least 9m8s of 45m9s (20%):

which is a pretty severe outage.  This was a 4.2.26 -> 4.3.0-0.ci-2020-03-26-221327 update job.

Comment 2 W. Trevor King 2020-03-31 04:49:46 UTC

Might also be an SDN issue like bug 1793635.

Comment 3 Dan Mace 2020-03-31 17:46:56 UTC

This bug is just another manifestation of #1809665 and isn't really adding any new information, but I'll keep it open and set a Depends On for now.

Comment 4 Ben Bennett 2020-05-08 23:36:41 UTC


*** This bug has been marked as a duplicate of bug 1809665 ***

Note You need to log in before you can comment on or make changes to this bug.