1819943 – openstack cluster lost all nodes during e2e run

Bug 1819943 - openstack cluster lost all nodes during e2e run

Summary: openstack cluster lost all nodes during e2e run

Keywords:
Status:	CLOSED DUPLICATE of bug 1817568
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-01 21:35 UTC by Ben Parees
Modified:	2020-04-02 17:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-02 17:07:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2020-04-01 21:35:15 UTC

Description of problem:
This cluster appears to have self-destructed catastrophically during the e2e run:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1335

Among other things, there appear to be no available nodes:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1335/artifacts/e2e-openstack/nodes.json

My theory is this lead to the variety of other errors seen (connections being reset talking to the apiserver, failure to contact the apiserver, watches being closed).

Version-Release number of selected component (if applicable):
4.4 on openstack

Comment 1 Ben Parees 2020-04-01 22:04:58 UTC

possibly related, this run (also on openstack) saw similar failures and in this case all the nodes are marked as "unreachable"(maybe a networking issue?)

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1348/artifacts/e2e-openstack/nodes.json


e.g.:
               "taints": [
                    {
                        "effect": "NoSchedule",
                        "key": "node-role.kubernetes.io/master"
                    },
                    {
                        "effect": "NoSchedule",
                        "key": "node.kubernetes.io/unreachable",
                        "timeAdded": "2020-03-28T15:53:22Z"
                    }
                ]

Comment 2 Ben Parees 2020-04-01 22:06:54 UTC

One more indicident:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1330

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1330/artifacts/e2e-openstack/nodes.json

again the nodes where marked unreachable.


Raising severity to urgent as this seems to represent a fundamental stability problem for clusters on openstack.

Comment 3 Ben Parees 2020-04-01 22:23:36 UTC

same (nodes unreachable):

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1358
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1358/artifacts/e2e-openstack/nodes.json

Comment 4 Ryan Phillips 2020-04-02 17:07:42 UTC

This is a problem with exec liveness probes within conmon. We have a fix and are getting it backported into the tree.

Severity of 1817568 has been raised to Urgent.

*** This bug has been marked as a duplicate of bug 1817568 ***

Note You need to log in before you can comment on or make changes to this bug.