Bug 1755164

Summary:	[buildcop] e2e-aws-upgrade-rollback-4.1 failure w/healthcheck failure on packageserver
Product:	OpenShift Container Platform	Reporter:	Robert Krawitz <rkrawitz>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.z	CC:	aos-bugs, jokerman, kgarriso, rphillips
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-25 20:50:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert Krawitz 2019-09-24 22:19:52 UTC

Reference https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/255

Sep 24 21:31:11.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (17 times)
Sep 24 21:31:21.746 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (18 times)
Sep 24 21:31:31.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (19 times)
Sep 24 21:31:41.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (20 times)
Sep 24 21:31:51.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (21 times)
Sep 24 21:33:28.387 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-72znt 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. (36 times)
Sep 24 21:34:09.763 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-0.nightly-2019-09-23-204738 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 2, unavailable: 1)
Sep 24 21:35:09.117 I test="[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" failed

Failing tests:

[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]

Comment 1 Kirsten Garrison 2019-09-24 23:53:55 UTC

It looks like node ip-10-0-145-66.ec2.internal didn't reboot: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/255/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-46dh6_machine-config-daemon.log

mcd log ends at :
I0924 21:28:38.948985    5660 update.go:89] pod "openshift-service-catalog-apiserver-operator-7885d585b6-th2n6" removed (evicted)
I0924 21:28:39.148458    5660 update.go:89] pod "cluster-monitoring-operator-5ffff6c8-rlwpc" removed (evicted)

Looking through masters.journal I see correctly:

Sep 24 21:28:18 ip-10-0-145-66 root[94841]: machine-config-daemon[5660]: Update prepared; beginning drain

and pods start to get evicted.. however, there seems to be some issues with the packageserver the whole time:

Sep 24 21:28:41 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:41.732405    1024 prober.go:111] Readiness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused
Sep 24 21:28:41 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:41.732474    1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused

Sep 24 21:28:45 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:45.356177    1024 kubelet.go:1995] SyncLoop (container unhealthy): "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098)"
Sep 24 21:28:45 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:45.356282    1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused
...
Sep 24 21:28:25 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:25.356022    1024 prober.go:111] Liveness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused
...
Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.443937    1024 kubelet_pods.go:898] Pod "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098)" is terminated, but some containers are still running
Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.613445    1024 prober.go:118] Liveness probe for "kube-controller-manager-ip-10-0-145-66.ec2.internal_openshift-kube-controller-manager(56ac7561ff1ff5c8d3fb7c947b180640):kube-controller-manager-4" succeeded
Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.732350    1024 prober.go:111] Readiness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused
Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.732532    1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused
...

It continues until the end and doesn't seem to get completely deleted. Might be why the node never logs something like: "systemd[1]: Starting Reboot..."

I'm not seeing any MCO  errors, and looking through the masterjournal this is the visible error that I see. For the successfully updated nodes packageserver was eventually completely deleted and then moved to drain complete/node rebooting to. 

Passing this to the node team, PTAL.

Comment 2 Ryan Phillips 2019-09-25 01:08:58 UTC

I agree with Kirsten. I do not think the node is rebooting. Likely because the MCD is still in a drain loop. I'm pretty sure we need this [1] patch for 4.1 to change the KILL time to about 2 minutes. The kubelet will send a sigterm to signal a pod to terminate. If a pod does not handle SIGTERM correctly, then it will wait 10 minutes (old way). The patch changes this time to 2 minutes which is the default.

1. https://github.com/openshift/machine-config-operator/pull/1121

Comment 3 Ryan Phillips 2019-09-25 20:50:35 UTC


*** This bug has been marked as a duplicate of bug 1753293 ***