Bug 1755164
Summary: | [buildcop] e2e-aws-upgrade-rollback-4.1 failure w/healthcheck failure on packageserver | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robert Krawitz <rkrawitz> |
Component: | Node | Assignee: | Seth Jennings <sjenning> |
Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.z | CC: | aos-bugs, jokerman, kgarriso, rphillips |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-25 20:50:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Robert Krawitz
2019-09-24 22:19:52 UTC
It looks like node ip-10-0-145-66.ec2.internal didn't reboot: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/255/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-46dh6_machine-config-daemon.log mcd log ends at : I0924 21:28:38.948985 5660 update.go:89] pod "openshift-service-catalog-apiserver-operator-7885d585b6-th2n6" removed (evicted) I0924 21:28:39.148458 5660 update.go:89] pod "cluster-monitoring-operator-5ffff6c8-rlwpc" removed (evicted) Looking through masters.journal I see correctly: Sep 24 21:28:18 ip-10-0-145-66 root[94841]: machine-config-daemon[5660]: Update prepared; beginning drain and pods start to get evicted.. however, there seems to be some issues with the packageserver the whole time: Sep 24 21:28:41 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:41.732405 1024 prober.go:111] Readiness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused Sep 24 21:28:41 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:41.732474 1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused Sep 24 21:28:45 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:45.356177 1024 kubelet.go:1995] SyncLoop (container unhealthy): "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098)" Sep 24 21:28:45 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:45.356282 1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused ... Sep 24 21:28:25 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:25.356022 1024 prober.go:111] Liveness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused ... Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.443937 1024 kubelet_pods.go:898] Pod "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098)" is terminated, but some containers are still running Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.613445 1024 prober.go:118] Liveness probe for "kube-controller-manager-ip-10-0-145-66.ec2.internal_openshift-kube-controller-manager(56ac7561ff1ff5c8d3fb7c947b180640):kube-controller-manager-4" succeeded Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.732350 1024 prober.go:111] Readiness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.732532 1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused ... It continues until the end and doesn't seem to get completely deleted. Might be why the node never logs something like: "systemd[1]: Starting Reboot..." I'm not seeing any MCO errors, and looking through the masterjournal this is the visible error that I see. For the successfully updated nodes packageserver was eventually completely deleted and then moved to drain complete/node rebooting to. Passing this to the node team, PTAL. I agree with Kirsten. I do not think the node is rebooting. Likely because the MCD is still in a drain loop. I'm pretty sure we need this [1] patch for 4.1 to change the KILL time to about 2 minutes. The kubelet will send a sigterm to signal a pod to terminate. If a pod does not handle SIGTERM correctly, then it will wait 10 minutes (old way). The patch changes this time to 2 minutes which is the default. 1. https://github.com/openshift/machine-config-operator/pull/1121 *** This bug has been marked as a duplicate of bug 1753293 *** |