Reference https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/255 Sep 24 21:31:11.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (17 times) Sep 24 21:31:21.746 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (18 times) Sep 24 21:31:31.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (19 times) Sep 24 21:31:41.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (20 times) Sep 24 21:31:51.747 W ns/openshift-operator-lifecycle-manager pod/packageserver-578df64994-6rgzk Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused (21 times) Sep 24 21:33:28.387 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-72znt 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. (36 times) Sep 24 21:34:09.763 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-0.nightly-2019-09-23-204738 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 2, unavailable: 1) Sep 24 21:35:09.117 I test="[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" failed Failing tests: [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]
It looks like node ip-10-0-145-66.ec2.internal didn't reboot: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/255/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-46dh6_machine-config-daemon.log mcd log ends at : I0924 21:28:38.948985 5660 update.go:89] pod "openshift-service-catalog-apiserver-operator-7885d585b6-th2n6" removed (evicted) I0924 21:28:39.148458 5660 update.go:89] pod "cluster-monitoring-operator-5ffff6c8-rlwpc" removed (evicted) Looking through masters.journal I see correctly: Sep 24 21:28:18 ip-10-0-145-66 root[94841]: machine-config-daemon[5660]: Update prepared; beginning drain and pods start to get evicted.. however, there seems to be some issues with the packageserver the whole time: Sep 24 21:28:41 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:41.732405 1024 prober.go:111] Readiness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused Sep 24 21:28:41 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:41.732474 1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused Sep 24 21:28:45 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:45.356177 1024 kubelet.go:1995] SyncLoop (container unhealthy): "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098)" Sep 24 21:28:45 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:45.356282 1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused ... Sep 24 21:28:25 ip-10-0-145-66 hyperkube[1024]: I0924 21:28:25.356022 1024 prober.go:111] Liveness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused ... Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.443937 1024 kubelet_pods.go:898] Pod "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098)" is terminated, but some containers are still running Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.613445 1024 prober.go:118] Liveness probe for "kube-controller-manager-ip-10-0-145-66.ec2.internal_openshift-kube-controller-manager(56ac7561ff1ff5c8d3fb7c947b180640):kube-controller-manager-4" succeeded Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.732350 1024 prober.go:111] Readiness probe for "packageserver-578df64994-6rgzk_openshift-operator-lifecycle-manager(765fe771-df10-11e9-90e2-0a0980118098):packageserver" failed (failure): Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused Sep 24 21:35:21 ip-10-0-145-66 hyperkube[1024]: I0924 21:35:21.732532 1024 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-operator-lifecycle-manager", Name:"packageserver-578df64994-6rgzk", UID:"765fe771-df10-11e9-90e2-0a0980118098", APIVersion:"v1", ResourceVersion:"27377", FieldPath:"spec.containers{packageserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.58:5443/healthz: dial tcp 10.129.0.58:5443: connect: connection refused ... It continues until the end and doesn't seem to get completely deleted. Might be why the node never logs something like: "systemd[1]: Starting Reboot..." I'm not seeing any MCO errors, and looking through the masterjournal this is the visible error that I see. For the successfully updated nodes packageserver was eventually completely deleted and then moved to drain complete/node rebooting to. Passing this to the node team, PTAL.
I agree with Kirsten. I do not think the node is rebooting. Likely because the MCD is still in a drain loop. I'm pretty sure we need this [1] patch for 4.1 to change the KILL time to about 2 minutes. The kubelet will send a sigterm to signal a pod to terminate. If a pod does not handle SIGTERM correctly, then it will wait 10 minutes (old way). The patch changes this time to 2 minutes which is the default. 1. https://github.com/openshift/machine-config-operator/pull/1121
*** This bug has been marked as a duplicate of bug 1753293 ***