Doesn't seem to happen all the time apparently, but in a local MCO upgrade the download pod didn't respond to the MCD evicting it before the 600s timeout. https://coreos.slack.com/archives/CEKNRGF25/p1566853880227400 https://bugzilla.redhat.com/show_bug.cgi?id=1729979 One suggestion is to change this to add `exec` to ensure that it's pid1.
*** Bug 1746154 has been marked as a duplicate of this bug. ***
So this is still happening: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/439 From the job above from one of the MCD logs: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/439/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-daemon-vh9c7_machine-config-daemon_previous.log I0828 13:00:45.145248 10307 update.go:89] pod "router-default-69f9bd7c5f-cdpl7" removed (evicted) I0828 13:10:51.146813 10307 update.go:89] pod "downloads-cc47b785c-whz2k" removed (evicted) I0828 13:10:51.146853 10307 update.go:89] drained node "ip-10-0-130-216.ec2.internal" It still takes 600s to be evicted which seems something is wrong in how signals are handled for the pod process. Apparently, the fix to exec isn't still enough. Reopening.
$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-01-224700 --commits | grep console console https://github.com/openshift/console 821d11c1a1cef7ea0e568ed8f2301c546a47d90a console-operator https://github.com/openshift/console-operator 320d99ef571b200ac8528206086572bbf9e094fa $ cd /path/to/console-operator/repo; git pull [yapei@New_Mac console-operator]$ git log 320d99ef571b200ac8528206086572bbf9e094fa | grep '#282' // PR is not found payload 4.2.0-0.nightly-2019-09-01-224700 don't include the latest fix, will verify when there's new payload
console-operator PR '#282' is picked in 4.2.0-0.nightly-2019-09-02-172410 And didn't find downloads pods evicted in 'machine-config-operator_machine-config-daemon_*' log Will continue to monitor in next few builds before verifying the bug
Not reproduced on 4.2.0-0.nightly-2019-09-03-062724, 4.2.0-0.nightly-2019-09-03-102130 Moving to VERIFIED $ oc adm release extract --from=registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-02-172410 --to=payload-4.2.0-0.nightly-2019-09-02-172410 [yapei@dhcp-141-192 payload-4.2.0-0.nightly-2019-09-02-172410]$ tail -2 0000_50_console-operator_07-downloads-deployment.yaml exec python2 /tmp/serve.py # the cli image only has Python 2.7 terminationGracePeriodSeconds: 1
It seems to be happening again: $ oc version Client Version: openshift-clients-4.2.0-201909081401 Server Version: 4.2.0-0.nightly-2019-09-10-014843 Kubernetes Version: v1.14.6+eded62f https://github.com/openshift/console/issues/2656
Hmm, I'm having trouble reproducing this. [1] has: Termination Grace Period: 600s but: $ oc adm release extract --to manifests registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-10-014843 Extracted release payload from digest sha256:dfc1a8724aeeabac4536cf9dd93113778d32d13d18106918e54c5841d58de5d2 created at 2019-09-10T01:52:14Z $ tail -n1 manifests/0000_50_console-operator_07-downloads-deployment.yaml terminationGracePeriodSeconds: 1 and CI for that same nightly shows that is indeed propagated to the pods [2]. Eduardo, can you walk back up the chain and confirm that your Deployment and CVO manifest have the 'terminationGracePeriodSeconds: 1' we expect? I want to see where it's getting lost in your cluster. [1]: https://github.com/openshift/console/issues/2656#issuecomment-529881664 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/487/artifacts/e2e-aws/must-gather/namespaces/openshift-console/pods/downloads-5cdcbb69d4-87cz7/downloads-5cdcbb69d4-87cz7.yaml
This is really weird as it now says 30 seconds... ``` $ oc get deploy console -o yaml -n openshift-console| grep -i termination terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError terminationGracePeriodSeconds: 30 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-09-10-014843 True False 23h Cluster version is 4.2.0-0.nightly-2019-09-10-014843 ``` oc get deploy console -o yaml -> http://pastebin.test.redhat.com/796223 oc logs console-operator-58c4f4b45b-fx2db -> http://pastebin.test.redhat.com/796226
*** Bug 1751275 has been marked as a duplicate of this bug. ***
> $ oc get deploy console -o yaml -n openshift-console| grep -i termination That's the console deployment, not the downloads deployment. Poking around in Eduardo's cluster: $ oc -n openshift-console get -o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}{"\n"}' deployment downloads 1 $ oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.terminationGracePeriodSeconds}{"\n"}{end}' pods console-6547578b9f-5kks6 30 console-6547578b9f-z49ff 30 downloads-5cdcbb69d4-gzh86 1 downloads-5cdcbb69d4-q9dvd 1 So that all looks fine. Testing a delete: $ time oc -n openshift-console delete pod downloads-5cdcbb69d4-gzh86 pod "downloads-5cdcbb69d4-gzh86" deleted real 0m11.045s user 0m0.186s sys 0m0.054s $ oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.terminationGracePeriodSeconds}{"\n"}{end}' pods console-6547578b9f-5kks6 30 console-6547578b9f-z49ff 30 downloads-5cdcbb69d4-q9dvd 1 downloads-5cdcbb69d4-s7jpp 1 11 seconds is not super-fast, but it's not 10m either. And you can see the Deployment went ahead an created a replacement Pod to get replicas back up to the requested 2. Can you point me at the specific script you're running (more-specific than the repo link in [1])? [1]: https://github.com/openshift/console/issues/2656#issue-491594608
(In reply to W. Trevor King from comment #13) > > $ oc get deploy console -o yaml -n openshift-console| grep -i termination > > That's the console deployment, not the downloads deployment. > My bad, sorry. > Poking around in Eduardo's cluster: > > $ oc -n openshift-console get -o > jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}{"\n"}' > deployment downloads > 1 > $ oc -n openshift-console get -o jsonpath='{range > .items[*]}{.metadata.name}{"\t"}{.spec. > terminationGracePeriodSeconds}{"\n"}{end}' pods > console-6547578b9f-5kks6 30 > console-6547578b9f-z49ff 30 > downloads-5cdcbb69d4-gzh86 1 > downloads-5cdcbb69d4-q9dvd 1 > > So that all looks fine. Testing a delete: > > $ time oc -n openshift-console delete pod downloads-5cdcbb69d4-gzh86 > pod "downloads-5cdcbb69d4-gzh86" deleted > > real 0m11.045s > user 0m0.186s > sys 0m0.054s > $ oc -n openshift-console get -o jsonpath='{range > .items[*]}{.metadata.name}{"\t"}{.spec. > terminationGracePeriodSeconds}{"\n"}{end}' pods > console-6547578b9f-5kks6 30 > console-6547578b9f-z49ff 30 > downloads-5cdcbb69d4-q9dvd 1 > downloads-5cdcbb69d4-s7jpp 1 > > 11 seconds is not super-fast, but it's not 10m either. And you can see the > Deployment went ahead an created a replacement Pod to get replicas back up > to the requested 2. Can you point me at the specific script you're running > (more-specific than the repo link in [1])? It happened when deploying an OCP4 cluster using https://github.com/openshift-kni/install-scripts The environment is 3 x baremetal masters. The issue happened to me when adding a machine config (to be 100% accurate, running this script that applies some machine-configs then waits for the reboot to happen https://github.com/openshift-kni/install-scripts/blob/master/OpenShift/99_post_install.sh) and in a different terminal an `oc get pods -A | grep -v -E 'Running|Complete'` then I observed the downloads one was in 'Terminating' for too long. > > [1]: https://github.com/openshift/console/issues/2656#issue-491594608
This is still an issue today and greatly increases rollout/upgrade time: I0917 07:40:56.930178 4695 update.go:89] pod "network-operator-fd8dbc8db-qbptc" removed (evicted) I0917 07:50:53.895129 4695 update.go:89] pod "downloads-86876867bf-wbxbk" removed (evicted) Tested with very latest code 4.2.0-0.okd-2019-09-16-050907
(In reply to Antonio Murdaca from comment #15) > This is still an issue today and greatly increases rollout/upgrade time: > > I0917 07:40:56.930178 4695 update.go:89] pod > "network-operator-fd8dbc8db-qbptc" removed (evicted) > I0917 07:50:53.895129 4695 update.go:89] pod "downloads-86876867bf-wbxbk" > removed (evicted) > > Tested with very latest code 4.2.0-0.okd-2019-09-16-050907 AWS IPI, so this is highly reproducible today, also for testing, the test to verify this would be to create a MachineConfig and check that on nodes the "download" pod isn't taking 10m to be evicted. Either that or kubectl drain should reproduce and verify as well.
Seems like this is more likely to occur when draining masters - not fully validated but that's what I'm actually experiencing.
any explicit reason for the move to this to 4.2? This adds a 10m slowdown to master rollouts
We can't ship a 5x regression in master upgrade times.
Manually associating with https://github.com/openshift/machine-config-operator/pull/1154 because we may not want to wait for the 4.3 but to be verified to make the Bugzilla bot happy.
Verified with 4.2.0-0.nightly-2019-10-04-222458 ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-10-04-222458 True False 18s Cluster version is 4.2.0-0.nightly-2019-10-04-222458 $ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://origin-release.svc.ci.openshi ft.org/graph"}}' --type=merge clusterversion.config.openshift.io/version patched $ oc adm upgrade --force --to-image registry.svc.ci.openshift.org/origin/release:4.2.0-0.okd-2019-10-04-225223 Updating to release image registry.svc.ci.openshift.org/origin/release:4.2.0-0.okd-2019-10-04-225223 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-10-04-222458 True True 20m Working towards 4.2.0-0.okd-2019-10-04-225223: 88% complete, waiting on machine-config $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.okd-2019-10-04-225223 True False 16m Cluster version is 4.2.0-0.okd-2019-10-04-225223 $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-130-98.us-west-2.compute.internal Ready master 68m v1.14.6+463c73f1f ip-10-0-141-66.us-west-2.compute.internal Ready worker 61m v1.14.6+463c73f1f ip-10-0-146-142.us-west-2.compute.internal Ready worker 61m v1.14.6+463c73f1f ip-10-0-146-64.us-west-2.compute.internal Ready master 68m v1.14.6+463c73f1f ip-10-0-162-250.us-west-2.compute.internal Ready master 68m v1.14.6+463c73f1f ip-10-0-172-152.us-west-2.compute.internal Ready worker 61m v1.14.6+463c73f1f ### Example master node $ oc debug node/ip-10-0-130-98.us-west-2.compute.internal Starting pod/ip-10-0-130-98us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.130.98 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -b | grep update.go <snip> Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:36.096056 31045 update.go:89] pod "network-operator-5dfb8d7cf6-n7bpz" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:36.696492 31045 update.go:89] pod "catalog-operator-64f798855d-4hkxq" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:37.895699 31045 update.go:89] pod "openshift-apiserver-operator-6f9cf8984d-5x479" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:38.295692 31045 update.go:89] pod "openshift-kube-scheduler-operator-f5c6665d5-5cv54" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:38.497816 31045 update.go:89] pod "etcd-quorum-guard-6c6566fd76-fckw2" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:38.895478 31045 update.go:89] pod "machine-config-operator-76cc587f6f-v4g85" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:39.695666 31045 update.go:89] pod "insights-operator-577689f998-sqnf9" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:40.895272 31045 update.go:89] pod "cluster-storage-operator-864ffd6864-6ssfd" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:42.095654 31045 update.go:89] pod "kube-apiserver-operator-794bc546c5-qrdbk" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:42.495786 31045 update.go:89] pod "openshift-controller-manager-operator-677c9c984-5gzgq" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:43.095483 31045 update.go:89] pod "cluster-image-registry-operator-5db8d685b4-p6sj5" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:43.295672 31045 update.go:89] pod "kube-controller-manager-operator-7fccc5f577-8qdfk" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:44.295627 31045 update.go:89] pod "dns-operator-c9b6c84-fz42q" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:44.695804 31045 update.go:89] pod "ingress-operator-99f7b6888-npd29" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:45.299833 31045 update.go:89] pod "downloads-5bdfb4c6fc-7b54b" removed (evicted) Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:45.299870 31045 update.go:89] drained node "ip-10-0-130-98.us-west-2.compute.internal" Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:45.299879 31045 update.go:984] drain complete ### ~10s for drain complete; ~1s per pod eviction ### Example worker node $ oc debug node/ip-10-0-141-66.us-west-2.compute.internal Starting pod/ip-10-0-141-66us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.141.66 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -b | grep update.go <snip> Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:35.558167 27162 update.go:984] Update prepared; beginning drain Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:35.572732 27162 update.go:89] cordoned node "ip-10-0-141-66.us-west-2.compute.internal" Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:35.657860 27162 update.go:93] ignoring DaemonSet-managed pods: tuned-cv8r4, dns-default-4fqrq, node-ca-5vfd2, machine-config-daemon-h7hcl, node-exporter-45tmz, multus-7w466, ovs-dfd8k, sdn-rl7tl; deleting pods with local storage: alertmanager-main-0, alertmanager-main-2, kube-state-metrics-5b957b4d97-pdr6q, prometheus-adapter-7f7dbfb579-qn8lf, prometheus-k8s-0 Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:42.854994 27162 update.go:89] pod "community-operators-68676d7f98-zkk6w" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:43.051225 27162 update.go:89] pod "router-default-6858dccf87-wlxzb" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:43.455441 27162 update.go:89] pod "image-registry-65cdcfd89b-qrnzj" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:44.051305 27162 update.go:89] pod "redhat-operators-bdc56bfc-86wrg" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:45.052461 27162 update.go:89] pod "prometheus-adapter-7f7dbfb579-qn8lf" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:46.252416 27162 update.go:89] pod "alertmanager-main-2" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:46.651470 27162 update.go:89] pod "kube-state-metrics-5b957b4d97-pdr6q" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:47.451456 27162 update.go:89] pod "openshift-state-metrics-5949866f78-h2hzm" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.052858 27162 update.go:89] pod "prometheus-k8s-0" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.252552 27162 update.go:89] pod "alertmanager-main-0" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.841844 27162 update.go:89] pod "certified-operators-659776856-6m26h" removed (evicted) Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.841877 27162 update.go:89] drained node "ip-10-0-141-66.us-west-2.compute.internal" Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.841884 27162 update.go:984] drain complete ### !~13s for drain complete, ~1s per pod eviction ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922