1745772 – downloads pod doesn't respond to drain

Bug 1745772 - downloads pod doesn't respond to drain

Summary: downloads pod doesn't respond to drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1746154 1751275 (view as bug list)
Depends On:	1758343
Blocks:	1758345
TreeView+	depends on / blocked

Reported:	2019-08-26 21:22 UTC by Colin Walters
Modified:	2019-10-16 06:38 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1758343 1758345 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:37:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift console-operator pull 275	'None'	closed	Bug 1745772: manifests/07-downloads-deployment: Exec Python for signal handling	2020-12-18 15:14:33 UTC
Github	openshift console-operator pull 282	'None'	closed	Bug 1745772: manifests/07-downloads-deployment: Set terminationGracePeriodSeconds	2020-12-18 15:14:33 UTC
Github	openshift machine-config-operator pull 1154	'None'	closed	Bug 1745772: pkg/daemon: default drain grace period to -1	2020-12-18 15:14:33 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:38:03 UTC

Description Colin Walters 2019-08-26 21:22:37 UTC

Doesn't seem to happen all the time apparently, but in a local MCO upgrade the download pod didn't respond to the MCD evicting it before the 600s timeout.

https://coreos.slack.com/archives/CEKNRGF25/p1566853880227400

https://bugzilla.redhat.com/show_bug.cgi?id=1729979


One suggestion is to change this to add `exec` to ensure that it's pid1.

Comment 3 Antonio Murdaca 2019-08-28 09:15:34 UTC

*** Bug 1746154 has been marked as a duplicate of this bug. ***

Comment 4 Antonio Murdaca 2019-08-28 16:13:23 UTC

So this is still happening: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/439

From the job above from one of the MCD logs: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/439/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-daemon-vh9c7_machine-config-daemon_previous.log

I0828 13:00:45.145248   10307 update.go:89] pod "router-default-69f9bd7c5f-cdpl7" removed (evicted)
I0828 13:10:51.146813   10307 update.go:89] pod "downloads-cc47b785c-whz2k" removed (evicted)
I0828 13:10:51.146853   10307 update.go:89] drained node "ip-10-0-130-216.ec2.internal"

It still takes 600s to be evicted which seems something is wrong in how signals are handled for the pod process.

Apparently, the fix to exec isn't still enough. Reopening.

Comment 6 Yadan Pei 2019-09-02 02:14:31 UTC

$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-01-224700 --commits | grep console
  console                                       https://github.com/openshift/console                                       821d11c1a1cef7ea0e568ed8f2301c546a47d90a
  console-operator                              https://github.com/openshift/console-operator                              320d99ef571b200ac8528206086572bbf9e094fa

$ cd /path/to/console-operator/repo; git pull

[yapei@New_Mac console-operator]$ git log 320d99ef571b200ac8528206086572bbf9e094fa | grep '#282'      // PR is not found

payload 4.2.0-0.nightly-2019-09-01-224700 don't include the latest fix, will verify when there's new payload

Comment 7 Yadan Pei 2019-09-03 01:42:41 UTC

console-operator PR '#282' is picked in 4.2.0-0.nightly-2019-09-02-172410

And didn't find downloads pods evicted in 'machine-config-operator_machine-config-daemon_*' log


Will continue to monitor in next few builds before verifying the bug

Comment 8 Yadan Pei 2019-09-04 01:43:23 UTC

Not reproduced on 4.2.0-0.nightly-2019-09-03-062724, 4.2.0-0.nightly-2019-09-03-102130

Moving to VERIFIED


$ oc adm release extract --from=registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-02-172410 --to=payload-4.2.0-0.nightly-2019-09-02-172410
[yapei@dhcp-141-192 payload-4.2.0-0.nightly-2019-09-02-172410]$ tail -2 0000_50_console-operator_07-downloads-deployment.yaml
          exec python2 /tmp/serve.py  # the cli image only has Python 2.7
      terminationGracePeriodSeconds: 1

Comment 9 Eduardo Minguez 2019-09-10 12:37:09 UTC

It seems to be happening again:

$ oc version
Client Version: openshift-clients-4.2.0-201909081401
Server Version: 4.2.0-0.nightly-2019-09-10-014843
Kubernetes Version: v1.14.6+eded62f

https://github.com/openshift/console/issues/2656

Comment 10 W. Trevor King 2019-09-11 04:14:47 UTC

Hmm, I'm having trouble reproducing this.  [1] has:

  Termination Grace Period:  600s

but:

  $ oc adm release extract --to manifests registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-10-014843
  Extracted release payload from digest sha256:dfc1a8724aeeabac4536cf9dd93113778d32d13d18106918e54c5841d58de5d2 created at 2019-09-10T01:52:14Z
  $ tail -n1 manifests/0000_50_console-operator_07-downloads-deployment.yaml
        terminationGracePeriodSeconds: 1

and CI for that same nightly shows that is indeed propagated to the pods [2].  Eduardo, can you walk back up the chain and confirm that your Deployment and CVO manifest have the 'terminationGracePeriodSeconds: 1' we expect?  I want to see where it's getting lost in your cluster.

[1]: https://github.com/openshift/console/issues/2656#issuecomment-529881664
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/487/artifacts/e2e-aws/must-gather/namespaces/openshift-console/pods/downloads-5cdcbb69d4-87cz7/downloads-5cdcbb69d4-87cz7.yaml

Comment 11 Eduardo Minguez 2019-09-11 10:16:23 UTC

This is really weird as it now says 30 seconds...

```
$ oc get deploy console -o yaml -n openshift-console| grep -i termination
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
      terminationGracePeriodSeconds: 30
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-10-014843   True        False         23h     Cluster version is 4.2.0-0.nightly-2019-09-10-014843
```

oc get deploy console -o yaml -> http://pastebin.test.redhat.com/796223
oc logs console-operator-58c4f4b45b-fx2db -> http://pastebin.test.redhat.com/796226

Comment 12 Antonio Murdaca 2019-09-11 16:20:28 UTC

*** Bug 1751275 has been marked as a duplicate of this bug. ***

Comment 13 W. Trevor King 2019-09-11 18:32:04 UTC

> $ oc get deploy console -o yaml -n openshift-console| grep -i termination

That's the console deployment, not the downloads deployment.

Poking around in Eduardo's cluster:

$ oc -n openshift-console get -o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}{"\n"}' deployment downloads
1
$ oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.terminationGracePeriodSeconds}{"\n"}{end}' pods
console-6547578b9f-5kks6	30
console-6547578b9f-z49ff	30
downloads-5cdcbb69d4-gzh86	1
downloads-5cdcbb69d4-q9dvd	1

So that all looks fine.  Testing a delete:

$ time oc -n openshift-console delete pod downloads-5cdcbb69d4-gzh86
pod "downloads-5cdcbb69d4-gzh86" deleted

real	0m11.045s
user	0m0.186s
sys	0m0.054s
$ oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.terminationGracePeriodSeconds}{"\n"}{end}' pods
console-6547578b9f-5kks6	30
console-6547578b9f-z49ff	30
downloads-5cdcbb69d4-q9dvd	1
downloads-5cdcbb69d4-s7jpp	1

11 seconds is not super-fast, but it's not 10m either.  And you can see the Deployment went ahead an created a replacement Pod to get replicas back up to the requested 2.  Can you point me at the specific script you're running (more-specific than the repo link in [1])?

[1]: https://github.com/openshift/console/issues/2656#issue-491594608

Comment 14 Eduardo Minguez 2019-09-12 09:07:32 UTC

(In reply to W. Trevor King from comment #13)
> > $ oc get deploy console -o yaml -n openshift-console| grep -i termination
> 
> That's the console deployment, not the downloads deployment.
> 

My bad, sorry.

> Poking around in Eduardo's cluster:
> 
> $ oc -n openshift-console get -o
> jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}{"\n"}'
> deployment downloads
> 1
> $ oc -n openshift-console get -o jsonpath='{range
> .items[*]}{.metadata.name}{"\t"}{.spec.
> terminationGracePeriodSeconds}{"\n"}{end}' pods
> console-6547578b9f-5kks6	30
> console-6547578b9f-z49ff	30
> downloads-5cdcbb69d4-gzh86	1
> downloads-5cdcbb69d4-q9dvd	1
> 
> So that all looks fine.  Testing a delete:
> 
> $ time oc -n openshift-console delete pod downloads-5cdcbb69d4-gzh86
> pod "downloads-5cdcbb69d4-gzh86" deleted
> 
> real	0m11.045s
> user	0m0.186s
> sys	0m0.054s
> $ oc -n openshift-console get -o jsonpath='{range
> .items[*]}{.metadata.name}{"\t"}{.spec.
> terminationGracePeriodSeconds}{"\n"}{end}' pods
> console-6547578b9f-5kks6	30
> console-6547578b9f-z49ff	30
> downloads-5cdcbb69d4-q9dvd	1
> downloads-5cdcbb69d4-s7jpp	1
> 
> 11 seconds is not super-fast, but it's not 10m either.  And you can see the
> Deployment went ahead an created a replacement Pod to get replicas back up
> to the requested 2.  Can you point me at the specific script you're running
> (more-specific than the repo link in [1])?

It happened when deploying an OCP4 cluster using https://github.com/openshift-kni/install-scripts
The environment is 3 x baremetal masters. The issue happened to me when adding a machine config (to be 100% accurate, running this script that applies some machine-configs then waits for the reboot to happen https://github.com/openshift-kni/install-scripts/blob/master/OpenShift/99_post_install.sh) and in a different terminal an `oc get pods -A | grep -v -E 'Running|Complete'` then I observed the downloads one was in 'Terminating' for too long.

> 
> [1]: https://github.com/openshift/console/issues/2656#issue-491594608

Comment 15 Antonio Murdaca 2019-09-17 07:51:59 UTC

This is still an issue today and greatly increases rollout/upgrade time:

I0917 07:40:56.930178    4695 update.go:89] pod "network-operator-fd8dbc8db-qbptc" removed (evicted)
I0917 07:50:53.895129    4695 update.go:89] pod "downloads-86876867bf-wbxbk" removed (evicted)

Tested with very latest code 4.2.0-0.okd-2019-09-16-050907

Comment 16 Antonio Murdaca 2019-09-17 07:54:27 UTC

(In reply to Antonio Murdaca from comment #15)
> This is still an issue today and greatly increases rollout/upgrade time:
> 
> I0917 07:40:56.930178    4695 update.go:89] pod
> "network-operator-fd8dbc8db-qbptc" removed (evicted)
> I0917 07:50:53.895129    4695 update.go:89] pod "downloads-86876867bf-wbxbk"
> removed (evicted)
> 
> Tested with very latest code 4.2.0-0.okd-2019-09-16-050907

AWS IPI, so this is highly reproducible today, also for testing, the test to verify this would be to create a MachineConfig and check that on nodes the "download" pod isn't taking 10m to be evicted. Either that or kubectl drain should reproduce and verify as well.

Comment 17 Antonio Murdaca 2019-09-17 08:04:31 UTC

Seems like this is more likely to occur when draining masters - not fully validated but that's what I'm actually experiencing.

Comment 18 Antonio Murdaca 2019-09-17 17:03:52 UTC

any explicit reason for the move to this to 4.2? This adds a 10m slowdown to master rollouts

Comment 20 Clayton Coleman 2019-10-03 20:31:37 UTC

We can't ship a 5x regression in master upgrade times.

Comment 21 W. Trevor King 2019-10-04 15:45:13 UTC

Manually associating with https://github.com/openshift/machine-config-operator/pull/1154 because we may not want to wait for the 4.3 but to be verified to make the Bugzilla bot happy.

Comment 23 Micah Abbott 2019-10-06 03:00:21 UTC

Verified with 4.2.0-0.nightly-2019-10-04-222458


```
$ oc get clusterversion                                                                              
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                               
version   4.2.0-0.nightly-2019-10-04-222458   True        False         18s     Cluster version is 4.2.0-0.nightly-2019-10-04-222458

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://origin-release.svc.ci.openshi
ft.org/graph"}}' --type=merge                                                                                                                                                                                        
clusterversion.config.openshift.io/version patched                                                                                                                                                                   

$ oc adm upgrade --force --to-image registry.svc.ci.openshift.org/origin/release:4.2.0-0.okd-2019-10-04-225223                                                                                                                                                                                                            
Updating to release image registry.svc.ci.openshift.org/origin/release:4.2.0-0.okd-2019-10-04-225223                       

$ oc get clusterversion                                                                              
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                               
version   4.2.0-0.nightly-2019-10-04-222458   True        True          20m     Working towards 4.2.0-0.okd-2019-10-04-225223: 88% complete, waiting on machine-config

$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.okd-2019-10-04-225223   True        False         16m     Cluster version is 4.2.0-0.okd-2019-10-04-225223

$ oc get node                                       
NAME                                         STATUS   ROLES    AGE   VERSION                                                                                                                                         
ip-10-0-130-98.us-west-2.compute.internal    Ready    master   68m   v1.14.6+463c73f1f                                                                                                                               
ip-10-0-141-66.us-west-2.compute.internal    Ready    worker   61m   v1.14.6+463c73f1f                                                                                                                               
ip-10-0-146-142.us-west-2.compute.internal   Ready    worker   61m   v1.14.6+463c73f1f                                                                                                                               
ip-10-0-146-64.us-west-2.compute.internal    Ready    master   68m   v1.14.6+463c73f1f                                                                                                                               
ip-10-0-162-250.us-west-2.compute.internal   Ready    master   68m   v1.14.6+463c73f1f                                                                                                                               
ip-10-0-172-152.us-west-2.compute.internal   Ready    worker   61m   v1.14.6+463c73f1f                                                                                                                               


### Example master node

$ oc debug node/ip-10-0-130-98.us-west-2.compute.internal
Starting pod/ip-10-0-130-98us-west-2computeinternal-debug ...                                                                                                                                                        
To use host binaries, run `chroot /host`                                                                                                                                                                             
Pod IP: 10.0.130.98                                                                                                                                                                                                  
If you don't see a command prompt, try pressing enter.                                                                                                                                                               
sh-4.2# chroot /host                                                                                                                                                                                                 
sh-4.4# journalctl -b | grep update.go     

<snip>

Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:36.096056   31045 update.go:89] pod "network-operator-5dfb8d7cf6-n7bpz" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:36.696492   31045 update.go:89] pod "catalog-operator-64f798855d-4hkxq" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:37.895699   31045 update.go:89] pod "openshift-apiserver-operator-6f9cf8984d-5x479" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:38.295692   31045 update.go:89] pod "openshift-kube-scheduler-operator-f5c6665d5-5cv54" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:38.497816   31045 update.go:89] pod "etcd-quorum-guard-6c6566fd76-fckw2" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:38.895478   31045 update.go:89] pod "machine-config-operator-76cc587f6f-v4g85" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:39.695666   31045 update.go:89] pod "insights-operator-577689f998-sqnf9" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:40.895272   31045 update.go:89] pod "cluster-storage-operator-864ffd6864-6ssfd" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:42.095654   31045 update.go:89] pod "kube-apiserver-operator-794bc546c5-qrdbk" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:42.495786   31045 update.go:89] pod "openshift-controller-manager-operator-677c9c984-5gzgq" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:43.095483   31045 update.go:89] pod "cluster-image-registry-operator-5db8d685b4-p6sj5" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:43.295672   31045 update.go:89] pod "kube-controller-manager-operator-7fccc5f577-8qdfk" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:44.295627   31045 update.go:89] pod "dns-operator-c9b6c84-fz42q" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:44.695804   31045 update.go:89] pod "ingress-operator-99f7b6888-npd29" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:45.299833   31045 update.go:89] pod "downloads-5bdfb4c6fc-7b54b" removed (evicted)
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:45.299870   31045 update.go:89] drained node "ip-10-0-130-98.us-west-2.compute.internal"
Oct 06 02:55:37 ip-10-0-130-98 hyperkube[1489]: I1006 02:33:45.299879   31045 update.go:984] drain complete

### ~10s for drain complete; ~1s per pod eviction

### Example worker node

$ oc debug node/ip-10-0-141-66.us-west-2.compute.internal                                            
Starting pod/ip-10-0-141-66us-west-2computeinternal-debug ...                                                                                                                                                        
To use host binaries, run `chroot /host`                                                                                                                                                                             
Pod IP: 10.0.141.66                                                                                                                                                                                                  
If you don't see a command prompt, try pressing enter.                                                                                                                                                               
sh-4.2# chroot /host                                                                                                                                                                                                 
sh-4.4# journalctl -b | grep update.go

<snip>

Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:35.558167   27162 update.go:984] Update prepared; beginning drain
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:35.572732   27162 update.go:89] cordoned node "ip-10-0-141-66.us-west-2.compute.internal"
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:35.657860   27162 update.go:93] ignoring DaemonSet-managed pods: tuned-cv8r4, dns-default-4fqrq, node-ca-5vfd2, machine-config-daemon-h7hcl, node-exporter-45tmz, multus-7w466, ovs-dfd8k, sdn-rl7tl; deleting pods with local storage: alertmanager-main-0, alertmanager-main-2, kube-state-metrics-5b957b4d97-pdr6q, prometheus-adapter-7f7dbfb579-qn8lf, prometheus-k8s-0
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:42.854994   27162 update.go:89] pod "community-operators-68676d7f98-zkk6w" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:43.051225   27162 update.go:89] pod "router-default-6858dccf87-wlxzb" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:43.455441   27162 update.go:89] pod "image-registry-65cdcfd89b-qrnzj" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:44.051305   27162 update.go:89] pod "redhat-operators-bdc56bfc-86wrg" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:45.052461   27162 update.go:89] pod "prometheus-adapter-7f7dbfb579-qn8lf" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:46.252416   27162 update.go:89] pod "alertmanager-main-2" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:46.651470   27162 update.go:89] pod "kube-state-metrics-5b957b4d97-pdr6q" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:47.451456   27162 update.go:89] pod "openshift-state-metrics-5949866f78-h2hzm" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.052858   27162 update.go:89] pod "prometheus-k8s-0" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.252552   27162 update.go:89] pod "alertmanager-main-0" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.841844   27162 update.go:89] pod "certified-operators-659776856-6m26h" removed (evicted)
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.841877   27162 update.go:89] drained node "ip-10-0-141-66.us-west-2.compute.internal"
Oct 06 02:57:45 ip-10-0-141-66 hyperkube[1218]: I1006 02:35:48.841884   27162 update.go:984] drain complete


### !~13s for drain complete, ~1s per pod eviction
```

Comment 24 errata-xmlrpc 2019-10-16 06:37:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.