Bug 1501514

Summary:	One single daemonset controller is responsible for 1/2 of all write workload on the cluster
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Master	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xingxing Xia <xxia>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.6.1	CC:	aos-bugs, ccoleman, decarr, jcantril, jokerman, juzhao, mfojtik, mmccomas, pportant, tkatarki, trankin, wmeng
Target Milestone:	---	Keywords:	OpsBlocker, Reopened, UpcomingRelease
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Restricting DaemonSet nodes with project default node selector. Consequence: Deleting and creating DS pods in a loop on those nodes the got restricted by adding project default node selector. Fix: Patching upstream DS logic to be aware of project default node selector. Result: No longer creates/deletes DS pods on the nodes that got restricted by project default node selector.	Story Points:	---
Clone Of:
Clones:	1571093 (view as bug list)		Environment:
Last Closed:	2018-12-20 21:40:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1536304, 1574773
Bug Blocks:	1571093

Description Clayton Coleman 2017-10-12 17:18:30 UTC

us-east-2 has a single daemonset on the cluster, the logging daemonset, which is driving a huge portion of write load:

{resource="endpoints",verb="PUT"}	33.325
{resource="pods",verb="PUT"}	17.179166666666667
{resource="nodes",verb="PATCH"}	15.575
{resource="pods",verb="DELETE"}	14.454166666666664
{resource="events",verb="POST"}	12.762500000000001
{resource="pods",verb="POST"}	12.695833333333331
{resource="builds",verb="PATCH"}	9.983333333333333
{resource="localsubjectaccessreviews",verb="POST"}	9.920833333333334
{resource="daemonsets",verb="PUT"}	4.775
{resource="replicasets",verb="POST"}	3.1
{resource="subjectaccessreviews",verb="POST"}	1.8
{resource="tokenreviews",verb="POST"}	0.9708333333333332
{resource="replicationcontrollers",verb="PUT"}	0.20000000000000004
{resource="deploymentconfigs",verb="PUT"}	0.17083333333333336
{resource="deploymentconfigs",verb="POST"}	0.09583333333333334
{resource="events",verb="PATCH"}	0.06666666666666667
{resource="bindings",verb="POST"}	0.029166666666666667
{resource="imagestreamimports",verb="POST"}	0.025

Endpoints, pods, and daemonset traffic are all coming from the logging daemonset which is flapping on a particular node.

oc get ds -n logging
NAME              DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR                AGE
logging-fluentd   153       150       150       150          150         logging-infra-fluentd=true   16d

Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  2d		6m		3437570	daemon-set				Normal		SuccessfulDelete	(combined from similar events): Deleted pod: logging-fluentd-qqbbh
  2d		1m		3441276	daemon-set				Normal		SuccessfulCreate	(combined from similar events): Created pod: logging-fluentd-pkmq1
  2d		1m		3441276	daemonset-controller			Warning		FailedDaemonPod		(combined from similar events): Found failed daemon pod logging/ip-172-31-72-226.us-east-2.compute.internal on node logging-fluentd-pkmq1, will try to kill it

3 million events is a bit high.

This is *1* daemon set - we need to determine whether DS is fundamentally broken in error cases and harden it or we can't use it.

Comment 1 Clayton Coleman 2017-10-12 18:02:39 UTC

Wow - oc get pods -n logging -w spews

```
logging-fluentd-c9xh6   0/1       Terminating   0         0s
logging-fluentd-rn345   0/1       Terminating   0         0s
logging-fluentd-btlgn   0/1       Pending   0         0s
logging-fluentd-pxxck   0/1       Pending   0         0s
logging-fluentd-kmjp9   0/1       Pending   0         0s
logging-fluentd-pxxck   0/1       MatchNodeSelector   0         0s
logging-fluentd-kmjp9   0/1       MatchNodeSelector   0         0s
logging-fluentd-btlgn   0/1       MatchNodeSelector   0         0s
logging-fluentd-pxxck   0/1       Terminating   0         0s
logging-fluentd-kmjp9   0/1       Terminating   0         0s
logging-fluentd-btlgn   0/1       Terminating   0         0s
logging-fluentd-pxxck   0/1       Terminating   0         0s
logging-fluentd-kmjp9   0/1       Terminating   0         0s
logging-fluentd-btlgn   0/1       Terminating   0         0s
logging-fluentd-436jz   0/1       Pending   0         0s
logging-fluentd-j1m5v   0/1       Pending   0         0s
logging-fluentd-58231   0/1       Pending   0         0s
logging-fluentd-j1m5v   0/1       MatchNodeSelector   0         0s
logging-fluentd-436jz   0/1       MatchNodeSelector   0         0s
logging-fluentd-58231   0/1       MatchNodeSelector   0         0s
logging-fluentd-436jz   0/1       Terminating   0         0s
logging-fluentd-58231   0/1       Terminating   0         0s
logging-fluentd-j1m5v   0/1       Terminating   0         0s
logging-fluentd-58231   0/1       Terminating   0         0s
logging-fluentd-436jz   0/1       Terminating   0         0s
logging-fluentd-j1m5v   0/1       Terminating   0         0s
logging-fluentd-jx8pt   0/1       Pending   0         0s
logging-fluentd-4n8n9   0/1       Pending   0         0s
logging-fluentd-z8zzw   0/1       Pending   0         0s
logging-fluentd-z8zzw   0/1       MatchNodeSelector   0         0s
logging-fluentd-jx8pt   0/1       MatchNodeSelector   0         0s
logging-fluentd-4n8n9   0/1       MatchNodeSelector   0         0s
logging-fluentd-jx8pt   0/1       Terminating   0         0s
logging-fluentd-z8zzw   0/1       Terminating   0         0s
logging-fluentd-4n8n9   0/1       Terminating   0         0s
logging-fluentd-jx8pt   0/1       Terminating   0         0s
logging-fluentd-4n8n9   0/1       Terminating   0         0s
logging-fluentd-z8zzw   0/1       Terminating   0         0s
logging-fluentd-sdwrr   0/1       Pending   0         0s
logging-fluentd-gm4kx   0/1       Pending   0         0s
logging-fluentd-w8q1f   0/1       Pending   0         0s
logging-fluentd-w8q1f   0/1       MatchNodeSelector   0         0s
```

Comment 2 Tomáš Nožička 2017-10-16 13:46:19 UTC

ccoleman do you know why it was flapping? Any chance I could get the DS and pod yaml? I'd like to reproduce it locally but I can't get it to fail.

Comment 3 Clayton Coleman 2017-10-19 11:35:44 UTC

I don't know why it was flapping, I think the node was broken.

Comment 5 Tomáš Nožička 2017-10-27 19:22:24 UTC

I think I have to admit I failed to reproduce this issue and we need more info next time this happens; if it happens again.

The first idea about matching selector was off but nevertheless I have tried different combinations of nodeSelectors, project nodeSelectors and different node labels.

But judging from this event:

Found failed daemon pod logging/ip-172-31-72-226.us-east-2.compute.internal on node logging-fluentd-pkmq1, will try to kill it

looking at the code where we emit this - that implies the pod has been in PodFailed phase. And that is particularly tricky because all daemon pods are forced to have RestartPolicy=Always. Looking at kubelet it is very hard or almost impossible to bring such pod to the failed state. I have tried bringing down the node the pod was on and although the k8s docs say it will be failed it isn't. Tried running out of space but in that case it will delete the failed pod and won't schedule new ones.

I would be very interested in how the pod could fail on that node. Maybe it was in older version where kubelet could transition to failed phase...

It will be essential to find out what really happens if we encounter this again.

Depending on what actually caused it we can decide if there is a better path forward but the controller is doing what it's suppose to do - reconciling state of the world. It sees a failed pod where one should be running, so it will delete it and create it again.

Comment 6 Michal Fojtik 2017-12-07 09:44:29 UTC

Clayton, we can't reproduce this, can you provide any guidance for Tomas where to start looking? Is this still happening?

Comment 7 Michal Fojtik 2018-01-09 10:18:52 UTC

ping? :-)

Comment 9 Tomáš Nožička 2018-03-12 20:26:02 UTC

Ok, I got a reproducer and I think this is what happened to Clayton.

If you create a project with default node selector (or set it in master config) in a way that it is not actually set on all the nodes (or any) that the DS targets.
  
  oc adm new-project myproject --node-selector='type=user-node'

The reason is that project node selector and Daemonsets own (internal) scheduler are incompatible.

The flow is:
 - DS looks for nodes that it should place pods on and filters them out by matching nodeSelector
 - DS creates pods with that nodeSelector and targets them directly to those nodes by setting nodeName (bypassing scheduler)
 - admission for project default node selector kicks in and adds a nodeSelector
 - kubelet picks up pods targeted to it by nodeName and tries to match selectors
 - with the added nodeSelector from admission it is no longer valid for some nodes - resulting in kubelet failing the pod.
 - DS sees a failed pod so it tries to reconciliate and delete it and create a new one
 - all goes in a loop


My recommendation is not to set project node selectors when using DaemonSets as those are incompatible features.

Comment 10 Clayton Coleman 2018-03-12 21:31:42 UTC

So reading through this, this really sounds like a bug in the admission plugin that it doesn't apply to the DS.

We support project node selector for security reasons, and it sounds like the same rules that apply to pods need to be applied to daemonsets, or the daemonset scheduler needs to take it into account.

Either:

1. on create / update the default project selector should be overlaid with the daemonset selector - conflicts should result in forbidden, defaults should be merged (foo=bar and foo=baz are incompatible, foo=bar and fiz=bar are compatible)
2. the daemonset scheduler itself reads the project node selector and applies it using the same rules

The second seems harder, but if it's actually easy / possible then it's something we need to support.

Comment 11 Clayton Coleman 2018-03-12 21:32:10 UTC

for 1, that should be "create/update of the daemonset"

Comment 12 Tomáš Nožička 2018-03-12 23:02:31 UTC

I don't see 2. realistic as kubernetes resource would need code changes to be dependent on openshift feature

1. is doable but fairly late for 3.9.0 if that's the target here? (assuming from the fact that it blocks the console)

Tightening the admission on DS might get us in trouble with upgrades as we can't deal with backwards incompatible changes there AFAIK (storage migration). Like if we have a DS with nodeSelector conflicting with the one to be set from project. But there usually aren't that many DS and they are usually own by admins so maybe we can leave that to manual intervention. Or we allow that on update if it was broken in the old version of that object.

A nit is that the nodeSelector will differ between e.g. DS and DC as one will have it modified "inline" the other one only on pod. Any possible openshift/kube or third party controller using nodeName will be still broken.

Comment 13 Clayton Coleman 2018-03-12 23:11:53 UTC

> Tightening the admission on DS might get us in trouble with upgrades as we can't deal with backwards incompatible changes there AFAIK (storage migration).

During update we know what the store version is so we can explicitly bypass it.  Since this is so broken we need to fix it anyway.

Anyone using node name with openshift is already broken, so no change there.

Comment 14 Clayton Coleman 2018-03-12 23:14:22 UTC

For LTS reasons it probably needs to be in 3.9 too.

Comment 16 Tomáš Nožička 2018-03-13 11:59:38 UTC

I think the admission for DS is problematic because if the admin changes the default project node selector after a DS has been created we are back where we started.

Comment 18 Tomáš Nožička 2018-03-13 18:11:56 UTC

The low risk, short term "fix" is likely a documentation for admins to
have a separate namespace for DSs and disable project default node
selector in namespaces containing DS. Missconfiguration will lead to
cluster being on fire though!

And changing default RBAC to restrict DS to admins only which we should
likely do anyways.


I think documentation and fixing RBAC is our only choice for 3.9 as
having and admission for DS won't really fix the issue we have as it
can get broken by first modification to namespace annotation or master
config with no way to react to it, also likely breaking storage
migration for the next upgrade as well, probably any update to DS after
that change.

The only fix I see now is to have DS scheaduled by the scheduler which
will not happen for 3.10. There is some work in the upstream to have it
as optional alfa feature in the next kube release making this long
term.

Comment 19 Scott Dodson 2018-03-13 18:55:54 UTC

*** Bug 1543727 has been marked as a duplicate of this bug. ***

Comment 20 Tomáš Nožička 2018-03-14 21:37:12 UTC

Before upstream manages to rewrite daemonsets to use scheduler we could write a controller to:

a) detect these misconfigurations and to disable such daemonsets

b) periodically sync default project node selector to appropriate daemonsets. The interim is ok, as the actual enforcement is also on pod level with admission. This would save the cluster but the daemonset still wouldn't likely target the intended nodes. Also in case of default node selector changes, these would be pilling up in the daemonset's nodeSelector, likely becoming invalid or targeting no nodes.

But we now document (or will when that merges) that before using any daemonset user shall disable the node selector on the project so it's a question of whether its worth it or we just wait for upstream to switch to scheduler now that it's covered in our docs and daemonsets are restricted to admins.

Otherwise we are fairly limited by the fact that the daemonsets are upstream resource.

Comment 21 Tomáš Nožička 2018-03-16 13:23:34 UTC

*** Bug 1557295 has been marked as a duplicate of this bug. ***

Comment 22 Tomáš Nožička 2018-03-19 16:38:34 UTC

docs PR for 3.9 https://github.com/openshift/openshift-docs/pull/8197

Comment 23 Tomáš Nožička 2018-04-09 18:13:57 UTC

https://github.com/openshift/origin/pull/18989
https://github.com/openshift/ose/pull/1205

long term issue for a fix upstream https://github.com/kubernetes/kubernetes/issues/61886

Comment 24 Jordan Liggitt 2018-04-23 18:49:03 UTC

included in 3.10.0-0.21.0

will be in 3.9.25-1

Comment 25 DeShuai Ma 2018-04-24 05:34:37 UTC

1. Verify on v3.10.0-0.27.0 the loop recreate issue have been fixed.
But there is still one issue: the desiredNumberScheduled is never equal to current.
DESIRED=3, CURRENT=2; We need set DESIRED=3;

[root@ip-172-18-9-197 ~]# oc get no
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-11-225.ec2.internal   Ready     compute   4h        v1.10.0+b81c8f8
ip-172-18-12-238.ec2.internal   Ready     compute   4h        v1.10.0+b81c8f8
ip-172-18-9-197.ec2.internal    Ready     master    4h        v1.10.0+b81c8f8

[root@ip-172-18-9-197 ~]# oc get ds -n dma
NAME              DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
hello-daemonset   3         2         0         2            0           <none>          7s

Steps to verify ref: https://bugzilla.redhat.com/show_bug.cgi?id=1543727#c5

Comment 26 DeShuai Ma 2018-04-24 05:36:22 UTC

3.9.25 still have the issue.
Could you help clone a bug for 3.9.z?

[root@host-192-168-100-6 ~]# oc version
oc v3.9.25
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.100.6:8443
openshift v3.9.25
kubernetes v1.9.1+a0ce1bc657
[root@host-192-168-100-6 ~]# vim /etc/origin/master/master-config.yaml
[root@host-192-168-100-6 ~]# oc adm new-project dma
Created project dma
[root@host-192-168-100-6 ~]# oc create -f https://raw.githubusercontent.com/mdshuai/testfile-openshift/master/daemonset/daemonset.yaml -n dma
daemonset "hello-daemonset" created
[root@host-192-168-100-6 ~]# oc get po -n dma
NAME                    READY     STATUS              RESTARTS   AGE
hello-daemonset-cxsg2   0/1       ContainerCreating   0          5s
hello-daemonset-rx89k   0/1       Pending             0          2s
[root@host-192-168-100-6 ~]# oc get po -n dma -w
NAME                    READY     STATUS    RESTARTS   AGE
hello-daemonset-cxsg2   1/1       Running   0          9s
hello-daemonset-hj9gk   0/1       Pending   0          0s
hello-daemonset-hj9gk   0/1       MatchNodeSelector   0         0s
hello-daemonset-hj9gk   0/1       Terminating   0         0s
hello-daemonset-hj9gk   0/1       Terminating   0         0s
hello-daemonset-rqfx9   0/1       Pending   0         0s
hello-daemonset-rqfx9   0/1       MatchNodeSelector   0         1s
hello-daemonset-rqfx9   0/1       Terminating   0         1s
hello-daemonset-rqfx9   0/1       Terminating   0         1s
hello-daemonset-b2hgm   0/1       Pending   0         0s
hello-daemonset-b2hgm   0/1       MatchNodeSelector   0         1s
hello-daemonset-b2hgm   0/1       Terminating   0         1s
hello-daemonset-b2hgm   0/1       Terminating   0         1s
hello-daemonset-phxzn   0/1       Pending   0         0s
hello-daemonset-phxzn   0/1       MatchNodeSelector   0         0s
hello-daemonset-phxzn   0/1       Terminating   0         0s
hello-daemonset-phxzn   0/1       Terminating   0         0s
hello-daemonset-jq7lb   0/1       Pending   0         0s
hello-daemonset-jq7lb   0/1       MatchNodeSelector   0         1s
hello-daemonset-jq7lb   0/1       Terminating   0         1s
hello-daemonset-jq7lb   0/1       Terminating   0         1s
hello-daemonset-v7cx7   0/1       Pending   0         0s
hello-daemonset-v7cx7   0/1       MatchNodeSelector   0         0s
hello-daemonset-v7cx7   0/1       Terminating   0         0s
hello-daemonset-v7cx7   0/1       Terminating   0         0s
hello-daemonset-2dlp7   0/1       Pending   0         0s

Comment 27 Tomáš Nožička 2018-04-24 06:19:12 UTC

Cloned 3.9 here: https://bugzilla.redhat.com/show_bug.cgi?id=1571093

Please create a separate bug for the status. I should be desired==2.

Comment 28 DeShuai Ma 2018-04-24 06:35:37 UTC

(In reply to Tomáš Nožička from comment #27)
> Cloned 3.9 here: https://bugzilla.redhat.com/show_bug.cgi?id=1571093
> 
> Please create a separate bug for the status. I should be desired==2.

Move to verified and create a separate bug for the status https://bugzilla.redhat.com/show_bug.cgi?id=1571111

Comment 31 Red Hat Bugzilla 2023-09-14 04:09:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days