us-east-2 has a single daemonset on the cluster, the logging daemonset, which is driving a huge portion of write load: {resource="endpoints",verb="PUT"} 33.325 {resource="pods",verb="PUT"} 17.179166666666667 {resource="nodes",verb="PATCH"} 15.575 {resource="pods",verb="DELETE"} 14.454166666666664 {resource="events",verb="POST"} 12.762500000000001 {resource="pods",verb="POST"} 12.695833333333331 {resource="builds",verb="PATCH"} 9.983333333333333 {resource="localsubjectaccessreviews",verb="POST"} 9.920833333333334 {resource="daemonsets",verb="PUT"} 4.775 {resource="replicasets",verb="POST"} 3.1 {resource="subjectaccessreviews",verb="POST"} 1.8 {resource="tokenreviews",verb="POST"} 0.9708333333333332 {resource="replicationcontrollers",verb="PUT"} 0.20000000000000004 {resource="deploymentconfigs",verb="PUT"} 0.17083333333333336 {resource="deploymentconfigs",verb="POST"} 0.09583333333333334 {resource="events",verb="PATCH"} 0.06666666666666667 {resource="bindings",verb="POST"} 0.029166666666666667 {resource="imagestreamimports",verb="POST"} 0.025 Endpoints, pods, and daemonset traffic are all coming from the logging daemonset which is flapping on a particular node. oc get ds -n logging NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE logging-fluentd 153 150 150 150 150 logging-infra-fluentd=true 16d Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 2d 6m 3437570 daemon-set Normal SuccessfulDelete (combined from similar events): Deleted pod: logging-fluentd-qqbbh 2d 1m 3441276 daemon-set Normal SuccessfulCreate (combined from similar events): Created pod: logging-fluentd-pkmq1 2d 1m 3441276 daemonset-controller Warning FailedDaemonPod (combined from similar events): Found failed daemon pod logging/ip-172-31-72-226.us-east-2.compute.internal on node logging-fluentd-pkmq1, will try to kill it 3 million events is a bit high. This is *1* daemon set - we need to determine whether DS is fundamentally broken in error cases and harden it or we can't use it.
Wow - oc get pods -n logging -w spews ``` logging-fluentd-c9xh6 0/1 Terminating 0 0s logging-fluentd-rn345 0/1 Terminating 0 0s logging-fluentd-btlgn 0/1 Pending 0 0s logging-fluentd-pxxck 0/1 Pending 0 0s logging-fluentd-kmjp9 0/1 Pending 0 0s logging-fluentd-pxxck 0/1 MatchNodeSelector 0 0s logging-fluentd-kmjp9 0/1 MatchNodeSelector 0 0s logging-fluentd-btlgn 0/1 MatchNodeSelector 0 0s logging-fluentd-pxxck 0/1 Terminating 0 0s logging-fluentd-kmjp9 0/1 Terminating 0 0s logging-fluentd-btlgn 0/1 Terminating 0 0s logging-fluentd-pxxck 0/1 Terminating 0 0s logging-fluentd-kmjp9 0/1 Terminating 0 0s logging-fluentd-btlgn 0/1 Terminating 0 0s logging-fluentd-436jz 0/1 Pending 0 0s logging-fluentd-j1m5v 0/1 Pending 0 0s logging-fluentd-58231 0/1 Pending 0 0s logging-fluentd-j1m5v 0/1 MatchNodeSelector 0 0s logging-fluentd-436jz 0/1 MatchNodeSelector 0 0s logging-fluentd-58231 0/1 MatchNodeSelector 0 0s logging-fluentd-436jz 0/1 Terminating 0 0s logging-fluentd-58231 0/1 Terminating 0 0s logging-fluentd-j1m5v 0/1 Terminating 0 0s logging-fluentd-58231 0/1 Terminating 0 0s logging-fluentd-436jz 0/1 Terminating 0 0s logging-fluentd-j1m5v 0/1 Terminating 0 0s logging-fluentd-jx8pt 0/1 Pending 0 0s logging-fluentd-4n8n9 0/1 Pending 0 0s logging-fluentd-z8zzw 0/1 Pending 0 0s logging-fluentd-z8zzw 0/1 MatchNodeSelector 0 0s logging-fluentd-jx8pt 0/1 MatchNodeSelector 0 0s logging-fluentd-4n8n9 0/1 MatchNodeSelector 0 0s logging-fluentd-jx8pt 0/1 Terminating 0 0s logging-fluentd-z8zzw 0/1 Terminating 0 0s logging-fluentd-4n8n9 0/1 Terminating 0 0s logging-fluentd-jx8pt 0/1 Terminating 0 0s logging-fluentd-4n8n9 0/1 Terminating 0 0s logging-fluentd-z8zzw 0/1 Terminating 0 0s logging-fluentd-sdwrr 0/1 Pending 0 0s logging-fluentd-gm4kx 0/1 Pending 0 0s logging-fluentd-w8q1f 0/1 Pending 0 0s logging-fluentd-w8q1f 0/1 MatchNodeSelector 0 0s ```
ccoleman do you know why it was flapping? Any chance I could get the DS and pod yaml? I'd like to reproduce it locally but I can't get it to fail.
I don't know why it was flapping, I think the node was broken.
I think I have to admit I failed to reproduce this issue and we need more info next time this happens; if it happens again. The first idea about matching selector was off but nevertheless I have tried different combinations of nodeSelectors, project nodeSelectors and different node labels. But judging from this event: Found failed daemon pod logging/ip-172-31-72-226.us-east-2.compute.internal on node logging-fluentd-pkmq1, will try to kill it looking at the code where we emit this - that implies the pod has been in PodFailed phase. And that is particularly tricky because all daemon pods are forced to have RestartPolicy=Always. Looking at kubelet it is very hard or almost impossible to bring such pod to the failed state. I have tried bringing down the node the pod was on and although the k8s docs say it will be failed it isn't. Tried running out of space but in that case it will delete the failed pod and won't schedule new ones. I would be very interested in how the pod could fail on that node. Maybe it was in older version where kubelet could transition to failed phase... It will be essential to find out what really happens if we encounter this again. Depending on what actually caused it we can decide if there is a better path forward but the controller is doing what it's suppose to do - reconciling state of the world. It sees a failed pod where one should be running, so it will delete it and create it again.
Clayton, we can't reproduce this, can you provide any guidance for Tomas where to start looking? Is this still happening?
ping? :-)
Ok, I got a reproducer and I think this is what happened to Clayton. If you create a project with default node selector (or set it in master config) in a way that it is not actually set on all the nodes (or any) that the DS targets. oc adm new-project myproject --node-selector='type=user-node' The reason is that project node selector and Daemonsets own (internal) scheduler are incompatible. The flow is: - DS looks for nodes that it should place pods on and filters them out by matching nodeSelector - DS creates pods with that nodeSelector and targets them directly to those nodes by setting nodeName (bypassing scheduler) - admission for project default node selector kicks in and adds a nodeSelector - kubelet picks up pods targeted to it by nodeName and tries to match selectors - with the added nodeSelector from admission it is no longer valid for some nodes - resulting in kubelet failing the pod. - DS sees a failed pod so it tries to reconciliate and delete it and create a new one - all goes in a loop My recommendation is not to set project node selectors when using DaemonSets as those are incompatible features.
So reading through this, this really sounds like a bug in the admission plugin that it doesn't apply to the DS. We support project node selector for security reasons, and it sounds like the same rules that apply to pods need to be applied to daemonsets, or the daemonset scheduler needs to take it into account. Either: 1. on create / update the default project selector should be overlaid with the daemonset selector - conflicts should result in forbidden, defaults should be merged (foo=bar and foo=baz are incompatible, foo=bar and fiz=bar are compatible) 2. the daemonset scheduler itself reads the project node selector and applies it using the same rules The second seems harder, but if it's actually easy / possible then it's something we need to support.
for 1, that should be "create/update of the daemonset"
I don't see 2. realistic as kubernetes resource would need code changes to be dependent on openshift feature 1. is doable but fairly late for 3.9.0 if that's the target here? (assuming from the fact that it blocks the console) Tightening the admission on DS might get us in trouble with upgrades as we can't deal with backwards incompatible changes there AFAIK (storage migration). Like if we have a DS with nodeSelector conflicting with the one to be set from project. But there usually aren't that many DS and they are usually own by admins so maybe we can leave that to manual intervention. Or we allow that on update if it was broken in the old version of that object. A nit is that the nodeSelector will differ between e.g. DS and DC as one will have it modified "inline" the other one only on pod. Any possible openshift/kube or third party controller using nodeName will be still broken.
> Tightening the admission on DS might get us in trouble with upgrades as we can't deal with backwards incompatible changes there AFAIK (storage migration). During update we know what the store version is so we can explicitly bypass it. Since this is so broken we need to fix it anyway. Anyone using node name with openshift is already broken, so no change there.
For LTS reasons it probably needs to be in 3.9 too.
I think the admission for DS is problematic because if the admin changes the default project node selector after a DS has been created we are back where we started.
The low risk, short term "fix" is likely a documentation for admins to have a separate namespace for DSs and disable project default node selector in namespaces containing DS. Missconfiguration will lead to cluster being on fire though! And changing default RBAC to restrict DS to admins only which we should likely do anyways. I think documentation and fixing RBAC is our only choice for 3.9 as having and admission for DS won't really fix the issue we have as it can get broken by first modification to namespace annotation or master config with no way to react to it, also likely breaking storage migration for the next upgrade as well, probably any update to DS after that change. The only fix I see now is to have DS scheaduled by the scheduler which will not happen for 3.10. There is some work in the upstream to have it as optional alfa feature in the next kube release making this long term.
*** Bug 1543727 has been marked as a duplicate of this bug. ***
Before upstream manages to rewrite daemonsets to use scheduler we could write a controller to: a) detect these misconfigurations and to disable such daemonsets b) periodically sync default project node selector to appropriate daemonsets. The interim is ok, as the actual enforcement is also on pod level with admission. This would save the cluster but the daemonset still wouldn't likely target the intended nodes. Also in case of default node selector changes, these would be pilling up in the daemonset's nodeSelector, likely becoming invalid or targeting no nodes. But we now document (or will when that merges) that before using any daemonset user shall disable the node selector on the project so it's a question of whether its worth it or we just wait for upstream to switch to scheduler now that it's covered in our docs and daemonsets are restricted to admins. Otherwise we are fairly limited by the fact that the daemonsets are upstream resource.
*** Bug 1557295 has been marked as a duplicate of this bug. ***
docs PR for 3.9 https://github.com/openshift/openshift-docs/pull/8197
https://github.com/openshift/origin/pull/18989 https://github.com/openshift/ose/pull/1205 long term issue for a fix upstream https://github.com/kubernetes/kubernetes/issues/61886
included in 3.10.0-0.21.0 will be in 3.9.25-1
1. Verify on v3.10.0-0.27.0 the loop recreate issue have been fixed. But there is still one issue: the desiredNumberScheduled is never equal to current. DESIRED=3, CURRENT=2; We need set DESIRED=3; [root@ip-172-18-9-197 ~]# oc get no NAME STATUS ROLES AGE VERSION ip-172-18-11-225.ec2.internal Ready compute 4h v1.10.0+b81c8f8 ip-172-18-12-238.ec2.internal Ready compute 4h v1.10.0+b81c8f8 ip-172-18-9-197.ec2.internal Ready master 4h v1.10.0+b81c8f8 [root@ip-172-18-9-197 ~]# oc get ds -n dma NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE hello-daemonset 3 2 0 2 0 <none> 7s Steps to verify ref: https://bugzilla.redhat.com/show_bug.cgi?id=1543727#c5
3.9.25 still have the issue. Could you help clone a bug for 3.9.z? [root@host-192-168-100-6 ~]# oc version oc v3.9.25 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://192.168.100.6:8443 openshift v3.9.25 kubernetes v1.9.1+a0ce1bc657 [root@host-192-168-100-6 ~]# vim /etc/origin/master/master-config.yaml [root@host-192-168-100-6 ~]# oc adm new-project dma Created project dma [root@host-192-168-100-6 ~]# oc create -f https://raw.githubusercontent.com/mdshuai/testfile-openshift/master/daemonset/daemonset.yaml -n dma daemonset "hello-daemonset" created [root@host-192-168-100-6 ~]# oc get po -n dma NAME READY STATUS RESTARTS AGE hello-daemonset-cxsg2 0/1 ContainerCreating 0 5s hello-daemonset-rx89k 0/1 Pending 0 2s [root@host-192-168-100-6 ~]# oc get po -n dma -w NAME READY STATUS RESTARTS AGE hello-daemonset-cxsg2 1/1 Running 0 9s hello-daemonset-hj9gk 0/1 Pending 0 0s hello-daemonset-hj9gk 0/1 MatchNodeSelector 0 0s hello-daemonset-hj9gk 0/1 Terminating 0 0s hello-daemonset-hj9gk 0/1 Terminating 0 0s hello-daemonset-rqfx9 0/1 Pending 0 0s hello-daemonset-rqfx9 0/1 MatchNodeSelector 0 1s hello-daemonset-rqfx9 0/1 Terminating 0 1s hello-daemonset-rqfx9 0/1 Terminating 0 1s hello-daemonset-b2hgm 0/1 Pending 0 0s hello-daemonset-b2hgm 0/1 MatchNodeSelector 0 1s hello-daemonset-b2hgm 0/1 Terminating 0 1s hello-daemonset-b2hgm 0/1 Terminating 0 1s hello-daemonset-phxzn 0/1 Pending 0 0s hello-daemonset-phxzn 0/1 MatchNodeSelector 0 0s hello-daemonset-phxzn 0/1 Terminating 0 0s hello-daemonset-phxzn 0/1 Terminating 0 0s hello-daemonset-jq7lb 0/1 Pending 0 0s hello-daemonset-jq7lb 0/1 MatchNodeSelector 0 1s hello-daemonset-jq7lb 0/1 Terminating 0 1s hello-daemonset-jq7lb 0/1 Terminating 0 1s hello-daemonset-v7cx7 0/1 Pending 0 0s hello-daemonset-v7cx7 0/1 MatchNodeSelector 0 0s hello-daemonset-v7cx7 0/1 Terminating 0 0s hello-daemonset-v7cx7 0/1 Terminating 0 0s hello-daemonset-2dlp7 0/1 Pending 0 0s
Cloned 3.9 here: https://bugzilla.redhat.com/show_bug.cgi?id=1571093 Please create a separate bug for the status. I should be desired==2.
(In reply to Tomáš Nožička from comment #27) > Cloned 3.9 here: https://bugzilla.redhat.com/show_bug.cgi?id=1571093 > > Please create a separate bug for the status. I should be desired==2. Move to verified and create a separate bug for the status https://bugzilla.redhat.com/show_bug.cgi?id=1571111
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days