| Summary: | OpenShift NodeController occasionally thinks nodes have been deleted | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Akram Ben Aissi <abenaiss> |
| Component: | Node | Assignee: | Paul Morie <pmorie> |
| Status: | CLOSED WONTFIX | QA Contact: | DeShuai Ma <dma> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.0.0 | CC: | abenaiss, agoldste, aos-bugs, eric.mountain, hirad, jdmartin, jkhelil, jokerman, mmccomas, pep, rbinkhor, salvatore-dario.minonne, tstclair |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-03-17 16:25:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1267746 | ||
Akram, could you please provide more details on the environment? - how many masters? - how many nodes? - any observed high cpu load when the node controller observes the deletion? - any observed high network load when the node controller observes the deletion? - any errors in the etcd logs? Also, is NTP enabled on the masters and nodes? - how many masters? 1 - how many nodes? 2 or 5 on the clusters we examined (on the 5-node cluster, 1 of the nodes was marked unschedulable due to the SDN not having set itself up correctly c.f. https://github.com/openshift/openshift-ansible/pull/1057) - any observed high cpu load when the node controller observes the deletion? Will try to get data. - any observed high network load when the node controller observes the deletion? Will try to get data. - any errors in the etcd logs? We use the in-process etcd. There are no etcd-looking errors issued by openshift-master. c.f. log links in this bugzilla. Wrt the load questions above, note that blocking packets between nodes and master with iptables did not produce the same side-effect. Forgot to answer the NTP part: we have occurrences on VIO w/ NTP installed, and RH OpenStack w/o NTP installed. Note we also did experiments changing the clock on the master and didn't successfully reproduce the issue. Eric, could you please provide the exact OpenShift version? rpm -qa | grep openshift Thanks! Andy, here you go: openshift-master-3.0.2.0-0.git.32.898bb7c.el7ose.x86_64 openshift-3.0.2.0-0.git.32.898bb7c.el7ose.x86_64 Also, I have emailed you a link with cpu/network graphs. No load observed at the time of the occurrence. Cheers, Eric How often/frequently do these node evictions happen? Here's a sampling of occurrences from one cluster: Jan 4 12:13:35 Jan 8 00:14:57 Jan 10 19:16:10 Jan 12 08:16:56 Jan 12 15:16:57 Jan 13 13:17:22 Jan 15 11:18:13 Jan 16 02:18:21 Jan 24 00:21:42 Jan 25 17:20:55 Jan 26 23:23:05 Jan 29 01:24:02 Jan 31 08:24:46 All times UTC. And every occurrence had 'NodeController observed a Node deletion: ...' in the master log? Is that the complete list from that cluster? Debug level logs `--loglevel=4` would probably useful here. @Andy: yes to both questions. I established today that all the nodes in the cluster appear to be deleted within the same iteration of the node controller's loop, FWIW. @pmorie: based on the timings and loop frequency I concur. Hello, We are experiencing the same issue that somehow master thinks that Nodes are deleted. We are also using 3.0.2 release of openshift. openshift-3.0.2.0-0.git.16.4d626fe.el7ose.x86_64 openshift-node-3.0.2.0-0.git.16.4d626fe.el7ose.x86_64 Hello, it has bitten again. Logs by email @Paul Hello, digging logs/GH and bugzilla I found maybe this: https://github.com/openshift/origin/issues/6290 which could be related (not sure...) since in our logs we can find something like: Feb 04 21:23:58 master.example.com openshift-master[26707]: W0204 21:23:58.978939 26707 reflector.go:227] pkg/project/cache/cache.go:102: watch of *api.Namespace ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803849/1802958]) [1804848] Feb 04 21:23:59 master.example.com openshift-master[26707]: I0204 21:23:59.646469 26707 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Feb 04 21:23:59 master.example.com openshift-master[26707]: I0204 21:23:59.646520 26707 reflector.go:304] pkg/admission/resourcequota/admission.go:59: Watch close - *api.ResourceQuota total 0 items received Feb 04 21:23:59 master.example.com openshift-master[26707]: W0204 21:23:59.648673 26707 reflector.go:227] pkg/admission/resourcequota/admission.go:59: watch of *api.ResourceQuota ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849] Feb 04 21:23:59 master.example.com openshift-master[26707]: I0204 21:23:59.740355 26707 reflector.go:304] pkg/admission/namespace/lifecycle/admission.go:107: Watch close - *api.Namespace total 0 items received Feb 04 21:23:59 master.example.com openshift-master[26707]: W0204 21:23:59.743068 26707 reflector.go:227] pkg/admission/namespace/lifecycle/admission.go:107: watch of *api.Namespace ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849] Feb 04 21:24:00 master.example.com openshift-master[26707]: I0204 21:24:00.014429 26707 reflector.go:304] pkg/admission/serviceaccount/admission.go:101: Watch close - *api.ServiceAccount total 0 items received Feb 04 21:24:00 master.example.com openshift-master[26707]: W0204 21:24:00.017736 26707 reflector.go:227] pkg/admission/serviceaccount/admission.go:101: watch of *api.ServiceAccount ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849] Feb 04 21:24:00 master.example.com openshift-master[26707]: I0204 21:24:00.564571 26707 reflector.go:304] pkg/security/admission/admission.go:59: Watch close - *api.SecurityContextConstraints total 0 items received Feb 04 21:24:00 master.example.com openshift-master[26707]: W0204 21:24:00.569122 26707 reflector.go:227] pkg/security/admission/admission.go:59: watch of *api.SecurityContextConstraints ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849] My 2 cents Dario Dario, the last log messages you posted are red-herrings. The watches terminate on the server-side on a period, and in older versions of OSE it gave you this distressing sounding log message. The watches are automatically reestablished. Paul, Regarding the watcher messages above, two things made us wonder if there was a connection to this issue: - the timing, i.e. just before "node deletion observed", and within the same second or so (not checked if _all_ cases show this pattern however) - the fact that it is related to something "going wrong" in the communication between openshift/kube and etcd, and notably the complaints about out of date indices (though I admit freely I don't understand the precise significance of the logs in question). The "node deletion" issue seems related to APIServer/etcd not giving the expected list of nodes back to the node controller. Is it because: - etcd has lost those keys? Seems not to be the case since they reappear subsequently. - etcd has got in a muddle and sent a bad list to the APIServer? - APIServer provided bad input on the query to etcd and got some kind of error/partial/out-of-date-data response it did not handle properly, and sent back an empty/partial list as though it were correct to the node controller? - all-encapsulating "something else"? Eric, Dario- I have basically the exact same questions as Eric. I had an idea today: since the version of OpenShift has a bug for high log-levels, what if we use etcdctl to watch the etcd instance directly for changes to the key that stores node data, ie: etcdctl watch registry/minions This will tell us whether something is actually deleting the node record from etcd, which I think is key to understanding this issue. I just discussed this with Dario on a conference and this seems like a way we can move forward while holding the version of OpenShift constant. Paul, thanks As discussed going to do and let you know Dario The customer made a comment which I have made private, scrubbed version here: Paul, If we retrieve the minion key from etcd for a node on which we have the issue, we can see the huge gap between Created-Index and Modified-Index, leading us to confirm that these keys are not deleted from etcd, eliminating the fact that something is deleting the node Notice that in openshift the modified index is incremented every 5 seconds by 5 with is the heartbeat frequency for nodes status and we get the issue again yesterday at Feb 09 21:21:04 sudo ./etcdctl --peers "<address of etcd peers>" --ca-file=/etc/openshift/master/ca.crt --cert-file=/etc/openshift/master/master.etcd-client.crt --key-file=/etc/openshift/master/master.etcd-client.key -o extended get /kubernetes.io/minions/node-name Key: /kubernetes.io/minions/node-name Created-Index: 405 Modified-Index: 4254610 TTL: 0 Etcd-Index: 4254610 Raft-Index: 6657294 Raft-Term: 7 Yesterday I run 2 scripts (background) checking etcd with "etcdctl" command, and the other with "oc get nodes". If some node or some key is deleted, scripts will log timestamp and a list of minios or a list of nodes when detects an update. Scripts didn't logged anything since yesterday 18:00 (UTC +1) until today 10:30 (UTC +1) Also checked openshift-master logs for "NodeController observed a Node deletion" and not logged. Note to our customers currently collaborating on this bugzilla: please be aware that the support tool to engage Red Hat Support is through https://access.redhat.com/. Please login there and open a support case to benefit from responses within SLA and the collaborative processes within support and escalation options this official channel provides. (In reply to Robert Binkhorst from comment #30) > Note to our customers currently collaborating on this bugzilla: please be > aware that the support tool to engage Red Hat Support is through > https://access.redhat.com/. Please login there and open a support case to > benefit from responses within SLA and the collaborative processes within > support and escalation options this official channel provides. Thanks for the info in the future we're going to open the issue directly to access.redhat.com We were not able to reproduce this anywhere and haven't encountered the bug since February, is it okay to close this one for now? @pmorie: given the code change wrt error handling in nodecontroller.go, clearly we won't get the case of nodes being observed deleted and pods being evicted any longer. What we now see is the "Error monitoring node status" log since the error is being caught correctly. I agree we close this bugzilla since we also have a case open and there is little point tracking the issue in 2 places. I will try to follow up the remaining error cases and determine if they are OpenShift or IaaS related by mid-next-week latest and followup in the support case. Thanks, Eric Thanks, Eric! |
Description of problem: Occasionally, the OpenShift master (NodeController component) thinks that nodes have been deleted even though they are up and running: Jan 10 19:16:10 master openshift-master: I0110 19:16:10.420577 18676 nodecontroller.go:230] NodeController observed a Node deletion: node5.example.com Pod eviction ensues. Note the node is not permanently lost, it is later registered: Jan 10 19:16:17 master openshift-master: I0110 19:16:17.302129 18676 nodecontroller.go:216] NodeController observed a new Node: api.Node{TypeMeta:api.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"node5.example.com", [...] Relating logs and code, it seeems that the node controller, in monitorNodeStatus(), receives a list of nodes back from the API server which is missing node(s) compared to its in-memory list from the previous round, and proceeds to consider those nodes deleted. This tends to indicate the node keys are no longer present in etcd, however: - deleting a node key via etcdctl indeed leads to the node deletion sequence, but the node is subsequently unable to re-register - deleting a node via "oc delete node" similarly leads to the node deletion sequence, however we are quite certain we are not running this command (don't know if node re-registers in such a case) - TTL on keys is 0 Note also that other experiments with e.g. changing the IP addresses of nodes, changing master/node clocks, simulating a broken network, came nowhere near reproducing the issue. Also, nodes typically disappear in bunches, though not all nodes of a cluster (e.g. 3 or 4 out of 5). Currently major pain point on our systems since the pod rescheduling causes disturbance and breaks some apps which (unfortunately still) require a pre-defined startup order. So leading to application instability and unreliability on our clusters. Version-Release number of selected component (if applicable): 3.0.2.0 Environment: - OpenShift 3.0.2. (@all - Maybe someone would be so kind to provide the exact version) - OpenStack, both Red Hat and vmWare VIO flavours. How reproducible: No reproducer found for the moment, but happened and still happen quite ofter