Bug 1301060 - OpenShift NodeController occasionally thinks nodes have been deleted
OpenShift NodeController occasionally thinks nodes have been deleted
Status: CLOSED WONTFIX
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.0.0
All Linux
high Severity high
: ---
: ---
Assigned To: Paul Morie
DeShuai Ma
:
Depends On:
Blocks: 1267746
  Show dependency treegraph
 
Reported: 2016-01-22 08:27 EST by Akram Ben Aissi
Modified: 2016-03-17 17:23 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-17 12:25:21 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 2 Andy Goldstein 2016-01-28 13:20:11 EST
Description of problem:

Occasionally, the OpenShift master (NodeController component) thinks that nodes have been deleted even though they are up and running:

Jan 10 19:16:10 master openshift-master: I0110 19:16:10.420577   18676 nodecontroller.go:230] NodeController observed a Node deletion: node5.example.com

Pod eviction ensues.

Note the node is not permanently lost, it is later registered:

Jan 10 19:16:17 master openshift-master: I0110 19:16:17.302129   18676 nodecontroller.go:216] NodeController observed a new Node: api.Node{TypeMeta:api.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"node5.example.com", [...]

Relating logs and code, it seeems that the node controller, in monitorNodeStatus(), receives a list of nodes back from the API server which is missing node(s) compared to its in-memory list from the previous round, and proceeds to consider those nodes deleted.

This tends to indicate the node keys are no longer present in etcd, however:

- deleting a node key via etcdctl indeed leads to the node deletion sequence, but the node is subsequently unable to re-register
- deleting a node via "oc delete node" similarly leads to the node deletion sequence, however we are quite certain we are not running this command (don't know if node re-registers in such a case)
- TTL on keys is 0

Note also that other experiments with e.g. changing the IP addresses of nodes, changing master/node clocks, simulating a broken network, came nowhere near reproducing the issue.

Also, nodes typically disappear in bunches, though not all nodes of a cluster (e.g. 3 or 4 out of 5).

Currently major pain point on our systems since the pod rescheduling causes disturbance and breaks some apps which (unfortunately still) require a pre-defined startup order.  So leading to application instability and unreliability on our clusters.




Version-Release number of selected component (if applicable):
3.0.2.0


Environment:
- OpenShift 3.0.2. (@all - Maybe someone would be so kind to provide the exact version)
- OpenStack, both Red Hat and vmWare VIO flavours.




How reproducible:

No reproducer found for the moment, but happened and still happen quite ofter
Comment 4 Andy Goldstein 2016-01-28 13:38:53 EST
Akram, could you please provide more details on the environment?

- how many masters?
- how many nodes?
- any observed high cpu load when the node controller observes the deletion?
- any observed high network load when the node controller observes the deletion?
- any errors in the etcd logs?
Comment 5 Andy Goldstein 2016-01-28 14:59:32 EST
Also, is NTP enabled on the masters and nodes?
Comment 6 eric.mountain 2016-01-29 04:00:59 EST
- how many masters? 1

- how many nodes? 2 or 5 on the clusters we examined (on the 5-node cluster, 1 of the nodes was marked unschedulable due to the SDN not having set itself up correctly c.f. https://github.com/openshift/openshift-ansible/pull/1057)

- any observed high cpu load when the node controller observes the deletion?
Will try to get data.

- any observed high network load when the node controller observes the deletion?
Will try to get data.

- any errors in the etcd logs?
We use the in-process etcd.  There are no etcd-looking errors issued by openshift-master.  c.f. log links in this bugzilla.

Wrt the load questions above, note that blocking packets between nodes and master with iptables did not produce the same side-effect.
Comment 7 eric.mountain 2016-01-29 05:20:35 EST
Forgot to answer the NTP part: we have occurrences on VIO w/ NTP installed, and RH OpenStack w/o NTP installed.

Note we also did experiments changing the clock on the master and didn't successfully reproduce the issue.
Comment 8 Andy Goldstein 2016-01-29 09:38:33 EST
Eric, could you please provide the exact OpenShift version? rpm -qa | grep openshift

Thanks!
Comment 9 eric.mountain 2016-01-29 10:35:55 EST
Andy, here you go:

openshift-master-3.0.2.0-0.git.32.898bb7c.el7ose.x86_64
openshift-3.0.2.0-0.git.32.898bb7c.el7ose.x86_64

Also, I have emailed you a link with cpu/network graphs.  No load observed at the time of the occurrence.

Cheers,
Eric
Comment 10 Andy Goldstein 2016-01-29 14:11:46 EST
How often/frequently do these node evictions happen?
Comment 11 eric.mountain 2016-02-01 02:57:19 EST
Here's a sampling of occurrences from one cluster:

Jan  4 12:13:35
Jan  8 00:14:57
Jan 10 19:16:10
Jan 12 08:16:56
Jan 12 15:16:57
Jan 13 13:17:22
Jan 15 11:18:13
Jan 16 02:18:21
Jan 24 00:21:42
Jan 25 17:20:55
Jan 26 23:23:05
Jan 29 01:24:02
Jan 31 08:24:46

All times UTC.
Comment 12 Andy Goldstein 2016-02-01 11:15:25 EST
And every occurrence had 'NodeController observed a Node deletion: ...' in the master log?

Is that the complete list from that cluster?
Comment 13 Timothy St. Clair 2016-02-01 11:32:45 EST
Debug level logs `--loglevel=4` would probably useful here.
Comment 15 eric.mountain 2016-02-01 12:37:54 EST
@Andy: yes to both questions.
Comment 16 Paul Morie 2016-02-02 11:47:33 EST
I established today that all the nodes in the cluster appear to be deleted within the same iteration of the node controller's loop, FWIW.
Comment 17 eric.mountain 2016-02-02 12:17:46 EST
@pmorie: based on the timings and loop frequency I concur.
Comment 18 Hirad 2016-02-04 13:17:07 EST
Hello,

We are experiencing the same issue that somehow master thinks that Nodes are deleted. We are also using 3.0.2 release of openshift. 
openshift-3.0.2.0-0.git.16.4d626fe.el7ose.x86_64
openshift-node-3.0.2.0-0.git.16.4d626fe.el7ose.x86_64
Comment 19 Dario Minonne 2016-02-05 04:49:17 EST
Hello,

it has bitten again.
Logs by email @Paul
Comment 20 Dario Minonne 2016-02-05 08:30:32 EST
Hello,
digging logs/GH and bugzilla I found maybe this:

https://github.com/openshift/origin/issues/6290
which could be related (not sure...) since in our logs we can find something like:

Feb 04 21:23:58 master.example.com openshift-master[26707]: W0204 21:23:58.978939   26707 reflector.go:227] pkg/project/cache/cache.go:102: watch of *api.Namespace ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803849/1802958]) [1804848]
Feb 04 21:23:59 master.example.com openshift-master[26707]: I0204 21:23:59.646469   26707 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Feb 04 21:23:59 master.example.com openshift-master[26707]: I0204 21:23:59.646520   26707 reflector.go:304] pkg/admission/resourcequota/admission.go:59: Watch close - *api.ResourceQuota total 0 items received
Feb 04 21:23:59 master.example.com openshift-master[26707]: W0204 21:23:59.648673   26707 reflector.go:227] pkg/admission/resourcequota/admission.go:59: watch of *api.ResourceQuota ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849]
Feb 04 21:23:59 master.example.com openshift-master[26707]: I0204 21:23:59.740355   26707 reflector.go:304] pkg/admission/namespace/lifecycle/admission.go:107: Watch close - *api.Namespace total 0 items received
Feb 04 21:23:59 master.example.com openshift-master[26707]: W0204 21:23:59.743068   26707 reflector.go:227] pkg/admission/namespace/lifecycle/admission.go:107: watch of *api.Namespace ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849]
Feb 04 21:24:00 master.example.com openshift-master[26707]: I0204 21:24:00.014429   26707 reflector.go:304] pkg/admission/serviceaccount/admission.go:101: Watch close - *api.ServiceAccount total 0 items received
Feb 04 21:24:00 master.example.com openshift-master[26707]: W0204 21:24:00.017736   26707 reflector.go:227] pkg/admission/serviceaccount/admission.go:101: watch of *api.ServiceAccount ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849]
Feb 04 21:24:00 master.example.com openshift-master[26707]: I0204 21:24:00.564571   26707 reflector.go:304] pkg/security/admission/admission.go:59: Watch close - *api.SecurityContextConstraints total 0 items received
Feb 04 21:24:00 master.example.com openshift-master[26707]: W0204 21:24:00.569122   26707 reflector.go:227] pkg/security/admission/admission.go:59: watch of *api.SecurityContextConstraints ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1803850/1802958]) [1804849]

My 2 cents

Dario
Comment 21 Paul Morie 2016-02-05 14:14:31 EST
Dario, the last log messages you posted are red-herrings.  The watches terminate on the server-side on a period, and in older versions of OSE it gave you this distressing sounding log message.  The watches are automatically reestablished.
Comment 22 eric.mountain 2016-02-08 06:19:18 EST
Paul,

Regarding the watcher messages above, two things made us wonder if there was a connection to this issue:

- the timing, i.e. just before "node deletion observed", and within the same second or so (not checked if _all_ cases show this pattern however)

- the fact that it is related to something "going wrong" in the communication between openshift/kube and etcd, and notably the complaints about out of date indices (though I admit freely I don't understand the precise significance of the logs in question).

The "node deletion" issue seems related to APIServer/etcd not giving the expected list of nodes back to the node controller.  Is it because:

- etcd has lost those keys?  Seems not to be the case since they reappear subsequently.
- etcd has got in a muddle and sent a bad list to the APIServer?
- APIServer provided bad input on the query to etcd and got some kind of error/partial/out-of-date-data response it did not handle properly, and sent back an empty/partial list as though it were correct to the node controller?
- all-encapsulating "something else"?
Comment 23 Paul Morie 2016-02-08 13:17:57 EST
Eric, Dario-

I have basically the exact same questions as Eric.  I had an idea today: since the version of OpenShift has a bug for high log-levels, what if we use etcdctl to watch the etcd instance directly for changes to the key that stores node data, ie:

etcdctl watch registry/minions

This will tell us whether something is actually deleting the node record from etcd, which I think is key to understanding this issue.  I just discussed this with Dario on a conference and this seems like a way we can move forward while holding the version of OpenShift constant.
Comment 24 Dario Minonne 2016-02-09 04:35:50 EST
Paul,
thanks
As discussed going to do and let you know

Dario
Comment 26 Paul Morie 2016-02-10 10:10:03 EST
The customer made a comment which I have made private, scrubbed version here:

Paul,
If we retrieve the minion key from etcd for a node on which we have the issue,  we can see the huge gap between Created-Index and Modified-Index, leading us to confirm that these keys are not deleted from etcd, eliminating the fact that something is deleting the node 
Notice that in openshift the modified index is incremented every 5 seconds by 5 with is the heartbeat frequency for nodes status and we get the issue again yesterday at Feb 09 21:21:04

 
sudo ./etcdctl  --peers "<address of etcd peers>"  --ca-file=/etc/openshift/master/ca.crt --cert-file=/etc/openshift/master/master.etcd-client.crt  --key-file=/etc/openshift/master/master.etcd-client.key -o extended get /kubernetes.io/minions/node-name
Key: /kubernetes.io/minions/node-name
Created-Index: 405
Modified-Index: 4254610
TTL: 0
Etcd-Index: 4254610
Raft-Index: 6657294
Raft-Term: 7
Comment 27 jdmartin 2016-02-11 05:03:22 EST
Yesterday I run 2 scripts (background) checking etcd with "etcdctl" command, and the other with "oc get nodes".
If some node or some key is deleted, scripts will log timestamp and a list of minios or a list of nodes when detects an update.
Scripts didn't logged anything since yesterday 18:00 (UTC +1) until today 10:30 (UTC +1)
Also checked openshift-master logs for "NodeController observed a Node deletion" and not logged.
Comment 30 Robert Binkhorst 2016-02-12 04:58:42 EST
Note to our customers currently collaborating on this bugzilla: please be aware that the support tool to engage Red Hat Support is through https://access.redhat.com/. Please login there and open a support case to benefit from responses within SLA and the collaborative processes within support and escalation options this official channel provides.
Comment 31 Dario Minonne 2016-02-12 11:31:49 EST
(In reply to Robert Binkhorst from comment #30)
> Note to our customers currently collaborating on this bugzilla: please be
> aware that the support tool to engage Red Hat Support is through
> https://access.redhat.com/. Please login there and open a support case to
> benefit from responses within SLA and the collaborative processes within
> support and escalation options this official channel provides.

Thanks for the info in the future we're going to open the issue directly to access.redhat.com
Comment 33 Paul Morie 2016-03-15 11:53:42 EDT
We were not able to reproduce this anywhere and haven't encountered the bug since February, is it okay to close this one for now?
Comment 34 eric.mountain 2016-03-17 11:46:16 EDT
@pmorie: given the code change wrt error handling in nodecontroller.go, clearly we won't get the case of nodes being observed deleted and pods being evicted any longer.  What we now see is the "Error monitoring node status" log since the error is being caught correctly.  I agree we close this bugzilla since we also have a case open and there is little point tracking the issue in 2 places.  I will try to follow up the remaining error cases and determine if they are OpenShift or IaaS related by mid-next-week latest and followup in the support case.

Thanks,
Eric
Comment 35 Paul Morie 2016-03-17 12:25:21 EDT
Thanks, Eric!

Note You need to log in before you can comment on or make changes to this bug.