Bug 1584995 - After delete the node on master host, the master api and controller will always restart
Summary: After delete the node on master host, the master api and controller will alwa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.0
Assignee: Ben Bennett
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-01 06:21 UTC by zhou ying
Modified: 2018-07-30 19:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-07-30 19:16:54 UTC
Target Upstream Version:


Attachments (Terms of Use)
logs from the api pod (59.44 KB, text/plain)
2018-06-01 06:21 UTC, zhou ying
no flags Details
logs from etcd pod (6.62 KB, text/plain)
2018-06-01 09:39 UTC, zhou ying
no flags Details
logs from node (200.94 KB, text/plain)
2018-06-04 07:30 UTC, zhou ying
no flags Details
post delete yaml file (4.12 KB, text/plain)
2018-06-04 07:30 UTC, zhou ying
no flags Details
pre deletion node yaml file (4.14 KB, text/plain)
2018-06-04 07:31 UTC, zhou ying
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:17:31 UTC
Origin (Github) 19987 None None None 2018-06-13 13:39:57 UTC
Origin (Github) 19992 None None None 2018-06-14 12:52:30 UTC

Description zhou ying 2018-06-01 06:21:31 UTC
Created attachment 1446537 [details]
logs from the api pod

Description of problem:
Delete the node on master host, then restart the node service, restart the api service, the master api and controller will always restart.

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.56.0


How reproducible:
always

Steps to Reproduce:
1.Delete the node on master host:
[root@qe-yinzhou-master-etcd-1 ~]# oc get node
NAME                                STATUS    ROLES     AGE       VERSION
qe-yinzhou-master-etcd-1            Ready     master    18m       v1.10.0+b81c8f8
qe-yinzhou-node-registry-router-1   Ready     compute   15m       v1.10.0+b81c8f8
[root@qe-yinzhou-master-etcd-1 ~]# oc delete node qe-yinzhou-master-etcd-1
node "qe-yinzhou-master-etcd-1" deleted

2.Restart the node service:
  systemctl restart  atomic-openshift-node.service
3.Restart the master api service

Actual results:
3. The master api and controller will restart always.
[root@qe-yinzhou-master-etcd-1 system]# oc get po -n kube-system
NAME                                          READY     STATUS    RESTARTS   AGE
master-api-qe-yinzhou-master-etcd-1           0/1       Running   40         2h
master-controllers-qe-yinzhou-master-etcd-1   1/1       Running   20         2h
master-etcd-qe-yinzhou-master-etcd-1          1/1       Running   0          2h


Expected results:
3. The master api and controller work well.


Additional info:

Comment 2 Michal Fojtik 2018-06-01 08:48:20 UTC
From the logs it seems it failed the etcd check, can we see the logs from the etcd container?

Comment 6 zhou ying 2018-06-01 09:39:47 UTC
Created attachment 1446608 [details]
logs from etcd pod

Comment 7 Jordan Liggitt 2018-06-02 18:48:58 UTC
the API server is complaining about etcd being unhealthy:

[+]ping ok
[-]etcd failed: reason withheld
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/project.openshift.io-projectcache ok
[+]poststarthook/project.openshift.io-projectauthorizationcache ok
[+]poststarthook/security.openshift.io-bootstrapscc ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/authorization.openshift.io-bootstrapclusterroles ok
[+]poststarthook/authorization.openshift.io-ensureopenshift-infra ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/openshift.io-AdmissionInit ok
[+]poststarthook/openshift.io-StartInformers ok
[+]poststarthook/oauth.openshift.io-StartOAuthClientsBootstrapping ok
healthz check failed


can you provide the following info (or a pointer to an environment in this state):

* node logs from the point where it is restarted post-delete
* output of `oc get --raw /healthz/etcd` against the restarting API server
* output of `oc get nodes` after the node has been restarted

Comment 8 Jordan Liggitt 2018-06-03 00:06:03 UTC
Also, can we get the full output of the node pre-deletion and post-recreation?

It wouldn't surprise me if deletion destroys labels that ansible sets up on the master nodes.

Comment 9 Jordan Liggitt 2018-06-03 00:06:47 UTC
> the full output of the node pre-deletion and post-recreation?

meaning `oc get node <name> -o yaml`

Comment 11 zhou ying 2018-06-04 07:30:03 UTC
Created attachment 1447322 [details]
logs from node

Comment 12 zhou ying 2018-06-04 07:30:32 UTC
Created attachment 1447323 [details]
post delete yaml file

Comment 13 zhou ying 2018-06-04 07:31:01 UTC
Created attachment 1447324 [details]
pre deletion node yaml file

Comment 14 Jordan Liggitt 2018-06-04 17:27:09 UTC
result of `oc get --raw /healthz/etcd`:

Error from server (InternalError): an error on the server ("internal server error: etcd failed") has prevented the request from succeeding

Comment 15 Michal Fojtik 2018-06-04 20:32:53 UTC
I was able to stabilize the apiserver by adding:

172.16.120.46 qe-yinzhou-master-etcd-nfs-1

into /etc/hosts...

It might mean the skydns or DNS resolving somehow broke after the node was removed from the API. In general any curl request against 'qe-yinzhou-master-etcd-nfs-1' without /etc/hosts file entry take >10s.

Comment 16 Jordan Liggitt 2018-06-04 20:39:12 UTC
the only difference I see is that before deletion the node has these labels:

  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/hostname: qe-yinzhou-master-etcd-nfs-1
    node-role.kubernetes.io/master: "true"
    role: node

after recreate it has these:

  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/hostname: qe-yinzhou-master-etcd-nfs-1
    node-role.kubernetes.io/master: "true"

Is there anything network-related that requires or selects on the "role: node" label?

Comment 17 Michal Fojtik 2018-06-04 20:43:49 UTC
Ben, this looks like a DNS/networking issue, can somebody from the networking team investigate while the cluster is up and running?

Comment 18 Michal Fojtik 2018-06-04 20:52:02 UTC
After the apiserver on master got stabilized, it seems that the networking pods are now running on the master node. I removed the entry from /etc/hosts and the resolving seems to now work fine...

I wonder if this is just a bad timing, where the controller never got enough time or "ready" apiserver to create the SDN/OVS pods, which crippled the networking which resulted to DNS resolving issues that cause the api server lagging when performing etcd health check?

Comment 19 Ben Bennett 2018-06-05 18:36:27 UTC
As you spotted, it looks like it is trying to resolve the name "qe-yinzhou-310master-etcd-nfs-1" and that is taking a long time because we send the requests into our resolver.

We set the nodes up with dnsmasq and then set the upstream resolvers as the default servers.

When atomic-openshift-node starts it sends a dbus message to dnsmasq to register the masters as endpoints for cluster.local.

Since cluster.local is in the search path, all requests for a bare machine name will go through the search path and dnsmasq will attempt to resolve them using the apiserver.  But when the node has been broken after it was running dnsmasq still has the dynamic rule set.

If you restart dnsmasq and flush it out, everything comes back cleanly.

So, we need to work out some way to flush the rule when things are broken... not sure the best way to do that.  But it may be a pod team problem since they are programming dnsmasq now from the node.

Comment 20 Jordan Liggitt 2018-06-05 19:02:56 UTC
> We set the nodes up with dnsmasq and then set the upstream resolvers as the default servers.
> 
> When atomic-openshift-node starts it sends a dbus message to dnsmasq to register the masters as endpoints for cluster.local.
> Since cluster.local is in the search path, all requests for a bare machine name will go through the search path and dnsmasq will attempt to resolve them using the apiserver.  But when the node has been broken after it was running dnsmasq still has the dynamic rule set.


is that an issue on HA clusters or only on a single machine like this?

Comment 21 Jordan Liggitt 2018-06-06 01:40:38 UTC
> When atomic-openshift-node starts it sends a dbus message to dnsmasq to register the masters as endpoints for cluster.local.
> 
> Since cluster.local is in the search path, all requests for a bare machine name will go through the search path and dnsmasq will attempt to resolve them using the apiserver.  But when the node has been broken after it was running dnsmasq still has the dynamic rule set.
> 
> If you restart dnsmasq and flush it out, everything comes back cleanly.
> 
> So, we need to work out some way to flush the rule when things are broken... not sure the best way to do that.  But it may be a pod team problem since they are programming dnsmasq now from the node.

seth, any insight on whether this can be improved, and on whether this is only an issue in single-master installations?

Comment 22 Seth Jennings 2018-06-06 15:52:55 UTC
It is not the kubelet that does this registration with dnsmasq.  It is the SDN DS pods I would guess (https://github.com/openshift/origin/blob/master/pkg/dns/dnsmasq.go).

Not sure what is meant by "node has been broken".

From the description:
Delete the node on master host, then restart the node service, restart the api service, the master api and controller will always restart.

Does "delete the node on the master host" mean "systemctl stop atomic-openshift-node.service on the master" or "deleting the master Node from the apiserver"?

Is the problem that the SDN pod goes down on the master, disconnecting it from the cluster network where the on-cluster DNS resolver is running?

Comment 23 zhou ying 2018-06-07 03:14:46 UTC
@Seth 

delete the node on the master with command:
[root@qe-yinzhou-master-etcd-1 ~]# oc get node
NAME                                STATUS    ROLES     AGE       VERSION
qe-yinzhou-master-etcd-1            Ready     master    18m       v1.10.0+b81c8f8
qe-yinzhou-node-registry-router-1   Ready     compute   15m       v1.10.0+b81c8f8
[root@qe-yinzhou-master-etcd-1 ~]# oc delete node qe-yinzhou-master-etcd-1
node "qe-yinzhou-master-etcd-1" deleted

Comment 24 Hongan Li 2018-06-07 05:31:59 UTC
(In reply to Seth Jennings from comment #22)
> It is not the kubelet that does this registration with dnsmasq.  It is the
> SDN DS pods I would guess
> (https://github.com/openshift/origin/blob/master/pkg/dns/dnsmasq.go).
> 
> Not sure what is meant by "node has been broken".
> 
> From the description:
> Delete the node on master host, then restart the node service, restart the
> api service, the master api and controller will always restart.
> 
> Does "delete the node on the master host" mean "systemctl stop
> atomic-openshift-node.service on the master" or "deleting the master Node
> from the apiserver"?
> 
> Is the problem that the SDN pod goes down on the master, disconnecting it
> from the cluster network where the on-cluster DNS resolver is running?

After executing command "oc delete node qe-yinzhou-master-etcd-1", only below containers are running, the SDN pod on the master is terminated. 

[root@qe-yinzhou-master-etcd-nfs-1 ~]# docker ps
CONTAINER ID        IMAGE                                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
f9cebe33c096        49ab4fcaa92e                                                           "/bin/bash -c '#!/..."   2 days ago          Up 2 days                               k8s_controllers_master-controllers-qe-yinzhou-master-etcd-nfs-1_kube-system_1fc0a155ad7dfe664863006577d4d60d_109
cc7242a85cbe        49ab4fcaa92e                                                           "/bin/bash -c '#!/..."   2 days ago          Up 2 days                               k8s_api_master-api-qe-yinzhou-master-etcd-nfs-1_kube-system_f8ddfa563d27a7f9ca95644fe2f8cdd0_223
7fd1a5572bc9        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.58.0   "/usr/bin/pod"           2 days ago          Up 2 days                               k8s_POD_master-api-qe-yinzhou-master-etcd-nfs-1_kube-system_f8ddfa563d27a7f9ca95644fe2f8cdd0_0
d1de465d55f8        4f35b6516d22                                                           "/bin/sh -c '#!/bi..."   3 days ago          Up 3 days                               k8s_etcd_master-etcd-qe-yinzhou-master-etcd-nfs-1_kube-system_1e016e57a10177673720c8d3321f3c8c_0
c25f5e28b003        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.58.0   "/usr/bin/pod"           3 days ago          Up 3 days                               k8s_POD_master-controllers-qe-yinzhou-master-etcd-nfs-1_kube-system_1fc0a155ad7dfe664863006577d4d60d_0
83081c6d8653        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.58.0   "/usr/bin/pod"           3 days ago          Up 3 days                               k8s_POD_master-etcd-qe-yinzhou-master-etcd-nfs-1_kube-system_1e016e57a10177673720c8d3321f3c8c_0

Comment 25 Seth Jennings 2018-06-07 13:58:37 UTC
Seems this fix should be that when the sdn pod goes down (or whatever sends the dbus messages to dnsmasq), it should clean up the dnsmasq configuration that it set up.

This is chicken-egg city:

master kubelet comes up tries to self-register Node with master but it can't because master static pod isn't running because it can't resolve etcd because the sdn DS pod is not running because the master Node doesn't exist because the master kubelet can't self-register.

Comment 26 Jordan Liggitt 2018-06-12 15:06:08 UTC
the component registering with dnsmasq is the sdn pod, moving to the network team

talked through the intended flow with Clayton, and came up with some additional avenues of investigation:

questions:
* does deleting the Node restart static pods? (it shouldn't)
* does the master register itself with dnsmasq (it shouldn't)
* what dns policy is the static apiserver pod using?

possible mitigations:
* config: can we fully qualify etcd hostnames in config to avoid lookup? (see if trailing '.' plumbs through and works properly)
* apiserver: should we change the etcd health check from a dial to an active client connection? (c.f. https://github.com/kubernetes/kubernetes/issues/64909)
* sdn: can the sdn unregister with dnsmasq on clean shutdown?
* sdn: can the sdn unregister with dnsmasq on bringup pre-data?

Comment 27 Clayton Coleman 2018-06-12 16:31:51 UTC
I don't think trailing . works.

We can set the timeout much shorter for local dnsmasq -> local kube-proxy.  200ms should be enough.  Still looking at other options.

Comment 28 Ben Bennett 2018-06-12 17:06:17 UTC
SDN^h Clayton:
- Will make the SDN pod register with dnsmasq only after it has synced the rules
- Will make the SDN pod unregister with dnsmasq when it gets a term signal
- Will make the SDN pod unregister with dnsmasq when starting up

Master team will:
- Fix the etcd health check to make it longer
- Change the manifest to increase the times a little

Comment 29 Clayton Coleman 2018-06-12 21:00:44 UTC
https://github.com/openshift/origin/pull/19987 is the DNSmasq config.

Comment 30 Ben Bennett 2018-06-14 12:46:33 UTC
Jordan, do you want me to assign this to the Master team since the SDN side is now posted and approved?

Comment 31 Ben Bennett 2018-06-14 12:52:12 UTC
Jordan: Sorry, I just saw https://github.com/openshift/origin/pull/19992 which looks like it addresses the Master team's tasks.  Is that right?  Thanks

Comment 32 Jordan Liggitt 2018-06-14 14:09:53 UTC
> Sorry, I just saw https://github.com/openshift/origin/pull/19992 which looks like it addresses the Master team's tasks.  Is that right?  Thanks

yes, that addresses the issue with the etcd healthz check

Comment 34 zhou ying 2018-06-19 03:32:37 UTC
Confirmed with openshift version,can't reproduce:
openshift version
openshift v3.10.1

[root@qe-yinzhou-master-etcd-1 ~]# oc get po -n kube-system
NAME                                          READY     STATUS    RESTARTS   AGE
master-api-qe-yinzhou-master-etcd-1           1/1       Running   0          13m
master-controllers-qe-yinzhou-master-etcd-1   1/1       Running   0          13m
master-etcd-qe-yinzhou-master-etcd-1          1/1       Running   0          13m
[root@qe-yinzhou-master-etcd-1 ~]# oc get node
NAME                                STATUS    ROLES     AGE       VERSION
qe-yinzhou-master-etcd-1            Ready     master    13m       v1.10.0+b81c8f8
qe-yinzhou-node-registry-router-1   Ready     compute   37m       v1.10.0+b81c8f8
[root@qe-yinzhou-master-etcd-1 ~]# oc delete node qe-yinzhou-master-etcd-1
node "qe-yinzhou-master-etcd-1" deleted


[root@qe-yinzhou-master-etcd-1 ~]# systemctl restart  atomic-openshift-node.service
[root@qe-yinzhou-master-etcd-1 ~]# oc get po -n kube-system
NAME                                          READY     STATUS    RESTARTS   AGE
master-api-qe-yinzhou-master-etcd-1           1/1       Running   0          39m
master-controllers-qe-yinzhou-master-etcd-1   1/1       Running   0          39m
master-etcd-qe-yinzhou-master-etcd-1          1/1       Running   0          39m
[root@qe-yinzhou-master-etcd-1 ~]# oc get node
NAME                                STATUS    ROLES     AGE       VERSION
qe-yinzhou-master-etcd-1            Ready     master    39m       v1.10.0+b81c8f8
qe-yinzhou-node-registry-router-1   Ready     compute   1h        v1.10.0+b81c8f8

Comment 36 errata-xmlrpc 2018-07-30 19:16:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.