Created attachment 1446537 [details] logs from the api pod Description of problem: Delete the node on master host, then restart the node service, restart the api service, the master api and controller will always restart. Version-Release number of selected component (if applicable): openshift v3.10.0-0.56.0 How reproducible: always Steps to Reproduce: 1.Delete the node on master host: [root@qe-yinzhou-master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-yinzhou-master-etcd-1 Ready master 18m v1.10.0+b81c8f8 qe-yinzhou-node-registry-router-1 Ready compute 15m v1.10.0+b81c8f8 [root@qe-yinzhou-master-etcd-1 ~]# oc delete node qe-yinzhou-master-etcd-1 node "qe-yinzhou-master-etcd-1" deleted 2.Restart the node service: systemctl restart atomic-openshift-node.service 3.Restart the master api service Actual results: 3. The master api and controller will restart always. [root@qe-yinzhou-master-etcd-1 system]# oc get po -n kube-system NAME READY STATUS RESTARTS AGE master-api-qe-yinzhou-master-etcd-1 0/1 Running 40 2h master-controllers-qe-yinzhou-master-etcd-1 1/1 Running 20 2h master-etcd-qe-yinzhou-master-etcd-1 1/1 Running 0 2h Expected results: 3. The master api and controller work well. Additional info:
From the logs it seems it failed the etcd check, can we see the logs from the etcd container?
Created attachment 1446608 [details] logs from etcd pod
the API server is complaining about etcd being unhealthy: [+]ping ok [-]etcd failed: reason withheld [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-apiextensions-informers ok [+]poststarthook/start-apiextensions-controllers ok [+]poststarthook/project.openshift.io-projectcache ok [+]poststarthook/project.openshift.io-projectauthorizationcache ok [+]poststarthook/security.openshift.io-bootstrapscc ok [+]poststarthook/bootstrap-controller ok [+]poststarthook/ca-registration ok [+]poststarthook/start-kube-aggregator-informers ok [+]poststarthook/apiservice-registration-controller ok [+]poststarthook/apiservice-status-available-controller ok [+]poststarthook/apiservice-openapi-controller ok [+]poststarthook/kube-apiserver-autoregistration ok [+]autoregister-completion ok [+]poststarthook/authorization.openshift.io-bootstrapclusterroles ok [+]poststarthook/authorization.openshift.io-ensureopenshift-infra ok [+]poststarthook/quota.openshift.io-clusterquotamapping ok [+]poststarthook/openshift.io-AdmissionInit ok [+]poststarthook/openshift.io-StartInformers ok [+]poststarthook/oauth.openshift.io-StartOAuthClientsBootstrapping ok healthz check failed can you provide the following info (or a pointer to an environment in this state): * node logs from the point where it is restarted post-delete * output of `oc get --raw /healthz/etcd` against the restarting API server * output of `oc get nodes` after the node has been restarted
Also, can we get the full output of the node pre-deletion and post-recreation? It wouldn't surprise me if deletion destroys labels that ansible sets up on the master nodes.
> the full output of the node pre-deletion and post-recreation? meaning `oc get node <name> -o yaml`
Created attachment 1447322 [details] logs from node
Created attachment 1447323 [details] post delete yaml file
Created attachment 1447324 [details] pre deletion node yaml file
result of `oc get --raw /healthz/etcd`: Error from server (InternalError): an error on the server ("internal server error: etcd failed") has prevented the request from succeeding
I was able to stabilize the apiserver by adding: 172.16.120.46 qe-yinzhou-master-etcd-nfs-1 into /etc/hosts... It might mean the skydns or DNS resolving somehow broke after the node was removed from the API. In general any curl request against 'qe-yinzhou-master-etcd-nfs-1' without /etc/hosts file entry take >10s.
the only difference I see is that before deletion the node has these labels: labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/hostname: qe-yinzhou-master-etcd-nfs-1 node-role.kubernetes.io/master: "true" role: node after recreate it has these: labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/hostname: qe-yinzhou-master-etcd-nfs-1 node-role.kubernetes.io/master: "true" Is there anything network-related that requires or selects on the "role: node" label?
Ben, this looks like a DNS/networking issue, can somebody from the networking team investigate while the cluster is up and running?
After the apiserver on master got stabilized, it seems that the networking pods are now running on the master node. I removed the entry from /etc/hosts and the resolving seems to now work fine... I wonder if this is just a bad timing, where the controller never got enough time or "ready" apiserver to create the SDN/OVS pods, which crippled the networking which resulted to DNS resolving issues that cause the api server lagging when performing etcd health check?
As you spotted, it looks like it is trying to resolve the name "qe-yinzhou-310master-etcd-nfs-1" and that is taking a long time because we send the requests into our resolver. We set the nodes up with dnsmasq and then set the upstream resolvers as the default servers. When atomic-openshift-node starts it sends a dbus message to dnsmasq to register the masters as endpoints for cluster.local. Since cluster.local is in the search path, all requests for a bare machine name will go through the search path and dnsmasq will attempt to resolve them using the apiserver. But when the node has been broken after it was running dnsmasq still has the dynamic rule set. If you restart dnsmasq and flush it out, everything comes back cleanly. So, we need to work out some way to flush the rule when things are broken... not sure the best way to do that. But it may be a pod team problem since they are programming dnsmasq now from the node.
> We set the nodes up with dnsmasq and then set the upstream resolvers as the default servers. > > When atomic-openshift-node starts it sends a dbus message to dnsmasq to register the masters as endpoints for cluster.local. > Since cluster.local is in the search path, all requests for a bare machine name will go through the search path and dnsmasq will attempt to resolve them using the apiserver. But when the node has been broken after it was running dnsmasq still has the dynamic rule set. is that an issue on HA clusters or only on a single machine like this?
> When atomic-openshift-node starts it sends a dbus message to dnsmasq to register the masters as endpoints for cluster.local. > > Since cluster.local is in the search path, all requests for a bare machine name will go through the search path and dnsmasq will attempt to resolve them using the apiserver. But when the node has been broken after it was running dnsmasq still has the dynamic rule set. > > If you restart dnsmasq and flush it out, everything comes back cleanly. > > So, we need to work out some way to flush the rule when things are broken... not sure the best way to do that. But it may be a pod team problem since they are programming dnsmasq now from the node. seth, any insight on whether this can be improved, and on whether this is only an issue in single-master installations?
It is not the kubelet that does this registration with dnsmasq. It is the SDN DS pods I would guess (https://github.com/openshift/origin/blob/master/pkg/dns/dnsmasq.go). Not sure what is meant by "node has been broken". From the description: Delete the node on master host, then restart the node service, restart the api service, the master api and controller will always restart. Does "delete the node on the master host" mean "systemctl stop atomic-openshift-node.service on the master" or "deleting the master Node from the apiserver"? Is the problem that the SDN pod goes down on the master, disconnecting it from the cluster network where the on-cluster DNS resolver is running?
@Seth delete the node on the master with command: [root@qe-yinzhou-master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-yinzhou-master-etcd-1 Ready master 18m v1.10.0+b81c8f8 qe-yinzhou-node-registry-router-1 Ready compute 15m v1.10.0+b81c8f8 [root@qe-yinzhou-master-etcd-1 ~]# oc delete node qe-yinzhou-master-etcd-1 node "qe-yinzhou-master-etcd-1" deleted
(In reply to Seth Jennings from comment #22) > It is not the kubelet that does this registration with dnsmasq. It is the > SDN DS pods I would guess > (https://github.com/openshift/origin/blob/master/pkg/dns/dnsmasq.go). > > Not sure what is meant by "node has been broken". > > From the description: > Delete the node on master host, then restart the node service, restart the > api service, the master api and controller will always restart. > > Does "delete the node on the master host" mean "systemctl stop > atomic-openshift-node.service on the master" or "deleting the master Node > from the apiserver"? > > Is the problem that the SDN pod goes down on the master, disconnecting it > from the cluster network where the on-cluster DNS resolver is running? After executing command "oc delete node qe-yinzhou-master-etcd-1", only below containers are running, the SDN pod on the master is terminated. [root@qe-yinzhou-master-etcd-nfs-1 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f9cebe33c096 49ab4fcaa92e "/bin/bash -c '#!/..." 2 days ago Up 2 days k8s_controllers_master-controllers-qe-yinzhou-master-etcd-nfs-1_kube-system_1fc0a155ad7dfe664863006577d4d60d_109 cc7242a85cbe 49ab4fcaa92e "/bin/bash -c '#!/..." 2 days ago Up 2 days k8s_api_master-api-qe-yinzhou-master-etcd-nfs-1_kube-system_f8ddfa563d27a7f9ca95644fe2f8cdd0_223 7fd1a5572bc9 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.58.0 "/usr/bin/pod" 2 days ago Up 2 days k8s_POD_master-api-qe-yinzhou-master-etcd-nfs-1_kube-system_f8ddfa563d27a7f9ca95644fe2f8cdd0_0 d1de465d55f8 4f35b6516d22 "/bin/sh -c '#!/bi..." 3 days ago Up 3 days k8s_etcd_master-etcd-qe-yinzhou-master-etcd-nfs-1_kube-system_1e016e57a10177673720c8d3321f3c8c_0 c25f5e28b003 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.58.0 "/usr/bin/pod" 3 days ago Up 3 days k8s_POD_master-controllers-qe-yinzhou-master-etcd-nfs-1_kube-system_1fc0a155ad7dfe664863006577d4d60d_0 83081c6d8653 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.58.0 "/usr/bin/pod" 3 days ago Up 3 days k8s_POD_master-etcd-qe-yinzhou-master-etcd-nfs-1_kube-system_1e016e57a10177673720c8d3321f3c8c_0
Seems this fix should be that when the sdn pod goes down (or whatever sends the dbus messages to dnsmasq), it should clean up the dnsmasq configuration that it set up. This is chicken-egg city: master kubelet comes up tries to self-register Node with master but it can't because master static pod isn't running because it can't resolve etcd because the sdn DS pod is not running because the master Node doesn't exist because the master kubelet can't self-register.
the component registering with dnsmasq is the sdn pod, moving to the network team talked through the intended flow with Clayton, and came up with some additional avenues of investigation: questions: * does deleting the Node restart static pods? (it shouldn't) * does the master register itself with dnsmasq (it shouldn't) * what dns policy is the static apiserver pod using? possible mitigations: * config: can we fully qualify etcd hostnames in config to avoid lookup? (see if trailing '.' plumbs through and works properly) * apiserver: should we change the etcd health check from a dial to an active client connection? (c.f. https://github.com/kubernetes/kubernetes/issues/64909) * sdn: can the sdn unregister with dnsmasq on clean shutdown? * sdn: can the sdn unregister with dnsmasq on bringup pre-data?
I don't think trailing . works. We can set the timeout much shorter for local dnsmasq -> local kube-proxy. 200ms should be enough. Still looking at other options.
SDN^h Clayton: - Will make the SDN pod register with dnsmasq only after it has synced the rules - Will make the SDN pod unregister with dnsmasq when it gets a term signal - Will make the SDN pod unregister with dnsmasq when starting up Master team will: - Fix the etcd health check to make it longer - Change the manifest to increase the times a little
https://github.com/openshift/origin/pull/19987 is the DNSmasq config.
Jordan, do you want me to assign this to the Master team since the SDN side is now posted and approved?
Jordan: Sorry, I just saw https://github.com/openshift/origin/pull/19992 which looks like it addresses the Master team's tasks. Is that right? Thanks
> Sorry, I just saw https://github.com/openshift/origin/pull/19992 which looks like it addresses the Master team's tasks. Is that right? Thanks yes, that addresses the issue with the etcd healthz check
Confirmed with openshift version,can't reproduce: openshift version openshift v3.10.1 [root@qe-yinzhou-master-etcd-1 ~]# oc get po -n kube-system NAME READY STATUS RESTARTS AGE master-api-qe-yinzhou-master-etcd-1 1/1 Running 0 13m master-controllers-qe-yinzhou-master-etcd-1 1/1 Running 0 13m master-etcd-qe-yinzhou-master-etcd-1 1/1 Running 0 13m [root@qe-yinzhou-master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-yinzhou-master-etcd-1 Ready master 13m v1.10.0+b81c8f8 qe-yinzhou-node-registry-router-1 Ready compute 37m v1.10.0+b81c8f8 [root@qe-yinzhou-master-etcd-1 ~]# oc delete node qe-yinzhou-master-etcd-1 node "qe-yinzhou-master-etcd-1" deleted [root@qe-yinzhou-master-etcd-1 ~]# systemctl restart atomic-openshift-node.service [root@qe-yinzhou-master-etcd-1 ~]# oc get po -n kube-system NAME READY STATUS RESTARTS AGE master-api-qe-yinzhou-master-etcd-1 1/1 Running 0 39m master-controllers-qe-yinzhou-master-etcd-1 1/1 Running 0 39m master-etcd-qe-yinzhou-master-etcd-1 1/1 Running 0 39m [root@qe-yinzhou-master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-yinzhou-master-etcd-1 Ready master 39m v1.10.0+b81c8f8 qe-yinzhou-node-registry-router-1 Ready compute 1h v1.10.0+b81c8f8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816