Description of problem: upgrade failed at TASK [openshift_node : Approve node certificates when bootstrapping] Version-Release number of the following components: openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch How reproducible: not sure Steps to Reproduce: 1. upgrade OCP v3.10 to v3.11 Actual results: upgrade failed. fatal: [wmeng11ug310bs-master-etcd-1.0917-k10.qe.rhcloud.com -> wmeng11ug310bs-master-etcd-1.0917-k10.qe.rhcloud.com]: FAILED! => { "all_subjects_found": [], "attempts": 30, "changed": false, "client_approve_results": [], "client_csrs": null, "invocation": { "module_args": { "node_list": [ "wmeng11ug310bs-master-etcd-1" ], "oc_bin": "oc", "oc_conf": "/etc/origin/master/admin.kubeconfig" } }, "msg": "The connection to the server wmeng11ug310bs-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "oc_get_nodes": null, "rc": 0, "server_approve_results": [], "server_csrs": null, "state": "unknown", "unwanted_csrs": [] } Expected results: upgrade succeeds Additional info: Please attach logs from ansible-playbook with the -vvv flag
Created attachment 1484071 [details] static pod: apiserver.yml
Created attachment 1484072 [details] master config
Created attachment 1484073 [details] node-config.yml
Created attachment 1484074 [details] docker ps output
Created attachment 1484075 [details] journalctl atomic-openshift-node output
Node is failing to start api static pod, see this spammed in logs: Sep 17 11:20:38 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: I0917 11:20:38.809983 11559 config.go:100] Looking for [api file], have seen map[file:{}] Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: I0917 11:21:15.419735 11559 kuberuntime_manager.go:720] Backing Off restarting container &Container{Name:api,Image:registry.reg-aws.openshift.com:443/openshift3/ose-control-plane:v3.11,Command:[/bin/bash -c],Args:[#!/bin/bash Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: set -euo pipefail Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: if [[ -f /etc/origin/master/master.env ]]; then Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: set -o allexport Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: source /etc/origin/master/master.env Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: fi Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: exec openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=${DEBUG_LOGLEVEL:-2} Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: ],WorkingDir:,Ports:[],Env:[],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[{master-config false /etc/origin/master/ <nil>} {master-cloud-provider false /etc/origin/cloudprovider/ <nil>} {master-data false /var/lib/origin/ <nil>}],LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz,Port:8443,Host:,Scheme:HTTPS,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:45,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,},ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz/ready,Port:8443,Host:,Scheme:HTTPS,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[],TerminationMessagePolicy:File,VolumeDevices:[],} in pod master-api-wmeng11ug310bs-master-etcd-1_kube-system(20506aaee850345a01e6ce64c31ec14f) Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: E0917 11:21:15.419853 11559 pod_workers.go:186] Error syncing pod 20506aaee850345a01e6ce64c31ec14f ("master-api-wmeng11ug310bs-master-etcd-1_kube-system(20506aaee850345a01e6ce64c31ec14f)"), skipping: failed to "StartContainer" for "api" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=api pod=master-api-wmeng11ug310bs-master-etcd-1_kube-system(20506aaee850345a01e6ce64c31ec14f)" More details in attached journalctl output. "Wait for master API to come back online" works intermittently on this host. I have tested this several times by running this task in it's own playbook; sometimes it passes, other times it does not. Either the API comes up for a very brief while and then dies or the wait_for module in ansible is buggy, I'm not sure.
Rebooting the host in question clears the issue and api seems to start up fine. Then, as we do in upgrade, I stopped atomic-openshift-node.service followed by stopping docker.service. Wait a few seconds for pods to die, start docker.service, start atomic-openshift-node.service. API does not come back online, but the port is bound to, logs again spammed, also noticed this: Sep 17 12:10:29 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[3848]: I0917 12:10:29.891371 3848 handler.go:91] Unable to get network stats from pid 4242: couldn't read network stats: failure opening /proc/4242/net/dev: open /proc/4242/net/dev: no such file or directory It appears on this host, stopping and restarting docker leaves the node service in some kind of dirty state. The api does bind to port 8443 for some time, but then later it dies. Here's a synopsis of what I was doing during this time period: < immediately after reboot > [root@wmeng11ug310bs-master-etcd-1 ~]# /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C "https://wmeng11ug310bs-master-etcd-1:2379" cluster-health member b9d362a0686e92bb is healthy: got healthy result from https://10.240.0.250:2379 cluster is healthy [root@wmeng11ug310bs-master-etcd-1 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-1-v7mjg 1/1 Running 5 7h registry-console-1-gdfxq 1/1 Running 3 7h router-1-dkzls 1/1 Running 0 7h < stop/start docker & atomic-openshift-node > [root@wmeng11ug310bs-master-etcd-1 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES cd967412b4f0 bb825675e651 "/bin/bash -c '#!/..." 4 minutes ago Up 4 minutes k8s_controllers_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2 62b0b4dba27d bb2f1d4dd3a7 "/bin/sh -c '#!/bi..." 4 minutes ago Up 4 minutes k8s_etcd_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2 5a2cedc94778 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7 "/usr/bin/pod" 4 minutes ago Up 4 minutes k8s_POD_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2 2dc4e9d2d850 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7 "/usr/bin/pod" 4 minutes ago Up 4 minutes k8s_POD_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2 ad380c4c1ae0 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7 "/usr/bin/pod" 4 minutes ago Up 4 minutes k8s_POD_master-api-wmeng11ug310bs-master-etcd-1_kube-system_20506aaee850345a01e6ce64c31ec14f_2 [root@wmeng11ug310bs-master-etcd-1 ~]# oc get pods ^C < just hung for about 20 seconds before I killed > [root@wmeng11ug310bs-master-etcd-1 ~]# telnet localhost 8443 Trying ::1... telnet: connect to address ::1: Connection refused Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Connection closed by foreign host. [root@wmeng11ug310bs-master-etcd-1 ~]# curl https://localhost:8443/ -k curl: (7) Failed connect to localhost:8443; Connection refused [root@wmeng11ug310bs-master-etcd-1 ~]# curl https://localhost:8443/ -k curl: (7) Failed connect to localhost:8443; Connection refused [root@wmeng11ug310bs-master-etcd-1 ~]# telnet localhost 8443 Trying ::1... telnet: connect to address ::1: Connection refused Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused [root@wmeng11ug310bs-master-etcd-1 ~]# ss -n | grep 8443 < no output > [root@wmeng11ug310bs-master-etcd-1 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES cd967412b4f0 bb825675e651 "/bin/bash -c '#!/..." 6 minutes ago Up 6 minutes k8s_controllers_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2 62b0b4dba27d bb2f1d4dd3a7 "/bin/sh -c '#!/bi..." 6 minutes ago Up 6 minutes k8s_etcd_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2 5a2cedc94778 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7 "/usr/bin/pod" 6 minutes ago Up 6 minutes k8s_POD_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2 2dc4e9d2d850 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7 "/usr/bin/pod" 6 minutes ago Up 6 minutes k8s_POD_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2 ad380c4c1ae0 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7 "/usr/bin/pod" 6 minutes ago Up 6 minutes k8s_POD_master-api-wmeng11ug310bs-master-etcd-1_kube-system_20506aaee850345a01e6ce64c31ec14f_2 [root@wmeng11ug310bs-master-etcd-1 ~]# telnet localhost 8443 Trying ::1... telnet: connect to address ::1: Connection refused Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused [root@wmeng11ug310bs-master-etcd-1 ~]# telnet wmeng11ug310bs-master-etcd-1 8443 Trying fe80::4001:aff:fef0:fa%eth0... telnet: connect to address fe80::4001:aff:fef0:fa%eth0: Connection refused Trying fe80::780b:63ff:fe6d:ac41%tun0... telnet: connect to address fe80::780b:63ff:fe6d:ac41%tun0: Connection refused Trying 10.240.0.250... telnet: connect to address 10.240.0.250: Connection refused Trying 172.17.0.1... telnet: connect to address 172.17.0.1: Connection refused Trying 10.128.0.1... telnet: connect to address 10.128.0.1: Connection refused [root@wmeng11ug310bs-master-etcd-1 ~]# /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C "https://wmeng11ug310bs-master-etcd-1:2379" cluster-health cluster may be unhealthy: failed to list members Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://wmeng11ug310bs-master-etcd-1:2379 exceeded header timeout error #0: client: endpoint https://wmeng11ug310bs-master-etcd-1:2379 exceeded header timeout It appears that static pods do not like docker and the node service being stopped/restarted on this host, and there is no obvious reason why they shouldn't all come back online.
Restarting dnsmasq and then restarting pods via master-restart shim seems to clear the issue. We are blocked/caused by: https://bugzilla.redhat.com/show_bug.cgi?id=1624448
Workaround PR created in master: https://github.com/openshift/openshift-ansible/pull/10112
https://github.com/openshift/openshift-ansible/pull/10115 release-3.11 backport
The PR 10115 has been merged to openshift-ansible-3.11.8-1,please check the bug.
Fixed. openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch