Bug 1629700 - upgrade failed at TASK [openshift_node : Approve node certificates when bootstrapping]
Summary: upgrade failed at TASK [openshift_node : Approve node certificates when boots...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Michael Gugino
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks: 1632863
TreeView+ depends on / blocked
 
Reported: 2018-09-17 11:12 UTC by Weihua Meng
Modified: 2018-10-11 19:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1632863 (view as bug list)
Environment:
Last Closed: 2018-10-11 19:12:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
docker ps output (2.43 KB, text/plain)
2018-09-17 15:19 UTC, Michael Gugino
no flags Details
journalctl atomic-openshift-node output (74.36 KB, text/plain)
2018-09-17 15:19 UTC, Michael Gugino
no flags Details

Description Weihua Meng 2018-09-17 11:12:24 UTC
Description of problem:
upgrade failed at TASK [openshift_node : Approve node certificates when bootstrapping]

Version-Release number of the following components:
openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch

How reproducible:
not sure

Steps to Reproduce:
1. upgrade OCP v3.10 to v3.11


Actual results:
upgrade failed.

fatal: [wmeng11ug310bs-master-etcd-1.0917-k10.qe.rhcloud.com -> wmeng11ug310bs-master-etcd-1.0917-k10.qe.rhcloud.com]: FAILED! => {
    "all_subjects_found": [], 
    "attempts": 30, 
    "changed": false, 
    "client_approve_results": [], 
    "client_csrs": null, 
    "invocation": {
        "module_args": {
            "node_list": [
                "wmeng11ug310bs-master-etcd-1"
            ], 
            "oc_bin": "oc", 
            "oc_conf": "/etc/origin/master/admin.kubeconfig"
        }
    }, 
    "msg": "The connection to the server wmeng11ug310bs-master-etcd-1:8443 was refused - did you specify the right host or port?\n", 
    "oc_get_nodes": null, 
    "rc": 0, 
    "server_approve_results": [], 
    "server_csrs": null, 
    "state": "unknown", 
    "unwanted_csrs": []
}


Expected results:
upgrade succeeds

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Michael Gugino 2018-09-17 15:17:48 UTC
Created attachment 1484071 [details]
static pod: apiserver.yml

Comment 4 Michael Gugino 2018-09-17 15:18:13 UTC
Created attachment 1484072 [details]
master config

Comment 5 Michael Gugino 2018-09-17 15:18:38 UTC
Created attachment 1484073 [details]
node-config.yml

Comment 6 Michael Gugino 2018-09-17 15:19:11 UTC
Created attachment 1484074 [details]
docker ps output

Comment 7 Michael Gugino 2018-09-17 15:19:42 UTC
Created attachment 1484075 [details]
journalctl atomic-openshift-node output

Comment 8 Michael Gugino 2018-09-17 16:05:36 UTC
Node is failing to start api static pod, see this spammed in logs:

Sep 17 11:20:38 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: I0917 11:20:38.809983   11559 config.go:100] Looking for [api file], have seen map[file:{}]

Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: I0917 11:21:15.419735   11559 kuberuntime_manager.go:720] Backing Off restarting container &Container{Name:api,Image:registry.reg-aws.openshift.com:443/openshift3/ose-control-plane:v3.11,Command:[/bin/bash -c],Args:[#!/bin/bash
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: set -euo pipefail
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: if [[ -f /etc/origin/master/master.env ]]; then
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: set -o allexport
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: source /etc/origin/master/master.env
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: fi
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: exec openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=${DEBUG_LOGLEVEL:-2}
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: ],WorkingDir:,Ports:[],Env:[],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[{master-config false /etc/origin/master/  <nil>} {master-cloud-provider false /etc/origin/cloudprovider/  <nil>} {master-data false /var/lib/origin/  <nil>}],LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz,Port:8443,Host:,Scheme:HTTPS,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:45,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,},ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz/ready,Port:8443,Host:,Scheme:HTTPS,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[],TerminationMessagePolicy:File,VolumeDevices:[],} in pod master-api-wmeng11ug310bs-master-etcd-1_kube-system(20506aaee850345a01e6ce64c31ec14f)
Sep 17 11:21:15 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[11559]: E0917 11:21:15.419853   11559 pod_workers.go:186] Error syncing pod 20506aaee850345a01e6ce64c31ec14f ("master-api-wmeng11ug310bs-master-etcd-1_kube-system(20506aaee850345a01e6ce64c31ec14f)"), skipping: failed to "StartContainer" for "api" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=api pod=master-api-wmeng11ug310bs-master-etcd-1_kube-system(20506aaee850345a01e6ce64c31ec14f)"


More details in attached journalctl output.

"Wait for master API to come back online" works intermittently on this host.  I have tested this several times by running this task in it's own playbook; sometimes it passes, other times it does not.  Either the API comes up for a very brief while and then dies or the wait_for module in ansible is buggy, I'm not sure.

Comment 9 Michael Gugino 2018-09-17 16:43:02 UTC
Rebooting the host in question clears the issue and api seems to start up fine.  Then, as we do in upgrade, I stopped atomic-openshift-node.service followed by stopping docker.service.  Wait a few seconds for pods to die, start docker.service, start atomic-openshift-node.service.

API does not come back online, but the port is bound to, logs again spammed, also noticed this:
Sep 17 12:10:29 wmeng11ug310bs-master-etcd-1 atomic-openshift-node[3848]: I0917 12:10:29.891371    3848 handler.go:91] Unable to get network stats from pid 4242: couldn't read network stats: failure opening /proc/4242/net/dev: open /proc/4242/net/dev: no such file or directory

It appears on this host, stopping and restarting docker leaves the node service in some kind of dirty state.  The api does bind to port 8443 for some time, but then later it dies.

Here's a synopsis of what I was doing during this time period:

< immediately after reboot >

[root@wmeng11ug310bs-master-etcd-1 ~]# /usr/local/bin/master-exec etcd etcd etcdctl  --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C "https://wmeng11ug310bs-master-etcd-1:2379" cluster-health
member b9d362a0686e92bb is healthy: got healthy result from https://10.240.0.250:2379
cluster is healthy
[root@wmeng11ug310bs-master-etcd-1 ~]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-v7mjg    1/1       Running   5          7h
registry-console-1-gdfxq   1/1       Running   3          7h
router-1-dkzls             1/1       Running   0          7h

< stop/start docker & atomic-openshift-node >

[root@wmeng11ug310bs-master-etcd-1 ~]# docker ps
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS              PORTS               NAMES
cd967412b4f0        bb825675e651                                                    "/bin/bash -c '#!/..."   4 minutes ago       Up 4 minutes                            k8s_controllers_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2
62b0b4dba27d        bb2f1d4dd3a7                                                    "/bin/sh -c '#!/bi..."   4 minutes ago       Up 4 minutes                            k8s_etcd_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2
5a2cedc94778        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7   "/usr/bin/pod"           4 minutes ago       Up 4 minutes                            k8s_POD_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2
2dc4e9d2d850        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7   "/usr/bin/pod"           4 minutes ago       Up 4 minutes                            k8s_POD_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2
ad380c4c1ae0        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7   "/usr/bin/pod"           4 minutes ago       Up 4 minutes                            k8s_POD_master-api-wmeng11ug310bs-master-etcd-1_kube-system_20506aaee850345a01e6ce64c31ec14f_2


[root@wmeng11ug310bs-master-etcd-1 ~]# oc get pods
^C < just hung for about 20 seconds before I killed >


[root@wmeng11ug310bs-master-etcd-1 ~]# telnet localhost 8443
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.


[root@wmeng11ug310bs-master-etcd-1 ~]# curl https://localhost:8443/ -k
curl: (7) Failed connect to localhost:8443; Connection refused

[root@wmeng11ug310bs-master-etcd-1 ~]# curl https://localhost:8443/ -k
curl: (7) Failed connect to localhost:8443; Connection refused

[root@wmeng11ug310bs-master-etcd-1 ~]# telnet localhost 8443
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused

[root@wmeng11ug310bs-master-etcd-1 ~]# ss -n | grep 8443
< no output >

[root@wmeng11ug310bs-master-etcd-1 ~]# docker ps
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS              PORTS               NAMES
cd967412b4f0        bb825675e651                                                    "/bin/bash -c '#!/..."   6 minutes ago       Up 6 minutes                            k8s_controllers_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2
62b0b4dba27d        bb2f1d4dd3a7                                                    "/bin/sh -c '#!/bi..."   6 minutes ago       Up 6 minutes                            k8s_etcd_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2
5a2cedc94778        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7   "/usr/bin/pod"           6 minutes ago       Up 6 minutes                            k8s_POD_master-etcd-wmeng11ug310bs-master-etcd-1_kube-system_b42bd04e236f6e97cd9ed4137ec29abb_2
2dc4e9d2d850        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7   "/usr/bin/pod"           6 minutes ago       Up 6 minutes                            k8s_POD_master-controllers-wmeng11ug310bs-master-etcd-1_kube-system_e35b9fa799adc39f08fadc32559045da_2
ad380c4c1ae0        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.7   "/usr/bin/pod"           6 minutes ago       Up 6 minutes                            k8s_POD_master-api-wmeng11ug310bs-master-etcd-1_kube-system_20506aaee850345a01e6ce64c31ec14f_2


[root@wmeng11ug310bs-master-etcd-1 ~]# telnet localhost 8443
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused


[root@wmeng11ug310bs-master-etcd-1 ~]# telnet wmeng11ug310bs-master-etcd-1 8443
Trying fe80::4001:aff:fef0:fa%eth0...
telnet: connect to address fe80::4001:aff:fef0:fa%eth0: Connection refused
Trying fe80::780b:63ff:fe6d:ac41%tun0...
telnet: connect to address fe80::780b:63ff:fe6d:ac41%tun0: Connection refused
Trying 10.240.0.250...
telnet: connect to address 10.240.0.250: Connection refused
Trying 172.17.0.1...
telnet: connect to address 172.17.0.1: Connection refused
Trying 10.128.0.1...
telnet: connect to address 10.128.0.1: Connection refused

[root@wmeng11ug310bs-master-etcd-1 ~]# /usr/local/bin/master-exec etcd etcd etcdctl  --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C "https://wmeng11ug310bs-master-etcd-1:2379" cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://wmeng11ug310bs-master-etcd-1:2379 exceeded header timeout

error #0: client: endpoint https://wmeng11ug310bs-master-etcd-1:2379 exceeded header timeout



It appears that static pods do not like docker and the node service being stopped/restarted on this host, and there is no obvious reason why they shouldn't all come back online.

Comment 10 Michael Gugino 2018-09-17 17:15:07 UTC
Restarting dnsmasq and then restarting pods via master-restart shim seems to clear the issue.  We are blocked/caused by: https://bugzilla.redhat.com/show_bug.cgi?id=1624448

Comment 11 Michael Gugino 2018-09-17 18:40:50 UTC
Workaround PR created in master: https://github.com/openshift/openshift-ansible/pull/10112

Comment 12 Scott Dodson 2018-09-17 20:18:01 UTC
https://github.com/openshift/openshift-ansible/pull/10115 release-3.11 backport

Comment 13 Wei Sun 2018-09-19 06:44:06 UTC
The PR 10115 has been merged to openshift-ansible-3.11.8-1,please check the bug.

Comment 14 Weihua Meng 2018-09-21 05:57:42 UTC
Fixed.

openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch


Note You need to log in before you can comment on or make changes to this bug.