Description of problem: upgrade failed at TASK [openshift_node : Approve the node] Version-Release number of the following components: openshift-ansible-3.11.0-0.11.0.git.0.3c66516None.noarch.rpm How reproducible: 3/4 Steps to Reproduce: 1. upgrade OCP v3.10 to v3.11 Actual results: TASK [openshift_node : Approve the node] *************************************** task path: /home/slave5/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48 Using module file /home/slave5/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/lib_openshift/library/oc_adm_csr.py <qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com> ESTABLISH SSH CONNECTION FOR USER: root <qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/slave5/workspace/Run-Ansible-Playbooks-Nextge/private/config/keys/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/slave5/.ansible/cp/%C qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' <qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com> (1, '\n{"exception": " File \\"/tmp/ansible_QXihG7/ansible_module_oc_adm_csr.py\\", line 47, in <module>\\n import ruamel.yaml as yaml\\n", "changed": false, "finished": false, "results": [], "failed": true, "state": "approve", "timeout": true, "invocation": {"module_args": {"approve_all": false, "fail_on_timeout": true, "service_account": "system:serviceaccount:openshift-infra:node-bootstrapper", "kubeconfig": "/etc/origin/master/admin.kubeconfig", "state": "approve", "timeout": 60, "debug": false, "nodes": ["qe-wmengah310-master-etcd-1"]}}, "nodes": [{"server_accepted": false, "csrs": {}, "client_accepted": false, "name": "qe-wmengah310-master-etcd-1", "denied": false}], "msg": "Timed out accepting certificate signing requests. Failing as requested."}\n', '') The full traceback is: File "/tmp/ansible_QXihG7/ansible_module_oc_adm_csr.py", line 47, in <module> import ruamel.yaml as yaml fatal: [qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com -> qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com]: FAILED! => { "changed": false, "finished": false, "invocation": { "module_args": { "approve_all": false, "debug": false, "fail_on_timeout": true, "kubeconfig": "/etc/origin/master/admin.kubeconfig", "nodes": [ "qe-wmengah310-master-etcd-1" ], "service_account": "system:serviceaccount:openshift-infra:node-bootstrapper", "state": "approve", "timeout": 60 } }, "msg": "Timed out accepting certificate signing requests. Failing as requested.", "nodes": [ { "client_accepted": false, "csrs": {}, "denied": false, "name": "qe-wmengah310-master-etcd-1", "server_accepted": false } ], "results": [], "state": "approve", "timeout": true } Failure summary: 1. Hosts: qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com Play: Update master nodes Task: Approve the node Message: Timed out accepting certificate signing requests. Failing as requested. Expected results: upgrade succeed
Still hit it on openshift-ansible-3.11.0-0.13.0.git.0.16dc599None.noarch Block v3.11 upgrade test.
*** This bug has been marked as a duplicate of bug 1609907 ***
Hit the issue in comment4. More info for debug. "Wait for node to be ready", actually node was ready and running well, but master api service can not start. [root@qe-jliu-a-master-etcd-1 ~]# docker ps -a|grep master-api b7f61961eb34 7f9f698aa5c8 "/bin/bash -c '#!/..." 3 minutes ago Exited (255) 2 minutes ago k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_6 996b3943ebef registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0 "/usr/bin/pod" 11 minutes ago Up 11 minutes k8s_POD_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_0 Checked master api log: I0815 07:44:09.377332 1 master_config.go:539] Using the lease endpoint reconciler with TTL=15s and interval=10s I0815 07:44:09.377438 1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jliu-a-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000} F0815 07:44:19.379486 1 start_api.go:68] context deadline exceeded It seems that etcd client can not connect etcd server with hostname:2379 correctly. And with ip:2379, it works well. [root@qe-jliu-a-master-etcd-1 ~]# cat /etc/origin/master/master-config.yaml |grep -A 2 urls urls: - https://qe-jliu-a-master-etcd-1:2379 etcdStorageConfig: [root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jliu-a-master-etcd-1:2379 https://qe-jliu-a-master-etcd-1:2379 is unhealthy: failed to connect: context deadline exceeded Error: unhealthy cluster [root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://10.240.0.123:2379 https://10.240.0.123:2379 is healthy: successfully committed proposal: took = 1.279727ms But change master-config to use etcd ip, then master api service start. [root@qe-jliu-a-master-etcd-1 ~]# docker ps -a|grep master-api 2c6181ba9f66 7f9f698aa5c8 "/bin/bash -c '#!/..." 44 seconds ago Up 43 seconds k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_9 b52f02c92012 7f9f698aa5c8 "/bin/bash -c '#!/..." 6 minutes ago Exited (255) 5 minutes ago k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_8 996b3943ebef registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0 "/usr/bin/pod" 24 minutes ago Up 24 minutes k8s_POD_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_0 Or, restart dnsmasq, master api can start correctly. Moreover, before upgrade(v3.10), with etcd hostname, client can connect server correctly. [root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jliu-a-master-etcd-1:2379 https://qe-jliu-a-master-etcd-1:2379 is healthy: successfully committed proposal: took = 1.301758ms