1612144 – upgrade failed at TASK [openshift_node : Approve the node]

Bug 1612144 - upgrade failed at TASK [openshift_node : Approve the node]

Summary: upgrade failed at TASK [openshift_node : Approve the node]

Keywords:
Status:	CLOSED DUPLICATE of bug 1609907
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Russell Teague
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-03 15:07 UTC by Weihua Meng
Modified:	2018-08-28 08:38 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-15 19:40:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Weihua Meng 2018-08-03 15:07:30 UTC

Description of problem:
upgrade failed at TASK [openshift_node : Approve the node]

Version-Release number of the following components:
openshift-ansible-3.11.0-0.11.0.git.0.3c66516None.noarch.rpm

How reproducible:
3/4

Steps to Reproduce:
1. upgrade OCP v3.10 to v3.11

Actual results:
TASK [openshift_node : Approve the node] ***************************************
task path: /home/slave5/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48
Using module file /home/slave5/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/lib_openshift/library/oc_adm_csr.py
<qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com> ESTABLISH SSH CONNECTION FOR USER: root
<qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/slave5/workspace/Run-Ansible-Playbooks-Nextge/private/config/keys/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/slave5/.ansible/cp/%C qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com> (1, '\n{"exception": "  File \\"/tmp/ansible_QXihG7/ansible_module_oc_adm_csr.py\\", line 47, in <module>\\n    import ruamel.yaml as yaml\\n", "changed": false, "finished": false, "results": [], "failed": true, "state": "approve", "timeout": true, "invocation": {"module_args": {"approve_all": false, "fail_on_timeout": true, "service_account": "system:serviceaccount:openshift-infra:node-bootstrapper", "kubeconfig": "/etc/origin/master/admin.kubeconfig", "state": "approve", "timeout": 60, "debug": false, "nodes": ["qe-wmengah310-master-etcd-1"]}}, "nodes": [{"server_accepted": false, "csrs": {}, "client_accepted": false, "name": "qe-wmengah310-master-etcd-1", "denied": false}], "msg": "Timed out accepting certificate signing requests. Failing as requested."}\n', '')
The full traceback is:
  File "/tmp/ansible_QXihG7/ansible_module_oc_adm_csr.py", line 47, in <module>
    import ruamel.yaml as yaml

fatal: [qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com -> qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com]: FAILED! => {
    "changed": false, 
    "finished": false, 
    "invocation": {
        "module_args": {
            "approve_all": false, 
            "debug": false, 
            "fail_on_timeout": true, 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "nodes": [
                "qe-wmengah310-master-etcd-1"
            ], 
            "service_account": "system:serviceaccount:openshift-infra:node-bootstrapper", 
            "state": "approve", 
            "timeout": 60
        }
    }, 
    "msg": "Timed out accepting certificate signing requests. Failing as requested.", 
    "nodes": [
        {
            "client_accepted": false, 
            "csrs": {}, 
            "denied": false, 
            "name": "qe-wmengah310-master-etcd-1", 
            "server_accepted": false
        }
    ], 
    "results": [], 
    "state": "approve", 
    "timeout": true
}

Failure summary:


  1. Hosts:    qe-wmengah310-master-etcd-1.0802-kgy.qe.rhcloud.com
     Play:     Update master nodes
     Task:     Approve the node
     Message:  Timed out accepting certificate signing requests. Failing as requested.

Expected results:
upgrade succeed

Comment 2 liujia 2018-08-10 02:57:03 UTC

Still hit it on openshift-ansible-3.11.0-0.13.0.git.0.16dc599None.noarch

Block v3.11 upgrade test.

Comment 3 Russell Teague 2018-08-13 12:44:48 UTC


*** This bug has been marked as a duplicate of bug 1609907 ***

Comment 5 liujia 2018-08-15 09:01:48 UTC

Hit the issue in comment4. More info for debug.

"Wait for node to be ready", actually node was ready and running well, but master api service can not start.

[root@qe-jliu-a-master-etcd-1 ~]# docker ps -a|grep master-api
b7f61961eb34        7f9f698aa5c8                                                                                                                                        "/bin/bash -c '#!/..."   3 minutes ago       Exited (255) 2 minutes ago                        k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_6
996b3943ebef        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0                                                                                "/usr/bin/pod"           11 minutes ago      Up 11 minutes                                     k8s_POD_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_0

Checked master api log:

I0815 07:44:09.377332       1 master_config.go:539] Using the lease endpoint reconciler with TTL=15s and interval=10s
I0815 07:44:09.377438       1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jliu-a-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0815 07:44:19.379486       1 start_api.go:68] context deadline exceeded

It seems that etcd client can not connect etcd server with hostname:2379 correctly. And with ip:2379, it works well.

[root@qe-jliu-a-master-etcd-1 ~]# cat /etc/origin/master/master-config.yaml |grep -A 2 urls  urls:
  - https://qe-jliu-a-master-etcd-1:2379
etcdStorageConfig:

[root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jliu-a-master-etcd-1:2379
https://qe-jliu-a-master-etcd-1:2379 is unhealthy: failed to connect: context deadline exceeded
Error:  unhealthy cluster

[root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://10.240.0.123:2379
https://10.240.0.123:2379 is healthy: successfully committed proposal: took = 1.279727ms


But change master-config to use etcd ip, then master api service start.

[root@qe-jliu-a-master-etcd-1 ~]# docker ps -a|grep master-api
2c6181ba9f66        7f9f698aa5c8                                                                                                                                        "/bin/bash -c '#!/..."   44 seconds ago       Up 43 seconds                                         k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_9
b52f02c92012        7f9f698aa5c8                                                                                                                                        "/bin/bash -c '#!/..."   6 minutes ago        Exited (255) 5 minutes ago                            k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_8
996b3943ebef        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0                                                                                "/usr/bin/pod"           24 minutes ago       Up 24 minutes                                         k8s_POD_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_0

Or, restart dnsmasq, master api can start correctly.

Moreover, before upgrade(v3.10), with etcd hostname, client can connect server correctly.

[root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jliu-a-master-etcd-1:2379
https://qe-jliu-a-master-etcd-1:2379 is healthy: successfully committed proposal: took = 1.301758ms

Note You need to log in before you can comment on or make changes to this bug.