Bug 1633570

Summary: cluster with crio runtime upgrade failed at [openshift_node : Approve node certificates when bootstrapping] due to missing connection to etcd hostname.
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: Cluster Version OperatorAssignee: Russell Teague <rteague>
Status: CLOSED ERRATA QA Contact: Weihua Meng <wmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, jokerman, mmccomas, rteague, wmeng
Target Milestone: ---Keywords: Regression, Triaged
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-14 02:17:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
upgrade log with inventory file embeded
none
upgrade log with inventory file embeded none

Description Johnny Liu 2018-09-27 10:10:59 UTC
Created attachment 1487703 [details]
upgrade log with inventory file embeded

Description of problem:


Version-Release number of the following components:
openshift-ansible-3.11.16-1.git.0.4ac6f81.el7.noarch
atomic-openshift-3.11.16-1.git.0.b48b8f8.el7.x86_64
cri-o-1.11.5-2.rhaos3.11.git1c8a4b1.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install a 3.10 cluster with crio runtime enabled
2. Trigger upgrade
3.

Actual results:
Upgrade failed.
TASK [openshift_node : Approve node certificates when bootstrapping] ***********
<--snip-->
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).

fatal: [qe-jialiu3101-master-etcd-1.0927-nqc.qe.rhcloud.com -> qe-jialiu3101-master-etcd-1.0927-nqc.qe.rhcloud.com]: FAILED! => {"all_subjects_found": [], "attempts": 30, "changed": false, "client_approve_results": [], "client_csrs": null, "msg": "The connection to the server qe-jialiu3101-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "oc_get_nodes": null, "rc": 0, "server_approve_results": [], "server_csrs": null, "state": "unknown", "unwanted_csrs": []}

Check api log:
[root@qe-jialiu3101-master-etcd-1 ~]# crictl logs b4200a17855cc
<--snip-->
I0927 08:21:14.253799       1 master_config.go:414] Initializing cache sizes based on 0MB limit
I0927 08:21:14.253907       1 master_config.go:476] Using the lease endpoint reconciler with TTL=15s and interval=10s
I0927 08:21:14.253955       1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu3101-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0927 08:21:24.255210       1 start_api.go:68] context deadline exceeded


Restart dnsmasq would bring etcd connection back.

Expected results:
upgrade is passed.

Additional info:
This bug is really similar to https://bugzilla.redhat.com/show_bug.cgi?id=1623145#c7 and https://bugzilla.redhat.com/show_bug.cgi?id=1624448, so I also tried a upgrade without crio runtime, upgrade is completed successfully.

QE have ever run upgrade crio cluster upgrade successfully some days ago, I can not remember exact version info now.

Comment 10 Johnny Liu 2018-11-05 10:58:36 UTC
Re-test this bug with openshift-ansible-3.11.39-1.git.0.fe42b3b.el7.noarch, still reproduce.

upgrade log is attached.

Comment 11 Johnny Liu 2018-11-05 11:00:08 UTC
Created attachment 1501783 [details]
upgrade log with inventory file embeded

Comment 19 Russell Teague 2018-12-20 19:40:49 UTC
Possibly related, given the upgrade is being performed on OpenStack: https://bugzilla.redhat.com/show_bug.cgi?id=1661232

Comment 20 Russell Teague 2019-01-30 16:50:12 UTC
Can this be tested again with the information provided in https://bugzilla.redhat.com/show_bug.cgi?id=1661232#c6?

Comment 21 Weihua Meng 2019-02-12 10:25:17 UTC
Fixed.

openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch

upgrade success on openstack v10 and v13, AWS and GCE

Comment 23 errata-xmlrpc 2019-03-14 02:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0407