Bug 1633570

Summary:

cluster with crio runtime upgrade failed at [openshift_node : Approve node certificates when bootstrapping] due to missing connection to etcd hostname.

Product:

OpenShift Container Platform

Reporter:

Johnny Liu <jialiu>

Component:

Cluster Version Operator

Assignee:

Russell Teague <rteague>

Status:

CLOSED ERRATA

QA Contact:

Weihua Meng <wmeng>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.11.0

CC:

aos-bugs, jokerman, mmccomas, rteague, wmeng

Target Milestone:

---

Keywords:

Regression, Triaged

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-03-14 02:17:59 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
upgrade log with inventory file embeded	none
upgrade log with inventory file embeded	none

Description Johnny Liu 2018-09-27 10:10:59 UTC

Created attachment 1487703 [details]
upgrade log with inventory file embeded

Description of problem:


Version-Release number of the following components:
openshift-ansible-3.11.16-1.git.0.4ac6f81.el7.noarch
atomic-openshift-3.11.16-1.git.0.b48b8f8.el7.x86_64
cri-o-1.11.5-2.rhaos3.11.git1c8a4b1.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install a 3.10 cluster with crio runtime enabled
2. Trigger upgrade
3.

Actual results:
Upgrade failed.
TASK [openshift_node : Approve node certificates when bootstrapping] ***********
<--snip-->
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).

fatal: [qe-jialiu3101-master-etcd-1.0927-nqc.qe.rhcloud.com -> qe-jialiu3101-master-etcd-1.0927-nqc.qe.rhcloud.com]: FAILED! => {"all_subjects_found": [], "attempts": 30, "changed": false, "client_approve_results": [], "client_csrs": null, "msg": "The connection to the server qe-jialiu3101-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "oc_get_nodes": null, "rc": 0, "server_approve_results": [], "server_csrs": null, "state": "unknown", "unwanted_csrs": []}

Check api log:
[root@qe-jialiu3101-master-etcd-1 ~]# crictl logs b4200a17855cc
<--snip-->
I0927 08:21:14.253799       1 master_config.go:414] Initializing cache sizes based on 0MB limit
I0927 08:21:14.253907       1 master_config.go:476] Using the lease endpoint reconciler with TTL=15s and interval=10s
I0927 08:21:14.253955       1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu3101-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0927 08:21:24.255210       1 start_api.go:68] context deadline exceeded


Restart dnsmasq would bring etcd connection back.

Expected results:
upgrade is passed.

Additional info:
This bug is really similar to https://bugzilla.redhat.com/show_bug.cgi?id=1623145#c7 and https://bugzilla.redhat.com/show_bug.cgi?id=1624448, so I also tried a upgrade without crio runtime, upgrade is completed successfully.

QE have ever run upgrade crio cluster upgrade successfully some days ago, I can not remember exact version info now.

Comment 10 Johnny Liu 2018-11-05 10:58:36 UTC

Re-test this bug with openshift-ansible-3.11.39-1.git.0.fe42b3b.el7.noarch, still reproduce.

upgrade log is attached.

Comment 11 Johnny Liu 2018-11-05 11:00:08 UTC

Created attachment 1501783 [details]
upgrade log with inventory file embeded

Comment 19 Russell Teague 2018-12-20 19:40:49 UTC

Possibly related, given the upgrade is being performed on OpenStack: https://bugzilla.redhat.com/show_bug.cgi?id=1661232

Comment 20 Russell Teague 2019-01-30 16:50:12 UTC

Can this be tested again with the information provided in https://bugzilla.redhat.com/show_bug.cgi?id=1661232#c6?

Comment 21 Weihua Meng 2019-02-12 10:25:17 UTC

Fixed.

openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch

upgrade success on openstack v10 and v13, AWS and GCE

Comment 23 errata-xmlrpc 2019-03-14 02:17:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0407