Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1915853

Summary: etcd fails during install on RHV
Product: OpenShift Container Platform Reporter: Peter Larsen <plarsen>
Component: EtcdAssignee: Suresh Kolichala <skolicha>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: low    
Version: 4.7CC: trees
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 47hack LifecycleStale
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-09 15:38:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Debug data based on https://access.redhat.com/articles/4971311
none
Attempted sosreport from one rhcoreos master node none

Description Peter Larsen 2021-01-13 14:52:18 UTC
Description of problem:
Using ocp-4.7-fc2 on RHV the install fails shortly after bringing the masters online. 

In the etcd logs (using crictl logs) logs of errors with  embed: rejected connection from "192.168.11.216:43430" (error "EOF", ServerName "") are shown (the IP is the IP of the master) and using etcdctl:

[root@ocp47-fkqjp-master-0 /]# etcdctl check datascale
fetch error: Get http://https//192.168.11.216:2379/metrics: dial tcp: lookup https on 192.168.11.23:53: no such host
FAIL: Could not read process_resident_memory_bytes before the put operations.

The url with a hostname of https is definitely wrong. 

# etcdctl member list
18feb806d6d7ca6c, started, etcd-bootstrap, https://192.168.11.180:2380, https://192.168.11.180:2379, false
785938451488e8ae, started, ocp47-fkqjp-master-0, https://192.168.11.216:2380, https://192.168.11.216:2379, false

All calls from openshift-apiserver are failing with:
W0113 14:49:32.409938       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://192.168.11.183:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.11.183:2379: connect: connection refused". Reconnecting...

which is what the log of etcd shows too (being denied). 

Version-Release number of selected component (if applicable):
4.7-fc2
RHV 4.4.3.12-0.1

How reproducible:
Every time
System is unstable - you can "fix" the above by rebooting the masters, but there may be failures later on with similar issues.

Steps to Reproduce:
1. openshift-install with a RHV install-config.yaml

Additional info:
I suspect https://bugzilla.redhat.com/show_bug.cgi?id=1896384 may be related.

Comment 1 Peter Larsen 2021-01-13 16:09:30 UTC
Created attachment 1747104 [details]
Debug data based on https://access.redhat.com/articles/4971311

Basic dump of journalctl and other data from the frozen masters.

Comment 2 Peter Larsen 2021-01-13 16:34:59 UTC
Created attachment 1747110 [details]
Attempted sosreport from one rhcoreos master node

Running sosreport -k crio.all=on -k crio.logs=on on the first master.

Comment 7 Michal Fojtik 2021-02-26 15:07:04 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 Suresh Kolichala 2021-04-09 15:38:31 UTC
Closing it as a duplicate of #1899316. If you think the bug is not duplicate of it, please reopen it.

*** This bug has been marked as a duplicate of bug 1899316 ***