Bug 1811530

Summary:	Install failed due to mdns record changed
Product:	OpenShift Container Platform	Reporter:	weiwei jiang <wjiang>
Component:	Etcd Operator	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	fbrychta, ikarpukh, jzmeskal, m.andre, pprinett, scuppett, smilner, wewang, wsun, yanyang, yprokule
Target Milestone:	---	Keywords:	Regression, TestBlocker
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-11 13:29:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1809238, 1810490

Description weiwei jiang 2020-03-09 07:05:01 UTC

Description of problem:
Checked with recent OCP on OSP installation, found kube-apiserver can not be ready.

# oc get pods -n openshift-kube-apiserver -o wide 
NAME                                       READY   STATUS      RESTARTS   AGE    IP             NODE                        NOMINATED NODE   READINESS GATES
installer-2-qe-wjios44-6bf4h-master-2      0/1     Completed   0          116s   10.128.0.26    qe-wjios44-6bf4h-master-2   <none>           <none>
kube-apiserver-qe-wjios44-6bf4h-master-2   3/4     Running     3          104s   192.168.0.13   qe-wjios44-6bf4h-master-2   <none>           <none>

# oc -n openshift-kube-apiserver logs kube-apiserver-qe-wjios44-b8bvg-master-1
W0309 05:30:32.293133       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-0.qe-wjios44.0309-xtg.qe.rhcloud.com:2379 0  <nil>}. Err :connection error: desc = "transport: 
Error while dialing dial tcp: lookup etcd-0.qe-wjios44.0309-xtg.qe.rhcloud.com on 192.168.0.6:53: no such host". Reconnecting...                                                                                   
W0309 05:30:32.769123       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-2.qe-wjios44.0309-xtg.qe.rhcloud.com:2379 0  <nil>}. Err :connection error: desc = "transport: 
Error while dialing dial tcp: lookup etcd-2.qe-wjios44.0309-xtg.qe.rhcloud.com on 192.168.0.6:53: no such host". Reconnecting...
W0309 05:30:33.301544       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.qe-wjios44.0309-xtg.qe.rhcloud.com:2379 0  <nil>}. Err :connection error: desc = "transport: 
Error while dialing dial tcp: lookup etcd-1.qe-wjios44.0309-xtg.qe.rhcloud.com on 192.168.0.6:53: no such host". Reconnecting...

[root@qe-wjios44-b8bvg-master-0 core]# dig +short  -t SRV  @127.0.0.1 _etcd-server-ssl._tcp.qe-wjios44.0309-xtg.qe.rhcloud.com
0 10 2380 qe-wjios44-b8bvg-etcd-0.qe-wjios44.0309-xtg.qe.rhcloud.com.
0 10 2380 qe-wjios44-b8bvg-etcd-2.qe-wjios44.0309-xtg.qe.rhcloud.com.
0 10 2380 qe-wjios44-b8bvg-etcd-1.qe-wjios44.0309-xtg.qe.rhcloud.com.

[root@qe-wjios44-b8bvg-master-0 core]# dig +short @127.0.0.1 qe-wjios44-b8bvg-etcd-0.qe-wjios44.0309-xtg.qe.rhcloud.com.
192.168.0.17

[root@qe-wjios44-b8bvg-master-0 core]# dig +short @127.0.0.1 etcd-0.qe-wjios44.0309-xtg.qe.rhcloud.com.
[root@qe-wjios44-b8bvg-master-0 core]# 

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-08-235004

How reproducible:
Always


Steps to Reproduce:
1. Try install IPI on OSP cluster
2.
3.

Actual results:
INFO Waiting up to 20m0s for the Kubernetes API at https://api.qe-wjios44.0309-xtg.qe.rhcloud.com:6443...  
DEBUG Still waiting for the Kubernetes API: Get https://api.qe-wjios44.0309-xtg.qe.rhcloud.com:6443/version?timeout=32s: dial tcp 10.0.98.45:6443: i/o timeout 
INFO API v1.17.1 up                                                                                                                                                                                                
INFO Waiting up to 40m0s for bootstrapping to complete... 
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authenti
cation: could not be retrieved: secret "v4-0-config-system-router-certs" not found
IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server 
INFO Cluster operator authentication Progressing is Unknown with NoData:  
INFO Cluster operator authentication Available is Unknown with NoData:  
ERROR Cluster operator kube-apiserver Degraded is True with StaticPods_Error: StaticPodsDegraded: nodes/qe-wjios44-b8bvg-master-1 pods/kube-apiserver-qe-wjios44-b8bvg-master-1 container="kube-apiserver" is not r
eady                                                                                                                                                                                                               
StaticPodsDegraded: nodes/qe-wjios44-b8bvg-master-1 pods/kube-apiserver-qe-wjios44-b8bvg-master-1 container="kube-apiserver" is waiting: "CrashLoopBackOff" - "back-off 5m0s restarting failed container=kube-apise
rver pod=kube-apiserver-qe-wjios44-b8bvg-master-1_openshift-kube-apiserver(4ad72f4ecbd4b2d85c8988b8b3aa8a4f)"                                                                                                      
StaticPodsDegraded: nodes/qe-wjios44-b8bvg-master-1 pods/kube-apiserver-qe-wjios44-b8bvg-master-1 container="kube-apiserver-cert-regeneration-controller" is not ready                                             
StaticPodsDegraded: nodes/qe-wjios44-b8bvg-master-1 pods/kube-apiserver-qe-wjios44-b8bvg-master-1 container="kube-apiserver-cert-regeneration-controller" is waiting: "CrashLoopBackOff" - "back-off 5m0s restartin
g failed container=kube-apiserver-cert-regeneration-controller pod=kube-apiserver-qe-wjios44-b8bvg-master-1_openshift-kube-apiserver(4ad72f4ecbd4b2d85c8988b8b3aa8a4f)"
StaticPodsDegraded: pods "kube-apiserver-qe-wjios44-b8bvg-master-0" not found                            
StaticPodsDegraded: pods "kube-apiserver-qe-wjios44-b8bvg-master-2" not found 

Expected results:
Should work well

Additional info:

Comment 3 Martin André 2020-03-09 12:29:38 UTC

This is also affecting upstream CI. I'm looking into it.

Comment 5 Martin André 2020-03-09 13:17:44 UTC

This is likely caused by https://github.com/openshift/cluster-etcd-operator/pull/233 (backported to 4.4 in https://github.com/openshift/cluster-etcd-operator/pull/239), I'll port https://github.com/openshift/machine-config-operator/commit/2908ca449b46200cbed67ae5a465243a7919f144 to openstack, hopefully this is enough to fix our issue.

Comment 6 Martin André 2020-03-09 17:24:13 UTC

Potential fix: https://github.com/openshift/cluster-kube-apiserver-operator/pull/789

Comment 8 wewang 2020-03-10 05:02:41 UTC

When install OCP in GCP, met the same issue as follow:

level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-03-09-234759: 76% complete"
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=error msg="Cluster operator kube-apiserver Degraded is True with InstallerPodContainerWaiting_CreateContainerError::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-wewang-vw88w-m-1.c.openshift-qe.internal\" on node \"wewang-vw88w-m-1.c.openshift-qe.internal\" container \"installer\" is waiting for 38m21.732926905s because \"the container name \\\"k8s_installer_installer-2-wewang-vw88w-m-1.c.openshift-qe.internal_openshift-kube-apiserver_243574fe-ebe9-4756-9e7f-6e8a446bf457_0\\\" is already in use by \\\"df46526127e942582cf15846967911f37d4a8db5abd712b0500d561131974176\\\". You have to remove that container to be able to reuse that name.: that name is already in use\"\nStaticPodsDegraded: nodes/wewang-vw88w-m-2.c.openshift-qe.internal pods/kube-apiserver-wewang-vw88w-m-2.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/wewang-vw88w-m-2.c.openshift-qe.internal pods/kube-apiserver-wewang-vw88w-m-2.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is waiting: \"CrashLoopBackOff\" - \"back-off 5m0s restarting failed container=kube-apiserver-cert-regeneration-controller pod=kube-apiserver-wewang-vw88w-m-2.c.openshift-qe.internal_openshift-kube-apiserver(b3014b7a2f1c6b8515fe65cbb22372bd)\"\nStaticPodsDegraded: pods \"kube-apiserver-wewang-vw88w-m-1.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-wewang-vw88w-m-0.c.openshift-qe.internal\" not found"

Comment 10 wewang 2020-03-10 09:26:06 UTC

Fyi, About comment 8, the cluster is about ocp in ipi on gcp.

Comment 11 Martin André 2020-03-10 15:12:18 UTC

Should be fixed with https://github.com/openshift/cluster-kube-apiserver-operator/pull/791

Comment 12 Wei Sun 2020-03-11 05:12:07 UTC

Today set up the cluster successfully for 'IPI on GCP with http_proxy&ovn' against 4.4.0-0.nightly-2020-03-10-194324

Comment 13 Jan Zmeskal 2020-03-11 08:18:51 UTC

I can confirm this issue was also encountered when deploying OPC on RHV with 4.4.0-0.nightly-2020-03-09-175442

Comment 14 Martin André 2020-03-11 13:27:34 UTC

*** Bug 1811855 has been marked as a duplicate of this bug. ***

Comment 15 Stephen Cuppett 2020-03-11 13:29:46 UTC


*** This bug has been marked as a duplicate of bug 1812071 ***