1779005 – [upi-vmware] Fail to finish cluster initialization after bootsrap complete

Bug 1779005 - [upi-vmware] Fail to finish cluster initialization after bootsrap complete

Summary: [upi-vmware] Fail to finish cluster initialization after bootsrap complete

Keywords:
Status:	CLOSED DUPLICATE of bug 1750606
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:	devex
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-03 03:31 UTC by liujia
Modified:	2019-12-09 02:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-05 01:42:58 UTC
Target Upstream Version:
Embargoed:
Flags:	jiajliu: needinfo-

Attachments	(Terms of Use)
logs (9.75 MB, application/gzip) 2019-12-03 03:34 UTC, liujia	no flags	Details
View All

Description liujia 2019-12-03 03:31:42 UTC

Description of problem:
The installation always can not finish at the stage that to patch image registry, which after bootstrap succeed. After waiting for 60mins, there is still not image registry operator generated.

[root@preserve-jliu-worker tmp]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential                                                               True        False         False      74m
dns                                        unknown                             False       True          True       71m
insights                                   4.3.0-0.nightly-2019-11-27-041100   True        True          False      72m
kube-apiserver                             4.3.0-0.nightly-2019-11-27-041100   True        False         False      71m
kube-controller-manager                    4.3.0-0.nightly-2019-11-27-041100   False       True          False      72m
kube-scheduler                             4.3.0-0.nightly-2019-11-27-041100   False       True          False      72m
machine-api                                4.3.0-0.nightly-2019-11-27-041100   True        False         False      71m
machine-config                             4.3.0-0.nightly-2019-11-27-041100   False       True          False      72m
network                                    4.3.0-0.nightly-2019-11-27-041100   True        False         False      71m
openshift-apiserver                        4.3.0-0.nightly-2019-11-27-041100   Unknown     False         False      72m
openshift-controller-manager                                                   False       True          False      72m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-11-27-041100   True        True          False      71m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-11-27-041100   True        True          False      71m
operator-lifecycle-manager-packageserver                                       False       True          False      71m
service-ca                                 4.3.0-0.nightly-2019-11-27-041100   True        False         False      72m
[root@preserve-jliu-worker tmp]# oc get configs.imageregistry.operator.openshift.io cluster
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

In this broken status, must-gather can not work.
[root@preserve-jliu-worker tmp]# oc adm must-gather
[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-nqxkf created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rfqgc created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created
[must-gather-d2ft8] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
...

So i attach cvo log and master/worker node log for debug. 

Some of logs about openshift-apiserver.
# oc describe co openshift-apiserver
Name:         openshift-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-29T02:21:41Z
  Generation:          1
  Resource Version:    2595
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  UID:                 ee1def8d-dfdf-45ef-b222-b95342d653f7
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-29T02:21:42Z
    Message:               EncryptionPruneControllerDegraded: daemonset.apps "apiserver" not found
EncryptionMigrationControllerDegraded: daemonset.apps "apiserver" not found
EncryptionStateControllerDegraded: daemonset.apps "apiserver" not found
ResourceSyncControllerDegraded: namespaces "openshift-apiserver" not found
EncryptionKeyControllerDegraded: daemonset.apps "apiserver" not found
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-11-29T02:21:42Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-11-29T02:21:41Z
    Reason:                NoData
    Status:                Unknown
    Type:                  Available
    Last Transition Time:  2019-11-29T02:21:42Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
...

# oc logs pod/openshift-apiserver-operator-5f7dcd8c88-lc9nf -n openshift-apiserver-operator
I1129 04:08:57.737594       1 cmd.go:188] Using service-serving-cert provided certificates
I1129 04:08:57.738165       1 observer_polling.go:136] Starting file observer
I1129 04:08:57.738273       1 observer_polling.go:97] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "ee4e4285ab6420066fac19de6bafd4e52ee8d92f6d3be1e31be188904ab35cb6", lastKnown: "ee4e4285ab6420066fac19de6bafd4e52ee8d92f6d3be1e31be188904ab35cb6")
...
W1129 04:09:27.739583       1 builder.go:181] unable to get owner reference (falling back to namespace): Get https://172.30.0.1:443/api/v1/namespaces/openshift-apiserver-operator/pods: dial tcp 172.30.0.1:443: i/o timeout
...

If any preserved cluster needed, please contact me for a reproduce and reservation.


Version-Release number of the following components:
4.3.0-0.nightly-2019-11-27-041100

How reproducible:
(3 times/always)

Steps to Reproduce:
1. Trigger upi/vsphere installation with ovn network(qe's ci test profile)
2. After bootstrap complete, there is not image registry for storage patch.
3.

Actual results:
Installation can not finish.

Expected results:
Installation succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 liujia 2019-12-03 03:34:39 UTC

Created attachment 1641539 [details]
logs

Comment 2 liujia 2019-12-03 03:42:17 UTC

Hit it again on 4.3.0-0.nightly-2019-11-29-013902 when deploy upi/vsphere cluster with http proxy enable.
http_proxy: "http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@139.178.76.57:3128"
https_proxy: "http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@139.178.76.57:3128"
no_proxy: "test.no-proxy.com"

[root@preserve-jliu-worker tmp]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          77m     Working towards 4.3.0-0.nightly-2019-11-29-013902: 72% complete
[root@preserve-jliu-worker tmp]# 
[root@preserve-jliu-worker tmp]# oc get configs.imageregistry.operator.openshift.io cluster
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found
[root@preserve-jliu-worker tmp]# oc get co
NAME                                 VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential                                                         True        False         False      74m
dns                                  4.3.0-0.nightly-2019-11-29-013902   True        False         False      66m
insights                             4.3.0-0.nightly-2019-11-29-013902   True        False         False      70m
kube-apiserver                       4.3.0-0.nightly-2019-11-29-013902   True        True          True       67m
kube-controller-manager              4.3.0-0.nightly-2019-11-29-013902   True        True          True       66m
kube-scheduler                       4.3.0-0.nightly-2019-11-29-013902   True        True          True       66m
machine-api                          4.3.0-0.nightly-2019-11-29-013902   True        False         False      67m
machine-config                       4.3.0-0.nightly-2019-11-29-013902   False       True          True       70m
network                              4.3.0-0.nightly-2019-11-29-013902   True        False         False      61m
openshift-apiserver                  4.3.0-0.nightly-2019-11-29-013902   False       False         False      67m
openshift-controller-manager                                             False       True          False      70m
operator-lifecycle-manager-catalog   4.3.0-0.nightly-2019-11-29-013902   True        False         False      67m
service-ca                           4.3.0-0.nightly-2019-11-29-013902   True        False         False      70m
[root@preserve-jliu-worker tmp]# oc get machineconfig
NAME            GENERATEDBYCONTROLLER   IGNITIONVERSION   CREATED
99-master-ssh                           2.2.0             70m
99-worker-ssh                           2.2.0             70m


[root@preserve-jliu-worker tmp]# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-29T06:32:57Z
  Generation:          1
  Resource Version:    16403
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 02e1ecab-cdb9-4c34-80ee-4de528c4e7e6
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-29T06:32:57Z
    Message:               Cluster not available for 4.3.0-0.nightly-2019-11-29-013902
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-11-29T06:32:57Z
    Message:               Cluster is bootstrapping 4.3.0-0.nightly-2019-11-29-013902
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-11-29T06:46:36Z
    Message:               Failed to resync 4.3.0-0.nightly-2019-11-29-013902 because: timed out waiting for the condition during waitForDeploymentRollout: Deployment machine-config-controller is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-11-29T06:46:36Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:

[root@preserve-jliu-worker tmp]# oc describe pod machine-config-controller-65d4889785-2c9kc -n openshift-machine-config-operator
Name:                 machine-config-controller-65d4889785-2c9kc
Namespace:            openshift-machine-config-operator
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=machine-config-controller
                      pod-template-hash=65d4889785
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/machine-config-controller-65d4889785
Containers:
  machine-config-controller:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a439c4a128260accac47c791bed2a318f95bdd17d93b5903ab7f8780ef99baf
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/bin/machine-config-controller
    Args:
      start
      --resourcelock-namespace=openshift-machine-config-operator
      --v=2
    Requests:
      cpu:        20m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from machine-config-controller-token-zcfn5 (ro)
Volumes:
  machine-config-controller-token-zcfn5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  machine-config-controller-token-zcfn5
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 120s
                 node.kubernetes.io/unreachable:NoExecute for 120s
Events:          <none>


# oc describe co openshift-apiserver
Name:         openshift-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-29T06:33:30Z
  Generation:          1
  Resource Version:    5894
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  UID:                 079fc917-a6b4-4766-80d7-a4137f5471b5
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-29T06:36:19Z
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-11-29T06:36:39Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-11-29T06:36:26Z
    Message:               Available: no openshift-apiserver daemon pods available on any node.
    Reason:                AvailableNoAPIServerPod
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-11-29T06:33:31Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>

Since the must-gather can not work in this broken status, so only some of info provided above. I will try to give another reproduce and keep the cluster for debug.

Note You need to log in before you can comment on or make changes to this bug.