1717257 – [upi-vmware] Fail to finish cluster initialation after bootsrap complete

Bug 1717257 - [upi-vmware] Fail to finish cluster initialation after bootsrap complete

Summary: [upi-vmware] Fail to finish cluster initialation after bootsrap complete

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Abhinav Dahiya
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1702615
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-05 02:33 UTC by liujia
Modified:	2019-12-03 03:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1702615
Environment:
Last Closed:	2019-12-02 18:21:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 Matthew Staebler 2019-06-21 13:43:55 UTC

This has been reported as https://github.com/openshift/installer/issues/1884 by a user, too.

Comment 5 Abhinav Dahiya 2019-07-03 01:01:45 UTC

Is this issue reproducible still?

Comment 6 liujia 2019-07-03 01:25:36 UTC

(In reply to Abhinav Dahiya from comment #5)
> Is this issue reproducible still?

No, we did not hit it recently.

Comment 7 Brenton Leanhardt 2019-07-08 17:38:15 UTC

If this happens again just let us know.

Comment 8 liujia 2019-11-29 04:17:45 UTC

Currently qe hit this issue several times in daily ci test. The latest hit(3 times/always) is on 4.3.0-0.nightly-2019-11-27-041100 for upi/vsphere installation with ovn network. The installation always can not finish at the stage that to patch image registry, which after bootstrap succeed. After waiting for 60mins, there is still not image registry operator generated.

[root@preserve-jliu-worker tmp]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential                                                               True        False         False      74m
dns                                        unknown                             False       True          True       71m
insights                                   4.3.0-0.nightly-2019-11-27-041100   True        True          False      72m
kube-apiserver                             4.3.0-0.nightly-2019-11-27-041100   True        False         False      71m
kube-controller-manager                    4.3.0-0.nightly-2019-11-27-041100   False       True          False      72m
kube-scheduler                             4.3.0-0.nightly-2019-11-27-041100   False       True          False      72m
machine-api                                4.3.0-0.nightly-2019-11-27-041100   True        False         False      71m
machine-config                             4.3.0-0.nightly-2019-11-27-041100   False       True          False      72m
network                                    4.3.0-0.nightly-2019-11-27-041100   True        False         False      71m
openshift-apiserver                        4.3.0-0.nightly-2019-11-27-041100   Unknown     False         False      72m
openshift-controller-manager                                                   False       True          False      72m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-11-27-041100   True        True          False      71m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-11-27-041100   True        True          False      71m
operator-lifecycle-manager-packageserver                                       False       True          False      71m
service-ca                                 4.3.0-0.nightly-2019-11-27-041100   True        False         False      72m
[root@preserve-jliu-worker tmp]# oc get configs.imageregistry.operator.openshift.io cluster
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

In this broken status, must-gather can not work.
[root@preserve-jliu-worker tmp]# oc adm must-gather
[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-nqxkf created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rfqgc created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created
[must-gather-d2ft8] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
...

So i attach cvo log and master/worker node log for debug. 

Some of logs about openshift-apiserver.
# oc describe co openshift-apiserver
Name:         openshift-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-29T02:21:41Z
  Generation:          1
  Resource Version:    2595
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  UID:                 ee1def8d-dfdf-45ef-b222-b95342d653f7
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-29T02:21:42Z
    Message:               EncryptionPruneControllerDegraded: daemonset.apps "apiserver" not found
EncryptionMigrationControllerDegraded: daemonset.apps "apiserver" not found
EncryptionStateControllerDegraded: daemonset.apps "apiserver" not found
ResourceSyncControllerDegraded: namespaces "openshift-apiserver" not found
EncryptionKeyControllerDegraded: daemonset.apps "apiserver" not found
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-11-29T02:21:42Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-11-29T02:21:41Z
    Reason:                NoData
    Status:                Unknown
    Type:                  Available
    Last Transition Time:  2019-11-29T02:21:42Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
...

# oc logs pod/openshift-apiserver-operator-5f7dcd8c88-lc9nf -n openshift-apiserver-operator
I1129 04:08:57.737594       1 cmd.go:188] Using service-serving-cert provided certificates
I1129 04:08:57.738165       1 observer_polling.go:136] Starting file observer
I1129 04:08:57.738273       1 observer_polling.go:97] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "ee4e4285ab6420066fac19de6bafd4e52ee8d92f6d3be1e31be188904ab35cb6", lastKnown: "ee4e4285ab6420066fac19de6bafd4e52ee8d92f6d3be1e31be188904ab35cb6")
...
W1129 04:09:27.739583       1 builder.go:181] unable to get owner reference (falling back to namespace): Get https://172.30.0.1:443/api/v1/namespaces/openshift-apiserver-operator/pods: dial tcp 172.30.0.1:443: i/o timeout
...

If any preserved cluster needed, please contact me for a reproduce and reservation.

Comment 10 liujia 2019-11-29 08:21:35 UTC

Hit it again on 4.3.0-0.nightly-2019-11-29-013902 when deploy upi/vsphere cluster with http proxy enable.
http_proxy: "http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@139.178.76.57:3128"
https_proxy: "http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@139.178.76.57:3128"
no_proxy: "test.no-proxy.com"

[root@preserve-jliu-worker tmp]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          77m     Working towards 4.3.0-0.nightly-2019-11-29-013902: 72% complete
[root@preserve-jliu-worker tmp]# 
[root@preserve-jliu-worker tmp]# oc get configs.imageregistry.operator.openshift.io cluster
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found
[root@preserve-jliu-worker tmp]# oc get co
NAME                                 VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential                                                         True        False         False      74m
dns                                  4.3.0-0.nightly-2019-11-29-013902   True        False         False      66m
insights                             4.3.0-0.nightly-2019-11-29-013902   True        False         False      70m
kube-apiserver                       4.3.0-0.nightly-2019-11-29-013902   True        True          True       67m
kube-controller-manager              4.3.0-0.nightly-2019-11-29-013902   True        True          True       66m
kube-scheduler                       4.3.0-0.nightly-2019-11-29-013902   True        True          True       66m
machine-api                          4.3.0-0.nightly-2019-11-29-013902   True        False         False      67m
machine-config                       4.3.0-0.nightly-2019-11-29-013902   False       True          True       70m
network                              4.3.0-0.nightly-2019-11-29-013902   True        False         False      61m
openshift-apiserver                  4.3.0-0.nightly-2019-11-29-013902   False       False         False      67m
openshift-controller-manager                                             False       True          False      70m
operator-lifecycle-manager-catalog   4.3.0-0.nightly-2019-11-29-013902   True        False         False      67m
service-ca                           4.3.0-0.nightly-2019-11-29-013902   True        False         False      70m
[root@preserve-jliu-worker tmp]# oc get machineconfig
NAME            GENERATEDBYCONTROLLER   IGNITIONVERSION   CREATED
99-master-ssh                           2.2.0             70m
99-worker-ssh                           2.2.0             70m


[root@preserve-jliu-worker tmp]# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-29T06:32:57Z
  Generation:          1
  Resource Version:    16403
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 02e1ecab-cdb9-4c34-80ee-4de528c4e7e6
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-29T06:32:57Z
    Message:               Cluster not available for 4.3.0-0.nightly-2019-11-29-013902
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-11-29T06:32:57Z
    Message:               Cluster is bootstrapping 4.3.0-0.nightly-2019-11-29-013902
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-11-29T06:46:36Z
    Message:               Failed to resync 4.3.0-0.nightly-2019-11-29-013902 because: timed out waiting for the condition during waitForDeploymentRollout: Deployment machine-config-controller is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-11-29T06:46:36Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:

[root@preserve-jliu-worker tmp]# oc describe pod machine-config-controller-65d4889785-2c9kc -n openshift-machine-config-operator
Name:                 machine-config-controller-65d4889785-2c9kc
Namespace:            openshift-machine-config-operator
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=machine-config-controller
                      pod-template-hash=65d4889785
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/machine-config-controller-65d4889785
Containers:
  machine-config-controller:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a439c4a128260accac47c791bed2a318f95bdd17d93b5903ab7f8780ef99baf
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/bin/machine-config-controller
    Args:
      start
      --resourcelock-namespace=openshift-machine-config-operator
      --v=2
    Requests:
      cpu:        20m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from machine-config-controller-token-zcfn5 (ro)
Volumes:
  machine-config-controller-token-zcfn5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  machine-config-controller-token-zcfn5
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 120s
                 node.kubernetes.io/unreachable:NoExecute for 120s
Events:          <none>


# oc describe co openshift-apiserver
Name:         openshift-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-29T06:33:30Z
  Generation:          1
  Resource Version:    5894
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  UID:                 079fc917-a6b4-4766-80d7-a4137f5471b5
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-29T06:36:19Z
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-11-29T06:36:39Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-11-29T06:36:26Z
    Message:               Available: no openshift-apiserver daemon pods available on any node.
    Reason:                AvailableNoAPIServerPod
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-11-29T06:33:31Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>

Comment 11 Scott Dodson 2019-12-02 18:21:02 UTC

All previously reported failures here are different from one another. Rather than re-opening this one please create a new bug with complete set of standard Installer debugging data and we'll look into that.

Comment 12 liujia 2019-12-03 03:22:40 UTC

> All previously reported failures here are different from one another. Rather than re-opening this one please create a new bug with complete set of standard Installer debugging data and we'll look into that.
For this bug(bz1717257), it's only used to track one issue that "no image registry generated after bootstrap complete". There are not many previous failures, but only one failure which have been splitted from bug #1702615 here. Since it's not 100% reproduce in v4.1, and not reproduce in v4.2, so the bug was closed with INSUFFICIENT_DATA in v4.2. Now in v4.3, from 4.3.0-0.nightly-2019-11-27-041100, we always hit it, so reopen the bug again due to the same issue. If it's not convinient to track the same issue in the same bug, then we can open a new bug to track it, and restore this one to correct status.

Comment 13 liujia 2019-12-03 03:32:28 UTC

Tracked the v4.3 issue in https://bugzilla.redhat.com/show_bug.cgi?id=1779005.

Note You need to log in before you can comment on or make changes to this bug.