Description of problem: The installation always can not finish at the stage that to patch image registry, which after bootstrap succeed. After waiting for 60mins, there is still not image registry operator generated. [root@preserve-jliu-worker tmp]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential True False False 74m dns unknown False True True 71m insights 4.3.0-0.nightly-2019-11-27-041100 True True False 72m kube-apiserver 4.3.0-0.nightly-2019-11-27-041100 True False False 71m kube-controller-manager 4.3.0-0.nightly-2019-11-27-041100 False True False 72m kube-scheduler 4.3.0-0.nightly-2019-11-27-041100 False True False 72m machine-api 4.3.0-0.nightly-2019-11-27-041100 True False False 71m machine-config 4.3.0-0.nightly-2019-11-27-041100 False True False 72m network 4.3.0-0.nightly-2019-11-27-041100 True False False 71m openshift-apiserver 4.3.0-0.nightly-2019-11-27-041100 Unknown False False 72m openshift-controller-manager False True False 72m operator-lifecycle-manager 4.3.0-0.nightly-2019-11-27-041100 True True False 71m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2019-11-27-041100 True True False 71m operator-lifecycle-manager-packageserver False True False 71m service-ca 4.3.0-0.nightly-2019-11-27-041100 True False False 72m [root@preserve-jliu-worker tmp]# oc get configs.imageregistry.operator.openshift.io cluster Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found In this broken status, must-gather can not work. [root@preserve-jliu-worker tmp]# oc adm must-gather [must-gather ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather) [must-gather ] OUT [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest [must-gather ] OUT namespace/openshift-must-gather-nqxkf created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rfqgc created [must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created [must-gather-d2ft8] POD Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout ... So i attach cvo log and master/worker node log for debug. Some of logs about openshift-apiserver. # oc describe co openshift-apiserver Name: openshift-apiserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-11-29T02:21:41Z Generation: 1 Resource Version: 2595 Self Link: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver UID: ee1def8d-dfdf-45ef-b222-b95342d653f7 Spec: Status: Conditions: Last Transition Time: 2019-11-29T02:21:42Z Message: EncryptionPruneControllerDegraded: daemonset.apps "apiserver" not found EncryptionMigrationControllerDegraded: daemonset.apps "apiserver" not found EncryptionStateControllerDegraded: daemonset.apps "apiserver" not found ResourceSyncControllerDegraded: namespaces "openshift-apiserver" not found EncryptionKeyControllerDegraded: daemonset.apps "apiserver" not found Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2019-11-29T02:21:42Z Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2019-11-29T02:21:41Z Reason: NoData Status: Unknown Type: Available Last Transition Time: 2019-11-29T02:21:42Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> ... # oc logs pod/openshift-apiserver-operator-5f7dcd8c88-lc9nf -n openshift-apiserver-operator I1129 04:08:57.737594 1 cmd.go:188] Using service-serving-cert provided certificates I1129 04:08:57.738165 1 observer_polling.go:136] Starting file observer I1129 04:08:57.738273 1 observer_polling.go:97] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "ee4e4285ab6420066fac19de6bafd4e52ee8d92f6d3be1e31be188904ab35cb6", lastKnown: "ee4e4285ab6420066fac19de6bafd4e52ee8d92f6d3be1e31be188904ab35cb6") ... W1129 04:09:27.739583 1 builder.go:181] unable to get owner reference (falling back to namespace): Get https://172.30.0.1:443/api/v1/namespaces/openshift-apiserver-operator/pods: dial tcp 172.30.0.1:443: i/o timeout ... If any preserved cluster needed, please contact me for a reproduce and reservation. Version-Release number of the following components: 4.3.0-0.nightly-2019-11-27-041100 How reproducible: (3 times/always) Steps to Reproduce: 1. Trigger upi/vsphere installation with ovn network(qe's ci test profile) 2. After bootstrap complete, there is not image registry for storage patch. 3. Actual results: Installation can not finish. Expected results: Installation succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
Created attachment 1641539 [details] logs
Hit it again on 4.3.0-0.nightly-2019-11-29-013902 when deploy upi/vsphere cluster with http proxy enable. http_proxy: "http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@139.178.76.57:3128" https_proxy: "http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@139.178.76.57:3128" no_proxy: "test.no-proxy.com" [root@preserve-jliu-worker tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 77m Working towards 4.3.0-0.nightly-2019-11-29-013902: 72% complete [root@preserve-jliu-worker tmp]# [root@preserve-jliu-worker tmp]# oc get configs.imageregistry.operator.openshift.io cluster Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found [root@preserve-jliu-worker tmp]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential True False False 74m dns 4.3.0-0.nightly-2019-11-29-013902 True False False 66m insights 4.3.0-0.nightly-2019-11-29-013902 True False False 70m kube-apiserver 4.3.0-0.nightly-2019-11-29-013902 True True True 67m kube-controller-manager 4.3.0-0.nightly-2019-11-29-013902 True True True 66m kube-scheduler 4.3.0-0.nightly-2019-11-29-013902 True True True 66m machine-api 4.3.0-0.nightly-2019-11-29-013902 True False False 67m machine-config 4.3.0-0.nightly-2019-11-29-013902 False True True 70m network 4.3.0-0.nightly-2019-11-29-013902 True False False 61m openshift-apiserver 4.3.0-0.nightly-2019-11-29-013902 False False False 67m openshift-controller-manager False True False 70m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2019-11-29-013902 True False False 67m service-ca 4.3.0-0.nightly-2019-11-29-013902 True False False 70m [root@preserve-jliu-worker tmp]# oc get machineconfig NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 99-master-ssh 2.2.0 70m 99-worker-ssh 2.2.0 70m [root@preserve-jliu-worker tmp]# oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-11-29T06:32:57Z Generation: 1 Resource Version: 16403 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 02e1ecab-cdb9-4c34-80ee-4de528c4e7e6 Spec: Status: Conditions: Last Transition Time: 2019-11-29T06:32:57Z Message: Cluster not available for 4.3.0-0.nightly-2019-11-29-013902 Status: False Type: Available Last Transition Time: 2019-11-29T06:32:57Z Message: Cluster is bootstrapping 4.3.0-0.nightly-2019-11-29-013902 Status: True Type: Progressing Last Transition Time: 2019-11-29T06:46:36Z Message: Failed to resync 4.3.0-0.nightly-2019-11-29-013902 because: timed out waiting for the condition during waitForDeploymentRollout: Deployment machine-config-controller is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2019-11-29T06:46:36Z Reason: AsExpected Status: True Type: Upgradeable Extension: [root@preserve-jliu-worker tmp]# oc describe pod machine-config-controller-65d4889785-2c9kc -n openshift-machine-config-operator Name: machine-config-controller-65d4889785-2c9kc Namespace: openshift-machine-config-operator Priority: 2000000000 Priority Class Name: system-cluster-critical Node: <none> Labels: k8s-app=machine-config-controller pod-template-hash=65d4889785 Annotations: <none> Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/machine-config-controller-65d4889785 Containers: machine-config-controller: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a439c4a128260accac47c791bed2a318f95bdd17d93b5903ab7f8780ef99baf Port: <none> Host Port: <none> Command: /usr/bin/machine-config-controller Args: start --resourcelock-namespace=openshift-machine-config-operator --v=2 Requests: cpu: 20m memory: 50Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from machine-config-controller-token-zcfn5 (ro) Volumes: machine-config-controller-token-zcfn5: Type: Secret (a volume populated by a Secret) SecretName: machine-config-controller-token-zcfn5 Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 120s node.kubernetes.io/unreachable:NoExecute for 120s Events: <none> # oc describe co openshift-apiserver Name: openshift-apiserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-11-29T06:33:30Z Generation: 1 Resource Version: 5894 Self Link: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver UID: 079fc917-a6b4-4766-80d7-a4137f5471b5 Spec: Status: Conditions: Last Transition Time: 2019-11-29T06:36:19Z Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2019-11-29T06:36:39Z Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2019-11-29T06:36:26Z Message: Available: no openshift-apiserver daemon pods available on any node. Reason: AvailableNoAPIServerPod Status: False Type: Available Last Transition Time: 2019-11-29T06:33:31Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Since the must-gather can not work in this broken status, so only some of info provided above. I will try to give another reproduce and keep the cluster for debug.