Bug 1702615
Summary: | [upi-vmware] Fail to finish cluster initialation after bootsrap complete | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> | |
Component: | Installer | Assignee: | Matthew Staebler <mstaeble> | |
Installer sub component: | openshift-installer | QA Contact: | liujia <jiajliu> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | bleanhar, jima, sponnaga | |
Version: | 4.1.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.2.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1717257 (view as bug list) | Environment: | ||
Last Closed: | 2019-06-04 10:47:56 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1717257 |
Description
liujia
2019-04-24 09:29:22 UTC
I'm assuming Scott didn't intend to change the QA contact. Hit it again based the same payload. The same steps as the description. The difference this time is that empty dir storage was added successfully before run "openshift-install wait-for install-complete". But the installation still failed. # ./openshift-install wait-for bootstrap-complete INFO Waiting up to 30m0s for the Kubernetes API at https://api.jliu-demo.qe.devcluster.openshift.com:6443... INFO API v1.13.4+81fc896 up INFO Waiting up to 30m0s for bootstrapping to complete... INFO It is now safe to remove the bootstrap resources # ./terraform apply -auto-approve -var 'bootstrap_complete=true' # oc get csr --no-headers|grep -i pending|awk '{print $1}' | xargs oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-62pdr approved certificatesigningrequest.certificates.k8s.io/csr-7ln8w approved certificatesigningrequest.certificates.k8s.io/csr-9242m approved certificatesigningrequest.certificates.k8s.io/csr-bmxqh approved certificatesigningrequest.certificates.k8s.io/csr-jhkx4 approved certificatesigningrequest.certificates.k8s.io/csr-jlcj4 approved certificatesigningrequest.certificates.k8s.io/csr-jtqnf approved certificatesigningrequest.certificates.k8s.io/csr-lm4zg approved certificatesigningrequest.certificates.k8s.io/csr-m7l2s approved certificatesigningrequest.certificates.k8s.io/csr-nr94d approved certificatesigningrequest.certificates.k8s.io/csr-pqz76 approved certificatesigningrequest.certificates.k8s.io/csr-w7mbc approved # oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}' config.imageregistry.operator.openshift.io/cluster patched # ./openshift-install wait-for install-complete INFO Waiting up to 30m0s for the cluster at https://api.jliu-demo.qe.devcluster.openshift.com:6443 to initialize... FATAL failed to initialize the cluster: Cluster operator image-registry is still updating: timed out waiting for the condition Checked the cluster status: # oc get co NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.1.0-rc.0 True False False 15h cloud-credential 4.1.0-rc.0 True False False 16h cluster-autoscaler 4.1.0-rc.0 True False False 16h console 4.1.0-rc.0 True False False 15h dns 4.1.0-rc.0 True False False 16h image-registry False True True 16h ingress 4.1.0-rc.0 True False False 16h kube-apiserver 4.1.0-rc.0 True False 16h kube-controller-manager 4.1.0-rc.0 True False 16h kube-scheduler 4.1.0-rc.0 True False 16h machine-api 4.1.0-rc.0 True False False 16h machine-config 4.1.0-rc.0 True False False 16h marketplace 4.1.0-rc.0 True False False 16h monitoring 4.1.0-rc.0 True False False 15h network 4.1.0-rc.0 True False 16h node-tuning 4.1.0-rc.0 True False False 16h openshift-apiserver 4.1.0-rc.0 True False 16h openshift-controller-manager 4.1.0-rc.0 True False 16h openshift-samples 4.1.0-rc.0 True False False 16h operator-lifecycle-manager 4.1.0-rc.0 True False False 16h operator-lifecycle-manager-catalog 4.1.0-rc.0 True False False 16h service-ca 4.1.0-rc.0 True False False 16h service-catalog-apiserver 4.1.0-rc.0 True False False 15h service-catalog-controller-manager 4.1.0-rc.0 True False False 15h storage 4.1.0-rc.0 True False False 16h # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-rc.0 False True 16h Unable to apply 4.1.0-rc.0: the cluster operator image-registry has not yet successfully rolled out # oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}' config.imageregistry.operator.openshift.io/cluster patched (no change) # oc get config.imageregistry.operator.openshift.io/cluster -o json| jq ".spec.storage" { "filesystem": { "volumeSource": { "emptyDir": {} } } } That is not the correct command to run to set the storage for image registry. # oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}' See https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md#updating-image-registry-to-emptydir-storage-backend The original failure captured in this bug does not appear to be related to the image registry. From the output it looks like there was wide-spread failures across the operators. The latest failure looks to be just the image-registry not being configured correctly. (In reply to Matthew Staebler from comment #4) > That is not the correct command to run to set the storage for image registry. > > # oc patch configs.imageregistry.operator.openshift.io cluster --type merge > --patch '{"spec":{"storage":{"emptyDir":{}}}}' > > See > https://github.com/openshift/installer/blob/master/docs/user/vsphere/ > install_upi.md#updating-image-registry-to-emptydir-storage-backend Thx for point out. I pasted the command from doc[1] in pr[2] [1] http://file.rdu.redhat.com/kalexand/0418/osdocs324/installing/installing_vsphere/installing-vsphere.html [2] https://github.com/openshift/openshift-docs/pull/14523 I think maybe this is the key point for the issue in comment3. I will try it again. If it works, i will change this issue to doc. Thx. But for the issue in description. I don't think it is much related with the wrong command since the image-registry is not shown in the operators list all the time. And the command to add empty dir can not succeed at all. # ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}' Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found Additional info: When fail, check the image-registry operator still not available. # oc get co NAME VERSION AVAILABLE PROGRESSING FAILING SINCE cloud-credential 4.1.0-rc.0 True False False 89m cluster-autoscaler 4.1.0-rc.0 True False False 89m dns 4.1.0-rc.0 True False False 87m kube-apiserver 4.1.0-rc.0 False True 88m kube-controller-manager 4.1.0-rc.0 True True 86m kube-scheduler 4.1.0-rc.0 False True 89m machine-api 4.1.0-rc.0 True False False 89m machine-config 4.1.0-rc.0 True False False 88m network 4.1.0-rc.0 True False 89m openshift-apiserver 4.1.0-rc.0 Unknown Unknown 88m openshift-controller-manager 4.1.0-rc.0 True True 87m operator-lifecycle-manager 4.1.0-rc.0 True True False 85m operator-lifecycle-manager-catalog 4.1.0-rc.0 True True False 85m service-ca 4.1.0-rc.0 True False False 88m About the issue in comment3, it's due to wrong command, so no need to track it here, i have added comment it https://github.com/openshift/openshift-docs/pull/14523#issuecomment-486517176. About the issue in description, tried again on two builds which should be the same. Tried on 4.1.0-0.nightly-2019-04-22-005054 succeed. Tried on 4.1.0-rc.0, still failed for the same error. After run "./openshift-install wait-for bootstrap-complete" succeed. All nodes can not be ready, which result that many opearators/pods can not be scheduled, so the empty dir can not be added. root@preserve-jliu-worker 20190424]# oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready,SchedulingDisabled worker 112m v1.13.4+0e917197e compute-1 Ready,SchedulingDisabled worker 112m v1.13.4+0e917197e compute-2 Ready,SchedulingDisabled worker 112m v1.13.4+0e917197e control-plane-0 Ready,SchedulingDisabled master 112m v1.13.4+0e917197e control-plane-1 Ready,SchedulingDisabled master 112m v1.13.4+0e917197e control-plane-2 Ready,SchedulingDisabled master 112m v1.13.4+0e917197e [root@preserve-jliu-worker 20190424]# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}' Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found [root@preserve-jliu-worker 20190424]# ./oc get co NAME VERSION AVAILABLE PROGRESSING FAILING SINCE cloud-credential 4.1.0-rc.0 True False False 112m cluster-autoscaler 4.1.0-rc.0 True False False 112m dns 4.1.0-rc.0 True False False 112m kube-apiserver 4.1.0-rc.0 True True 111m kube-controller-manager 4.1.0-rc.0 False True 112m kube-scheduler 4.1.0-rc.0 False True 112m machine-api 4.1.0-rc.0 True False False 112m machine-config 4.1.0-rc.0 True False False 112m network 4.1.0-rc.0 True False 113m openshift-apiserver 4.1.0-rc.0 False False 111m openshift-controller-manager 4.1.0-rc.0 True True 111m operator-lifecycle-manager 4.1.0-rc.0 True True False 110m operator-lifecycle-manager-catalog 4.1.0-rc.0 True True False 110m service-ca 4.1.0-rc.0 True False False 112m [root@preserve-jliu-worker 20190424]# ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-rc.0 False True 114m Unable to apply 4.1.0-rc.0: an unknown error has occurred [root@preserve-jliu-worker 20190424]# [root@preserve-jliu-worker 20190424]# ./oc describe co openshift-apiserver|grep Message Message: Available: no openshift-apiserver daemon pods available on any node. [root@preserve-jliu-worker 20190424]# [root@preserve-jliu-worker 20190424]# ./oc describe co kube-scheduler|grep Message Message: StaticPodsDegraded: nodes/control-plane-1 pods/openshift-kube-scheduler-control-plane-1 container="scheduler" is not ready Message: Progressing: 3 nodes are at revision 0 Message: Available: 0 nodes are active; 3 nodes are at revision 0 [root@preserve-jliu-worker 20190424]# [root@preserve-jliu-worker 20190424]# ./oc describe co kube-controller-manager|grep Message Message: StaticPodsDegraded: nodes/control-plane-2 pods/kube-controller-manager-control-plane-2 container="kube-controller-manager-2" is not ready Message: Progressing: 3 nodes are at revision 0 Message: Available: 0 nodes are active; 3 nodes are at revision 0 [root@preserve-jliu-worker 20190424]# oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 124m Normal RegisteredNode node/compute-0 Node compute-0 event: Registered Node compute-0 in Controller 124m Normal Starting node/compute-0 Starting openshift-sdn. 119m Normal Drain node/compute-0 Draining node to update config. 118m Normal Reboot node/compute-0 Node will reboot into config rendered-worker-ca38126bd905bba006bb3127d588ed8c 124m Normal RegisteredNode node/compute-1 Node compute-1 event: Registered Node compute-1 in Controller 123m Normal Starting node/compute-1 Starting openshift-sdn. 119m Normal Drain node/compute-1 Draining node to update config. 118m Normal Reboot node/compute-1 Node will reboot into config rendered-worker-ca38126bd905bba006bb3127d588ed8c 124m Normal RegisteredNode node/compute-2 Node compute-2 event: Registered Node compute-2 in Controller 124m Normal Starting node/compute-2 Starting openshift-sdn. 119m Normal Drain node/compute-2 Draining node to update config. 118m Normal Reboot node/compute-2 Node will reboot into config rendered-worker-ca38126bd905bba006bb3127d588ed8c 124m Normal RegisteredNode node/control-plane-0 Node control-plane-0 event: Registered Node control-plane-0 in Controller 122m Normal Starting node/control-plane-0 Starting openshift-sdn. 122m Normal NodeReady node/control-plane-0 Node control-plane-0 status is now: NodeReady 119m Normal Drain node/control-plane-0 Draining node to update config. 124m Normal RegisteredNode node/control-plane-1 Node control-plane-1 event: Registered Node control-plane-1 in Controller 123m Normal Starting node/control-plane-1 Starting openshift-sdn. 123m Normal NodeReady node/control-plane-1 Node control-plane-1 status is now: NodeReady 119m Normal Drain node/control-plane-1 Draining node to update config. 124m Normal RegisteredNode node/control-plane-2 Node control-plane-2 event: Registered Node control-plane-2 in Controller 123m Normal Starting node/control-plane-2 Starting openshift-sdn. 123m Normal NodeReady node/control-plane-2 Node control-plane-2 status is now: NodeReady 119m Normal Drain node/control-plane-2 Draining node to update config. 122m Normal Status upgrade clusteroperator/machine-api Progressing towards operator: 4.1.0-rc.0 [root@compute-0 core]# journalctl --no-page -u kubelet.service -n 200 Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.588051 894 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Unauthorized Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.589071 894 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Unauthorized Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.589566 894 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Unauthorized Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.659826 894 kubelet.go:2273] node "compute-0" not found Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.760019 894 kubelet.go:2273] node "compute-0" not found Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.860197 894 kubelet.go:2273] node "compute-0" not found Apr 25 05:17:02 compute-0 hyperkube[894]: I0425 05:17:02.927966 894 kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach Apr 25 05:17:02 compute-0 hyperkube[894]: I0425 05:17:02.928917 894 kubelet_node_status.go:72] Attempting to register node compute-0 Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.930043 894 kubelet_node_status.go:94] Unable to register node "compute-0" with API server: Unauthorized Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.960413 894 kubelet.go:2273] node "compute-0" not found Apr 25 05:17:03 compute-0 hyperkube[894]: E0425 05:17:03.060618 894 kubelet.go:2273] node "compute-0" not found Apr 25 05:17:03 compute-0 hyperkube[894]: E0425 05:17:03.160818 894 kubelet.go:2273] node "compute-0" not found [root@preserve-jliu-worker 20190424]# oc describe co kube-apiserver Name: kube-apiserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-04-25T02:58:21Z Generation: 1 Resource Version: 6402 Self Link: /apis/config.openshift.io/v1/clusteroperators/kube-apiserver UID: fc42aed2-6705-11e9-be5d-0050568b6e23 Spec: Status: Conditions: Last Transition Time: 2019-04-25T02:59:32Z Message: StaticPodsDegraded: pods "kube-apiserver-control-plane-2" not found Reason: StaticPodsDegradedError Status: True Type: Degraded Last Transition Time: 2019-04-25T02:59:04Z Message: Progressing: 2 nodes are at revision 1; 1 nodes are at revision 0 Reason: Progressing Status: True Type: Progressing Last Transition Time: 2019-04-25T02:59:32Z Message: Available: 2 nodes are active; 2 nodes are at revision 1; 1 nodes are at revision 0 Reason: AsExpected Status: True Type: Available Last Transition Time: 2019-04-25T02:58:21Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubeapiservers Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-apiserver-operator Resource: namespaces Group: Name: openshift-kube-apiserver Resource: namespaces Versions: Name: operator Version: 4.1.0-rc.0 Name: kube-apiserver Version: 1.13.4 Events: <none> [root@preserve-jliu-worker 20190424]# oc describe co kube-apiserver Name: kube-apiserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-04-25T02:58:21Z Generation: 1 Resource Version: 6402 Self Link: /apis/config.openshift.io/v1/clusteroperators/kube-apiserver UID: fc42aed2-6705-11e9-be5d-0050568b6e23 Spec: Status: Conditions: Last Transition Time: 2019-04-25T02:59:32Z Message: StaticPodsDegraded: pods "kube-apiserver-control-plane-2" not found Reason: StaticPodsDegradedError Status: True Type: Degraded Last Transition Time: 2019-04-25T02:59:04Z Message: Progressing: 2 nodes are at revision 1; 1 nodes are at revision 0 Reason: Progressing Status: True Type: Progressing Last Transition Time: 2019-04-25T02:59:32Z Message: Available: 2 nodes are active; 2 nodes are at revision 1; 1 nodes are at revision 0 Reason: AsExpected Status: True Type: Available Last Transition Time: 2019-04-25T02:58:21Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubeapiservers Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-apiserver-operator Resource: namespaces Group: Name: openshift-kube-apiserver Resource: namespaces Versions: Name: operator Version: 4.1.0-rc.0 Name: kube-apiserver Version: 1.13.4 Events: <none> [root@preserve-jliu-worker 20190424]# oc get all -n openshift-kube-apiserver-operator NAME READY STATUS RESTARTS AGE pod/kube-apiserver-operator-7fccf67d7b-6qcdm 1/1 Terminating 1 145m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/metrics ClusterIP 172.30.54.46 <none> 443/TCP 145m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/kube-apiserver-operator 1/1 1 1 145m NAME DESIRED CURRENT READY AGE replicaset.apps/kube-apiserver-operator-7fccf67d7b 1 1 1 145m [root@preserve-jliu-worker 20190424]# oc describe pod/kube-apiserver-operator-7fccf67d7b-6qcdm -n openshift-kube-apiserver-operator Name: kube-apiserver-operator-7fccf67d7b-6qcdm Namespace: openshift-kube-apiserver-operator Priority: 2000000000 PriorityClassName: system-cluster-critical Node: control-plane-1/139.178.76.10 Start Time: Thu, 25 Apr 2019 02:56:51 +0000 Labels: app=kube-apiserver-operator pod-template-hash=7fccf67d7b Annotations: k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.0.2" ], "default": true, "dns": {} }] Status: Terminating (lasts <invalid>) Termination Grace Period: 600s IP: 10.131.0.2 Controlled By: ReplicaSet/kube-apiserver-operator-7fccf67d7b Containers: operator: Container ID: cri-o://fc13234debb2a875346ac38f140f98441734ea1b77acfaae23f582a6ed5b2c0f Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b572a949d8b36574b020056aca2dc8c902c302068bae3924567a674396daeaf Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b572a949d8b36574b020056aca2dc8c902c302068bae3924567a674396daeaf Port: 8443/TCP Host Port: 0/TCP Command: cluster-kube-apiserver-operator operator Args: --config=/var/run/configmaps/config/config.yaml -v=2 State: Running Started: Thu, 25 Apr 2019 02:59:41 +0000 Last State: Terminated Reason: Error Message: 59 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"c6615cff-6705-11e9-be5d-0050568b6e23", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ServiceMonitorCreateFailed' Failed to create ServiceMonitor.monitoring.coreos.com/v1: the server could not find the requested resource I0425 02:59:33.682882 1 installer_controller.go:337] "control-plane-0" needs new revision 1 I0425 02:59:33.683006 1 installer_controller.go:346] "control-plane-0" moving to (v1.NodeStatus) { NodeName: (string) (len=15) "control-plane-0", CurrentRevision: (int32) 0, TargetRevision: (int32) 1, LastFailedRevision: (int32) 0, LastFailedRevisionErrors: ([]string) <nil> } I0425 02:59:34.885744 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"c6615cff-6705-11e9-be5d-0050568b6e23", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-1-control-plane-0 -n openshift-kube-apiserver because it was missing I0425 02:59:35.483424 1 installer_controller.go:318] "control-plane-0" is in transition to 1, but has not made progress I0425 02:59:36.683225 1 installer_controller.go:318] "control-plane-0" is in transition to 1, but has not made progress I0425 02:59:40.874900 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "2c138ee47b915e13d56fc84f84b4ee8c149083a4292a6aaa386f63ddc1001276", lastKnown: "") W0425 02:59:40.875365 1 builder.go:108] Restart triggered because of file /var/run/secrets/serving-cert/tls.key was created I0425 02:59:40.875470 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "29bca5782263e753c31f017607943d281f1f40866e6eed79fb9e69310d0e10c0", lastKnown: "") F0425 02:59:40.875489 1 builder.go:217] server exited Exit Code: 255 Started: Thu, 25 Apr 2019 02:58:20 +0000 Finished: Thu, 25 Apr 2019 02:59:40 +0000 Ready: True Restart Count: 1 Requests: cpu: 10m memory: 50Mi Environment: IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1883673bcfead0a959ca9de08df3cf017f507ebd14d287b0045edba6ad7a3c24 OPERATOR_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b572a949d8b36574b020056aca2dc8c902c302068bae3924567a674396daeaf OPERAND_IMAGE_VERSION: 1.13.4 OPERATOR_IMAGE_VERSION: 4.1.0-rc.0 POD_NAME: kube-apiserver-operator-7fccf67d7b-6qcdm (v1:metadata.name) Mounts: /var/run/configmaps/config from config (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-apiserver-operator-token-22z8t (ro) /var/run/secrets/serving-cert from serving-cert (rw) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: serving-cert: Type: Secret (a volume populated by a Secret) SecretName: kube-apiserver-operator-serving-cert Optional: true config: Type: ConfigMap (a volume populated by a ConfigMap) Name: kube-apiserver-operator-config Optional: false kube-apiserver-operator-token-22z8t: Type: Secret (a volume populated by a Secret) SecretName: kube-apiserver-operator-token-22z8t Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 153m default-scheduler Successfully assigned openshift-kube-apiserver-operator/kube-apiserver-operator-7fccf67d7b-6qcdm to control-plane-1 Warning NetworkNotReady 152m (x25 over 153m) kubelet, control-plane-1 network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] [root@preserve-jliu-worker 20190424]# oc get all -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE pod/kube-apiserver-control-plane-0 2/2 Running 0 142m pod/kube-apiserver-control-plane-1 2/2 Running 0 143m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/apiserver ClusterIP 172.30.28.13 <none> 443/TCP 143m With 4.1.0-0.ci-2019-04-29-123411, I am no longer able to reproduce this bug. Still hit it on 4.1.0-0.nightly-2019-04-28-064010. Try 4.1.0-0.nightly-2019-04-29-235412 and hit bz1699456. Will try it when newer build avilable. Did not hit it on 4.1.0-0.nightly-2019-05-04-210601 and installation for upi/vmware succeed. Considering this bug is not a 100% reproduce one, so verify it now, and if hit it in following builds, will re-open it again. The `oc patch configs.imageregistry.operator.openshift.io cluster` command will fail until the resource has been created by the image-registry operator, which is the error that you had when attempting the patch. I will open a separate bug to track adding recommendations to the user to reattempt the patch if it fails. As for the lastest failure, the `oc get co` command was from 5 minutes after the operators were created. It takes longer than that for the installation to complete, so it is not unexpected that there are failing operators at that point. The `oc get clusterversion` was from 78 minutes after the operators were created. If the storage backend for the image registry was never patched, then it is not unexpected that the clusterversion is still progressing. The information provided is not sufficient for me to reason that this is a re-occurrence of the original issue for this bug. In the original issue, the kube apiserver pods were not running, and this was not related to whether the storage backend was configured for the image registry. Here is the bug to track the documentation change for re-attempting setting the image registry storage backend. https://bugzilla.redhat.com/show_bug.cgi?id=1716427 Hi Matthew, >The `oc patch configs.imageregistry.operator.openshift.io cluster` command will fail until the resource has been created by the image-registry operator, which is the error that you had when attempting the patch. I agree this, but the issue is not why the command fail, but why the installation can not finish. All clusteroperators are not running well and the cluster version status hang on 81%. >I will open a separate bug to track adding recommendations to the user to reattempt the patch if it fails. This is a good recommendation for users to reattempt the patch command if it fails when first try. But as for this issue, retry can not resolve the problem that image-operator was not running at all even after several hours. I kept the cluster for the whole night without any extra operations. and checked it now, still in a broken status. [root@preserve-jliu-worker 20190603_2784]# ./oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 20h v1.13.4+cb455d664 control-plane-0 Ready master 20h v1.13.4+cb455d664 control-plane-1 Ready master 20h v1.13.4+cb455d664 control-plane-2 Ready master 20h v1.13.4+cb455d664 [root@preserve-jliu-worker 20190603_2784]# ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential 4.1.0-rc.8 True False False 20h cluster-autoscaler 4.1.0-rc.8 True False False 20h dns unknown False True True 20h kube-apiserver 4.1.0-rc.8 False True True 20h kube-controller-manager False True True 20h kube-scheduler 4.1.0-rc.8 False True True 20h machine-api 4.1.0-rc.8 True False False 20h machine-config 4.1.0-rc.8 False True False 20h network 4.1.0-rc.8 True False False 20h openshift-apiserver 4.1.0-rc.8 Unknown Unknown True 20h openshift-controller-manager 4.1.0-rc.8 True True False 20h operator-lifecycle-manager 4.1.0-rc.8 True False False 20h operator-lifecycle-manager-catalog 4.1.0-rc.8 True False False 20h service-ca 4.1.0-rc.8 True False False 20h [root@preserve-jliu-worker 20190603_2784]# ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-rc.8 False True 20h Working towards 4.1.0-rc.8: 81% complete [root@preserve-jliu-worker 20190603_2784]# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}' Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found >The information provided is not sufficient for me to reason that this is a re-occurrence of the original issue for this bug. In the original issue, the kube apiserver pods were not running, and this was not related to whether the storage backend was configured for the image registry. The kube apiserver pods were not running either in the latest issue(refer to the log in my last comment about kube api server). Cause the same scenario and the same apprance, so just re-open the bug. But if you think it is another issue. I'm ok to file a new bug for tracking it. Reopen the bug and set the target to 4.2 first. If it is another root casue after dev debug, please feel free to close this one and file a new bug for it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |