Bug 1702615

Summary:	[upi-vmware] Fail to finish cluster initialation after bootsrap complete
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Installer	Assignee:	Matthew Staebler <mstaeble>
Installer sub component:	openshift-installer	QA Contact:	liujia <jiajliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	bleanhar, jima, sponnaga
Version:	4.1.0
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1717257 (view as bug list)		Environment:
Last Closed:	2019-06-04 10:47:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1717257

Description liujia 2019-04-24 09:29:22 UTC

Description of problem:
Trigger upi on vmware installation against 4.1.0-rc.0 payload. when run "./openshift-install wait-for install-complete", hit following error:

# ./openshift-install wait-for install-complete
INFO Waiting up to 30m0s for the cluster at https://api.jliu-demo.qe.devcluster.openshift.com:6443 to initialize... 
FATAL failed to initialize the cluster: Multiple errors are preventing progress:
* Cluster operator authentication has not yet reported success
* Cluster operator image-registry has not yet reported success
* Cluster operator ingress has not yet reported success
* Cluster operator kube-apiserver is reporting a failure: StaticPodsDegraded: pods "kube-apiserver-control-plane-0" not found
StaticPodsDegraded: pods "kube-apiserver-control-plane-2" not found
* Cluster operator kube-controller-manager has not yet reported success
* Cluster operator kube-scheduler is reporting a failure: StaticPodsDegraded: nodes/control-plane-1 pods/openshift-kube-scheduler-control-plane-1 container="scheduler" is not ready
StaticPodsDegraded: pods "openshift-kube-scheduler-control-plane-0" not found
StaticPodsDegraded: pods "openshift-kube-scheduler-control-plane-2" not found
* Cluster operator marketplace has not yet reported success
* Cluster operator monitoring has not yet reported success
* Cluster operator node-tuning has not yet reported success
* Cluster operator openshift-apiserver is reporting a failure: ResourceSyncControllerDegraded: namespaces "openshift-apiserver" not found
* Cluster operator operator-lifecycle-manager has not yet reported success
* Cluster operator service-catalog-apiserver has not yet reported success
* Cluster operator service-catalog-controller-manager has not yet reported success
* Cluster operator storage has not yet reported success
* Could not update oauthclient "console" (269 of 333): the server does not recognize this resource, check extension API servers
* Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (232 of 333): resource may have been deleted
* Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (329 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-authentication-operator/openshift-authentication-operator" (305 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (332 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-image-registry/image-registry" (311 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (320 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (323 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (326 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (314 of 333): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (317 of 333): the server does not recognize this resource, check extension API servers: timed out waiting for the condition 
[root@preserve-jliu-worker 20190424]# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}'
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.0   False       True          75m     Unable to apply 4.1.0-rc.0: an unknown error has occurred

It should be related with no storage backend, but when create empty dir for image registry before run "openshift-install wait-for install-complete", it report failed due to image registry operator is not avialable.
# ./oc get co
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
cloud-credential                     4.1.0-rc.0   True        False         False     50m
cluster-autoscaler                   4.1.0-rc.0   True        False         False     50m
dns                                  4.1.0-rc.0   True        False         False     48m
kube-apiserver                       4.1.0-rc.0   False       True                    49m
kube-controller-manager              4.1.0-rc.0   True        True                    47m
kube-scheduler                       4.1.0-rc.0   False       True                    50m
machine-api                          4.1.0-rc.0   True        False         False     50m
machine-config                       4.1.0-rc.0   True        False         False     49m
network                              4.1.0-rc.0   True        False                   50m
openshift-apiserver                  4.1.0-rc.0   Unknown     Unknown                 49m
openshift-controller-manager         4.1.0-rc.0   True        True                    48m
operator-lifecycle-manager           4.1.0-rc.0   True        True          False     46m
operator-lifecycle-manager-catalog   4.1.0-rc.0   True        True          False     46m
service-ca                           4.1.0-rc.0   True        False         False     49m

Version-Release number of the following components:
# ./openshift-install version
./openshift-install v4.1.0-201904211700-dirty
built from commit f3b726cc151f5a3d66bc7e23e81b3013f1347a7e
release image registry.svc.ci.openshift.org/ocp/release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867

4.1.0-rc.0 payload

How reproducible:
sometimes

Steps to Reproduce:
Refer to https://github.com/openshift/installer/tree/master/upi/vsphere
1. finish step1-step6
2. ./openshift-install wait-for bootstrap-complete
INFO Waiting up to 30m0s for the Kubernetes API at https://api.jliu-demo.qe.devcluster.openshift.com:6443... 
INFO API v1.13.4+81fc896 up                       
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO It is now safe to remove the bootstrap resources 
4. run "terraform apply -auto-approve -var 'bootstrap_complete=true'" to remove bootstrap resource
5. approve csr according to [1]
# ./oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve
6. create empty dir for image registry
# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}'
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found
7. run "openshift-install wait-for install-complete" to finish the install


[1] https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md#approving-server-certificates-for-nodes
[2] https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md#updating-image-registry-to-emptydir-storage-backend

Actual results:
Fail to install the cluster.

Expected results:
Installation succeed.

Additional info:
When fail, check the image-registry operator still not available.
# oc get co
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
cloud-credential                     4.1.0-rc.0   True        False         False     89m
cluster-autoscaler                   4.1.0-rc.0   True        False         False     89m
dns                                  4.1.0-rc.0   True        False         False     87m
kube-apiserver                       4.1.0-rc.0   False       True                    88m
kube-controller-manager              4.1.0-rc.0   True        True                    86m
kube-scheduler                       4.1.0-rc.0   False       True                    89m
machine-api                          4.1.0-rc.0   True        False         False     89m
machine-config                       4.1.0-rc.0   True        False         False     88m
network                              4.1.0-rc.0   True        False                   89m
openshift-apiserver                  4.1.0-rc.0   Unknown     Unknown                 88m
openshift-controller-manager         4.1.0-rc.0   True        True                    87m
operator-lifecycle-manager           4.1.0-rc.0   True        True          False     85m
operator-lifecycle-manager-catalog   4.1.0-rc.0   True        True          False     85m
service-ca                           4.1.0-rc.0   True        False         False     88m

Comment 2 Brenton Leanhardt 2019-04-24 15:36:41 UTC

I'm assuming Scott didn't intend to change the QA contact.

Comment 3 liujia 2019-04-25 01:28:46 UTC

Hit it again based the same payload.

The same steps as the description. The difference this time is that empty dir storage was added successfully before run "openshift-install wait-for install-complete". But the installation still failed.

# ./openshift-install wait-for bootstrap-complete
INFO Waiting up to 30m0s for the Kubernetes API at https://api.jliu-demo.qe.devcluster.openshift.com:6443... 
INFO API v1.13.4+81fc896 up                       
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO It is now safe to remove the bootstrap resources 

# ./terraform apply -auto-approve -var 'bootstrap_complete=true'

# oc get csr --no-headers|grep -i pending|awk '{print $1}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-62pdr approved
certificatesigningrequest.certificates.k8s.io/csr-7ln8w approved
certificatesigningrequest.certificates.k8s.io/csr-9242m approved
certificatesigningrequest.certificates.k8s.io/csr-bmxqh approved
certificatesigningrequest.certificates.k8s.io/csr-jhkx4 approved
certificatesigningrequest.certificates.k8s.io/csr-jlcj4 approved
certificatesigningrequest.certificates.k8s.io/csr-jtqnf approved
certificatesigningrequest.certificates.k8s.io/csr-lm4zg approved
certificatesigningrequest.certificates.k8s.io/csr-m7l2s approved
certificatesigningrequest.certificates.k8s.io/csr-nr94d approved
certificatesigningrequest.certificates.k8s.io/csr-pqz76 approved
certificatesigningrequest.certificates.k8s.io/csr-w7mbc approved

# oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}'
config.imageregistry.operator.openshift.io/cluster patched

# ./openshift-install wait-for install-complete
INFO Waiting up to 30m0s for the cluster at https://api.jliu-demo.qe.devcluster.openshift.com:6443 to initialize... 
FATAL failed to initialize the cluster: Cluster operator image-registry is still updating: timed out waiting for the condition 

Checked the cluster status:
# oc get co
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.1.0-rc.0   True        False         False     15h
cloud-credential                     4.1.0-rc.0   True        False         False     16h
cluster-autoscaler                   4.1.0-rc.0   True        False         False     16h
console                              4.1.0-rc.0   True        False         False     15h
dns                                  4.1.0-rc.0   True        False         False     16h
image-registry                                    False       True          True      16h
ingress                              4.1.0-rc.0   True        False         False     16h
kube-apiserver                       4.1.0-rc.0   True        False                   16h
kube-controller-manager              4.1.0-rc.0   True        False                   16h
kube-scheduler                       4.1.0-rc.0   True        False                   16h
machine-api                          4.1.0-rc.0   True        False         False     16h
machine-config                       4.1.0-rc.0   True        False         False     16h
marketplace                          4.1.0-rc.0   True        False         False     16h
monitoring                           4.1.0-rc.0   True        False         False     15h
network                              4.1.0-rc.0   True        False                   16h
node-tuning                          4.1.0-rc.0   True        False         False     16h
openshift-apiserver                  4.1.0-rc.0   True        False                   16h
openshift-controller-manager         4.1.0-rc.0   True        False                   16h
openshift-samples                    4.1.0-rc.0   True        False         False     16h
operator-lifecycle-manager           4.1.0-rc.0   True        False         False     16h
operator-lifecycle-manager-catalog   4.1.0-rc.0   True        False         False     16h
service-ca                           4.1.0-rc.0   True        False         False     16h
service-catalog-apiserver            4.1.0-rc.0   True        False         False     15h
service-catalog-controller-manager   4.1.0-rc.0   True        False         False     15h
storage                              4.1.0-rc.0   True        False         False     16h

# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.0   False       True          16h     Unable to apply 4.1.0-rc.0: the cluster operator image-registry has not yet successfully rolled out

# oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}'
config.imageregistry.operator.openshift.io/cluster patched (no change)


# oc get config.imageregistry.operator.openshift.io/cluster -o json| jq ".spec.storage"
{
  "filesystem": {
    "volumeSource": {
      "emptyDir": {}
    }
  }
}

Comment 4 Matthew Staebler 2019-04-25 02:15:05 UTC

That is not the correct command to run to set the storage for image registry.

# oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'

See https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md#updating-image-registry-to-emptydir-storage-backend

Comment 5 Matthew Staebler 2019-04-25 02:16:53 UTC

The original failure captured in this bug does not appear to be related to the image registry. From the output it looks like there was wide-spread failures across the operators. The latest failure looks to be just the image-registry not being configured correctly.

Comment 6 liujia 2019-04-25 02:37:08 UTC

(In reply to Matthew Staebler from comment #4)
> That is not the correct command to run to set the storage for image registry.
> 
> # oc patch configs.imageregistry.operator.openshift.io cluster --type merge
> --patch '{"spec":{"storage":{"emptyDir":{}}}}'
> 
> See
> https://github.com/openshift/installer/blob/master/docs/user/vsphere/
> install_upi.md#updating-image-registry-to-emptydir-storage-backend

Thx for point out. I pasted the command from doc[1] in pr[2]

[1] http://file.rdu.redhat.com/kalexand/0418/osdocs324/installing/installing_vsphere/installing-vsphere.html
[2] https://github.com/openshift/openshift-docs/pull/14523

I think maybe this is the key point for the issue in comment3. I will try it again. If it works, i will change this issue to doc. Thx.

But for the issue in description. I don't think it is much related with the wrong command since the image-registry is not shown in the operators list all the time. And the command to add empty dir can not succeed at all.
# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"filesystem":{"volumeSource": {"emptyDir":{}}}}}}'
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

Additional info:
When fail, check the image-registry operator still not available.
# oc get co
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
cloud-credential                     4.1.0-rc.0   True        False         False     89m
cluster-autoscaler                   4.1.0-rc.0   True        False         False     89m
dns                                  4.1.0-rc.0   True        False         False     87m
kube-apiserver                       4.1.0-rc.0   False       True                    88m
kube-controller-manager              4.1.0-rc.0   True        True                    86m
kube-scheduler                       4.1.0-rc.0   False       True                    89m
machine-api                          4.1.0-rc.0   True        False         False     89m
machine-config                       4.1.0-rc.0   True        False         False     88m
network                              4.1.0-rc.0   True        False                   89m
openshift-apiserver                  4.1.0-rc.0   Unknown     Unknown                 88m
openshift-controller-manager         4.1.0-rc.0   True        True                    87m
operator-lifecycle-manager           4.1.0-rc.0   True        True          False     85m
operator-lifecycle-manager-catalog   4.1.0-rc.0   True        True          False     85m
service-ca                           4.1.0-rc.0   True        False         False     88m

Comment 7 liujia 2019-04-25 05:31:31 UTC

About the issue in comment3, it's due to wrong command, so no need to track it here, i have added comment it https://github.com/openshift/openshift-docs/pull/14523#issuecomment-486517176.
About the issue in description, tried again on two builds which should be the same.

Tried on 4.1.0-0.nightly-2019-04-22-005054 succeed.
Tried on 4.1.0-rc.0, still failed for the same error.

After run "./openshift-install wait-for bootstrap-complete" succeed. All nodes can not be ready, which result that many opearators/pods can not be scheduled, so the empty dir can not be added.

root@preserve-jliu-worker 20190424]# oc get node
NAME              STATUS                     ROLES    AGE    VERSION
compute-0         Ready,SchedulingDisabled   worker   112m   v1.13.4+0e917197e
compute-1         Ready,SchedulingDisabled   worker   112m   v1.13.4+0e917197e
compute-2         Ready,SchedulingDisabled   worker   112m   v1.13.4+0e917197e
control-plane-0   Ready,SchedulingDisabled   master   112m   v1.13.4+0e917197e
control-plane-1   Ready,SchedulingDisabled   master   112m   v1.13.4+0e917197e
control-plane-2   Ready,SchedulingDisabled   master   112m   v1.13.4+0e917197e

[root@preserve-jliu-worker 20190424]# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

[root@preserve-jliu-worker 20190424]# ./oc get co
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
cloud-credential                     4.1.0-rc.0   True        False         False     112m
cluster-autoscaler                   4.1.0-rc.0   True        False         False     112m
dns                                  4.1.0-rc.0   True        False         False     112m
kube-apiserver                       4.1.0-rc.0   True        True                    111m
kube-controller-manager              4.1.0-rc.0   False       True                    112m
kube-scheduler                       4.1.0-rc.0   False       True                    112m
machine-api                          4.1.0-rc.0   True        False         False     112m
machine-config                       4.1.0-rc.0   True        False         False     112m
network                              4.1.0-rc.0   True        False                   113m
openshift-apiserver                  4.1.0-rc.0   False       False                   111m
openshift-controller-manager         4.1.0-rc.0   True        True                    111m
operator-lifecycle-manager           4.1.0-rc.0   True        True          False     110m
operator-lifecycle-manager-catalog   4.1.0-rc.0   True        True          False     110m
service-ca                           4.1.0-rc.0   True        False         False     112m

[root@preserve-jliu-worker 20190424]# ./oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.0   False       True          114m    Unable to apply 4.1.0-rc.0: an unknown error has occurred
[root@preserve-jliu-worker 20190424]# 
[root@preserve-jliu-worker 20190424]# ./oc describe co openshift-apiserver|grep Message
    Message:               Available: no openshift-apiserver daemon pods available on any node.
[root@preserve-jliu-worker 20190424]# 
[root@preserve-jliu-worker 20190424]# ./oc describe co kube-scheduler|grep Message
    Message:               StaticPodsDegraded: nodes/control-plane-1 pods/openshift-kube-scheduler-control-plane-1 container="scheduler" is not ready
    Message:               Progressing: 3 nodes are at revision 0
    Message:               Available: 0 nodes are active; 3 nodes are at revision 0
[root@preserve-jliu-worker 20190424]# 
[root@preserve-jliu-worker 20190424]# ./oc describe co kube-controller-manager|grep Message
    Message:               StaticPodsDegraded: nodes/control-plane-2 pods/kube-controller-manager-control-plane-2 container="kube-controller-manager-2" is not ready
    Message:               Progressing: 3 nodes are at revision 0
    Message:               Available: 0 nodes are active; 3 nodes are at revision 0

[root@preserve-jliu-worker 20190424]# oc get events
LAST SEEN   TYPE     REASON           OBJECT                        MESSAGE
124m        Normal   RegisteredNode   node/compute-0                Node compute-0 event: Registered Node compute-0 in Controller
124m        Normal   Starting         node/compute-0                Starting openshift-sdn.
119m        Normal   Drain            node/compute-0                Draining node to update config.
118m        Normal   Reboot           node/compute-0                Node will reboot into config rendered-worker-ca38126bd905bba006bb3127d588ed8c
124m        Normal   RegisteredNode   node/compute-1                Node compute-1 event: Registered Node compute-1 in Controller
123m        Normal   Starting         node/compute-1                Starting openshift-sdn.
119m        Normal   Drain            node/compute-1                Draining node to update config.
118m        Normal   Reboot           node/compute-1                Node will reboot into config rendered-worker-ca38126bd905bba006bb3127d588ed8c
124m        Normal   RegisteredNode   node/compute-2                Node compute-2 event: Registered Node compute-2 in Controller
124m        Normal   Starting         node/compute-2                Starting openshift-sdn.
119m        Normal   Drain            node/compute-2                Draining node to update config.
118m        Normal   Reboot           node/compute-2                Node will reboot into config rendered-worker-ca38126bd905bba006bb3127d588ed8c
124m        Normal   RegisteredNode   node/control-plane-0          Node control-plane-0 event: Registered Node control-plane-0 in Controller
122m        Normal   Starting         node/control-plane-0          Starting openshift-sdn.
122m        Normal   NodeReady        node/control-plane-0          Node control-plane-0 status is now: NodeReady
119m        Normal   Drain            node/control-plane-0          Draining node to update config.
124m        Normal   RegisteredNode   node/control-plane-1          Node control-plane-1 event: Registered Node control-plane-1 in Controller
123m        Normal   Starting         node/control-plane-1          Starting openshift-sdn.
123m        Normal   NodeReady        node/control-plane-1          Node control-plane-1 status is now: NodeReady
119m        Normal   Drain            node/control-plane-1          Draining node to update config.
124m        Normal   RegisteredNode   node/control-plane-2          Node control-plane-2 event: Registered Node control-plane-2 in Controller
123m        Normal   Starting         node/control-plane-2          Starting openshift-sdn.
123m        Normal   NodeReady        node/control-plane-2          Node control-plane-2 status is now: NodeReady
119m        Normal   Drain            node/control-plane-2          Draining node to update config.
122m        Normal   Status upgrade   clusteroperator/machine-api   Progressing towards operator: 4.1.0-rc.0

[root@compute-0 core]# journalctl --no-page -u kubelet.service -n 200
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.588051     894 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Unauthorized
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.589071     894 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Unauthorized
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.589566     894 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Unauthorized
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.659826     894 kubelet.go:2273] node "compute-0" not found
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.760019     894 kubelet.go:2273] node "compute-0" not found
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.860197     894 kubelet.go:2273] node "compute-0" not found
Apr 25 05:17:02 compute-0 hyperkube[894]: I0425 05:17:02.927966     894 kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
Apr 25 05:17:02 compute-0 hyperkube[894]: I0425 05:17:02.928917     894 kubelet_node_status.go:72] Attempting to register node compute-0
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.930043     894 kubelet_node_status.go:94] Unable to register node "compute-0" with API server: Unauthorized
Apr 25 05:17:02 compute-0 hyperkube[894]: E0425 05:17:02.960413     894 kubelet.go:2273] node "compute-0" not found
Apr 25 05:17:03 compute-0 hyperkube[894]: E0425 05:17:03.060618     894 kubelet.go:2273] node "compute-0" not found
Apr 25 05:17:03 compute-0 hyperkube[894]: E0425 05:17:03.160818     894 kubelet.go:2273] node "compute-0" not found

[root@preserve-jliu-worker 20190424]# oc describe co kube-apiserver
Name:         kube-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-04-25T02:58:21Z
  Generation:          1
  Resource Version:    6402
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/kube-apiserver
  UID:                 fc42aed2-6705-11e9-be5d-0050568b6e23
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-04-25T02:59:32Z
    Message:               StaticPodsDegraded: pods "kube-apiserver-control-plane-2" not found
    Reason:                StaticPodsDegradedError
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-04-25T02:59:04Z
    Message:               Progressing: 2 nodes are at revision 1; 1 nodes are at revision 0
    Reason:                Progressing
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-04-25T02:59:32Z
    Message:               Available: 2 nodes are active; 2 nodes are at revision 1; 1 nodes are at revision 0
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-04-25T02:58:21Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  kubeapiservers
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.1.0-rc.0
    Name:     kube-apiserver
    Version:  1.13.4
Events:       <none>

[root@preserve-jliu-worker 20190424]# oc describe co kube-apiserver
Name:         kube-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-04-25T02:58:21Z
  Generation:          1
  Resource Version:    6402
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/kube-apiserver
  UID:                 fc42aed2-6705-11e9-be5d-0050568b6e23
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-04-25T02:59:32Z
    Message:               StaticPodsDegraded: pods "kube-apiserver-control-plane-2" not found
    Reason:                StaticPodsDegradedError
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-04-25T02:59:04Z
    Message:               Progressing: 2 nodes are at revision 1; 1 nodes are at revision 0
    Reason:                Progressing
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-04-25T02:59:32Z
    Message:               Available: 2 nodes are active; 2 nodes are at revision 1; 1 nodes are at revision 0
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-04-25T02:58:21Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  kubeapiservers
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.1.0-rc.0
    Name:     kube-apiserver
    Version:  1.13.4
Events:       <none>
[root@preserve-jliu-worker 20190424]# oc get all -n openshift-kube-apiserver-operator
NAME                                           READY   STATUS        RESTARTS   AGE
pod/kube-apiserver-operator-7fccf67d7b-6qcdm   1/1     Terminating   1          145m

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/metrics   ClusterIP   172.30.54.46   <none>        443/TCP   145m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kube-apiserver-operator   1/1     1            1           145m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/kube-apiserver-operator-7fccf67d7b   1         1         1       145m

[root@preserve-jliu-worker 20190424]# oc describe pod/kube-apiserver-operator-7fccf67d7b-6qcdm -n openshift-kube-apiserver-operator
Name:                      kube-apiserver-operator-7fccf67d7b-6qcdm
Namespace:                 openshift-kube-apiserver-operator
Priority:                  2000000000
PriorityClassName:         system-cluster-critical
Node:                      control-plane-1/139.178.76.10
Start Time:                Thu, 25 Apr 2019 02:56:51 +0000
Labels:                    app=kube-apiserver-operator
                           pod-template-hash=7fccf67d7b
Annotations:               k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "openshift-sdn",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.131.0.2"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  600s
IP:                        10.131.0.2
Controlled By:             ReplicaSet/kube-apiserver-operator-7fccf67d7b
Containers:
  operator:
    Container ID:  cri-o://fc13234debb2a875346ac38f140f98441734ea1b77acfaae23f582a6ed5b2c0f
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b572a949d8b36574b020056aca2dc8c902c302068bae3924567a674396daeaf
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b572a949d8b36574b020056aca2dc8c902c302068bae3924567a674396daeaf
    Port:          8443/TCP
    Host Port:     0/TCP
    Command:
      cluster-kube-apiserver-operator
      operator
    Args:
      --config=/var/run/configmaps/config/config.yaml
      -v=2
    State:       Running
      Started:   Thu, 25 Apr 2019 02:59:41 +0000
    Last State:  Terminated
      Reason:    Error
      Message:   59       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"c6615cff-6705-11e9-be5d-0050568b6e23", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ServiceMonitorCreateFailed' Failed to create ServiceMonitor.monitoring.coreos.com/v1: the server could not find the requested resource
I0425 02:59:33.682882       1 installer_controller.go:337] "control-plane-0" needs new revision 1
I0425 02:59:33.683006       1 installer_controller.go:346] "control-plane-0" moving to (v1.NodeStatus) {
 NodeName: (string) (len=15) "control-plane-0",
 CurrentRevision: (int32) 0,
 TargetRevision: (int32) 1,
 LastFailedRevision: (int32) 0,
 LastFailedRevisionErrors: ([]string) <nil>
}
I0425 02:59:34.885744       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"c6615cff-6705-11e9-be5d-0050568b6e23", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-1-control-plane-0 -n openshift-kube-apiserver because it was missing
I0425 02:59:35.483424       1 installer_controller.go:318] "control-plane-0" is in transition to 1, but has not made progress
I0425 02:59:36.683225       1 installer_controller.go:318] "control-plane-0" is in transition to 1, but has not made progress
I0425 02:59:40.874900       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "2c138ee47b915e13d56fc84f84b4ee8c149083a4292a6aaa386f63ddc1001276", lastKnown: "")
W0425 02:59:40.875365       1 builder.go:108] Restart triggered because of file /var/run/secrets/serving-cert/tls.key was created
I0425 02:59:40.875470       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "29bca5782263e753c31f017607943d281f1f40866e6eed79fb9e69310d0e10c0", lastKnown: "")
F0425 02:59:40.875489       1 builder.go:217] server exited

      Exit Code:    255
      Started:      Thu, 25 Apr 2019 02:58:20 +0000
      Finished:     Thu, 25 Apr 2019 02:59:40 +0000
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:     10m
      memory:  50Mi
    Environment:
      IMAGE:                   quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1883673bcfead0a959ca9de08df3cf017f507ebd14d287b0045edba6ad7a3c24
      OPERATOR_IMAGE:          quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b572a949d8b36574b020056aca2dc8c902c302068bae3924567a674396daeaf
      OPERAND_IMAGE_VERSION:   1.13.4
      OPERATOR_IMAGE_VERSION:  4.1.0-rc.0
      POD_NAME:                kube-apiserver-operator-7fccf67d7b-6qcdm (v1:metadata.name)
    Mounts:
      /var/run/configmaps/config from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-apiserver-operator-token-22z8t (ro)
      /var/run/secrets/serving-cert from serving-cert (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  serving-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-apiserver-operator-serving-cert
    Optional:    true
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-apiserver-operator-config
    Optional:  false
  kube-apiserver-operator-token-22z8t:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-apiserver-operator-token-22z8t
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     
                 node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason           Age                   From                      Message
  ----     ------           ----                  ----                      -------
  Normal   Scheduled        153m                  default-scheduler         Successfully assigned openshift-kube-apiserver-operator/kube-apiserver-operator-7fccf67d7b-6qcdm to control-plane-1
  Warning  NetworkNotReady  152m (x25 over 153m)  kubelet, control-plane-1  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]


[root@preserve-jliu-worker 20190424]# oc get all -n openshift-kube-apiserver
NAME                                 READY   STATUS    RESTARTS   AGE
pod/kube-apiserver-control-plane-0   2/2     Running   0          142m
pod/kube-apiserver-control-plane-1   2/2     Running   0          143m

NAME                TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/apiserver   ClusterIP   172.30.28.13   <none>        443/TCP   143m

Comment 11 Matthew Staebler 2019-04-30 02:15:58 UTC

With 4.1.0-0.ci-2019-04-29-123411, I am no longer able to reproduce this bug.

Comment 13 liujia 2019-04-30 09:55:54 UTC

Still hit it on 4.1.0-0.nightly-2019-04-28-064010.
Try 4.1.0-0.nightly-2019-04-29-235412 and hit bz1699456.
Will try it when newer build avilable.

Comment 14 liujia 2019-05-05 07:11:03 UTC

Did not hit it on 4.1.0-0.nightly-2019-05-04-210601 and installation for upi/vmware succeed.
Considering this bug is not a 100% reproduce one, so verify it now, and if hit it in following builds, will re-open it again.

Comment 19 Matthew Staebler 2019-06-03 13:29:36 UTC

The `oc patch configs.imageregistry.operator.openshift.io cluster` command will fail until the resource has been created by the image-registry operator, which is the error that you had when attempting the patch. I will open a separate bug to track adding recommendations to the user to reattempt the patch if it fails.

As for the lastest failure, the `oc get co` command was from 5 minutes after the operators were created. It takes longer than that for the installation to complete, so it is not unexpected that there are failing operators at that point. The `oc get clusterversion` was from 78 minutes after the operators were created. If the storage backend for the image registry was never patched, then it is not unexpected that the clusterversion is still progressing. The information provided is not sufficient for me to reason that this is a re-occurrence of the original issue for this bug. In the original issue, the kube apiserver pods were not running, and this was not related to whether the storage backend was configured for the image registry.

Comment 20 Matthew Staebler 2019-06-03 13:35:43 UTC

Here is the bug to track the documentation change for re-attempting setting the image registry storage backend. https://bugzilla.redhat.com/show_bug.cgi?id=1716427

Comment 22 liujia 2019-06-04 02:19:26 UTC

Hi Matthew,

>The `oc patch configs.imageregistry.operator.openshift.io cluster` command will fail until the resource has been created by the image-registry operator, which is the error that you had when attempting the patch. 

I agree this, but the issue is not why the command fail, but why the installation can not finish. All clusteroperators are not running well and the cluster version status hang on 81%.

>I will open a separate bug to track adding recommendations to the user to reattempt the patch if it fails.
This is a good recommendation for users to reattempt the patch command if it fails when first try. But as for this issue, retry can not resolve the problem that image-operator was not running at all even after several hours.

I kept the cluster for the whole night without any extra operations. and checked it now, still in a broken status.

[root@preserve-jliu-worker 20190603_2784]# ./oc get node
NAME              STATUS   ROLES    AGE   VERSION
compute-0         Ready    worker   20h   v1.13.4+cb455d664
control-plane-0   Ready    master   20h   v1.13.4+cb455d664
control-plane-1   Ready    master   20h   v1.13.4+cb455d664
control-plane-2   Ready    master   20h   v1.13.4+cb455d664
[root@preserve-jliu-worker 20190603_2784]# ./oc get co
NAME                                 VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential                     4.1.0-rc.8   True        False         False      20h
cluster-autoscaler                   4.1.0-rc.8   True        False         False      20h
dns                                  unknown      False       True          True       20h
kube-apiserver                       4.1.0-rc.8   False       True          True       20h
kube-controller-manager                           False       True          True       20h
kube-scheduler                       4.1.0-rc.8   False       True          True       20h
machine-api                          4.1.0-rc.8   True        False         False      20h
machine-config                       4.1.0-rc.8   False       True          False      20h
network                              4.1.0-rc.8   True        False         False      20h
openshift-apiserver                  4.1.0-rc.8   Unknown     Unknown       True       20h
openshift-controller-manager         4.1.0-rc.8   True        True          False      20h
operator-lifecycle-manager           4.1.0-rc.8   True        False         False      20h
operator-lifecycle-manager-catalog   4.1.0-rc.8   True        False         False      20h
service-ca                           4.1.0-rc.8   True        False         False      20h
[root@preserve-jliu-worker 20190603_2784]# ./oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.8   False       True          20h     Working towards 4.1.0-rc.8: 81% complete
[root@preserve-jliu-worker 20190603_2784]# ./oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

>The information provided is not sufficient for me to reason that this is a re-occurrence of the original issue for this bug. In the original issue, the kube apiserver pods were not running, and this was not related to whether the storage backend was configured for the image registry.
The kube apiserver pods were not running either in the latest issue(refer to the log in my last comment about kube api server). Cause the same scenario and the same apprance, so just re-open the bug. But if you think it is another issue. I'm ok to file a new bug for tracking it.

Comment 23 liujia 2019-06-04 03:12:34 UTC

Reopen the bug and set the target to 4.2 first. If it is another root casue after dev debug, please feel free to close this one and file a new bug for it.

Comment 28 errata-xmlrpc 2019-06-04 10:47:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758