2047397 – 4.10 fc builds fails to install with cluster version stuck at certain percentage and etcd pod in bad state

Bug 2047397 - 4.10 fc builds fails to install with cluster version stuck at certain percentage and etcd pod in bad state

Summary: 4.10 fc builds fails to install with cluster version stuck at certain percent...

Keywords:
Status:	CLOSED DUPLICATE of bug 1961204
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Telco Edge
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Ian Miller
QA Contact:	yliu1
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-27 18:41 UTC by yliu1
Modified:	2022-01-27 20:34 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-27 20:34:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description yliu1 2022-01-27 18:41:12 UTC

Description of problem:
During installation of 4.10 fc builds, agentclusterinstall times out waiting for cluster version to be available. 

Cluster version got stuck at certain percentage with no progress, some pods are in bad state on cluster including etcd. 

Version-Release number of selected component (if applicable):
4.10 fc builds (tried 4.10.0-fc.1/2/3)

How reproducible:
100% reproducible on a specific server (cnfocto2), this server has no issue installing 4.9 z builds. 
It was also seen once on a different server. 

Steps to Reproduce:
1. Trigger du node install via ZTP
siteconfig is here: 
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/b280ce1d9bf0c203af26ebe18fe382eecb65bcf4/siteconfig/cnfocto2.yaml
2. Wait for install to begin by monitoring agentclusterinstalls
3. Wait for install to complete by monitoring agentclusterinstalls

Actual results:
# agentclusterinstall timed out at waiting for cluster version
    state: error
    stateInfo: 'Timeout while waiting for cluster version to be available: timed out'
  machineNetwork:
  - cidr: 10.16.231.0/24

Expected results:
ocp installed successfully


Additional info:
# clusterversion on spoke got stuck for hours with no further progress (in previous installs, it eventually fails):
  - lastTransitionTime: "2022-01-27T15:22:53Z"
    message: 'Working towards 4.10.0-fc.3: 370 of 769 done (48% complete)'
    status: "True"
    type: Progressing

# some pods in bad state including etcd:
NAMESPACE                                          NAME                                                           READY   STATUS              RESTARTS         AGE
openshift-cloud-credential-operator                cloud-credential-operator-69b7f7d54f-jf49f                     0/2     ContainerCreating   0                133m
openshift-cluster-machine-approver                 machine-approver-755948d859-ccdqv                              0/2     ContainerCreating   0                133m
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-555f4f9b5f-5nksx                  0/1     ContainerCreating   0                133m
openshift-cluster-storage-operator                 csi-snapshot-webhook-c948dc568-ppbrp                           0/1     ContainerCreating   0                129m
openshift-cluster-version                          cluster-version-operator-655b45f98b-lzd2z                      0/1     ContainerCreating   0                133m
openshift-dns-operator                             dns-operator-f85999bff-fgj2s                                   0/2     ContainerCreating   0                133m
openshift-etcd                                     etcd-cnfocto2.ptp.lab.eng.bos.redhat.com                       3/4     CrashLoopBackOff    31 (106s ago)    124m
openshift-ingress-operator                         ingress-operator-5b78cd4b5b-7r65r                              0/2     ContainerCreating   0                133m
openshift-kube-controller-manager                  kube-controller-manager-cnfocto2.ptp.lab.eng.bos.redhat.com    3/4     CrashLoopBackOff    26 (4m30s ago)   112m
openshift-machine-config-operator                  machine-config-daemon-2gr8r                                    0/2     ContainerCreating   0                130m
openshift-marketplace                              marketplace-operator-97478fb5c-rxk4z                           0/1     ContainerCreating   0                133m
openshift-monitoring                               cluster-monitoring-operator-7b87d99dd7-lshpg                   0/2     ContainerCreating   0                133m
openshift-multus                                   ip-reconciler-27388425-ldjvp                                   0/1     Pending             0                111s
openshift-multus                                   multus-admission-controller-crhhm                              0/2     ContainerCreating   0                130m
openshift-multus                                   network-metrics-daemon-smn2s                                   0/2     ContainerCreating   0                132m

# etcd-health-monitor container fails:
  - containerID: cri-o://979553aaf386828cb595c5514a3971a5a8f2a52640e4d9c3914ebe240ec4718a
    image: registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev@sha256:3fc3fda52bc4c683f7dd30c4c121221b3a12156afea60437b09424ee5b568855
    imageID: registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev@sha256:3fc3fda52bc4c683f7dd30c4c121221b3a12156afea60437b09424ee5b568855
    lastState:
      terminated:
        containerID: cri-o://979553aaf386828cb595c5514a3971a5a8f2a52640e4d9c3914ebe240ec4718a
        exitCode: 255
        finishedAt: "2022-01-27T16:50:48Z"
        message: "8s.io/apimachinery.0/pkg/util/wait/wait.go:167 +0x13b\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0,
          0x12a05f200, 0x0, 0x5, 0xc0001ecfd0)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:133
          +0x89\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:90\nk8s.io/apimachinery/pkg/util/wait.Forever(0x0,
          0xc00026b140)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:81 +0x28\ncreated
          by k8s.io/component-base/logs.InitLogs\n\tk8s.io/component-base.0/logs/logs.go:179
          +0x85\n\ngoroutine 87 [select]:\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run.func1()\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:139
          +0x85\ncreated by github.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:138
          +0x18f\n\ngoroutine 88 [IO wait]:\ninternal/poll.runtime_pollWait(0x7f3c9c9b4798,
          0x72)\n\truntime/netpoll.go:229 +0x89\ninternal/poll.(*pollDesc).wait(0xc000768000,
          0x4, 0x0)\n\tinternal/poll/fd_poll_runtime.go:84 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:89\ninternal/poll.(*FD).Accept(0xc000768000)\n\tinternal/poll/fd_unix.go:402
          +0x22c\nnet.(*netFD).accept(0xc000768000)\n\tnet/fd_unix.go:173 +0x35\nnet.(*TCPListener).accept(0xc000522000)\n\tnet/tcpsock_posix.go:140
          +0x28\nnet.(*TCPListener).Accept(0xc000522000)\n\tnet/tcpsock.go:262 +0x3d\nnet/http.(*Server).Serve(0xc0001420e0,
          {0x296f1a0, 0xc000522000})\n\tnet/http/server.go:3001 +0x394\nnet/http.(*Server).ListenAndServe(0xc0001420e0)\n\tnet/http/server.go:2930
          +0x7d\nnet/http.ListenAndServe(...)\n\tnet/http/server.go:3184\ngithub.com/openshift/library-go/pkg/serviceability.StartProfiler.func1()\n"
        reason: Error
        startedAt: "2022-01-27T16:50:28Z"
    name: etcd-health-monitor
    ready: false
    restartCount: 21
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-cnfocto2.ptp.lab.eng.bos.redhat.com_openshift-etcd(c31234d31c7c177b8475cd874811e26a)
        reason: CrashLoopBackOff

# pod describe:
  Warning  Unhealthy  178m  kubelet  Readiness probe failed: + unset ETCDCTL_ENDPOINTS
+ /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://10.16.231.12:0 endpoint health -w json
+ grep '"health":true'
{"level":"warn","ts":1643298097.4124234,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00035ac40/#initially=[unixs://10.16.231.12:0]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial unix 10.16.231.12:0: connect: no such file or directory\""}
Error: unhealthy cluster

Note You need to log in before you can comment on or make changes to this bug.