Bug 2047397 - 4.10 fc builds fails to install with cluster version stuck at certain percentage and etcd pod in bad state
Summary: 4.10 fc builds fails to install with cluster version stuck at certain percent...
Keywords:
Status: CLOSED DUPLICATE of bug 1961204
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Telco Edge
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Ian Miller
QA Contact: yliu1
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-27 18:41 UTC by yliu1
Modified: 2022-01-27 20:34 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-27 20:34:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description yliu1 2022-01-27 18:41:12 UTC
Description of problem:
During installation of 4.10 fc builds, agentclusterinstall times out waiting for cluster version to be available. 

Cluster version got stuck at certain percentage with no progress, some pods are in bad state on cluster including etcd. 

Version-Release number of selected component (if applicable):
4.10 fc builds (tried 4.10.0-fc.1/2/3)

How reproducible:
100% reproducible on a specific server (cnfocto2), this server has no issue installing 4.9 z builds. 
It was also seen once on a different server. 

Steps to Reproduce:
1. Trigger du node install via ZTP
siteconfig is here: 
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/b280ce1d9bf0c203af26ebe18fe382eecb65bcf4/siteconfig/cnfocto2.yaml
2. Wait for install to begin by monitoring agentclusterinstalls
3. Wait for install to complete by monitoring agentclusterinstalls

Actual results:
# agentclusterinstall timed out at waiting for cluster version
    state: error
    stateInfo: 'Timeout while waiting for cluster version to be available: timed out'
  machineNetwork:
  - cidr: 10.16.231.0/24

Expected results:
ocp installed successfully


Additional info:
# clusterversion on spoke got stuck for hours with no further progress (in previous installs, it eventually fails):
  - lastTransitionTime: "2022-01-27T15:22:53Z"
    message: 'Working towards 4.10.0-fc.3: 370 of 769 done (48% complete)'
    status: "True"
    type: Progressing

# some pods in bad state including etcd:
NAMESPACE                                          NAME                                                           READY   STATUS              RESTARTS         AGE
openshift-cloud-credential-operator                cloud-credential-operator-69b7f7d54f-jf49f                     0/2     ContainerCreating   0                133m
openshift-cluster-machine-approver                 machine-approver-755948d859-ccdqv                              0/2     ContainerCreating   0                133m
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-555f4f9b5f-5nksx                  0/1     ContainerCreating   0                133m
openshift-cluster-storage-operator                 csi-snapshot-webhook-c948dc568-ppbrp                           0/1     ContainerCreating   0                129m
openshift-cluster-version                          cluster-version-operator-655b45f98b-lzd2z                      0/1     ContainerCreating   0                133m
openshift-dns-operator                             dns-operator-f85999bff-fgj2s                                   0/2     ContainerCreating   0                133m
openshift-etcd                                     etcd-cnfocto2.ptp.lab.eng.bos.redhat.com                       3/4     CrashLoopBackOff    31 (106s ago)    124m
openshift-ingress-operator                         ingress-operator-5b78cd4b5b-7r65r                              0/2     ContainerCreating   0                133m
openshift-kube-controller-manager                  kube-controller-manager-cnfocto2.ptp.lab.eng.bos.redhat.com    3/4     CrashLoopBackOff    26 (4m30s ago)   112m
openshift-machine-config-operator                  machine-config-daemon-2gr8r                                    0/2     ContainerCreating   0                130m
openshift-marketplace                              marketplace-operator-97478fb5c-rxk4z                           0/1     ContainerCreating   0                133m
openshift-monitoring                               cluster-monitoring-operator-7b87d99dd7-lshpg                   0/2     ContainerCreating   0                133m
openshift-multus                                   ip-reconciler-27388425-ldjvp                                   0/1     Pending             0                111s
openshift-multus                                   multus-admission-controller-crhhm                              0/2     ContainerCreating   0                130m
openshift-multus                                   network-metrics-daemon-smn2s                                   0/2     ContainerCreating   0                132m

# etcd-health-monitor container fails:
  - containerID: cri-o://979553aaf386828cb595c5514a3971a5a8f2a52640e4d9c3914ebe240ec4718a
    image: registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev@sha256:3fc3fda52bc4c683f7dd30c4c121221b3a12156afea60437b09424ee5b568855
    imageID: registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev@sha256:3fc3fda52bc4c683f7dd30c4c121221b3a12156afea60437b09424ee5b568855
    lastState:
      terminated:
        containerID: cri-o://979553aaf386828cb595c5514a3971a5a8f2a52640e4d9c3914ebe240ec4718a
        exitCode: 255
        finishedAt: "2022-01-27T16:50:48Z"
        message: "8s.io/apimachinery.0/pkg/util/wait/wait.go:167 +0x13b\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0,
          0x12a05f200, 0x0, 0x5, 0xc0001ecfd0)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:133
          +0x89\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:90\nk8s.io/apimachinery/pkg/util/wait.Forever(0x0,
          0xc00026b140)\n\tk8s.io/apimachinery.0/pkg/util/wait/wait.go:81 +0x28\ncreated
          by k8s.io/component-base/logs.InitLogs\n\tk8s.io/component-base.0/logs/logs.go:179
          +0x85\n\ngoroutine 87 [select]:\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run.func1()\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:139
          +0x85\ncreated by github.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:138
          +0x18f\n\ngoroutine 88 [IO wait]:\ninternal/poll.runtime_pollWait(0x7f3c9c9b4798,
          0x72)\n\truntime/netpoll.go:229 +0x89\ninternal/poll.(*pollDesc).wait(0xc000768000,
          0x4, 0x0)\n\tinternal/poll/fd_poll_runtime.go:84 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:89\ninternal/poll.(*FD).Accept(0xc000768000)\n\tinternal/poll/fd_unix.go:402
          +0x22c\nnet.(*netFD).accept(0xc000768000)\n\tnet/fd_unix.go:173 +0x35\nnet.(*TCPListener).accept(0xc000522000)\n\tnet/tcpsock_posix.go:140
          +0x28\nnet.(*TCPListener).Accept(0xc000522000)\n\tnet/tcpsock.go:262 +0x3d\nnet/http.(*Server).Serve(0xc0001420e0,
          {0x296f1a0, 0xc000522000})\n\tnet/http/server.go:3001 +0x394\nnet/http.(*Server).ListenAndServe(0xc0001420e0)\n\tnet/http/server.go:2930
          +0x7d\nnet/http.ListenAndServe(...)\n\tnet/http/server.go:3184\ngithub.com/openshift/library-go/pkg/serviceability.StartProfiler.func1()\n"
        reason: Error
        startedAt: "2022-01-27T16:50:28Z"
    name: etcd-health-monitor
    ready: false
    restartCount: 21
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-cnfocto2.ptp.lab.eng.bos.redhat.com_openshift-etcd(c31234d31c7c177b8475cd874811e26a)
        reason: CrashLoopBackOff

# pod describe:
  Warning  Unhealthy  178m  kubelet  Readiness probe failed: + unset ETCDCTL_ENDPOINTS
+ /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://10.16.231.12:0 endpoint health -w json
+ grep '"health":true'
{"level":"warn","ts":1643298097.4124234,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00035ac40/#initially=[unixs://10.16.231.12:0]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial unix 10.16.231.12:0: connect: no such file or directory\""}
Error: unhealthy cluster


Note You need to log in before you can comment on or make changes to this bug.