2030413 – Unable to connect to etcd service using OCP 4.10.0-0.nightly-s390x-2021-12-08-123942 with disconnected installation

Bug 2030413 - Unable to connect to etcd service using OCP 4.10.0-0.nightly-s390x-2021-12-08-123942 with disconnected installation

Summary: Unable to connect to etcd service using OCP 4.10.0-0.nightly-s390x-2021-12-08...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Multi-Arch
Sub Component:
Version:	4.10
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Philip Chan
QA Contact:	Douglas Slavens
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2009709
TreeView+	depends on / blocked

Reported:	2021-12-08 17:42 UTC by Philip Chan
Modified:	2022-01-26 11:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-14 22:38:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The bootkube.service log from bootstrap-0 node. (75.50 KB, text/plain) 2021-12-08 17:42 UTC, Philip Chan	no flags	Details
openshift-install gather bootstrap (75.50 KB, text/plain) 2021-12-08 22:21 UTC, Philip Chan	no flags	Details
Full journalctl from bootstrap node (18.22 MB, application/gzip) 2021-12-14 02:30 UTC, Philip Chan	no flags	Details
journalctl of kubelet from bootstrap (7.78 MB, application/gzip) 2021-12-14 02:31 UTC, Philip Chan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MULTIARCH-2013	0	None	None	None	2021-12-08 17:43:26 UTC

Description Philip Chan 2021-12-08 17:42:53 UTC

Created attachment 1845310 [details]
The bootkube.service log from bootstrap-0 node.

Description of problem:
When performing a disconnected installation when using any of the latest OCP 4.10 nightly builds, the bootstrap.service will encounter this error and loop:

Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: {"level":"warn","ts":1638985238.1713336,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00043c380/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: Error: unhealthy cluster
Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: etcdctl failed. Retrying in 5 seconds...

The problem seems to only happen when performing disconnected installs on KVM.  We do not see this issue on zVM.

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-s390x-2021-12-08-123942

How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. Use OCP build 4.10.0-0.nightly-s390x-2021-12-08-123942 to perform disconnected install.
2. Start bootstap node.
3. Monitor bootkube.service on bootstrap node for etcd error.  Attached is a copy of the bootkube.service log.

Actual results:
The etcd server is unavailable.

Expected results:
A healthy etcd server to establish connection to.

Additional info:
The etcd issue was first reported in an earlier bugzilla 2029289.  For KVM, we had two issues - the common problem for KVM and zVM was resolved by consolidating the pullSecret stanza.  However, only KVM would go on and encounter this etcd failure as detailed above.

Comment 1 Prashanth Sundararaman 2021-12-08 18:51:01 UTC

Can you please attach the `openshift-install gather bootstrap` logs? also what state are the masters in? are the VMs provisioned and are they up and reachable?

Comment 2 Philip Chan 2021-12-08 22:20:26 UTC

Hi Prashanth,

Yes, the master nodes are provisioned.  All three are not reachable as they are stuck at the ignition phase:

[   ***] A start job is running for Ignition (fetch) (28min 5s / no limit)[ 1689.088688] ignition[746]: GET https://api-int.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com:22623/config/master: attempt #341
[ 1689.088745] ignition[746]: GET error: Get "https://api-int.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com:22623/config/master": dial tcp: lookup api-int.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com on 192.168.79.1:53: dial udp 192.168.79.1:53: connect: network is unreachable

Since the bootstrap has not yet completed coming up.  I've run gather of the bootstrap logs to attach.

Comment 3 Philip Chan 2021-12-08 22:21:10 UTC

Created attachment 1845348 [details]
openshift-install gather bootstrap

Comment 4 krmoser 2021-12-09 07:51:39 UTC

Prashanth,

Thank you for your assistance.  

1. Phil and I have determined that this issue was introduced with OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2021-11-17-220453.  Per this OCP 4.10 build's release.txt file, the corresponding RHCOS build is 410.84.202111172002-0.

2. This issue is only seen when conducting disconnected installs in an OCP 4.10 on Z KVM environment.

3. This issue is not seen when conducting disconnected installs in an OCP 4.10 on Z zVM environment.

4. This issue is not seen with OCP 4.6, 4.7, 4.8, or 4.9 on Z KVM and zVM environments.

Thank you,
Kyle

Comment 5 Prashanth Sundararaman 2021-12-10 15:30:50 UTC

Kyle/Phil,

The bootstrap gather should have the logs of the containers on the bootstrap node. did you not see those getting generated?

If they are not, could you check the status of various containers on the bootstrap by doing a `sudo crictl ps -a` and trying to get the logs of the containers running?

Thanks,
Prashanth

Comment 6 Philip Chan 2021-12-11 04:24:53 UTC

Hi Prashanth,

There does not seem to be any containers running on the bootstrap node.  If I run `sudo crictl ps -a`, it shows no containers:

[core@bootstrap-0 ~]$ sudo crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
[core@bootstrap-0 ~]$

If I try running podman to view any container that has tried running but exited, this is the output:

[core@bootstrap-0 ~]$ sudo podman ps -a
CONTAINER ID  IMAGE                                                                                                                   COMMAND               CREATED         STATUS                     PORTS       NAMES
9a73ffd0bec8  bastion:5000/ocp4/openshift4@sha256:ee27b0a0106f2a1d90f74e181797f31138213a8eb1ad62c31570222339b73484                    render --output-d...  35 minutes ago  Exited (0) 35 minutes ago              cvo-render
ba07201951c9  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f150b7b4e61aa31616cc493f0ae5989e20c5f00938e4a3037e35d20443bb3af  /usr/bin/cluster-...  35 minutes ago  Exited (0) 35 minutes ago              etcd-render
3ab059c94698  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c28c6bfe172d19fd08f2c63c07c54bb6d88cc7076a967b19d0a0ed5e02d66b4  /usr/bin/cluster-...  35 minutes ago  Exited (0) 35 minutes ago              config-render
5627c634e900  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4329ccc4bbbdf1e26b5f0ba75f75611c53286750a74bb9854822120edbd57220  /usr/bin/cluster-...  35 minutes ago  Exited (0) 35 minutes ago              kube-apiserver-render
65d9575c4248  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b2392f28c4bf6188986103dbb4ce46c0c2f15669c011e652e49314b55f51060b  /usr/bin/cluster-...  35 minutes ago  Exited (0) 35 minutes ago              kube-controller-render
1927f923c1e9  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ae774a0083d2ffba8dbd14113b905fdeb45e5e83b3e056fb0dda428427b1448  /usr/bin/cluster-...  35 minutes ago  Exited (0) 35 minutes ago              kube-scheduler-render
ae953b9eedf8  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f8ee2501ec4d026332d401ff1b1d5a797904ac841a3b0653231366b39618e53  render --prefix=c...  35 minutes ago  Exited (0) 35 minutes ago              ingress-render
2419090636d3  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:46b7f2c29bca7b1e4dbb1a2b560cfb10673c026bdceed858530825b077874c14  bootstrap --root-...  35 minutes ago  Exited (0) 35 minutes ago              mco-render
f5bc855dc2d9  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f4ecbd62ce8778b79e5e890d50925a301c103966cef0f99e90ad9b044caaeba2  render --dest-dir...  35 minutes ago  Exited (0) 35 minutes ago              cco-render
2176b4157192  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6bc4dc417794513a56d52a80510ba4fbe54418ba1307fee28cf6eab3e500798  --dial-timeout=10...  1 second ago    Up 1 second ago                        etcdctl

As you can see with the last container(2176b4157192), it is trying to restart etcdctl over and over again.

Comment 7 Dan Li 2021-12-13 15:20:13 UTC

Assigning this bug back to Phil per discussion with the bug triage team, as we are still gathering information on this bug. 

Hi Prashanth, per Phil's Comment 6, he couldn’t get any additional information with regards to any running containers on bootstrap. Is there any more information that we need from him?

Comment 8 Philip Chan 2021-12-13 15:22:15 UTC

Hi Prashanth,

I believe this etcd container that is stuck trying to restart is the problem.  Would you know how to gather the logs of a container that stops running? It seems to only provide a window of about 5 seconds before the container stops.  Once its down, I can no longer view any of the logs.

Comment 9 Prashanth Sundararaman 2021-12-13 23:04:02 UTC

could you get the kubelet logs through a `journalctl -u kubelet` and also get a full copy of the journal logs by just a `journalctl` ?

Comment 10 Philip Chan 2021-12-14 02:29:41 UTC

Hi Prashanth,

Yes, I was able to capture both the kubelet and full journal logs from the bootstrap node.  Please see attached.

Comment 11 Philip Chan 2021-12-14 02:30:43 UTC

Created attachment 1846162 [details]
Full journalctl from bootstrap node

Comment 12 Philip Chan 2021-12-14 02:31:38 UTC

Created attachment 1846163 [details]
journalctl of kubelet from bootstrap

Comment 13 Prashanth Sundararaman 2021-12-14 15:25:32 UTC

ok looking at the logs i see:

Dec 11 06:07:13 bootstrap-0.pok-87-nov-qemu.ocptest.pok.stglabs.ibm.com kubelet.sh[2311]: E1211 06:07:13.689256    2323 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-machine-config-operator-bootstrap-0.pok-87-nov-qemu.ocptest.pok.stglabs.ibm.com_default_0d9c273f06d500edc3a13759cd20eabf_0\": initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:101f7335a2364288cbba63643da2504be563fece98c76acfbe6b9c4a2dab003d: (Mirrors also failed: [bastion:5000/ocp4/openshift4@sha256:101f7335a2364288cbba63643da2504be563fece98c76acfbe6b9c4a2dab003d: pinging container registry bastion:5000: Get \"https://bastion:5000/v2/\": x509: certificate relies on legacy Common Name field, use SANs instead]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:101f7335a2364288cbba63643da2504be563fece98c76acfbe6b9c4a2dab003d: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 3.214.233.154:443: connect: no route to host" pod="default/bootstrap-machine-config-operator-bootstrap-0.pok-87-nov-qemu.ocptest.pok.stglabs.ibm.com"

this is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1882191. Instead of using commonName in the certificates for your private registry, you need to use SANs. note that with golang1.17, GODEBUG=x509ignoreCN=0 is not supported anymore so you have to use SANs.

Please let me know after you make this change and test and we can close this out.

Comment 14 Philip Chan 2021-12-14 19:56:31 UTC

Hi Prashanth,

I created certificate using SAN instead of CN, and was able to successfully perform a disconnected install with one of the latest OCP nightly builds - 4.10.0-0.nightly-s390x-2021-12-14-101550.  Thank you again for your help.

Comment 15 Prashanth Sundararaman 2021-12-14 22:38:09 UTC

Thanks for the confirmation Phil.

Note You need to log in before you can comment on or make changes to this bug.