Created attachment 1845310 [details] The bootkube.service log from bootstrap-0 node. Description of problem: When performing a disconnected installation when using any of the latest OCP 4.10 nightly builds, the bootstrap.service will encounter this error and loop: Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: {"level":"warn","ts":1638985238.1713336,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00043c380/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: Error: unhealthy cluster Dec 08 17:40:38 bootstrap-0.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com bootkube.sh[2345]: etcdctl failed. Retrying in 5 seconds... The problem seems to only happen when performing disconnected installs on KVM. We do not see this issue on zVM. Version-Release number of selected component (if applicable): 4.10.0-0.nightly-s390x-2021-12-08-123942 How reproducible: Consistently reproducible. Steps to Reproduce: 1. Use OCP build 4.10.0-0.nightly-s390x-2021-12-08-123942 to perform disconnected install. 2. Start bootstap node. 3. Monitor bootkube.service on bootstrap node for etcd error. Attached is a copy of the bootkube.service log. Actual results: The etcd server is unavailable. Expected results: A healthy etcd server to establish connection to. Additional info: The etcd issue was first reported in an earlier bugzilla 2029289. For KVM, we had two issues - the common problem for KVM and zVM was resolved by consolidating the pullSecret stanza. However, only KVM would go on and encounter this etcd failure as detailed above.
Can you please attach the `openshift-install gather bootstrap` logs? also what state are the masters in? are the VMs provisioned and are they up and reachable?
Hi Prashanth, Yes, the master nodes are provisioned. All three are not reachable as they are stuck at the ignition phase: [ ***] A start job is running for Ignition (fetch) (28min 5s / no limit)[ 1689.088688] ignition[746]: GET https://api-int.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com:22623/config/master: attempt #341 [ 1689.088745] ignition[746]: GET error: Get "https://api-int.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com:22623/config/master": dial tcp: lookup api-int.pok-239-dec-qemu.ocptest.pok.stglabs.ibm.com on 192.168.79.1:53: dial udp 192.168.79.1:53: connect: network is unreachable Since the bootstrap has not yet completed coming up. I've run gather of the bootstrap logs to attach.
Created attachment 1845348 [details] openshift-install gather bootstrap
Prashanth, Thank you for your assistance. 1. Phil and I have determined that this issue was introduced with OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2021-11-17-220453. Per this OCP 4.10 build's release.txt file, the corresponding RHCOS build is 410.84.202111172002-0. 2. This issue is only seen when conducting disconnected installs in an OCP 4.10 on Z KVM environment. 3. This issue is not seen when conducting disconnected installs in an OCP 4.10 on Z zVM environment. 4. This issue is not seen with OCP 4.6, 4.7, 4.8, or 4.9 on Z KVM and zVM environments. Thank you, Kyle
Kyle/Phil, The bootstrap gather should have the logs of the containers on the bootstrap node. did you not see those getting generated? If they are not, could you check the status of various containers on the bootstrap by doing a `sudo crictl ps -a` and trying to get the logs of the containers running? Thanks, Prashanth
Hi Prashanth, There does not seem to be any containers running on the bootstrap node. If I run `sudo crictl ps -a`, it shows no containers: [core@bootstrap-0 ~]$ sudo crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID [core@bootstrap-0 ~]$ If I try running podman to view any container that has tried running but exited, this is the output: [core@bootstrap-0 ~]$ sudo podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 9a73ffd0bec8 bastion:5000/ocp4/openshift4@sha256:ee27b0a0106f2a1d90f74e181797f31138213a8eb1ad62c31570222339b73484 render --output-d... 35 minutes ago Exited (0) 35 minutes ago cvo-render ba07201951c9 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f150b7b4e61aa31616cc493f0ae5989e20c5f00938e4a3037e35d20443bb3af /usr/bin/cluster-... 35 minutes ago Exited (0) 35 minutes ago etcd-render 3ab059c94698 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c28c6bfe172d19fd08f2c63c07c54bb6d88cc7076a967b19d0a0ed5e02d66b4 /usr/bin/cluster-... 35 minutes ago Exited (0) 35 minutes ago config-render 5627c634e900 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4329ccc4bbbdf1e26b5f0ba75f75611c53286750a74bb9854822120edbd57220 /usr/bin/cluster-... 35 minutes ago Exited (0) 35 minutes ago kube-apiserver-render 65d9575c4248 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b2392f28c4bf6188986103dbb4ce46c0c2f15669c011e652e49314b55f51060b /usr/bin/cluster-... 35 minutes ago Exited (0) 35 minutes ago kube-controller-render 1927f923c1e9 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ae774a0083d2ffba8dbd14113b905fdeb45e5e83b3e056fb0dda428427b1448 /usr/bin/cluster-... 35 minutes ago Exited (0) 35 minutes ago kube-scheduler-render ae953b9eedf8 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f8ee2501ec4d026332d401ff1b1d5a797904ac841a3b0653231366b39618e53 render --prefix=c... 35 minutes ago Exited (0) 35 minutes ago ingress-render 2419090636d3 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:46b7f2c29bca7b1e4dbb1a2b560cfb10673c026bdceed858530825b077874c14 bootstrap --root-... 35 minutes ago Exited (0) 35 minutes ago mco-render f5bc855dc2d9 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f4ecbd62ce8778b79e5e890d50925a301c103966cef0f99e90ad9b044caaeba2 render --dest-dir... 35 minutes ago Exited (0) 35 minutes ago cco-render 2176b4157192 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6bc4dc417794513a56d52a80510ba4fbe54418ba1307fee28cf6eab3e500798 --dial-timeout=10... 1 second ago Up 1 second ago etcdctl As you can see with the last container(2176b4157192), it is trying to restart etcdctl over and over again.
Assigning this bug back to Phil per discussion with the bug triage team, as we are still gathering information on this bug. Hi Prashanth, per Phil's Comment 6, he couldn’t get any additional information with regards to any running containers on bootstrap. Is there any more information that we need from him?
Hi Prashanth, I believe this etcd container that is stuck trying to restart is the problem. Would you know how to gather the logs of a container that stops running? It seems to only provide a window of about 5 seconds before the container stops. Once its down, I can no longer view any of the logs.
could you get the kubelet logs through a `journalctl -u kubelet` and also get a full copy of the journal logs by just a `journalctl` ?
Hi Prashanth, Yes, I was able to capture both the kubelet and full journal logs from the bootstrap node. Please see attached.
Created attachment 1846162 [details] Full journalctl from bootstrap node
Created attachment 1846163 [details] journalctl of kubelet from bootstrap
ok looking at the logs i see: Dec 11 06:07:13 bootstrap-0.pok-87-nov-qemu.ocptest.pok.stglabs.ibm.com kubelet.sh[2311]: E1211 06:07:13.689256 2323 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-machine-config-operator-bootstrap-0.pok-87-nov-qemu.ocptest.pok.stglabs.ibm.com_default_0d9c273f06d500edc3a13759cd20eabf_0\": initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:101f7335a2364288cbba63643da2504be563fece98c76acfbe6b9c4a2dab003d: (Mirrors also failed: [bastion:5000/ocp4/openshift4@sha256:101f7335a2364288cbba63643da2504be563fece98c76acfbe6b9c4a2dab003d: pinging container registry bastion:5000: Get \"https://bastion:5000/v2/\": x509: certificate relies on legacy Common Name field, use SANs instead]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:101f7335a2364288cbba63643da2504be563fece98c76acfbe6b9c4a2dab003d: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 3.214.233.154:443: connect: no route to host" pod="default/bootstrap-machine-config-operator-bootstrap-0.pok-87-nov-qemu.ocptest.pok.stglabs.ibm.com" this is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1882191. Instead of using commonName in the certificates for your private registry, you need to use SANs. note that with golang1.17, GODEBUG=x509ignoreCN=0 is not supported anymore so you have to use SANs. Please let me know after you make this change and test and we can close this out.
Hi Prashanth, I created certificate using SAN instead of CN, and was able to successfully perform a disconnected install with one of the latest OCP nightly builds - 4.10.0-0.nightly-s390x-2021-12-14-101550. Thank you again for your help.
Thanks for the confirmation Phil.