Description of problem: 1. When attempting to install a zVM hosted OCP 4.9 on Z cluster with any of the 3 current builds from August 12, 2021, the installations fail with either the master nodes not booting successfully from the bootstrap node (for builds 4.9.0-0.nightly-s390x-2021-08-12-033000 and 4.9.0-0.nightly-s390x-2021-08-12-04400) or not successfully completing installation (for build 4.9.0-0.nightly-s390x-2021-08-12-055355). 2. Using the (current) latest OCP 4.9 build from August 12, 2021, build 4.9.0-0.nightly-s390x-2021-08-12-055355, the master nodes install but fail to/never join the OCP cluster. 3. Here is an install of the 4.9.0-0.nightly-s390x-2021-08-12-055355 build with the bootstrap core userid login, which shows a "crio.service" failure and the following unhealthy cluster information as displayed by the "journalctl -b -f -u release-image.service -u bootkube.service" command: This is the bootstrap node; it will be destroyed when the master is fully up. The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g. journalctl -b -f -u release-image.service -u bootkube.service [systemd] Failed Units: 1 crio.service [core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Thu 2021-08-12 07:53:37 UTC. -- Aug 12 08:02:08 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster Aug 12 08:02:09 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: {"level":"warn","ts":1628755339.550245,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001a2000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: {"level":"warn","ts":1628755350.320205,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000550380/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds... q ^C [core@bootstrap-0 ~]$ 4. Here is a second install of the 4.9.0-0.nightly-s390x-2021-08-12-055355 build with the bootstrap core userid login, which shows a "crio.service" failure and the following unhealthy cluster information as displayed by the "journalctl -b -f -u release-image.service -u bootkube.service" command: This is the bootstrap node; it will be destroyed when the master is fully up. The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g. journalctl -b -f -u release-image.service -u bootkube.service [systemd] Failed Units: 1 crio.service [core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Thu 2021-08-12 08:15:58 UTC. -- Aug 12 08:20:56 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster Aug 12 08:20:57 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: {"level":"warn","ts":1628756467.5806026,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000394a80/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: {"level":"warn","ts":1628756478.3589513,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003e4700/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds... ^C [core@bootstrap-0 ~]$ Version-Release number of selected component (if applicable): 1. 4.9.0-0.nightly-s390x-2021-08-12-033000 2. 4.9.0-0.nightly-s390x-2021-08-12-044003 3. 4.9.0-0.nightly-s390x-2021-08-12-055355 How reproducible: Consistently reproducible Steps to Reproduce: 1. Initiate installation of a z/VM hosted OCP 4.9.0 on Z cluster using the OCP 4.9.0-0.nightly-s390x-2021-08-12-055355 build and the corresponding RHCOS 49.84.202108100912-0 build. Actual results: Failed installation of z/VM hosted OCP 4.9 on Z cluster. Installation does not proceed beyond the master/control plane node installation phase, with no master/control plane nodes joining the OCP cluster. Expected results: Successful installation of the z/VM hosted OCP 4.9 on Z cluster. Additional info: Thank you.
Folks, Here is some additional information to help with this issue. 1. The same issue seen with OCP 4.9 4.9.0-0.nightly-s390x-2021-08-12-055355 is now also seen with subsequent OCP 4.9 builds: 1. 4.9.0-0.nightly-s390x-2021-08-12-094356 2. 4.9.0-0.nightly-s390x-2021-08-12-102845 2. A snippet from the bootstrap node upon login as the core userid shows the crio.service with issues and the following journalctl -b -f -u release-image.service -u bootkube.service" output: This is the bootstrap node; it will be destroyed when the master is fully up. The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g. journalctl -b -f -u release-image.service -u bootkube.service [systemd] Failed Units: 1 crio.service [core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Thu 2021-08-12 12:45:44 UTC. -- Aug 12 12:51:11 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster Aug 12 12:51:11 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds... Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: {"level":"warn","ts":1628772681.832625,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004328c0/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster Aug 12 12:51:22 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds... Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: {"level":"warn","ts":1628772692.6208522,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004c21c0/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds... 3. Here is the output from the "systemctl status crio.service" command: [core@bootstrap-0 ~]$ systemctl status crio.service ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2021-08-12 12:56:32 UTC; 9s ago Docs: https://github.com/cri-o/cri-o Process: 22286 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAIL> Main PID: 22286 (code=exited, status=1/FAILURE) Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102416230Z" level=info msg="Node configuration value for hugetlb cgroup is> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102444233Z" level=info msg="Node configuration value for pid cgroup is tru> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102472479Z" level=info msg="Node configuration value for memoryswap cgroup> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.109244230Z" level=info msg="Node configuration value for systemd CollectMo> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.114660118Z" level=info msg="Node configuration value for systemd AllowedCP> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.115798982Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.184122261Z" level=fatal msg="validating runtime config: conmon validation:> Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: crio.service: Failed with result 'exit-code'. Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O). [core@bootstrap-0 ~]$ Thank you, Kyle
duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1992557. fix is https://github.com/openshift/machine-config-operator/pull/2712
*** This bug has been marked as a duplicate of bug 1992557 ***
FYI. All 10 OCP 4.9 builds from yesterday, August 12th, and the 3 current OCP 4.9 builds from today, August 13th, are broken with the existing crio issue. Specifically: 1. 4.9.0-0.nightly-s390x-2021-08-12-033000 2. 4.9.0-0.nightly-s390x-2021-08-12-044003 3. 4.9.0-0.nightly-s390x-2021-08-12-055355 4. 4.9.0-0.nightly-s390x-2021-08-12-094356 5. 4.9.0-0.nightly-s390x-2021-08-12-102845 6. 4.9.0-0.nightly-s390x-2021-08-12-121353 7. 4.9.0-0.nightly-s390x-2021-08-12-171820 8. 4.9.0-0.nightly-s390x-2021-08-12-183918 9. 4.9.0-0.nightly-s390x-2021-08-12-195352 10. 4.9.0-0.nightly-s390x-2021-08-12-220446 1. 4.9.0-0.nightly-s390x-2021-08-13-000210 2. 4.9.0-0.nightly-s390x-2021-08-13-011451 3. 4.9.0-0.nightly-s390x-2021-08-13-020317 Thank you, Kyle
FYI. Coupled with the RHCOS 49.84.202108130913-0 build (which seems to resolve the crio issue), the last 3 OCP 4.9 on Z builds from August 13th and the 2 OCP 4.9 on Z builds from August 14th all successfully install. Specifically: 1. 4.9.0-0.nightly-s390x-2021-08-13-143049 2. 4.9.0-0.nightly-s390x-2021-08-13-145651 3. 4.9.0-0.nightly-s390x-2021-08-13-175700 1. 4.9.0-0.nightly-s390x-2021-08-14-044503 2. 4.9.0-0.nightly-s390x-2021-08-14-065517 Thank you, Kyle