Bug 1992995
| Summary: | OCP 4.9.0-0.nightly-s390x-2021-08-12-033000, 044003, 055355 builds fail installation with master/control plane node issues | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | krmoser |
| Component: | Multi-Arch | Assignee: | Dennis Gilmore <dgilmore> |
| Status: | CLOSED DUPLICATE | QA Contact: | Barry Donahue <bdonahue> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | chanphil, christian.lapolt, Holger.Wolf, psundara, sorth, wolfgang.voesch |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | s390x | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-12 14:02:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Folks,
Here is some additional information to help with this issue.
1. The same issue seen with OCP 4.9 4.9.0-0.nightly-s390x-2021-08-12-055355 is now also seen with subsequent OCP 4.9 builds:
1. 4.9.0-0.nightly-s390x-2021-08-12-094356
2. 4.9.0-0.nightly-s390x-2021-08-12-102845
2. A snippet from the bootstrap node upon login as the core userid shows the crio.service with issues and the following journalctl -b -f -u release-image.service -u bootkube.service" output:
This is the bootstrap node; it will be destroyed when the master is fully up.
The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.
journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
crio.service
[core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Thu 2021-08-12 12:45:44 UTC. --
Aug 12 12:51:11 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster
Aug 12 12:51:11 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds...
Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: {"level":"warn","ts":1628772681.832625,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004328c0/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster
Aug 12 12:51:22 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds...
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: {"level":"warn","ts":1628772692.6208522,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004c21c0/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds...
3. Here is the output from the "systemctl status crio.service" command:
[core@bootstrap-0 ~]$ systemctl status crio.service
● crio.service - Container Runtime Interface for OCI (CRI-O)
Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2021-08-12 12:56:32 UTC; 9s ago
Docs: https://github.com/cri-o/cri-o
Process: 22286 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAIL>
Main PID: 22286 (code=exited, status=1/FAILURE)
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102416230Z" level=info msg="Node configuration value for hugetlb cgroup is>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102444233Z" level=info msg="Node configuration value for pid cgroup is tru>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102472479Z" level=info msg="Node configuration value for memoryswap cgroup>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.109244230Z" level=info msg="Node configuration value for systemd CollectMo>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.114660118Z" level=info msg="Node configuration value for systemd AllowedCP>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.115798982Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.184122261Z" level=fatal msg="validating runtime config: conmon validation:>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
[core@bootstrap-0 ~]$
Thank you,
Kyle
duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1992557. fix is https://github.com/openshift/machine-config-operator/pull/2712 *** This bug has been marked as a duplicate of bug 1992557 *** FYI. All 10 OCP 4.9 builds from yesterday, August 12th, and the 3 current OCP 4.9 builds from today, August 13th, are broken with the existing crio issue. Specifically: 1. 4.9.0-0.nightly-s390x-2021-08-12-033000 2. 4.9.0-0.nightly-s390x-2021-08-12-044003 3. 4.9.0-0.nightly-s390x-2021-08-12-055355 4. 4.9.0-0.nightly-s390x-2021-08-12-094356 5. 4.9.0-0.nightly-s390x-2021-08-12-102845 6. 4.9.0-0.nightly-s390x-2021-08-12-121353 7. 4.9.0-0.nightly-s390x-2021-08-12-171820 8. 4.9.0-0.nightly-s390x-2021-08-12-183918 9. 4.9.0-0.nightly-s390x-2021-08-12-195352 10. 4.9.0-0.nightly-s390x-2021-08-12-220446 1. 4.9.0-0.nightly-s390x-2021-08-13-000210 2. 4.9.0-0.nightly-s390x-2021-08-13-011451 3. 4.9.0-0.nightly-s390x-2021-08-13-020317 Thank you, Kyle FYI. Coupled with the RHCOS 49.84.202108130913-0 build (which seems to resolve the crio issue), the last 3 OCP 4.9 on Z builds from August 13th and the 2 OCP 4.9 on Z builds from August 14th all successfully install. Specifically: 1. 4.9.0-0.nightly-s390x-2021-08-13-143049 2. 4.9.0-0.nightly-s390x-2021-08-13-145651 3. 4.9.0-0.nightly-s390x-2021-08-13-175700 1. 4.9.0-0.nightly-s390x-2021-08-14-044503 2. 4.9.0-0.nightly-s390x-2021-08-14-065517 Thank you, Kyle |
Description of problem: 1. When attempting to install a zVM hosted OCP 4.9 on Z cluster with any of the 3 current builds from August 12, 2021, the installations fail with either the master nodes not booting successfully from the bootstrap node (for builds 4.9.0-0.nightly-s390x-2021-08-12-033000 and 4.9.0-0.nightly-s390x-2021-08-12-04400) or not successfully completing installation (for build 4.9.0-0.nightly-s390x-2021-08-12-055355). 2. Using the (current) latest OCP 4.9 build from August 12, 2021, build 4.9.0-0.nightly-s390x-2021-08-12-055355, the master nodes install but fail to/never join the OCP cluster. 3. Here is an install of the 4.9.0-0.nightly-s390x-2021-08-12-055355 build with the bootstrap core userid login, which shows a "crio.service" failure and the following unhealthy cluster information as displayed by the "journalctl -b -f -u release-image.service -u bootkube.service" command: This is the bootstrap node; it will be destroyed when the master is fully up. The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g. journalctl -b -f -u release-image.service -u bootkube.service [systemd] Failed Units: 1 crio.service [core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Thu 2021-08-12 07:53:37 UTC. -- Aug 12 08:02:08 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster Aug 12 08:02:09 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: {"level":"warn","ts":1628755339.550245,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001a2000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: {"level":"warn","ts":1628755350.320205,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000550380/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds... q ^C [core@bootstrap-0 ~]$ 4. Here is a second install of the 4.9.0-0.nightly-s390x-2021-08-12-055355 build with the bootstrap core userid login, which shows a "crio.service" failure and the following unhealthy cluster information as displayed by the "journalctl -b -f -u release-image.service -u bootkube.service" command: This is the bootstrap node; it will be destroyed when the master is fully up. The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g. journalctl -b -f -u release-image.service -u bootkube.service [systemd] Failed Units: 1 crio.service [core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Thu 2021-08-12 08:15:58 UTC. -- Aug 12 08:20:56 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster Aug 12 08:20:57 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: {"level":"warn","ts":1628756467.5806026,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000394a80/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds... Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: {"level":"warn","ts":1628756478.3589513,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003e4700/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""} Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds... ^C [core@bootstrap-0 ~]$ Version-Release number of selected component (if applicable): 1. 4.9.0-0.nightly-s390x-2021-08-12-033000 2. 4.9.0-0.nightly-s390x-2021-08-12-044003 3. 4.9.0-0.nightly-s390x-2021-08-12-055355 How reproducible: Consistently reproducible Steps to Reproduce: 1. Initiate installation of a z/VM hosted OCP 4.9.0 on Z cluster using the OCP 4.9.0-0.nightly-s390x-2021-08-12-055355 build and the corresponding RHCOS 49.84.202108100912-0 build. Actual results: Failed installation of z/VM hosted OCP 4.9 on Z cluster. Installation does not proceed beyond the master/control plane node installation phase, with no master/control plane nodes joining the OCP cluster. Expected results: Successful installation of the z/VM hosted OCP 4.9 on Z cluster. Additional info: Thank you.