Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1992995

Summary:	OCP 4.9.0-0.nightly-s390x-2021-08-12-033000, 044003, 055355 builds fail installation with master/control plane node issues
Product:	OpenShift Container Platform	Reporter:	krmoser
Component:	Multi-Arch	Assignee:	Dennis Gilmore <dgilmore>
Status:	CLOSED DUPLICATE	QA Contact:	Barry Donahue <bdonahue>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	chanphil, christian.lapolt, Holger.Wolf, psundara, sorth, wolfgang.voesch
Target Milestone:	---
Target Release:	---
Hardware:	s390x
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-12 14:02:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description krmoser 2021-08-12 08:25:17 UTC

Description of problem:
1. When attempting to install a zVM hosted OCP 4.9 on Z cluster with any of the 3 current builds from August 12, 2021, the installations fail with either the master nodes not booting successfully from the bootstrap node (for builds 4.9.0-0.nightly-s390x-2021-08-12-033000 and 4.9.0-0.nightly-s390x-2021-08-12-04400) or not successfully completing installation (for build 4.9.0-0.nightly-s390x-2021-08-12-055355).
  
2. Using the (current) latest OCP 4.9 build from August 12, 2021, build 4.9.0-0.nightly-s390x-2021-08-12-055355, the master nodes install but fail to/never join the OCP cluster.

3. Here is an install of the 4.9.0-0.nightly-s390x-2021-08-12-055355 build with the bootstrap core userid login, which shows a "crio.service" failure and the following unhealthy cluster information as displayed by the
"journalctl -b -f -u release-image.service -u bootkube.service" command:


This is the bootstrap node; it will be destroyed when the master is fully up.

The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.

  journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
  crio.service
[core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Thu 2021-08-12 07:53:37 UTC. --
Aug 12 08:02:08 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster
Aug 12 08:02:09 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds...
Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: {"level":"warn","ts":1628755339.550245,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001a2000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster
Aug 12 08:02:19 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds...
Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: {"level":"warn","ts":1628755350.320205,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000550380/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: Error: unhealthy cluster
Aug 12 08:02:30 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2409]: etcdctl failed. Retrying in 5 seconds...
q
^C
[core@bootstrap-0 ~]$


4. Here is a second install of the 4.9.0-0.nightly-s390x-2021-08-12-055355 build with the bootstrap core userid login, which shows a "crio.service" failure and the following unhealthy cluster information as displayed by the
"journalctl -b -f -u release-image.service -u bootkube.service" command:

This is the bootstrap node; it will be destroyed when the master is fully up.

The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.

  journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
  crio.service
[core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Thu 2021-08-12 08:15:58 UTC. --
Aug 12 08:20:56 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster
Aug 12 08:20:57 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds...
Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: {"level":"warn","ts":1628756467.5806026,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000394a80/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster
Aug 12 08:21:07 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds...
Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: {"level":"warn","ts":1628756478.3589513,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003e4700/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: Error: unhealthy cluster
Aug 12 08:21:18 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2388]: etcdctl failed. Retrying in 5 seconds...
^C
[core@bootstrap-0 ~]$




Version-Release number of selected component (if applicable):
1. 4.9.0-0.nightly-s390x-2021-08-12-033000
2. 4.9.0-0.nightly-s390x-2021-08-12-044003
3. 4.9.0-0.nightly-s390x-2021-08-12-055355

How reproducible:
Consistently reproducible

Steps to Reproduce:
1. Initiate installation of a z/VM hosted OCP 4.9.0 on Z cluster using the OCP 4.9.0-0.nightly-s390x-2021-08-12-055355 build and the corresponding RHCOS 49.84.202108100912-0 build.


Actual results:
Failed installation of z/VM hosted OCP 4.9 on Z cluster.  Installation does not proceed beyond the master/control plane node installation phase, with no master/control plane nodes joining the OCP cluster.

Expected results:
Successful installation of the z/VM hosted OCP 4.9 on Z cluster.

Additional info:


Thank you.

Comment 1 krmoser 2021-08-12 12:58:27 UTC

Folks,

Here is some additional information to help with this issue.

1. The same issue seen with OCP 4.9 4.9.0-0.nightly-s390x-2021-08-12-055355 is now also seen with subsequent OCP 4.9 builds:
  1. 4.9.0-0.nightly-s390x-2021-08-12-094356
  2. 4.9.0-0.nightly-s390x-2021-08-12-102845

2. A snippet from the bootstrap node upon login as the core userid shows the crio.service with issues and the following journalctl -b -f -u release-image.service -u bootkube.service" output:


This is the bootstrap node; it will be destroyed when the master is fully up.

The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.

  journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
  crio.service
[core@bootstrap-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Thu 2021-08-12 12:45:44 UTC. --
Aug 12 12:51:11 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster
Aug 12 12:51:11 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds...
Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: {"level":"warn","ts":1628772681.832625,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004328c0/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 12:51:21 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster
Aug 12 12:51:22 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds...
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: {"level":"warn","ts":1628772692.6208522,"logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004c21c0/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: Error: unhealthy cluster
Aug 12 12:51:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com bootkube.sh[2413]: etcdctl failed. Retrying in 5 seconds...



3. Here is the output from the "systemctl status crio.service" command:

[core@bootstrap-0 ~]$ systemctl status crio.service
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2021-08-12 12:56:32 UTC; 9s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 22286 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAIL>
 Main PID: 22286 (code=exited, status=1/FAILURE)

Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102416230Z" level=info msg="Node configuration value for hugetlb cgroup is>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102444233Z" level=info msg="Node configuration value for pid cgroup is tru>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.102472479Z" level=info msg="Node configuration value for memoryswap cgroup>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.109244230Z" level=info msg="Node configuration value for systemd CollectMo>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.114660118Z" level=info msg="Node configuration value for systemd AllowedCP>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.115798982Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com crio[22286]: time="2021-08-12 12:56:32.184122261Z" level=fatal msg="validating runtime config: conmon validation:>
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 12 12:56:32 bootstrap-0.pok-220.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
[core@bootstrap-0 ~]$


Thank you,
Kyle

Comment 2 Prashanth Sundararaman 2021-08-12 13:59:34 UTC

duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1992557. fix is https://github.com/openshift/machine-config-operator/pull/2712

Comment 3 Prashanth Sundararaman 2021-08-12 14:02:49 UTC


*** This bug has been marked as a duplicate of bug 1992557 ***

Comment 4 krmoser 2021-08-13 09:56:30 UTC

FYI.

All 10 OCP 4.9 builds from yesterday, August 12th, and the 3 current OCP 4.9 builds from today, August 13th, are broken with the existing crio issue.  Specifically:

 1. 4.9.0-0.nightly-s390x-2021-08-12-033000
 2. 4.9.0-0.nightly-s390x-2021-08-12-044003
 3. 4.9.0-0.nightly-s390x-2021-08-12-055355
 4. 4.9.0-0.nightly-s390x-2021-08-12-094356
 5. 4.9.0-0.nightly-s390x-2021-08-12-102845
 6. 4.9.0-0.nightly-s390x-2021-08-12-121353
 7. 4.9.0-0.nightly-s390x-2021-08-12-171820
 8. 4.9.0-0.nightly-s390x-2021-08-12-183918
 9. 4.9.0-0.nightly-s390x-2021-08-12-195352
10. 4.9.0-0.nightly-s390x-2021-08-12-220446

 1. 4.9.0-0.nightly-s390x-2021-08-13-000210
 2. 4.9.0-0.nightly-s390x-2021-08-13-011451
 3. 4.9.0-0.nightly-s390x-2021-08-13-020317


Thank you,
Kyle

Comment 5 krmoser 2021-08-16 04:06:31 UTC

FYI.

Coupled with the RHCOS 49.84.202108130913-0 build (which seems to resolve the crio issue), the last 3 OCP 4.9 on Z builds from August 13th and the 2 OCP 4.9 on Z builds from August 14th all successfully install.
Specifically:

1. 4.9.0-0.nightly-s390x-2021-08-13-143049	       
2. 4.9.0-0.nightly-s390x-2021-08-13-145651	       
3. 4.9.0-0.nightly-s390x-2021-08-13-175700	
	
1. 4.9.0-0.nightly-s390x-2021-08-14-044503	
2. 4.9.0-0.nightly-s390x-2021-08-14-065517	


Thank you,
Kyle