Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2073944

Summary:

OCP 4.11 on Z builds with RHCOS 411 builds built on RHEL 8.6 fail installation for most zVM and all KVM environments tested

Product:

OpenShift Container Platform

Reporter:

krmoser

Component:

Multi-Arch

Assignee:

Jeremy Poulin <jpoulin>

Multi-Arch sub component:

Multi-Arch

QA Contact:

Douglas Slavens <dslavens>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, chanphil, christian.lapolt, danili, dbenoit, dwalsh, fleber, Holger.Wolf, jschinta

Version:

4.11

Target Milestone:

---

Target Release:

4.11.0

Hardware:

s390x

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-04-18 16:58:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
must-gather for failing install of 4.11.0-0.nightly-s390x-2022-04-10-104512	none
master-0 node journalctl log	none
master-1 node journalctl log	none
master-2 node journalctl log	none
bootstrap node journalctl log	none
podman journal log	none

Description krmoser 2022-04-11 08:16:12 UTC

Description of problem:
Our Solution Test team has been testing the April 8th through 10th, 2022 OCP 4.11 on Z builds that now require RHCOS 411 builds built on RHEL 8.6, now referred to with "411.86" at the start of their build names.  To date, with these OCP 4.11 builds based on RHCOS 411.86, we have had limited success with installations for zVM environments and no success with KVM environments.

1. For the RHCOS 411 on Z builds based on RHEL 8.6 and released on April 7th and 8th, 2022, there is an issue where only specific zVM environments will successfully complete OCP 4.11 installations, while most zVM environments will fail to complete installation and all KVM environments fail to complete installation.

2. All of the KVM and zVM environments' hypervisors used to date are installed with the March 8th RHEL 8.5 kernel.

# uname -a
Linux ospcmgr1 4.18.0-348.20.1.el8_5.s390x #1 SMP Tue Mar 8 13:01:59 EST 2022 s390x s390x s390x GNU/Linux
#



3. The same KVM and zVM environments, without any changes to these environments, successfully complete OCP 4.11 on Z build installations for all previously tested OCP 4.11 on Z builds based on RHCOS 411 RHEL 8.5 builds for the March and April 2022 OCP 4.11 builds (except for those OCP 4.11 on Z builds from April 1st through mid-way April 4th that fail all installations given a known etcd issue, and was resolved mid-way through the April 4, 2022 OCP 4.11 builds).

4. For the OCP 4.11 on Z builds that are require RHCOS 411 builds 411.86.202204072033-0 and 411.86.202204081208-0, these OCP 4.11 builds do not successfully complete installation except for a zVM 7.2 environment hosted on a z16 server.  

5. Multiple KVM environments hosted on the same z16 server fail to complete installation, as do KVM and zVM 7.1 environments hosted on z14 and z15 servers.  

6. These OCP 4.11 on Z builds that require the RHCOS 411 RHEL 8.6 builds, and fail installation, seem related to environments, most likely network related, and not server related.


7. For those environments where the OCP 4.11 on Z builds based on these RHCOS 411 builds 411.86.202204072033-0 and 411.86.202204081208-0 fail to install, for both the KVM and zVM OCP 4.11 installations:
(1) The master nodes seem to complete the ignition process.

(2) The worker nodes do not complete the ignition process.
examples:
failed: [9.12.23.72] (item=worker-0) => {"ansible_loop_var": "item", "changed": false, "elapsed": 5000, "item": "worker-0", "msg": "Timeout when waiting for search string OpenSSH in 10.20.116.94:22"}

failed: [9.12.23.72] (item=worker-1) => {"ansible_loop_var": "item", "changed": false, "elapsed": 5000, "item": "worker-1", "msg": "Timeout when waiting for search string OpenSSH in 10.20.116.95:22"}


(3) The OCP installation does not seem to install the master node resources as no resources are returned from the OC CLI command "oc get nodes".  Specifically, the message "No resources found" is returned.

(4) The installations do not seem to install any cluster operators as the installation does not progress past the installation of the "cloud-credential" cluster operator, as indicated by the OC CLI command "oc get co".  Per the "oc get co" command, none of the cluster operators are successfully installed.

Here is an example:

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                                                                                    
baremetal                                                                                         
cloud-controller-manager                                                                          
cloud-credential                                     True        False         False      21m     
cluster-autoscaler                                                                                
config-operator                                                                                   
console                                                                                           
csi-snapshot-controller                                                                           
dns                                                                                               
etcd                                                                                              
image-registry                                                                                    
ingress                                                                                           
insights                                                                                          
kube-apiserver                                                                                    
kube-controller-manager                                                                           
kube-scheduler                                                                                    
kube-storage-version-migrator                                                                     
machine-api                                                                                       
machine-approver                                                                                  
machine-config                                                                                    
marketplace                                                                                       
monitoring                                                                                        
network                                                                                           
node-tuning                                                                                       
openshift-apiserver                                                                               
openshift-controller-manager                                                                      
openshift-samples                                                                                 
operator-lifecycle-manager                                                                        
operator-lifecycle-manager-catalog                                                                
operator-lifecycle-manager-packageserver                                                          
service-ca                                                                                        
storage                                                                                           


8. For these OCP 4.11 build installation failures, for the KVM and zVM environments, the master nodes do seem to complete there ignition installations, although the master nodes do not progress with any status updates on the console. In addition, the "at least XX boot ago" information is different between failing and working environments.


For the zVM 7.1 environments where the OCP 4.11 builds based on RHCOS 4.11.86 fail installation, here are examples of the master-0 node static console update: 
==============================================================================================================================================================
04/10/22 17:47:50 Red Hat Enterprise Linux CoreOS 411.86.202204081208-0 (Ootpa) 4.11                   
04/10/22 17:47:50 Ignition: ran on 2022/04/10 21:46:10 UTC (at least 1 boot ago)                       
04/10/22 17:47:50 Ignition: user-provided config was applied                                           
04/10/22 17:47:50 SSH host key: SHA256:Qu8WiyY5Qpy4wjlF0Q/zsVsEiF6EtFGG/goD7yqW5n8 (ECDSA)             
04/10/22 17:47:50 SSH host key: SHA256:+c9S2MIniSa6l5aRqPgES6K36qAJt5z9npiKS7bMrJQ (ED25519)           
04/10/22 17:47:50 SSH host key: SHA256:/0vRoEO2oFH+s+mVIGCowqqFog/rkARzf2kDrr01shM (RSA)               
04/10/22 17:47:50 enc2e0:                                                                              
04/10/22 17:47:50 master-0 login:                                                                      


For the zVM 7.2 environment where the OCP 4.11 builds based on RHCOS 4.11.86 pass installation, here are examples of master-0 node regular console updates: 
==========================================================================================================================================================
07:25:25 Red Hat Enterprise Linux CoreOS 411.86.202204081208-0 (Ootpa) 4.11                       
07:25:25 Ignition: ran on 2022/04/11 02:07:33 UTC (at least 0 boots ago)                          
07:25:25 Ignition: user-provided config was applied                                               
07:25:25 SSH host key: SHA256:baeWGM25+jcxvdtbu3E4ePs0foOHoDU7+A4NIWBkqig (RSA)                   
07:25:25 SSH host key: SHA256:ZwgZnExNgRxCdxvPi6iV0i1nWEFt2uYwuXaYVEl5IJA (ECDSA)                 
07:25:25 SSH host key: SHA256:DAmx/CgabhF1JznMFvpIcw0mOrjZVpOwDIKI+Mgcwqc (ED25519)               
07:25:25 enc2e0:                                                                                  
07:25:25 master-0 login:                                                                          
07:25:27 Red Hat Enterprise Linux CoreOS 411.86.202204081208-0 (Ootpa) 4.11                       
07:25:27 Ignition: ran on 2022/04/11 02:07:33 UTC (at least 0 boots ago)                          
07:25:27 Ignition: user-provided config was applied                                               
07:25:27 SSH host key: SHA256:baeWGM25+jcxvdtbu3E4ePs0foOHoDU7+A4NIWBkqig (RSA)                   
07:25:27 SSH host key: SHA256:ZwgZnExNgRxCdxvPi6iV0i1nWEFt2uYwuXaYVEl5IJA (ECDSA)                 
07:25:27 SSH host key: SHA256:DAmx/CgabhF1JznMFvpIcw0mOrjZVpOwDIKI+Mgcwqc (ED25519)               
07:25:27 enc2e0:                                                                                  
07:25:27 master-0 login:                                                                          
07:25:35 [19000.894790] device bd7decafa0d1b52 left promiscuous mode                              
07:25:36 Red Hat Enterprise Linux CoreOS 411.86.202204081208-0 (Ootpa) 4.11                       
07:25:36 Ignition: ran on 2022/04/11 02:07:33 UTC (at least 0 boots ago)                          
07:25:36 Ignition: user-provided config was applied                                               
07:25:36 SSH host key: SHA256:baeWGM25+jcxvdtbu3E4ePs0foOHoDU7+A4NIWBkqig (RSA)                   
07:25:36 SSH host key: SHA256:ZwgZnExNgRxCdxvPi6iV0i1nWEFt2uYwuXaYVEl5IJA (ECDSA)                 
07:25:36 SSH host key: SHA256:DAmx/CgabhF1JznMFvpIcw0mOrjZVpOwDIKI+Mgcwqc (ED25519)               
07:25:36 enc2e0:                                                                                  



  

Version-Release number of selected component (if applicable):
All of these builds from April 8th through 10th, 2022:

 1. 4.11.0-0.nightly-s390x-2022-04-08-132726
 2. 4.11.0-0.nightly-s390x-2022-04-08-140813
 3. 4.11.0-0.nightly-s390x-2022-04-08-145045
 4. 4.11.0-0.nightly-s390x-2022-04-08-172829
 5. 4.11.0-0.nightly-s390x-2022-04-08-172829
 6. 4.11.0-0.nightly-s390x-2022-04-08-181147
 7. 4.11.0-0.nightly-s390x-2022-04-08-184511
 8. 4.11.0-0.nightly-s390x-2022-04-08-195412

 9. 4.11.0-0.nightly-s390x-2022-04-09-142341

10. 4.11.0-0.nightly-s390x-2022-04-10-051912
11. 4.11.0-0.nightly-s390x-2022-04-10-104512
12. 4.11.0-0.nightly-s390x-2022-04-10-231906


How reproducible:
1. Consistently reproducible in almost all zVM hypervisor environments and all KVM hypervisor environments. 

2. In the 1 zVM 7.2 environment where these OCP 4.11 builds that require RHCOS 411.86 builds are successfully installed, they have not failed installation to date after 12+ different builds' install attempts.

3. In all other zVM and KVM environments where these OCP 4.11 on Z builds that require RHCOS 4.11.86 builds are unsuccessfully installed, they have failed ALL installations to date after 12+ different builds' install attempts.


Steps to Reproduce:
1. Attempt an OCP 4.11 on Z build installation for any builds from April 8th through April 10th that require RHCOS builds 411.86.202204072033-0 or 411.86.202204081208-0.


Actual results:
Many OCP 4.11 on Z builds, with RHCOS builds built on RHEL 8.6, fail installation. 

Expected results:
All OCP 4.11 on Z builds, with RHCOS builds built on RHEL 8.6, should pass installation.

Additional info:
Will be providing additional information to this bugzilla, including OC CLI must gather information.

Thank you.

Comment 1 krmoser 2022-04-11 08:55:27 UTC

1. Attempted to collect a "oc adm must-gather" for the OCP 4.11 build 4.11.0-0.nightly-s390x-2022-04-10-104512, and encountered numerous errors:

[root@ospamgr4 ~]# oc adm must-gather
[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT
[must-gather      ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 35a1d176-cff0-4357-8668-0556b0c98dc2
ClusterVersion: Installing "4.11.0-0.nightly-s390x-2022-04-10-104512" for 6 minutes: Unable to apply 4.11.0-0.nightly-s390x-2022-04-10-104512: an unknown error has occurred: MultipleErrors
ClusterOperators:
        clusteroperator/authentication is not available (<missing>) because <missing>
        clusteroperator/baremetal is not available (<missing>) because <missing>
        clusteroperator/cloud-controller-manager is not available (<missing>) because <missing>
        clusteroperator/cluster-autoscaler is not available (<missing>) because <missing>
        clusteroperator/config-operator is not available (<missing>) because <missing>
        clusteroperator/console is not available (<missing>) because <missing>
        clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing>
        clusteroperator/dns is not available (<missing>) because <missing>
        clusteroperator/etcd is not available (<missing>) because <missing>
        clusteroperator/image-registry is not available (<missing>) because <missing>
        clusteroperator/ingress is not available (<missing>) because <missing>
        clusteroperator/insights is not available (<missing>) because <missing>
        clusteroperator/kube-apiserver is not available (<missing>) because <missing>
        clusteroperator/kube-controller-manager is not available (<missing>) because <missing>
        clusteroperator/kube-scheduler is not available (<missing>) because <missing>
        clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing>
        clusteroperator/machine-api is not available (<missing>) because <missing>
        clusteroperator/machine-approver is not available (<missing>) because <missing>
        clusteroperator/machine-config is not available (<missing>) because <missing>
        clusteroperator/marketplace is not available (<missing>) because <missing>
        clusteroperator/monitoring is not available (<missing>) because <missing>
        clusteroperator/network is not available (<missing>) because <missing>
        clusteroperator/node-tuning is not available (<missing>) because <missing>
        clusteroperator/openshift-apiserver is not available (<missing>) because <missing>
        clusteroperator/openshift-controller-manager is not available (<missing>) because <missing>
        clusteroperator/openshift-samples is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing>
        clusteroperator/service-ca is not available (<missing>) because <missing>
        clusteroperator/storage is not available (<missing>) because <missing>


[must-gather      ] OUT namespace/openshift-must-gather-wkrbg created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-r6x9g created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created

[must-gather-v2r9x] OUT gather did not start: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-r6x9g deleted
[must-gather      ] OUT namespace/openshift-must-gather-wkrbg deleted


Error running must-gather collection:
    gather did not start for pod must-gather-v2r9x: timed out waiting for the condition

Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information.
Gathering data for ns/openshift-cloud-controller-manager...
Gathering data for ns/openshift-cloud-controller-manager-operator...
Gathering data for ns/openshift-cloud-credential-operator...
Gathering data for ns/openshift-machine-api...
Gathering data for ns/openshift-config...
Gathering data for ns/openshift-config-managed...
Gathering data for ns/openshift-etcd-operator...
Gathering data for ns/openshift-etcd...
Gathering data for ns/openshift-ingress-operator...
Gathering data for ns/openshift-kube-apiserver-operator...
Gathering data for ns/openshift-kube-apiserver...
Gathering data for ns/openshift-kube-controller-manager...
Gathering data for ns/openshift-kube-controller-manager-operator...
Gathering data for ns/kube-system...
Gathering data for ns/openshift-kube-scheduler-operator...
Gathering data for ns/openshift-kube-scheduler...
Gathering data for ns/openshift-kube-storage-version-migrator-operator...
Gathering data for ns/openshift-cluster-machine-approver...
Gathering data for ns/openshift-machine-config-operator...
Gathering data for ns/openshift-monitoring...
Gathering data for ns/openshift-user-workload-monitoring...
Gathering data for ns/openshift-cluster-samples-operator...
Wrote inspect data to must-gather.local.8858696351402996231/inspect.local.6850847521026171443.
error running backup collection: errors occurred while gathering data:
    [skipping gathering namespaces/openshift-ingress due to error: namespaces "openshift-ingress" not found, skipping gathering namespaces/openshift-ingress-canary due to error: namespaces "openshift-ingress-canary" not found, skipping gathering securitycontextconstraints.security.openshift.io due to error: the server doesn't have a resource type "securitycontextconstraints", skipping gathering podnetworkconnectivitychecks.controlplane.operator.openshift.io due to error: the server doesn't have a resource type "podnetworkconnectivitychecks", skipping gathering namespaces/openshift-kube-storage-version-migrator due to error: namespaces "openshift-kube-storage-version-migrator" not found, skipping gathering controllerconfigs.machineconfiguration.openshift.io due to error: the server doesn't have a resource type "controllerconfigs", skipping gathering configs.samples.operator.openshift.io/cluster due to error: configs.samples.operator.openshift.io "cluster" not found, skipping gathering templates.template.openshift.io due to error: the server doesn't have a resource type "templates", skipping gathering imagestreams.image.openshift.io due to error: the server doesn't have a resource type "imagestreams"]


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 35a1d176-cff0-4357-8668-0556b0c98dc2
ClusterVersion: Installing "4.11.0-0.nightly-s390x-2022-04-10-104512" for 16 minutes: Unable to apply 4.11.0-0.nightly-s390x-2022-04-10-104512: an unknown error has occurred: MultipleErrors
ClusterOperators:
        clusteroperator/authentication is not available (<missing>) because <missing>
        clusteroperator/baremetal is not available (<missing>) because <missing>
        clusteroperator/cloud-controller-manager is not available (<missing>) because <missing>
        clusteroperator/cluster-autoscaler is not available (<missing>) because <missing>
        clusteroperator/config-operator is not available (<missing>) because <missing>
        clusteroperator/console is not available (<missing>) because <missing>
        clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing>
        clusteroperator/dns is not available (<missing>) because <missing>
        clusteroperator/etcd is not available (<missing>) because <missing>
        clusteroperator/image-registry is not available (<missing>) because <missing>
        clusteroperator/ingress is not available (<missing>) because <missing>
        clusteroperator/insights is not available (<missing>) because <missing>
        clusteroperator/kube-apiserver is not available (<missing>) because <missing>
        clusteroperator/kube-controller-manager is not available (<missing>) because <missing>
        clusteroperator/kube-scheduler is not available (<missing>) because <missing>
        clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing>
        clusteroperator/machine-api is not available (<missing>) because <missing>
        clusteroperator/machine-approver is not available (<missing>) because <missing>
        clusteroperator/machine-config is not available (<missing>) because <missing>
        clusteroperator/marketplace is not available (<missing>) because <missing>
        clusteroperator/monitoring is not available (<missing>) because <missing>
        clusteroperator/network is not available (<missing>) because <missing>
        clusteroperator/node-tuning is not available (<missing>) because <missing>
        clusteroperator/openshift-apiserver is not available (<missing>) because <missing>
        clusteroperator/openshift-controller-manager is not available (<missing>) because <missing>
        clusteroperator/openshift-samples is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing>
        clusteroperator/service-ca is not available (<missing>) because <missing>
        clusteroperator/storage is not available (<missing>) because <missing>


error: gather did not start for pod must-gather-v2r9x: timed out waiting for the condition
[root@ospamgr4 ~]#
[root@ospamgr4 ~]#



2. Will attach this "must-gather" tar.gz file to this bugzilla.

Thank you.

Comment 2 krmoser 2022-04-11 08:57:00 UTC

Created attachment 1871728 [details]
must-gather for failing install of 4.11.0-0.nightly-s390x-2022-04-10-104512

must-gather for failing install of 4.11.0-0.nightly-s390x-2022-04-10-104512

Comment 4 Jeremy Poulin 2022-04-11 21:45:10 UTC

Hey Kyle!

Thanks for all the input. all this info is good, but in order to see what's going on, we could really use complete journal logs for the nodes so we can track them from boot to join. If you're saying that the master nodes aren't reported in 'oc get nodes', I think we definitely need to start with their boot logs to see where we got stuck.

Comment 5 krmoser 2022-04-12 07:04:38 UTC

Jeremy (and Jan, Prashanth, and Red Hat colleagues), 

Thanks, sure thing.

1. Please find attached the complete journal logs (journalctl) for the bootstrap node and 3 master nodes, running with network type OVNKubernetes and:
  (1) OCP 4.11 build 4.11.0-0.nightly-s390x-2022-04-11-224006 
  (2) RHCOS build 411.86.202204111446-0

2. The same install failure issue occurs with this OCP 4.11 build and corresponding RHCOS build, as described above in the initial description section of this bugzilla defect, and as for all of the April 8th through April 11th OCP 4.11 builds with corresponding RHCOS 411.86 builds for zVM and KVM hypervisor environments (except for a zVM 7.2 environment hosted on a z16).


Thank you,
Kyle

Comment 6 krmoser 2022-04-12 07:05:50 UTC

Created attachment 1871895 [details]
master-0 node journalctl log

Comment 7 krmoser 2022-04-12 07:08:45 UTC

Created attachment 1871896 [details]
master-1 node journalctl log

Comment 8 krmoser 2022-04-12 07:09:29 UTC

Created attachment 1871897 [details]
master-2 node journalctl log

Comment 9 krmoser 2022-04-12 07:17:59 UTC

Created attachment 1871900 [details]
bootstrap node journalctl log

Comment 10 krmoser 2022-04-12 07:22:34 UTC

Folks,

Please see the following continual error messages in each of the master-XX node's journalctl log attachments:


Error: setxattr /etc/systemd/system/basic.target.wants/coreos-ignition-firstboot-complete.service: read-only file system
...
Error: setxattr /etc/systemd/system/dbus-org.freedesktop.timedate1.service: read-only file system


Thank you,
Kyle

Comment 11 jschinta 2022-04-12 10:53:25 UTC

Created attachment 1871935 [details]
podman journal log

Comment 12 jschinta 2022-04-12 10:54:55 UTC

(In reply to Jeremy Poulin from comment #3)
Hi Jeremy,

i don't think this corresponds, as https://bugzilla.redhat.com/show_bug.cgi?id=2047242 is a kernel bug and we are using the 8.5 Kernel, which is known to be good. My thoughts is more that this is a problem with podman.
When i tried this in a cluster myself i saw podman having a panic in the logs. I also don't know why podman is repeatedly printing blank lines in the log.

Comment 13 Jeremy Poulin 2022-04-12 13:31:58 UTC

I'll review the logs, but Micah suggested that https://bugzilla.redhat.com/show_bug.cgi?id=2074090 could be related, and you noted that this looks like a podman issue.

Comment 14 jschinta 2022-04-12 14:57:57 UTC

Looking on the dependency of Micahs BZ, this does look like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2072072.
The only thing that is new is the podman panic, which at least i don't see anywhere in the logs from that BZ.

Comment 15 Dan Li 2022-04-13 11:55:06 UTC

Moving to Assigned based on Comment 13. Hi Jeremy, once you review the logs would you set the "Blocker?" flag to see if this is (or is not) a blocker?

Comment 18 Jeremy Poulin 2022-04-13 18:13:08 UTC

Based on an examination of the logs, it is clear that the behavior is at least in part caused by https://bugzilla.redhat.com/show_bug.cgi?id=2072072. There is clear evidence that the mislabeled selinux issue is causing the same readonly filesystem signature.

However, the podman segfault is another concern. I'm going to go ahead and mark this a blocker+ since the impact is critical to Z. With the podman issue being pulled in via Monday's builds, we should be able to see if this issue can be recreated with the new podman.

Comment 20 Jeremy Poulin 2022-04-13 18:22:47 UTC

Update:
Reading through the BZ linked above, it is noted that the underlying issue is that you shouldn't relabel the FS from within a container. This resulted in a follow-up patch in MCO:
See https://bugzilla.redhat.com/show_bug.cgi?id=2074613

New builds with this patch *may* resolve this issue. The SIGSEV remains unaccounted for.

Comment 21 Jeremy Poulin 2022-04-13 18:41:33 UTC

Hi Dan - would you mind taking a look at the podman stacktrace attached to this bug?

It appears to me as though we are in an invalid state after re-labeling the system directory (podman issue: https://bugzilla.redhat.com/show_bug.cgi?id=2074090#c4 -> mco bug: https://bugzilla.redhat.com/show_bug.cgi?id=2074613)

My concern here is that podman is able to get into a state where it segfaults, and weirdly this only seems to happen on Z (that I've seen so far).

Comment 23 krmoser 2022-04-15 17:32:06 UTC

Jeremy,

Here's an update on our current OCP 4.11 on Z install and upgrade testing using the April 15th OCP 4.11 on Z builds and corresponding April 15th RHCOS 411.86 build.

1. Our Solution Test team has successfully conducted a large number of OCP 4.11 on Z KVM and zVM install and upgrade tests with the latest available RHCOS 411.86 build, 411.86.202204150312-0, and the 7 currently available, corresponding OCP 4.11 on Z builds from April 15, 2022.

2. As a comparison, for all 10 OCP 4.11 on Z builds from yesterday, April 14, 2022, and the corresponding RHCOS 411.86 build, 411.86.202204132133-0, all install and upgrade tests failed for KVM and zVM environments, except for the z16 based zVM environment.

3. Today's install and upgrade tests using the April 15th OCP 4.11 on Z builds and RHCOS 411.86 build 411.86.202204150312-0 are successful on a wide range of environments including z14, z15, and z16 servers.

4. Our Solution Test team will follow-up with another update by tomorrow, as we complete additional KVM and zVM install and upgrade testing.

Thank you,
Kyle

Comment 24 krmoser 2022-04-18 05:32:05 UTC

Jeremy,

Here's an update on our Solution Test team's current OCP 4.11 on Z install and upgrade testing using the 12 OCP 4.11 on Z builds released from April 15th through 17th, and the corresponding April 15th RHCOS 411.86 build 411.86.202204150312-0 required for each of these 12 OCP 4.11 on Z builds.

1. For both zVM z15 and z16 environments, all 12 OCP 4.11 on Z builds from April 15th through 17th were successfully installed.
Specifically, these 12 OCP 4.11 on Z builds:

  1. 4.11.0-0.nightly-s390x-2022-04-15-054202
  2. 4.11.0-0.nightly-s390x-2022-04-15-062432
  3. 4.11.0-0.nightly-s390x-2022-04-15-071851
  4. 4.11.0-0.nightly-s390x-2022-04-15-075527
  5. 4.11.0-0.nightly-s390x-2022-04-15-102054
  6. 4.11.0-0.nightly-s390x-2022-04-15-105554
  7. 4.11.0-0.nightly-s390x-2022-04-15-120558
  8. 4.11.0-0.nightly-s390x-2022-04-15-183515
  9. 4.11.0-0.nightly-s390x-2022-04-15-204301

 10. 4.11.0-0.nightly-s390x-2022-04-16-092549
 11. 4.11.0-0.nightly-s390x-2022-04-16-120844

 12. 4.11.0-0.nightly-s390x-2022-04-17-124011


2. For either KVM z15 or z16 environments, all 12 OCP 4.11 on Z builds from April 15th through 17th were successfully installed.


3. For zVM z15 environments, all 12 OCP 4.11 on Z builds from April 15th through 17th were successfully upgraded to, from the OCP 4.10.10 build.

4. For zVM z16 environments, a subset of these 12 OCP 4.11 on Z builds from April 15th through 17th were successfully upgraded to, from the OCP 4.10.10 build.


5. For KVM z15 environments, a subset of these 12 OCP 4.11 on Z builds from April 15th through 17th were successfully upgraded to, from the OCP 4.10.10 build.

6. For KVM z16 environments, a subset of these 12 OCP 4.11 on Z builds from April 15th through 17th were successfully upgraded to, from the OCP 4.10.10 build.


Thank you,
Kyle

Comment 25 Dan Li 2022-04-18 14:42:59 UTC

Hi Jeremy, per Kyle's Comment 24, can this bug be closed out? Or if you believe there is additional investigation, can we set the "reviewed-in-sprint" flag if this bug will continue into the next sprint?

Comment 26 Jeremy Poulin 2022-04-18 16:58:51 UTC

Closing this bug out as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2074613, since that issue corresponds to the fix.

*** This bug has been marked as a duplicate of bug 2074613 ***