Description of problem (please be detailed as possible and provide log snippests): While running OCS-CI test under "pv_services" with CephFileSystem and CephBlockPool test failed with MountFailed error and I can see in pod description mount getting failed. Version of all relevant components (if applicable): OCS version : 4.4 OCP version : 4.4 Ceph Version : ceph version 14.2.8-79.el8cparch (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 -- simple scenario I tried. Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Setup OCP and OCS on KVM guest on s390x hardware 2. Run this test test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete] or Test related to CephFileSystem and CephBlockPool 3) executed test tests/manage/pv_services/test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete] Actual results: POD is not getting created during test execution STATUS as "ContainerCreating" Expected results: POD is should create and execute the test cases. Additional info: (venv) root@s83lp83:~/ocs-ci# oc describe pod pod-test-cephfs-a33f2ff647014bea8b5d4cea27ab70f0 Name: pod-test-cephfs-a33f2ff647014bea8b5d4cea27ab70f0 Namespace: namespace-test-1d38209307cc477da74029a0e21ca128 Priority: 0 Node: test1-f9xps-worker-0-k2hvx/192.168.126.51 Start Time: Tue, 28 Jul 2020 15:51:09 +0200 Labels: <none> Annotations: openshift.io/scc: anyuid Status: Pending IP: IPs: <none> Containers: web-server: Container ID: Image: nginx Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/lib/www/html from mypvc (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-fg8tm (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: mypvc: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: pvc-test-df590d0b329047a996d615d8b2932e3c ReadOnly: false default-token-fg8tm: Type: Secret (a volume populated by a Secret) SecretName: default-token-fg8tm Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulAttachVolume 32m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" Warning FailedMount 32m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (2974) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-257629566,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 32m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3076) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-853909311,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 32m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3178) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-819052180,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 32m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3280) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-135149837,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 32m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3382) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-482107770,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 31m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3484) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-338771179,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 31m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3586) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-278313904,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 31m kubelet, test1-f9xps-worker-0-k2hvx MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3688) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-758372953,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 12m (x8 over 30m) kubelet, test1-f9xps-worker-0-k2hvx Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc default-token-fg8tm]: timed out waiting for the condition Warning FailedMount 7m9s (x14 over 29m) kubelet, test1-f9xps-worker-0-k2hvx (combined from similar events): MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (4912) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-208307645,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error Warning FailedMount 66s kubelet, test1-f9xps-worker-0-k2hvx Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[default-token-fg8tm mypvc]: timed out waiting for the condition (venv) root@s83lp83:~/ocs-ci# (venv) root@s83lp83:~/ocs-ci# oc logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79d8db4bqd2l6 debug 2020-07-27 17:25:11.871 3ffaf9ec430 0 set uid:gid to 167:167 (ceph:ceph) debug 2020-07-27 17:25:11.871 3ffaf9ec430 0 ceph version 14.2.8-79.el8cparch (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable), process ceph-mds, pid 1 debug 2020-07-27 17:25:11.871 3ffaf9ec430 0 pidfile_write: ignore empty --pid-file starting mds.ocs-storagecluster-cephfilesystem-a at debug 2020-07-27 17:25:12.001 3ffa0bff910 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 3 from mon.1 debug 2020-07-27 17:25:16.671 3ffa0bff910 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 4 from mon.1 debug 2020-07-27 17:25:16.671 3ffa0bff910 1 mds.ocs-storagecluster-cephfilesystem-a Monitors have assigned me to become a standby. debug 2020-07-27 17:25:16.681 3ffa0bff910 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 5 from mon.1 debug 2020-07-27 17:25:16.681 3ffa0bff910 1 mds.0.5 handle_mds_map i am now mds.0.5 debug 2020-07-27 17:25:16.681 3ffa0bff910 1 mds.0.5 handle_mds_map state change up:boot --> up:creating debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x1 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x100 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x600 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x601 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x602 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x603 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x604 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x605 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x606 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x607 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x608 debug 2020-07-27 17:25:16.681 3ffa0bff910 0 mds.0.cache creating system inode with ino:0x609 debug 2020-07-27 17:25:16.701 3ff817fa910 1 mds.0.5 creating_done debug 2020-07-27 17:25:17.701 3ffa0bff910 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 6 from mon.1 debug 2020-07-27 17:25:17.701 3ffa0bff910 1 mds.0.5 handle_mds_map i am now mds.0.5 debug 2020-07-27 17:25:17.701 3ffa0bff910 1 mds.0.5 handle_mds_map state change up:creating --> up:active debug 2020-07-27 17:25:17.701 3ffa0bff910 1 mds.0.5 recovery_done -- successful recovery! debug 2020-07-27 17:25:17.701 3ffa0bff910 1 mds.0.5 active_start debug 2020-07-27 17:25:17.711 3ffa0bff910 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 7 from mon.1 (venv) root@s83lp83:~/ocs-ci# (venv) root@s83lp83:~/ocs-ci# oc logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-578cfc89279wf debug 2020-07-27 17:25:13.414 3ff88eec430 0 set uid:gid to 167:167 (ceph:ceph) debug 2020-07-27 17:25:13.414 3ff88eec430 0 ceph version 14.2.8-79.el8cparch (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable), process ceph-mds, pid 1 debug 2020-07-27 17:25:13.414 3ff88eec430 0 pidfile_write: ignore empty --pid-file starting mds.ocs-storagecluster-cephfilesystem-b at debug 2020-07-27 17:25:13.544 3ff6dffb910 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 3 from mon.2 debug 2020-07-27 17:25:16.664 3ff6dffb910 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 4 from mon.2 debug 2020-07-27 17:25:16.684 3ff6dffb910 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 5 from mon.2 debug 2020-07-27 17:25:17.694 3ff6dffb910 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 6 from mon.2 debug 2020-07-27 17:25:17.694 3ff6dffb910 1 mds.ocs-storagecluster-cephfilesystem-b Monitors have assigned me to become a standby. debug 2020-07-27 17:25:17.714 3ff6dffb910 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 7 from mon.2 debug 2020-07-27 17:25:17.714 3ff6dffb910 1 mds.0.0 handle_mds_map i am now mds.25028.0 replaying mds.0.0 debug 2020-07-27 17:25:17.714 3ff6dffb910 1 mds.0.0 handle_mds_map state change up:boot --> up:standby-replay debug 2020-07-27 17:25:17.714 3ff6dffb910 1 mds.0.0 replay_start debug 2020-07-27 17:25:17.714 3ff6dffb910 1 mds.0.0 recovery set is debug 2020-07-27 17:25:17.714 3ff527fc910 0 mds.0.cache creating system inode with ino:0x100 debug 2020-07-27 17:25:17.714 3ff527fc910 0 mds.0.cache creating system inode with ino:0x1 (venv) root@s83lp83:~/ocs-ci# [140776.205942] libceph: mon1 (1)172.30.251.141:6789 session established [140776.206199] libceph: client119443 fsid d3dffa2c-0ba8-4f3c-bf8d-74c273b1a87d [140776.209513] ceph: problem parsing mds trace -5 [140776.209619] ceph: mds parse_reply err -5 [140776.209672] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1) [140776.209744] header: 00000000: 02 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [140776.209745] header: 00000010: 1a 00 7f 00 01 00 b4 01 00 00 00 00 00 00 00 00 ................ [140776.209746] header: 00000020: 00 00 00 00 02 00 00 00 00 00 00 00 00 01 00 00 ................ [140776.209747] header: 00000030: 00 d3 94 4d 24 ...M$ [140776.209748] front: 00000000: 01 01 00 00 00 00 00 00 07 00 00 00 01 00 01 39 ...............9 [140776.209748] front: 00000010: 01 00 00 03 01 33 01 00 00 10 00 00 00 00 01 00 .....3.......... [140776.209749] front: 00000020: 00 fe ff ff ff ff ff ff ff 00 00 00 00 1c 00 00 ................ [140776.209750] front: 00000030: 00 00 00 00 00 01 00 00 00 00 00 00 00 55 00 00 .............U.. [140776.209751] front: 00000040: 00 00 00 00 00 b7 0e 00 00 00 00 00 00 01 00 00 ................ [140776.209752] front: 00000050: 00 00 00 00 00 10 00 00 00 00 01 00 00 01 00 00 ................ [140776.209752] front: 00000060: 40 00 01 00 00 00 00 00 40 00 00 00 00 00 00 00 @.......@....... [140776.209753] front: 00000070: 00 00 00 00 00 00 05 00 00 00 11 2d 20 5f 6e 70 ...........- _np [140776.209754] front: 00000080: b6 1b 11 2d 20 5f bc 86 64 1b 11 2d 20 5f bc 86 ...- _..d..- _.. [140776.209755] front: 00000090: 64 1b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d............... [140776.209756] front: 000000a0: 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 01 00 ................ [140776.209757] front: 000000b0: 00 00 ff 41 00 00 00 00 00 00 00 00 00 00 01 00 ...A............ [140776.209758] front: 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209758] front: 000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209759] front: 000000e0: 00 00 01 00 00 00 00 00 00 00 11 2d 20 5f 6e 70 ...........- _np [140776.209760] front: 000000f0: b6 1b 00 00 00 00 00 00 00 00 02 00 00 00 00 00 ................ [140776.209761] front: 00000100: 00 00 04 00 00 00 00 00 00 00 ff ff ff ff ff ff ................ [140776.209762] front: 00000110: ff ff 00 00 00 00 01 01 10 00 00 00 00 00 00 80 ................ [140776.209762] front: 00000120: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209763] front: 00000130: 11 2d 20 5f bc 86 64 1b 03 00 00 00 00 00 00 00 .- _..d......... [140776.209764] front: 00000140: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209765] front: 00000150: 60 00 00 00 10 00 00 00 00 01 00 00 01 00 00 00 `............... [140776.209766] front: 00000160: 00 00 00 00 01 00 00 00 00 00 00 00 02 00 00 00 ................ [140776.209766] front: 00000170: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209767] front: 00000180: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209768] front: 00000190: 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ................ [140776.209769] front: 000001a0: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ [140776.209769] front: 000001b0: 00 00 00 00 .... [140776.209770] footer: 00000000: bd 9e 7b c2 00 00 00 00 00 00 00 00 74 9a d9 fa ..{.........t... [140776.209771] footer: 00000010: 0b 6b 0d 0a 05 .k... [140900.319870] libceph: mon1 (1)172.30.251.141:6789 session established [140900.320153] libceph: client119569 fsid d3dffa2c-0ba8-4f3c-bf8d-74c273b1a87d [140900.324348] ceph: problem parsing mds trace -5 [140900.324453] ceph: mds parse_reply err -5 [140900.324504] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)
Log file can be found in this location for OCS and OCP https://drive.google.com/drive/folders/1CK2PcG63pW9Z1XB74aXD8v50WmVlOzZe?usp=sharing
What kernel was the client running? It probably needs this patch (which is already merged for RHEL8.3): https://marc.info/?l=ceph-devel&m=158807659304587&w=2 You may want to update the kernel on the client to the latest RHEL8.3 beta kernel and see whether this is still reproducible.
Client running with below version Linux test1-f9xps-worker-0-rwvmr 4.18.0-193.13.2.el8_2.s390x #1 SMP Mon Jul 13 23:23:50 UTC 2020 s390x s390x s390x GNU/Linux
Thanks. Yeah, that kernel doesn't have the endianness fix, as that went into -207.el8. See: https://bugzilla.redhat.com/show_bug.cgi?id=1827767 If you have the ability to run this on a RHEL8.3 kernel, then it should (hopefully) work.
Don't have access (You are not authorized to access bug #1827767) to view this bug https://bugzilla.redhat.com/show_bug.cgi?id=1827767
Chidanand, I cc'ed you on the RHEL8.3 update bug, bit it's not very interesting as it's just a big rollup of upstream patches. What I'd probably suggest is getting the latest RHEL-8.3.0 candidate kernel you can find, and see if this is still reproducible with it. See: https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=1231
Assigning this to Jeff so he can make sure it gets tested with the latest RHEL-8.3.0 candidate kernel
Chidanand, do you have the ability to run this test and override the kernel that it uses?
Jeff I never tried and don't have much idea on how to override the kernel, If you can share some instruction on this , I can have one try.
I think that depends on the testing infrastructure you were using. Maybe Raz (our QA contact) can help?
Hi Jeff, I'm not sure what is the ask here. Are you suggesting to run the tests over RHEL 8.3? or any other modifications to the kernel itself? We are running our tests over RHEL which is supported in OCP.
(In reply to Raz Tamir from comment #12) > Hi Jeff, > > I'm not sure what is the ask here. > Are you suggesting to run the tests over RHEL 8.3? or any other > modifications to the kernel itself? > We are running our tests over RHEL which is supported in OCP. The initial description was "Setup OCP and OCS on KVM guest on s390x hardware" - I don't think you are running on that one. Setting NEEDINFO on the initial reporter, if the kernel can be updated on that VM.
Setup is on s390x LPAR and I did OCP4.4 installation using libvirt (https://github.com/openshift/installer/tree/release-4.4/docs/dev/libvirt) by referring steps provided in the link. OCP cluster setup with 3 Master and 3 Worker node, all nodes are running with RedHat CoreOS with version (Linux test1-f9xps-worker-0-rwvmr 4.18.0-193.13.2.el8_2.s390x #1 SMP Mon Jul 13 23:23:50 UTC 2020 s390x s390x s390x GNU/Linux) Regarding Kernel update I can try to update the kernel on those KVM's. It will be more helpful if you can give the link for the latest kernel which support s390x hardware.
A quick update on the nature of the kernel bug. A while ago, the kernel cephfs component was extended to support a new element of the on-the-wire protocol between cephfs and the MDS daemon that allows negotiation of the supported feature set. The initial version of that patch was broken on big-endian machines, causing *any* cephfs mount to fail. This bug was fixed by the patch in comment #3. So we have three possible states: 1) old cephfs without feature selection - works on s390x 2) cephfs with (buggy) feature selection - broken on s390x 3) cephfs with fixed feature selection - works on s390x When we initially noticed and fixed that problem, it looked like the RHEL kernels didn't have that issue, because at the time they were still using an old kernel (1) without the feature selection code. That's why we just fixed it upstream and didn't ask for a backport. However, it looks like we didn't notice that in the meantime, the feature selection code was actually backported to the RHEL kernel -- unfortunately (initially) without the fix, so now we have the situation (2) in the current RHEL kernel where cephfs mounts are completely broken. As mentioned in comment #5, the fix has now been backported as well, and will be in the RHEL 8.3 kernel. This means as far as RHEL kernels are concerned, we have the following status: before 4.18.0-154: state (1) - works on s390x since 4.18.0-154 but before 4.18.0-207: state(2) - broken on s390x since 4.18.0-207: state (3) - works on s390x For RHEL 8.x releases this implies: RHEL 8.1 - GA kernel 4.18.0-147 - works on s390x RHEL 8.2 - GA kernel 4.18.0-193 - broken on s390x RHEL 8.3 - GA kernel >4.18.0-207 - will work on s390x So in a sense, RHEL 8.2 introduced a cephfs regression that will be fixed again in RHEL 8.3. Given that this is a regression, I guess one question would be whether this ought to be fixed in the RHEL 8.2 maintenance stream as well. However, the RHEL product isn't really what is relevant for our use case here, because we're using OCP on RH CoreOS, we are not actually using RHEL. (OCP is only supported on top of CoreOS on Z.) So the interesting question is how to get the fix into *CoreOS* (and therefore OCP). It looks like both OCP 4.4 and OCP 4.5 currently use the same -193 kernel that is in RHEL 8.2, and are therefore broken on s390x. I'm not so familiar with how the CoreOS upgrade process works, so here's a few questions: - What kernel will OCP 4.6 use? How can we ensure this bug will be fixed there? - Should this regression get fixed as a maintenance update for OCP 4.5 (and possibly 4.4)? - If there will be no official fix on 4.5, is there some way we can work around the bug? I believe you cannot simply install another kernel in CoreOS ... As an aside, note that it now appears that installing OCP on KVM vs. z/VM actually doesn't make any difference here, it's just that our z/VM install was still based on OCP 4.3 (using the RHEL 8.1 kernel), and therefore did not yet show this bug.
Thanks for the succinct explanation Ulrich. > So in a sense, RHEL 8.2 introduced a cephfs regression that will be fixed again in RHEL 8.3. Given that this is a regression, I guess one question would be whether this ought to be fixed in the RHEL 8.2 maintenance stream as well. I'd be fine with that. The patch is pretty safe, so backporting should be no big deal. I'm not well versed enough in CoreOS to know what we'd need to do for that though.
Oof, spoke a bit too soon about the safety of that patch. I just opened BZ#1866018 today, which is a regression that was caused by that endianness fix. If you pull that patch into CoreOS, you'll also want this one (not yet merged upstream). https://marc.info/?l=ceph-devel&m=159655872206314&w=2
In an offline conversation with Elad and Chidanand it was clear that OCP 4.5 CoreOS is running with kernel version 4.18.0-193, which is causing the problem on s390x hardware. Also, Cephfs issue is fixed with Kernel version 207 and above because of this OCS tier1 test cases are failing. And this is possible to hit with any OCS version with these conditions, so it can be a test blocker but I don't think this should be a blocker for OCS 4.5 I would like to remove the blocker flag for 4.5 or/and move it to next release unless someone thinks otherwise.
Jeff, I don't seem to be able to access BZ#1866018, could you add me on CC? Following up on comment #14, I looked more closely at the CoreOS situation: - The CoreOS/OCP 4.4 GA release actually still uses the RHEL 8.1 kernel, so this works on s390x - However, at some point (around 2020-07-14) the nightly dev-prelease stream of OCP 4.4 switched over to the RHEL 8.2 kernel, from which point on it fails - CoreOS/OCP 4.5 has always used the RHEL 8.2 kernel, and therefore always fails - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, so should work again (however, note the new regression in BZ#1866018) So I believe the next steps should be: - Get the regression fix in BZ#1866018 accepted upstream and included into RHEL 8.3 (and then CoreOS 4.6) - Port both fixes into the RHEL 8.2.z maintenance stream - Update the CoreOS 4.5 kernel with the latest RHEL 8.2.z kernel Does that look reasonable? As to release blocker status, I agree that this bug is not tied to a particular *OCS version* as such; any OCS version will fail if the kernel has this bug. However, I'd consider presence of this bug a release blocker for *OCS on Z* in general. Depending on which version of OCS we're targeting for the initial Z release, this would then become a blocker for that version. I believe at this point, we still have not made the final decision; it could be either some post-GA OCS 4.5.z release or else OCS 4.6.
Yes, that's more or less what I'm planning to do. We should probably clone this bug for 8.2.z and we'll just make sure we pull in both patches for that.
RHEL-8.2.z bug : https://bugzilla.redhat.com/show_bug.cgi?id=1866386
> - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, > so should work again (however, note the new regression in BZ#1866018) > We are currently using a bespoke 8.3 kernel to work around a selinux patch that is being backported to RHEL 8.2. OCP 4.6 will not move to RHEL 8.3 during its lifecycle, so please be sure to continue backporting to 8.2 for OCP 4.6 fixes. > So I believe the next steps should be: > - Get the regression fix in BZ#1866018 accepted upstream and included into > RHEL 8.3 (and then CoreOS 4.6) > - Port both fixes into the RHEL 8.2.z maintenance stream > - Update the CoreOS 4.5 kernel with the latest RHEL 8.2.z kernel > Does that look reasonable? > that sounds like a reasonable approach to make sure RHCOS consumes this fix correctly, but there could be a brief window between RHCOS switching back to the 8.2 kernel and patches landing in the proper z streams.
(In reply to Jeff Layton from comment #16) > I'd be fine with that. The patch is pretty safe, so backporting should be no > big deal. I'm not well versed enough in CoreOS to know what we'd need to do > for that though. How to update new kernel in CoreOS for testing/development? ------------------------------------------------------------ Download all the relevant kernel rpm packages: like kernel,kernel-core, kernel-modules etc., // For replacing a test kernel: # rpm-ostree override replace /path/to/kernel-XYZ*.rpm \ /path/to/kernel-core*.rpm \ /path/to/kernel-modules*.rpm Reboot the node and ensure the latest kernel is running. (Note: You can still select old kernel using grub menu during bootup) It is important to note that this is only for "development/testing" purposes. Once you get actual update in the official kernel, you need first revert back to the original kernel (1) and then follow this to upgrade the OS: https://github.com/openshift/os/blob/master/FAQ.md#q-how-do-i-upgrade-the-os Note : (1) // To undo some of the overrides you have done in the past # rpm-ostree override reset // To discard all the local modifications done and goes back to original tree # rpm-ostree reset This should help us to proceed with testing. Ref: https://github.com/openshift/os/blob/master/FAQ.md#q-what-happens-when-i-use-rpm-ostree-override-replace-to-replace-an-rpm
(In reply to Jeff Ligon from comment #25) > > - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, > > so should work again (however, note the new regression in BZ#1866018) > > > > We are currently using a bespoke 8.3 kernel to work around a selinux patch > that is being backported to RHEL 8.2. OCP 4.6 will not move to RHEL 8.3 > during its lifecycle, so please be sure to continue backporting to 8.2 for > OCP 4.6 fixes. Jeff Layton opened a bug to track backporting to RHEL 8.2 (see above). Does this mean that the change will then flow to OCP 4.6 automatically, or do we need to open *another* bug again OCP/CoreOS to track that?
As noted, this issue is only a tracker for BZ#1866018.
I guess there is a backport BZ for 8.2 (https://bugzilla.redhat.com/show_bug.cgi?id=1875787) so that fix is already backported to 8.2 and hence can be tested there.
Based on the automation run results of BUILD ID: v4.6.0-131.ci RUN ID: 1603961686 (tier1 over IBM ROKS), in which both this test case passed, I am moving to VERIFIED: tests/manage/pv_services/test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete]
With OCP 4.6.3 (with kernel version 4.18.0-193.28.1.el8_2.s390x) issue has be been fixed .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days