Bug 1861780 - [Tracker BZ1866386][IBM s390x] Mount Failed for CEPH while running couple of OCS test cases.
Summary: [Tracker BZ1866386][IBM s390x] Mount Failed for CEPH while running couple of...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.4
Hardware: s390x
OS: Linux
unspecified
high
Target Milestone: ---
: OCS 4.6.0
Assignee: Scott Ostapovicz
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-29 14:23 UTC by Chidanand Harlapur
Modified: 2023-09-14 06:04 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-17 06:23:13 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:24:28 UTC

Description Chidanand Harlapur 2020-07-29 14:23:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
While running OCS-CI test under "pv_services" with CephFileSystem and CephBlockPool  test failed with MountFailed error and I can see in pod description mount getting failed. 


Version of all relevant components (if applicable):
OCS version : 4.4
OCP version : 4.4
Ceph Version : ceph version 14.2.8-79.el8cparch (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2 -- simple scenario I tried.



Can this issue reproducible?
yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup OCP and OCS on KVM guest on s390x hardware
2. Run this test test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete]
 or Test related to CephFileSystem and CephBlockPool
3) executed test tests/manage/pv_services/test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete]




Actual results:
POD is not getting created during test execution STATUS as "ContainerCreating"


Expected results:
POD is should create and execute the test cases.

Additional info:

(venv) root@s83lp83:~/ocs-ci# oc describe pod pod-test-cephfs-a33f2ff647014bea8b5d4cea27ab70f0
Name:         pod-test-cephfs-a33f2ff647014bea8b5d4cea27ab70f0
Namespace:    namespace-test-1d38209307cc477da74029a0e21ca128
Priority:     0
Node:         test1-f9xps-worker-0-k2hvx/192.168.126.51
Start Time:   Tue, 28 Jul 2020 15:51:09 +0200
Labels:       <none>
Annotations:  openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  web-server:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fg8tm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-test-df590d0b329047a996d615d8b2932e3c
    ReadOnly:   false
  default-token-fg8tm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fg8tm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                  From                                 Message
  ----     ------                  ----                 ----                                 -------
  Normal   SuccessfulAttachVolume  32m                  attachdetach-controller              AttachVolume.Attach succeeded for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f"
  Warning  FailedMount             32m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (2974) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-257629566,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             32m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3076) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-853909311,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             32m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3178) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-819052180,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             32m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3280) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-135149837,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             32m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3382) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-482107770,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             31m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3484) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-338771179,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             31m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3586) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-278313904,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             31m                  kubelet, test1-f9xps-worker-0-k2hvx  MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (3688) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-758372953,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             12m (x8 over 30m)    kubelet, test1-f9xps-worker-0-k2hvx  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc default-token-fg8tm]: timed out waiting for the condition
  Warning  FailedMount             7m9s (x14 over 29m)  kubelet, test1-f9xps-worker-0-k2hvx  (combined from similar events): MountVolume.MountDevice failed for volume "pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f" : rpc error: code = Internal desc = an error occurred while running (4912) mount [-t ceph 172.30.173.238:6789,172.30.251.141:6789,172.30.223.131:6789:/volumes/csi/csi-vol-3f77a9c8-d0d9-11ea-be8c-0a580a800211/34590dd0-f198-43db-b1e9-82b2e9f90639 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b7919eef-4b3e-4b0b-b316-636d0fa2722f/globalmount -o name=admin,secretfile=/tmp/csi/keys/keyfile-208307645,mds_namespace=ocs-storagecluster-cephfilesystem]: exit status 32: mount error 5 = Input/output error
  Warning  FailedMount             66s                  kubelet, test1-f9xps-worker-0-k2hvx  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[default-token-fg8tm mypvc]: timed out waiting for the condition
(venv) root@s83lp83:~/ocs-ci# 




(venv) root@s83lp83:~/ocs-ci# oc logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79d8db4bqd2l6
debug 2020-07-27 17:25:11.871 3ffaf9ec430  0 set uid:gid to 167:167 (ceph:ceph)
debug 2020-07-27 17:25:11.871 3ffaf9ec430  0 ceph version 14.2.8-79.el8cparch (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable), process ceph-mds, pid 1
debug 2020-07-27 17:25:11.871 3ffaf9ec430  0 pidfile_write: ignore empty --pid-file
starting mds.ocs-storagecluster-cephfilesystem-a at 
debug 2020-07-27 17:25:12.001 3ffa0bff910  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 3 from mon.1
debug 2020-07-27 17:25:16.671 3ffa0bff910  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 4 from mon.1
debug 2020-07-27 17:25:16.671 3ffa0bff910  1 mds.ocs-storagecluster-cephfilesystem-a Monitors have assigned me to become a standby.
debug 2020-07-27 17:25:16.681 3ffa0bff910  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 5 from mon.1
debug 2020-07-27 17:25:16.681 3ffa0bff910  1 mds.0.5 handle_mds_map i am now mds.0.5
debug 2020-07-27 17:25:16.681 3ffa0bff910  1 mds.0.5 handle_mds_map state change up:boot --> up:creating
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x1
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x100
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x600
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x601
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x602
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x603
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x604
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x605
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x606
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x607
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x608
debug 2020-07-27 17:25:16.681 3ffa0bff910  0 mds.0.cache creating system inode with ino:0x609
debug 2020-07-27 17:25:16.701 3ff817fa910  1 mds.0.5 creating_done
debug 2020-07-27 17:25:17.701 3ffa0bff910  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 6 from mon.1
debug 2020-07-27 17:25:17.701 3ffa0bff910  1 mds.0.5 handle_mds_map i am now mds.0.5
debug 2020-07-27 17:25:17.701 3ffa0bff910  1 mds.0.5 handle_mds_map state change up:creating --> up:active
debug 2020-07-27 17:25:17.701 3ffa0bff910  1 mds.0.5 recovery_done -- successful recovery!
debug 2020-07-27 17:25:17.701 3ffa0bff910  1 mds.0.5 active_start
debug 2020-07-27 17:25:17.711 3ffa0bff910  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 7 from mon.1
(venv) root@s83lp83:~/ocs-ci# 



(venv) root@s83lp83:~/ocs-ci# oc logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-578cfc89279wf
debug 2020-07-27 17:25:13.414 3ff88eec430  0 set uid:gid to 167:167 (ceph:ceph)
debug 2020-07-27 17:25:13.414 3ff88eec430  0 ceph version 14.2.8-79.el8cparch (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable), process ceph-mds, pid 1
debug 2020-07-27 17:25:13.414 3ff88eec430  0 pidfile_write: ignore empty --pid-file
starting mds.ocs-storagecluster-cephfilesystem-b at 
debug 2020-07-27 17:25:13.544 3ff6dffb910  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 3 from mon.2
debug 2020-07-27 17:25:16.664 3ff6dffb910  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 4 from mon.2
debug 2020-07-27 17:25:16.684 3ff6dffb910  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 5 from mon.2
debug 2020-07-27 17:25:17.694 3ff6dffb910  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 6 from mon.2
debug 2020-07-27 17:25:17.694 3ff6dffb910  1 mds.ocs-storagecluster-cephfilesystem-b Monitors have assigned me to become a standby.
debug 2020-07-27 17:25:17.714 3ff6dffb910  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 7 from mon.2
debug 2020-07-27 17:25:17.714 3ff6dffb910  1 mds.0.0 handle_mds_map i am now mds.25028.0 replaying mds.0.0
debug 2020-07-27 17:25:17.714 3ff6dffb910  1 mds.0.0 handle_mds_map state change up:boot --> up:standby-replay
debug 2020-07-27 17:25:17.714 3ff6dffb910  1 mds.0.0 replay_start
debug 2020-07-27 17:25:17.714 3ff6dffb910  1 mds.0.0  recovery set is 
debug 2020-07-27 17:25:17.714 3ff527fc910  0 mds.0.cache creating system inode with ino:0x100
debug 2020-07-27 17:25:17.714 3ff527fc910  0 mds.0.cache creating system inode with ino:0x1
(venv) root@s83lp83:~/ocs-ci#




[140776.205942] libceph: mon1 (1)172.30.251.141:6789 session established
[140776.206199] libceph: client119443 fsid d3dffa2c-0ba8-4f3c-bf8d-74c273b1a87d
[140776.209513] ceph: problem parsing mds trace -5
[140776.209619] ceph: mds parse_reply err -5
[140776.209672] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)
[140776.209744] header: 00000000: 02 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
[140776.209745] header: 00000010: 1a 00 7f 00 01 00 b4 01 00 00 00 00 00 00 00 00  ................
[140776.209746] header: 00000020: 00 00 00 00 02 00 00 00 00 00 00 00 00 01 00 00  ................
[140776.209747] header: 00000030: 00 d3 94 4d 24                                   ...M$
[140776.209748]  front: 00000000: 01 01 00 00 00 00 00 00 07 00 00 00 01 00 01 39  ...............9
[140776.209748]  front: 00000010: 01 00 00 03 01 33 01 00 00 10 00 00 00 00 01 00  .....3..........
[140776.209749]  front: 00000020: 00 fe ff ff ff ff ff ff ff 00 00 00 00 1c 00 00  ................
[140776.209750]  front: 00000030: 00 00 00 00 00 01 00 00 00 00 00 00 00 55 00 00  .............U..
[140776.209751]  front: 00000040: 00 00 00 00 00 b7 0e 00 00 00 00 00 00 01 00 00  ................
[140776.209752]  front: 00000050: 00 00 00 00 00 10 00 00 00 00 01 00 00 01 00 00  ................
[140776.209752]  front: 00000060: 40 00 01 00 00 00 00 00 40 00 00 00 00 00 00 00  @.......@.......
[140776.209753]  front: 00000070: 00 00 00 00 00 00 05 00 00 00 11 2d 20 5f 6e 70  ...........- _np
[140776.209754]  front: 00000080: b6 1b 11 2d 20 5f bc 86 64 1b 11 2d 20 5f bc 86  ...- _..d..- _..
[140776.209755]  front: 00000090: 64 1b 00 00 00 00 00 00 00 00 00 00 00 00 00 00  d...............
[140776.209756]  front: 000000a0: 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 01 00  ................
[140776.209757]  front: 000000b0: 00 00 ff 41 00 00 00 00 00 00 00 00 00 00 01 00  ...A............
[140776.209758]  front: 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209758]  front: 000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209759]  front: 000000e0: 00 00 01 00 00 00 00 00 00 00 11 2d 20 5f 6e 70  ...........- _np
[140776.209760]  front: 000000f0: b6 1b 00 00 00 00 00 00 00 00 02 00 00 00 00 00  ................
[140776.209761]  front: 00000100: 00 00 04 00 00 00 00 00 00 00 ff ff ff ff ff ff  ................
[140776.209762]  front: 00000110: ff ff 00 00 00 00 01 01 10 00 00 00 00 00 00 80  ................
[140776.209762]  front: 00000120: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209763]  front: 00000130: 11 2d 20 5f bc 86 64 1b 03 00 00 00 00 00 00 00  .- _..d.........
[140776.209764]  front: 00000140: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209765]  front: 00000150: 60 00 00 00 10 00 00 00 00 01 00 00 01 00 00 00  `...............
[140776.209766]  front: 00000160: 00 00 00 00 01 00 00 00 00 00 00 00 02 00 00 00  ................
[140776.209766]  front: 00000170: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209767]  front: 00000180: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209768]  front: 00000190: 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
[140776.209769]  front: 000001a0: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
[140776.209769]  front: 000001b0: 00 00 00 00                                      ....
[140776.209770] footer: 00000000: bd 9e 7b c2 00 00 00 00 00 00 00 00 74 9a d9 fa  ..{.........t...
[140776.209771] footer: 00000010: 0b 6b 0d 0a 05                                   .k...
[140900.319870] libceph: mon1 (1)172.30.251.141:6789 session established
[140900.320153] libceph: client119569 fsid d3dffa2c-0ba8-4f3c-bf8d-74c273b1a87d
[140900.324348] ceph: problem parsing mds trace -5
[140900.324453] ceph: mds parse_reply err -5
[140900.324504] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

Comment 2 Chidanand Harlapur 2020-07-29 14:40:50 UTC
Log file can be found in this location for OCS and OCP

https://drive.google.com/drive/folders/1CK2PcG63pW9Z1XB74aXD8v50WmVlOzZe?usp=sharing

Comment 3 Jeff Layton 2020-08-03 11:23:52 UTC
What kernel was the client running? It probably needs this patch (which is already merged for RHEL8.3):

    https://marc.info/?l=ceph-devel&m=158807659304587&w=2

You may want to update the kernel on the client to the latest RHEL8.3 beta kernel and see whether this is still reproducible.

Comment 4 Chidanand Harlapur 2020-08-03 11:38:20 UTC
Client running with below version

Linux test1-f9xps-worker-0-rwvmr 4.18.0-193.13.2.el8_2.s390x #1 SMP Mon Jul 13 23:23:50 UTC 2020 s390x s390x s390x GNU/Linux

Comment 5 Jeff Layton 2020-08-03 11:45:56 UTC
Thanks. Yeah, that kernel doesn't have the endianness fix, as that went into -207.el8. See:

https://bugzilla.redhat.com/show_bug.cgi?id=1827767

If you have the ability to run this on a RHEL8.3 kernel, then it should (hopefully) work.

Comment 6 Chidanand Harlapur 2020-08-03 11:51:14 UTC
Don't have access (You are not authorized to access bug #1827767) to view this bug https://bugzilla.redhat.com/show_bug.cgi?id=1827767

Comment 7 Jeff Layton 2020-08-03 13:36:49 UTC
Chidanand, I cc'ed you on the RHEL8.3 update bug, bit it's not very interesting as it's just a big rollup of upstream patches. What I'd probably suggest is getting the latest RHEL-8.3.0 candidate kernel you can find, and see if this is still reproducible with it. See:

https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=1231

Comment 8 Scott Ostapovicz 2020-08-03 14:05:58 UTC
Assigning this to Jeff so he can make sure it gets tested with the latest RHEL-8.3.0 candidate kernel

Comment 9 Jeff Layton 2020-08-03 14:15:55 UTC
Chidanand, do you have the ability to run this test and override the kernel that it uses?

Comment 10 Chidanand Harlapur 2020-08-03 14:38:53 UTC
Jeff I never tried and don't have much idea on how to override the kernel, If you can share some instruction on this , I can have one try.

Comment 11 Jeff Layton 2020-08-03 14:52:21 UTC
I think that depends on the testing infrastructure you were using. Maybe Raz (our QA contact) can help?

Comment 12 Raz Tamir 2020-08-04 07:04:49 UTC
Hi Jeff,

I'm not sure what is the ask here.
Are you suggesting to run the tests over RHEL 8.3? or any other modifications to the kernel itself?
We are running our tests over RHEL which is supported in OCP.

Comment 13 Yaniv Kaul 2020-08-04 08:57:58 UTC
(In reply to Raz Tamir from comment #12)
> Hi Jeff,
> 
> I'm not sure what is the ask here.
> Are you suggesting to run the tests over RHEL 8.3? or any other
> modifications to the kernel itself?
> We are running our tests over RHEL which is supported in OCP.

The initial description was "Setup OCP and OCS on KVM guest on s390x hardware" - I don't think you are running on that one.

Setting NEEDINFO on the initial reporter, if the kernel can be updated on that VM.

Comment 14 Chidanand Harlapur 2020-08-04 09:21:33 UTC
Setup is on s390x LPAR and I did OCP4.4 installation using libvirt (https://github.com/openshift/installer/tree/release-4.4/docs/dev/libvirt) by referring steps provided in the link.

OCP cluster setup with 3 Master and 3 Worker node, all nodes are running with RedHat CoreOS with version (Linux test1-f9xps-worker-0-rwvmr 4.18.0-193.13.2.el8_2.s390x #1 SMP Mon Jul 13 23:23:50 UTC 2020 s390x s390x s390x GNU/Linux)

Regarding Kernel update I can try to update the kernel on those KVM's. It will be more helpful if you can give the link for the latest kernel which support s390x hardware.

Comment 15 Ulrich Weigand 2020-08-04 13:27:41 UTC
A quick update on the nature of the kernel bug.  A while ago, the kernel cephfs component was extended to support a new element of the on-the-wire protocol between cephfs and the MDS daemon that allows negotiation of the supported feature set.  The initial version of that patch was broken on big-endian machines, causing *any* cephfs mount to fail.  This bug was fixed by the patch in comment #3.  So we have three possible states:

1) old cephfs without feature selection - works on s390x
2) cephfs with (buggy) feature selection - broken on s390x
3) cephfs with fixed feature selection - works on s390x

When we initially noticed and fixed that problem, it looked like the RHEL kernels didn't have that issue, because at the time they were still using an old kernel (1) without the feature selection code.  That's why we just fixed it upstream and didn't ask for a backport.

However, it looks like we didn't notice that in the meantime, the feature selection code was actually backported to the RHEL kernel -- unfortunately (initially) without the fix, so now we have the situation (2) in the current RHEL kernel where cephfs mounts are completely broken.

As mentioned in comment #5, the fix has now been backported as well, and will be in the RHEL 8.3 kernel.  This means as far as RHEL kernels are concerned, we have the following status:

before 4.18.0-154: state (1) - works on s390x
since 4.18.0-154 but before 4.18.0-207:  state(2) - broken on s390x
since 4.18.0-207: state (3) - works on s390x

For RHEL 8.x releases this implies:

RHEL 8.1 - GA kernel 4.18.0-147 - works on s390x
RHEL 8.2 - GA kernel 4.18.0-193 - broken on s390x
RHEL 8.3 - GA kernel >4.18.0-207 - will work on s390x

So in a sense, RHEL 8.2 introduced a cephfs regression that will be fixed again in RHEL 8.3.   Given that this is a regression, I guess one question would be whether this ought to be fixed in the RHEL 8.2 maintenance stream as well.


However, the RHEL product isn't really what is relevant for our use case here, because we're using OCP on RH CoreOS, we are not actually using RHEL.  (OCP is only supported on top of CoreOS on Z.)

So the interesting question is how to get the fix into *CoreOS* (and therefore OCP).  It looks like both OCP 4.4 and OCP 4.5 currently use the same -193 kernel that is in RHEL 8.2, and are therefore broken on s390x.

I'm not so familiar with how the CoreOS upgrade process works, so here's a few questions:
- What kernel will OCP 4.6 use?  How can we ensure this bug will be fixed there?
- Should this regression get fixed as a maintenance update for OCP 4.5 (and possibly 4.4)?
- If there will be no official fix on 4.5, is there some way we can work around the bug?  I believe you cannot simply install another kernel in CoreOS ...


As an aside, note that it now appears that installing OCP on KVM vs. z/VM actually doesn't make any difference here, it's just that our z/VM install was still based on OCP 4.3 (using the RHEL 8.1 kernel), and therefore did not yet show this bug.

Comment 16 Jeff Layton 2020-08-04 13:39:42 UTC
Thanks for the succinct explanation Ulrich.

> So in a sense, RHEL 8.2 introduced a cephfs regression that will be fixed again in RHEL 8.3.   Given that this is a regression, I guess one question would be whether this ought to be fixed in the RHEL 8.2 maintenance stream as well.

I'd be fine with that. The patch is pretty safe, so backporting should be no big deal. I'm not well versed enough in CoreOS to know what we'd need to do for that though.

Comment 17 Jeff Layton 2020-08-04 16:43:05 UTC
Oof, spoke a bit too soon about the safety of that patch. I just opened BZ#1866018 today, which is a regression that was caused by that endianness fix. If you pull that patch into CoreOS, you'll also want this one (not yet merged upstream).

    https://marc.info/?l=ceph-devel&m=159655872206314&w=2

Comment 18 Mudit Agarwal 2020-08-05 10:28:26 UTC
In an offline conversation with Elad and Chidanand it was clear that OCP 4.5 CoreOS is running with kernel version 4.18.0-193, which is causing the problem on s390x hardware. Also, Cephfs issue is fixed with Kernel version 207 and above because of this OCS tier1 test cases are failing.

And this is possible to hit with any OCS version with these conditions, so it can be a test blocker but I don't think this should be a blocker for OCS 4.5

I would like to remove the blocker flag for 4.5 or/and move it to next release unless someone thinks otherwise.

Comment 19 Ulrich Weigand 2020-08-05 13:18:36 UTC
Jeff, I don't seem to be able to access BZ#1866018, could you add me on CC?

Following up on comment #14, I looked more closely at the CoreOS situation:
- The CoreOS/OCP 4.4 GA release actually still uses the RHEL 8.1 kernel, so this works on s390x
- However, at some point (around 2020-07-14) the nightly dev-prelease stream of OCP 4.4 switched over to the RHEL 8.2 kernel, from which point on it fails
- CoreOS/OCP 4.5 has always used the RHEL 8.2 kernel, and therefore always fails
- The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, so should work again (however, note the new regression in BZ#1866018)

So I believe the next steps should be:
- Get the regression fix in BZ#1866018 accepted upstream and included into RHEL 8.3 (and then CoreOS 4.6)
- Port both fixes into the RHEL 8.2.z maintenance stream
- Update the CoreOS 4.5 kernel with the latest RHEL 8.2.z kernel
Does that look reasonable?

As to release blocker status, I agree that this bug is not tied to a particular *OCS version* as such; any OCS version will fail if the kernel has this bug.  However, I'd consider presence of this bug a release blocker for *OCS on Z* in general.  Depending on which version of OCS we're targeting for the initial Z release, this would then become a blocker for that version.  I believe at this point, we still have not made the final decision; it could be either some post-GA OCS 4.5.z release or else OCS 4.6.

Comment 20 Jeff Layton 2020-08-05 13:27:30 UTC
Yes, that's more or less what I'm planning to do. We should probably clone this bug for 8.2.z and we'll just make sure we pull in both patches for that.

Comment 21 Jeff Layton 2020-08-05 13:49:56 UTC
RHEL-8.2.z bug : https://bugzilla.redhat.com/show_bug.cgi?id=1866386

Comment 25 Jeff Ligon 2020-08-13 14:35:04 UTC
> - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel,
> so should work again (however, note the new regression in BZ#1866018)
> 

We are currently using a bespoke 8.3 kernel to work around a selinux patch that is being backported to RHEL 8.2. OCP 4.6 will not move to RHEL 8.3 during its lifecycle, so please be sure to continue backporting to 8.2 for OCP 4.6 fixes.

> So I believe the next steps should be:
> - Get the regression fix in BZ#1866018 accepted upstream and included into
> RHEL 8.3 (and then CoreOS 4.6)
> - Port both fixes into the RHEL 8.2.z maintenance stream
> - Update the CoreOS 4.5 kernel with the latest RHEL 8.2.z kernel
> Does that look reasonable?
> 

that sounds like a reasonable approach to make sure RHCOS consumes this fix correctly, but there could be a brief window between RHCOS switching back to the 8.2 kernel and patches landing in the proper z streams.

Comment 26 Saravanakumar 2020-08-19 07:04:28 UTC
(In reply to Jeff Layton from comment #16)

> I'd be fine with that. The patch is pretty safe, so backporting should be no
> big deal. I'm not well versed enough in CoreOS to know what we'd need to do
> for that though.

How to update new kernel in CoreOS for testing/development?
------------------------------------------------------------ 

Download all the relevant kernel rpm packages: like kernel,kernel-core, kernel-modules etc.,

// For replacing a test kernel:
# rpm-ostree override replace /path/to/kernel-XYZ*.rpm \
                              /path/to/kernel-core*.rpm \
                              /path/to/kernel-modules*.rpm

Reboot the node and ensure the latest kernel is running. 
(Note: You can still select old kernel using grub menu during bootup)

It is important to note that this is only for "development/testing" purposes. 
Once you get actual update in the official kernel, you need first revert back to the original kernel (1)
and then follow this to upgrade the OS: https://github.com/openshift/os/blob/master/FAQ.md#q-how-do-i-upgrade-the-os


Note :
(1)
// To undo some of the overrides you have done in the past
# rpm-ostree override reset
// To discard all the local modifications done and goes back to original tree
# rpm-ostree reset

This should help us to proceed with testing. 

Ref: https://github.com/openshift/os/blob/master/FAQ.md#q-what-happens-when-i-use-rpm-ostree-override-replace-to-replace-an-rpm

Comment 28 Ulrich Weigand 2020-08-20 16:06:55 UTC
(In reply to Jeff Ligon from comment #25)
> > - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel,
> > so should work again (however, note the new regression in BZ#1866018)
> > 
> 
> We are currently using a bespoke 8.3 kernel to work around a selinux patch
> that is being backported to RHEL 8.2. OCP 4.6 will not move to RHEL 8.3
> during its lifecycle, so please be sure to continue backporting to 8.2 for
> OCP 4.6 fixes.

Jeff Layton opened a bug to track backporting to RHEL 8.2 (see above).

Does this mean that the change will then flow to OCP 4.6 automatically, or do we need to open *another* bug again OCP/CoreOS to track that?

Comment 29 Scott Ostapovicz 2020-08-24 14:16:46 UTC
As noted, this issue is only a tracker for BZ#1866018.

Comment 33 Mudit Agarwal 2020-09-28 03:24:28 UTC
I guess there is a backport BZ for 8.2 (https://bugzilla.redhat.com/show_bug.cgi?id=1875787) so that fix is already backported to 8.2 and hence can be tested there.

Comment 36 Elad 2020-11-26 14:31:24 UTC
Based on the automation run results of BUILD ID: v4.6.0-131.ci RUN ID: 1603961686 (tier1 over IBM ROKS), in which both this test case passed, I am moving to VERIFIED:

tests/manage/pv_services/test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete]

Comment 37 Chidanand Harlapur 2020-11-26 15:50:00 UTC
With OCP 4.6.3 (with kernel version 4.18.0-193.28.1.el8_2.s390x) issue has be been fixed .

Comment 39 errata-xmlrpc 2020-12-17 06:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Comment 40 Red Hat Bugzilla 2023-09-14 06:04:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.