Description of problem: 1) Deployment of a hybrid ceph cluster crashes. 2) Monitor daemon of homogeneous cluster built of s390x nodes crashes repeatedly under specific workload on Ceph FS. Ceph version: 15.2.3 ("Octopus") Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Reproducible: Yes Steps to reproduce: 1) Build a "hybrid" cluster. Bootstrap a monitor node on s390 box. Add a number of OSD nodes, one of which is on x86_64. Start a ceph manager and a monitor daemon. Start an OSD daemon on the x86_64 node. Actual results: the OSD daemon crashes. Expected results: the OSD daemon works without crash. 2) Build a cluster (all nodes on s390x) with 1 monitor node, 3 OSDs, 1 MDS and 1 MGR. Add 2 more monitor nodes which are OSDs already. The resulted cluster has 3 monitor nodes, 2 of which are both OSDs and monitors. Create and mount Ceph FS. Run stress-ng tool. Actual results: monitor daemon crashes repeatedly Expected results: monitor daemon doesn't crash Additional info: Topology: root@m8330013:~# ceph node ls all { "mon": { "m8330013": [ "m8330013" ], "m8330014": [ "m8330014" ], "m8330015": [ "m8330015" ] }, "osd": { "m8330014": [ 0 ], "m8330015": [ 1 ], "m8330016": [ 2 ] }, "mds": { "m8330013": [ "m8330013" ] }, "mgr": { "m8330013": [ "m8330013" ], "m8330015": [ "m8330015" ] } } Stress-ng Job file : run sequential metrics verbose timeout 5m times timestamp #0 means 1 stressor per CPU access 0 bind-mount 0 chdir 0 chmod 0 chown 0 copy-file 0 dentry 0 dir 0 dirdeep 0 dnotify 0 dup 0 eventfd 0 fallocate 0 fanotify 0 fcntl 0 fiemap 0 file-ioctl 0 filename 0 flock 0 fstat 0 getdent 0 handle 0 inode-flags 0 inotify 0 io 0 iomix 0 ioprio 0 lease 0 link 0 locka 0 lockf 0 lockofd 0 mknod 0 open 0 procfs 0 rename 0 symlink 0 sync-file 0 utime 0 xattr 0 Command for Execution: stress-ng --job <job_file> --temp-path <cephfs_mountpoint> --log-file <log_file>
The root cause of the problems was identified as a number of endian bugs during Ceph message encoding/decoding that were causing various issues on IBM Z. Proposed fixups were sent to upstream. The related upstream pull requests are: https://github.com/ceph/ceph/pull/35920 (msg/msg_types: entity_addrvec_t: fix decode on big-endian hosts) This is probably not critical on its own, but it did cause ceph unit test failures. https://github.com/ceph/ceph/pull/36697 (messages,mds: Fix decoding of enum types on big-endian systems) This showed up as MON daemons crashing whenever they receive a HEALTH_WARN message (and probably other problems). https://github.com/ceph/ceph/pull/36992 (include/encoding: Fix encode/decode of float types on big-endian systems) This causes immediate crashes of an IBM Z OSD attempting to join an x86 based Ceph cluster (and probably other problems).
Does not seem to be an OCS bug. Moving to RHCS.
Eduard please follow these instructions to get this in the product repo https://mojo.redhat.com/docs/DOC-1221008 let me know if you run into trouble
*** Bug 1890640 has been marked as a duplicate of this bug. ***
ok noted, removing this bz from the release notes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0081