Bug 2223380 - [IBM Z] ODF deployed on IBM Z with DASD ( OSD CLBO failed to load OSD map for epoch ) [NEEDINFO]
Summary: [IBM Z] ODF deployed on IBM Z with DASD ( OSD CLBO failed to load OSD map for...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.12
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Radoslaw Zarzynski
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-17 14:05 UTC by khover
Modified: 2023-08-09 16:37 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-31 11:53:22 UTC
Embargoed:
khover: needinfo? (tstober)
khover: needinfo? (rzarzyns)


Attachments (Terms of Use)

Description khover 2023-07-17 14:05:04 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After initiating a OCP MC update OSD is CLBO.

2023-07-17T08:06:34.180507921Z debug     -2> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 osd.0 0 failed to load OSD map for epoch 14, got 0 bytes


Entire log line:

2023-07-17T08:06:34.180481051Z debug    -11> 2023-07-17T08:06:34.146+0000 3ff8926a500  2 osd.0 0 journal looks like ssd
2023-07-17T08:06:34.180481051Z debug    -10> 2023-07-17T08:06:34.146+0000 3ff8926a500  2 osd.0 0 boot
2023-07-17T08:06:34.180481051Z debug     -9> 2023-07-17T08:06:34.146+0000 3ff7abfe900  5 prioritycache tune_memory target: 4294967296 mapped: 18661376 unmapped: 1228800 heap: 19890176 old mem: 134217728 new mem: 2833635839
2023-07-17T08:06:34.180481051Z debug     -8> 2023-07-17T08:06:34.146+0000 3ff7abfe900  5 prioritycache tune_memory target: 4294967296 mapped: 18685952 unmapped: 1204224 heap: 19890176 old mem: 2833635839 new mem: 2845364582
2023-07-17T08:06:34.180481051Z debug     -7> 2023-07-17T08:06:34.146+0000 3ff7abfe900  5 bluestore.MempoolThread(0x2aa560f2b90) _resize_shards cache_size: 2845364582 kv_alloc: 1207959552 kv_used: 40592 kv_onode_alloc: 167772160 kv_onode_used: 14704 meta_alloc: 1207959552 meta_used: 21263 data_alloc: 218103808 data_used: 0
2023-07-17T08:06:34.180481051Z debug     -6> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180481051Z debug     -5> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180507921Z debug     -4> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180507921Z debug     -3> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180507921Z debug     -2> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 osd.0 0 failed to load OSD map for epoch 14, got 0 bytes
2023-07-17T08:06:34.180507921Z debug     -1> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 /builddir/build/BUILD/ceph-16.2.10/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 3ff8926a500 time 2023-07-17T08:06:34.160411+0000
2023-07-17T08:06:34.180507921Z /builddir/build/BUILD/ceph-16.2.10/src/osd/OSD.h: 752: FAILED ceph_assert(ret)




Version of all relevant components (if applicable):

ODF v4.12.4

ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Customer cannot reliably deploy ODF on IBM Z

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2023-07-17 14:14:12 UTC
Must gather uploaded to supportshell,

/cases/03562792

0010-odf-mustgather-c2nwedi01.tgz

namespaces/openshift-storage/pods/rook-ceph-osd-0-754b566b47-d578b/osd/osd/logs/previous.log

namespaces/openshift-storage/pods/rook-ceph-osd-0-754b566b47-d578b/osd/osd/logs/current.log

Comment 6 khover 2023-07-19 20:07:17 UTC
These are two diff clusters same customer and BZ 2222728 error is.

 - inferring bluefs devices from bluestore path
   2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is ahead of WALs

Adam may have added that HINT: to the wrong bz 

==========================================================

The question from the customer remains>

Customer is requesting if there is a way to target the exact area on disk where the bad crc occurred based on: 

2023-07-17T08:06:34.180481051Z debug     -6> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180481051Z debug     -5> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180507921Z debug     -4> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180507921Z debug     -3> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0x7b34f36b, device location [0x471000~1000], logical extent 0x1000~1000, object #-1:3b30e826:::osdmap.14:0#
2023-07-17T08:06:34.180507921Z debug     -2> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 osd.0 0 failed to load OSD map for epoch 14, got 0 bytes

Or additional info we could collect to achieve this ?


Note You need to log in before you can comment on or make changes to this bug.