Description of problem (please be detailed as possible and provide log snippests): OSD CLBO: inferring bluefs devices from bluestore path 2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is ahead of WALs 2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 bluestore(/var/lib/ceph/osd/ceph-1) _open_db erroring opening db: /builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::expand_devices(std::ostream&)' thread 3ff8ab68800 time 2023-07-12T09:21:44.526558+0000 /builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: 6911: FAILED ceph_assert(r == 0) ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x3ff81a6d8be] 2: /usr/lib64/ceph/libceph-common.so.2(+0x26dad2) [0x3ff81a6dad2] 3: (BlueStore::expand_devices(std::ostream&)+0x54a) [0x2aa2fb6d18a] 4: main() 5: __libc_start_main() 6: ceph-bluestore-tool(+0x1f4bec) [0x2aa2fa74bec] 7: [(nil)] *** Caught signal (Aborted) ** in thread 3ff8ab68800 thread_name:ceph-bluestore- 2023-07-12T09:21:44.524+0000 3ff8ab68800 -1 /builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::expand_devices(std::ostream&)' thread 3ff8ab68800 time 2023-07-12T09:21:44.526558+0000 /builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: 6911: FAILED ceph_assert(r == 0) Version of all relevant components (if applicable): ODF 4.12 ceph-16.2.10 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? IBM Z DASD POC on hold Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Regarding: Errors like this one usually are symptoms of an underlying hardware problem. Has it been excluded? We will need a IBM Z expert for that.
Abdul tried to reproduce the issue on my DASD based cluster. After applying certificate change, the nodes got rebooted and ODF came back healthy completely without issues. Version tried is: OCP 4.12.23 ODF 4.12.4
Question: Was the customer able to create a new OSD and re-attach to cluster? And did that replaced cluster work?
And one more question: Can someone please verify if the logs provided by the customer had been created at the correct date (July 14th) (NOT June26th) ?
The customer has not replaced the OSD at this time in case further info was needed for this BZ. All logs surrounding this BZ are from July 13/14. 2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is ahead of WALs ``` Errors like this one usually are symptoms of an underlying hardware problem. Has it been excluded?
(In reply to khover from comment #7) > The customer has not replaced the OSD at this time in case further info was > needed for this BZ. > > > All logs surrounding this BZ are from July 13/14. > > > 2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is > ahead of WALs > ``` > > Errors like this one usually are symptoms of an underlying hardware > problem. Has it been excluded? Hi Kevan, The customer ran a script dbginfo.sh on Friday during the call and collected a DBGINFO-...tarball. Did you get that data?
Hi, Yes, the tarball is uploaded to supportshell. supportshell/cases/03539410 DBGINFO-2023-06-26-07-08-03-c02ns001-25B658.tgz
(In reply to khover from comment #11) > Hi, > > Yes, the tarball is uploaded to supportshell. > > supportshell/cases/03539410 > > DBGINFO-2023-06-26-07-08-03-c02ns001-25B658.tgz But this is not the one taken on last Friday, it's an old one. The one taken during the call on Friday should have a name like DBGINFO-2023-07-14-...
Ill reach out to customer for that.
Hi The customer states that DBGINFO-2023-07-14-13-30-46 was collected for case 03562792/BZ 2223380 cluster. entries (bad CRC) 2023-07-17T08:06:34.180507921Z debug -2> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 osd.0 0 failed to load OSD map for epoch 14, got 0 bytes ======================================================================= Do we need a fresh debug collected for this cluster ? I can get that if needed.
We may have Identified the issue in gchat collab. I requested If there was a diff in the IBM test env and the customer env. ID 0.0.0a00 is being partitioned, on each node for OSDs correct way to configure would look like the following ? # lszdev TYPE ID ON PERS NAMES dasd-eckd 0.0.0100 yes no dasda dasd-eckd 0.0.0190 no no dasd-eckd 0.0.0191 no no dasd-eckd 0.0.01fd yes no <<-- use for node 1 dasd-eckd 0.0.01fe yes no <<-- use for node 2 dasd-eckd 0.0.01ff yes no dasd-eckd 0.0.0592 no no dasd-eckd 0.0.0a00 yes yes <<-- use for node 3 @Abdul Yes, Kevan, Please use different DASD ids for each OSD’s as you listed below. As Sa mentioned, we always had different DASD ids for each OSD disks in our test environment. ============================================================= This will be set up in the customer ENV for case 03562792/BZ 2223380 for testing initially in case we need additional info collected for this case/BZ
Raising the escalation flag on the BZ ticket to match the ongoing internal escalation The deployment of applications on OpenShift on z is being block by these issues.
This morning in the customer call a problem was identified that the block device /dev/dasde1 had been accidentally overwritten and showed as a file with size 0 in the filesystem. The customer would like to correct it and reinstall odf. The result will be shown in the customer call later today.
This is not the root cause for the initial crc errors. - We observed on reading 4k byte from /dev/dasde1, it showed 0 byte. Running strace further showed read() call is returning 0. From cli history, observed dd if=/dev/zero of=/dev/dasde1 was run to swipe the disk as part of troubleshooting earlier. This dd was run with the customer in an attempt to recover the OSD down/ CLBO outlined in the initial BZ Description.
Yesterday in the customer call we did a fresh installation of OCP and ODF. Then after update of certification and reboot, everything worked fine. The only difference comparing with the setup before, is disk layout CDL(Compatible disk layout) is used instead of LDL(Linux® disk layout). https://www.ibm.com/docs/en/linux-on-systems?topic=know-disk-layout-summary Although we could reproduce the customer issue using LDL formatting in our lab, the CDL formatting is the default for DASD disks and more recommended. It is also required for OCP installation on IBM Z: https://docs.openshift.com/container-platform/4.13/installing/installing_ibm_z/installing-ibm-z.html The difference is after formatting with CDL, the customer needs to run another command "fdasd" to do the partitioning. This step is not needed with LDL. This may lead to some confusion.
> Although we could reproduce the customer issue using LDL formatting in our > lab, the CDL formatting is the default for DASD disks and more recommended. Just reviewed the comment I wrote yesterday and need to correct: We could NOT reproduce the customer issue using LDL formatting...
Customer updated today: After the call we have done some testing. We did a reinstall of the OpenShift and ODF cluster using LDL formatted disks and noticed inconsistencies with the install compared to the one we did on the call. We couldn't create a ODF Storage Cluster as it wasn't available, only a generic storage system was available. We then rebuild OpenShift and ODF cluster again using CDL formatting and used our scripts and yamls rather than the UI and it worked fine. We then performed multiple updates that triggered reboots and everything is fine at the moment. I have attached the yamls for the storage components, if these could be checked just to make sure look fine that would be good.
@saliu I am still working on RH documentation for dasd. You stated: > the customer needs to run another command "fdasd" to do the partitioning. This step is not needed with LDL. Here is what I have documented for RH, please help me verify or correct any errors. Are we missing a "fdasd" command ? ( EXAMPLE ) # chzdev -e 0.0.0a00 dasd-eckd 0.0.0a00 yes yes dasde Formatting is issued which gives the device a partition as part of the formatting. # dasdfmt /dev/dasde -b 4096 -p -y -F -d ldl Releasing space for the entire device... Skipping format check due to --force. Finished formatting the device. Rereading the partition table... ok ITEM -2 If for some reason we had a failed OSD/device Should the cleaning of the device be. # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT dasda 94:0 0 103.2G 0 disk |-dasda1 94:1 0 384M 0 part /boot `-dasda2 94:2 0 102.8G 0 part /sysroot dasde 94:16 0 811.6G 0 disk `-dasde1 94:17 0 811.6G 0 part On partition ? # /usr/bin/dd if=/dev/zero of=/dev/dasde1 bs=1M count=10 conv=fsync Or on dasde: # /usr/bin/dd if=/dev/zero of=/dev/dasde bs=1M count=10 conv=fsync
# formatting disks and partitioning from customer script for x in $storage_nodes; do oc debug node/${x} -- chroot /host chzdev -e 0.0.0a00; done for x in $storage_nodes; do oc debug node/${x} -- chroot /host dasdfmt /dev/dasde -b 4096 -p -y -F; done for x in $storage_nodes; do oc debug node/${x} -- chroot /host fdasd -a /dev/dasde; done
As discussed in the call, the commands for formatting looks good. Perhaps adding a sleep 60 after the dasdfmt command to wait for finishing the formatting. But only because the customer is using ESE DASD which is by default formatted in a quick mode (onyl the first two tracks). A full mode dasdfmt of a big disk can take longer than an hour. One additional recommendation to the customer is, to make best use of PAV when formatting a DASD that has one base device (0a00) and four alias devices (0afc, 0afd, 0afe, 0aff), specify five cylinders by adding the option "-r 5" dasdfmt /dev/dasde -b 4096 -p -y -F -r 5 I'm wondering why do they need to run 3 loops to execute these 3 commands? Isn't it possible to run for each node three commands at once?