Bug 2222728

Summary: [IBM Z] ODF deployed on IBM Z with DASD ( BlueStore FAILED ceph_assert )
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: cephAssignee: Adam Kupczyk <akupczyk>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: high    
Priority: high CC: akandath, akupczyk, bniver, gjose, glaw, jquinn, mgokhool, mhackett, muagarwa, nojha, odf-bz-bot, rzarzyns, saliu, sarora, sostapov, tstober
Version: 4.12Flags: khover: needinfo? (tstober)
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description khover 2023-07-13 14:54:30 UTC
Description of problem (please be detailed as possible and provide log
snippests):

OSD CLBO:

inferring bluefs devices from bluestore path
2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is ahead of WALs
2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 bluestore(/var/lib/ceph/osd/ceph-1) _open_db erroring opening db:
/builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::expand_devices(std::ostream&)' thread 3ff8ab68800 time 2023-07-12T09:21:44.526558+0000
/builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: 6911: FAILED ceph_assert(r == 0)
 ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x3ff81a6d8be]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x26dad2) [0x3ff81a6dad2]
 3: (BlueStore::expand_devices(std::ostream&)+0x54a) [0x2aa2fb6d18a]
 4: main()
 5: __libc_start_main()
 6: ceph-bluestore-tool(+0x1f4bec) [0x2aa2fa74bec]
 7: [(nil)]
*** Caught signal (Aborted) **
 in thread 3ff8ab68800 thread_name:ceph-bluestore-
2023-07-12T09:21:44.524+0000 3ff8ab68800 -1 /builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::expand_devices(std::ostream&)' thread 3ff8ab68800 time 2023-07-12T09:21:44.526558+0000
/builddir/build/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: 6911: FAILED ceph_assert(r == 0)


Version of all relevant components (if applicable):

ODF 4.12

ceph-16.2.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

IBM Z DASD POC on hold 

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 3 khover 2023-07-13 17:08:10 UTC
Regarding:

 Errors like this one usually are symptoms of an underlying hardware
 problem. Has it been excluded?


We will need a IBM Z expert for that.

Comment 4 tstober 2023-07-18 07:17:14 UTC
Abdul tried to reproduce the issue on my DASD based cluster. After applying certificate change, the nodes got rebooted and ODF came back healthy completely without issues.
Version tried is:
OCP 4.12.23
ODF 4.12.4

Comment 5 tstober 2023-07-18 07:18:27 UTC
Question:
Was the customer able to create a new OSD and re-attach to cluster? And did that replaced cluster work?

Comment 6 tstober 2023-07-18 09:15:10 UTC
And one more question:
Can someone please verify if the logs provided by the customer had been created at the correct date (July 14th) (NOT June26th) ?

Comment 7 khover 2023-07-18 10:42:58 UTC
The customer has not replaced the OSD at this time in case further info was needed for this BZ.


All logs surrounding this BZ are from July 13/14.


2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is ahead of WALs
```

Errors like this one usually are symptoms of an underlying hardware
problem. Has it been excluded?

Comment 10 saliu 2023-07-18 14:08:46 UTC
(In reply to khover from comment #7)
> The customer has not replaced the OSD at this time in case further info was
> needed for this BZ.
> 
> 
> All logs surrounding this BZ are from July 13/14.
> 
> 
> 2023-07-12T09:21:44.004+0000 3ff8ab68800 -1 rocksdb: Corruption: SST file is
> ahead of WALs
> ```
> 
> Errors like this one usually are symptoms of an underlying hardware
> problem. Has it been excluded?

Hi Kevan, 
The customer ran a script dbginfo.sh on Friday during the call and collected a DBGINFO-...tarball. Did you get that data?

Comment 11 khover 2023-07-18 15:01:44 UTC
Hi,

Yes, the tarball is uploaded to supportshell.

supportshell/cases/03539410

DBGINFO-2023-06-26-07-08-03-c02ns001-25B658.tgz

Comment 12 saliu 2023-07-18 15:25:26 UTC
(In reply to khover from comment #11)
> Hi,
> 
> Yes, the tarball is uploaded to supportshell.
> 
> supportshell/cases/03539410
> 
> DBGINFO-2023-06-26-07-08-03-c02ns001-25B658.tgz

But this is not the one taken on last Friday, it's an old one. The one taken during the call on Friday should have a name like DBGINFO-2023-07-14-...

Comment 13 khover 2023-07-18 16:34:50 UTC
Ill reach out to customer for that.

Comment 14 khover 2023-07-19 11:38:12 UTC
Hi 

The customer states that DBGINFO-2023-07-14-13-30-46 was collected for case 03562792/BZ 2223380 cluster.

entries (bad CRC)

 2023-07-17T08:06:34.180507921Z debug     -2> 2023-07-17T08:06:34.146+0000 3ff8926a500 -1 osd.0 0 failed to load OSD map for epoch 14, got 0 bytes

=======================================================================

Do we need a fresh debug collected for this cluster ?

I can get that if needed.

Comment 15 khover 2023-07-19 11:52:40 UTC
We may have Identified the issue in gchat collab.

I requested If there was a diff in the IBM test env and the customer env.

ID 0.0.0a00 is being partitioned, on each node for OSDs

correct way to configure would look like the following ?

# lszdev
TYPE         ID                          ON   PERS  NAMES
dasd-eckd    0.0.0100                    yes  no    dasda
dasd-eckd    0.0.0190                    no   no
dasd-eckd    0.0.0191                    no   no
dasd-eckd    0.0.01fd                    yes  no              <<-- use for node 1 
dasd-eckd    0.0.01fe                    yes  no              <<-- use for node 2
dasd-eckd    0.0.01ff                    yes  no
dasd-eckd    0.0.0592                    no   no
dasd-eckd    0.0.0a00                    yes  yes             <<-- use for node 3 

@Abdul

Yes, Kevan,

Please use different DASD ids for each OSD’s as you listed below.

As Sa mentioned, we always had different DASD ids for each OSD disks in our test environment.

=============================================================

This will be set up in the customer ENV for case 03562792/BZ 2223380 for testing initially in case we need additional info collected for this case/BZ

Comment 18 Mariya Gokhool 2023-07-21 09:41:33 UTC
Raising the escalation flag on the BZ ticket to match the ongoing internal escalation

The deployment of applications on OpenShift on z is being block by these issues.

Comment 19 saliu 2023-07-24 12:43:52 UTC
This morning in the customer call a problem was identified that the block device
/dev/dasde1 
had been accidentally overwritten and showed as a file with size 0 in the filesystem. 
The customer would like to correct it and reinstall odf. The result will be shown in the customer call later today.

Comment 21 khover 2023-07-24 13:41:06 UTC
This is not the root cause for the initial crc errors.

 - We observed on reading 4k byte from /dev/dasde1, it showed 0 byte. Running strace further showed read() call is returning 0. From cli history, observed dd if=/dev/zero of=/dev/dasde1 was run to swipe the disk as part of troubleshooting earlier.

This dd was run with the customer in an attempt to recover the OSD down/ CLBO outlined in the initial BZ Description.

Comment 24 saliu 2023-07-25 13:38:48 UTC
Yesterday in the customer call we did a fresh installation of OCP and ODF. Then after update of certification and reboot, everything worked fine.
The only difference comparing with the setup before, is disk layout CDL(Compatible disk layout) is used instead of LDL(Linux® disk layout). https://www.ibm.com/docs/en/linux-on-systems?topic=know-disk-layout-summary 
Although we could reproduce the customer issue using LDL formatting in our lab, the CDL formatting is the default for DASD disks and more recommended. It is also required for OCP installation on IBM Z: https://docs.openshift.com/container-platform/4.13/installing/installing_ibm_z/installing-ibm-z.html

The difference is after formatting with CDL, the customer needs to run another command "fdasd" to do the partitioning. This step is not needed with LDL. This may lead to some confusion.

Comment 25 saliu 2023-07-26 12:39:21 UTC
> Although we could reproduce the customer issue using LDL formatting in our
> lab, the CDL formatting is the default for DASD disks and more recommended.

Just reviewed the comment I wrote yesterday and need to correct: We could NOT reproduce the customer issue using LDL formatting...

Comment 26 khover 2023-07-27 19:29:16 UTC
Customer updated today:

After the call we have done some testing.  We did a reinstall of the OpenShift and ODF cluster using LDL formatted disks and noticed inconsistencies with the install compared to the one we did on the call.   We couldn't create a ODF Storage Cluster as it wasn't available, only a generic storage system was available.

We then rebuild OpenShift and ODF cluster again using CDL formatting and used our scripts and yamls rather than the UI and it worked fine.  We then performed multiple updates that triggered reboots and everything is fine at the moment.   I have attached the yamls for the storage components, if these could be checked just to make sure look fine that would be good.

Comment 27 khover 2023-07-27 19:39:53 UTC
@saliu

I am still working on RH documentation for dasd.

You stated:

     > the customer needs to run another command "fdasd" to do the partitioning. This step is not needed with LDL.


Here is what I have documented for RH, please help me verify or correct any errors.

Are we missing a "fdasd" command ?

( EXAMPLE )

# chzdev -e 0.0.0a00

dasd-eckd    0.0.0a00                    yes  yes   dasde

Formatting is issued which gives the device a partition as part of the formatting.

# dasdfmt /dev/dasde -b 4096 -p -y -F -d ldl
Releasing space for the entire device...
Skipping format check due to --force.
Finished formatting the device.
Rereading the partition table... ok


ITEM -2 

If for some reason we had a failed OSD/device 

Should the cleaning of the device be.

# lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
dasda     94:0    0 103.2G  0 disk
|-dasda1  94:1    0   384M  0 part /boot
`-dasda2  94:2    0 102.8G  0 part /sysroot
dasde     94:16   0 811.6G  0 disk
`-dasde1  94:17   0 811.6G  0 part

On partition ?

# /usr/bin/dd if=/dev/zero of=/dev/dasde1 bs=1M count=10 conv=fsync

Or on dasde:

# /usr/bin/dd if=/dev/zero of=/dev/dasde bs=1M count=10 conv=fsync

Comment 28 khover 2023-07-28 12:44:54 UTC
# formatting disks and partitioning from customer script 

for x in $storage_nodes; do oc debug node/${x} -- chroot /host chzdev -e 0.0.0a00; done
for x in $storage_nodes; do oc debug node/${x} -- chroot /host dasdfmt /dev/dasde -b 4096 -p -y -F; done
for x in $storage_nodes; do oc debug node/${x} -- chroot /host fdasd -a /dev/dasde; done

Comment 29 saliu 2023-07-28 13:43:45 UTC
As discussed in the call, the commands for formatting looks good. Perhaps adding a sleep 60 after the dasdfmt command to wait for finishing the formatting. But only because the customer is using ESE DASD which is by default formatted in a quick mode (onyl the first two tracks). A full mode dasdfmt of a big disk can take longer than an hour. 

One additional recommendation to the customer is, to make best use of PAV when formatting a DASD that has one base device (0a00) and four alias devices (0afc, 0afd, 0afe, 0aff), specify five cylinders by adding the option "-r 5" 
dasdfmt /dev/dasde -b 4096 -p -y -F -r 5

I'm wondering why do they need to run 3 loops to execute these 3 commands? Isn't it possible to run for each node three commands at once?