Description of problem (please be detailed as possible and provide log snippests): After cluster wide reboot on cert auth. A ODF node reboot removed the DASD partition and they lost all 3 OSDs. Customer followed this IBM documentation to partition the DASD. > https://www.ibm.com/docs/en/linux-on-systems?topic=architecture-storage > See Section “4.1.2 Steps specific for DASD devices” ODF deployed successfully with LSO and the OSDs mapped to dasde1. To use host binaries, run `chroot /host` NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop1 7:1 0 811.6G 0 loop dasda 94:0 0 103.2G 0 disk |-dasda1 94:1 0 384M 0 part /host/boot `-dasda2 94:2 0 102.8G 0 part /host/sysroot dasde 94:16 0 811.6G 0 disk `-dasde1 94:17 0 811.6G 0 part After cluster wide reboot on cert auth. MapVolume.EvalHostSymlinks failed for volume "local-pv-ef04e88d" : lstat /dev/disk/by-id/ccw-IBM.750000000KHF61.baee.40-part1: no such file or directory Events log: 3m27s Warning FailedMapVolume pod/rook-ceph-osd-0-59c9db848-5rp9f MapVolume.EvalHostSymlinks failed for volume "local-pv-ef04e88d" : lstat /dev/disk/by-id/ccw-IBM.750000000KHF61.baee.40-part1: no such file or directory 23m Warning FailedMount pod/rook-ceph-osd-0-59c9db848-5rp9f (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-odf-cluster-storage-0-data-1wgxd7], unattached volumes=[ocs-deviceset-odf-cluster-storage-0-data-1wgxd7 ocs-deviceset-odf-cluster-storage-0-data-1wgxd7-bridge kube-api-access-25kpz rook-data rook-config-override rook-ceph-log rook-ceph-crash run-udev]: timed out waiting for the condition NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT dasda 94:0 0 103.2G 0 disk |-dasda1 94:1 0 384M 0 part /boot `-dasda2 94:2 0 102.8G 0 part /sysroot dasde 94:16 0 811.6G 0 disk Version of all relevant components (if applicable): OCP/ODF 4.12 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Node reboot destroys the OSD path to dasde1 Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Yes, on reboot of the node. Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: DSAD partition persists on reboot. Additional info:
Hi. I got to know that it could just be an issue on how the partition was configured and not a real issue at all. So any new updates on this BZ?
Hi Santosh, I believe your correct. It seemed to be a issue with partition config. The customer is going to do some testing to ensure partition persistence on reboot. I asked the customer to upload testing details to the case. I will share the testing details when provided. The next action items will most likely target documentation, supportability and QE testing. I will open a new Doc BZ for that.
Thank for the update Kevan. I'll wait for more details.
Hi Santosh, This was definately a miss config by the customer. I suspect they ran # chzdev -e dasde instead of the ID. ====================================== lszdev shows that the persistent flag is not set [root@c02ns001 ~]# lszdev TYPE ID ON PERS NAMES dasd-eckd 0.0.0100 yes no dasda dasd-eckd 0.0.0190 no no dasd-eckd 0.0.0191 no no dasd-eckd 0.0.01fd yes no dasd-eckd 0.0.01fe yes no dasd-eckd 0.0.01ff yes no dasd-eckd 0.0.0592 no no dasd-eckd 0.0.0a00 yes no dasde dasd-eckd 0.0.0afc yes no dasd-eckd 0.0.0afd yes no dasd-eckd 0.0.0afe yes no dasd-eckd 0.0.0aff yes no qeth 0.0.2d00:0.0.2d01:0.0.2d02 yes no enc2d00 generic-ccw 0.0.0009 yes no generic-ccw 0.0.000c no no generic-ccw 0.0.000d no no generic-ccw 0.0.000e no no then the chzdev command is issued [root@c02ns001 ~]# chzdev -e 0.0.0a00 ECKD DASD 0.0.0a00 configured Now the persistent flag is set on [root@c02ns001 ~]# lszdev TYPE ID ON PERS NAMES dasd-eckd 0.0.0100 yes no dasda dasd-eckd 0.0.0190 no no dasd-eckd 0.0.0191 no no dasd-eckd 0.0.01fd yes no dasd-eckd 0.0.01fe yes no dasd-eckd 0.0.01ff yes no dasd-eckd 0.0.0592 no no dasd-eckd 0.0.0a00 yes yes dasde dasd-eckd 0.0.0afc yes no dasd-eckd 0.0.0afd yes no dasd-eckd 0.0.0afe yes no dasd-eckd 0.0.0aff yes no qeth 0.0.2d00:0.0.2d01:0.0.2d02 yes no enc2d00 generic-ccw 0.0.0009 yes no generic-ccw 0.0.000c no no generic-ccw 0.0.000d no no generic-ccw 0.0.000e no no format issued with a ldl instead of a cdl which gives it a partition as part of the formatting. [root@c02ns001 ~]# dasdfmt /dev/dasde -b 4096 -p -y -F -d ldl Releasing space for the entire device... Skipping format check due to --force. Finished formatting the device. Rereading the partition table... ok lsblk shows the partition dasde1 [root@c02ns001 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT dasda 94:0 0 103.2G 0 disk |-dasda1 94:1 0 384M 0 part /boot `-dasda2 94:2 0 102.8G 0 part /sysroot dasde 94:16 0 811.6G 0 disk `-dasde1 94:17 0 811.6G 0 part after a reboot [core@c02ns001 ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 811.6G 0 loop dasda 94:0 0 103.2G 0 disk |-dasda1 94:1 0 384M 0 part /boot `-dasda2 94:2 0 102.8G 0 part /sysroot dasde 94:16 0 811.6G 0 disk `-dasde1 94:17 0 811.6G 0 part I have put together the following KCS for the RH ODF team for awareness on DASD config. https://access.redhat.com/solutions/7022104
We may need to put this on hold for a moment. There are still pending issues in the customer environment so Im not sure we have 100% complete doc update items needed at the moment.