Description of problem (please be detailed as possible and provide log snippests): OCS has been ok for a week or so, cust came into office to find OSD down, lots of alerts. Rebooted all nodes. One of the 3 OSD's is in CLBO. Logs show an init container error from container 'encryption-open'. log message is + KEY_FILE_PATH=/etc/ceph/luks_key + BLOCK_PATH=/var/lib/ceph/osd/ceph-0/block-tmp + DM_NAME=ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt + DM_PATH=/dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt + dmsetup version Library version: 1.02.175-RHEL8 (2021-01-28) Driver version: 4.43.0 + '[' -b /dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt ']' + open_encrypted_block + echo 'Opening encrypted device /var/lib/ceph/osd/ceph-0/block-tmp at /dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt' Opening encrypted device /var/lib/ceph/osd/ceph-0/block-tmp at /dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt + cryptsetup luksOpen --verbose --disable-keyring --allow-discards --key-file /etc/ceph/luks_key /var/lib/ceph/osd/ceph-0/block-tmp ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt Device /var/lib/ceph/osd/ceph-0/block-tmp is not a valid LUKS device. WARNING: Locking directory /run/cryptsetup is missing! Command failed with code -1 (wrong or missing parameters). +from affected node sh-4.4# dd if=/dev/sdd count=1 | hexdump -C 1+0 records in 1+0 records out 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 512 bytes copied, 7.6701e-05 s, 6.7 MB/s 00000200 sh-4.4# cryptsetup luksDump /dev/sdd | grep Slot Device /dev/sdd is not a valid LUKS device. Version of all relevant components (if applicable): Azure deployment OCS Version is : 4.8.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? User impact osd is down and cluster unhealthy Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Unknown Can this issue reproduce from the UI? Unknown If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Waiting for more feedback/investigation from support. Presently, it seems that the device is simply not "our" device or has been wiped. It's unclear why though. In any case, there is nothing we can do about that situation.
shan Can there be a procedure developed for replacing a disk for OCS/Azure ? Our docs state to replace the node in OCS/Azure env. I do not have access to a OCS/Azure cluster to test/develop this process. Regards, Kevan
Please reach out to QE to get an azure environment to try out the OSD replacement.
Please reopen if we can get a repro environment.
Have another cu hitting this As encryption was enabled at the time of the install, we removed dm-crypt managed device-mapper & then the persistent volume. Our assumptions was that once the node had the NVMe replaced, the osd would be replaced cleanly. After HW remediation, we found: osd.15 dis not recover. none of the other osd's from the affected server recovered either. 8 osd's down and osd.15 in new state, but not yet assigned to a container storage node. 4) pod status for node: (ukwest11) ah-1107689-001:~$ oc get pods -o wide|grep lcami514xsdi004.emea.sdi.corp.bankofamerica.com csi-cephfsplugin-h44pn 3/3 Running 0 80d 30.195.93.74 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> csi-rbdplugin-nrnn8 3/3 Running 0 80d 30.195.93.74 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-crashcollector-2a9a1b271b41fae96a8d44220779ce53-w29vk 1/1 Running 0 2d21h 30.125.6.15 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-12-b46b7fb86-lz2ft 0/2 Init:CrashLoopBackOff 820 3d2h 30.125.6.17 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-49-b9bd64bd7-pb5wk 0/2 Init:CrashLoopBackOff 819 3d2h 30.125.6.16 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-58-847c45b46c-vf5mw 0/2 Init:CrashLoopBackOff 819 3d2h 30.125.6.18 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-8-74b7dcb4fd-g7srp 0/2 Init:CrashLoopBackOff 820 3d2h 30.125.6.22 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-83-5db8f96f4f-k66dt 0/2 Init:CrashLoopBackOff 820 3d2h 30.125.6.19 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-84-76f67b98c7-7rtsp 0/2 Init:CrashLoopBackOff 821 3d2h 30.125.6.21 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-9-54556bd979-bjx98 0/2 Init:CrashLoopBackOff 819 3d2h 30.125.6.20 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> rook-ceph-osd-prepare-ocs-deviceset-1-data-32wr9j4-fsmv9 0/1 Completed 0 2d21h 30.125.6.24 lcami514xsdi004.emea.sdi.corp.bankofamerica.com <none> <none> Extract from osd crash loop back-off: + echo 'Opening encrypted device /var/lib/ceph/osd/ceph-12/block-tmp at /dev/mapper/ocs-deviceset-0-data-124qlt7-block-dmcrypt' Opening encrypted device /var/lib/ceph/osd/ceph-12/block-tmp at /dev/mapper/ocs-deviceset-0-data-124qlt7-block-dmcrypt + cryptsetup luksOpen --verbose --disable-keyring --allow-discards --key-file /etc/ceph/luks_key /var/lib/ceph/osd/ceph-12/block-tmp ocs-deviceset-0-data-124qlt7-block-dmcrypt Device /var/lib/ceph/osd/ceph-12/block-tmp is not a valid LUKS device. WARNING: Locking directory /run/cryptsetup is missing! Command failed with code -1 (wrong or missing parameters).
Re-opening. Just to make sure, you replaced an OSD drive and after rebooting the node, none of the OSDs are coming up? What is osd.15? The one that disk got replaced? Can I get all the init container logs for a failing OSD? Can I access the env? Thanks.
Hi Sebastien, Just to make sure, you replaced an OSD drive and after rebooting the node, none of the OSDs are coming up? >> correct CLBO What is osd.15? The one that disk got replaced? >> correct Can I get all the init container logs for a failing OSD? Can I access the env? All the logs for this case are in suportshell if you can access. /cases/03071794 -rw-rwxrw-+ 1 yank yank 5658918912 Nov 2 18:26 0010-must-gather-uscentral12.tar.gz drwxrwxrwx+ 4 yank yank 4096 Nov 3 17:21 0020-tarball.tar.gz drwxr-xr-x+ 3 yank yank 4096 Nov 3 17:20 0030-must-gather-useast12.tar.gz -rw-rwxrw-+ 1 yank yank 50075 Nov 4 15:06 0040-ceph_daignostics_useast12.txt -rw-rwxrw-+ 1 yank yank 50019 Nov 4 15:16 0050-useast12_case_03071794.txt drwxrwxrwx+ 3 yank yank 42 Nov 5 06:55 0060-must-gather-useast12.tar.gz -rw-rwxrw-+ 1 yank yank 49906 Nov 5 12:28 0070-ceph_daignostics_useast12.txt -rw-rwxrw-+ 1 yank yank 1255390 Nov 11 18:04 0080-rook-ceph-operator-log.txt -rw-rwxrw-+ 1 yank yank 29925 Nov 19 19:38 0090-ACHP-Replaceafaileddrive.pdf -rw-rwxrw-+ 1 yank yank 37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt -rw-rwxrw-+ 1 yank yank 468721039 Nov 22 22:29 0110-cluster-info-dump.txt -rw-rw-rw-+ 1 yank yank 3010 Nov 24 09:51 0120-LocalVolumeSet.txt Env is now stable but cu is trying to avoid similar situation in the future. Not sure if you can access the env cu is bank of america, but if needed I can ask or schedule a remote session for you to join.
(In reply to khover from comment #11) > Hi Sebastien, > > Just to make sure, you replaced an OSD drive and after rebooting the node, > none of the OSDs are coming up? >> correct CLBO > > What is osd.15? The one that disk got replaced? >> correct > > > Can I get all the init container logs for a failing OSD? > Can I access the env? > > All the logs for this case are in suportshell if you can access. > > /cases/03071794 > > -rw-rwxrw-+ 1 yank yank 5658918912 Nov 2 18:26 > 0010-must-gather-uscentral12.tar.gz > drwxrwxrwx+ 4 yank yank 4096 Nov 3 17:21 0020-tarball.tar.gz > drwxr-xr-x+ 3 yank yank 4096 Nov 3 17:20 > 0030-must-gather-useast12.tar.gz > -rw-rwxrw-+ 1 yank yank 50075 Nov 4 15:06 > 0040-ceph_daignostics_useast12.txt > -rw-rwxrw-+ 1 yank yank 50019 Nov 4 15:16 > 0050-useast12_case_03071794.txt > drwxrwxrwx+ 3 yank yank 42 Nov 5 06:55 > 0060-must-gather-useast12.tar.gz > -rw-rwxrw-+ 1 yank yank 49906 Nov 5 12:28 > 0070-ceph_daignostics_useast12.txt > -rw-rwxrw-+ 1 yank yank 1255390 Nov 11 18:04 > 0080-rook-ceph-operator-log.txt > -rw-rwxrw-+ 1 yank yank 29925 Nov 19 19:38 > 0090-ACHP-Replaceafaileddrive.pdf > -rw-rwxrw-+ 1 yank yank 37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt > -rw-rwxrw-+ 1 yank yank 468721039 Nov 22 22:29 0110-cluster-info-dump.txt > -rw-rw-rw-+ 1 yank yank 3010 Nov 24 09:51 0120-LocalVolumeSet.txt Ok I'll try to access the logs. > > Env is now stable but cu is trying to avoid similar situation in the future. Does that mean OSDs are running normally now? What happened? Which action was taken to resolve this? > Not sure if you can access the env cu is bank of america, but if needed I > can ask or schedule a remote session for you to join. Probably not needed anymore if the env is stable and I can access the logs.
(In reply to Sébastien Han from comment #12) > (In reply to khover from comment #11) > > Hi Sebastien, > > > > Just to make sure, you replaced an OSD drive and after rebooting the node, > > none of the OSDs are coming up? >> correct CLBO > > > > What is osd.15? The one that disk got replaced? >> correct > > > > > > Can I get all the init container logs for a failing OSD? > > Can I access the env? > > > > All the logs for this case are in suportshell if you can access. > > > > /cases/03071794 > > > > -rw-rwxrw-+ 1 yank yank 5658918912 Nov 2 18:26 > > 0010-must-gather-uscentral12.tar.gz > > drwxrwxrwx+ 4 yank yank 4096 Nov 3 17:21 0020-tarball.tar.gz > > drwxr-xr-x+ 3 yank yank 4096 Nov 3 17:20 > > 0030-must-gather-useast12.tar.gz > > -rw-rwxrw-+ 1 yank yank 50075 Nov 4 15:06 > > 0040-ceph_daignostics_useast12.txt > > -rw-rwxrw-+ 1 yank yank 50019 Nov 4 15:16 > > 0050-useast12_case_03071794.txt > > drwxrwxrwx+ 3 yank yank 42 Nov 5 06:55 > > 0060-must-gather-useast12.tar.gz > > -rw-rwxrw-+ 1 yank yank 49906 Nov 5 12:28 > > 0070-ceph_daignostics_useast12.txt > > -rw-rwxrw-+ 1 yank yank 1255390 Nov 11 18:04 > > 0080-rook-ceph-operator-log.txt > > -rw-rwxrw-+ 1 yank yank 29925 Nov 19 19:38 > > 0090-ACHP-Replaceafaileddrive.pdf > > -rw-rwxrw-+ 1 yank yank 37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt > > -rw-rwxrw-+ 1 yank yank 468721039 Nov 22 22:29 0110-cluster-info-dump.txt > > -rw-rw-rw-+ 1 yank yank 3010 Nov 24 09:51 0120-LocalVolumeSet.txt > > > Ok I'll try to access the logs. > > > > > Env is now stable but cu is trying to avoid similar situation in the future. > > Does that mean OSDs are running normally now? What happened? Which action > was taken to resolve this? > > > Not sure if you can access the env cu is bank of america, but if needed I > > can ask or schedule a remote session for you to join. > > Probably not needed anymore if the env is stable and I can access the logs. > Does that mean OSDs are running normally now? What happened? Which action was taken to resolve this? Yes, they are running normally now, cu had to remove osds from cluster, wipe the disks, re deploy as new osds.
(In reply to khover from comment #13) > (In reply to Sébastien Han from comment #12) > > (In reply to khover from comment #11) > > > Hi Sebastien, > > > > > > Just to make sure, you replaced an OSD drive and after rebooting the node, > > > none of the OSDs are coming up? >> correct CLBO > > > > > > What is osd.15? The one that disk got replaced? >> correct > > > > > > > > > Can I get all the init container logs for a failing OSD? > > > Can I access the env? > > > > > > All the logs for this case are in suportshell if you can access. > > > > > > /cases/03071794 > > > > > > -rw-rwxrw-+ 1 yank yank 5658918912 Nov 2 18:26 > > > 0010-must-gather-uscentral12.tar.gz > > > drwxrwxrwx+ 4 yank yank 4096 Nov 3 17:21 0020-tarball.tar.gz > > > drwxr-xr-x+ 3 yank yank 4096 Nov 3 17:20 > > > 0030-must-gather-useast12.tar.gz > > > -rw-rwxrw-+ 1 yank yank 50075 Nov 4 15:06 > > > 0040-ceph_daignostics_useast12.txt > > > -rw-rwxrw-+ 1 yank yank 50019 Nov 4 15:16 > > > 0050-useast12_case_03071794.txt > > > drwxrwxrwx+ 3 yank yank 42 Nov 5 06:55 > > > 0060-must-gather-useast12.tar.gz > > > -rw-rwxrw-+ 1 yank yank 49906 Nov 5 12:28 > > > 0070-ceph_daignostics_useast12.txt > > > -rw-rwxrw-+ 1 yank yank 1255390 Nov 11 18:04 > > > 0080-rook-ceph-operator-log.txt > > > -rw-rwxrw-+ 1 yank yank 29925 Nov 19 19:38 > > > 0090-ACHP-Replaceafaileddrive.pdf > > > -rw-rwxrw-+ 1 yank yank 37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt > > > -rw-rwxrw-+ 1 yank yank 468721039 Nov 22 22:29 0110-cluster-info-dump.txt > > > -rw-rw-rw-+ 1 yank yank 3010 Nov 24 09:51 0120-LocalVolumeSet.txt > > > > > > Ok I'll try to access the logs. > > > > > > > > Env is now stable but cu is trying to avoid similar situation in the future. > > > > Does that mean OSDs are running normally now? What happened? Which action > > was taken to resolve this? > > > > > Not sure if you can access the env cu is bank of america, but if needed I > > > can ask or schedule a remote session for you to join. > > > > Probably not needed anymore if the env is stable and I can access the logs. > > > Does that mean OSDs are running normally now? What happened? Which action was taken to resolve this? > Yes, they are running normally now, cu had to remove osds from cluster, wipe > the disks, re deploy as new osds. Wow that's drastic. Thanks. Shay, Rachael, have you ever observed something similar during testing? If you reboot a node, have you ever seen OSDs not coming up again?
ok, as discussed in the support call, we are going to schedule a call with the customer to inspect the nodes. Moving to 4.9.z too.
(In reply to Sébastien Han from comment #16) > ok, as discussed in the support call, we are going to schedule a call with > the customer to inspect the nodes. Moving to 4.9.z too. Reached out to customer re their availability windows for a remote session this week. I will update when they respond. Confirmed the deployment is OCS with LSO on AWS.
(In reply to khover from comment #17) > (In reply to Sébastien Han from comment #16) > > ok, as discussed in the support call, we are going to schedule a call with > > the customer to inspect the nodes. Moving to 4.9.z too. > > Reached out to customer re their availability windows for a remote session > this week. > > I will update when they respond. > > Confirmed the deployment is OCS with LSO on AWS. Hi Sebastien, cu provided the following for remote this week my availability over the next few days: Thursday 2nd December - 11:00 EST / 16:00 GMT Friday 3rd December - 09:00 EST / 14:00 GMT Monday 6th December - Directly after our TAM call , approximately 09:30 EST / 14:30 GMT
Let's go with Monday 6th December at 09:30 EST / 14:30 GMT, thanks. Please send me an invite.
(In reply to Sébastien Han from comment #19) > Let's go with Monday 6th December at 09:30 EST / 14:30 GMT, thanks. > Please send me an invite. cu confirmed Monday 6th December at 09:30 EST / 14:30 GMT https://goto.webex.com/goto/j.php?MTID=m4a5fb1d55d63628641945801e07d3ea1
Thanks!
Customer version is 4.7.3 so changed the version to 4.7 Also, this might help a bit https://bugzilla.redhat.com/show_bug.cgi?id=2030291
Hi Sebastien, Just curious if any discovery was made on this issue. do you want me to link https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case as well ? cheers
(In reply to khover from comment #23) > Hi Sebastien, > > Just curious if any discovery was made on this issue. > > do you want me to link https://bugzilla.redhat.com/show_bug.cgi?id=2030291 > to the case as well ? > > cheers Hi, no progress so far, I think we are stuck at the same point we were before. Without any logs we cannot debug further :( We can add https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case, which will certainly help with the upgrade.
(In reply to Sébastien Han from comment #24) > (In reply to khover from comment #23) > > Hi Sebastien, > > > > Just curious if any discovery was made on this issue. > > > > do you want me to link https://bugzilla.redhat.com/show_bug.cgi?id=2030291 > > to the case as well ? > > > > cheers > > Hi, no progress so far, I think we are stuck at the same point we were > before. Without any logs we cannot debug further :( > We can add https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case, > which will certainly help with the upgrade. Can I pass along to cu any specific logs/data you need to capture when the issue happens again ? cheers
(In reply to khover from comment #25) > (In reply to Sébastien Han from comment #24) > > (In reply to khover from comment #23) > > > Hi Sebastien, > > > > > > Just curious if any discovery was made on this issue. > > > > > > do you want me to link https://bugzilla.redhat.com/show_bug.cgi?id=2030291 > > > to the case as well ? > > > > > > cheers > > > > Hi, no progress so far, I think we are stuck at the same point we were > > before. Without any logs we cannot debug further :( > > We can add https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case, > > which will certainly help with the upgrade. > > Can I pass along to cu any specific logs/data you need to capture when the > issue happens again ? > > cheers Hi Kevan, Next time this happens: * ideally do not re-install the nodes so we can hop onto the machine * identify the failing OSD * find the associated PV * log onto the machine where the PV is mapped on * find the underlying block device of the PV (use "lsblk" and "ls -al" command to find the major/minor numbers * make sure it's the same used by the OSD deployment * run cryptsetup luksDump <disk> If the disk shows the error as "not a LUKS device", there is a problem. If it's bare-metal this could mean: * this is a disk different (unlikely?) * the disk was wiped (FYI the osd removal job does not wipe disk) If it's virtualized/cloud env, this means (more likely) that it's a different disk, like the EBS volume is different for instance. But again, being hands-on in the env would be ideal. I have to close this again, unfortunately. Thanks