Bug 2004746

Summary: OSD Failure with init container encryption-open logs reporting device is not a valid LUKS device
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: assingh, bniver, hnallurv, khover, madam, mhackett, mpandey, muagarwa, nravinas, ocs-bugs, odf-bz-bot, rgeorge, shan, srozen, tnielsen
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-14 10:28:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2030291    
Bug Blocks:    

Description khover 2021-09-15 20:56:21 UTC
Description of problem (please be detailed as possible and provide log
snippests):

OCS has been ok for a week or so, cust came into office to find OSD down, lots of alerts.
Rebooted all nodes.

One of the 3 OSD's is in CLBO.
Logs show an init container error from container 'encryption-open'. log message is
+ KEY_FILE_PATH=/etc/ceph/luks_key
+ BLOCK_PATH=/var/lib/ceph/osd/ceph-0/block-tmp
+ DM_NAME=ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt
+ DM_PATH=/dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt
+ dmsetup version
Library version:   1.02.175-RHEL8 (2021-01-28)
Driver version:    4.43.0
+ '[' -b /dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt ']'
+ open_encrypted_block
+ echo 'Opening encrypted device /var/lib/ceph/osd/ceph-0/block-tmp at /dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt'
Opening encrypted device /var/lib/ceph/osd/ceph-0/block-tmp at /dev/mapper/ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt
+ cryptsetup luksOpen --verbose --disable-keyring --allow-discards --key-file /etc/ceph/luks_key /var/lib/ceph/osd/ceph-0/block-tmp ocs-deviceset-managed-premium-2-data-0-p2rxb-block-dmcrypt
Device /var/lib/ceph/osd/ceph-0/block-tmp is not a valid LUKS device.
WARNING: Locking directory /run/cryptsetup is missing!
Command failed with code -1 (wrong or missing parameters).

+from affected node 

sh-4.4# dd if=/dev/sdd count=1 | hexdump -C
1+0 records in
1+0 records out
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
512 bytes copied, 7.6701e-05 s, 6.7 MB/s
00000200
sh-4.4# cryptsetup luksDump /dev/sdd | grep Slot
Device /dev/sdd is not a valid LUKS device.

Version of all relevant components (if applicable):

Azure deployment

OCS Version is : 4.8.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

User impact osd is down and cluster unhealthy

Is there any workaround available to the best of your knowledge?

No 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

5

Can this issue reproducible?

Unknown

Can this issue reproduce from the UI?

Unknown 

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Sébastien Han 2021-09-20 14:48:25 UTC
Waiting for more feedback/investigation from support.
Presently, it seems that the device is simply not "our" device or has been wiped.

It's unclear why though. In any case, there is nothing we can do about that situation.

Comment 5 khover 2021-09-21 15:24:59 UTC
shan

Can there be a procedure developed for replacing a disk for OCS/Azure ?

Our docs state to replace the node in OCS/Azure env.

I do not have access to a OCS/Azure cluster to test/develop this process.

Regards,

Kevan

Comment 7 Travis Nielsen 2021-09-27 15:17:41 UTC
Please reach out to QE to get an azure environment to try out the OSD replacement.

Comment 8 Travis Nielsen 2021-10-11 15:21:28 UTC
Please reopen if we can get a repro environment.

Comment 9 khover 2021-11-24 16:24:35 UTC
Have another cu hitting this

As encryption was enabled at the time of the install, we removed dm-crypt managed device-mapper & then the persistent volume.
Our assumptions was that once the node had the NVMe replaced, the osd would be replaced cleanly.

After HW remediation, we found:
osd.15 dis not recover.
none of the other osd's from the affected server recovered either. 8 osd's down and osd.15 in new state, but not yet assigned to a container storage node.


4) pod status for node:
(ukwest11) ah-1107689-001:~$ oc get pods -o wide|grep lcami514xsdi004.emea.sdi.corp.bankofamerica.com
csi-cephfsplugin-h44pn                                            3/3     Running                 0          80d     30.195.93.74    lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
csi-rbdplugin-nrnn8                                               3/3     Running                 0          80d     30.195.93.74    lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-crashcollector-2a9a1b271b41fae96a8d44220779ce53-w29vk   1/1     Running                 0          2d21h   30.125.6.15     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-12-b46b7fb86-lz2ft                                  0/2     Init:CrashLoopBackOff   820        3d2h    30.125.6.17     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-49-b9bd64bd7-pb5wk                                  0/2     Init:CrashLoopBackOff   819        3d2h    30.125.6.16     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-58-847c45b46c-vf5mw                                 0/2     Init:CrashLoopBackOff   819        3d2h    30.125.6.18     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-8-74b7dcb4fd-g7srp                                  0/2     Init:CrashLoopBackOff   820        3d2h    30.125.6.22     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-83-5db8f96f4f-k66dt                                 0/2     Init:CrashLoopBackOff   820        3d2h    30.125.6.19     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-84-76f67b98c7-7rtsp                                 0/2     Init:CrashLoopBackOff   821        3d2h    30.125.6.21     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-9-54556bd979-bjx98                                  0/2     Init:CrashLoopBackOff   819        3d2h    30.125.6.20     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-1-data-32wr9j4-fsmv9          0/1     Completed               0          2d21h   30.125.6.24     lcami514xsdi004.emea.sdi.corp.bankofamerica.com   <none>           <none>


Extract from osd crash loop back-off:
+ echo 'Opening encrypted device /var/lib/ceph/osd/ceph-12/block-tmp at /dev/mapper/ocs-deviceset-0-data-124qlt7-block-dmcrypt'
Opening encrypted device /var/lib/ceph/osd/ceph-12/block-tmp at /dev/mapper/ocs-deviceset-0-data-124qlt7-block-dmcrypt
+ cryptsetup luksOpen --verbose --disable-keyring --allow-discards --key-file /etc/ceph/luks_key /var/lib/ceph/osd/ceph-12/block-tmp ocs-deviceset-0-data-124qlt7-block-dmcrypt
Device /var/lib/ceph/osd/ceph-12/block-tmp is not a valid LUKS device.
WARNING: Locking directory /run/cryptsetup is missing!
Command failed with code -1 (wrong or missing parameters).

Comment 10 Sébastien Han 2021-11-24 17:44:40 UTC
Re-opening.

Just to make sure, you replaced an OSD drive and after rebooting the node, none of the OSDs are coming up?
What is osd.15? The one that disk got replaced? 
Can I get all the init container logs for a failing OSD?
Can I access the env?

Thanks.

Comment 11 khover 2021-11-24 19:14:52 UTC
Hi Sebastien,

Just to make sure, you replaced an OSD drive and after rebooting the node, none of the OSDs are coming up? >> correct CLBO

What is osd.15? The one that disk got replaced? >> correct


Can I get all the init container logs for a failing OSD?
Can I access the env?

All the logs for this case are in suportshell if you can access.

/cases/03071794

-rw-rwxrw-+ 1 yank yank 5658918912 Nov  2 18:26 0010-must-gather-uscentral12.tar.gz
drwxrwxrwx+ 4 yank yank       4096 Nov  3 17:21 0020-tarball.tar.gz
drwxr-xr-x+ 3 yank yank       4096 Nov  3 17:20 0030-must-gather-useast12.tar.gz
-rw-rwxrw-+ 1 yank yank      50075 Nov  4 15:06 0040-ceph_daignostics_useast12.txt
-rw-rwxrw-+ 1 yank yank      50019 Nov  4 15:16 0050-useast12_case_03071794.txt
drwxrwxrwx+ 3 yank yank         42 Nov  5 06:55 0060-must-gather-useast12.tar.gz
-rw-rwxrw-+ 1 yank yank      49906 Nov  5 12:28 0070-ceph_daignostics_useast12.txt
-rw-rwxrw-+ 1 yank yank    1255390 Nov 11 18:04 0080-rook-ceph-operator-log.txt
-rw-rwxrw-+ 1 yank yank      29925 Nov 19 19:38 0090-ACHP-Replaceafaileddrive.pdf
-rw-rwxrw-+ 1 yank yank      37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt
-rw-rwxrw-+ 1 yank yank  468721039 Nov 22 22:29 0110-cluster-info-dump.txt
-rw-rw-rw-+ 1 yank yank       3010 Nov 24 09:51 0120-LocalVolumeSet.txt


Env is now stable but cu is trying to avoid similar situation in the future.

Not sure if you can access the env cu is bank of america, but if needed I can ask or schedule a remote session for you to join.

Comment 12 Sébastien Han 2021-11-25 09:24:47 UTC
(In reply to khover from comment #11)
> Hi Sebastien,
> 
> Just to make sure, you replaced an OSD drive and after rebooting the node,
> none of the OSDs are coming up? >> correct CLBO
> 
> What is osd.15? The one that disk got replaced? >> correct
> 
> 
> Can I get all the init container logs for a failing OSD?
> Can I access the env?
> 
> All the logs for this case are in suportshell if you can access.
> 
> /cases/03071794
> 
> -rw-rwxrw-+ 1 yank yank 5658918912 Nov  2 18:26
> 0010-must-gather-uscentral12.tar.gz
> drwxrwxrwx+ 4 yank yank       4096 Nov  3 17:21 0020-tarball.tar.gz
> drwxr-xr-x+ 3 yank yank       4096 Nov  3 17:20
> 0030-must-gather-useast12.tar.gz
> -rw-rwxrw-+ 1 yank yank      50075 Nov  4 15:06
> 0040-ceph_daignostics_useast12.txt
> -rw-rwxrw-+ 1 yank yank      50019 Nov  4 15:16
> 0050-useast12_case_03071794.txt
> drwxrwxrwx+ 3 yank yank         42 Nov  5 06:55
> 0060-must-gather-useast12.tar.gz
> -rw-rwxrw-+ 1 yank yank      49906 Nov  5 12:28
> 0070-ceph_daignostics_useast12.txt
> -rw-rwxrw-+ 1 yank yank    1255390 Nov 11 18:04
> 0080-rook-ceph-operator-log.txt
> -rw-rwxrw-+ 1 yank yank      29925 Nov 19 19:38
> 0090-ACHP-Replaceafaileddrive.pdf
> -rw-rwxrw-+ 1 yank yank      37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt
> -rw-rwxrw-+ 1 yank yank  468721039 Nov 22 22:29 0110-cluster-info-dump.txt
> -rw-rw-rw-+ 1 yank yank       3010 Nov 24 09:51 0120-LocalVolumeSet.txt


Ok I'll try to access the logs.

> 
> Env is now stable but cu is trying to avoid similar situation in the future.

Does that mean OSDs are running normally now? What happened? Which action was taken to resolve this?

> Not sure if you can access the env cu is bank of america, but if needed I
> can ask or schedule a remote session for you to join.

Probably not needed anymore if the env is stable and I can access the logs.

Comment 13 khover 2021-11-26 13:32:25 UTC
(In reply to Sébastien Han from comment #12)
> (In reply to khover from comment #11)
> > Hi Sebastien,
> > 
> > Just to make sure, you replaced an OSD drive and after rebooting the node,
> > none of the OSDs are coming up? >> correct CLBO
> > 
> > What is osd.15? The one that disk got replaced? >> correct
> > 
> > 
> > Can I get all the init container logs for a failing OSD?
> > Can I access the env?
> > 
> > All the logs for this case are in suportshell if you can access.
> > 
> > /cases/03071794
> > 
> > -rw-rwxrw-+ 1 yank yank 5658918912 Nov  2 18:26
> > 0010-must-gather-uscentral12.tar.gz
> > drwxrwxrwx+ 4 yank yank       4096 Nov  3 17:21 0020-tarball.tar.gz
> > drwxr-xr-x+ 3 yank yank       4096 Nov  3 17:20
> > 0030-must-gather-useast12.tar.gz
> > -rw-rwxrw-+ 1 yank yank      50075 Nov  4 15:06
> > 0040-ceph_daignostics_useast12.txt
> > -rw-rwxrw-+ 1 yank yank      50019 Nov  4 15:16
> > 0050-useast12_case_03071794.txt
> > drwxrwxrwx+ 3 yank yank         42 Nov  5 06:55
> > 0060-must-gather-useast12.tar.gz
> > -rw-rwxrw-+ 1 yank yank      49906 Nov  5 12:28
> > 0070-ceph_daignostics_useast12.txt
> > -rw-rwxrw-+ 1 yank yank    1255390 Nov 11 18:04
> > 0080-rook-ceph-operator-log.txt
> > -rw-rwxrw-+ 1 yank yank      29925 Nov 19 19:38
> > 0090-ACHP-Replaceafaileddrive.pdf
> > -rw-rwxrw-+ 1 yank yank      37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt
> > -rw-rwxrw-+ 1 yank yank  468721039 Nov 22 22:29 0110-cluster-info-dump.txt
> > -rw-rw-rw-+ 1 yank yank       3010 Nov 24 09:51 0120-LocalVolumeSet.txt
> 
> 
> Ok I'll try to access the logs.
> 
> > 
> > Env is now stable but cu is trying to avoid similar situation in the future.
> 
> Does that mean OSDs are running normally now? What happened? Which action
> was taken to resolve this?
> 
> > Not sure if you can access the env cu is bank of america, but if needed I
> > can ask or schedule a remote session for you to join.
> 
> Probably not needed anymore if the env is stable and I can access the logs.

> Does that mean OSDs are running normally now? What happened? Which action was taken to resolve this?
Yes, they are running normally now, cu had to remove osds from cluster, wipe the disks, re deploy as new osds.

Comment 14 Sébastien Han 2021-11-26 13:42:20 UTC
(In reply to khover from comment #13)
> (In reply to Sébastien Han from comment #12)
> > (In reply to khover from comment #11)
> > > Hi Sebastien,
> > > 
> > > Just to make sure, you replaced an OSD drive and after rebooting the node,
> > > none of the OSDs are coming up? >> correct CLBO
> > > 
> > > What is osd.15? The one that disk got replaced? >> correct
> > > 
> > > 
> > > Can I get all the init container logs for a failing OSD?
> > > Can I access the env?
> > > 
> > > All the logs for this case are in suportshell if you can access.
> > > 
> > > /cases/03071794
> > > 
> > > -rw-rwxrw-+ 1 yank yank 5658918912 Nov  2 18:26
> > > 0010-must-gather-uscentral12.tar.gz
> > > drwxrwxrwx+ 4 yank yank       4096 Nov  3 17:21 0020-tarball.tar.gz
> > > drwxr-xr-x+ 3 yank yank       4096 Nov  3 17:20
> > > 0030-must-gather-useast12.tar.gz
> > > -rw-rwxrw-+ 1 yank yank      50075 Nov  4 15:06
> > > 0040-ceph_daignostics_useast12.txt
> > > -rw-rwxrw-+ 1 yank yank      50019 Nov  4 15:16
> > > 0050-useast12_case_03071794.txt
> > > drwxrwxrwx+ 3 yank yank         42 Nov  5 06:55
> > > 0060-must-gather-useast12.tar.gz
> > > -rw-rwxrw-+ 1 yank yank      49906 Nov  5 12:28
> > > 0070-ceph_daignostics_useast12.txt
> > > -rw-rwxrw-+ 1 yank yank    1255390 Nov 11 18:04
> > > 0080-rook-ceph-operator-log.txt
> > > -rw-rwxrw-+ 1 yank yank      29925 Nov 19 19:38
> > > 0090-ACHP-Replaceafaileddrive.pdf
> > > -rw-rwxrw-+ 1 yank yank      37103 Nov 19 19:39 0100-rook-ceph-raw-notes.txt
> > > -rw-rwxrw-+ 1 yank yank  468721039 Nov 22 22:29 0110-cluster-info-dump.txt
> > > -rw-rw-rw-+ 1 yank yank       3010 Nov 24 09:51 0120-LocalVolumeSet.txt
> > 
> > 
> > Ok I'll try to access the logs.
> > 
> > > 
> > > Env is now stable but cu is trying to avoid similar situation in the future.
> > 
> > Does that mean OSDs are running normally now? What happened? Which action
> > was taken to resolve this?
> > 
> > > Not sure if you can access the env cu is bank of america, but if needed I
> > > can ask or schedule a remote session for you to join.
> > 
> > Probably not needed anymore if the env is stable and I can access the logs.
> 
> > Does that mean OSDs are running normally now? What happened? Which action was taken to resolve this?
> Yes, they are running normally now, cu had to remove osds from cluster, wipe
> the disks, re deploy as new osds.

Wow that's drastic. Thanks.

Shay, Rachael, have you ever observed something similar during testing? If you reboot a node, have you ever seen OSDs not coming up again?

Comment 16 Sébastien Han 2021-11-29 15:43:59 UTC
ok, as discussed in the support call, we are going to schedule a call with the customer to inspect the nodes. Moving to 4.9.z too.

Comment 17 khover 2021-11-29 21:32:43 UTC
(In reply to Sébastien Han from comment #16)
> ok, as discussed in the support call, we are going to schedule a call with
> the customer to inspect the nodes. Moving to 4.9.z too.

Reached out to customer re their availability windows for a remote session this week.

I will update when they respond.

Confirmed the deployment is OCS with LSO on AWS.

Comment 18 khover 2021-12-01 14:05:01 UTC
(In reply to khover from comment #17)
> (In reply to Sébastien Han from comment #16)
> > ok, as discussed in the support call, we are going to schedule a call with
> > the customer to inspect the nodes. Moving to 4.9.z too.
> 
> Reached out to customer re their availability windows for a remote session
> this week.
> 
> I will update when they respond.
> 
> Confirmed the deployment is OCS with LSO on AWS.

Hi Sebastien,

cu provided the following for remote this week

my availability over the next few days:
Thursday 2nd December - 11:00 EST / 16:00 GMT
Friday 3rd December        - 09:00 EST / 14:00 GMT
Monday 6th December    - Directly after our TAM call , approximately 09:30 EST / 14:30 GMT

Comment 19 Sébastien Han 2021-12-02 08:39:04 UTC
Let's go with Monday 6th December at 09:30 EST / 14:30 GMT, thanks.
Please send me an invite.

Comment 20 khover 2021-12-02 15:46:26 UTC
(In reply to Sébastien Han from comment #19)
> Let's go with Monday 6th December at 09:30 EST / 14:30 GMT, thanks.
> Please send me an invite.

cu confirmed Monday 6th December at 09:30 EST / 14:30 GMT 

https://goto.webex.com/goto/j.php?MTID=m4a5fb1d55d63628641945801e07d3ea1

Comment 21 Sébastien Han 2021-12-02 16:53:27 UTC
Thanks!

Comment 22 Sébastien Han 2021-12-08 11:46:17 UTC
Customer version is 4.7.3 so changed the version to 4.7
Also, this might help a bit https://bugzilla.redhat.com/show_bug.cgi?id=2030291

Comment 23 khover 2021-12-10 18:46:21 UTC
Hi Sebastien,

Just curious if any discovery was made on this issue.

do you want me to link  https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case as well ?

cheers

Comment 24 Sébastien Han 2021-12-13 15:34:58 UTC
(In reply to khover from comment #23)
> Hi Sebastien,
> 
> Just curious if any discovery was made on this issue.
> 
> do you want me to link  https://bugzilla.redhat.com/show_bug.cgi?id=2030291
> to the case as well ?
> 
> cheers

Hi, no progress so far, I think we are stuck at the same point we were before. Without any logs we cannot debug further :(
We can add https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case, which will certainly help with the upgrade.

Comment 25 khover 2021-12-13 16:20:24 UTC
(In reply to Sébastien Han from comment #24)
> (In reply to khover from comment #23)
> > Hi Sebastien,
> > 
> > Just curious if any discovery was made on this issue.
> > 
> > do you want me to link  https://bugzilla.redhat.com/show_bug.cgi?id=2030291
> > to the case as well ?
> > 
> > cheers
> 
> Hi, no progress so far, I think we are stuck at the same point we were
> before. Without any logs we cannot debug further :(
> We can add https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case,
> which will certainly help with the upgrade.

Can I pass along to cu any specific logs/data you need to capture when the issue happens again ?

cheers

Comment 26 Sébastien Han 2021-12-14 10:28:30 UTC
(In reply to khover from comment #25)
> (In reply to Sébastien Han from comment #24)
> > (In reply to khover from comment #23)
> > > Hi Sebastien,
> > > 
> > > Just curious if any discovery was made on this issue.
> > > 
> > > do you want me to link  https://bugzilla.redhat.com/show_bug.cgi?id=2030291
> > > to the case as well ?
> > > 
> > > cheers
> > 
> > Hi, no progress so far, I think we are stuck at the same point we were
> > before. Without any logs we cannot debug further :(
> > We can add https://bugzilla.redhat.com/show_bug.cgi?id=2030291 to the case,
> > which will certainly help with the upgrade.
> 
> Can I pass along to cu any specific logs/data you need to capture when the
> issue happens again ?
> 
> cheers

Hi Kevan,

Next time this happens:

* ideally do not re-install the nodes so we can hop onto the machine
* identify the failing OSD
* find the associated PV
* log onto the machine where the PV is mapped on
* find the underlying block device of the PV (use "lsblk" and "ls -al" command to find the major/minor numbers
* make sure it's the same used by the OSD deployment
* run cryptsetup luksDump <disk>

If the disk shows the error as "not a LUKS device", there is a problem.

If it's bare-metal this could mean:

* this is a disk different (unlikely?)
* the disk was wiped (FYI the osd removal job does not wipe disk)


If it's virtualized/cloud env, this means (more likely) that it's a different disk, like the EBS volume is different for instance.
But again, being hands-on in the env would be ideal.

I have to close this again, unfortunately.
Thanks