1576021 – RFE: if re-encrypt gets to "100% complete" yet still fails, and existing LUKS files still exist, tell user to try again?

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1576021 - RFE: if re-encrypt gets to "100% complete" yet still fails, and existing LUKS files still exist, tell user to try again?

Summary: RFE: if re-encrypt gets to "100% complete" yet still fails, and existing LUKS...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	cryptsetup
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Ondrej Kozina
QA Contact:	Storage QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-08 14:55 UTC by Corey Marthaler
Modified:	2021-09-06 15:04 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-26 15:08:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2018-05-08 14:55:04 UTC

Description of problem:
I set up a test case where a re-encrypt attempt gets really close to finishing (100%), but can't fully finish due to the underlying space running out. I got "scared" when the next luksOpen attempt couldn't find a key associated with the correct passphrase and thought the data might be lost. With LUKS re-encrypt files still around in /tmp, could we tell the user to try the re-encrypt again (assuming the hardware issue is remedied? Would that even be possible?

-r--------. 1 root root 1052672 May  7 23:05 LUKS-4e667655-3410-417f-8c1d-e9322155b24f.org
-rw-------. 1 root root 1052672 May  7 23:05 LUKS-4e667655-3410-417f-8c1d-e9322155b24f.new
-rw-------. 1 root root     512 May  7 23:08 LUKS-4e667655-3410-417f-8c1d-e9322155b24f.log

The man page clearly states that it's not resistant to hardware issues, so really it would be the users own fault if this ever happened.
WARNING:  The  cryptsetup-reencrypt  program  is  not resistant to hardware or kernel failures during reencryption (you can lose you data in this
       case).

# In every attempt, i see the status get to 100%, yet still fails. Maybe that could be changed that as well?
cryptsetup-reencrypt /dev/snapper_thinp/origin
Progress: 100.0%, ETA 00:00, 6016 MiB written, speed  38.5 MiB/s
Cannot restore LUKS header on device /dev/snapper_thinp/origin.
couldn't reencrypt virt volume


May  7 22:23:08 host-087 lvm[15902]: WARNING: Thin pool snapper_thinp-POOL-tpool data is now 89.36% full.
May  7 22:23:12 host-087 kernel: device-mapper: thin: 253:4: reached low water mark for data device: sending event.
May  7 22:23:14 host-087 kernel: device-mapper: thin: 253:4: switching pool to out-of-data-space (queue IO) mode
May  7 22:23:18 host-087 lvm[15902]: WARNING: Thin pool snapper_thinp-POOL-tpool data is now 100.00% full.
May  7 22:24:14 host-087 kernel: device-mapper: thin: 253:4: switching pool to out-of-data-space (error IO) mode
May  7 22:24:14 host-087 kernel: Buffer I/O error on dev dm-14, logical block 1534208, lost async page write
May  7 22:24:14 host-087 kernel: Buffer I/O error on dev dm-14, logical block 1534209, lost async page write


# no key! here a user could think they're out of luck now since the re-encrypt did get to 100% finished, yet the data is no longer accessible.
[root@host-087 ~]#  cryptsetup luksOpen /dev/snapper_thinp/origin virt_luksvolume
Enter passphrase for /dev/snapper_thinp/origin: 
No key available with this passphrase.

# However, if you know what you're doing, extend the pool, and reattempt the re-encrypt again, you "should" get you data back.
[root@host-087 ~]# lvs
  LV            VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  POOL          snapper_thinp twi-aotzD-   6.00g             100.00 40.14
  online_resize snapper_thinp Vwi-a-tz--   1.00g POOL origin 14.73
  origin        snapper_thinp Vwi-a-tz--   5.88g POOL        99.49

[root@host-087 ~]# lvextend -L +500M snapper_thinp/POOL
  WARNING: Sum of all thin volume sizes (11.88 GiB) exceeds the size of thin pools (<6.49 GiB).
  WARNING: You have not turned on protection against thin pools running out of space.
  WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
  Size of logical volume snapper_thinp/POOL_tdata changed from 6.00 GiB (1536 extents) to <6.49 GiB (1661 extents).
  Logical volume snapper_thinp/POOL_tdata successfully resized.

[root@host-087 ~]# lvs
  LV            VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  POOL          snapper_thinp twi-aotz--  <6.49g             92.47  40.14
  online_resize snapper_thinp Vwi-a-tz--   1.00g POOL origin 14.73
  origin        snapper_thinp Vwi-a-tz--   5.88g POOL        99.49

[root@host-087 ~]# cryptsetup-reencrypt /dev/snapper_thinp/origin
Log file LUKS-2a706e19-08d8-4375-a663-033e28f78d1c.log exists, resuming reencryption.
Enter any existing passphrase: 

[root@host-087 ~]# cryptsetup luksOpen /dev/snapper_thinp/origin virt_luksvolume
Enter passphrase for /dev/snapper_thinp/origin: 

[root@host-087 ~]# dmsetup ls
snapper_thinp-origin    (253:6)
virt_luksvolume (253:12)
snapper_thinp-online_resize     (253:13)
snapper_thinp-POOL      (253:5)
snapper_thinp-POOL-tpool        (253:4)
snapper_thinp-POOL_tdata        (253:3)
snapper_thinp-POOL_tmeta        (253:2)

# I was able to get the data back and verify the integrity 
[root@host-087 ~]# mount /dev/mapper/virt_luksvolume /mnt/origin/
[root@host-087 ~]# /usr/tests/sts-rhel7.5/bin/checkit -w /mnt/origin -f /tmp/checkit_LUKS -v
checkit starting with:
VERIFY
Verify XIOR Stream: /tmp/checkit_LUKS
Working dir:        /mnt/origin


Feel free to close this RFE if this is all expected behavior, since one would hope the user would attempt the re-encrypt again before doing something more drastic thinking the data was lost. 


Version-Release number of selected component (if applicable):
cryptsetup-reencrypt-1.7.4-4.el7.x86_64


How reproducible:
Everytime

Comment 3 Milan Broz 2018-05-08 16:05:24 UTC

Running reencryption over Thin provisioning is not good idea at all - it will basically plunder the whole space (including unused one) up to the encrypted device size.

If you enlarge pool, then yes, it can probably continue. But without it you have already destroyed your data, because some part of device is now encrypted with different key. Do you really get correct checksum if you enlarge the pool and continue the reencryption?

Anyway, my solution to this would be better to somehow tell underlying storage to preallocate space for the whole device (because reencryption WILL write to the whole device) than run out of space later and scare user.

(Can we use something like fallocate() for thin pool block device?)

Comment 4 Ondrej Kozina 2019-08-26 15:12:23 UTC

Hi Corey,

actually this is very interesting test case for new (online) reencryption code. We should be able to recover from such failure without loosing data. After reencryption fails to write in full thin LV and when you add new storage to thin pool, it should be able to recover from it.

O.

Note You need to log in before you can comment on or make changes to this bug.