281691 – kernel dm crypt: ext3 fs errors if using encrypted swap and suspend

Bug 281691 - kernel dm crypt: ext3 fs errors if using encrypted swap and suspend

Summary: kernel dm crypt: ext3 fs errors if using encrypted swap and suspend

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	9
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Milan Broz
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-07 00:53 UTC by Jason Haar
Modified:	2013-03-01 04:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-12-19 12:53:52 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lvmdump of affected system (14.75 KB, application/octet-stream) 2007-10-18 07:59 UTC, Jason Haar	no flags	Details
View All

Description Jason Haar 2007-09-07 00:53:51 UTC

Description of problem:

I am using cryptsetup to fully encrypt my harddisk (except for a small /boot
partition to boot off). It works really well - I had to create my own initrd to
include all the crypto and dm kernel modules, and cryptsetup - but it rocks.

However, every "few" reboots (e.g. full reboot or recovering from a
hibernate-to-disk) fsck reports the file system is unclean and kicks off a full
fsck. Sometimes it finds nothing wrong, and sometimes it has to fix up some file
- typically in /tmp. 

When I notice this, if I go back through the syslogs, I can see this was bound
to happen, as the kernel would have been reporting a ext3 error beforehand.

Sep  2 16:54:18 tnz-jhaar-lt kernel: EXT3-fs error (device dm-0):
ext3_free_blocks_sb: bit already cleared for block 7548941

dm-0 is my ext3-based "/" partition. I also encrypt my swap partition - and that
has never caused a problem BTW...

I first saw this on FC6 and thought it indicated I had a bad disk. I replaced
the disk and took the opportunity to install from scratch FC7. So maybe this is
actually a hardware problem too - but that's pretty unlikely.


Version-Release number of selected component (if applicable):

FC7, fully updated via yum, running 2.6.22.4-65.fc7
cryptsetup-luks-1.0.3-4.fc7

How reproducible:

Happens every reboot I do after those ext3 errors show up in syslog. Happened
Aug 14, Aug 28 and Sep 6.

Steps to Reproduce:
1. no idea
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Till Maas 2007-09-07 10:10:57 UTC

(In reply to comment #0)

> Sep  2 16:54:18 tnz-jhaar-lt kernel: EXT3-fs error (device dm-0):
> ext3_free_blocks_sb: bit already cleared for block 7548941

> I first saw this on FC6 and thought it indicated I had a bad disk. I replaced
> the disk and took the opportunity to install from scratch FC7. So maybe this is
> actually a hardware problem too - but that's pretty unlikely.

Maybe it is not the disk but the controller/mainboard or memory that is defect
here. Can you run memtest86+ to check your memory?

> dm-0 is my ext3-based "/" partition. I also encrypt my swap partition - and 
> that has never caused a problem BTW...

Do you use your swap partition a lot or is it rather unused?


Btw. there is already a patch for encrypted root being developed, maybe you want
to use it and help there, see #124789

Comment 2 Till Maas 2007-09-07 10:15:13 UTC

ah, and probably you can use smartmontools you check you hard drives. Do you
have only one ext3 error every time you have one or are there several of it?

Comment 3 Jason Haar 2007-09-10 23:24:27 UTC

I've let memetst86+ run overnight on the machine - it detected no RAM problems.

I have also sat down and gone through the syslogs. This problem with "EXT3-fs
error" errors occurring happens minutes to hours after a reboot or
hibernate-to-disk - and produces either one or many " EXT3-fs error" records. It
just happened this morning when I brought my laptop into work - got 8500+ in a
one minute period! I immediately rebooted and fsck'ed the disk - other than a
bunch of inode errors, it looked fine. However, I can see two files in
/lost+found from a few months ago - one is an /sbin/iscsid binary - something I
don't think I've ever used...

Comment 4 Jason Haar 2007-09-20 05:24:12 UTC

Well big things have happened since last time. I managed to convince Dell it was
a hardware fault, and they replaced both my harddisk again and my motherboard
(with the disk controller).


...but here it is a week later and I've just had a serious occurance of the
"EXT3-fs error" yet again. This time ext3 lost the inodes of 4 files on my
harddisk - including /usr/bin/swatch and /lib/iptables/libipt_CLASSIFY.so -
which means after the fsck they were GONE.

This has got to be a software problem doesn't it?

One thing. When I went to restore my FC7 system onto the new harddisk, I hit the
same problem I had the first time - namely that the FC7 boot CD/DVD doesn't have
any support for cryptsetup or the appropriate kern modules. So I couldn't use
FC7 to actual create the cryptsetup partitions to restore onto. So I grabbed
Ubuntu (which does support cryptsetup) and used it to create the encrypted
partitions, and then restored onto that. 

So the question is: does the  Sept release of cryptsetup on Ubuntu match what
you'd expect? If not, if you can tell me how to create an encrypted partition
using a FC7 DVD I'd be happy to do it again...
 
BTW: "cryptsetup lukDump" and "cryptsetup status" don't return anything that
looks like a version number. If there are issues with cryptsetup, probably been
able to tell what version created  a partition would help from a support
perspective?

Thanks

Jason

Comment 5 Jason Haar 2007-09-22 06:27:56 UTC

I've just had a thought - could this be a configuration problem more than a
software one?

I created my own initrd to mount the encrypted root and swap partitions at boot
time. 

mkblkdevs
mkdir -p /dev/mapper
cryptsetup luksOpen /dev/sda3 root-enc
mkdir /mnt-root
mount -t ext3 /dev/mapper/root-enc /mnt-root
cryptsetup luksOpen /dev/sda2 swap-enc --key-file=/mnt-root/etc/crypto-swap.key
umount /mnt-root
resume /dev/mapper/swap-enc
echo Creating root device.
mkrootdev -t ext3 -o defaults,noreservation,ro /dev/mapper/root-enc

but I've done nothing to correctly umount it all during a halt,reboot or (more
importantly) hibernate.

As it 99.99% works, I wonder if it could be that root&swap are umounted
correctly - but there isn't a "cryptsetup remove"? Could that cause subtle
corruption?

BTW I don't use /etc/crypttab as I specifically mount root and swap in initrd...

Comment 6 Jason Haar 2007-10-18 07:57:50 UTC

I have reinstalled again - this time placing the dm_crypt root and swap
partitions on top of LVM - which appears to be the more "correct" way (although
a waste of time on a laptop IMHO). Nothing but Redhat tools were used to
construct this version.

Anyway, as Milan Broz requested - attached is the lvmdump from this system. 

More symptoms. I successfully suspended (via pm-hibernate) 6+ times today, each
time it booted, initrd would ask for the password to unlock the root partition,
and then called a password file on (the now unencrypted) /etc/ to decrypt the
swap - so it could resume. All worked splendidly...

Until the last time. It resumed well, but almost immediately started reporting

Oct 18 17:59:17 tnz-jhaar-lt kernel: EXT3-fs error (device dm-2):
ext3_free_blocks_sb: bit already cleared for block 7162385

Over 300 such events in ONE sec - and then it no more reports. But I didn't
notice and merrily went about my business installing software and generally
doing I/O.

Then 2 hours later there was a sudden burst of over 700 events - and the system
actually froze at  that stage and I rebooted, and had to manually "fsck -y /" to
fix it. Didn't lose any files that time - but I normally do :-(

So this laptop has had it's disk replaced 3 times and it's motherboard twice on
my instance this isn't a software problem. But this must be?

Here's the section of my custom "init" in my initrd related to cryptsetup:

echo Scanning logical volumes
lvm vgscan --ignorelockingfailure
echo Activating logical volumes
lvm vgchange -ay --ignorelockingfailure  VolGroup00
cryptsetup luksOpen /dev/VolGroup00/LogVol00 root-enc
mkdir /mnt-root
mount -t ext3 /dev/mapper/root-enc /mnt-root
cryptsetup luksOpen /dev/VolGroup00/LogVol01 swap-enc --key-file=/mnt-root/etc/c
rypto-swap.key
umount /mnt-root
resume /dev/mapper/swap-enc
echo Creating root device.
mkrootdev -t ext3 -o defaults,noreservation,ro /dev/mapper/root-enc
echo Mounting root filesystem.
mount /sysroot

Comment 7 Jason Haar 2007-10-18 07:59:17 UTC

Created attachment 230781 [details]
lvmdump of affected system

Comment 8 Till Maas 2007-11-10 00:24:46 UTC

There is now a discussion about corruptions with dm-crypt on the dm-crypt
mailinglist:

http://thread.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/2381

Maybe this is the same issue that you reported.

Comment 9 Jason Haar 2007-11-10 00:40:58 UTC

That looks like a different issue to mine.

I have new piece of information (I am using dm-crypt to encrypt my entire system
- both root (/) and swap. i.e. only /boot isn't encrypted, and initrd calls
cryptsetup to initialize the crypto).

My problem appears to occur exclusively when I suspend-to-disk. If I do a full
shutdown and restart, then I never seem to trigger the problem. However, if I
suspend, then there's around a 1-in-4 chance that ext3 will declare something's
wrong and will do a full fsck. If I'm lucky, it will find nothing wrong, if I'm
unlucky, files go missing. The fact that it "mostly" works makes me feel this
cannot be a configuration or "things being done in the wrong order during
hibernation" problem.

This laptop has had all hardware components replaced (thanks Dell) and still has
this symptom - so I'm left thinking this has to be a software problem.

Comment 10 Jason Haar 2007-11-10 00:42:47 UTC

Oh yeah - I've reinstalled this laptop in both LVM (with dm-crypt on top) and
raw partition mode (i.e. dm-crypt on top of /dev/sda) and got the same issue -
so this isn't a  LVM problem for me.


Jason

Comment 11 Milan Broz 2007-11-10 14:33:16 UTC

(In reply to comment #8)
No, this is different issue (the issue you pointed out was caused by faulty hw).

(In reply to comment #9)
> My problem appears to occur exclusively when I suspend-to-disk.

yes, this is very important information.

Do you see a corruption without encrypted swap ?
(using encrypted root filesystem only)

(I am trying to find out in which part of the process corruption happens.)

Comment 12 Jason Haar 2007-11-10 20:33:27 UTC

You mean run it with unencrypted swap?

OK, I've re-jigged it and we'll see what happens. I should know within a few
days/week if it's going to happen or not

Comment 13 Jason Haar 2007-11-15 03:09:14 UTC

OK, I've been hauling my laptop between home and work all week, suspending to
(unencrypted) disk exclusively, and have had ZERO problems.

So it looks like this issue only occurs when the swap partition is encrypted -
and then only some of the time.

Hope that helps

Comment 14 Milan Broz 2007-11-15 08:48:25 UTC

Kernel problem, probably it sometimes lost data during hibernate and writing to
swap through dm-crypt.

Comment 15 Jason Haar 2007-11-20 07:23:33 UTC

Hi there

So is this a known problem, or should I be reporting it to someone else?...

Thaks

Jason

PS: It is still working fine (ie suspend to disk) with unencrypted swap.

Comment 16 Milan Broz 2007-11-28 07:21:26 UTC

(In reply to comment #15)
> So is this a known problem, or should I be reporting it to someone else?...

If you are able to reproduce this on upstream kernel, maybe someone on kernel
list could help.

(I expect that some flush is missing in the process of suspend, so there is
still some unfinished work in the crypt queue. Maybe things will complicate a
little bit more because of recent changes in block layer - zero-sized barriers
which are still rejected by DM targets. Just quick thoughts, this need some
analysis...)

I have this problem in my dm-crypt TODO list but currently there are some issues
with higher priority.

Comment 17 Christopher Brown 2008-01-13 23:14:55 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

There hasn't been much activity on this bug for a while.

Jason, have you been able to test this on a upstream kernel. If not, do you need
one building?

Milan, have you been able to look further into this issue?

Comment 18 Jason Haar 2008-01-14 05:40:47 UTC

I've been running dm-crypto on my / partition - but have removed it from swap as
that was where the problem was.

I've just re-enabled that and will start suspending-to-disk again 100% crypto.

I'll let you know in a few days if the problem comes back again

Comment 19 Jason Haar 2008-01-15 07:32:54 UTC

Well that didn't take long :-(

I did 4 cycles of suspending encrypted disk to encrypted swap and each time let
it reboot all the way up to a working state. It looked good.

However, on the 4th time, it also looked good, but then 10 minutes after the
final services had restarted/unfrozen, I started seeing these infamous words again

kernel: EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared
for block 7907504
kernel: EXT3-fs error (device dm-2): ext3_free_inode: bit already cleared for
inode 3935185
kernel: EXT3-fs warning (device dm-2): ext3_unlink: Deleting nonexistent file
(3344191), 0


a real mess :-(

Unless you have any other ideas, I'm going back to unencrypted swap - before I
loose any more /usr/bin files...

Comment 20 Milan Broz 2008-01-15 12:10:31 UTC

We need add another sync in suspend path - I already read the code but still
have no time to create a patch and some tests + kernel build.
Anyway increasing severity of this bug.

Comment 21 Jason Haar 2008-02-17 23:11:06 UTC

FYI I have just replaced my Dell X300 with a Dell 430 laptop and moved up to FC8.

I implemented encrypted "/" again - but left swap unencrypted due to this fault.

It has been working 100% well for 3 weeks (hibernating to disk several times a
week) - until today...

Same problem - ext3 errors all over the place after coming out of
hibernate/suspend. I've just rebooted, typed in the password to decrypt the disk
and now I'm seeing

ata1.00: BMDMA stat 0x25
ata1.00: cmd c8/00........
EXT3-fs: Can't read superblock on 2nd try
mount failed.


It's toast :-(

Either my disk just died (yeah, right) or dm-crypt just killed it.

I'm going back to unencrypted with encfs. That was rock-solid. :-(

Comment 22 Milan Broz 2008-02-18 08:46:45 UTC

The last issue seems to me like hw fault... or not?
(There should be no problems in encrypted root only.)

Comment 23 Jason Haar 2008-02-18 09:14:57 UTC

Yes. I think I jumped the gun. It's just that I've had 3 disk replacements since
I started using dm-crypt - I'm starting to blame it for everything.

Do you know if there's any work on the encrypted-swap-and-hibernation bug I've
been seeing?

Thanks

Comment 24 Milan Broz 2008-02-18 09:49:35 UTC

Adding this bug to F9 blocker list because encrypted root and swap is supported
configuration in F9 time frame.

Comment 25 Milan Broz 2008-02-21 12:22:59 UTC

Jason, please could you confirm my assumptions (from attached logs):
- corruption was seen even on uniprocessor (no dualcore/SMP, just single CPU)
- you are using different encryption for swap and root (aes+twofish)

Comment 26 Jesse Keating 2008-03-31 17:56:52 UTC

Can we get a statement as to what this bug actually is, and if it's really a
blocker for Fedora 9 (of which there is very little time left for development)?

Comment 27 Jason Haar 2008-03-31 18:44:56 UTC

(In reply to comment #25)
> Jason, please could you confirm my assumptions (from attached logs):
> - corruption was seen even on uniprocessor (no dualcore/SMP, just single CPU)
> - you are using different encryption for swap and root (aes+twofish)
> 
Sorry I took so long to answer this - I never received an email alert.

I have had it on two laptops: one single-processor, and now this one - a dualcore.

As far as what crypto type is in use, I *think* they are different. Can you tell
me what command I could run that would tell me what dm-crypt settings are on each?

Thanks

Jason

Comment 28 Jason Haar 2008-04-01 03:09:22 UTC

(In reply to comment #27)
> me what command I could run that would tell me what dm-crypt settings are on 
> each?

Don't worry - lvmdump did the trick.

With this newer machine I have been unsuccessful even with using the same crypto
options for swap as well as the root - i.e aes-cbc-essiv:sha256

Comment 29 Jon Stanley 2008-04-17 00:58:08 UTC

I was under the impression that hibernate was unsupported with encrypted swap -
am I wrong here?  I'm just going through the F9 blocker list.  I realize that
encrypted swap is the default with F9 if you tick the 'encrypt system' box in
anaconda.  I tried to hibernate my encrypted rawhide laptop and completely
failed today - the system just booted when I turned it back on rather than
resuming from swap.

If it is true that hibernate is unsupported w/encrypted swap, then we're going
to need a release note...

Comment 30 Jason Haar 2008-04-17 01:43:50 UTC

To do "proper" full disk encrytion (like all the commercial Windows products do
BTW...), you really have to encrypt the swap. 

What's really missing with cryptsetup is some form of kernel password storage 
area, where a "cryptsetkey" command early in the initrd boot process could
prompt for the password, and then use it on any future invocation of cryptsetup.
That way you could prompt for the password, and then use it to decrypt swap
and/or root before doing the resume. Without it I for one am stuck in the
hand-crafted hell of creating a static password file on (encrypted) root, and
running cryptsetup on root first to grab the key file to decrypt swap - before
the resume!

I only came up with the "cryptsetkey" concept last night - I might have to
harass the cryptsetup author about it :-) After it had mounted everything it
needed to in initrd, you could run "cryptsetkey --delete" to trash the password
from "kernel memory" (I'm no programmer - but hopefully you get the gist ;-)

So to get back to your question, yes - you are probably correct. However, I
think it's a bit bizarre Linux distributions still have figured out how to do
"proper" whole disk encryption when Windows figured it out many years ago. :-(

Comment 31 Milan Broz 2008-04-17 07:00:52 UTC

(In reply to comment #30)
> That way you could prompt for the password, and then use it to decrypt swap
> and/or root before doing the resume. Without it I for one am stuck in the
> hand-crafted hell of creating a static password file on (encrypted) root, and
> running cryptsetup on root first to grab the key file to decrypt swap - before
> the resume!

Exactly this is now possible in Fedora9. It asks for LUKS password before
running LVM scan, so all physical volume can be fully encrypted. And resume runs
from logical volume mapped to swap on this encrypted volume.

I run two notebooks, both (in simple tests) resumed from encrypted swap.
(But the bug this bugzilla is about is still here, just I wasn't able to
reproduce it without additional hacks yet).

Anyway, I saw other problem: because root and swap are encrypted, standard
initscrits don't correctly umount/remove encryption mapping (because cryptsetup
and initscripts runs from device, which need to be umounted'luksClosed!).
So there should be some shutdown ramdisk or so (I need to check in recent
version of F9, if it is still true, it need new bug report).
 
> I only came up with the "cryptsetkey" concept last night - I might have to
> harass the cryptsetup author about it :-) After it had mounted everything it
> needed to in initrd, you could run "cryptsetkey --delete" to trash the password
> from "kernel memory" (I'm no programmer - but hopefully you get the gist ;-)

You mean password for unlocking LUKS? It is not stored in memory after unlocking
IMHO.

Wipe master key (used for encryption algorithm in dm-crypt) from kernel memory
command is already supported in dm-crypt kernel module through dm message
interface... no idea why it is not used. (already possible with dmsetup - see
some thread on dm-crypt mailing list)

I'll make some more notes to this bugzilla later, just currently busy with some
other work, sorry.

Comment 32 Jason Haar 2008-04-17 07:25:56 UTC

what about remounting readonly? Having to add a ramdisk just to do shutdown
cleanly is a bit severe. Can't remounting readonly before powering off get
around the problem? Or is it that dm-crypt still has some unfinished writes
hanging about? If so, wouldn't that be a bug?

jason

Comment 33 Milan Broz 2008-04-17 07:53:46 UTC

Sure, it remounts read-only if it cannot umount. But this is enough for
non-encrypted system, not for dm-crypted one.
Master key is still in memory after read-only remount (possible DRAM data
retention attack etc.) 

I am not sure if dm-crypt internal queue is flushed here properly (but sync
should be enough here in shutdown path - so probably not big problem).

Comment 34 Jesse Keating 2008-04-18 21:35:15 UTC

Ok, our default setup, which is swap as part of LVM, and encrypting the LVM
physical volume, works just fine with suspend and hibernate.  I'm going to
remove this from the blocker list, as it's not really a case our installers will
hit.

Comment 35 Jason Haar 2008-04-18 23:52:21 UTC

I talked to the primary authors of dm-crypt this week and they said I should be
doing that too.

I'll reinstall my laptop next week - and put both swap and root into the same
LVM. We'll see what that does :-)

Comment 36 Jason Haar 2008-05-06 03:13:25 UTC

I've been running FC8 on a LVM'ed cm-crypt volume as per the above suggestions
for over 3 weeks now with ZERO problems.

That appears to be it! Having separate dm-crypt partitions for swap and root was
the problem - putting both on the same dm-crypt partition appears to have solved
everything for me :-)

Crypt rocks.

So yes, I'm now looking forward to FC9 with the supported crypto - no more need
for me to manually create initrd's :-)

Thanks again!

Jason

Comment 37 Bug Zapper 2008-05-14 03:12:07 UTC

Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 38 Michal Kučera 2008-06-19 07:24:02 UTC

Hi, I have the same problem, but I don't use neither LVM nor crypted filesystem
and I get similar error: 
kernel: EXT3-fs error (device /dev/sda1): ext3_free_blocks_sb: bit already cleared
for block 7907504 

And it raised after updating to kernel 2.6.25.6. After shutting down the
computer and new start of system, I get this error and whole partition is bad. I
have full my root partition because file /var/log/messages saturate whole root
partition. If I delete this file, partition is still full. And after checking
partition with e3fsck I get many errors. And now I'm not able to boot system.
Booting failed.

Comment 39 Milan Broz 2008-12-19 12:53:52 UTC

The bug reported in comment#38 is something related to ext3 corruption, for sure not related to volume encryption.

I was not able to reproduce it and several kernel version (and even Fedora version) was released since this bug was opened... 

Closing this bug, if you still see a corruption when using recent Fedora version amd encrypted swap, please open new bug with the exact description of kernel version and how to reproduce it.
(Because F9 and F10 supports encryption in installer and no bug reports so far I expect it is fixed...)

Note You need to log in before you can comment on or make changes to this bug.