728723 – LUKS partitions fail to get unmounted cleanly on shutdown resulting in filesystem corruption

Bug 728723 - LUKS partitions fail to get unmounted cleanly on shutdown resulting in filesystem corruption

Summary: LUKS partitions fail to get unmounted cleanly on shutdown resulting in filesy...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	15
Hardware:	i386
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Lennart Poettering
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-08-06 17:57 UTC by ell1e
Modified:	2011-08-25 15:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-08-21 12:26:12 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
smartctl output (19.07 KB, text/plain) 2011-08-06 17:57 UTC, ell1e	no flags	Details
View All

Description ell1e 2011-08-06 17:57:51 UTC

Created attachment 517001 [details]
smartctl output

Description of problem:
I filed this bug to systemd because it is shutdown related, but it was a pure guess.

On shutdown, I recently discovered that I always get three messages on screen for just a second or two (sorry I don't have them full text but it is too short to recognise them completely):
 cryptsetup /some/long/path1 cannot umount/remove/..: resource is busy
 cryptsetup /some/long/path2 cannot umount/remove/..: resource is busy
 cryptsetup /some/long/path3 cannot umount/remove/..: resource is busy

This is equivalent to my three encrypted hard disk partitions (/, /home, swap).
When booting the system afterwards, I always get a recovered journal, so the filesystems weren't removed cleanly.

At some point, apparently they got so badly corrupted that I got disk read errors like this one which went away completely after an fsck run in maintenance mode (which also resulted in some files lost forever, so actual data loss!!):

[ 1418.029152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1418.032714] ata1.00: BMDMA stat 0x25
[ 1418.036078] ata1.00: failed command: READ DMA EXT
[ 1418.039440] ata1.00: cmd 25/00:06:56:03:2d/00:00:12:00:00/e0 tag 0 dma 3072 in
[ 1418.039443] res 51/40:00:56:03:2d/40:00:12:00:00/e0 Emask 0x9 (media error)
[ 1418.046305] ata1.00: status: { DRDY ERR }
[ 1418.049716] ata1.00: error { UNC }
[ 1418.062550] end_request: I/O error, dev sda, sector 304939862
[ 1418.066023] Buffer I/O error on device dm-2, logical block 122683401
[ 1418.069467] Buffer I/O error on device dm-2, logical block 122683402
[ 1418.072831] Buffer I/O error on device dm-2, logical block 122683403

Now while after that fsck run that fixed the filesystem for now made them go away, the source of the problem still persists and I fear running into new data loss and read errors as above soon when it isn't fixed in one way or another.

The failing unmount occurs no matter whether I use "init 0", "init 6" or "reboot" for shutdown.

Version-Release number of selected component (if applicable):
bash-4.2$ systemctl --version
systemd 26
fedora
+PAM +LIBWRAP +AUDIT +SELINUX +SYSVINIT +LIBCRYPTSETUP
bash-4.2$ uname -a
Linux jth 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386 GNU/Linux
bash-4.2$


How reproducible:
Always

Steps to Reproduce:
1. Shutdown
2. Boot
  
Actual results:
On shutdown, the three errors above are printed. During boot, journal is examined. After many boots, I get read errors and other issues until I run fsck which clearly shows a borked filesystem.

Expected results:
On shutdown, none of the above errors are printed. During normal boot, everything is fine and no journal is examined or any other indication of an unclean unmount visible.

Additional info:
smartctl examination output is appended just in case this is related to hard disk failure. The read errors in there are from that point where the filesystem was so badly trashed, after an fsck repair they're now all gone. The self-test of the hard disk which went an hour was prompted by me afterwards, so is very recent and up-to-date (and as far as I can see, pretty much ok).

If you need more info, then just ask me for it and I will see whether I can gather it.

Comment 1 ell1e 2011-08-06 18:12:27 UTC

btw, just in case that is possibly related, I use (in /etc/rc.local):
 echo 1500 > /proc/sys/vm/dirty_writeback_centisecs

Comment 2 ell1e 2011-08-12 19:53:11 UTC

Since this is on a production system, some advice or workaround except not rebooting (which I relied on for now) would be nice.

Comment 3 ell1e 2011-08-12 21:58:50 UTC

I got a better look at the error message now. Sorry I don't have the numbers, it isn't on the screen very long:

[...numbers....] systemd-cryptsetup[number]: failed to deactivate: device or resource is busy

Comment 4 Lennart Poettering 2011-08-21 12:26:12 UTC

[ 1418.029152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1418.032714] ata1.00: BMDMA stat 0x25
[ 1418.036078] ata1.00: failed command: READ DMA EXT
[ 1418.039440] ata1.00: cmd 25/00:06:56:03:2d/00:00:12:00:00/e0 tag 0 dma 3072
in
[ 1418.039443] res 51/40:00:56:03:2d/40:00:12:00:00/e0 Emask 0x9 (media error)
[ 1418.046305] ata1.00: status: { DRDY ERR }
[ 1418.049716] ata1.00: error { UNC }

This is a hardware/driver problem and is unrelated to systemd.

If / is encrypted we cannot detach it on shutdown in F15 (and any older fedora version), since we cannot unmount the root file system. In F16 for the first time we will be able to jump back into the initrd which then unmounts the root fs and detaches all remaining crypto disks afterwards.

The fact that we cannot detach/unmount the root fs is not a problem however, since we sync everything to disk, and the kernel will do so again. So there's no systemd problem here.

Please file a new bug about your ATA media error problem, against the kernel.

Comment 5 ell1e 2011-08-25 15:03:38 UTC

I am just asking to be sure:

It seems to me that /home and swap aren't cleanly unmounted either. Is that also normal for Fedora 15 and not possibly causing any file system corruption? Also the subsequent data loss and file system corruption happened on /home, not on /.

Note You need to log in before you can comment on or make changes to this bug.