Bug 1002144 - LVM snapshot creation sometimes fails
Summary: LVM snapshot creation sometimes fails
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: lvm2
Version: 19
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Peter Rajnoha
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1014067
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-28 13:54 UTC by John F Sullivan
Modified: 2014-01-15 14:38 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-15 14:38:11 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Script used to automate snapshot create/remove and reproduce the error during snapshot creation (1.05 KB, application/x-shellscript)
2013-08-28 13:54 UTC, John F Sullivan
no flags Details
output of "lsblk -a" (6.30 KB, text/plain)
2013-08-30 13:30 UTC, John F Sullivan
no flags Details

Description John F Sullivan 2013-08-28 13:54:59 UTC
Created attachment 791409 [details]
Script used to automate snapshot create/remove and reproduce the error during snapshot creation

Description of problem:
Creating an LVM snapshot occasionally fails.  I found this issue while running rsnapshot to backup my system, but I can easily reproduce it by creating and removing snapshots.

Version-Release number of selected component (if applicable):
kernel-3.10.9-200.fc19.x86_64
lvm2-2.02.98-12.fc19.x86_64
lvm2-libs-2.02.98-12.fc19.x86_64
device-mapper-1.02.77-12.fc19.x86_64
systemd-204-9.fc19.x86_64
dmraid-events-1.0.0.rc16-21.fc19.x86_64
mdadm-3.2.6-21.fc19.x86_64

How reproducible:
Seems somewhat random, but reproduces within twenty iterations of the attached shell script.

Steps to Reproduce:
1. Create a snapshot
2. Remove the snapshot
3. If no error occurs, go to step 1, else stop and file a bug

Actual results:
# ./lvm-snapshot-test.sh vg_rivendell root_fedora_15 rsnapshot
-> Starting...
-> Iteration 1, creating...
  Logical volume "rsnapshot" created
-> Removing...
  Logical volume "rsnapshot" successfully removed
-> Iteration 2, creating...
  Logical volume "rsnapshot" created
-> Removing...
  Logical volume "rsnapshot" successfully removed
-> Iteration 3, creating...
  device-mapper: create ioctl on vg_rivendell-rsnapshot failed: Device or resource busy
  Aborting. Failed to activate snapshot exception store.

Note that in this specific instance, the snapshot was actually created - usually it is not.  I was unable to remove the snapshot, however, and I needed to reboot the system at which point the snapshot no longer existed in the volume group

# ls /dev/vg_rivendell/rsnapshot
/dev/vg_rivendell/rsnapshot
# lvremove /dev/vg_rivendell/rsnapshot
  One or more specified logical volume(s) not found.
# lvdisplay /dev/vg_rivendell/rsnapshot
  One or more specified logical volume(s) not found.

Expected results:
No error

Additional info:
Physical system, with four drives configured as a RAID 6 and formatted as the physical volume of the affected volume group.

The dmesg output shows no messages logged as a result of these operations, and the /var/log/messages shows no difference between the successful and failing cases:
Aug 28 09:41:02 rivendell lvm[2230]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 28 09:41:02 rivendell lvm[2230]: No longer monitoring snapshot vg_rivendell-rsnapshot
Aug 28 09:41:02 rivendell systemd-udevd[2202]: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory
Aug 28 09:41:02 rivendell systemd-udevd[2220]: inotify_add_watch(7, /dev/dm-18, 10) failed: No such file or directory
Aug 28 09:41:03 rivendell lvm[2230]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 28 09:41:03 rivendell lvm[2230]: No longer monitoring snapshot vg_rivendell-rsnapshot
Aug 28 09:41:03 rivendell systemd-udevd[2220]: inotify_add_watch(7, /dev/dm-18, 10) failed: No such file or directory
Aug 28 09:41:03 rivendell systemd-udevd[2202]: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory
Aug 28 09:41:03 rivendell systemd-udevd[2246]: inotify_add_watch(7, /dev/dm-17, 10) failed: No such file or directory

Comment 1 John F Sullivan 2013-08-29 14:06:30 UTC
The same utility run on a system with the same versions of packages but without a RAID configuration does not exhibit the problem.  I ran 200 iterations of the test and did not see the issue.

Original system on which the problem occurs:
  Intel Core 2 6400 @ 2.13GHz
  8GB RAM
  4x SATA drives configured as RAID 6

New system on which the problem does not occur:
  Intel Core i7 860 @ 2.80GHz
  8GB RAM
  2x SATA drives with physical volumes belonging to the same volume group

When reproducing it again on the original system, I found a difference in /var/log/messages that I did not see earlier, as well as a message that I did not notice earlier.  The following snipped shows the entries logged during several iterations.  Note the difference in timing in the iteration prior to the failure and during the failure itself:

Aug 29 09:58:43 rivendell lvm[12012]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:43 rivendell lvm[12012]: No longer monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:43 rivendell systemd-udevd[11984]: inotify_add_watch(7, /dev/dm-18, 10) failed: No such file or directory
Aug 29 09:58:43 rivendell systemd-udevd[11982]: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory
Aug 29 09:58:44 rivendell lvm[12012]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:44 rivendell lvm[12012]: No longer monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:44 rivendell systemd-udevd[11984]: inotify_add_watch(7, /dev/dm-18, 10) failed: No such file or directory
Aug 29 09:58:44 rivendell systemd-udevd[11982]: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory
Aug 29 09:58:44 rivendell lvm[12012]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:45 rivendell lvm[12012]: No longer monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:45 rivendell systemd-udevd[11982]: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory
Aug 29 09:58:45 rivendell systemd-udevd[11984]: inotify_add_watch(7, /dev/dm-18, 10) failed: No such file or directory
Aug 29 09:58:45 rivendell lvm[12012]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:45 rivendell lvm[12012]: No longer monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:45 rivendell lvm[12012]: Monitoring snapshot vg_rivendell-rsnapshot
Aug 29 09:58:45 rivendell systemd-udevd[11993]: inotify_add_watch(7, /dev/dm-18, 10) failed: No such file or directory
Aug 29 09:58:45 rivendell systemd-udevd[11984]: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory
Aug 29 09:58:46 rivendell lvm[12012]: Snapshot vg_rivendell-rsnapshot changed state to:
Aug 29 09:58:46 rivendell lvm[12012]: dm_task_run failed, errno = 6, No such device or address
Aug 29 09:58:46 rivendell lvm[12012]: vg_rivendell-rsnapshot disappeared, detaching
Aug 29 09:58:46 rivendell lvm[12012]: No longer monitoring snapshot vg_rivendell-rsnapshot

I suspect this problem may have more to do with device-mapper/udev or some interaction between device-mapper and lvm2 that is exacerbated by the RAID 6 configuration, but I do not know how to extract additional information to further troubleshoot and triage.  Any suggestions would be greatly appreciated - thank you in advance.

Comment 2 Peter Rajnoha 2013-08-30 12:30:43 UTC
(In reply to John F Sullivan from comment #1)
> I suspect this problem may have more to do with device-mapper/udev or some
> interaction between device-mapper and lvm2 that is exacerbated by the RAID 6
> configuration, but I do not know how to extract additional information to
> further troubleshoot and triage.  Any suggestions would be greatly
> appreciated - thank you in advance.

Please, try using global/use_lvmetad=0 in /etc/lvm/lvm.conf. This will disable lvmtead daemon and accompanying LVM autoactivation - there might be a problem with the LVM autoactivation that interferes here.

Comment 3 John F Sullivan 2013-08-30 12:47:57 UTC
Settings use_lvmetad=0 in /etc/lvm/lvm.conf and rebooting the system appears to have made the problem disappear.  With this setting, I ran 200 iterations of the attached test script without error.  I should be able to get my backups running again - thank you!

Please let me know if I can gather any additional information that may shed light on what's going wrong with LVM auto-activation.  I would be happy to continue triaging and gathering information to help solve the underlying issue so this workaround is no longer necessary.

Comment 4 Peter Rajnoha 2013-08-30 13:06:01 UTC
Thanks for testing. Well, we need to fix how LVM autoactivation is done on MD RAID - we already have a similar bug reported - bug #985638 (though it might not seem to be similar at first sight). Just to be sure, please, send the output of lsblk command in addition so I can see the exact device layout. Thanks.

Comment 5 John F Sullivan 2013-08-30 13:30:19 UTC
Created attachment 792178 [details]
output of "lsblk -a"

Hi Peter - you're very welcome, and thank you for the quick responses.  I never would have identified that bug as being the same issue, though partly because I didn't associate the problem with RAID when I first experienced the issue.

I've attached the output of "lsblk -a" - please let me know if there's any other data I can provide.

Comment 6 Peter Rajnoha 2013-09-11 12:00:47 UTC
This patch should fix the issue reported:

  https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=8d1d83504dcf9c86ad42d34d3bd0b201d7bab8f6

I'll try to add that in next F19 update.

Comment 7 Peter Rajnoha 2013-10-01 11:29:48 UTC
(In reply to Peter Rajnoha from comment #6)
> This patch should fix the issue reported:
> 
>  
> https://git.fedorahosted.org/cgit/lvm2.git/commit/
> ?id=8d1d83504dcf9c86ad42d34d3bd0b201d7bab8f6
> 
> I'll try to add that in next F19 update.

We also need a fix for bug #1014067 (dracut).

Comment 8 Peter Rajnoha 2014-01-15 14:38:11 UTC
I really recommend using Fedora 20 which has the most recent bug fixes for lvmetad and accompanying autoactivation feature. It would be a bit complicated to backport all necessary patches back for F19 (there's a high risk of regressions to appear if the backport is not done correctly). Since we have a workaround for F19, I'm closing this bug with NEXTRELEASE. This problem should be completely fixed in F20.


Note You need to log in before you can comment on or make changes to this bug.