Bug 1018171
Summary: | /dev/md/xxx symlink not removed | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jan Safranek <jsafrane> | ||||||||||||
Component: | udev | Assignee: | Harald Hoyer <harald> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Leos Pol <lpol> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 6.4 | CC: | jsafrane, lpol, mnavrati, psklenar, udev-maint-list | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: |
If udev processed the uevent queue for a device that was already removed, the internal handling failed to process an already removed device. Consequently, some symbolic links were not removed for these devices. Now, udev no longer relies on the existence of a device when dealing with the backlog of the uevent queue, and all symbolic links are removed as expected.
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2015-07-22 07:19:42 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
dmesg of mdadm -C and mdadm -S at the time of buggy behavior, imho nothing interesting: md: bind<sda> md: bind<sdb1> bio: create slab <bio-1> at 1 md/raid0:md127: md_size is 2287616 sectors. md: RAID0 configuration for md127 - 2 zones md: zone0=[sda/sdb1] zone-offset= 0KB, device-offset= 0KB, size= 191488KB md: zone1=[sda] zone-offset= 191488KB, device-offset= 95744KB, size= 952320KB md127: detected capacity change from 0 to 1171259392 md127: unknown partition table md127: detected capacity change from 1171259392 to 0 md: md127 stopped. md: unbind<sdb1> md: export_rdev(sdb1) md: unbind<sda> md: export_rdev(sda) Created attachment 811006 [details]
reproducerDon't forget to change DISKS variables inside!
Attached new reproducer script, now I can reproduce it with 100% success rate.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. (In reply to Jan Safranek from comment #3) > Created attachment 811006 [details] > reproducerDon't forget to change DISKS variables inside! > > Attached new reproducer script, now I can reproduce it with 100% success > rate. instead of "sleep 2", you probably should use "udevadm settle" See, if you can reproduce it then. Result is the same with udevadm settle, fails after several loops. (In reply to Jan Safranek from comment #6) > Result is the same with udevadm settle, fails after several loops. and after the script ends/fails, the symlinks are still there and never ever removed? even after a 3 minutes? maybe add # udevadm control --log-priority=info or run in parallel: # udevadm monitor and attach the last relevant messages from /var/log/messages and the monitor output I can't reproduce the bug with 'udevadm control --log-priority=info'. This probably indicates the bug is dependent on exact timing / race. Created attachment 817971 [details]
Updated reproducer
Updated reproducer script with enhanced logging (basically run udevadm monitor in background + print timestaps as the script progresses).
Created attachment 817976 [details]
script output
Log with udevadm + script mixed together. udevadm seems to write log in batches, I'm not able to judge if it's due to buffered output or some delay in kernel/udev.
Anyway, just beware the udevadmm logs are written late and merge them to appropriate place.
Created attachment 817979 [details]
syslog messages
syslog from the time the reproducer was running
umm, syslog shows this error at the time the bug occurs: Oct 31 16:10:01 rhel6_test udevd-work[1613]: inotify_add_watch(6, /dev/md127, 10) failed: No such file or directory (=1383232199.000000000 UNIX time) and after the script ends/fails, the symlinks are still there and never ever removed? even after a 3 minutes???? Yes, after a few minutes of poking around the system, the symlink is still there. I guess stracing udevd and its workers will change the timing also (In reply to Jan Safranek from comment #10) > Log with udevadm + script mixed together. udevadm seems to write log in > batches, I'm not able to judge if it's due to buffered output or some delay > in kernel/udev. redirect stdout to stderr for udevadm monitor. stdout is buffered, stderr is not # rpm -q udev udev-147-2.57.el6.x86_64 .... lrwxrwxrwx. 1 root root 8 15. dub 04.21 /dev/md/test -> ../md127 Failed after: 2 # rpm -q udev udev-147-2.61.el6.x86_64 .... runs forewer Honzo, thanks for reproducer. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1382.html |
Created attachment 811002 [details] reproducer. Don't forget to change DISKS variable inside! 'mdadm -S /dev/md/test' sometimes does not remove /dev/md/test symlink, while /dev/md127 device is destroyed correctly. Version-Release number of selected component (if applicable): udev-147-2.48.el6.x86_64 kernel-2.6.32-422.el6.x86_64 How reproducible: sometimes Steps to Reproduce: 1. run attached test script Actual results (typically): Failed after: 2 Expected results: Endless loop of mdadm -C and mdadm -S Additional info: It is reproducible only sometimes. When I clear both RAID members with 'dd', I usually get correct behavior. I have complex test suite for OpenLMI project which creates/destroys MD RAIDs, partitions, LVs, various filestems etc. and after the test suite finishes, I can reproduce the bug reliably. It seems some stray data on the RAID members or in memory make the bug reproducible.