Bug 876789
Summary: | FormatDestroyError: error wiping old signatures from /dev/md/Volume0_0p2: 1 | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
Component: | lvm2 | Assignee: | LVM and device-mapper development team <lvm-team> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 18 | CC: | agk, bmarzins, bmr, dwysocha, g.kaviyarasu, heinzm, Jes.Sorensen, jkortus, joerg.lechner, jonathan, kay, kparal, kzak, lpoetter, lvm-team, mschmidt, msnitzer, prajnoha, prockai, robatino, rsterenb, rvykydal, sbueno, stephent98, vanmeeuwen+fedora, zkabelac |
Target Milestone: | --- | Keywords: | CommonBugs |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | abrt_hash:ef387c87fc27789661a312ada280a0285dba4e5df7e4e8fca1f49fa37c659b4b https://fedoraproject.org/wiki/Common_F18_bugs#old-signatures-crash AcceptedBlocker | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-12-19 00:13:19 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 752661 | ||
Attachments: |
Description
Adam Williamson
2012-11-15 00:23:09 UTC
Created attachment 645262 [details]
File: anaconda-tb
Created attachment 645263 [details]
File: product
Created attachment 645264 [details]
File: type
Created attachment 645265 [details]
File: ifcfg.log
Created attachment 645266 [details]
File: storage.log
Created attachment 645267 [details]
File: version
Created attachment 645268 [details]
File: environ
Created attachment 645269 [details]
File: executable
Created attachment 645270 [details]
File: anaconda.log
Created attachment 645271 [details]
File: syslog
Created attachment 645272 [details]
File: hashmarkername
Created attachment 645273 [details]
File: packaging.log
Created attachment 645274 [details]
File: cmdline_file
Created attachment 645275 [details]
File: release
Created attachment 645276 [details]
File: program.log
re-testing install to an IMSM RAID-1 array, with the short-term fix for 873576 in place, to a newly-created array. Seems like I hit this bug dependably. It may be an interaction with the resync process, but F17 did not have that problem, it can install just fine to a newly-created array, while the array is re-syncing. Package: anaconda-18.29 OS Release: Fedora release 18 Nominating as a blocker, this seems to be a showstopper for the fairly simple case of creating an IMSM RAID-1 array and installing to it. This appears to be the problem, from program.log: 19:16:48,018 INFO program: Running... wipefs -a /dev/md/Volume0_0p2 19:16:48,026 ERR program: wipefs: error: /dev/md/Volume0_0p2: probing initialization failed: Device or resource busy note that the last test was a freshly-created array, so it had no partitions or anything. anaconda is trying to run wipefs on a partition it just created, for some reason, I think. Created attachment 646083 [details]
program.log from successful f17 install (for comparison)
I did an f17 run for comparison, attaching program.log and storage.log. It looks like f17 did call wipefs too, but the call succeeds, no problems.
Created attachment 646096 [details]
program.log from successful f17 install (for comparison)
I did an identical F17 install (freshly created RAID-1 array, LVM autopart) for comparison. It works fine. Attaching program.log and storage.log. Note wipefs calls do happen, but work fine, no 'busy' errors.
Created attachment 646097 [details]
storage.log from successful f17 install (for comparison)
dlehman asked for the contents of /dev/md/ when the bug happens, though I think he may have been thinking of another bug when he asked for that. But just to confirm, /dev/md/Volume0_0p2 does exist (the contents are Volume0_0, Volume0_0p1, Volume0_0p2 and imsm0 ). Note that the error is that it's 'busy', not that it doesn't exist. oh, resync is in progress in both the f17 and f18 cases when partitioning happens, note. Further info asked for by dlehman: after the bug happens, manually running the wipefs command from ctrl-alt-f2 results in the same error. I'll attach the result of running 'ps aux' at that point. I will wait out the resync process and see if the command succeeds after that. Created attachment 646653 [details]
ps aux output after the bug happens (and while the command is still erroring when run manually)
after waiting out resync, wipefs still fails. and if I reboot and re-attempt the install, without resync happening at all during the install process, I still hit this tb. So it's not resync interfering, it's something else. This doesn't happen with a softraid install on the same system, btw. pretty different case, but we thought it might be useful data. Can you provide the output of /proc/mdstat? Thanks, Jes In addition, please check that mdmon is indeed running: ps aux | grep dmon Don't grep for mdmon as it may be @dmon if launched with --off-root Jes Adam, When you say the array never had any data on it - do you mean it had never been created as a RAID array and you just created the container in the BIOS, or did you have the RAID-1 array already created? Testing with an IMSM setup here with two RAID-1 arrays on it, I don't see the problem you report. Jes Note here it runs wipefs like this wipefs -a /dev/md/IMSM00_0p3 jes: happens either way, for me. but mostly I've been testing with a completely freshly created array - go into the Intel RAID bios, remove the existing array, reboot (probably not necessary but eh) create a new array, boot straight into the installer and try to install. jes: that's about the same - I believe the Volume0 / IMSM00 difference is just how our BIOSes by default name the created array. I could change mine to anything else at the creation step if I want to. Discussed at 2012-11-19 QA meeting, acting as a blocker review meeting. Rejected as a blocker on the basis that it's working for Jes - we're pretty generous with the RAID criteria for Beta, and read it more as 'it at least has to work for someone' than 'it must work for everyone'. This can be re-visited if we figure out what's going wrong in more detail and it looks serious. Accepted as NTH. Adam: Ok, I'll try and blow away the arrays I have on this scratch box and report back. Adam: I still need your /proc/mdstat output from just after you get the error. However I have a theory, which matches in my rather limited testing here. If the array is clean, the installation process succeeds fine, if the array is dirty and in the process of being resynchronized, wipefs fails for me. Jes Nope. I already tested that, see above. I've tested running the wipefs command manually after resync completed, and also waiting for resync to complete, rebooting, verifying that resync was not running, and trying the install: I hit the bug, and manually running wipefs fails. See https://bugzilla.redhat.com/show_bug.cgi?id=876789#c26 . That was what I explicitly spent three hours testing that day. oh, and just for comprehensiveness, I also checked that it works in F17 even if resync is in progress: the attached logs from an F17 install are from an install which was run with resync in progress the whole time. The entire install succeeds, including the same wipefs calls that F18 uses. My testing suggests pretty solidly that this has nothing to do with resync. Hmm.. maybe I know where is the problem, wipefs opens devices with O_EXCL (in f18 and rawhide). If there is a partition table or mounted filesystem then it returns EBUSY. See bug #872831 for more details. Karel, I just came up with the same conclusion here. The problem is as follows, wipefs uses O_EXCL to open the device it is trying to wipe. However O_EXCL will fail on a device if it is already open per 'man 2 open'. If one has existing LVM volume groups on the disk(s), Anaconda will assemble those, and they will leave the open count > 0, causing the O_EXCL to fail. I just managed to reproduce the problem here. If I install onto LVM, then try to do another install onto LVM, it fails for me too. If I jump to the shell and manually wipe out the pv and vg, I can run wipefs again on the partition. It looks like Anaconda needs to make sure to check it doesn't leave anything assembled on a device it is intending to run wipefs on. I don't think this is in fact limited to IMSM RAID arrays, it ought to happen to on everything where there can be LVM devices on top, which raises the severity of the problem. Jes rjones suggested --force for wipefs to avoid O_EXCL. I think especially for anaconda where we want delete everything it seems as a good idea. I'll probably improve wipefs, the O_EXCL is pretty useless on devices with partition table (e.g. sda) because for kernel such device is used and open returns EBUSY. It might be useful to have --force, however I still believe that trying to wipe out the underlying RAID array while LVM volumes are still assembled on top of it is a rather serious problem here. By-passing the issue by using --force would just risk causing data corruption IMHO. I agree :-) (In reply to comment #40) > If one has existing LVM volume groups on the disk(s), Anaconda will assemble > those, and they will leave the open count > 0, causing the O_EXCL to fail. Anaconda is not activating any lvm in the attached logs. Are we talking about lvm on top of degraded md? The case of reinstalling over a sane lvm configuration has been done more than a few times, so I rather doubt that this is a case of anaconda just not having a clue how manage an install with preexisting lvm. We're talking about LVM volumes created on top of a clean IMSM RAID array. If I do an install to IMSM RAID1 with LVM, then try to re-do the same install it will fail. You can see it happen quite easily by performing the first install, then re-launching the install, selecting the same device, 'reclaim' the space used and then go to the point just before you launch the install. At this point, jump to the shell and check /proc/partitions or /dev/mapper - you'll see the old LVM devices are still visible. Jes Please provide the following logs from the scenario you just described: /tmp/syslog /tmp/storage.log /tmp/program.log /tmp/anaconda.log Thanks. So this is interesting info but I'm still not sure it's capturing my case. Remember, each time I test, I wipe the existing RAID-1 array and create a new one. Are you saying that somehow there's LVM config data from some previous install remaining even after I do that? Or is the stuff above not actually what's causing the problem for me? David: I'll get it to you, but I can't do it before Thursday. It's 22h40 here now and I am not near the test box tomorrow. Adam: That would explain things, especially if you create the same RAID device over and over again. It should be easy to check though, just jump to the shell before clicking go, and check what you have in /dev/mapper Jes Tried to do a custom btrfs install over top of a previous software RAID install. Package: anaconda-18.29.2 OS Release: Fedora release 18-Beta so that last time I hit it a different way - I did a software RAID install, and then I tried to do an install with different layout over top of that install. Basically, looks like wipefs is not reliable for destroying existing RAID devices, any more. I don't believe LVM was involved, but I may be missing something. Will attach storage.log . Created attachment 649549 [details]
storage.log from the install-over-softRAID case
as you seem to be able to hit this with various 'install over existing RAID' cases, commonbugs. we should document that you need to zero out the drives... Created attachment 649692 [details]
anaconda.log with old lvm
Created attachment 649693 [details]
program.log with existing lvm
Created attachment 649694 [details]
storage.log with old lvm
Created attachment 649695 [details]
syslog with old lvm
Created attachment 649696 [details]
ls -l /dev/mapper output confirming the old lvm volumes present
Adam, It can be a real headache for people to zero out their drives. There are going to be users who simply wish to reinstall on top of an existing install rather than doing an upgrade. They will get hit by this, and it may not be that simple for them to zero out the drives. It's still fine to document 'if a raid install crashes, try zeroing your drives as a workaround' but I do not think it should be documented as a must do. Jes yes, that was the idea. I have started a netinst+inst.repo=http:// installation. This happened when "creating ext4 on /dev/sda1". In the partitioning step I used guided partitioning and marked all existing partitions to be deleted. Package: anaconda-18.29.2 Architecture: i686 OS Release: Fedora release 18-Beta Created attachment 649905 [details]
anaconda-tb
Traceback from the crash. I _wasn't_ using any RAID, just a single standard disk.
Can't reproduce, because my existing layout was destroyed in the process. From Kamil's log: 10:41:40,761 INFO program: Running... wipefs -a /dev/sda2 10:41:40,790 ERR program: wipefs: error: /dev/sda2: probing initialization failed: Device or resource busy so we can hit this with no RAID involved at all, it seems... Adam: I can confirm this finding, reproducing this without RAID is rather easy. It matches my theory smack on. 1) Do a fresh install onto a fresh disk, and click yes to encrypt it 2) Reboot back into the installer, now select to install back onto the same disk 3) Select reclaim of all the partitions that were created in the previous install 4) Watch the fireworks I'll post logs in a minute. Speaking of debug output - please add lsblk output to the list of default generated output when anaconda runs. Jes Created attachment 650285 [details]
syslog luks
Created attachment 650286 [details]
anaconda.log luks
Created attachment 650287 [details]
anaconda-tb luks
Created attachment 650288 [details]
program.log luks
Created attachment 650289 [details]
storage.log luks
Created attachment 650290 [details]
lsblk luks
Guys, Running... wipefs -a /dev/sda2 wipefs: error: /dev/sda2: probing initialization failed: Device or resource busy lsblk: `-sda2 8:2 0 29.5G 0 part |-fedora-swap (dm-1) 253:1 0 7.8G 0 lvm `-fedora-root (dm-2) 253:2 0 21.8G 0 lvm you cannot call wipefs in situation when the device (sda2) is actively used, just deactive lvm. Thanks Jes! This is most certainly a Final blocker, proposing as one. The installer must be able to create and install to any workable partition layout using any file system offered in a default installer configuration, LVM, software, hardware or BIOS RAID, or combination of the above https://fedoraproject.org/wiki/Fedora_18_Final_Release_Criteria Jes: "please add lsblk output to the list of default generated output when anaconda runs" - can you please file a separate bug for that? The request will likely get lost if it's just mentioned as part of a separate bug. (In reply to comment #73) > Jes: "please add lsblk output to the list of default generated output when > anaconda runs" - can you please file a separate bug for that? The request > will likely get lost if it's just mentioned as part of a separate bug. Done BZ#879940 Jes Apparently someone started automatically activating all lvm between TC8 and TC9. Also from Kamil's logs: 15:41:36,945 ERR systemd-udevd: '/usr/sbin/lvm pvscan --cache --activate ay --major 8 --minor 4' [1272] terminated by signal 6 (Aborted) That is almost certainly the problem. Comment 61 appears to be caused by us trying to wipe any potential old metadata from sda2 immediately after creating the partition. It fails because something is running on that partition, triggered automatically by its appearance, thereby preventing us from wiping it. What a joke. Here's a theory on why there's no lvm data in sda2's udev db entry. From syslog provided in comment 65: 09:26:51,257 INFO kernel:[ 16.113655] lvm[928]: segfault at 38 ip 0000000000442ca0 sp 00007fff79186d18 error 4 in lvm[400000+e5000] 09:26:51,257 ERR systemd-udevd: '/sbin/lvm pvs --config 'global {locking_type=4}' --units k --nosuffix --nameprefixes --rows --unquoted --noheadings -opv_name,pv_uuid,pv_size,pv_pe_count,pv_pe_alloc_count,pe_start,vg_name,vg_uuid,vg_size,vg_free,vg_extent_size,vg_extent_count,vg_free_count,pv_count /dev/sda2' [928] terminated by signal 11 (Segmentation fault) 09:26:51,265 ERR systemd-udevd: '/sbin/lvm pvs --config 'global {locking_type=4}' --units k --nosuffix --nameprefixes --rows --unquoted --noheadings -olv_name,lv_uuid,lv_size,lv_attr /dev/sda2' [930] terminated by signal 11 (Segmentation fault) 09:26:51,265 INFO kernel:[ 16.121496] lvm[930]: segfault at 38 ip 0000000000442ca0 sp 00007fffeb7ac468 error 4 in lvm[400000+e5000] 09:26:51,580 INFO kernel:[ 16.436417] lvm[943]: segfault at 38 ip 0000000000442ca0 sp 00007ffff6ce5c98 error 4 in lvm[400000+e5000] 09:26:51,581 ERR systemd-udevd: '/sbin/lvm pvs --config 'global {locking_type=4}' --units k --nosuffix --nameprefixes --rows --unquoted --noheadings -opv_name,pv_uuid,pv_size,pv_pe_count,pv_pe_alloc_count,pe_start,vg_name,vg_uuid,vg_size,vg_free,vg_extent_size,vg_extent_count,vg_free_count,pv_count /dev/sdb2' [943] terminated by signal 11 (Segmentation fault) 09:26:51,588 INFO kernel:[ 16.444455] lvm[945]: segfault at 38 ip 0000000000442ca0 sp 00007fff2a205288 error 4 in lvm[400000+e5000] 09:26:51,589 ERR systemd-udevd: '/sbin/lvm pvs --config 'global {locking_type=4}' --units k --nosuffix --nameprefixes --rows --unquoted --noheadings -olv_name,lv_uuid,lv_size,lv_attr /dev/sdb2' [945] terminated by signal 11 (Segmentation fault) I would recommend that you open a bug against lvm to address this. "Apparently someone started automatically activating all lvm between TC8 and TC9." That sounds extremely like it would've come from the RAID 'fix': https://bugzilla.redhat.com/show_bug.cgi?id=873576 that went in for TC9 and could certainly have the effect described, if it's triggered by udev. (The 'fix' was to revert a udev patch which kills processes forked off from udev rules - see https://bugzilla.redhat.com/show_bug.cgi?id=873576#c57 , that came in with systemd 183). Still, we had mdmon activated by udev in F17, and this problem apparently didn't crop up. Did we suppress the activation of LVM by udev in previous releases somehow? Or did udev have nothing to activate LVM in previous releases? CCing Lennart, Kay and LVM folks. This is starting to look an awful lot like a clusterf**k, we might need to get systemd, udev, anaconda and storage folks all together to thrash out how this ought to actually work. Please, try the latest lvm2-2.02.98-3.fc18 and see if you can still hit the problem (should be in testing repo at the moment). The version before used lvmetad by default if lvm.conf is not used and/or use_lvmetad configuration setting was not defined. This should be disabled by default in F18 (only used if directly set in the conf). However, if this is the case, then we need to analyze the problem more as it shouldn't fail even if lvmetad is used! If lvmetad is used, the activation is done automatically based on incoming udev events and once the VG is complete, the volumes are all activated unless they're limited by activation/auto_activation_volume_list lvm.conf setting. Discussed at 2012-11-29 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-11-29/f18final-blocker-review-1.1.2012-11-29-17.01.log.txt . Accepted as a blocker per criterion "The installer's custom partitioning mode must be capable of the following: Creating, destroying and assigning mount points to partitions of any specified size using most commonly-used filesystem types ..." (though actually this is not custom-specific). Looks like we're waiting for a retest involving a tree with the version of the lvm2 package mentioned in comment #78, right? Yes, but annoyingly, I'm having trouble reproducing this reliably. Jes' reproduction steps don't seem to be working for me. I found *another* bug, but couldn't make this one happen on demand... installing Fedora 18 Beta. LXDE Desktop, options: LXDE tools, administration, libre office. Package: anaconda-18.29.2 Architecture: i686 OS Release: Fedora release 18-Beta I'm installing x86_64 F18 in KVM on x86_64 F17. VM configuation: 1vCPU, 2GB RAM, 50GB disk (virtio) and qxl video. When installing F18 beta, I just ran Anaconda: select timezone, packages, partitioning. The latter is quirky in itself as I'm unable to reclaim existing partitions (the reclaim button stays disabled no matter what I do): I had to create a new LV or erase any partition on the LV before install. When the disk should be partitioned and the root password must be entered (not getting the chance to do that) the error occurs. Package: anaconda-18.29.2 OS Release: Fedora release 18-Beta http://dl.fedoraproject.org/pub/alt/qa/20121205_f18-smoke4/ was built with lvm2-2.02.98-3.fc18, please retest with it. -3 is stable now. If no-one is able to confirm directly, we can probably just close this after a week or two of no more duplicate reports with Final TC1 or later. Created attachment 662386 [details]
F18 TC1 logs
Sorry for spoiling the party - I finally got my new IMSM test box assembled,
pulled TC1 and I see what looks to me like the same problem.
Jes, the crash you hit is totally unrelated to this bug. Please open a separate bug. anaconda 18.36 exception report Traceback (most recent call first): File "/usr/lib/python2.7/site-packages/dbus/connection.py", line 651, in call_blocking message, timeout) File "/usr/lib/python2.7/site-packages/dbus/proxies.py", line 145, in __call__ **keywords) File "/usr/lib/python2.7/site-packages/dbus/proxies.py", line 70, in __call__ return self._proxy_method(*args, **keywords) File "/usr/lib64/python2.7/site-packages/pyanaconda/network.py", line 467, in createMissingDefaultIfcfgs setting = con.GetSettings() File "/usr/lib64/python2.7/site-packages/pyanaconda/network.py", line 1028, in networkInitialize createMissingDefaultIfcfgs() File "/usr/lib64/python2.7/threading.py", line 504, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib64/python2.7/site-packages/pyanaconda/threads.py", line 91, in run threading.Thread.run(self, *args, **kwargs) DBusException: org.freedesktop.DBus.Error.UnknownMethod: Method "GetSettings" with signature "" on interface "(null)" doesn't exist (In reply to comment #89) > Jes, the crash you hit is totally unrelated to this bug. Please open a The traceback from comment #89 is fixed in bug #884660 in version 18.37 of anaconda. If this is already fixed, I won't bother with a new bug for it. Any ETA for when there will be an install image with the update I can pull for testing? Thanks, Jes There have been a few already, but 18.37 keeps having much more serious bugs that would probably block you from testing this. Another smoke build with http://koji.fedoraproject.org/koji/buildinfo?buildID=372476 should show up soon, if it doesn't seem to have any obvious facepalms, I'll let you know. smoke6 looks to be basically workable, so give it a shot: https://dl.fedoraproject.org/pub/alt/qa/20121212_f18-smoke6/ TC3 should also include the fix: https://dl.fedoraproject.org/pub/alt/stage/18-TC3/ I've spent the day running a number of tests with 18-TC3 on my IMSM test box, and I can no longer produce the install problem. Note that this is not the old box that I reproduced it on earlier, that is currently in pieces. Note that I still see the hang on reboot due to / not being unmounted correctly problem, which we see in BZ#834245, but it is less critical than not being able to install, and I think we can fix that post release. I would opt for closing this bug for now. Jes OK, let's go with that. 18.37.3 is now stable. |