873576 – mdmon not run for existing RAID-1 arrays when anaconda ins initializing, results in parted hanging trying to inspect newly-created Intel fwraid RAID-1 array

Bug 873576 - mdmon not run for existing RAID-1 arrays when anaconda ins initializing, results in parted hanging trying to inspect newly-created Intel fwraid RAID-1 array

Summary: mdmon not run for existing RAID-1 arrays when anaconda ins initializing, resu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	18
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Duplicates (1):	874012 (view as bug list)
Depends On:
Blocks:	F18Beta, F18BetaBlocker
TreeView+	depends on / blocked

Reported:	2012-11-06 08:32 UTC by Adam Williamson
Modified:	2013-01-10 08:34 UTC (History)
CC List:	31 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-11-20 07:17:11 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace of 'parted -s /dev/md126 p' on a fresh RAID-1 array booted from Beta TC7 desktop live (15.43 KB, text/plain) 2012-11-06 17:19 UTC, Adam Williamson	no flags	Details
/var/log/messages from live boot after running stuck parted command (78.97 KB, text/plain) 2012-11-06 17:19 UTC, Adam Williamson	no flags	Details
/var/log/messages after doing alt-sysrq-t with a stuck parted command running (twice) (923.87 KB, text/plain) 2012-11-06 22:11 UTC, Adam Williamson	no flags	Details
Possible fix for mdadm to make it play nice with udev and systemd (864 bytes, patch) 2012-11-15 22:17 UTC, Doug Ledford	no flags	Details \| Diff
View All

Description Adam Williamson 2012-11-06 08:32:10 UTC

To reproduce this problem:

* Create a RAID-1 array in Intel firmware RAID (I'm using an Asus P8P67 Deluxe with 2x500GB Seagate 7200.12 disks attached to the Intel SATA ports)
* Verify the RAID BIOS shows 'normal' as the state of the array when it flashes up during boot
* Attempt to boot a Fedora 18 image

The install will get stuck around the time of transitioning to X, with a black background and a cursor that you can move.

You can access the console and examine the logs. It is clear from the logs that things get stuck when anaconda tries to examine the RAID device - /dev/md126 - via parted. bcl suggested trying to run this command from the console:

parted -s /dev/md126 p

If you run that command, it will hang and cannot be ctrl-C'ed.

If you follow this exact procedure with a Fedora 17 (final) image, everything is fine. Install runs, you can pick the array as a target disk, and if you switch to a console and run that parted command, it returns a result quickly. In both the F17 and F18 cases the newly-created array starts resyncing at boot, but I'm told that's normal, and parted should be able to run even with the resync in progress (as it can in F17).

The bug can also be reproduced by booting a Fedora 18 live image and running the parted command. That may be a better avenue for exploring the issue.

This bug is reproducible all the way back to Alpha TC4, the earliest 18 build I can actually get to boot. It also happens to the latest 18 build, Beta TC7 (this is where I first observed it).

The system never seems to 'come unstuck'. Twice I waited long enough for the resync process to complete, and even after that, the install was still stuck, and if I tried to run the parted command, it hung.

This does not appear to be an anaconda issue, as all anaconda does is use parted, and the issue can be reproduced just by booting live and running parted.

When rebooting from the stuck installer, the array always seems to be in 'dirty' state, even if you wait for the resync to complete and do *not* run parted before rebooting. The RAID BIOS will show the status as 'Initialize' rather than 'Normal'.

Alpha TC4 (the earliest known broken case) has kernel 3.6.0-0.rc2.git2.1.fc18. I don't have the package version of mdadm handy, but mdadm --version shows 3.2.5. Not sure what other version info may be useful.

F17 (the working case) has kernel 3.3.4-5.fc17, and mdadm 3.2.3.

Comment 1 Adam Williamson 2012-11-06 08:34:50 UTC

Proposing as Beta blocker as this obviously blocks install to Intel FW RAID-1. Oh, it's probably worth noting that RAID-0 worked fine.

Comment 2 Adam Williamson 2012-11-06 08:42:48 UTC

Just to confirm, with the same disks, following the same process but creating a RAID-0 array instead of RAID-1, install proceeds fine and I can run the parted command successfully. The issue is definitely limited to RAID-1. (As the controller only has two SATA ports, you can only do RAID-0 or RAID-1.)

Comment 3 Adam Williamson 2012-11-06 17:19:12 UTC

Created attachment 639474 [details]
strace of 'parted -s /dev/md126 p' on a fresh RAID-1 array booted from Beta TC7 desktop live

I created the RAID-1 array fresh, booted to F18 Beta TC7 desktop live to avoid interference by anaconda, installed strace, and ran 'strace -o parted.trace -s /dev/md126 p'. This is the result.

Comment 4 Adam Williamson 2012-11-06 17:19:56 UTC

Created attachment 639475 [details]
/var/log/messages from live boot after running stuck parted command

In case it helps, this is the contents of /var/log/messages on the live boot, after straceing the parted command.

Comment 5 Dave Jones 2012-11-06 18:52:04 UTC

what does cat /proc/$(pidof parted)/stack say ?

Comment 6 Josh Boyer 2012-11-06 19:16:11 UTC

Jes is really the best person to poke at this probably.

Have you tried other F17 kernels?  It would be good to know if something a bit, uh, newer than 3.3.4 works.  F17 should have 3.3 -> 3.6 available in koji.

Can you also maybe get the output of sysrq-t to show the backtraces of everything?  Maybe it will be more clear on what it's hung on.

Comment 7 Adam Williamson 2012-11-06 20:10:36 UTC

jwb: older kernels get trashcanned from koji after a while, I don't think anything much before 3.6 is available IIRC. but i'll try and look.

Comment 8 Adam Williamson 2012-11-06 20:58:06 UTC

I'll provide the requested info in a minute, but in the mean time I was testing something else, which relates to the severity of the bug rather than the cause. Unfortunately, the bug also affects the case of an existing install to a RAID-1 array, which makes it more severe (it doesn't only affect a brand new array).

To test I just created the array as described above and installed F17 to it. I booted the installed F17 system and waited for resync to complete. Then I rebooted once to confirm the array was behaving sensibly - system booted, and did not start resyncing again. Then I booted to the f18 beta tc7 netinst, and the bug happened. There's no resync going on, but the installer is stuck at the black screen and logs indicate it's trying to scan the RAID array, just as reported.

The only good news is that rebooting cleanly from the stuck installer doesn't leave the array in 'dirty' status, as I was afraid it would - the BIOS still reports 'Normal' and booting the installed F17 system doesn't prompt a resync. Just hitting the 'reset' button, as people might do when they see the stuck X display, might leave it dirty, though.

Comment 9 Josh Boyer 2012-11-06 20:58:57 UTC

(In reply to comment #7)
> jwb: older kernels get trashcanned from koji after a while, I don't think
> anything much before 3.6 is available IIRC. but i'll try and look.

I wouldn't ask you to do something that is impossible.  Koji doesn't trashcan kernels that made it to stable updates.  There are lots of them there.

3.6 is easy to get, so for your testing pleasure:

3.5: http://koji.fedoraproject.org/koji/buildinfo?buildID=344970
3.4: http://koji.fedoraproject.org/koji/buildinfo?buildID=321678

There are stable version updates of those in koji as well.

Comment 10 Adam Williamson 2012-11-06 21:59:55 UTC

davej: it says:

[<ffffffff814a8245>] md_write_start+0xa5/0x1b0
[<ffffffffa04bda01>] make_request+0x41/0xc60 [raid1]
[<ffffffff814a12cd>] md_make_request+0xcd/0x200
[<ffffffff812bc632>] generic_make_request+0xc2/0x110
[<ffffffff812bcf65>] submit_bio+0x85/0x110
[<ffffffff812bf50d>] blkdev_issue_flush+0x8d/0xe0
[<ffffffff811c565e>] blkdev_fsync+0x3e/0x50
[<ffffffff811bce10>] do_fsync+0x50/0x80
[<ffffffff811bd080>] sys_fsync+0x10/0x20
[<ffffffff81629729>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

I'll attach the alt-sysrq-t output.

Comment 11 Adam Williamson 2012-11-06 22:11:39 UTC

Created attachment 639670 [details]
/var/log/messages after doing alt-sysrq-t with a stuck parted command running (twice)

Comment 12 Adam Williamson 2012-11-06 23:35:37 UTC

Something interesting: I tried with the latest F17 respin from Fedora Unity:

https://alt.fedoraproject.org/pub/alt/live-respins/

It has kernel 3.6.3-1.fc17. The problem does not manifest: I can call parted and it works fine. So this isn't a straightforward 'old kernel works, new kernel fails' thing. It's either in the kernel config, or another layer, or something more complex.

Comment 13 Brian Lane 2012-11-07 01:38:58 UTC

Can you install the f18 parted on there and give it a try.

Comment 14 Adam Williamson 2012-11-07 01:56:50 UTC

Michal, can you test and see if you are affected by this? You seem to have Intel BIOS RAID. It'd be useful to know if this is limited to me or not. Thanks!

Comment 15 Adam Williamson 2012-11-07 02:34:37 UTC

on updating parted/mdadm: doesn't make any difference. I booted the updated F17 live image and updated first parted then mdadm to the F18 versions, trying the parted command at each point along the way, it worked every time. I tried the latest F18 packages from updates-testing, no change. So it's either not in those two packages, or updating the package post-boot isn't enough to produce the bug. I could try building a F17 live image with those two packages updated, I guess...

Comment 16 Michal Kovarik 2012-11-07 08:07:50 UTC

I have same issue with RAID-1 (black screen, freeze after 'parted -s /dev/md126 p')

Comment 17 Chris Lumens 2012-11-07 15:30:02 UTC

*** Bug 874012 has been marked as a duplicate of this bug. ***

Comment 18 Adam Williamson 2012-11-07 17:53:02 UTC

Discussed 2012-11-07 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-11-07/f18beta-blocker-review-7.2012-11-07-17.03.log.txt . Accepted as a blocker per criterion "The installer must be able to create and install to software, hardware or BIOS RAID-0, RAID-1 or RAID-5 partitions for anything except /boot", in the case of Intel FW RAID-1 and RAID-5 (see dupe).

Comment 19 Josh Boyer 2012-11-07 18:38:54 UTC

(In reply to comment #12)
> Something interesting: I tried with the latest F17 respin from Fedora Unity:
> 
> https://alt.fedoraproject.org/pub/alt/live-respins/
> 
> It has kernel 3.6.3-1.fc17. The problem does not manifest: I can call parted
> and it works fine. So this isn't a straightforward 'old kernel works, new
> kernel fails' thing. It's either in the kernel config, or another layer, or
> something more complex.

Here's the diff of the kernel f17 vs. f18 kernel configs:

http://paste.stg.fedoraproject.org/1550/

basically, it's the modsign stuff, some MTD stuff, and a few DRM specific config options.  Nothing to do with RAID or block in general.  We also aren't carrying any patches related to this in either kernel.

Comment 20 SpuyMore 2012-11-07 21:21:58 UTC

(In reply to comment #18)
> Discussed 2012-11-07 blocker review meeting:
> http://meetbot.fedoraproject.org/fedora-qa/2012-11-07/f18beta-blocker-review-
> 7.2012-11-07-17.03.log.txt . Accepted as a blocker per criterion "The
> installer must be able to create and install to software, hardware or BIOS
> RAID-0, RAID-1 or RAID-5 partitions for anything except /boot", in the case
> of Intel FW RAID-1 and RAID-5 (see dupe).

I read you need additional reports on this issue. I experience exactly the same problem with RAID-1 array. I have an Intel Desktop Board DQ77MK. Let me know if I can be of any assistance.

Comment 21 Doug Ledford 2012-11-07 21:26:26 UTC

This sounds more like a case of "mdmon is not running when parted attempts to write to the device" issue.  The fact that it blocks on md_write_start would seem to back that up.  For raid0 devices, we don't need mdmon to mark the device clean or dirty, it has no meaning.  For raid1 devices, we do.  Can you log into a hung machine and see if mdmon is running for the effected device?  If not, then we need to find out why not as that is the most likely cause of this problem (md_write_start is waiting for mdmon to mark the superblocks dirty and then update the kernel that it has done so via writing to a sysfs entry for the drive, and the kernel is waiting forever because mdmon is not there to handle the write start).

Comment 22 Josh Boyer 2012-11-07 21:56:25 UTC

Doug, I noticed the md126_resync thread is stuck in D state after doing a sync_request.  So it seems like the array is in resync still too (my RAID knowledge is non-existent)?  Would that also backup your theory of mdmon not being around to do things, or is that perhaps another reason it could be hung?

Comment 23 Adam Williamson 2012-11-08 06:00:32 UTC

jwb: I mentioned about the resync thing above. it's apparently normal that a newly created array goes into a resync when it's first booted, so you'll see indications of resync occurring in many tests / logs for this, but it does not actually appear to affect the issue. I've tested waiting out the resync, and letting the resync happen in a working F17 boot and *then* booting F18; in that case resync does not occur, but the bug still happens. I'm pretty sure the bug has nothing at all to do with resyncs.

Comment 24 SpuyMore 2012-11-08 10:31:07 UTC

(In reply to comment #21)
> This sounds more like a case of "mdmon is not running when parted attempts
> to write to the device" issue.  The fact that it blocks on md_write_start
> would seem to back that up.  For raid0 devices, we don't need mdmon to mark
> the device clean or dirty, it has no meaning.  For raid1 devices, we do. 
> Can you log into a hung machine and see if mdmon is running for the effected
> device?  If not, then we need to find out why not as that is the most likely
> cause of this problem (md_write_start is waiting for mdmon to mark the
> superblocks dirty and then update the kernel that it has done so via writing
> to a sysfs entry for the drive, and the kernel is waiting forever because
> mdmon is not there to handle the write start).

Confirmed, when I start mdmon --all in the ctrl+alt+F2 tty the installer continues and mdadm --query --detail /dev/md126 shows me the array is clean and resyncing.

Comment 25 Josh Boyer 2012-11-08 13:47:06 UTC

(In reply to comment #23)
> jwb: I mentioned about the resync thing above. 

Ugh, so you did.  OK, I should have read that more closely.  I stared at the code a while longer anyway and it didn't seem the paths parted and the resync thread were in would deadlock against each other anyway.

Comment 26 Josh Boyer 2012-11-08 13:47:50 UTC

(In reply to comment #24)
> (In reply to comment #21)
> > This sounds more like a case of "mdmon is not running when parted attempts
> > to write to the device" issue.  The fact that it blocks on md_write_start
> > would seem to back that up.  For raid0 devices, we don't need mdmon to mark
> > the device clean or dirty, it has no meaning.  For raid1 devices, we do. 
> > Can you log into a hung machine and see if mdmon is running for the effected
> > device?  If not, then we need to find out why not as that is the most likely
> > cause of this problem (md_write_start is waiting for mdmon to mark the
> > superblocks dirty and then update the kernel that it has done so via writing
> > to a sysfs entry for the drive, and the kernel is waiting forever because
> > mdmon is not there to handle the write start).
> 
> Confirmed, when I start mdmon --all in the ctrl+alt+F2 tty the installer
> continues and mdadm --query --detail /dev/md126 shows me the array is clean
> and resyncing.

Well, that's good news.  Now, which component should this bug move to for not starting up mdmon?

Comment 27 Matthew Garrett 2012-11-08 16:22:05 UTC

mdmonitor isn't in any of Anaconda's systemd targets.

Comment 28 Adam Williamson 2012-11-08 18:14:04 UTC

On a fresh boot of F17 netinst I see 'mdmonitor-takeover.service', which runs mdmon --takeover --all, as 'active (exited)' after booting, which looks like what we want.

On a fresh boot of F18 netinst the mdmonitor-takeover.service service is present but shown as 'inactive (dead)'.

So I think we have our bunny.

Probing further...

Okay, yeah, I've got it. It's systemd packaging. When it was converted to the new macros / presets setup, mdadm didn't include presets to enable its services by default.

In the F17 spec file, there's this in %post:

%post
if [ $1 -eq 1 ] ; then
    /bin/systemctl enable mdmonitor.service mdmonitor-takeover.service >/dev/nu
fi

But in F18 there's only this, and no presets:

%post
%systemd_post mdmonitor.service mdmonitor-takeover.service

That's not going to enable the services. According to https://fedoraproject.org/wiki/Packaging:ScriptletSnippets#Systemd:

"If your package includes one or more systemd units that need to be enabled by default on package installation, they need to be covered by the default Fedora preset policy. The default fedora preset policy is shipped as part of systemd.rpm. If your unit files are missing from this list, please file a bug against the systemd package."

So moving this to systemd. systemd folks, can we please have mdmonitor.service and mdmonitor-takeover.service started by default? Thanks.

Comment 29 Adam Williamson 2012-11-08 18:30:18 UTC

Huh. https://bugzilla.redhat.com/show_bug.cgi?id=855372 is filed, and 90-default.preset has:

# https://bugzilla.redhat.com/show_bug.cgi?id=855372
enable mdmonitor.service
enable mdmonitor-takeover.service

but they definitely aren't enabled.

Comment 30 Adam Williamson 2012-11-08 18:41:12 UTC

So mjg59 points out that on non-live, anaconda doesn't use multi-user.target, which probably explains the non-live case.

The live case looks weirder. mdmonitor.service fails because /etc/mdadm.conf doesn't exist, but I'm not sure that's a problem. What we really want to run is mdmonitor-takeover.service , which is what runs mdmon. It looks like that gets started and then gets stopped three seconds later:

mdmonitor-takeover.service - Software RAID Monitor Takeover
	  Loaded: loaded (/usr/lib/systemd/system/mdmonitor-takeover.service; disabled)
	  Active: inactive (dead) since Thu, 2012-11-08 13:33:35 EST; 7min ago
	 Process: 373 ExecStart=/sbin/mdmon --takeover --all (code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/system/mdmonitor-takeover.service

Nov 08 13:33:32 localhost.localdomain systemd[1]: Started Software RAID Monitor Takeover.
Nov 08 13:33:35 localhost.localdomain systemd[1]: Stopping Software RAID Monitor Takeover...
Nov 08 13:33:35 localhost.localdomain systemd[1]: Stopped Software RAID Monitor Takeover.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Comment 31 Matthew Garrett 2012-11-08 18:58:37 UTC

Ignore what I said in Comment 27 - the image I was looking at was too old. I believe that the target is present and correct in current images, which just leaves us with the failure of -takeover.

Comment 32 Adam Williamson 2012-11-08 19:02:19 UTC

OK, so I was planning to take a mental health day today so I'm damn well going to do it. But here's what I've got.

For the live case, livesys.service - which is not native systemd but old-skool sysv, /etc/init.d/livesys - tries to disable and stop mdmonitor-takeover.service , which kind of ties in with comment #30. Only that code is in f17 too, but the bug doesn't seem to affect f17. So I'm not at the bottom of this yet. My next step would be to look at the systemctl output on a system which actually has RAID, so far I've just been looking at live images for convenience. Please, if other folks can take a look at the two cases - live and non-live - and see if you can figure out what's going on, that'd be aces.

Comment 33 Matthew Garrett 2012-11-08 19:57:04 UTC

No, actually, I'm wrong in comment 31 - the mdmonitor links aren't present in Beta-TC6 either. So definitely something wrong in the build.

Comment 34 Jóhann B. Guðmundsson 2012-11-08 20:00:45 UTC

Anaconda runs it's own systemd implementation so both Anaconda and "Spins" are in control over their own preset policy so it could just be the case Anaconda's own preset file/policy needs to be fixed/updated ( and spins as well ) 

If those are present I would check the udev+units and if /etc/mdadm.conf exist and is correctly configured?

Comment 35 Jes Sorensen 2012-11-09 10:50:55 UTC

Just to be clear on this issue - is this caused by anaconda getting it's
list wrong, or do I need to update mdadm with something additional as well
as per comment #28?

Jes

Comment 36 Adam Williamson 2012-11-09 18:41:51 UTC

Jes: right now we're not sure. It looked clear-cut to me for a bit but then it got complicated again :). To summarize:

We know for sure the problem happens because mdmon isn't running.
We don't know for sure _why_ mdmon isn't running in either the non-live or the live case. It's possible the two cases are the same, and it's possible they're different.

I'm going to look into it more today. I think the chances are high that it'll turn out to need fixing in lorax or anaconda or systemd and/or livecd-tools and/or spin-kickstarts or something, and the chances of it actually being in mdadm are fairly slim, but they're not 0%, so please stay handy :) thanks!

Comment 37 Jes Sorensen 2012-11-13 09:41:59 UTC

Adam,

Ok let me know how it goes and if you want me to try anything.

I tried reproducing this locally using TC8, but anaconda never launches,
except I end up with a message that the pane is dead when I switch to VC1.
This is worse than Alpha which would explode when it detected that I had
pre-existing raid arrays.

Jes

Comment 38 Václav Pavlín 2012-11-13 14:21:09 UTC

The service mdmonitor-takeover is not enabled athough it is in presets (if you call systemctl preset mdmonitor-takeover.service, it will be enabled).

It seems to me that dracut does not enable the service although it should - Harald, what do you think?

Comment 39 Adam Williamson 2012-11-14 01:42:56 UTC

vaclav: it's important to differentiate between the two cases of the bug (see comment #36). I'm inferring that you're talking about the non-live case, but it would probably be good to state that explicitly in further discussion.

Comment 40 Adam Williamson 2012-11-14 09:13:28 UTC

OK, so looking at the non-live case on F17, here's what I see.

'mdmon md127' is running.

the mdmonitor-takeover.service is listed as 'active (exited)' with status SUCCESS, which is what I'd expect in a 'working' case.

mdmonitor.service is inactive, so we can probably ignore that so far as this bug is concerned; it doesn't appear to be needed.

/etc/systemd/system/sysinit.target.wants/mdmonitor-takeover.service
/etc/systemd/system/multi-user.target.wants/mdmonitor.service

so that implies sysinit.target.wants is getting hit but multi-user.target.wants isn't, so we're in something below multi-user.target here, I guess.

I'll compare f18.

Comment 41 Adam Williamson 2012-11-14 09:48:45 UTC

OKAY!

I think I have it this time.

Come on down, lorax. From share/runtime-postinstall.tmpl , around line 29:

## Disable unwanted systemd services
systemctl disable systemd-readahead-collect.service \
                  systemd-readahead-replay.service \
                  mdmonitor.service \
                  mdmonitor-takeover.service \
                  lvm2-monitor.service

Thank you, Mr. Will Woods:

commit 3a75b2e07dcc076b357cb0d9fca69efea5464050
Author: Will Woods <wwoods>
Date:   Tue Jun 19 15:14:27 2012 -0400

    add 'systemctl' command and use it in postinstall
    
    The 'systemctl' command can be used to enable, disable, or mask systemd
    units inside the runtime being modified. Modify runtime-postinstall.tmpl
    to use the 'systemctl' command.
    
    We also no longer remove quota*.service or kexec*.service, since
    these aren't enabled by default. And systemd-remount-api-vfs.service
    should work correctly now, so we can leave it alone as well.

That covers the non-live case at least. The live case I will look at tomorrow or something. I have to wake up to go snowboarding in five hours.

Comment 42 Adam Williamson 2012-11-14 09:53:28 UTC

note that I checked F18 env and all is consistent with the above: everything looks like mdmonitor-takeover.service should be enabled - all the dependencies of the targets are correct, 90-default.preset lists it to be enabled, and the requirements and %post script of the mdadm package are correct - but it mysteriously is *not* enabled. This of course lines up perfectly with a scenario where it is properly enabled at package install time, then lorax comes along and disables it.

should be easy enough to comment out that line in lorax and spin a test build, if someone wants to do it (I've never been able to build non-live images successfully). If someone builds such an image, I'll test it.

Comment 43 Adam Williamson 2012-11-14 10:05:09 UTC

I've tried to track back the history of that line, and in f17, it seems like it was basically this line in share/runtime-postinstall.tmpl:

removefrom mdadm /lib/systemd/system/*

which was clearly intended to remove all mdadm's services. Only as a later commit on master branch shows, it was broken because of /usrmove. So mdmonitor-takeover.service is actually present and active _by accident_ in F17, it seems.

Ultimately, this code intending to remove mdadm's services was introduced in this commit:

commit 6962fe3e80037cb266e122c709c772d7d30180ce
Author: Will Woods <wwoods>
Date:   Tue Aug 9 17:39:07 2011 -0400

    move systemd cleanup to runtime-postinstall
    
    also make sure we clean a bunch more unneeded services, but don't bother
    deleting target files that would just be ignored anyway.
    
    also also, delete everything in /etc/systemd/system/default.target.wants
    so that we don't get readahead stuff in anaconda.

I guess mdmonitor-takeover.service was considered to be one of the 'bunch more unneeded services', and only survived in F17 by sheer luck (/usr move). Crazy, huh?

Comment 44 Martin Gracik 2012-11-14 11:05:09 UTC

So do we want to enable it or not?

We also disable other services, anything we need to enable with it?

## Disable unwanted systemd services
systemctl disable systemd-readahead-collect.service \
                  systemd-readahead-replay.service \
                  mdmonitor.service \
                  mdmonitor-takeover.service \
                  lvm2-monitor.service

Comment 45 Jes Sorensen 2012-11-14 14:08:55 UTC

Martin,

From the mdadm perspective, not running these bits scares me. I am pretty sure
you need the takeover service to run, either by letting systemd launch it or
having it launched manually by something else.

I also strongly recommend checking with the lvm2 guys to find out whether it is
safe to disable lvm2-monitor.service.

Jes

Comment 46 Jaroslav Reznik 2012-11-14 15:14:22 UTC

Are there a specific reasons why these services are disabled? Or it's more a result of clean-up?

To solve this bug, at least mdmonitor-takeover service has to be run. Jes, is mdmonitor.service needed too? From Adam's comment it seems it does not run in F17 (from comment #40) and thus it's not needed.

For the lvm2-monitor service, I'd prefer to limit the scope of the fix to mdadm part only not to introduce a new bug related to LVM.

Comment 47 Doug Ledford 2012-11-14 15:31:04 UTC

Yes, we absolutely want the mdmonitor-takeover service, it is non-optional on any Intel firmware raid arrays. The mdmonitor service is more optional, but also entirely benign if there are no raid arrays to monitor, so silly to turn it off.

My only lingering concern is that there might be an additional issue at play here. In particular, if an array exists already, then the mdmonitor-takeover service *RE*-starts the array management daemon for that array, it is not intended to *start* the daemon (the purpose is to move the pid and sock files off of the initrd and onto the real root filesystem, as well as the binary executable, so that the old initrd can be freed...something we don't care so much about with the new way dracut unrolls the initrd and shutdown...so it's really only intended to be run after switchroot has been run).

If we create a new array with mdadm, and that array is an Intel array, and we run the array at the time we create it, then mdadm starts the mdmon process after completing the array startup. Or if we assemble an existing imsm array using mdadm, mdadm starts mdmon for that array. In all cases, if we run an array using mdadm, either as part of a create or a later assemble, mdadm runs mdmon on the array. I'm concerned because it is almost sounding like either A) mdadm can't find the mdmon binary to execute and so it isn't getting run or B) maybe these arrays are being created by anaconda and run by anaconda, and anaconda is not starting mdmon (or maybe anaconda uses mdadm to create the arrays, but not run them, and then it assembles the arrays itself and starts the array itself without running mdmon on the array).

As a side issue, Jes, it might be worth renaming the takeover service to that makes more sense and more accurately reflects what mdmon does. Something like intel-fw-raid-manager-restart.service would at least not be mistaken for something that can be disabled with no consequences. Although, given the dracut changes I mentioned above, it this service even necessary any more? I can't remember the full details of the earlier discussions...

Comment 48 Doug Ledford 2012-11-14 15:33:10 UTC

Allow me to clarify the first sentence of my previous post: the takeover service is absolutely necessary if we don't have dracut unrolling the initrd at shutdown.  It's what allows any initrd to be unmounted.  However, for a live image, this isn't really an issue as we are booting from the image, correct?  So it won't matter there.  The real issue, as I see it, is that we need to confirm that mdmon is being properly *started* on array creation/assemble, not that mdmon is being *re*started and moved off the initrd image.

Comment 49 Adam Williamson 2012-11-14 15:40:23 UTC

dledford: I don't know the ins and outs, but the bottom line is that for non-live cases, on f17 mdmonitor-takeover.service is enabled, it starts, an mdmon process shows up, and the bug does not happen. on f18 mdmonitor-takeover.service is disabled, it does not start, no mdmon process shows up, the bug happens, and starting mdmonitor-takeover.service manually starts an mdmon process and fixes it.

I cannot find any code in anaconda that attempts to call mdmon directly, going all the way back to f13. there may be code that ought to result in it being called indirectly, I guess. That would be beyond a simple grep. but I do recall broadly that some storage stuff is disabled on the principle that anaconda will do the necessary itself at storage init/scan. clearly, if that's the design here, it is not working.

None of this applies to live, where the case is clearly different, BTW. All of the above applied to the non-live case. live behaves differently. I was going to look into live today.

A scenario where anaconda ought to be making sure mdmon gets run itself somehow, but isn't, would be consistent with all the evidence we have so far, to be clear. Again this could have broken as far back as f17 and the accidental enablement of mdmonitor-takeover.service would have serendipitously concealed the bug.

Comment 50 Brian Lane 2012-11-14 16:08:53 UTC

Doug,

anaconda uses mdadm to manipulate the arrays.

Comment 51 Brian Lane 2012-11-14 17:47:18 UTC

Here's a test iso. Could those who are hitting this bug give it a try?

https://alt.fedoraproject.org/pub/alt/anaconda/bz873576-boot.iso

This was built using updates-testing and the bleed repo so hopefully everything on it is current.

Comment 52 Doug Ledford 2012-11-14 19:16:54 UTC

Adam, Brian: the mdmonitor-takeover service will start a mdmon process for all existing imsm arrays, even if one does not already exist (which it should if the array was started properly).  I knew in the past that we used mdadm to manipulate arrays, I thought it had been taken internal to anaconda with the storage backend rewrite, but I see I was wrong about that.  That then leaves me to believe that for whatever reason the mdadm binary is not starting the mdmon binary when assembling/running imsm arrays.  This may be as simple as the location of the mdmon binary path in the mdadm binary might be wrong as it doesn't search paths for the binary, either it's where it expects it or it isn't.  Whereas the mdmonitor-takeover.service is different in that it calls mdmon directly and the path is encoded in the service script, not in the mdadm binary.  So there is the possibility that the bug is in the mdadm binary's understanding of the mdmon location and that running the takeover service works around the bug in mdadm.  It's worth inspecting.

Adam, an easy test for this is to start up an install on a machine with existing imsm arrays, jump to F2 prior to disk selection, then use mdadm -As to assemble all existing arrays and see if it started any mdmon processes.

Comment 53 Adam Williamson 2012-11-14 23:59:40 UTC

running mdadm -As on a system in the 'stuck' situation does not unstick the install and does not cause any mdmon processes to start. so there's possibly a problem in the mdadm mechanism for launching mdmon, whatever that is.

Comment 54 Adam Williamson 2012-11-15 01:04:46 UTC

Wow, so more news on this. This bug is pretty devious.

tl;dr summary: forget mdmonitor-takeover.service. It's just a big evil decoy. It has nothing to do with the bug. Something else fires mdmon on F17 and should be firing it on F18, but obviously is not. I don't know what yet, but we need to find out and fix it.

So all the stuff about mdmonitor-takeover.service ? Forget it. Not relevant. The fact that there is a service which can start mdmon, which is enabled and runs on F17 where the bug doesn't happen and where mdmon runs but is disabled and doesn't run on F18 where the bug happens and mdmon doesn't run, and which actually starts mdmon and 'fixes' the bug if you launch it manually in F18?

Yeah, that has nothing to do with the bug.

Crazy, huh?

bcl spun up an F18 ISO that enables the service. I booted it. I checked the service is enabled, the service runs, all good. But it doesn't actually run mdmon, and the bug still happens. Running 'systemctl stop mdmonitor-takeover.service' then 'systemctl start mdmonitor-takeover.service' does start mdmon and 'fix the bug', but that's just a side note now. Just another red herring this evil bug is using to confuse us. :)

So I dug a little more, and checked F17 again. The mdmon process that runs doesn't actually seem to be associated with the mdmonitor-takeover service - it's not in its cgroup and it doesn't stop when you stop the service. Just to be double sure, I booted the F17 image with rd.break , disabled mdmonitor-takeover.service from the dracut environment, and continued the boot. Result? mdmonitor-takeover.service doesn't run...and mdmon still starts up and there's no problem.

So yeah: mdmonitor-takeover.service is a giant, evil, red herring. Forget about it. Something else should be starting mdmon, and did in F17, but it isn't now. I've no idea what, but that's where we are now. dledford, Jes, any ideas?

Comment 55 Jes Sorensen 2012-11-15 09:54:07 UTC

Adam,

mdadm will exec() mdmon directly, so the location of the binary is rather
important. From util.c:

int start_mdmon(int devnum)
{
        int i, skipped;
        int len;
        pid_t pid;
        int status;
        char pathbuf[1024];
        char *paths[4] = {
                pathbuf,
                "/sbin/mdmon",
                "mdmon",
                NULL
        };

If you have mdmon sitting anywhere else on the image, you'll be out of
luck basically.

Cheers,
Jes

Comment 56 Jaroslav Reznik 2012-11-15 13:36:04 UTC

(In reply to comment #55)
> Adam,
> 
> mdadm will exec() mdmon directly, so the location of the binary is rather
> important. From util.c:

bcl's F18 ISO:
# which mdmon
/sbin/mdmon

Comment 57 Michal Schmidt 2012-11-15 14:26:06 UTC

(In reply to comment #55)
> mdadm will exec() mdmon directly

Forking off daemons from user commands is dangerous in general.
If spawned like this, mdmon will run in the same cgroup as mdadm. systemd will therefore treat the process as a part of an unrelated service, or of a user's session. If it's done from a udev rule, udevd will kill it.
From the NEWS file for systemd+udev v183:

        * udev: when udevd is started by systemd, processes which are left
          behind by forking them off of udev rules, are unconditionally cleaned
          up and killed now after the event handling has finished. Services or
          daemons must be started as systemd services. Services can be
          pulled-in by udev to get started, but they can no longer be directly
          forked by udev rules.

Comment 58 Doug Ledford 2012-11-15 14:33:56 UTC

Interesting, as mdadm definitely wants to start daemons itself.  Not sure I like this change, but no doubt this is what our problem is.

Jes, I guess we need a new systemd service file, imsm-manager.service (or similar), that we can call multiple times.  Every time we would normally exec mdmon, we need to call systemd start imsm-manager.service (OK, I'm really thinking this is just frikkin ridiculous that a low level system binary is forced to call systemd over and over again just to start a daemon that is a specifically hardware tied daemon, and as such really belongs to the udev rule, but I guess we missed the discussion on this decision until after it was made...)

BTW, I assume we can call this from the initrd context?

Comment 59 Jes Sorensen 2012-11-15 15:02:31 UTC

Doug,

'dislike' is an understatement to how I feel about this. I guess we need to
disallow the fork() and exec() calls too in glibc while we're at it :(

Having to call systemd like this is totally ridiculous as you point out, and
will result in something completely convoluted for the sake of obfuscating it.

:(

Right now I have zero idea on how to even pass arguments to a system script
this way.

Jes

Comment 60 Michal Schmidt 2012-11-15 15:14:51 UTC

Just to provide some explanation: It's not done this way just for the fun of it. The primary reason for spawning daemons from systemd is to ensure clean, deterministic environment (and by that I don't mean just env, but various contexts, limits, state, ...).

Comment 61 Jes Sorensen 2012-11-15 15:22:12 UTC

Ok to be a bit more explicit, we need to be able to do the following:

- Launch mdmon with a variable number of arguments and different arguments
  passed in by mdadm. Having to write those arguments into a text file and then
  having the systemd script read in that file and pass the content as arguments
  when launching mdmon makes me want to use highly inappropriate language,
  which I shall refrain from putting in here.
- Run multiple instances of mdmon

This stuff is really black magic, so someone who understands it, please step
in and show us how.

Comment 62 Jaroslav Reznik 2012-11-15 15:45:53 UTC

Could we try (for now) to workaround it by reverting the udev killing change in systemd (Michal, could you point us to the offending patch)? At least to check if it's the real issue, not to be chasing ghosts again?

Comment 63 Michal Schmidt 2012-11-15 15:51:20 UTC

Interestingly, the idea with mdmon@.service units has been floated before, within this thread:
http://thread.gmane.org/gmane.linux.raid/30471

(In reply to comment #62)
The udev killing patch is this one:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=194bbe33382f5365be3865ed1779147cb680f1d3

Comment 64 Michal Schmidt 2012-11-15 16:04:31 UTC

A scratch build with the udev killing patch reverted:
http://koji.fedoraproject.org/koji/taskinfo?taskID=4692242

Comment 65 Michal Schmidt 2012-11-15 16:11:35 UTC

It does not seem that the hypothetitcal mdmon@.service would need to support entirely arbitrary command line arguments. mdadm calls it using one of two forms:

                                if (__offroot) {
                                        execl(paths[i], "mdmon", "--offroot",
                                              devnum2devname(devnum),
                                              NULL);
                                } else {
                                        execl(paths[i], "mdmon",
                                              devnum2devname(devnum),
                                              NULL);
                                }

Only the devname is variable. So having two templates could be sufficient: mdmon@.service, mdmon-offroot@.service. The instance name would be the devname.

Comment 66 Jes Sorensen 2012-11-15 16:26:04 UTC

That is true, but you have no guarantee that no other daemon would not need
to be able to launch with arbitrary arguments in the future.

We could hack around this with two templates - it's gross, and it does nothing
to address the point Doug made above.

Comment 67 Michal Schmidt 2012-11-15 16:47:02 UTC

(In reply to comment #58)
> to start a daemon that is a specifically hardware tied daemon, and as such
> really belongs to the udev rule

There are ways to spawn services from udev rules:
..., TAG+="systemd", ENV{SYSTEMD_WANTS}+="mdmon@%k.service"
or:
..., RUN+="/usr/bin/systemctl start mdmon@%k.service"

Comment 68 Michal Schmidt 2012-11-15 16:48:34 UTC

(In reply to comment #67)
> ..., RUN+="/usr/bin/systemctl start mdmon@%k.service"

rather:
..., RUN+="/usr/bin/systemctl --no-block start mdmon@%k.service"

Comment 69 Doug Ledford 2012-11-15 16:55:55 UTC

I get that there are ways to make it happen, that doesn't change how frikkin' ugly it is.  Nor does it change the fact that starting mdmon was the final step in a multi-step process internal to mdadm, where the final step should only be taken if all previous steps succeeded, so now what we really need to do is run mdadm, capture the exit code, and only run mdmon service on a successful exit code.  Again, compared to how mdadm used to do things, this is just frikkin' ugly.  And this is to satisfy someone else's idea of "these things should be in these cgroups and not in these other cgroups", when my original statement about this being a hardware tied daemon and as such it legitimately belonging to udev was meant to imply "as such it rightfully belongs to udev's cgroup and this method of starting it violates that by moving it to systemd's cgroup".

And this still begs the question as to whether or not systemd is up and running on the initrd, because if it isn't, then this is all a moot point, we *can't* do things this way no matter how much you want us to.  We have to be able to do whatever we are talking about doing here as part of bringing the root filesystem up.  If that doesn't work, then this solution is a non-starter.

Comment 70 Kay Sievers 2012-11-15 17:37:58 UTC

(In reply to comment #69)
> and as such it legitimately belonging to udev was meant to imply "as such it
> rightfully belongs to udev's cgroup and this method of starting it violates
> that by moving it to systemd's cgroup".

It "frikkin" does never belong into udev's cgroup. Udev's cgroup is udev's
one, not the one from a random other service.

Udev rightfully cleans up its _own_ cgroup from left-over processes that
did not cleanup itself properly after the event handling. Udev is an
event handler not a service manager.

Udev is not, and never was, an environment to start services, you can add
as many bad words here as you like, it does not make it more right. It is
an entirely broken idea to directly start long-running services from udev
rules ever.

The OS need to be able to introspect services, need to be able to track
failing services, needs to be able to handle supervision, propagate errors
to the logs/admin, logs with the proper context. None of that is reliably
possible from udev, and it is unlikely to ever be made so.

All the issues you see here are shortcomings of the design and integration of
mdmon/mdadm. What it does is just a wild hack, which needs an update to match
a modern service managing OS and today's reality.

Comment 71 Doug Ledford 2012-11-15 18:14:52 UTC

(In reply to comment #70)
> (In reply to comment #69)
> > and as such it legitimately belonging to udev was meant to imply "as such it
> > rightfully belongs to udev's cgroup and this method of starting it violates
> > that by moving it to systemd's cgroup".
> 
> It "frikkin" does never belong into udev's cgroup. Udev's cgroup is udev's
> one, not the one from a random other service.

Udev's cgroup is related to system hardware.  Although mdmon looks like a service, it really isn't.  It's a specifically hardware tied item that, unlike most other similar cases, is a user process instead of a kernel thread.  It should not be treated like a service, it should not be under systemd's control, it should be started when the specific hardware item that needs it is brought live in the kernel, and should go away when that hardware has been deleted from the kernel.  The long drawn out thread you had with Neil Brown over how to get systemd to ignore the mdmon application by putting an @ symbol at the beginning of the service name should have been a nice big red flag to you that this is *not* a systemd service and should not be lumped in as such.

> Udev rightfully cleans up its _own_ cgroup from left-over processes that
> did not cleanup itself properly after the event handling. Udev is an
> event handler not a service manager.

And on device add event, this device handler thread is created.  That it's in user space is an implementation detail, but does not change the fact that this is really a hardware tied thread, not a system level service.  The claim that no hardware activation should ever create a user space process that manages that hardware is a circular argument.  It's true only because you state it is true, not because it actually is.  The udev developers decided that this should be the case and then wrote code to enforce it.  I, however, have yet to hear an actual argument for *why* this is true, just simple assertions that it is.

> Udev is not, and never was, an environment to start services, you can add
> as many bad words here as you like, it does not make it more right. It is
> an entirely broken idea to directly start long-running services from udev
> rules ever.

Just like it is an equally bad idea to treat a hardware handler thread as though it is a system level service.  They are not the same and should not be treated the same.

> The OS need to be able to introspect services, need to be able to track
> failing services, needs to be able to handle supervision, propagate errors
> to the logs/admin, logs with the proper context. None of that is reliably
> possible from udev, and it is unlikely to ever be made so.

Likewise when I bring up the InfiniBand stack in my kernel and it creates a whole bunch of infiniband related kernel worker threads, udev is not resonsible for those, and neither is systemd.  Mdmon should be treated just the same: udev and systemd shouldn't be touching it or worrying about it.  The mdmon application is subject to the same level of reliability as any other hardware management kernel thread: a failure of those threads usually means a kernel oops or worse, likewise mdmon is hardened against failure and it is not the responsibility of udev or systemd to ever worry about mdmon.  It was intentionally kept very simple and reliable for exactly that reason.  Putting mdmon under systemd's control as though it were any other service has been a source of headache and failure from the day systemd thought it new best about when to start/stop/restart/shut down this user space hardware management thread that it erroneously treats as a system level service.

> All the issues you see here are shortcomings of the design and integration of
> mdmon/mdadm. What it does is just a wild hack, which needs an update to match
> a modern service managing OS and today's reality.

That systemd and udev do not have a concept of a user space hardware management thread does not make a user space hardware management thread a necessarily bad thing, it just means that systemd and udev don't want to acknowledge a valid method of accomplishing a specific goal.

All this arguing aside though, there is still the question as to whether or not systemctl start can be run from the dracut created initrd.  My f18 test box is not on a remote console where I can run the tests necessary or I would check it myself.  In any case, if it can't, then this is all a moot point as the suggested fix can not, and will not, work.  Harold?  Does dracut create a working systemd environment before mounting the root filesystem or not?

If it does, then this is what we will need Jes:

1) a new systemd service unit for mdmon startup (probably two as pointed out, one for --offroot and one without)
2) modify mdadm to not call start_mdmon, but in all cases where we would have called start_mdmon, we fork/exec a call to systemd start instead, using the correct service unit based upon the --offroot option
3) add a hard Requires: systemd to the mdadm spec file as we now can not operate without it
4) make sure none of these changes go back to any older fedora releases by accident via a merge as they will break the old sysv init compatibility on those systems where people opted for sysv init instead of systemd

If the initrd doesn't have a running systemd environment well, then, we're just SOL.

Comment 72 Adam Williamson 2012-11-15 19:32:24 UTC

Back on the topic of actually fixing this for Beta, I've been testing the systemd scratch build, and results look good for at least the live case (we're back to live and non-live probably being the same now, BTW). I built two live images, one with regular systemd 195-6, one with the patched systemd, otherwise identical. 'parted -s /dev/md126 p' hangs on the regular systemd build, and runs fine on the patched systemd build. So that's good. tflink is now building me a boot.iso to confirm the non-live case is fixed.

Comment 73 Kay Sievers 2012-11-15 19:51:48 UTC

(In reply to comment #71)
> Udev's cgroup is related to system hardware.

No, that's not right in that context, it's the processes udev tracks, and
not any hardware stuff that runs in the background. The "udev" cgroup defines
the state of the udev service itself.

> Although mdmon looks like a service, it really isn't.

No problem with that idea, but it makes it immediately clear that the
PIDs do not belong into any service tracking then, also not in the service
"udev". We cannot have PIDs be member of a service cgroup and "not being
a service" at the same time, it make no sense really.

> It's a specifically hardware tied item that,
> unlike most other similar cases, is a user process instead of a kernel
> thread.  It should not be treated like a service, it should not be under
> systemd's control, it should be started when the specific hardware item that
> needs it is brought live in the kernel, and should go away when that
> hardware has been deleted from the kernel.  The long drawn out thread you
> had with Neil Brown over how to get systemd to ignore the mdmon application
> by putting an @ symbol at the beginning of the service name should have been
> a nice big red flag to you that this is *not* a systemd service and should
> not be lumped in as such.

No, that was only about the that time "broken idea" to manage the rootfs
with a service that is stored and runs from the same rootfs it mamages
which just can't really work, not even in theory.

It again all sounds like the real fix would be to end that mdmon experiment
and just do that from inside the kernel where it belongs.

As long as that does not happen, I have no conceptual problem if you want
to make all mdmon processes behave like the one for the rootfs, it would
imply moving the forked PIDs entirely out of any service cgroup, that
would make it more look like a kernel thread.

Comment 74 Doug Ledford 2012-11-15 20:29:53 UTC

(In reply to comment #73)
> (In reply to comment #71)
> > Udev's cgroup is related to system hardware.
> 
> No, that's not right in that context, it's the processes udev tracks, and
> not any hardware stuff that runs in the background. The "udev" cgroup defines
> the state of the udev service itself.

Fine, I can accept that.

> > Although mdmon looks like a service, it really isn't.
> 
> No problem with that idea, but it makes it immediately clear that the
> PIDs do not belong into any service tracking then, also not in the service
> "udev". We cannot have PIDs be member of a service cgroup and "not being
> a service" at the same time, it make no sense really.

I'm fine with that too.

> > It's a specifically hardware tied item that,
> > unlike most other similar cases, is a user process instead of a kernel
> > thread.  It should not be treated like a service, it should not be under
> > systemd's control, it should be started when the specific hardware item that
> > needs it is brought live in the kernel, and should go away when that
> > hardware has been deleted from the kernel.  The long drawn out thread you
> > had with Neil Brown over how to get systemd to ignore the mdmon application
> > by putting an @ symbol at the beginning of the service name should have been
> > a nice big red flag to you that this is *not* a systemd service and should
> > not be lumped in as such.
> 
> No, that was only about the that time "broken idea" to manage the rootfs
> with a service that is stored and runs from the same rootfs it mamages
> which just can't really work, not even in theory.

Sure it can.  And it *did* work back when we had sysv init.  It relies on the fact that the page cache can contain data even after the filesystem is mounted read-only and that data can still do useful things, but that's standard Unix behavior so not like we relied upon black magic voodoo.

> It again all sounds like the real fix would be to end that mdmon experiment
> and just do that from inside the kernel where it belongs.

Not my call, but not that I disagree with you either.

> As long as that does not happen, I have no conceptual problem if you want
> to make all mdmon processes behave like the one for the rootfs, it would
> imply moving the forked PIDs entirely out of any service cgroup, that
> would make it more look like a kernel thread.

The question is how we go about doing this.  I know nothing about the udev/systemd usage of cgroups, nor do I know how to move the application into a different cgroup that udev will then not kill (assuming the "kill services started from udev rules" patch is put back in place).  What is it that would be needed to get udev/systemd to leave the mdmon process alone?

Comment 75 Adam Williamson 2012-11-15 20:30:44 UTC

non-live case confirmed too. So this is definitely the bug here, and the workaround of disabling the systemd patch should fix it for beta. I'll test further to make sure I can actually install, but we seem to have nailed down this specific bug for sure.

If we're going with the workaround of squelching the systemd patch for Beta, can we please get a systemd build with that change - and no other - from 195-6 done and submitted as an update, please? Thanks!

Comment 76 Michal Schmidt 2012-11-15 21:19:22 UTC

(In reply to comment #74)
> I know nothing about the udev/systemd usage of cgroups, nor do I know how to
> move the application into a different cgroup that udev will then not kill

Write your PID number into /sys/fs/cgroup/systemd/tasks.
(You'll want to check for SELinux AVCs and ask Dan Walsh to allow them in the policy.)

Comment 77 Fedora Update System 2012-11-15 21:24:46 UTC

systemd-195-7.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/systemd-195-7.fc18

Comment 78 Kay Sievers 2012-11-15 21:30:03 UTC

(In reply to comment #74)
> > No, that was only about the that time "broken idea" to manage the rootfs
> > with a service that is stored and runs from the same rootfs it mamages
> > which just can't really work, not even in theory.
> 
> Sure it can.  And it *did* work back when we had sysv init.  It relies on
> the fact that the page cache can contain data even after the filesystem is
> mounted read-only and that data can still do useful things, but that's
> standard Unix behavior so not like we relied upon black magic voodoo.

No, it was based on pure luck. Daemons pull in libraries, you never know which
one and how many, because glibc has plugins. Libraries get updated, and the
original files deleted. The running process pins the deleted files until
it exec()s or exit()s, that does not happen with a library update -- and there
goes the naive idea of: "mounted read-only".

It's just not possible to mount with busy and deleted files. Mdmon here is
relying on luck, hoping that black magic fixes it, but no such magic exists.
And _this_ is all real standard UNIX. :) I keep calling it a wild hack, that
cannot work. :)

Comment 79 Doug Ledford 2012-11-15 21:34:56 UTC

(In reply to comment #78)
> (In reply to comment #74)
> > > No, that was only about the that time "broken idea" to manage the rootfs
> > > with a service that is stored and runs from the same rootfs it mamages
> > > which just can't really work, not even in theory.
> > 
> > Sure it can.  And it *did* work back when we had sysv init.  It relies on
> > the fact that the page cache can contain data even after the filesystem is
> > mounted read-only and that data can still do useful things, but that's
> > standard Unix behavior so not like we relied upon black magic voodoo.
> 
> No, it was based on pure luck. Daemons pull in libraries, you never know
> which
> one and how many, because glibc has plugins.

You underestimate how many times we have done this in the past I think.  We statically linked mdmon, and upon execution you pin all memory permanently into physical ram so it can't be swapped out.  No issues with shared libs, no issues with updates, no issues with being swapped out, etc.  It wasn't luck, it was by design.

Comment 80 Kay Sievers 2012-11-15 21:39:17 UTC

Nope, it it still based on luck and hoping for the best. Glibc cannot really
be statically linked, it pulls in dynamic plugins even when done so. The
whole issue came only to our attention, because it failed that way, and it
it not fixable without a rexex of all tools, that would be triggered by any
library change.

Comment 81 Doug Ledford 2012-11-15 21:46:22 UTC

If such is the case, had it been brought to our attention, we would have fixed it.  We had been statically linking mdadm and mdmon (and mpathd before that) for a long time (all the way back into the mkinitrd days when the apps on the initrd had to be static), and the makefile gives us a number of options for what to statically link against (glibc, uclibc, diet_gcc, klibc).  If glibc broke static linking, it really wouldn't have been any problem to work around.  And it's not really fair to say "glibc broke static linking behind your back, see, that's proof static linking can never work" because it isn't, it's just proof that glibc broke static linking in an unexpected way.

Comment 82 Doug Ledford 2012-11-15 22:17:17 UTC

Created attachment 646014 [details]
Possible fix for mdadm to make it play nice with udev and systemd

Since udev really doesn't want mdmon to be part of its cgroup, if the systemd cgroup exists, write our mdmon pid number to its task list.

Comment 83 Doug Ledford 2012-11-15 22:18:25 UTC

I would proceed with the beta using the new udev and the existing mdadm, and update to an mdadm that includes this possible fix and the udev that kills processes in its cgroup between beta and rc.  Just my $.02

Comment 84 Bill Nottingham 2012-11-15 22:20:03 UTC

If reparent-to-systemd's-cgroup (for lack of a better description) is intended to be a common operation, should we include a utility to do that so we don't have to add SELinux policy for each app that might need it?

Comment 85 Michal Schmidt 2012-11-15 22:28:38 UTC

(In reply to comment #83)
> I would proceed with the beta using the new udev and the existing mdadm, and
> update to an mdadm that includes this possible fix and the udev that kills
> processes in its cgroup between beta and rc.  Just my $.02

I agree.

(In reply to comment #84)
> If reparent-to-systemd's-cgroup (for lack of a better description) is
> intended to be a common operation, should we include a utility to do that so
> we don't have to add SELinux policy for each app that might need it?

It should be an unusual operation and it's good if the SELinux policy prevents ordinary daemons from doing it.

Comment 86 Doug Ledford 2012-11-15 22:40:03 UTC

I agree that this should be a rare thing, not a standard utility open to lots of processes.  Dan, can we get a policy update for f18 (and on) that allows mdadm to write to /sys/fs/cgroup/systemd/tasks?

Comment 87 Doug Ledford 2012-11-15 22:41:07 UTC

BTW, I still don't know if this fix will work on very early dracut images...will the systemd cgroup exist that early?

Comment 88 Michal Schmidt 2012-11-15 22:53:03 UTC

(In reply to comment #87)
> BTW, I still don't know if this fix will work on very early dracut
> images...will the systemd cgroup exist that early?

F18's initramfs uses systemd, so the systemd cgroup hierarchy will exist.
But even if it did not exist, that would just imply there's no need to escape anywhere. If the hierarchy is mounted later, preexisting tasks fall into the root of it.

Comment 89 Adam Williamson 2012-11-16 00:51:16 UTC

Note that when using the image that fixes this bug, I still can't complete an install to a RAID-1 array successfully: I hit https://bugzilla.redhat.com/show_bug.cgi?id=876789 when partitioning happens. /dev/md/Volume0_0p2 appears to be 'busy' when anaconda attempts to call wipefs on it (I don't know why it calls wipefs on a partition it just created, but it did that in F17 too, and it worked fine there). If anyone has any ideas on that bug, that'd be great.

Comment 90 Milan Broz 2012-11-16 05:22:15 UTC

(In reply to comment #89)
> (I don't know why it calls wipefs on a partition it just
> created, but it did that in F17 too, and it worked fine there).

I think it is because there can be already existing metadata.

Imagine someone wipe part table with dd and later anaconda recreate exactly the same partition (same offset). And immediately you have old raid, lvm, luks metadata back... (I saw this problem re-appearing regularly in installer for the last years.)

Whatever the real problem is, please do not workaround it by removing that wipefs :-)

Comment 91 Daniel Walsh 2012-11-16 17:02:52 UTC

Added policy for this in  selinux-policy-3.11.1-55.fc18.noarch

Comment 92 Adam Williamson 2012-11-16 23:48:41 UTC

so this is 'fixed' in Beta TC9 by the systemd workaround. Do we want to close this bug when that goes stable and open a new bug for the correct fix, or do we want to keep this bug for the correct fix and just drop the blocker status when systemd goes stable?

Comment 93 Fedora Update System 2012-11-17 02:26:46 UTC

Package systemd-195-7.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-195-7.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-18353/systemd-195-7.fc18
then log in and leave karma (feedback).

Comment 94 Fedora Update System 2012-11-20 07:17:15 UTC

systemd-195-7.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.

agk
bcl
dledford
dwalsh
gansalmon
g.kaviyarasu
info
itamar
Jes.Sorensen
jfeeney
johannbg
jonathan
jreznik
kernel-maint
lnykryn
madhu.chinakonda
mbroz
metherid
mgracik
mkovarik
mschmidt
msekleta
notting
plautrba
prajnoha
robatino
sbueno
systemd-maint
vanmeeuwen+fedora
vpavlin
wwoods