Description of problem: While improving lvm activation speed for larger LV numbers Systemd usually kills my system on resume when such larger set of devices are active in system with some message telling me parameter list too long - it tends to kill things like Xsession and other daemons - probably because it hits some internal error. Brings on another question - why systemd cares about these devices are create some service for all of them - devices are not mentioned anywhere except in my script - so why systemd consumes more resources for them? Version-Release number of selected component (if applicable): systemd-13-1.fc15.x86_64 How reproducible: Steps to Reproduce: 1. Create some large enough TestingVG (i.e. over loop device) 2. for ((i=1;i<=2000;i++)) ; do echo "lvcreate -n tstA$i -l1 TestingVG " done | lvm 3. (note: Current upstream lvm is quite slow so it takes some time) Actual results: Expected results: No kill my xsession on resume Additional info:
Hmm, what gives you the conclusion that systemd is killing that? Note that systemd monitors block devices coming and going to build proper dependency trees for them and so that we can bind stuff like cryptsetup and similar things to their appearance. Also, at boot we wait exactly for the moment that the devices needed have shown up, instead of waiting for some arbitrary amount of time assuming that after that passed "all" block devices have shown up, how it was traditionally done on boot and still is for the LVM stuff. Please provide logs that can explain why you think systemd is misbehaving here. In particular I'd be interested in the kmsg logs that are generated when this happens when you booted with systemd.log_level=debug systend.log_target=kmsg on the kernel cmdline.
Ok - I'm attaching simplified version of demo script which only uses dmsetup to create larger set of device so you may try yourself and kill your desktop as many times as you need. It looks like ~450 devices is the limit for systemd. Crash happens as soon as you run 'dmsetup remove_all' which quickly remove all not opened devices from dm table. It usually good idea to kill gvfs tasks to get at least some speed - as they have some strange exponential complexity probably (needs to be fixed as well) Here are some snips from logs: --- dmsetup create time Running GC... dev-disk-by\x2did-dm\x2dname\x2ddevb4_22.device changed dead -> plugged dev-mapper-devb4_22.device changed dead -> plugged dev-dm\x2d433.device changed dead -> plugged sys-devices-virtual-block-dm\x2d433.device changed dead -> plugged dev-disk-by\x2did-dm\x2dname\x2ddevb6_24.device changed dead -> plugged dev-mapper-devb6_24.device changed dead -> plugged dev-dm\x2d432.device changed dead -> plugged sys-devices-virtual-block-dm\x2d432.device changed dead -> plugged Failed to load device unit: Argument list too long Failed to load device unit: Argument list too long Failed to load device unit: Argument list too long sys-devices-virtual-block-dm\x2d435.device changed dead -> plugged Failed to load device unit: Argument list too long --- crash time - after dmsetup remove_all sys-devices-virtual-block-dm\x2d47.device changed plugged -> dead Running GC... Collecting sys-devices-virtual-block-dm\x2d47.device Collecting dev-dm\x2d47.device Collecting dev-mapper-dev6_8.device Collecting dev-disk-by\x2did-dm\x2dname\x2ddev6_8.device Collecting sys-devices-virtual-block-dm\x2d356.device Collecting dev-dm\x2d356.device Collecting dev-mapper-devb2_11.device Collecting dev-disk-by\x2did-dm\x2dname\x2ddevb2_11.device Collecting sys-devices-virtual-block-dm\x2d384.device Collecting dev-dm\x2d384.device Collecting dev-mapper-devb1_14.device Collecting dev-disk-by\x2did-dm\x2dname\x2ddevb1_14.device Collecting sys-devices-virtual-block-dm\x2d420.device Collecting dev-dm\x2d420.device Collecting dev-mapper-devb4_20.device Collecting dev-disk-by\x2did-dm\x2dname\x2ddevb4_20.device dev-disk-by\x2did-dm\x2dname\x2ddev1_5.device changed plugged -> dead dev-mapper-dev1_5.device changed plugged -> dead dev-dm\x2d29.device changed plugged -> dead sys-devices-virtual-block-dm\x2d29.device changed plugged -> dead Assertion 'events == EPOLLIN' failed at src/device.c:527, function device_fd_event(). Aborting. Caught <ABRT>, dumped core as pid 6811. Executing crash shell in 10s... --- after this - system kills my Xsession and end in some not really usable state in unknown terminal.
Created attachment 462310 [details] Testing script Creates loop device and tries to map many dm devices on this.
"Argument list too long" is E2BIG. Perhaps it's hitting this limit in unit_add_name(): if (hashmap_size(u->meta.manager->units) >= MANAGER_MAX_NAMES) { r = -E2BIG; goto fail; }
I can reproduce the crash with the script easily. A couple of observations: - On my system "dmsetup remove_all" is not necessary to reproduce it. Simply running the script itself was enough to trigger the assertion failure. "events" was 9 (EPOLLIN|EPOLLERR). If I add a small delay after creating every device, the assertion does not appear. Perhaps systemd fails as it cannot keep up with the rate of incoming events from udev? - With the added delay I got the "Argument list too long" after a while instead. It seems to me, the MANAGER_MAX_NAMES limit is quite arbitrary and should be removed. So there are really two bugs here.
Could be it - I've kernel with some slowing down debug options enabled. Maybe debugging messages itself slows down that as well.
Yes, there are two seperate issues here. We should probably drastically raise MANAGER_MAX_NAMES. (to 128k or so) Which is issue #1. (I have now changed this in git upstream) I really wonder which fd that is that causes the EPOLLERR though. To fix that is issue #2. Zdenek: systemd automatically does a chvt to tty on segfault. Is it possible that your Xsession kill is actually just a chvt?
So as of todays rawhide I observe different behavior: Now after 'dmsetup remove_all' my X session is switched to the console where I've started my session via 'startx' (I'm not using any kdm/xdm/gdm) And Xsession seems to be still running - I could easily switch back to my running session -- that seems to be good. So I do not get any weird looking shell session this time - I remember seeing some backtrace in Xorg.log - but I've been thinking it's been some consequence of systemd's abort and have not made any analysis of this - but maybe some terminal swich error has been recently fixed in Xorg. And now bad parts - System could not be reboot-ed other way then using "SysRq + B" When I log on any console session - and I do exit - session remains in some 'exiting' mode and does not get back to login/password getty mode (systemd died) Here is my current xorg rawhide version: xorg-x11-server-Xorg-1.9.1-6.fc15.x86_64
(In reply to comment #7) > Yes, there are two seperate issues here. We should probably drastically raise > MANAGER_MAX_NAMES. (to 128k or so) Which is issue #1. (I have now changed this > in git upstream) Is this thing actually hardcoded (one value fits all users?) into the source file - or is configurable through some text file ? I assume you are not preallocating 128K entries on the startup ?
There's no preallocation. In fact, the code I pasted in comment 4 is the only place where the constant is used. It is not configurable. The existence of such a limit makes me a bit worried, no matter how large this limit may be.
(In reply to comment #7) > I really wonder which fd that is that causes the EPOLLERR though. To fix that > is issue #2. Since the assertion failure occurs in device_fd_event(), the watch type must be WATCH_UDEV and the fd is the udev monitor socket fd (from udev_monitor_get_fd()).
If the assertion is removed, then recvmsg() inside of udev_monitor_receive_device() gives errno=105 (ENOBUFS).
Hmm, according to kay the fix is that we need to do udev_monitor_set_receive_buffer_size(monitor, 128*1024*1024); in systemd. (and also gracfully handle ENOBUFS).
Mind checking if that helps? We get 128MB kernel buffer size limit, like udevd itself, and just log an error if we get the netlink overflow notification. http://cgit.freedesktop.org/systemd/commit/?id=99448c1f01d79891e0afdfcf3ec8ed9fa92502ae Cheers, Kay
(In reply to comment #14) > Mind checking if that helps? > > We get 128MB kernel buffer size limit, like udevd itself, and just log an error > if we get the netlink overflow notification. > > http://cgit.freedesktop.org/systemd/commit/?id=99448c1f01d79891e0afdfcf3ec8ed9fa92502ae > > Cheers, > Kay So every properly working udev monitor needs to allocate 128MB just for the case ??? That sounds strange. I.e. having 10 monitors - and consuming 1.2G just for buffers ?
That's the maximum buffer size, in case something can create events faster than you can read them. It's not used in any normal situation. The default buffer size is pretty small. It's a standard problem with multicast, you can't block the sender, so you have to take care on the receiver side that stuff does not get lost.
With the patch systemd does not crash anymore. I did receive a strange message though during the test: systemd[1]: Looping too fast. Throttling execution a little.
(In reply to comment #17) > With the patch systemd does not crash anymore. Good! :) > I did receive a strange message though during the test: > systemd[1]: Looping too fast. Throttling execution a little. Yeah, that's something that needs to be tuned. There is an event-ratelimit at the moment, to prevent busyloop-like behavior. Seems, it does handle that many devices properly. I've seen that too, with 'modprobe scsi_debug ...', which can easily create tens of thousands of devices. We will need to get there and fix all that behavior, it just wasn't the focus until now. But we are now almost in 'bug fix' instead of 'adding features' mode. :) Thanks for the test!
Just an update: I tried 20.000 simulated disks on a laptop now, and all seems to work: $ time modprobe scsi_debug max_luns=16 add_host=16 num_parts=4 num_tgts=16 14m We have ~20.000 block devices now: $ find /sys/class/block/ | wc -l 20489 The logs look fine, and 'systemctl' outputs all these units just fine: $ systemctl | wc -l 20592
OK, closing this now as UPSTREAM, since this is fixed upstream, and I'll do a new upload shortly.