Bug 655857 - systemd needs to handle ENOBUFS on udev netlink socket
Summary: systemd needs to handle ENOBUFS on udev netlink socket
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: rawhide
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
Assignee: Lennart Poettering
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-22 14:45 UTC by Zdenek Kabelac
Modified: 2011-01-04 23:44 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-04 23:44:52 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Testing script (601 bytes, text/plain)
2010-11-23 12:24 UTC, Zdenek Kabelac
no flags Details

Description Zdenek Kabelac 2010-11-22 14:45:56 UTC
Description of problem:

While improving  lvm activation speed for larger LV numbers Systemd usually 
kills my system on resume when such larger set of devices are active in system with some message telling me parameter list too long - it tends to kill things like Xsession and other daemons - probably because it hits some internal error.

Brings on another question -  why systemd cares about these devices are create some service for all of them - devices are not mentioned anywhere except in my script - so why systemd consumes more resources for them?


Version-Release number of selected component (if applicable):
systemd-13-1.fc15.x86_64

How reproducible:


Steps to Reproduce:
1. Create some large enough TestingVG (i.e. over loop device)
2. for ((i=1;i<=2000;i++)) ; do
 echo "lvcreate -n tstA$i -l1 TestingVG "
done | lvm
3. (note: Current upstream lvm is quite slow so it takes some time)

Actual results:


Expected results:
No kill my xsession on resume 

Additional info:

Comment 1 Lennart Poettering 2010-11-22 20:24:00 UTC
Hmm, what gives you the conclusion that systemd is killing that?

Note that systemd monitors block devices coming and going to build proper dependency trees for them and so that we can bind stuff like cryptsetup and similar things to their appearance. Also, at boot we wait exactly for the moment that the devices needed have shown up, instead of waiting for some arbitrary amount of time assuming that after that passed "all" block devices have shown up, how it was traditionally done on boot and still is for the LVM stuff.

Please provide logs that can explain why you think systemd is misbehaving here. In particular I'd be interested in the kmsg logs that are generated when this happens when you booted with systemd.log_level=debug systend.log_target=kmsg on the kernel cmdline.

Comment 2 Zdenek Kabelac 2010-11-23 12:22:28 UTC
Ok - I'm attaching simplified version of demo script which only uses dmsetup to create larger set of device so you may try yourself and kill your desktop as many times as you need.

It looks like  ~450 devices is the limit for systemd.
Crash happens as soon as you run   'dmsetup remove_all'   which quickly remove all not opened devices from dm table.

It usually good idea to kill  gvfs tasks to get at least some speed - as they have some strange exponential complexity probably (needs to be fixed as well)

Here are some snips from logs:

---  dmsetup create time

Running GC...
dev-disk-by\x2did-dm\x2dname\x2ddevb4_22.device changed dead -> plugged
dev-mapper-devb4_22.device changed dead -> plugged
dev-dm\x2d433.device changed dead -> plugged
sys-devices-virtual-block-dm\x2d433.device changed dead -> plugged
dev-disk-by\x2did-dm\x2dname\x2ddevb6_24.device changed dead -> plugged
dev-mapper-devb6_24.device changed dead -> plugged
dev-dm\x2d432.device changed dead -> plugged
sys-devices-virtual-block-dm\x2d432.device changed dead -> plugged
Failed to load device unit: Argument list too long
Failed to load device unit: Argument list too long
Failed to load device unit: Argument list too long
sys-devices-virtual-block-dm\x2d435.device changed dead -> plugged
Failed to load device unit: Argument list too long


--- crash time - after dmsetup remove_all

sys-devices-virtual-block-dm\x2d47.device changed plugged -> dead
Running GC...
Collecting sys-devices-virtual-block-dm\x2d47.device
Collecting dev-dm\x2d47.device
Collecting dev-mapper-dev6_8.device
Collecting dev-disk-by\x2did-dm\x2dname\x2ddev6_8.device
Collecting sys-devices-virtual-block-dm\x2d356.device
Collecting dev-dm\x2d356.device
Collecting dev-mapper-devb2_11.device
Collecting dev-disk-by\x2did-dm\x2dname\x2ddevb2_11.device
Collecting sys-devices-virtual-block-dm\x2d384.device
Collecting dev-dm\x2d384.device
Collecting dev-mapper-devb1_14.device
Collecting dev-disk-by\x2did-dm\x2dname\x2ddevb1_14.device
Collecting sys-devices-virtual-block-dm\x2d420.device
Collecting dev-dm\x2d420.device
Collecting dev-mapper-devb4_20.device
Collecting dev-disk-by\x2did-dm\x2dname\x2ddevb4_20.device
dev-disk-by\x2did-dm\x2dname\x2ddev1_5.device changed plugged -> dead
dev-mapper-dev1_5.device changed plugged -> dead
dev-dm\x2d29.device changed plugged -> dead
sys-devices-virtual-block-dm\x2d29.device changed plugged -> dead
Assertion 'events == EPOLLIN' failed at src/device.c:527, function device_fd_event(). Aborting.
Caught <ABRT>, dumped core as pid 6811.
Executing crash shell in 10s...

---  after this - system kills my Xsession and end in some not really usable state in unknown terminal.

Comment 3 Zdenek Kabelac 2010-11-23 12:24:28 UTC
Created attachment 462310 [details]
Testing script

Creates loop device and tries to map many dm devices on this.

Comment 4 Michal Schmidt 2010-11-23 12:41:49 UTC
"Argument list too long" is E2BIG.

Perhaps it's hitting this limit in unit_add_name():

        if (hashmap_size(u->meta.manager->units) >= MANAGER_MAX_NAMES) {
                r = -E2BIG;
                goto fail;
        }

Comment 5 Michal Schmidt 2010-11-24 15:37:03 UTC
I can reproduce the crash with the script easily.

A couple of observations:
 - On my system "dmsetup remove_all" is not necessary to reproduce it.
   Simply running the script itself was enough to trigger the assertion failure.
   "events" was 9 (EPOLLIN|EPOLLERR). If I add a small delay after creating
   every device, the assertion does not appear. Perhaps systemd fails as it
   cannot keep up with the rate of incoming events from udev?
 - With the added delay I got the "Argument list too long" after a while instead.
   It seems to me, the MANAGER_MAX_NAMES limit is quite arbitrary and should be
   removed.

So there are really two bugs here.

Comment 6 Zdenek Kabelac 2010-11-24 16:41:58 UTC
Could be it - I've kernel with some slowing down debug options enabled.
Maybe debugging messages itself slows down that as well.

Comment 7 Lennart Poettering 2010-11-24 22:33:20 UTC
Yes, there are two seperate issues here. We should probably drastically raise MANAGER_MAX_NAMES. (to 128k or so) Which is issue #1. (I have now changed this in git upstream)

I really wonder which fd that is that causes the EPOLLERR though. To fix that is issue #2.

Zdenek: systemd automatically does a chvt to tty on segfault. Is it possible that your Xsession kill is actually just a chvt?

Comment 8 Zdenek Kabelac 2010-11-25 12:53:12 UTC
So as of todays rawhide I observe different behavior:

Now after 'dmsetup remove_all'  my X session is switched to the console where I've started my session via 'startx'  (I'm not using any kdm/xdm/gdm)

And Xsession seems to be still running - I could easily switch back to my running session -- that seems to be good.

So I do not get any weird looking shell session this time - I remember seeing some backtrace in Xorg.log - but I've been thinking it's been some consequence of systemd's abort and have not made any analysis of this - but maybe some terminal swich error has been recently fixed in Xorg.

And now bad parts -

System could not be reboot-ed other way then using  "SysRq + B"

When I log on any console session - and I do exit - session remains in some 'exiting' mode and does not get back to  login/password getty mode  (systemd died)

Here is my current xorg rawhide version:
xorg-x11-server-Xorg-1.9.1-6.fc15.x86_64

Comment 9 Zdenek Kabelac 2010-11-25 13:39:00 UTC
(In reply to comment #7)
> Yes, there are two seperate issues here. We should probably drastically raise
> MANAGER_MAX_NAMES. (to 128k or so) Which is issue #1. (I have now changed this
> in git upstream)


Is this thing actually hardcoded (one value fits all users?) into the source file - or is configurable through some text file ?

I assume you are not preallocating 128K entries on the startup ?

Comment 10 Michal Schmidt 2010-11-25 14:25:10 UTC
There's no preallocation. In fact, the code I pasted in comment 4 is the only place where the constant is used. It is not configurable.
The existence of such a limit makes me a bit worried, no matter how large this limit may be.

Comment 11 Michal Schmidt 2010-11-25 15:08:24 UTC
(In reply to comment #7)
> I really wonder which fd that is that causes the EPOLLERR though. To fix that
> is issue #2.

Since the assertion failure occurs in device_fd_event(), the watch type must be WATCH_UDEV and the fd is the udev monitor socket fd (from udev_monitor_get_fd()).

Comment 12 Michal Schmidt 2010-11-25 15:55:08 UTC
If the assertion is removed, then recvmsg() inside of udev_monitor_receive_device() gives errno=105 (ENOBUFS).

Comment 13 Lennart Poettering 2010-11-26 02:22:22 UTC
Hmm, according to kay the fix is that we need to do 

udev_monitor_set_receive_buffer_size(monitor, 128*1024*1024);

in systemd.

(and also gracfully handle ENOBUFS).

Comment 14 Kay Sievers 2010-11-26 03:56:08 UTC
Mind checking if that helps?

We get 128MB kernel buffer size limit, like udevd itself, and just log an error if we get the netlink overflow notification.

http://cgit.freedesktop.org/systemd/commit/?id=99448c1f01d79891e0afdfcf3ec8ed9fa92502ae

Cheers,
Kay

Comment 15 Zdenek Kabelac 2010-11-26 09:01:39 UTC
(In reply to comment #14)
> Mind checking if that helps?
> 
> We get 128MB kernel buffer size limit, like udevd itself, and just log an error
> if we get the netlink overflow notification.
> 
> http://cgit.freedesktop.org/systemd/commit/?id=99448c1f01d79891e0afdfcf3ec8ed9fa92502ae
> 
> Cheers,
> Kay

So every properly working udev monitor needs to allocate 128MB just for the case ???  That sounds strange.

I.e. having 10 monitors - and consuming 1.2G just for buffers ?

Comment 16 Kay Sievers 2010-11-26 11:24:07 UTC
That's the maximum buffer size, in case something can create events faster than you can read them. It's not used in any normal situation. The default buffer size is pretty small.

It's a standard problem with multicast, you can't block the sender, so you have to take care on the receiver side that stuff does not get lost.

Comment 17 Michal Schmidt 2010-11-26 13:43:47 UTC
With the patch systemd does not crash anymore.

I did receive a strange message though during the test:
 systemd[1]: Looping too fast. Throttling execution a little.

Comment 18 Kay Sievers 2010-11-26 13:52:27 UTC
(In reply to comment #17)
> With the patch systemd does not crash anymore.

Good! :)

> I did receive a strange message though during the test:
>  systemd[1]: Looping too fast. Throttling execution a little.

Yeah, that's something that needs to be tuned. There is an event-ratelimit at the moment, to prevent busyloop-like behavior. Seems, it does handle that many devices properly.

I've seen that too, with 'modprobe scsi_debug ...', which can easily create tens of thousands of devices.

We will need to get there and fix all that behavior, it just wasn't the focus until now. But we are now almost in 'bug fix' instead of 'adding features' mode. :)

Thanks for the test!

Comment 19 Kay Sievers 2010-11-30 12:13:03 UTC
Just an update: I tried 20.000 simulated disks on a laptop now, and all seems to work:
  $ time modprobe scsi_debug max_luns=16 add_host=16 num_parts=4 num_tgts=16
  14m

We have ~20.000 block devices now:
  $ find /sys/class/block/ | wc -l
  20489

The logs look fine, and 'systemctl' outputs all these units just fine:
  $ systemctl | wc -l
  20592

Comment 20 Lennart Poettering 2011-01-04 23:44:52 UTC
OK, closing this now as UPSTREAM, since this is fixed upstream, and I'll do a new upload shortly.


Note You need to log in before you can comment on or make changes to this bug.