982734 – manpage: dispatcher timeouts are undocumented

Bug 982734 - manpage: dispatcher timeouts are undocumented

Summary: manpage: dispatcher timeouts are undocumented

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	NetworkManager
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Dan Williams
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	909577 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-09 17:22 UTC by Kamil Páral
Modified:	2014-04-22 08:24 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-04-18 15:44:57 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kamil Páral 2013-07-09 17:22:26 UTC

Description of problem:
I have been playing with NM-dispatcher lately and I sometimes I spotted this in my journal:

Jul 09 18:35:00 kraken nm-dispatcher.action[594]: Script '/etc/NetworkManager/dispatcher.d/99-kparal' took too long; killing it.
Jul 09 18:35:00 kraken NetworkManager[601]: <warn> Dispatcher script timed out: Script '/etc/NetworkManager/dispatcher.d/99-kparal' timed out.

But if I look into "man NetworkManager" there is no indication that there is some time limit imposed, what the time limit is, and whether I can change it.

Please add those 3 pieces of information into the manpage, thank you.

Version-Release number of selected component (if applicable):
NetworkManager-0.9.8.2-1.fc18.x86_64

Comment 1 Jirka Klimes 2013-07-11 12:45:34 UTC

The timeout is not configurable and in 3 seconds now.
http://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/callouts/nm-dispatcher-action.c?id=0201d6da877d9f19c124298bac8b8cc3d81585e7#n352

It has been introduced by a change that makes dispatcher asynchronous, and allows NM to wait for dispatcher.
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=0201d6da877d9f19c124298bac8b8cc3d81585e7
We should ask Dan whether he thinks 3 seconds are enough and how to document the  behaviour properly.

Comment 2 Dan Williams 2013-07-11 15:07:01 UTC

(In reply to Jirka Klimes from comment #1)
> The timeout is not configurable and in 3 seconds now.
> http://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/callouts/nm-
> dispatcher-action.c?id=0201d6da877d9f19c124298bac8b8cc3d81585e7#n352
> 
> It has been introduced by a change that makes dispatcher asynchronous, and
> allows NM to wait for dispatcher.
> http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/
> ?id=0201d6da877d9f19c124298bac8b8cc3d81585e7
> We should ask Dan whether he thinks 3 seconds are enough and how to document
> the  behaviour properly.

The timeout is not currently intended to be configurable, and there are plans to make it go away if/when we actually make NM block on the various dispatcher scripts to implement "pre-down" functionality (which is blocking on some internal state change issues).  We should however document the timeout in the manpage as suggested.

Comment 3 James Ralston 2013-08-29 17:58:37 UTC

(In reply to Dan Williams from comment #2)

> The timeout is not currently intended to be configurable, and there are
> plans to make it go away if/when we actually make NM block on the various
> dispatcher scripts to implement "pre-down" functionality (which is blocking
> on some internal state change issues).  We should however document the
> timeout in the manpage as suggested.

I use Puppet heavily in my infrastructure. On my laptop, I have two dhclient hook scripts: zz-facter.sh writes all of the dhclient information to /etc/facter/facts.d/dhclient_INTERFACE.yaml, and zz-puppet.sh runs "puppet apply". I do not run Puppet in client/server mode, as that makes no sense on my laptop (because it isn't always connected to a network).

Invoking Puppet is not a lightweight process. In the best case, it takes between 5-10 seconds; in the worst case (e.g., a freshly-booted system with nothing already cached), it can take closer to 60 seconds. In my testing so far, it NEVER manages to complete before NetworkManager comes along and chops its legs off with its 3-second timeout.

I can understand needing to protect against a hung dispatcher script that will never return. But I'm curious: why did you assume that a measly >3 seconds< would be sufficient for every NetworkManager dispatcher script that has ever been written, or has ever been written? Did it truly not occur to you that Linux users would be capable of writing more complex dispatcher scripts that would require more time than that?

As it stands now, I can see only 3 (no pun intended) realistic options:

1. Repackage NetworkManager RPMs locally for my own systems, including a patch to raise the timeout to something reasonable (60 seconds, probably). But this requires rebasing every time Fedora pushes out an updated NetworkManager package.

2. Have zz-puppet.sh fork the "puppet apply" command into the background. But this risks having multiple "puppet apply" commands running at the same time, most likely fighting with each other.

3. Setup a full-blown Puppet client/server instance on my laptop, where the server is simply localhost. This would make Puppet executions asynchronous, and would also prevent concurrent executions, as calling condrestart will stop an already-running Puppet execution. But running a same-system client/server setup is needlessly complex.

I'm probably going to go with option #3, as I think it's probably the cleanest solution to avoid NetworkManager's 3-second timeout.

But it is frustrating that I have to take the time and effort to work around NetworkManager's deficiencies, particularly when those deficiencies are caused by inappropriate assumptions (namely, that 3 seconds is a reasonable default timeout, and that no one will ever need to raise it) on the part of NetworkManager developers.

Comment 4 Fedora End Of Life 2013-12-21 14:17:29 UTC

This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Charles R. Anderson 2014-04-15 13:48:49 UTC

*** Bug 909577 has been marked as a duplicate of this bug. ***

Comment 6 Dan Winship 2014-04-18 15:44:57 UTC

(In reply to Kamil Páral from comment #0)
> But if I look into "man NetworkManager" there is no indication that there is
> some time limit imposed, what the time limit is, and whether I can change it.

This is now fixed upstream:

>+      Dispatcher scripts are run one at a time, but asynchronously from the main
>+      NetworkManager process, and will be killed if they run for too long. If your script
>+      might take arbitrarily long to complete, you should spawn a child process and have the
>+      parent return immediately.

(The time limit is now 20 seconds rather than 3; this is not documented because we don't guarantee the specific time limit will stay the same, although it should remain closer to 20 than to 3, since 3 seconds turns out to be too short for even "quick" scripts on heavily-loaded machines.)

This will eventually be in rawhide [when there's a snapshot with a date later than 20140418], but I'm going to just close the bug now so we don't forget to later.

Comment 7 Kamil Páral 2014-04-22 08:24:29 UTC

Thanks.

Note You need to log in before you can comment on or make changes to this bug.