Red Hat Bugzilla – Bug 982734
manpage: dispatcher timeouts are undocumented
Last modified: 2014-04-22 04:24:29 EDT
Description of problem:
I have been playing with NM-dispatcher lately and I sometimes I spotted this in my journal:
Jul 09 18:35:00 kraken nm-dispatcher.action: Script '/etc/NetworkManager/dispatcher.d/99-kparal' took too long; killing it.
Jul 09 18:35:00 kraken NetworkManager: <warn> Dispatcher script timed out: Script '/etc/NetworkManager/dispatcher.d/99-kparal' timed out.
But if I look into "man NetworkManager" there is no indication that there is some time limit imposed, what the time limit is, and whether I can change it.
Please add those 3 pieces of information into the manpage, thank you.
Version-Release number of selected component (if applicable):
The timeout is not configurable and in 3 seconds now.
It has been introduced by a change that makes dispatcher asynchronous, and allows NM to wait for dispatcher.
We should ask Dan whether he thinks 3 seconds are enough and how to document the behaviour properly.
(In reply to Jirka Klimes from comment #1)
> The timeout is not configurable and in 3 seconds now.
> It has been introduced by a change that makes dispatcher asynchronous, and
> allows NM to wait for dispatcher.
> We should ask Dan whether he thinks 3 seconds are enough and how to document
> the behaviour properly.
The timeout is not currently intended to be configurable, and there are plans to make it go away if/when we actually make NM block on the various dispatcher scripts to implement "pre-down" functionality (which is blocking on some internal state change issues). We should however document the timeout in the manpage as suggested.
(In reply to Dan Williams from comment #2)
> The timeout is not currently intended to be configurable, and there are
> plans to make it go away if/when we actually make NM block on the various
> dispatcher scripts to implement "pre-down" functionality (which is blocking
> on some internal state change issues). We should however document the
> timeout in the manpage as suggested.
I use Puppet heavily in my infrastructure. On my laptop, I have two dhclient hook scripts: zz-facter.sh writes all of the dhclient information to /etc/facter/facts.d/dhclient_INTERFACE.yaml, and zz-puppet.sh runs "puppet apply". I do not run Puppet in client/server mode, as that makes no sense on my laptop (because it isn't always connected to a network).
Invoking Puppet is not a lightweight process. In the best case, it takes between 5-10 seconds; in the worst case (e.g., a freshly-booted system with nothing already cached), it can take closer to 60 seconds. In my testing so far, it NEVER manages to complete before NetworkManager comes along and chops its legs off with its 3-second timeout.
I can understand needing to protect against a hung dispatcher script that will never return. But I'm curious: why did you assume that a measly >3 seconds< would be sufficient for every NetworkManager dispatcher script that has ever been written, or has ever been written? Did it truly not occur to you that Linux users would be capable of writing more complex dispatcher scripts that would require more time than that?
As it stands now, I can see only 3 (no pun intended) realistic options:
1. Repackage NetworkManager RPMs locally for my own systems, including a patch to raise the timeout to something reasonable (60 seconds, probably). But this requires rebasing every time Fedora pushes out an updated NetworkManager package.
2. Have zz-puppet.sh fork the "puppet apply" command into the background. But this risks having multiple "puppet apply" commands running at the same time, most likely fighting with each other.
3. Setup a full-blown Puppet client/server instance on my laptop, where the server is simply localhost. This would make Puppet executions asynchronous, and would also prevent concurrent executions, as calling condrestart will stop an already-running Puppet execution. But running a same-system client/server setup is needlessly complex.
I'm probably going to go with option #3, as I think it's probably the cleanest solution to avoid NetworkManager's 3-second timeout.
But it is frustrating that I have to take the time and effort to work around NetworkManager's deficiencies, particularly when those deficiencies are caused by inappropriate assumptions (namely, that 3 seconds is a reasonable default timeout, and that no one will ever need to raise it) on the part of NetworkManager developers.
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '18'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 18's end of life.
Thank you for reporting this issue and we are sorry that we may not be
able to fix it before Fedora 18 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior to Fedora 18's end of life.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
*** Bug 909577 has been marked as a duplicate of this bug. ***
(In reply to Kamil Páral from comment #0)
> But if I look into "man NetworkManager" there is no indication that there is
> some time limit imposed, what the time limit is, and whether I can change it.
This is now fixed upstream:
>+ Dispatcher scripts are run one at a time, but asynchronously from the main
>+ NetworkManager process, and will be killed if they run for too long. If your script
>+ might take arbitrarily long to complete, you should spawn a child process and have the
>+ parent return immediately.
(The time limit is now 20 seconds rather than 3; this is not documented because we don't guarantee the specific time limit will stay the same, although it should remain closer to 20 than to 3, since 3 seconds turns out to be too short for even "quick" scripts on heavily-loaded machines.)
This will eventually be in rawhide [when there's a snapshot with a date later than 20140418], but I'm going to just close the bug now so we don't forget to later.