Bug 1112248 - dnssec-trigger-script fails to configure unbound on dnssec-triggerd restart
Summary: dnssec-trigger-script fails to configure unbound on dnssec-triggerd restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: dnssec-trigger
Version: 20
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Pavel Šimerda (pavlix)
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1113122
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-23 12:15 UTC by Pavel Šimerda (pavlix)
Modified: 2015-04-21 18:49 UTC (History)
10 users (show)

Fixed In Version: dnssec-trigger-0.12-20.fc22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-19 10:07:05 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
fix (783 bytes, patch)
2014-06-23 12:15 UTC, Pavel Šimerda (pavlix)
no flags Details | Diff
additional fix to avoid a blocking on systemd (1.05 KB, patch)
2014-06-23 14:18 UTC, Pavel Šimerda (pavlix)
no flags Details | Diff

Description Pavel Šimerda (pavlix) 2014-06-23 12:15:34 UTC
Created attachment 911409 [details]
fix

There's a race condition between restarting NetworkManager from dnssec-trigger-script --cleanup and communicating with it from dnssec-trigger-script --prepare when the service is restarted. This can be easily solved in the service file, see attachment.

Comment 1 Pavel Šimerda (pavlix) 2014-06-23 14:18:21 UTC
Created attachment 911472 [details]
additional fix to avoid a blocking on systemd

Comment 2 Pavel Šimerda (pavlix) 2014-06-24 09:11:28 UTC
Still a problem.

Comment 3 Pavel Šimerda (pavlix) 2014-06-24 12:54:58 UTC
It has probably improved with dnssec-trigger-0.12-10.fc20. But there's still some sort of a race condition that sometimes leads to not installing the forward zones. It may be inherent to the fact that dnssec-trigger-script --update doesn't guard for parallel run or it can be something in NetworkManager.

Comment 4 Pavel Šimerda (pavlix) 2014-06-24 13:52:50 UTC
Guarding for parallel run is now implemented and there's still a high rate of unsuccessful restarts. The problem is definitely in NetworkManager, see the logs with --debug argument added to the call to dnssec-trigger-script from the dispatcher.d script.

Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: Running: run_update
Jun 24 15:46:17 87-84-246-161.wendulka.net NetworkManager[12340]: <info> startup complete
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: <ConnectionList(only_default=True, skip_wifi=False, connections=[<Connection(uuid=57e5480b-08e4-46a4-8d0c-cfbeffdbe267, type=other, default=
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: <Store(name=servers, set(['2a00:1268:1ff:f001::1', '127.0.0.1']))>
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: Global forwarders: 2a00:1268:1ff:f001::1 127.0.0.1
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: <ConnectionList(only_default=False, skip_wifi=True, connections=[<Connection(uuid=57e5480b-08e4-46a4-8d0c-cfbeffdbe267, type=other, default=
Jun 24 15:46:17 87-84-246-161.wendulka.net dnssec-triggerd[12348]: ok
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: <UnboundZoneConfig(data={})>
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: <Store(name=zones, set([]))>
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: removing stored zones not present in both unbound and an active connection
Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]: installing connection provided zones
Jun 24 15:46:17 87-84-246-161.wendulka.net systemd[1]: Unit iscsi.service cannot be reloaded because it is inactive.
Jun 24 15:46:18 87-84-246-161.wendulka.net dhclient[12409]: DHCPREQUEST on enp0s25 to 255.255.255.255 port 67 (xid=0xe533987)
Jun 24 15:46:18 87-84-246-161.wendulka.net dhclient[12409]: DHCPOFFER from 84.246.161.81
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: DHCPREQUEST on enp0s25 to 255.255.255.255 port 67 (xid=0xe533987)
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: DHCPOFFER from 84.246.161.81
Jun 24 15:46:18 87-84-246-161.wendulka.net dhclient[12409]: DHCPACK from 84.246.161.81 (xid=0xe533987)
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: DHCPACK from 84.246.161.81 (xid=0xe533987)
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info> (enp0s25): DHCPv4 state changed preinit -> bound
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   address 84.246.161.87
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   plen 28 (255.255.255.240)
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   gateway 84.246.161.81
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   server identifier 84.246.161.81
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   lease time 43200
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   hostname '87-84-246-161'
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   nameserver '84.246.161.81'
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info>   domain name 'lan'
Jun 24 15:46:18 87-84-246-161.wendulka.net dhclient[12409]: bound to 84.246.161.87 -- renewal in 20325 seconds.
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info> Activation (enp0s25) Stage 5 of 5 (IPv4 Configure Commit) scheduled...
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info> Activation (enp0s25) Stage 5 of 5 (IPv4 Commit) started...
Jun 24 15:46:18 87-84-246-161.wendulka.net avahi-daemon[3350]: Registering new address record for 84.246.161.87 on enp0s25.IPv4.
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: bound to 84.246.161.87 -- renewal in 20325 seconds.
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <warn> could not commit DNS changes: (0) Could not replace /etc/resolv.conf: Operation not permitted
Jun 24 15:46:18 87-84-246-161.wendulka.net NetworkManager[12340]: <info> Activation (enp0s25) Stage 5 of 5 (IPv4 Commit) complete.

NetworkManager sometimes apparently only calls dispatcher.d before finishing the DHCP transaction and thus before adding the domain to the zone list. In my opinion, the dispatcher.d script should be called *after* the DHCP transaction and thus *after* the data is added to the published connection object.

Note that reproducability is not 100%, so it's necessary to restart NetworkManager repeatedly until that happens.

Comment 5 Thomas Haller 2014-06-24 14:42:08 UTC
(In reply to Pavel Šimerda (pavlix) from comment #4)
> Guarding for parallel run is now implemented and there's still a high rate
> of unsuccessful restarts. The problem is definitely in NetworkManager, see
> the logs with --debug argument added to the call to dnssec-trigger-script
> from the dispatcher.d script.
> 
> Jun 24 15:46:17 87-84-246-161.wendulka.net nm-dispatcher.action[12411]:
> Running: run_update
>
> NetworkManager sometimes apparently only calls dispatcher.d before finishing
> the DHCP transaction and thus before adding the domain to the zone list. In
> my opinion, the dispatcher.d script should be called *after* the DHCP
> transaction and thus *after* the data is added to the published connection
> object.
> 
> Note that reproducability is not 100%, so it's necessary to restart
> NetworkManager repeatedly until that happens.

The parts from logging do not show what happens before invoking the dispatcher script. It is also not visible, which dispatcher action was called.

Anyway, a device will be declared ACTIVATED when having IPv6 connectivity. When a device becomes activated, it calls the dispatcher with "UP" action.

When receiving the DHCP4 lease later, it will call dispatcher again with "dhcp4-change" action.


I think, the script should notice that it was called again with dhcp4-change, and do all the configuration again.
It is not enough, to notice that the script is already running from a previous invocation and quit (assuming that the currently running process picks up the most recent changes). Instead, you have to schedule a new configuration run to make sure you really get the latest state.

Comment 6 Pavel Šimerda (pavlix) 2014-06-24 18:43:59 UTC
(In reply to Thomas Haller from comment #5)
> The parts from logging do not show what happens before invoking the
> dispatcher script. It is also not visible, which dispatcher action was
> called.

The problem is that *after* the DHCP configuration and after setting the domain list in the active connection, *no* dispatcher action was called at all. The dnssec-trigger dispatcher script currently doesn't distinguishes actions.

> Anyway, a device will be declared ACTIVATED when having IPv6 connectivity.

Good catch. The testing machine is in an IPv6 capable network. There were related bugs reported in the past, maybe I can find some of them... and they mostly seem to be considered fixed.

Possibly related NM bugzilla tickets:

https://bugzilla.gnome.org/show_bug.cgi?id=563654
https://bugzilla.gnome.org/show_bug.cgi?id=689741
https://bugzilla.gnome.org/show_bug.cgi?id=689742

Our use case only needs any dispatcher script with any action to be run *after* the DHCP transaction is finished and available via active connection API.

> When a device becomes activated, it calls the dispatcher with "UP" action.
> 
> When receiving the DHCP4 lease later, it will call dispatcher again with
> "dhcp4-change" action.

Please note that the reproducibility is somewhere around 20%, so we're certainly not talking about the regular behavior of NetworkManager but about exceptional behavior, although very frequent.

> I think, the script should notice that it was called again with
> dhcp4-change, and do all the configuration again.

The script doesn't read the action field and performs the reconfiguration upon *any* action, see:

http://www.nlnetlabs.nl/svn/dnssec-trigger/trunk/01-dnssec-trigger.in
http://www.nlnetlabs.nl/svn/dnssec-trigger/trunk/dnssec-trigger-script.in

(Some of the latest changes may not yet be upstreamed, including the switch from systemctl --no-block to systemctl --ignore-dependencies and the usage of a locking mechanism used to serialize scripts started in parallel.)

> It is not enough, to notice that the script is already running from a
> previous invocation and quit

That's correct. The script actions are only *serialized*, each run of the script only *waits* for the previous run to finish and then starts its work from the beginning.

I tested NM behavior with a script 00-debug:

#!/bin/bash

echo "*************************"
echo "$0"
echo "$@"
echo "-------------------------"
env
echo "*************************"

And even such a trivial script is *sometimes* only run before the DHCP transaction is logged and is not repeated after. We're definitely talking about NM behavior here, tested in Fedora rawhide:

NetworkManager-0.9.9.1-5.git20140319.fc21.x86_64

Thanks for getting involved.

Cheers,

Pavel

Comment 7 Pavel Šimerda (pavlix) 2014-06-24 18:54:35 UTC
Tomáš (Hozza),

the current build probably still constitutes a regression over the f20 stable one because dnssec-triggerd stop cleans up unbound configuration, then uses a freshly started NetworkManager configuration which is empty, and then *sometimes* fails to be started by NetworkManager dispatcher.d script.

I guess the only reasonable way around (apart from changing NM behavior) is to force dnssec-trigger-script --update a couple of seconds after restarting NM (e.g. by forcing the dnssec-triggerd.service to wait before running the script).

Could you please review correctness of the above information?

Cheers,

Pavel

Comment 8 Tomáš Hozza 2014-06-25 10:29:10 UTC
(In reply to Pavel Šimerda (pavlix) from comment #7)
> Tomáš (Hozza),
> 
> the current build probably still constitutes a regression over the f20
> stable one because dnssec-triggerd stop cleans up unbound configuration,
> then uses a freshly started NetworkManager configuration which is empty, and
> then *sometimes* fails to be started by NetworkManager dispatcher.d script.
> 
> I guess the only reasonable way around (apart from changing NM behavior) is
> to force dnssec-trigger-script --update a couple of seconds after restarting
> NM (e.g. by forcing the dnssec-triggerd.service to wait before running the
> script).

We could use it as temporary workaround, until NM issue is fixed. I think the wait time should be reasonably short (e.g. 1s).

Comment 9 Thomas Haller 2014-06-25 11:43:47 UTC
(In reply to Pavel Šimerda (pavlix) from comment #6)
> (In reply to Thomas Haller from comment #5)
> > The parts from logging do not show what happens before invoking the
> > dispatcher script. It is also not visible, which dispatcher action was
> > called.
> 
> The problem is that *after* the DHCP configuration and after setting the
> domain list in the active connection, *no* dispatcher action was called at
> all. The dnssec-trigger dispatcher script currently doesn't distinguishes
> actions.

When the DHCP configuration changes the dispatcher scripts certainly should be called again with action "dhcp4-change". Anything else would indeed be a bug.

There was a recent bug related to this (https://bugzilla.gnome.org/show_bug.cgi?id=729284), but this got backported to rhel-7, so it really ~should~ work.


Could get a logfile with DEBUG logging enabled. Is that possible to reproduce?


> > Anyway, a device will be declared ACTIVATED when having IPv6 connectivity.
> 
> Good catch. The testing machine is in an IPv6 capable network. There were
> related bugs reported in the past, maybe I can find some of them... and they
> mostly seem to be considered fixed.
> 
> Possibly related NM bugzilla tickets:
> 
> https://bugzilla.gnome.org/show_bug.cgi?id=563654
> https://bugzilla.gnome.org/show_bug.cgi?id=689741
> https://bugzilla.gnome.org/show_bug.cgi?id=689742
> 
> Our use case only needs any dispatcher script with any action to be run
> *after* the DHCP transaction is finished and available via active connection
> API.
> 
> > When a device becomes activated, it calls the dispatcher with "UP" action.
> > 
> > When receiving the DHCP4 lease later, it will call dispatcher again with
> > "dhcp4-change" action.
> 
> Please note that the reproducibility is somewhere around 20%, so we're
> certainly not talking about the regular behavior of NetworkManager but about
> exceptional behavior, although very frequent.
> 
> > I think, the script should notice that it was called again with
> > dhcp4-change, and do all the configuration again.
> 
> The script doesn't read the action field and performs the reconfiguration
> upon *any* action, see:
> 
> http://www.nlnetlabs.nl/svn/dnssec-trigger/trunk/01-dnssec-trigger.in
> http://www.nlnetlabs.nl/svn/dnssec-trigger/trunk/dnssec-trigger-script.in
> 
> (Some of the latest changes may not yet be upstreamed, including the switch
> from systemctl --no-block to systemctl --ignore-dependencies and the usage
> of a locking mechanism used to serialize scripts started in parallel.)
> 
> > It is not enough, to notice that the script is already running from a
> > previous invocation and quit
> 
> That's correct. The script actions are only *serialized*, each run of the
> script only *waits* for the previous run to finish and then starts its work
> from the beginning.
> 
> I tested NM behavior with a script 00-debug:
> 
> #!/bin/bash
> 
> echo "*************************"
> echo "$0"
> echo "$@"
> echo "-------------------------"
> env
> echo "*************************"
> 
> And even such a trivial script is *sometimes* only run before the DHCP
> transaction is logged and is not repeated after. We're definitely talking
> about NM behavior here, tested in Fedora rawhide:
> 
> NetworkManager-0.9.9.1-5.git20140319.fc21.x86_64

Interesting. This version does not contain before mentioned patch 11f9855223966c4fd927fe83e2cd9a623a74acad.

The first (and currently only) release on Rawhide with this patch is NetworkManager-0.9.9.95.git20140609.1963adda.tar.bz2 . Can you still reproduce with this one?

Comment 10 Pavel Šimerda (pavlix) 2014-06-25 13:19:49 UTC
(In reply to Thomas Haller from comment #9)
> > The problem is that *after* the DHCP configuration and after setting the
> > domain list in the active connection, *no* dispatcher action was called at
> > all. The dnssec-trigger dispatcher script currently doesn't distinguishes
> > actions.
> 
> When the DHCP configuration changes the dispatcher scripts certainly should
> be called again with action "dhcp4-change". Anything else would indeed be a
> bug.
> 
> There was a recent bug related to this
> (https://bugzilla.gnome.org/show_bug.cgi?id=729284), but this got backported
> to rhel-7, so it really ~should~ work.

Looks like that is the case. Started bug #1113122 for backporting this fix in Fedora 20.

> > NetworkManager-0.9.9.1-5.git20140319.fc21.x86_64
> 
> Interesting. This version does not contain before mentioned patch
> 11f9855223966c4fd927fe83e2cd9a623a74acad.
> 
> The first (and currently only) release on Rawhide with this patch is
> NetworkManager-0.9.9.95.git20140609.1963adda.tar.bz2 . Can you still
> reproduce with this one?

NetworkManager-0.9.9.95-1.git20140609.fc21.x86_64 seems to be working correctly.

Comment 11 Pavel Šimerda (pavlix) 2014-06-30 07:44:12 UTC
Please confirm that this works correctly with the latest NM[1]. As we are using the same spec file for rawhide and f20, it seems to be impractical to add a dependency on the new version and people should just always use the latest NetworkManager, unbound and dnssec-trigger when testing DNSSEC on Fedora.

[1] NetworkManager-0.9.9.0-40.git20131003.fc20

Comment 12 Pavel Šimerda (pavlix) 2014-06-30 08:42:27 UTC
Finally there's a dependency on the new version in the package, using conditionals to support various fedora versions.

Comment 13 Fedora Update System 2014-06-30 15:30:10 UTC
dnssec-trigger-0.12-12.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/dnssec-trigger-0.12-12.fc20

Comment 14 Fedora Update System 2014-07-01 07:21:50 UTC
Package dnssec-trigger-0.12-12.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing dnssec-trigger-0.12-12.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-7942/dnssec-trigger-0.12-12.fc20
then log in and leave karma (feedback).

Comment 15 Fedora Update System 2014-08-11 11:40:44 UTC
dnssec-trigger-0.12-13.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/dnssec-trigger-0.12-13.fc20

Comment 16 Fedora Update System 2014-09-19 10:07:05 UTC
dnssec-trigger-0.12-13.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 17 Fedora Update System 2015-03-17 11:07:03 UTC
dnssec-trigger-0.12-19.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/FEDORA-2015-3864/dnssec-trigger-0.12-19.fc22

Comment 18 Fedora Update System 2015-03-17 11:07:20 UTC
dnssec-trigger-0.12-19.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/FEDORA-2015-3843/dnssec-trigger-0.12-19.fc21

Comment 19 Fedora Update System 2015-03-26 22:04:50 UTC
dnssec-trigger-0.12-19.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2015-04-21 18:49:07 UTC
dnssec-trigger-0.12-20.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.