1071356 – Racy startup of ipa.service, named.service and dirsrv.target

Bug 1071356 - Racy startup of ipa.service, named.service and dirsrv.target

Summary: Racy startup of ipa.service, named.service and dirsrv.target

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	freeipa
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Rob Crittenden
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	RejectedBlocker https://fedoraproject...
Depends On:
Blocks:	1139380
TreeView+	depends on / blocked

Reported:	2014-02-28 15:10 UTC by Stef Walter
Modified:	2018-06-01 10:34 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-11-04 18:43:39 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	stefw: needinfo-

Attachments	(Terms of Use)

Description Stef Walter 2014-02-28 15:10:45 UTC

Description of problem:

ipa.service fails to start during boot due to named.service failures.

Booting a Qemu virtual machine with freeipa. This VM is booted on my machine as part of the cockpit integration tests.

Changing the following line in /lib/systemd/system/ipa.service from:

Requires=network.target

to:

Requires=network.target dirsrv.target

Fixes the problem.

[root@f0 ~]# rpm -q freeipa-server
freeipa-server-3.3.4-3.fc20.x86_64

Comment 1 Stef Walter 2014-02-28 15:11:25 UTC

The named startup failures look like this, although I've seen other ones too.

Feb 28 17:00:06 f0.cockpit.lan systemd[1]: Unit named.service cannot be reloaded because it is inactive.
Feb 28 17:00:06 f0.cockpit.lan named[803]: starting BIND 9.9.4-P2-RedHat-9.9.4-11.P2.fc20 -u named
Feb 28 17:00:06 f0.cockpit.lan named[803]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--pre
Feb 28 17:00:06 f0.cockpit.lan named[803]: ----------------------------------------------------
Feb 28 17:00:06 f0.cockpit.lan named[803]: BIND 9 is maintained by Internet Systems Consortium,
Feb 28 17:00:06 f0.cockpit.lan named[803]: Inc. (ISC), a non-profit 501(c)(3) public-benefit
Feb 28 17:00:06 f0.cockpit.lan named[803]: corporation.  Support and training for BIND 9 are
Feb 28 17:00:06 f0.cockpit.lan named[803]: available at https://www.isc.org/support
Feb 28 17:00:06 f0.cockpit.lan named[803]: ----------------------------------------------------
Feb 28 17:00:06 f0.cockpit.lan named[803]: adjusted limit on open files from 4096 to 1048576
Feb 28 17:00:06 f0.cockpit.lan named[803]: found 1 CPU, using 1 worker thread
Feb 28 17:00:06 f0.cockpit.lan named[803]: using 1 UDP listener per interface
Feb 28 17:00:06 f0.cockpit.lan named[803]: using up to 4096 sockets
Feb 28 17:00:06 f0.cockpit.lan named[803]: loading configuration from '/etc/named.conf'
Feb 28 17:00:06 f0.cockpit.lan systemd[1]: Stopping Berkeley Internet Name Domain (DNS)...
Feb 28 17:00:06 f0.cockpit.lan systemd[1]: Stopped Berkeley Internet Name Domain (DNS).
Feb 28 17:00:06 f0.cockpit.lan named[803]: using default UDP/IPv4 port range: [1024, 65535]
Feb 28 17:00:06 f0.cockpit.lan named[803]: using default UDP/IPv6 port range: [1024, 65535]
Feb 28 17:00:06 f0.cockpit.lan named[803]: listening on IPv6 interfaces, port 53
Feb 28 17:00:06 f0.cockpit.lan named[803]: could not listen on UDP socket: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: listening on all IPv6 interfaces failed
Feb 28 17:00:06 f0.cockpit.lan named[803]: listening on IPv4 interface lo, 127.0.0.1#53
Feb 28 17:00:06 f0.cockpit.lan named[803]: could not listen on UDP socket: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: creating IPv4 interface lo failed; interface ignored
Feb 28 17:00:06 f0.cockpit.lan named[803]: listening on IPv4 interface eth0, 10.111.111.100#53
Feb 28 17:00:06 f0.cockpit.lan named[803]: could not listen on UDP socket: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: creating IPv4 interface eth0 failed; interface ignored
Feb 28 17:00:06 f0.cockpit.lan named[803]: not listening on any interfaces
Feb 28 17:00:06 f0.cockpit.lan named[803]: generating session key for dynamic DNS
Feb 28 17:00:06 f0.cockpit.lan named[803]: sizing zone task pool based on 6 zones
Feb 28 17:00:06 f0.cockpit.lan named[803]: listening on IPv6 interfaces, port 53
Feb 28 17:00:06 f0.cockpit.lan named[803]: could not listen on UDP socket: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: listening on all IPv6 interfaces failed
Feb 28 17:00:06 f0.cockpit.lan named[803]: additionally listening on IPv4 interface lo, 127.0.0.1#53
Feb 28 17:00:06 f0.cockpit.lan named[803]: could not listen on UDP socket: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: creating IPv4 interface lo failed; interface ignored
Feb 28 17:00:06 f0.cockpit.lan named[803]: additionally listening on IPv4 interface eth0, 10.111.111.100#53
Feb 28 17:00:06 f0.cockpit.lan named[803]: could not listen on UDP socket: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: creating IPv4 interface eth0 failed; interface ignored
Feb 28 17:00:06 f0.cockpit.lan named[803]: loading configuration: shutting down
Feb 28 17:00:06 f0.cockpit.lan named[803]: exiting (due to fatal error)

The ipa.service logs look like this:

Feb 28 16:59:59 f0.cockpit.lan systemd[1]: Starting Identity, Policy, Audit...
Feb 28 17:00:06 f0.cockpit.lan ipactl[358]: Job for named.service failed. See 'systemctl status named.service' and 'journalctl -xn' for details.
Feb 28 17:00:06 f0.cockpit.lan ipactl[358]: Failed to start named Service
Feb 28 17:00:06 f0.cockpit.lan ipactl[358]: Shutting down
Feb 28 17:00:08 f0.cockpit.lan ipactl[358]: Aborting ipactl
Feb 28 17:00:08 f0.cockpit.lan ipactl[358]: Starting Directory Service
Feb 28 17:00:08 f0.cockpit.lan ipactl[358]: Starting krb5kdc Service
Feb 28 17:00:08 f0.cockpit.lan ipactl[358]: Starting kadmin Service
Feb 28 17:00:08 f0.cockpit.lan ipactl[358]: Starting named Service
Feb 28 17:00:08 f0.cockpit.lan systemd[1]: ipa.service: main process exited, code=exited, status=1/FAILURE
Feb 28 17:00:08 f0.cockpit.lan systemd[1]: Failed to start Identity, Policy, Audit.
Feb 28 17:00:08 f0.cockpit.lan systemd[1]: Unit ipa.service entered failed state.

Comment 2 Stef Walter 2014-02-28 15:13:12 UTC

[root@f0 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.111.111.100	f0.cockpit.lan f0
[root@f0 ~]# hostname
f0.cockpit.lan

Comment 3 Stef Walter 2014-02-28 15:13:43 UTC

[root@f0 ~]# cat /etc/named.conf 
options {
	// turns on IPv6 for port 53, IPv4 is on by default for all ifaces
	listen-on-v6 {any;};

	// Put files that named is allowed to write in the data/ directory:
	directory "/var/named"; // the default
	dump-file		"data/cache_dump.db";
	statistics-file		"data/named_stats.txt";
	memstatistics-file	"data/named_mem_stats.txt";

	forward first;
	forwarders { };

	// Any host is permitted to issue recursive queries
	allow-recursion { any; };

	tkey-gssapi-keytab "/etc/named.keytab";
	pid-file "/run/named/named.pid";
};

/* If you want to enable debugging, eg. using the 'rndc trace' command,
 * By default, SELinux policy does not allow named to modify the /var/named directory,
 * so put the default debug log file in data/ :
 */
logging {
	channel default_debug {
		file "data/named.run";
		severity dynamic;
		print-time yes;
	};
};

zone "." IN {
	type hint;
	file "named.ca";
};

include "/etc/named.rfc1912.zones";

dynamic-db "ipa" {
	library "ldap.so";
	arg "uri ldapi://%2fvar%2frun%2fslapd-COCKPIT-LAN.socket";
	arg "base cn=dns, dc=cockpit,dc=lan";
	arg "fake_mname f0.cockpit.lan.";
	arg "auth_method sasl";
	arg "sasl_mech GSSAPI";
	arg "sasl_user DNS/f0.cockpit.lan";
	arg "serial_autoincrement yes";
};

Comment 4 Martin Kosek 2014-03-03 08:10:10 UTC

Stef, thanks for the report! Alexander, Petr2 - are you OK with the proposed fix? If yes, I will clone a bug.

Comment 5 Alexander Bokovoy 2014-03-03 08:34:56 UTC

I'm not yet OK with that. We are experimenting with different settings with Marius of Cockpit team and he sees shutdown lock on ipa.service when we use 'Require=network.target dirsrv.target; After=dirsrv.target'.

More experiments are due today.

I expect we'll need to clone the bug to RHEL7.0 too.

Comment 6 Marius Vollmer 2014-03-03 09:40:21 UTC

I don't think this has anything to do with dirsrv.service.

ipactl starts dirsrv and actually uses it before starting any of the other services, and named doesn't reach the point where it loads the bind-dyndb-ldap module.  It looks like it simply fails to open its own port 53 for listening.

I can't reproduce this, so I can't really do further investigation...

Comment 7 Marius Vollmer 2014-03-03 10:09:26 UTC

I guess named has some requirements on the network configuration that might or might not be fulfilled by the time it is activated.  The "network.target" itself doesn't do much, according to

 http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

so maybe named.service needs to get a Requiers+After=network-online field.

Comment 8 Marius Vollmer 2014-03-03 10:17:22 UTC

But then again, named seems to enumerate all interfaces and then listens on each of them.  So, we must be in a scenario where there is a interface with a ip address, but named can't listen on it.  I have no idea how that might happen.

Comment 9 Alexander Bokovoy 2014-03-03 10:19:52 UTC

Our current working theory is that this is actually named vs network race. Named wants to listen to specific IPv4 addresses (and :::53 on IPv6 too) and the first part (IPv4) fails before NetworkManager actually completes address assignment.

Now, network.target in systemd has little meaning as well as in System V. There is network-online.target (see systemd.special(7)) that can be used to depend on a working networking configuration, whatever that means. 

The dependency actually should be Wants+After, not Requires+After.

At this point I think we need to ask Petr Spacek first for named-specific advise as this seems to be in the hands of named systemd unit configuration.

Comment 10 Petr Spacek 2014-03-03 12:13:58 UTC

I can't say anything until I have a machine with reproducer and time to investigate it.

Named really enumerates network interfaces and but I have no idea what systemd and NetworkManaged do behind scenes.

Comment 11 Marius Vollmer 2014-03-06 09:24:47 UTC

I haven't been able to reproduce the exact journal output in comment 1, but I could verify that there is a time window during named startup where it replies with NXDOMAIN to a legitimate query.

I was running this on one machine:

  $ while true; do dig _ldap._tcp.cockpit.lan SRV | grep status; done

and "systemctl restart named" on the one running named.  I can see this very occasionally:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30286
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34741
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 32372
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50643
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20848

My guess is that named answers queries before actually loading the zone from ldap.  After that, things work as expected.

Thus, this is not a startup failure, and I doubt that anyone has ever run into this outside of our testsuite...

Comment 12 Stef Walter 2014-03-06 10:08:22 UTC

In the original bug report named never comes up.

While there may also be a race during which the name server is up and responding NXDOMAIN, this is a different race than the one reported here.

Comment 13 Marius Vollmer 2014-03-06 10:53:23 UTC

(In reply to Stef Walter from comment #12)
> In the original bug report named never comes up.

Yes, sorry for mixing these two things.

Comment 14 Petr Spacek 2014-03-06 12:30:57 UTC

(In reply to Marius Vollmer from comment #11)
> I could verify that there is a time window during named startup where it
> replies with NXDOMAIN to a legitimate query.
> 
> I was running this on one machine:
> 
>   $ while true; do dig _ldap._tcp.cockpit.lan SRV | grep status; done
> 
> and "systemctl restart named" on the one running named.  I can see this very
> occasionally:
> 
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30286
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34741
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 32372
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50643
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20848
> 
> My guess is that named answers queries before actually loading the zone from
> ldap.  After that, things work as expected.
> 
> Thus, this is not a startup failure, and I doubt that anyone has ever run
> into this outside of our testsuite...

This is "expected" behavior for bind-dyndb-ldap version 4.0 and higher:
See https://git.fedorahosted.org/cgit/bind-dyndb-ldap.git/tree/NEWS for version 4.0 point [5].

It simply takes some time to load all the data from LDAP to named.

If you want to see some other behavior please open a bug against bind-dyndb-ldap component.

Comment 15 Martin Kosek 2014-03-07 11:32:10 UTC

Any update? Is this still a problem? It seems to me that if Stef has a VM reproducing this issue that he could lend to Petr or Alexander, it could help them investigate.

Comment 16 Martin Kosek 2014-03-14 13:08:59 UTC

I see no updates, given that we were not able to reproduce with recent systemd versions andPetr confirmed that issue described in Comment 11 is expected, I will consider closing this bug.

Comment 17 Stef Walter 2014-03-14 16:54:47 UTC

Martin, which systemd version fixes this? Marius and I can update our image (which still fails 100% reliably for me) and retest.

BTW, I sent Alexander the URL of our images on the day this occurred. Will send the URL them again after updating them.

Comment 18 Martin Kosek 2014-03-17 08:37:15 UTC

I was just guessing that this was fixed in systemd given we could not reproduce it on our updated VMs. If you are reproducing the issue with recent systemd version and we access to image, we should be then able to investigate - thanks.

Comment 19 Martin Kosek 2014-08-01 10:59:31 UTC

We did not manage to reproduce the problem ourselves and neither was able to access some VM where the problem reproduces. Closing the bug as WORKSFORME.

Please feel free to reopen it if you reproduce the problem with up to date Fedora which we could get our hands on.

Comment 20 Stef Walter 2014-09-16 12:43:16 UTC

I can duplicate this pretty regularly now via the Cockpit integration tests in virtual machines. I'm happy to work with someone on IRC to help them run the tests and reproduce it too.

As noted above, it's not something that is reproducible every time. But in my case it's about 60% of the time.

Here is the method that I use. I've done this on Fedora 21 updated to TC6:

 $ sudo yum-builddep cockpit
 $ sudo yum install mock qemu qemu-kvm qemu-img python curl trickle nbd-server npm python-libguestfs libvirt-client
 $ git clone git://github.com/stefwalter/cockpit
 $ cd cockpit/test
 $ git checkout broken-ipa-startup
 $ ./testsuite-prepare
 $ ./check-broken-ipa-startup

If it outputs 'IPA started up fine' then it didn't happen. It'll say 'IPA didn't start up right' when it fails.

In this case the test waits for about 30 seconds before declaring success/failure.

The script prints out login information to the IPA virtual machine, and then just hangs waiting for you diagnose the problem further. Press Ctrl-C when you're done.

Comment 21 Stef Walter 2014-09-16 12:50:15 UTC

As before here are some logs: http://fpaste.org/133870/14108717/

A later 'systemctl start ipa' after boot succeeds. But that's not what this bug is about.

Comment 22 Fedora Blocker Bugs Application 2014-10-22 11:58:53 UTC

Proposed as a Blocker for 21-beta by Fedora user sgallagh using the blocker tracking app because:

 Nominating as a Beta blocker per criterion "With the Domain Controller role active and correctly configured:

    Multiple clients must be able to enroll and unenroll in the domain
    Client systems must be able to authenticate users with Kerberos
    The FreeIPA configuration web UI must be available and allow at least basic configuration of user accounts and permissions "

Without proper DNS services, the clients will not function properly or be able to enroll with realmd.

Comment 23 Stef Walter 2014-10-22 15:05:56 UTC

Confirming, can still reproduce.

Comment 24 Adam Williamson 2014-10-22 15:58:57 UTC

Is it possible this is simply caused by the crasher bug in named - https://bugzilla.redhat.com/show_bug.cgi?id=1155334 ? Can you check logs and see if it's not a race, but named going away because it crashed?

Comment 25 Petr Spacek 2014-10-22 16:34:39 UTC

On behalf of IDM team:

(In reply to Fedora Blocker Bugs Application from comment #22)
>  Nominating as a Beta blocker per criterion "With the Domain Controller role
> active and correctly configured:
> 
>     Multiple clients must be able to enroll and unenroll in the domain
>     Client systems must be able to authenticate users with Kerberos
>     The FreeIPA configuration web UI must be available and allow at least
> basic configuration of user accounts and permissions "
> 
> Without proper DNS services, the clients will not function properly or be
> able to enroll with realmd.

The workaround is really simple: run 'ipactl restart' or 'systemctl restart ipa.service' if it accidentally failed to start.

In normal scenario the IPA server is installed and started once and then many clients enroll to it.

The workaround is so simple that we don't consider the bug as a blocker.

We did analysis of the bug and there is not simple way how to fix it so we are unable to provide fix today.

Comment 26 Adam Williamson 2014-10-22 17:37:59 UTC

Discussed at 2014-10-22 blocker review meeting: http://meetbot.fedoraproject.org/fedora-blocker-review/2014-10-22/f21-blocker-review.2014-10-22-16.03.log.txt . Rejected as a blocker: it is a conditional violation (it only breaks sometimes) and we consider the simple workaround from c#25 sufficient mitigation. It will be documented in CommonBugs for Beta.

Comment 27 Stef Walter 2014-10-22 20:31:18 UTC

(In reply to Petr Spacek from comment #25)
> On behalf of IDM team:
> 
> (In reply to Fedora Blocker Bugs Application from comment #22)
> >  Nominating as a Beta blocker per criterion "With the Domain Controller role
> > active and correctly configured:
> > 
> >     Multiple clients must be able to enroll and unenroll in the domain
> >     Client systems must be able to authenticate users with Kerberos
> >     The FreeIPA configuration web UI must be available and allow at least
> > basic configuration of user accounts and permissions "
> > 
> > Without proper DNS services, the clients will not function properly or be
> > able to enroll with realmd.
> 
> The workaround is really simple: run 'ipactl restart' or 'systemctl restart
> ipa.service' if it accidentally failed to start.
> 
> In normal scenario the IPA server is installed and started once and then
> many clients enroll to it.
> 
> The workaround is so simple that we don't consider the bug as a blocker.

IPA server should start up on the machine when the IPA machine boots. 

If your critical infrastructure doesn't come back up after an outage, migration, or other reboot, this is a bug. 

But I understand if you don't want to fix this for F21.

Comment 28 Adam Williamson 2014-10-23 02:07:34 UTC

Blocker status was considered for 21 *Beta*, not Final.

Comment 29 Martin Kosek 2014-10-23 07:30:20 UTC

(In reply to Stef Walter from comment #27)
...
> IPA server should start up on the machine when the IPA machine boots. 

+1, it normally does. We are still not sure about the root cause in your case.

> If your critical infrastructure doesn't come back up after an outage,
> migration, or other reboot, this is a bug. 
> 
> But I understand if you don't want to fix this for F21.

I think we should continue with investigation if you keep helping us with reproduction.

Comment 30 Petr Spacek 2014-10-31 08:27:15 UTC

Martin, could you set expectations for F21 final release?

Comment 31 Martin Kosek 2014-10-31 09:14:29 UTC

Nobody still shared the exact root cause. It is also not trivial to reproduce, as we learned, and it happens only in some setups. However, IIUC correctly, this is caused by some race condition IPA start procedure caused by the IPA systemd service architecture. This may be removed several ways:

1) With systemd refactoring (planned in https://fedorahosted.org/freeipa/ticket/4552). This change affects not only IPA, but also all other dependent services to provide SD notify.

2) If anyone is finally able to find the exact root cause/race condition and add some workaround/fix to ipa.service/ipactl/systemd/...

3) Adding a workaround to Cockpit to check ipa.service status in the affected scenario and start the service again if it is not started. But this is of course just a workaround.

2) or 3) can be done in F21 Final. 1) obviously cannot be done.

Comment 32 Stef Walter 2014-10-31 15:52:43 UTC

(In reply to Martin Kosek from comment #31)
> 3) Adding a workaround to Cockpit to check ipa.service status in the
> affected scenario and start the service again if it is not started. But this
> is of course just a workaround.

We have such a work around in our test suite ... the version of Cockpit in F21 doesn't have deep integration with rolekit or other means for installing IPA.

Comment 33 Martin Kosek 2015-09-15 10:35:20 UTC

Let us reiterate in this long-standing Bugzilla. Is still an issue or at least better reproducible so that it can be either investigated on some VM or closed at all?

Comment 34 Fedora End Of Life 2015-11-04 10:30:15 UTC

This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 35 Adam Williamson 2015-11-04 18:43:39 UTC

I don't recall this happening ever on my prod server lately, so I guess we can just close it.

Note You need to log in before you can comment on or make changes to this bug.