Bug 1656519 - certmonger.service starts but certmonger does not reply on D-Bus in container
Summary: certmonger.service starts but certmonger does not reply on D-Bus in container
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: certmonger
Version: 29
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nalin Dahyabhai
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1763745
TreeView+ depends on / blocked
 
Reported: 2018-12-05 16:54 UTC by Jan Pazdziora
Modified: 2020-02-10 02:38 UTC (History)
6 users (show)

Fixed In Version: certmonger-0.79.9-1.fc31 certmonger-0.79.9-1.fc30
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1763745 (view as bug list)
Environment:
Last Closed: 2019-09-06 17:49:20 UTC
Type: Bug


Attachments (Terms of Use)

Description Jan Pazdziora 2018-12-05 16:54:53 UTC
Description of problem:

Starting with Fedora 27

Version-Release number of selected component (if applicable):

In the container:

certmonger-0.79.6-3.fc29.x86_64
dbus-1.12.10-1.fc29.x86_64
systemd-239-3.fc29.x86_64

On the host:

docker-1.13.1-62.git9cb56fd.fc29.x86_64

How reproducible:

Not deterministic.

Steps to Reproduce:
1. Allow running systemd in containers:
   setsebool -P container_manage_cgroup 1
2. Run systemd container:
   docker run -e container=oci --name=systemd -d --rm registry.fedoraproject.org/fedora:29 /usr/sbin/init
3. Install certmonger in the container:
   docker exec systemd dnf install -y certmonger
4. In the container, start certmonger, check that systemd sees it running, and try to introspect it:
   docker exec systemd bash -c 'systemctl restart certmonger ; systemctl status certmonger ; dbus-send --system --type=method_call --print-reply --dest=org.fedorahosted.certmonger /org/fedorahosted/certmonger org.freedesktop.DBus.Introspectable.Introspect'

Actual results:

● certmonger.service - Certificate monitoring and PKI enrollment
   Loaded: loaded (/usr/lib/systemd/system/certmonger.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2018-12-05 15:28:13 UTC; 48ms ago
 Main PID: 241 (certmonger)
   CGroup: /system.slice/docker-d0a9b82bcc42b9759b76b701a6e75e23cccff1e96db61c43a089501e1014da2f.scope/system.slice/certmonger.service
           ├─241 /usr/sbin/certmonger -S -p /var/run/certmonger.pid -n
           └─243 /usr/sbin/certmonger -S -p /var/run/certmonger.pid -n

Dec 05 15:28:12 d0a9b82bcc42 systemd[1]: Starting Certificate monitoring and PKI enrollment...
Dec 05 15:28:13 d0a9b82bcc42 systemd[1]: Started Certificate monitoring and PKI enrollment.

Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Expected results:

● certmonger.service - Certificate monitoring and PKI enrollment
   Loaded: loaded (/usr/lib/systemd/system/certmonger.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2018-12-05 15:28:13 UTC; 48ms ago
 Main PID: 241 (certmonger)
   CGroup: /system.slice/docker-d0a9b82bcc42b9759b76b701a6e75e23cccff1e96db61c43a089501e1014da2f.scope/system.slice/certmonger.service
           ├─241 /usr/sbin/certmonger -S -p /var/run/certmonger.pid -n
           └─243 /usr/sbin/certmonger -S -p /var/run/certmonger.pid -n

Dec 05 15:28:12 d0a9b82bcc42 systemd[1]: Starting Certificate monitoring and PKI enrollment...
Dec 05 15:28:13 d0a9b82bcc42 systemd[1]: Started Certificate monitoring and PKI enrollment.

method return time=1544023753.120209 sender=:1.3 -> destination=:1.5 serial=22 reply_serial=2
   string "<!DOCTYPE node PUBLIC "-//freedesktop//DTD D-BUS Object Introspection 1.0//EN"
"http://www.freedesktop.org/standards/dbus/1.0/introspect.dtd">

<node name="/org/fedorahosted/certmonger">
 <interface name="org.freedesktop.DBus.Introspectable">
  <method name="Introspect">
   <arg name="xml_data" type="s" direction="out"/>
  </method>
 </interface>
 <interface name="org.freedesktop.DBus.Properties">
  <method name="Get">
   <arg name="interface_name" type="s" direction="in"/>
   <arg name="property_name" type="s" direction="in"/>
   <arg name="value" type="v" direction="out"/>
  </method>
  <method name="Set">
   <arg name="interface_name" type="s" direction="in"/>
   <arg name="property_name" type="s" direction="in"/>
   <arg name="value" type="v" direction="in"/>
  </method>
[...]

Additional info:

https://github.com/freeipa/freeipa-container/issues/187#issuecomment-391687818

Comment 2 Jan Pazdziora 2018-12-05 17:10:24 UTC
The

/usr/lib/systemd/system/certmonger.service

is defined as

[Unit]
Description=Certificate monitoring and PKI enrollment
After=syslog.target network.target dbus.service

[Service]
Type=dbus
PIDFile=/var/run/certmonger.pid
EnvironmentFile=-/etc/sysconfig/certmonger
ExecStart=/usr/sbin/certmonger -S -p /var/run/certmonger.pid -n $OPTS
BusName=org.fedorahosted.certmonger

[Install]
WantedBy=multi-user.target

And systemd.service(5) says

           Behavior of dbus is similar to simple; however, it is expected that
           the daemon acquires a name on the D-Bus bus, as configured by
           BusName=. systemd will proceed with starting follow-up units after
           the D-Bus bus name has been acquired. Service units with this
           option configured implicitly gain dependencies on the dbus.socket
           unit. This type is the default if BusName= is specified.

It does not specifically say that the service will be made active/running after it has acquired the D-Bus bus name but it used to behave this way up until Fedora 26 -- ipa-server-install (where the issues was first observed and from which it was minimized to this bug report) tests were not failing because of certmonger.service startup.

Comment 5 Jan Pazdziora 2019-07-16 14:42:03 UTC
After certmonger was started, list of defunct child processes gradually grows up to

root       786  0.2  0.1  17424  7460 ?        Ss   14:38   0:00 /usr/sbin/certmonger -S -p /var/run/certmonger.pid -n
root       788  4.9  0.0      0     0 ?        Z    14:38   0:01  \_ [ipa-submit] <defunct>
root       793  3.2  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       794  3.2  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       796  3.4  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       798  3.5  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       800  3.7  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       801  3.9  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       803  4.0  0.0      0     0 ?        Z    14:38   0:00  \_ [ipa-submit] <defunct>
root       805  4.1  0.0      0     0 ?        Z    14:38   0:00  \_ [certmaster-subm] <defunct>
root       807  4.4  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       808  4.7  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       810  4.6  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       812  5.0  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       814  5.4  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       816  5.8  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       817  5.6  0.0      0     0 ?        Z    14:39   0:00  \_ [certmaster-subm] <defunct>
root       819  6.2  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       821  6.8  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       823  7.8  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       825  8.2  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       826  8.4  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       828  9.3  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       830 10.7  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       832 12.5  0.0      0     0 ?        Z    14:39   0:00  \_ [dogtag-ipa-rene] <defunct>
root       833 12.5  0.0      0     0 ?        Z    14:39   0:00  \_ [local-submit] <defunct>
root       835 15.2  0.0      0     0 ?        Z    14:39   0:00  \_ [local-submit] <defunct>
root       837 32.7  0.0      0     0 ?        Z    14:39   0:01  \_ [local-submit] <defunct>
root       839 25.3  0.0      0     0 ?        Z    14:39   0:00  \_ [local-submit] <defunct>
root       840 37.5  0.0      0     0 ?        Z    14:39   0:00  \_ [local-submit] <defunct>
root       842 76.0  0.0      0     0 ?        Z    14:39   0:00  \_ [local-submit] <defunct>
root       844  0.0  0.0  17424  1016 ?        R    14:39   0:00  \_ /usr/sbin/certmonger -S -p /var/run/certmonger.pid -n

During that time, certmonger does not seem to respond over D-Bus.

The container does not have anything about dogtag or ipa installed, so I'm not sure what certmonger is trying to do here.

Comment 6 Jan Pazdziora 2019-07-16 14:56:26 UTC
It seems like on the host (outside of container), these children are executed as well during systemctl restart certmonger (bash -c 'echo $$' before and after the systemctl start shows 38 pids go by) but it all happens much faster, at least looking at the journalctl -f output (for certmonger configured with OPTS="-d 10" in /etc/sysconfig/certmonger) so I never manage to see those children in the ps output.

Comment 7 Rob Crittenden 2019-07-16 15:20:14 UTC
certmonger validates the CA configuration. For example, for the local CA if the self-signed CA is not present it will generate one.

Does certmonger ever eventually reply once these tasks clean up?

Comment 8 Jan Pazdziora 2019-07-16 15:31:33 UTC
If I rerun that dbus-send after a while, it will reply. But it can take many seconds for those forked processes to clear out in the container case. On the host, on the same machine, they go away (and certmonger get responsive) much faster.

The problem is that ipa-server-install runs systemctl restart certmonger and when that systemctl command finishes, it starts the next operation which tries to use certmonger. But at that point certmonger is still running those startup validation steps, making FreeIPA setup fail in nondeterministic fashion; it also kills FreeIPA upgrades.

Comment 9 Rob Crittenden 2019-07-16 15:38:37 UTC
Ok, fair point. I've seen others complain about this behavior of tons of spawned processes hanging around a while, even in the non-container case, but haven't been able to replicate it. I'll have to see if there is a way I can tell what is going on.

Comment 10 Rob Crittenden 2019-08-29 16:28:44 UTC
I have a candidate IPA patch which should mitigate this somewhat, https://github.com/freeipa/freeipa/pull/3596

The IPA certmonger CA uses a locker, ipa-server-guard, to ensure serial renewals. This was executing on operations that it didn't need to. Return earlier for operations not supported by ipa-submit. This avoids a bunch of locked calls that are guaranteed to fail.

The other piece is in the renewal script for the CA subsystem certificates. The IPA API was always initialized whether it need to be or not, which is costly. I also moved some imports around so that some are deferred and only used if needed. This should help in startup time.

There is still a flurry of activity during startup but in my testing it ends 50% sooner with less overall load.

Comment 11 Rob Crittenden 2019-09-03 16:43:57 UTC
Duplicating some info from the PR that is more relevant for this more vanilla case.

During startup certmonger forks each CA a number of times performing operations on it to obtain their capabilities, six in total. So the number of forks is 6 x # of CAs.

A strategy for reducing these startup time would be to remove CA's that will not be used in the container using getcert remove-ca <name>

I wasn't able to reproduce the failure on a F30 VM running the F29 container in the steps, it connects the DBus after a second or two pause.

I realize this BZ isn't specifically about IPA/dogtag but is a more vanilla case but IPA exacerbates some of the issues because it ends up serializing some of the requests using locks. The above PR is an attempt to mitigate that specific case.

Comment 12 Rob Crittenden 2019-09-03 16:45:55 UTC
Typo, it's 8 forks per CA not 6.

Comment 13 Jan Pazdziora 2019-09-05 14:00:57 UTC
The real reason for the different and much slower behaviour in containers is the code 

        for (i = getdtablesize() - 1; i >= 3; i--) {
                if ((i == fd) ||
                    (i == fd2) ||
                    (i == fd3)) {
                        continue;
                }
                l = fcntl(i, F_GETFD);
                if (l != -1) {
                        if (fcntl(i, F_SETFD, l | FD_CLOEXEC) != 0) {
                                cm_log(0, "Potentially leaking FD %d.\n", i);
                        }
                }
        }

in src/subproc.c.

In containers, the ulimit seems to be set to

# ulimit -n
1048576

so the processes loop over huge number of invalid file descriptors.

Comment 14 Jan Pazdziora 2019-09-05 14:10:32 UTC
Setting the limit to the same value as observed on the host seems to help:

( echo "[Service]" ; echo "LimitNOFILE=1024" ) > /usr/lib/systemd/system/certmonger.service.d/ulimit.conf

Comment 15 Jan Pazdziora 2019-09-05 14:25:48 UTC
Alternatively:

echo "DefaultLimitNOFILE=1024" >> /etc/systemd/system.conf

Comment 16 Jan Pazdziora 2019-09-05 15:05:46 UTC
I guess this bugzilla can be closed because ensuring FD_CLOEXEC is set is a good thing, and there isn't a better way in Linux than cycling through all potential file descriptors, which getdtablesize() is a good indication. So there's nothing to be done in certmonger, really.

Any patch that certmonger can get to minimize unneeded work is of course great so feel free to track it here if you prefer to. But the real cause needs to be handled in the container, setting the limits appropriately. We have a pull request https://github.com/freeipa/freeipa-container/pull/283 under testing now.

Comment 17 Jan Pazdziora 2019-09-05 15:16:06 UTC
Actually, listing fds with /proc/self/fd instead of using the getdtablesize() approach might be a valid approach, should certmonger team prefer to do that.

Comment 18 Rob Crittenden 2019-09-05 17:21:05 UTC
Suggestion implemented in https://pagure.io/certmonger/pull-request/130

Comment 19 Rob Crittenden 2019-09-06 17:49:20 UTC
master: 9bbb628620d4e586941344e1bdbbc166a885c0a9

Comment 20 Rob Crittenden 2019-10-10 20:46:01 UTC
Additional fix to not close STDERR when fetching CA data.

master: b7bcb1b3b953c2052e2d89cb2b3e9d9ccd1b3864

Comment 21 Fedora Update System 2020-02-01 01:04:11 UTC
FEDORA-2020-773d9d9103 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2020-773d9d9103

Comment 22 Fedora Update System 2020-02-02 01:15:15 UTC
certmonger-0.79.9-1.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-319545c382

Comment 23 Fedora Update System 2020-02-02 02:37:00 UTC
certmonger-0.79.9-1.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-773d9d9103

Comment 24 Fedora Update System 2020-02-10 01:42:15 UTC
certmonger-0.79.9-1.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.

Comment 25 Fedora Update System 2020-02-10 02:38:20 UTC
certmonger-0.79.9-1.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.