Bug 1437952

Summary: [openshift] fluentd: fluentd container is keeping shm mount point of other containers busy
Product: OpenShift Container Platform Reporter: Peter Portante <pportant>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.4.1CC: aos-bugs, asherkho, brubisch, dma, dwalsh, eparis, jcantril, jgoulding, jokerman, jupierce, juzhao, mmccomas, mpatel, mwoodson, nalin, pportant, pweil, rmeggins, vgoyal, whearn, wsun, xiazhao
Target Milestone: ---Keywords: OpsBlocker
Target Release: 3.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: fluentd mounted hostpath and kept other container filesystem busy Consequence: the cluster is unable to terminate pods Fix: unmount the /var/lib/docker/container/*/shm on fluentd start Result: pods are able to be delete
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-15 18:37:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peter Portante 2017-03-31 14:14:08 UTC
We see the following emails sent from the root account:

-----
Date: Tue, 28 Mar 2017 12:01:02 GMT
From: "(Cron Daemon)" <root@gprfc061>
To: root@gprfc061
Content-Type: text/plain; charset=UTF-8
Subject: [perf-dept-admins] Cron <root@gprfc061> run-parts /etc/cron.hourly

/etc/cron.hourly/docker-cleanup:

Error response from daemon: Unable to remove filesystem for 1d348b2449ad4e07f928c18cf7ba422dd6954afcc8bf3b7a0108b6152642e618: remove /var/lib/docker/containers/1d348b2449ad4e07f928c18cf7ba422dd6954afcc8bf3b7a0108b6152642e618/shm: device or resource busy
-----

We are running on RHEL 7.3.  docker-storage-setup deferred deletion is false.

Comment 1 Vivek Goyal 2017-03-31 14:24:15 UTC
Peter, so what's the issue. email being sent is the issue you want to be fixed?

I think cron job on failure sends an email by default. I will look if there is a way to change that behavior.

Comment 2 Vivek Goyal 2017-03-31 14:34:43 UTC
Container deletion is still failing because device is still mounted in some other namespace. So either container is still running or somehow mount point leaked into another mount namespace.

After the failure, can you run following script and paste output here.

https://github.com/rhvgoyal/misc/blob/master/find-busy-mnt.sh

#./find-busy-mnt.sh <failing-container-id>

In this case.

#./find-busy-mnt.sh <1d348b2449ad4e07f928c18cf7b>

Comment 3 Anton Sherkhonov 2017-03-31 15:50:46 UTC
probably cause is similar to this: https://bugzilla.redhat.com/show_bug.cgi?id=1391665

Comment 4 Peter Portante 2017-04-04 12:32:25 UTC
We have a number of systems encountering this, here is the output I see:

for i in {53..64} ; do ssh root@gprfc0$i 'echo $(hostname -s); /etc/cron.hourly/docker-cleanup' ; done
gprfc053
gprfc054
gprfc055
Error response from daemon: Unable to remove filesystem for 8c9a2a7d6ae6f004da76e846411f33452d29fd6b5eb6e97edc79fa2ad56a5e4e: remove /var/lib/docker/containers/8c9a2a7d6ae6f004da76e846411f33452d29fd6b5eb6e97edc79fa2ad56a5e4e/shm: device or resource busy
gprfc056
gprfc057
gprfc058
Error response from daemon: Unable to remove filesystem for 0d567fd2201c3a3f3364cb17532e965d86d70738f6719c7c0835f67dbbe3ea39: remove /var/lib/docker/containers/0d567fd2201c3a3f3364cb17532e965d86d70738f6719c7c0835f67dbbe3ea39/shm: device or resource busy
gprfc059
gprfc060
gprfc061
Error response from daemon: Unable to remove filesystem for bfb670b791814013f2cb5aacd0f44f63055875f6456cf916f116d520f1a53e1f: remove /var/lib/docker/containers/bfb670b791814013f2cb5aacd0f44f63055875f6456cf916f116d520f1a53e1f/shm: device or resource busy
Error response from daemon: Unable to remove filesystem for 1d348b2449ad4e07f928c18cf7ba422dd6954afcc8bf3b7a0108b6152642e618: remove /var/lib/docker/containers/1d348b2449ad4e07f928c18cf7ba422dd6954afcc8bf3b7a0108b6152642e618/shm: device or resource busy
gprfc062
Error response from daemon: Unable to remove filesystem for 994c5faf0f869c108d28e3b7c7fa8c1e05b0e314a278981d6bec37a64d048e08: remove /var/lib/docker/containers/994c5faf0f869c108d28e3b7c7fa8c1e05b0e314a278981d6bec37a64d048e08/shm: device or resource busy
gprfc063
gprfc064

[root@gprfc055 ~]# ./find-busy-mnt.sh 8c9a2a7d6ae6f004da76e846411f33452d29fd6b5eb6e97edc79fa2ad56a5e4e
PID	NAME		MNTNS
60931	fluentd		mnt:[4026532996]
61017	fluentd		mnt:[4026532996]

[root@gprfc058 ~]# ./find-busy-mnt.sh 0d567fd2201c3a3f3364cb17532e965d86d70738f6719c7c0835f67dbbe3ea39
PID	NAME		MNTNS
35790	fluentd		mnt:[4026533132]
35939	fluentd		mnt:[4026533132]

[root@gprfc061 ~]# ./find-busy-mnt.sh bfb670b791814013f2cb5aacd0f44f63055875f6456cf916f116d520f1a53e1f
PID	NAME		MNTNS
119836	fluentd		mnt:[4026533026]
119965	fluentd		mnt:[4026533026]
[root@gprfc061 ~]# ./find-busy-mnt.sh 1d348b2449ad4e07f928c18cf7ba422dd6954afcc8bf3b7a0108b6152642e618
PID	NAME		MNTNS
119836	fluentd		mnt:[4026533026]
119965	fluentd		mnt:[4026533026]

[root@gprfc062 ~]# ./find-busy-mnt.sh 994c5faf0f869c108d28e3b7c7fa8c1e05b0e314a278981d6bec37a64d048e08
PID	NAME		MNTNS
114717	fluentd		mnt:[4026533026]
114833	fluentd		mnt:[4026533026]

Comment 5 Vivek Goyal 2017-04-04 13:43:39 UTC
ok, so fluentd is the process which is keeping your mount points busy and that's why you can't remove those containers.

Now the question is who launches fluentd and how did another container's mount point manage to leak into fluentd's mount namespace.

First can you give me output of following.

- ls -l /proc/1/ns/mnt
- ls -l /proc/$(ps -C dockerd-current -o pid=)/ns/mnt

Also give me output of "docker info" on failing node.

Comment 8 Vivek Goyal 2017-04-04 14:24:28 UTC
Ok, so problem is recursive bind mounting of /var/lib/docker/containers/ inside fluentd containers.

That means, at the time of mount, any mount point inside /var/lib/docker/containers will become visible inside fluentd container and that will keep mount point busy and that means that will keep directory busy and that means container can't be removed.

For example, one of the mount points inside fluentd looks as follows.

shm /var/lib/docker/containers/7f6ff7bf8670f80291153b7dfe9e200caee3db524be5244b64601cda97c9ed5e/shm tmpfs rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c287,c1002",nosuid,nodev,noexec,relatime,size=65536k 0 0

This is shm type mount on host and this is now visible inside container and this now means that deletion of container 7f6ff7bf8670f80 will fail. It will complain that can't remove directory /var/lib/docker/containers/<container-id>/shm as this is busy.

Comment 10 Vivek Goyal 2017-04-04 14:27:02 UTC
My first question to the fluentd container writers is that why do you need access to /var/lib/docker/containers/ directory and all the mount points underneath.

We probably should make /var/lib/docker/containers/ as private mount point to make sure any new mounts are not propagated into another containers. Like we do for graph drivers.

If all these container mount points leak into another containers, we have a huge issue of mount points being busy.

Comment 11 Anton Sherkhonov 2017-04-04 14:32:33 UTC
CC Jeff Cantril, Rich Megginson.

This folder is mounted to get log files in case docker uses json log driver.

We don't need /var/lib/docker/containers/ mounted if we use journald log driver.

Is getting logs from log files(not jounrald) still supported?

Comment 12 Vivek Goyal 2017-04-04 14:41:42 UTC
Will you not use "docker logs" command to access container logs?

CCing Dan and Nalin. They have more idea about logging.

Comment 13 Rich Megginson 2017-04-04 14:44:39 UTC
Getting logs from log files is still supported.  The problem is that you cannot "conditionally" mount /var/lib/containers - it has to be baked into the daemonset configuration.  This is what allows us to switch between files/journald at run-time when we detect which log driver docker is using on the node.

If we were to do something like - at install time, use the daemonset-files.yaml or daemonset-journald.yaml - then we would lose the ability to detect which log driver docker is using, and the ability to switch at run-time.

Is there a way to do a bind mount on a directory, but restrict how deep the nesting is?  The only reason we need /var/lib/containers is for the log files - we don't need to go all the way into /var/lib/docker/containers/7f6ff7bf8670f80291153b7dfe9e200caee3db524be5244b64601cda97c9ed5e/shm etc.

Comment 14 Rich Megginson 2017-04-04 14:47:36 UTC
(In reply to Vivek Goyal from comment #12)
> Will you not use "docker logs" command to access container logs?

Do you mean - can fluentd use the docker API to get the logs, rather than scraping the log files from /var/log/containers/*.log?  That would require quite a bit of work on the fluentd side, because the k8s metadata plugin is designed to work with either the files under /var/log/containers/*.log, or reading from journal.

> CCing Dan and Nalin. They have more idea about logging.

Comment 15 Vivek Goyal 2017-04-04 14:50:36 UTC
(In reply to Rich Megginson from comment #13)
> 
> Is there a way to do a bind mount on a directory, but restrict how deep the
> nesting is?  The only reason we need /var/lib/containers is for the log
> files - we don't need to go all the way into
> /var/lib/docker/containers/
> 7f6ff7bf8670f80291153b7dfe9e200caee3db524be5244b64601cda97c9ed5e/shm etc.

ne has to do non-recursive bind mount to make sure mount points underneath don't show up. And I think docker volume mounts are recursive  by default and I am not aware of any option to do non-recursive mount binds.

Having said that, why don't you go through docker API to access logs. If you directly try to poke at docker metadata, bad things are bound to happen.

Mrunal, are you aware of any docker option to allow non-recursive volume mounts.

Comment 16 Vivek Goyal 2017-04-04 14:51:43 UTC
(In reply to Rich Megginson from comment #14)
> (In reply to Vivek Goyal from comment #12)
> > Will you not use "docker logs" command to access container logs?
> 
> Do you mean - can fluentd use the docker API to get the logs, rather than
> scraping the log files from /var/log/containers/*.log?  That would require
> quite a bit of work on the fluentd side, because the k8s metadata plugin is
> designed to work with either the files under /var/log/containers/*.log, or
> reading from journal.

I think so. I see "docker logs <container-id> option existing to fetch container logs. And that sounds like right thing to do instead of poking at docker metadata directly.

Comment 17 Rich Megginson 2017-04-04 14:56:52 UTC
(In reply to Vivek Goyal from comment #16)
> (In reply to Rich Megginson from comment #14)
> > (In reply to Vivek Goyal from comment #12)
> > > Will you not use "docker logs" command to access container logs?
> > 
> > Do you mean - can fluentd use the docker API to get the logs, rather than
> > scraping the log files from /var/log/containers/*.log?  That would require
> > quite a bit of work on the fluentd side, because the k8s metadata plugin is
> > designed to work with either the files under /var/log/containers/*.log, or
> > reading from journal.
> 
> I think so. I see "docker logs <container-id> option existing to fetch
> container logs.

fluentd can use the `docker logs ...` command from inside the fluentd container?  Or I suppose it could use the https rest api.

> And that sounds like right thing to do instead of poking at
> docker metadata directly.

The problem is that a solution like this won't be available until 3.7 at the earliest, due to the amount of work that will need to be done on the fluentd side, both in origin-aggregated-logging and in fabric8 k8s meta plugin.

In the meantime, this is what I suggest:

* change docker to use --log-driver=journald if not already
* edit the logging-fluentd daemonset to remove the offending volume/volumeMount

Comment 18 Mrunal Patel 2017-04-04 20:04:18 UTC
You could bind mount the docker socket from the host and then use the docker client API to access the logs.

Comment 19 Jeff Cantrill 2017-04-04 20:14:39 UTC
This will require some additional integration to docker either via shell and attaching to streams or some ruby C extension to accomplish.  The former seems like a bad idea and us setting us up for some serious performance issues.  The later is doable but I wonder to what effort we desire to solve this if upstream will be providing an api to address issues such as these.

Comment 20 Rich Megginson 2017-04-04 20:32:13 UTC
(In reply to Jeff Cantrill from comment #19)
> This will require some additional integration to docker either via shell and
> attaching to streams or some ruby C extension to accomplish.  The former
> seems like a bad idea and us setting us up for some serious performance
> issues.  The later is doable but I wonder to what effort we desire to solve
> this if upstream will be providing an api to address issues such as these.

i.e. it will be a few months before we have a solution, one way (implementing support for docker api in fluentd, bind mounting the docker socket) or another (K8s sig for logging implements better support for log files/volumes)

Comment 21 Peter Portante 2017-04-05 03:05:03 UTC
Seems like we need to make a fix for 3.3, 3.4, and 3.5, at the least.

Can we agree to get a short-term fix out, and then work to address this for the long term?

Comment 22 Rich Megginson 2017-04-05 03:47:24 UTC
(In reply to Peter Portante from comment #21)
> Seems like we need to make a fix for 3.3, 3.4, and 3.5, at the least.
> 
> Can we agree to get a short-term fix out, and then work to address this for
> the long term?

What is the short term fix?

Comment 25 Jeff Cantrill 2017-04-05 20:16:40 UTC
Trying to reproduce locally to provide a fix.  I noticed that without fluentd running and deleting some random pod I get:

Apr  5 20:11:18 vagrant journal: E0405 20:11:18.106768   27494 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/fedcf0e0-1a3a-11e7-8e4f-525400307efd-default-token-r3dg3\" (\"fedcf0e0-1a3a-11e7-8e4f-525400307efd\")" failed. No retries permitted until 2017-04-05 20:11:18.606731854 +0000 UTC (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/fedcf0e0-1a3a-11e7-8e4f-525400307efd-default-token-r3dg3" (volume.spec.Name: "default-token-r3dg3") pod "fedcf0e0-1a3a-11e7-8e4f-525400307efd" (UID: "fedcf0e0-1a3a-11e7-8e4f-525400307efd") with: remove /var/lib/origin/openshift.local.volumes/pods/fedcf0e0-1a3a-11e7-8e4f-525400307efd/volumes/kubernetes.io~secret/default-token-r3dg3: device or resource busy

Mount script outputs:

PID	NAME		MNTNS
22419	openshift		mnt:[4026532206]
22419	openshift		mnt:[4026532206]
22546	journalctl		mnt:[4026532206]
22546	journalctl		mnt:[4026532206]

What does this tell me?  Seems we could/do have issues in general that are not related to anything fluentd is doing?  Is that true?  Does this output matter to this specific issue?  The fact that 'fluentd' in the output means we 'fixed' the issue (atleast in the short term)?

Comment 26 Vivek Goyal 2017-04-05 21:30:57 UTC
This issue is not related to this particular error which happens when container is removed.

But tearing down volume also is device busy problem which falls in similar class. That is somebody is trying to remove volume while there are still users of volume. So this will have to be a separate issue and a separate fix.

Comment 29 Peter Portante 2017-04-06 20:13:50 UTC
@Vivek, isn't title of this wrong?

"fluentd: fluentd container is keeping shm mount point of other containers busy"

Shouldn't that be

"fluentd: other containers are keeping the shm mount point of the fluentd container busy"?

The other containers are not having the unmount problems, it is when a fluentd pod restarts that we see this.

Comment 30 Vivek Goyal 2017-04-06 21:10:28 UTC
(In reply to Peter Portante from comment #29)
> @Vivek, isn't title of this wrong?
> 
> "fluentd: fluentd container is keeping shm mount point of other containers
> busy"
> 
> Shouldn't that be
> 
> "fluentd: other containers are keeping the shm mount point of the fluentd
> container busy"?
> 
> The other containers are not having the unmount problems, it is when a
> fluentd pod restarts that we see this.

Peter,

I see that fluentd is mounting /var/lib/docker/containers/. So it is definitely keeping mount points of other containers busy and when these containers are deleted, one might face following error.

Error response from daemon: Unable to remove filesystem for 0d567fd2201c3a3f3364cb17532e965d86d70738f6719c7c0835f67dbbe3ea39: remove /var/lib/docker/containers/0d567fd2201c3a3f3364cb17532e965d86d70738f6719c7c0835f67dbbe3ea39/shm: device or resource busy

Now it is possible that some other container is also keeping some mount point of fluentd busy and we run into issues.

But given description so far, a container foo failed to be removed and when we passed foo's id to find-busy-mnt.sh script, it found that "fluentd" process has that mount point busy. That means fluentd is keeping mount point of another container busy.

Your testing might be running into other issues when tyring to restart fluentd pod.

Comment 31 Peter Portante 2017-04-11 20:12:58 UTC
@Vivek, thanks.  We changed our fluentd pods to remove the /var/lib/docker/containers references, and restarted our fluentd pods.  This allowed those mounts to be cleaned up.

I filed an issue upstream to fix this in origin-aggregated-logging, https://github.com/openshift/origin-aggregated-logging/issues/376

Comment 32 Eric Paris 2017-05-09 15:48:48 UTC
I don't see it noted here, but I believe openshift online has had success mounting `-v /:/host:rslave`

couldn't we do the same here mounting `-v /var/lib/docker/container:/whereever:rslave` ?

Comment 33 Vivek Goyal 2017-05-09 15:51:37 UTC
Fundamentally, container looking at docker's metadata is wrong. If a container needs something from docker, it needs to go through the docker API. So mounting /var/lib/docker/container is not a good idea. That data belongs to docker and no other tools is supposed to poke at the data directly. Otherwise expect the unexpected.

Comment 34 Rich Megginson 2017-05-09 16:08:09 UTC
@vivek - that's great - so that means there is another way the fluentd container can get at the host's docker log files (written with --log-driver=json-file) without having access to /var/lib/docker/container/*/*.json?

Comment 35 Eric Paris 2017-05-09 17:01:15 UTC
I fundamentally disagree with the premise in #33. Obviously we keep realizing it's a bad idea to mount /var/lib/container into a container TODAY, but we need to find a way to make it possible in the future. It's just way to common of a usage pattern for us to not find a solution. That bigger problem, however, is not this BZ.


Specific to fluentd log collection, would it make sense to re-architect to make the fluentd container a 'system container' which starts before docker? And then use the docker fluentd log driver instead of json-file?


Operationally that would be harder to deploy and manage, so I don't know if that makes it much better. But it would work around mount propagation problems...


So I'd much rather we try doing `:rslave` on the mounts into the fluentd container to see if it solves the problem of holding mounts...

Comment 36 Rich Megginson 2017-05-09 17:15:13 UTC
(In reply to Eric Paris from comment #35)
> I fundamentally disagree with the premise in #33. Obviously we keep
> realizing it's a bad idea to mount /var/lib/container into a container
> TODAY, but we need to find a way to make it possible in the future. It's
> just way to common of a usage pattern for us to not find a solution. That
> bigger problem, however, is not this BZ.
> 
> 
> Specific to fluentd log collection, would it make sense to re-architect to
> make the fluentd container a 'system container' which starts before docker?
> And then use the docker fluentd log driver instead of json-file?

But doesn't that break `docker logs` and `oc logs` and reading pod logs in the OpenShift UI?

> 
> 
> Operationally that would be harder to deploy and manage, so I don't know if
> that makes it much better. But it would work around mount propagation
> problems...
> 
> 
> So I'd much rather we try doing `:rslave` on the mounts into the fluentd
> container to see if it solves the problem of holding mounts...

Let me know what I need to do for the fluentd daemonset volume configs.

Comment 37 Wesley Hearn 2017-05-09 17:17:40 UTC
(In reply to Rich Megginson from comment #36)
> (In reply to Eric Paris from comment #35)
> > I fundamentally disagree with the premise in #33. Obviously we keep
> > realizing it's a bad idea to mount /var/lib/container into a container
> > TODAY, but we need to find a way to make it possible in the future. It's
> > just way to common of a usage pattern for us to not find a solution. That
> > bigger problem, however, is not this BZ.
> > 
> > 
> > Specific to fluentd log collection, would it make sense to re-architect to
> > make the fluentd container a 'system container' which starts before docker?
> > And then use the docker fluentd log driver instead of json-file?
> 
> But doesn't that break `docker logs` and `oc logs` and reading pod logs in
> the OpenShift UI?
Yes it does, docker logs and prob oc logs depends on either journald or json-file logging for docker.

https://docs.docker.com/engine/reference/commandline/logs/#extended-description

> 
> > 
> > 
> > Operationally that would be harder to deploy and manage, so I don't know if
> > that makes it much better. But it would work around mount propagation
> > problems...
> > 
> > 
> > So I'd much rather we try doing `:rslave` on the mounts into the fluentd
> > container to see if it solves the problem of holding mounts...
> 
> Let me know what I need to do for the fluentd daemonset volume configs.

Comment 38 Vivek Goyal 2017-05-09 17:29:19 UTC
(In reply to Eric Paris from comment #35)
> I fundamentally disagree with the premise in #33. Obviously we keep
> realizing it's a bad idea to mount /var/lib/container into a container
> TODAY, but we need to find a way to make it possible in the future. It's
> just way to common of a usage pattern for us to not find a solution. That
> bigger problem, however, is not this BZ.

Hmm.., This is somewhat equivalent to, I want metadata associated with a file. Lets not do "stat" on file. Instead try to scrub through the raw disk data and figure out what filesystem has stored where and try to get it from there.

I would say a better way is to do "stat". For docker, use "docker logs". If you directly poke into metadata of a different software, that's a bad idea and very difficult to support. Because in general software assumes, nowhere else is touching their metadata.

If another application requires access to metadata, then an API needs to be created to access that metadata (if it is not already there).

Comment 39 Vivek Goyal 2017-05-09 17:40:52 UTC
Even if :rslave solves mount busy issue. There is another race with container deletion. Say fluentd is looking at logs of a container and kubernetes/openshift/foo calls into docker to delete that container. Now that container deletion will fail because fluentd is keeping directory busy. 

If fluentd uses proper API (docker logs), docker can keep track that container is busy right now and can fail much early with proper error message.

Otherwise a bug will be generated that container deletion failed because a directory was busy. And developer will have no idea what software might be keeping directory busy.

So there is a value in using proper API and not poke at metadata directly.

Comment 40 Rich Megginson 2017-05-09 18:44:26 UTC
@vivek - great - just let me know how I can use `docker logs` or the log api in the fluentd container.

Comment 41 Vivek Goyal 2017-05-09 18:56:50 UTC
I think you will have to do few things.

- Bind mount docker socket into container.

  -v /var/run/docker.sock:/var/run/docker.sock

- Package docker client in fluentd container.

- And then run docker logs

  docker logs <container-id>

This should be the basic to access docker API through docker command line.

Comment 42 Rich Megginson 2017-05-09 19:13:02 UTC
(In reply to Vivek Goyal from comment #41)
> I think you will have to do few things.
> 
> - Bind mount docker socket into container.
> 
>   -v /var/run/docker.sock:/var/run/docker.sock
> 
> - Package docker client in fluentd container.
> 
> - And then run docker logs
> 
>   docker logs <container-id>
> 
> This should be the basic to access docker API through docker command line.

Can I set up "watches" on docker resources?  For example, I need to do the equivalent of "docker logs -f <container-id>" on _all_ containers.  I need to immediately disconnect "docker logs -f <container-id>" when the container goes away.  I need to immediately connect "docker logs -f <container-id>" when there is a new container id.

I also need a way to get the metadata associated with the container.  For example, using the json-file log driver, Kubernetes writes symlinks under /var/log/containers to the actual log files under /var/lib/docker/containers/*/*.json in the following format:
https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/lib/fluent/plugin/filter_kubernetes_metadata.rb#L35

'var\.log\.containers\.(?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$'

If I use the docker logs command or api, how can I get the pod_name, namespace, container_name, and docker_id?

Comment 43 Vivek Goyal 2017-05-09 20:08:06 UTC
(In reply to Rich Megginson from comment #42)
> 
> Can I set up "watches" on docker resources?  For example, I need to do the
> equivalent of "docker logs -f <container-id>" on _all_ containers.  I need
> to immediately disconnect "docker logs -f <container-id>" when the container
> goes away.  I need to immediately connect "docker logs -f <container-id>"
> when there is a new container id.

Can you make use of docker events to figure out which containers have been 
created and deleted. (Dan just mentioned docker events to me). 

> 
> I also need a way to get the metadata associated with the container.

Checkout "docker inspect <container-id>"

> For
> example, using the json-file log driver, Kubernetes writes symlinks under
> /var/log/containers to the actual log files under
> /var/lib/docker/containers/*/*.json in the following format:
> https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/
> master/lib/fluent/plugin/filter_kubernetes_metadata.rb#L35
> 
> 'var\.log\.containers\.(?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-
> 9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container_name>.+)-
> (?<docker_id>[a-z0-9]{64})\.log$'
> 
> If I use the docker logs command or api, how can I get the pod_name,
> namespace, container_name, and docker_id?

docker does not control pod name. So I think you will continue to get the pod name the way you do now.

what namespace?

Once you have container_id you can map it to container name as well. In fact I think that might be part of docker events api where you can get both container id and container name.

what is docker_id?

Comment 44 Rich Megginson 2017-05-09 20:13:27 UTC
whearn pointed out that I can use `docker info` to get the Kubernetes pod_name, etc.

I think using /var/run/docker.sock is a non-starter:

>docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-6.gitae7d637.fc25.x86_64
 Go version:      go1.7.4
 Git commit:      ae7d637/1.12.6
 Built:           Mon Jan 30 16:15:28 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-6.gitae7d637.fc25.x86_64
 Go version:      go1.7.4
 Git commit:      ae7d637/1.12.6
 Built:           Mon Jan 30 16:15:28 2017
 OS/Arch:         linux/amd64

>docker run --privileged -it -v /var/run/docker.sock:/var/run/docker.sock:ro centos:7 bash
[root@30afc37b9ff6 /]# curl --unix-socket /var/run/docker.sock -H "Content-Type: application/json"     -d '{"Image": "centos", "Cmd": ["echo", "hello world"]}'     -X POST http:/v1.24/containers/create
{"Id":"6f1771f1df0e216666c76eeac53f7a116827a94328a3bb14028ea1546b8ff937","Warnings":null}

That is, the socket is fully writable, meaning fluentd can do whatever it wants with docker, _even though I mounted it :ro_.  Even if docker did honor the "ro", I think that would mean the container would not be able to send commands to docker via the socket.

If we do go this route, I need to be able to restrict what fluentd is able to do - it should have read-only access to docker - it should be able to get info about a container, know when containers are added/deleted, and should be able to read the logs for a container.

Comment 45 Vivek Goyal 2017-05-09 20:17:57 UTC
(In reply to Vivek Goyal from comment #39)
> Even if :rslave solves mount busy issue. There is another race with
> container deletion. Say fluentd is looking at logs of a container and
> kubernetes/openshift/foo calls into docker to delete that container. Now
> that container deletion will fail because fluentd is keeping directory busy.

Here is an example of races which can happen during container deletion. Say 
a directory foo has file foo.txt which is log file being read by fluentd.

- docker creates container foo

  foo/foo.txt

- fluentd notices it and starts reading foo/foo.txt (vi foo/foo.txt)
 
- docker decides to delete container

  unlink foo/foo.txt
  rmdir foo

Now rmdir will fail as long as foo/foo.txt is open with following message.

rmdir: failed to remove 'foo': Directory not empty

using "rslave" will not take care of such races.

And it will result in a docker bug that container removal fails once in a blue moon.

No it is not a docker bug. It is an issue with the way containers decided to make use of docker and its metadata.

This race exists in theory. That's a different thing we might not be hitting it in practice for some reason.

Comment 46 Paul Weil 2017-05-12 14:54:44 UTC
*** Bug 1449277 has been marked as a duplicate of this bug. ***

Comment 50 Eric Paris 2017-05-12 18:21:28 UTC
From bz: 1449277

The fluentd pod running is the cause of the other 4 pods stuck terminating.

Fluentd is mounting /var/lib/containers into itself to get access to the json-log files. When the fluentd container is created all of the mount points under /var/lib/docker/container/* (aka all of the /shm mount points) are also duplicated inside the fluentd container.

When docker tries to tear down some other container and remove /var/lib/docker/containers/*/* on host host it will fail because /shm is still in use inside the fluentd container.

A potential workaround/solution to this problem might be for the fluentd pod to unmount everything it has inside /var/lib/docker on startup.

Basically
umount /var/lib/docker/containers/*/shm  (or */*)
On startup

The json-log files will still be in there, but you should not have any mount points under there.

Comment 51 Daniel Walsh 2017-05-13 10:04:15 UTC
Random thought, if we changed the location of the tmpfs on shm location from /var/lib/docker to /run, would this help at all, or just cause other issues?  Potential problems would be people mounting /run into containers.  

Could it work if we mount the tmpfs on /dev/shm on the hosts /dev/shm?

Comment 52 Peter Portante 2017-05-14 04:13:11 UTC
Regardless of what we do, the fluentd pod will be an instrument for an "action at a distance problem", right?  It seems that some other configuration of piece of software could innocently add a mount which fluentd would pick up and hold.

And can't this happen after a fluentd pod starts?  Meaning if fluentd starts first, and then containers are created afterwards, won't fluentd pick up those points as the containers come into existence?  Or maybe I am not quite understanding the problem?

Comment 53 Eric Paris 2017-05-14 12:32:58 UTC
"Other" mounts getting picked up are possible. And we may have to address those in the future.

No, new containers created won't leak in. It is only a startup problem. fluentd has a private mount namespace.

The 'real' problem is, as Vivek has described, the fact that fluentd is doing what it is doing at all. Its playing around inside docker's internal datastructures (aka the json-log files) instead of using the public apis (aka docker log).

For this moment we know that any mount point under /var/lib/docker/containers/*/* doesn't need to be inside the fluentd container and it should clean those up.

Can we also be sure to get the list together of exactly the shortcommings in journal and try to get those fixed? Peter, maybe you already gave that succinct list of journal asks to Ian and i just missed it?

Comment 54 Rich Megginson 2017-05-15 01:07:43 UTC
(In reply to Eric Paris from comment #53)
> "Other" mounts getting picked up are possible. And we may have to address
> those in the future.
> 
> No, new containers created won't leak in. It is only a startup problem.
> fluentd has a private mount namespace.
> 
> The 'real' problem is, as Vivek has described, the fact that fluentd is
> doing what it is doing at all. Its playing around inside docker's internal
> datastructures (aka the json-log files) instead of using the public apis
> (aka docker log).

The problem with that is what I described in https://bugzilla.redhat.com/show_bug.cgi?id=1437952#c44 - that is, there is no way for fluentd to have read-only access to docker.

> 
> For this moment we know that any mount point under
> /var/lib/docker/containers/*/* doesn't need to be inside the fluentd
> container and it should clean those up.
> 
> Can we also be sure to get the list together of exactly the shortcommings in
> journal and try to get those fixed? Peter, maybe you already gave that
> succinct list of journal asks to Ian and i just missed it?

Comment 55 Eric Paris 2017-05-15 02:20:06 UTC
(In reply to Rich Megginson from comment #54)
> (In reply to Eric Paris from comment #53)

> > The 'real' problem is, as Vivek has described, the fact that fluentd is
> > doing what it is doing at all. Its playing around inside docker's internal
> > datastructures (aka the json-log files) instead of using the public apis
> > (aka docker log).
> 
> The problem with that is what I described in
> https://bugzilla.redhat.com/show_bug.cgi?id=1437952#c44 - that is, there is
> no way for fluentd to have read-only access to docker.

I certainly agree, there is no (good) answer here.


Is there some path out of he logging madness using the fluentd logging driver? Or some new log driver that merges all the best parts of json-file, fluentd, and journal?

Comment 57 Peter Portante 2017-05-16 15:18:22 UTC
(In reply to Eric Paris from comment #55)
> Is there some path out of he logging madness using the fluentd logging
> driver? Or some new log driver that merges all the best parts of json-file,
> fluentd, and journal?

The one path that I can see right now is via journald.  I have created a tracking document which covers the three main things we need from journald: https://docs.google.com/a/redhat.com/document/d/1CrvXa4d1KNiOqf2lTsyBQXWwP96UiFEfVvtni3URzBU/edit?usp=sharing (repeated below):

1. Elimination of “global rate-limiting effect”
   * Docker journald driver shows up as one service,
     so all logs are “global” coming from that service
2. Graceful rate-limit based on priority
   * Drop logs events for configurable priority first
     at lower threshold, then hard-drop once upper
     threshold met
3. Back-pressure vs Drop logs for stdout/stderr
   * Having journald stop reading from a log source
     for stdout/stderr to cause it to block instead of
     dropping the data

It seems we have to support json-file, but there are still too many exposed points for that to be our go-to solution.

Comment 59 openshift-github-bot 2017-05-18 16:38:12 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/0ca7c1e5dad6751d7a286e760b386fb9b852408d
bug 1437952. unmount /var/lib/docker/container/*/shm to allow pods to terminate

Comment 61 Junqi Zhao 2017-06-02 09:16:56 UTC
@Jeff,

Since we have not encounter this defect when testing logging 3.4.1 and 3.5.0, I verified this defect according to the following steps:
1. Deployed logging with json-file logging driver and journald logging driver.
2. Checked the whole logging stacks were worked well.
2. Deleted logging stacks

all the pods, including the fluentd pods were deleted(the fluentd pods stayed in  terminating status for a while, but deleted later), and there was not "device or resource busy" info

# openshift version
openshift v3.4.1.27
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Image id from brew registry
# docker images | grep logging
openshift3/logging-kibana             3.4.1               984049164dd4        16 hours ago        338.8 MB
openshift3/logging-elasticsearch      3.4.1               5d7c27a3f943        16 hours ago        399.6 MB
openshift3/logging-fluentd            3.4.1               3d9ed3e93ef0        16 hours ago        232.7 MB
openshift3/logging-curator            3.4.1               bb24347aea98        16 hours ago        244.5 MB
openshift3/logging-auth-proxy         3.4.1               88906188516a        16 hours ago        215.3 MB
openshift3/logging-deployer           3.4.1               67911234dec5        16 hours ago        887.9 MB

xiazhao reported one similar bug on logging 3.6.0, https://bugzilla.redhat.com/show_bug.cgi?id=1449097, maybe it is related to this issue.

Can I close this defect or need to do more testings?

Comment 62 Junqi Zhao 2017-06-07 00:48:40 UTC
Close it since we not find this error during testing, if anyone think the testing is not correct, please assign it back.

Comment 63 Eric Paris 2017-06-07 01:42:37 UTC
While hopefully the workaround works I would like to get a slightly different test run.

1. Start a pod on node. I'm not sure if you need 2 containers in that pod, but that would make sure that /dev/shm was created and shared.
2. After the pod is running deploy logging with json-file on that node
3. Stop the pod on the node.

Look for errors about the pod failing to remove volumes because they are busy. Either "shm" or the root devicemapper mount point.

Comment 64 Junqi Zhao 2017-06-07 04:14:10 UTC
@Eric,

Tested according to Comment 63,
There was no errors about the pod failing to remove volumes.

But there was no files under /dev/shm when created one pod with 2 containers and removed this pod and all logging pods later.
# ls -al /dev/shm
total 0
drwxrwxrwt.  2 root root   40 Jun  6 21:39 .
drwxr-xr-x. 19 root root 3580 Jun  7 00:06 ..

I don't think it's expected.

WDYT?

Comment 65 Eric Paris 2017-06-07 13:03:51 UTC
I think it is normal and expected for /dev/shm to be empty. Thank you! Let's call it verified.

Comment 66 Junqi Zhao 2017-06-08 00:07:10 UTC
Set it to VERIFIED based on Comment 64 and Comment 65

Comment 68 errata-xmlrpc 2017-06-15 18:37:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1425