Bug 1434942 - Symbolic link error for log file of every pod started when docker log driver is journald
Summary: Symbolic link error for log file of every pod started when docker log driver ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.5.z
Assignee: Seth Jennings
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-35
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-22 16:40 UTC by Mike Fiedler
Modified: 2018-01-09 02:54 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Skips log symlink creation that fails when using journald logging driver.
Clone Of:
Environment:
Last Closed: 2017-12-07 07:10:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node log file (222.36 KB, application/x-gzip)
2017-03-22 16:40 UTC, Mike Fiedler
no flags Details
pod definition (776 bytes, text/plain)
2017-03-22 16:41 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3389 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update 2017-12-07 12:09:10 UTC

Description Mike Fiedler 2017-03-22 16:40:40 UTC
Created attachment 1265437 [details]
node log file

Description of problem:

Every pod started is getting:

Mar 22 12:20:46 ip-172-31-43-89 atomic-openshift-node: E0322 12:20:46.122505   25624 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "cakephp-mysql-example-1-build_mff(73dcb124-0f1b-11e7-9c92-025d2f9e4ccf)" container "sti-build": symlink  /var/log/containers/

I see this on Kubernetes 1.5.4 as well.


Version-Release number of selected component (if applicable): 3.5.0.55


How reproducible: Always


Steps to Reproduce:
1. Set up an OpenShift (or Kubernetes) cluster
2. Create a new pod with the attached json
3. Look at /var/log/messages on the node where the pod starts

Actual results:

Mar 22 12:20:46 ip-172-31-43-89 atomic-openshift-node: E0322 12:20:46.122505   25624 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "cakephp-mysql-example-1-build_mff(73dcb124-0f1b-11e7-9c92-025d2f9e4ccf)" container "sti-build": symlink  /var/log/containers/


Expected results:

No error on normal pod start

Additional info:

Full log attached.

Comment 1 Mike Fiedler 2017-03-22 16:41:09 UTC
Created attachment 1265438 [details]
pod definition

Comment 2 Mike Fiedler 2017-03-22 17:39:46 UTC
/var/log/containers is empty.

I checked a 3.4 system and /var/log/containers has content.  Marking regression.

Comment 3 Seth Jennings 2017-03-23 15:12:51 UTC
Worth noting the node seems to be having other issues.

From the node log:

Mar 22 12:20:48 ip-172-31-43-89 kernel: ------------[ cut here ]------------
Mar 22 12:20:48 ip-172-31-43-89 kernel: WARNING: at net/core/skbuff.c:3824 skb_try_coalesce+0x419/0x450()
Mar 22 12:20:48 ip-172-31-43-89 kernel: Modules linked in: veth overlay(T) xt_statistic xt_nat xt_recent vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch xt_mark xt_comment xt_multiport ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables br_netfilter bridge stp llc isofs xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod cirrus intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev
Mar 22 12:20:48 ip-172-31-43-89 kernel: ttm drm_kms_helper aesni_intel lrw syscopyarea gf128mul sysfillrect glue_helper sysimgblt ablk_helper fb_sys_fops cryptd drm i2c_piix4 i2c_core pcspkr parport_pc parport xen_netfront nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix crct10dif_pclmul crct10dif_common xen_blkfront crc32c_intel libata ixgbevf serio_raw floppy fjes
Mar 22 12:20:48 ip-172-31-43-89 kernel: CPU: 3 PID: 23 Comm: ksoftirqd/3 Tainted: G               ------------ T 3.10.0-514.10.2.el7.x86_64 #1
Mar 22 12:20:48 ip-172-31-43-89 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Mar 22 12:20:48 ip-172-31-43-89 kernel: 0000000000000000 00000000599de0b2 ffff880174aeb988 ffffffff816863ef
Mar 22 12:20:48 ip-172-31-43-89 kernel: ffff880174aeb9c0 ffffffff81085940 ffff8807c1754c00 ffff8807c1755a00
Mar 22 12:20:48 ip-172-31-43-89 kernel: 0000000000000704 0000000000000020 ffff880174aeba34 ffff880174aeb9d0
Mar 22 12:20:48 ip-172-31-43-89 kernel: Call Trace:
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff816863ef>] dump_stack+0x19/0x1b
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81561ba9>] skb_try_coalesce+0x419/0x450
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffffa0095d07>] ? ipt_do_table+0x337/0x710 [ip_tables]
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815c46da>] tcp_try_coalesce+0x6a/0xd0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815c5925>] tcp_queue_rcv+0x55/0x130
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815caf17>] tcp_rcv_established+0x3a7/0x760
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815d586a>] tcp_v4_do_rcv+0x10a/0x340
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff812a8036>] ? security_sock_rcv_skb+0x16/0x20
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815d6fb9>] tcp_v4_rcv+0x799/0x9a0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffffa03dc306>] ? ipv4_confirm+0x86/0x100 [nf_conntrack_ipv4]
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0974>] ip_local_deliver_finish+0xb4/0x1f0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0c59>] ip_local_deliver+0x59/0xd0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b08c0>] ? ip_rcv_finish+0x350/0x350
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b05fa>] ip_rcv_finish+0x8a/0x350
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0f86>] ip_rcv+0x2b6/0x410
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156fe02>] ? netif_receive_skb_sk+0x42/0x70
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0570>] ? inet_del_offload+0x40/0x40
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156f9b2>] __netif_receive_skb_core+0x582/0x800
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156ffec>] ? napi_complete_done+0x2c/0xc0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156fc48>] __netif_receive_skb+0x18/0x60
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81570f5e>] process_backlog+0xae/0x170
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815704e0>] net_rx_action+0x170/0x380
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8108f2cf>] __do_softirq+0xef/0x280
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8108f498>] run_ksoftirqd+0x38/0x50
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b927f>] smpboot_thread_fn+0x12f/0x180
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b9150>] ? lg_double_unlock+0x90/0x90
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
Mar 22 12:20:48 ip-172-31-43-89 kernel: ---[ end trace 7bef0d165c4c4827 ]---

Still looking...

Comment 4 Seth Jennings 2017-03-23 15:59:53 UTC
I was able to reproduce on origin with "oc cluster up" version v1.5.0-rc.0 AND v1.4.1 on docker-common-1.12.6-6.

If I change to json-file logging, it works for both versions.

It seems that docker is not creating the log files in /var/lib/docker/containers that openshift is trying to symlink against in /var/log/containers with the logging driver is journald.

Mike, can you provide the docker versions for your 3.4 and 3.5 clusters?

Comment 5 Mike Fiedler 2017-03-23 18:05:24 UTC
for 3.5 install, docker is 1.12.6-14.el7
for 3.4 install, docker is 1.12.5-9.el7

Comment 6 Seth Jennings 2017-03-23 18:31:55 UTC
Thanks Mike.  I am able to reproduce this error in origin 1.4.1.  I'm not convinced this is a regression.  Are you saying that you observe the symlinks being successfully created in 3.4 under /var/log/containers using docker 1.12 and the journald logging driver?

Comment 7 Mike Fiedler 2017-03-23 18:46:24 UTC
On both my 3.4 and 3.5 clusters, docker is started with --log-driver=json-file

3.4 - /var/log/containers has entries:

root@ip-172-31-13-107: ~ # openshift version
openshift v3.4.0.39
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
root@ip-172-31-13-107: ~ # docker version
Client:
 Version:         1.12.5
 API version:     1.24
 Package version: docker-common-1.12.5-9.el7.x86_64
 Go version:      go1.7.4
 Git commit:      c22650b
 Built:           Thu Jan  5 14:00:55 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.5
 API version:     1.24
 Package version: docker-common-1.12.5-9.el7.x86_64
 Go version:      go1.7.4
 Git commit:      c22650b
 Built:           Thu Jan  5 14:00:55 2017
 OS/Arch:         linux/amd64
root@ip-172-31-13-107: ~ # ps -ef | grep dockerd-current
root       6031      1 37 Mar21 ?        17:51:43 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=json-file --log-opt max-size=50m --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com --insecure-registry registry.ops.openshift.com


3.5 - /var/log/containers empty

root@ip-172-31-60-44: ~ # openshift version
openshift v3.5.0.55
kubernetes v1.5.2+43a9be4
etcd 3.1.0
root@ip-172-31-60-44: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-14.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Thu Mar 16 14:27:53 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-14.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Thu Mar 16 14:27:53 2017
 OS/Arch:         linux/amd64
root@ip-172-31-60-44: ~ # ps -ef | grep dockerd-current
root      39225  38698  0 14:45 pts/0    00:00:00 grep --color=auto dockerd-current
root      44844      1  0 Mar22 ?        00:07:05 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=journald --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com --insecure-registry registry.ops.openshift.com

Comment 8 Seth Jennings 2017-03-23 18:50:23 UTC
Ah, so 3.4 is using --log-driver=json-file and 3.5 is using --log-driver=journald.  That explains why it doesn't appear in the 3.4 cluster. The error doesn't result from a code regression but a change in configuration options from 3.4 to 3.5.

Comment 9 Mike Fiedler 2017-03-23 18:57:32 UTC
Right you are - could have sworn I saw json for both.   I'll remove the regression flag.

Comment 10 Seth Jennings 2017-03-23 18:58:20 UTC
I'm not sure how these clusters were deployed, but openshift-ansible overrides the docker RPM default config and changes the logging driver to json-file

https://bugzilla.redhat.com/show_bug.cgi?id=1335939

https://github.com/openshift/openshift-ansible/pull/1976

Comment 11 Mike Fiedler 2017-03-23 18:59:53 UTC
We do an RPM install of docker in our image provisioner.  It pulls from the rhel7-next mirror on mirror.openshift.com

Comment 12 Seth Jennings 2017-03-23 19:05:31 UTC
Aaaand as soon as I say that, I find the commit that later changed it back to journald :-/

https://github.com/openshift/openshift-ansible/commit/222e0e52f7176ae028caaf8406bb0b08ebff715d

That went into openshift-ansible openshift-ansible-3.5.4-1.  So the change from 3.4 to 3.5 of the installer might explain the difference.

Comment 13 Seth Jennings 2017-03-23 19:35:33 UTC
This is the commit upstream that addressed this
https://github.com/kubernetes/kubernetes/commit/7fed242d5557b5bf82463566e41af8d18905449e

I did the backport to 3.5 and it isn't bad.
https://github.com/sjenning/ose/commit/923432047771ddab9f66d9bf3ae8335c0b86e43c

Or we can UpcomingRelease it and pick it up in the 3.6 rebase.

Comment 14 Seth Jennings 2017-03-23 20:26:55 UTC
OCP 3.5 PR
https://github.com/openshift/ose/pull/663

Comment 15 Kenjiro Nakayama 2017-10-16 01:42:28 UTC
One of the customers hit this issue.

   messages-20171016:Oct  9 15:42:53 xxxxx atomic-openshift-node: E1009 15:42:53.049129   59782 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "xxxxxxxxxxxx(7fdfd342-ac9b-11e7-acc4-001a4a408e0c)" container "xxxxxxxxxxx": symlink  /var/log/containers/xxxxxxxxx-0838efecda57312940f683a44aee9d2ac515f6a23ee67d4b2677b9b0335f3666.log: no such file or directory

Their logging driver is "journald" and versions are:

  atomic-openshift-3.5.5.31.24-1.git.0.ff74e0b.el7.x86_64
  docker-1.12.6-32.git88a4867.el7.x86_64

It looks like the PR c#14 (https://github.com/openshift/ose/pull/663) was closed without merge. Could you please share the status of the fix for OCP 3.5?

Comment 16 Seth Jennings 2017-10-16 13:53:15 UTC
I reopened the PR now that there is a customer request

Comment 17 Derek Carr 2017-10-17 16:19:19 UTC
merged 3.5 fix https://github.com/openshift/ose/pull/663

Comment 19 Mike Fiedler 2017-10-31 18:39:48 UTC
3.5 backport verified on v3.5.5.31.40

Used cluster-loader to create many pods on each node (journald used as log driver).   This error never occurred.

Comment 20 xiuli 2017-11-01 03:27:28 UTC
Verified and the issue can not reproduced again in the env:openshift v3.5.5.31.40
kubernetes v1.5.2+43a9be4

Comment 23 errata-xmlrpc 2017-12-07 07:10:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389


Note You need to log in before you can comment on or make changes to this bug.