Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1434942 - Symbolic link error for log file of every pod started when docker log driver is journald
Symbolic link error for log file of every pod started when docker log driver ...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.5.0
x86_64 Linux
unspecified Severity medium
: ---
: 3.5.z
Assigned To: Seth Jennings
Mike Fiedler
aos-scalability-35
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-03-22 12:40 EDT by Mike Fiedler
Modified: 2018-01-08 21:54 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Skips log symlink creation that fails when using journald logging driver.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-07 02:10:26 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
node log file (222.36 KB, application/x-gzip)
2017-03-22 12:40 EDT, Mike Fiedler
no flags Details
pod definition (776 bytes, text/plain)
2017-03-22 12:41 EDT, Mike Fiedler
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3389 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update 2017-12-07 07:09:10 EST

  None (edit)
Description Mike Fiedler 2017-03-22 12:40:40 EDT
Created attachment 1265437 [details]
node log file

Description of problem:

Every pod started is getting:

Mar 22 12:20:46 ip-172-31-43-89 atomic-openshift-node: E0322 12:20:46.122505   25624 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "cakephp-mysql-example-1-build_mff(73dcb124-0f1b-11e7-9c92-025d2f9e4ccf)" container "sti-build": symlink  /var/log/containers/

I see this on Kubernetes 1.5.4 as well.


Version-Release number of selected component (if applicable): 3.5.0.55


How reproducible: Always


Steps to Reproduce:
1. Set up an OpenShift (or Kubernetes) cluster
2. Create a new pod with the attached json
3. Look at /var/log/messages on the node where the pod starts

Actual results:

Mar 22 12:20:46 ip-172-31-43-89 atomic-openshift-node: E0322 12:20:46.122505   25624 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "cakephp-mysql-example-1-build_mff(73dcb124-0f1b-11e7-9c92-025d2f9e4ccf)" container "sti-build": symlink  /var/log/containers/


Expected results:

No error on normal pod start

Additional info:

Full log attached.
Comment 1 Mike Fiedler 2017-03-22 12:41 EDT
Created attachment 1265438 [details]
pod definition
Comment 2 Mike Fiedler 2017-03-22 13:39:46 EDT
/var/log/containers is empty.

I checked a 3.4 system and /var/log/containers has content.  Marking regression.
Comment 3 Seth Jennings 2017-03-23 11:12:51 EDT
Worth noting the node seems to be having other issues.

From the node log:

Mar 22 12:20:48 ip-172-31-43-89 kernel: ------------[ cut here ]------------
Mar 22 12:20:48 ip-172-31-43-89 kernel: WARNING: at net/core/skbuff.c:3824 skb_try_coalesce+0x419/0x450()
Mar 22 12:20:48 ip-172-31-43-89 kernel: Modules linked in: veth overlay(T) xt_statistic xt_nat xt_recent vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch xt_mark xt_comment xt_multiport ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables br_netfilter bridge stp llc isofs xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod cirrus intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev
Mar 22 12:20:48 ip-172-31-43-89 kernel: ttm drm_kms_helper aesni_intel lrw syscopyarea gf128mul sysfillrect glue_helper sysimgblt ablk_helper fb_sys_fops cryptd drm i2c_piix4 i2c_core pcspkr parport_pc parport xen_netfront nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix crct10dif_pclmul crct10dif_common xen_blkfront crc32c_intel libata ixgbevf serio_raw floppy fjes
Mar 22 12:20:48 ip-172-31-43-89 kernel: CPU: 3 PID: 23 Comm: ksoftirqd/3 Tainted: G               ------------ T 3.10.0-514.10.2.el7.x86_64 #1
Mar 22 12:20:48 ip-172-31-43-89 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Mar 22 12:20:48 ip-172-31-43-89 kernel: 0000000000000000 00000000599de0b2 ffff880174aeb988 ffffffff816863ef
Mar 22 12:20:48 ip-172-31-43-89 kernel: ffff880174aeb9c0 ffffffff81085940 ffff8807c1754c00 ffff8807c1755a00
Mar 22 12:20:48 ip-172-31-43-89 kernel: 0000000000000704 0000000000000020 ffff880174aeba34 ffff880174aeb9d0
Mar 22 12:20:48 ip-172-31-43-89 kernel: Call Trace:
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff816863ef>] dump_stack+0x19/0x1b
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81561ba9>] skb_try_coalesce+0x419/0x450
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffffa0095d07>] ? ipt_do_table+0x337/0x710 [ip_tables]
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815c46da>] tcp_try_coalesce+0x6a/0xd0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815c5925>] tcp_queue_rcv+0x55/0x130
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815caf17>] tcp_rcv_established+0x3a7/0x760
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815d586a>] tcp_v4_do_rcv+0x10a/0x340
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff812a8036>] ? security_sock_rcv_skb+0x16/0x20
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815d6fb9>] tcp_v4_rcv+0x799/0x9a0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffffa03dc306>] ? ipv4_confirm+0x86/0x100 [nf_conntrack_ipv4]
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0974>] ip_local_deliver_finish+0xb4/0x1f0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0c59>] ip_local_deliver+0x59/0xd0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b08c0>] ? ip_rcv_finish+0x350/0x350
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b05fa>] ip_rcv_finish+0x8a/0x350
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0f86>] ip_rcv+0x2b6/0x410
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156fe02>] ? netif_receive_skb_sk+0x42/0x70
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0570>] ? inet_del_offload+0x40/0x40
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156f9b2>] __netif_receive_skb_core+0x582/0x800
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156ffec>] ? napi_complete_done+0x2c/0xc0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156fc48>] __netif_receive_skb+0x18/0x60
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81570f5e>] process_backlog+0xae/0x170
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815704e0>] net_rx_action+0x170/0x380
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8108f2cf>] __do_softirq+0xef/0x280
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8108f498>] run_ksoftirqd+0x38/0x50
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b927f>] smpboot_thread_fn+0x12f/0x180
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b9150>] ? lg_double_unlock+0x90/0x90
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90
Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
Mar 22 12:20:48 ip-172-31-43-89 kernel: ---[ end trace 7bef0d165c4c4827 ]---

Still looking...
Comment 4 Seth Jennings 2017-03-23 11:59:53 EDT
I was able to reproduce on origin with "oc cluster up" version v1.5.0-rc.0 AND v1.4.1 on docker-common-1.12.6-6.

If I change to json-file logging, it works for both versions.

It seems that docker is not creating the log files in /var/lib/docker/containers that openshift is trying to symlink against in /var/log/containers with the logging driver is journald.

Mike, can you provide the docker versions for your 3.4 and 3.5 clusters?
Comment 5 Mike Fiedler 2017-03-23 14:05:24 EDT
for 3.5 install, docker is 1.12.6-14.el7
for 3.4 install, docker is 1.12.5-9.el7
Comment 6 Seth Jennings 2017-03-23 14:31:55 EDT
Thanks Mike.  I am able to reproduce this error in origin 1.4.1.  I'm not convinced this is a regression.  Are you saying that you observe the symlinks being successfully created in 3.4 under /var/log/containers using docker 1.12 and the journald logging driver?
Comment 7 Mike Fiedler 2017-03-23 14:46:24 EDT
On both my 3.4 and 3.5 clusters, docker is started with --log-driver=json-file

3.4 - /var/log/containers has entries:

root@ip-172-31-13-107: ~ # openshift version
openshift v3.4.0.39
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
root@ip-172-31-13-107: ~ # docker version
Client:
 Version:         1.12.5
 API version:     1.24
 Package version: docker-common-1.12.5-9.el7.x86_64
 Go version:      go1.7.4
 Git commit:      c22650b
 Built:           Thu Jan  5 14:00:55 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.5
 API version:     1.24
 Package version: docker-common-1.12.5-9.el7.x86_64
 Go version:      go1.7.4
 Git commit:      c22650b
 Built:           Thu Jan  5 14:00:55 2017
 OS/Arch:         linux/amd64
root@ip-172-31-13-107: ~ # ps -ef | grep dockerd-current
root       6031      1 37 Mar21 ?        17:51:43 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=json-file --log-opt max-size=50m --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com --insecure-registry registry.ops.openshift.com


3.5 - /var/log/containers empty

root@ip-172-31-60-44: ~ # openshift version
openshift v3.5.0.55
kubernetes v1.5.2+43a9be4
etcd 3.1.0
root@ip-172-31-60-44: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-14.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Thu Mar 16 14:27:53 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-14.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Thu Mar 16 14:27:53 2017
 OS/Arch:         linux/amd64
root@ip-172-31-60-44: ~ # ps -ef | grep dockerd-current
root      39225  38698  0 14:45 pts/0    00:00:00 grep --color=auto dockerd-current
root      44844      1  0 Mar22 ?        00:07:05 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=journald --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com --insecure-registry registry.ops.openshift.com
Comment 8 Seth Jennings 2017-03-23 14:50:23 EDT
Ah, so 3.4 is using --log-driver=json-file and 3.5 is using --log-driver=journald.  That explains why it doesn't appear in the 3.4 cluster. The error doesn't result from a code regression but a change in configuration options from 3.4 to 3.5.
Comment 9 Mike Fiedler 2017-03-23 14:57:32 EDT
Right you are - could have sworn I saw json for both.   I'll remove the regression flag.
Comment 10 Seth Jennings 2017-03-23 14:58:20 EDT
I'm not sure how these clusters were deployed, but openshift-ansible overrides the docker RPM default config and changes the logging driver to json-file

https://bugzilla.redhat.com/show_bug.cgi?id=1335939

https://github.com/openshift/openshift-ansible/pull/1976
Comment 11 Mike Fiedler 2017-03-23 14:59:53 EDT
We do an RPM install of docker in our image provisioner.  It pulls from the rhel7-next mirror on mirror.openshift.com
Comment 12 Seth Jennings 2017-03-23 15:05:31 EDT
Aaaand as soon as I say that, I find the commit that later changed it back to journald :-/

https://github.com/openshift/openshift-ansible/commit/222e0e52f7176ae028caaf8406bb0b08ebff715d

That went into openshift-ansible openshift-ansible-3.5.4-1.  So the change from 3.4 to 3.5 of the installer might explain the difference.
Comment 13 Seth Jennings 2017-03-23 15:35:33 EDT
This is the commit upstream that addressed this
https://github.com/kubernetes/kubernetes/commit/7fed242d5557b5bf82463566e41af8d18905449e

I did the backport to 3.5 and it isn't bad.
https://github.com/sjenning/ose/commit/923432047771ddab9f66d9bf3ae8335c0b86e43c

Or we can UpcomingRelease it and pick it up in the 3.6 rebase.
Comment 14 Seth Jennings 2017-03-23 16:26:55 EDT
OCP 3.5 PR
https://github.com/openshift/ose/pull/663
Comment 15 Kenjiro Nakayama 2017-10-15 21:42:28 EDT
One of the customers hit this issue.

   messages-20171016:Oct  9 15:42:53 xxxxx atomic-openshift-node: E1009 15:42:53.049129   59782 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "xxxxxxxxxxxx(7fdfd342-ac9b-11e7-acc4-001a4a408e0c)" container "xxxxxxxxxxx": symlink  /var/log/containers/xxxxxxxxx-0838efecda57312940f683a44aee9d2ac515f6a23ee67d4b2677b9b0335f3666.log: no such file or directory

Their logging driver is "journald" and versions are:

  atomic-openshift-3.5.5.31.24-1.git.0.ff74e0b.el7.x86_64
  docker-1.12.6-32.git88a4867.el7.x86_64

It looks like the PR c#14 (https://github.com/openshift/ose/pull/663) was closed without merge. Could you please share the status of the fix for OCP 3.5?
Comment 16 Seth Jennings 2017-10-16 09:53:15 EDT
I reopened the PR now that there is a customer request
Comment 17 Derek Carr 2017-10-17 12:19:19 EDT
merged 3.5 fix https://github.com/openshift/ose/pull/663
Comment 19 Mike Fiedler 2017-10-31 14:39:48 EDT
3.5 backport verified on v3.5.5.31.40

Used cluster-loader to create many pods on each node (journald used as log driver).   This error never occurred.
Comment 20 xiuli 2017-10-31 23:27:28 EDT
Verified and the issue can not reproduced again in the env:openshift v3.5.5.31.40
kubernetes v1.5.2+43a9be4
Comment 23 errata-xmlrpc 2017-12-07 02:10:26 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389

Note You need to log in before you can comment on or make changes to this bug.