Created attachment 1265437 [details] node log file Description of problem: Every pod started is getting: Mar 22 12:20:46 ip-172-31-43-89 atomic-openshift-node: E0322 12:20:46.122505 25624 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "cakephp-mysql-example-1-build_mff(73dcb124-0f1b-11e7-9c92-025d2f9e4ccf)" container "sti-build": symlink /var/log/containers/ I see this on Kubernetes 1.5.4 as well. Version-Release number of selected component (if applicable): 3.5.0.55 How reproducible: Always Steps to Reproduce: 1. Set up an OpenShift (or Kubernetes) cluster 2. Create a new pod with the attached json 3. Look at /var/log/messages on the node where the pod starts Actual results: Mar 22 12:20:46 ip-172-31-43-89 atomic-openshift-node: E0322 12:20:46.122505 25624 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "cakephp-mysql-example-1-build_mff(73dcb124-0f1b-11e7-9c92-025d2f9e4ccf)" container "sti-build": symlink /var/log/containers/ Expected results: No error on normal pod start Additional info: Full log attached.
Created attachment 1265438 [details] pod definition
/var/log/containers is empty. I checked a 3.4 system and /var/log/containers has content. Marking regression.
Worth noting the node seems to be having other issues. From the node log: Mar 22 12:20:48 ip-172-31-43-89 kernel: ------------[ cut here ]------------ Mar 22 12:20:48 ip-172-31-43-89 kernel: WARNING: at net/core/skbuff.c:3824 skb_try_coalesce+0x419/0x450() Mar 22 12:20:48 ip-172-31-43-89 kernel: Modules linked in: veth overlay(T) xt_statistic xt_nat xt_recent vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch xt_mark xt_comment xt_multiport ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables br_netfilter bridge stp llc isofs xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod cirrus intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev Mar 22 12:20:48 ip-172-31-43-89 kernel: ttm drm_kms_helper aesni_intel lrw syscopyarea gf128mul sysfillrect glue_helper sysimgblt ablk_helper fb_sys_fops cryptd drm i2c_piix4 i2c_core pcspkr parport_pc parport xen_netfront nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix crct10dif_pclmul crct10dif_common xen_blkfront crc32c_intel libata ixgbevf serio_raw floppy fjes Mar 22 12:20:48 ip-172-31-43-89 kernel: CPU: 3 PID: 23 Comm: ksoftirqd/3 Tainted: G ------------ T 3.10.0-514.10.2.el7.x86_64 #1 Mar 22 12:20:48 ip-172-31-43-89 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016 Mar 22 12:20:48 ip-172-31-43-89 kernel: 0000000000000000 00000000599de0b2 ffff880174aeb988 ffffffff816863ef Mar 22 12:20:48 ip-172-31-43-89 kernel: ffff880174aeb9c0 ffffffff81085940 ffff8807c1754c00 ffff8807c1755a00 Mar 22 12:20:48 ip-172-31-43-89 kernel: 0000000000000704 0000000000000020 ffff880174aeba34 ffff880174aeb9d0 Mar 22 12:20:48 ip-172-31-43-89 kernel: Call Trace: Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff816863ef>] dump_stack+0x19/0x1b Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81561ba9>] skb_try_coalesce+0x419/0x450 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffffa0095d07>] ? ipt_do_table+0x337/0x710 [ip_tables] Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815c46da>] tcp_try_coalesce+0x6a/0xd0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815c5925>] tcp_queue_rcv+0x55/0x130 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815caf17>] tcp_rcv_established+0x3a7/0x760 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815d586a>] tcp_v4_do_rcv+0x10a/0x340 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff812a8036>] ? security_sock_rcv_skb+0x16/0x20 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815d6fb9>] tcp_v4_rcv+0x799/0x9a0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffffa03dc306>] ? ipv4_confirm+0x86/0x100 [nf_conntrack_ipv4] Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0974>] ip_local_deliver_finish+0xb4/0x1f0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0c59>] ip_local_deliver+0x59/0xd0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b08c0>] ? ip_rcv_finish+0x350/0x350 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b05fa>] ip_rcv_finish+0x8a/0x350 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0f86>] ip_rcv+0x2b6/0x410 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156fe02>] ? netif_receive_skb_sk+0x42/0x70 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815b0570>] ? inet_del_offload+0x40/0x40 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156f9b2>] __netif_receive_skb_core+0x582/0x800 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156ffec>] ? napi_complete_done+0x2c/0xc0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8156fc48>] __netif_receive_skb+0x18/0x60 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81570f5e>] process_backlog+0xae/0x170 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff815704e0>] net_rx_action+0x170/0x380 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8108f2cf>] __do_softirq+0xef/0x280 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff8108f498>] run_ksoftirqd+0x38/0x50 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b927f>] smpboot_thread_fn+0x12f/0x180 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b9150>] ? lg_double_unlock+0x90/0x90 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90 Mar 22 12:20:48 ip-172-31-43-89 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 Mar 22 12:20:48 ip-172-31-43-89 kernel: ---[ end trace 7bef0d165c4c4827 ]--- Still looking...
I was able to reproduce on origin with "oc cluster up" version v1.5.0-rc.0 AND v1.4.1 on docker-common-1.12.6-6. If I change to json-file logging, it works for both versions. It seems that docker is not creating the log files in /var/lib/docker/containers that openshift is trying to symlink against in /var/log/containers with the logging driver is journald. Mike, can you provide the docker versions for your 3.4 and 3.5 clusters?
for 3.5 install, docker is 1.12.6-14.el7 for 3.4 install, docker is 1.12.5-9.el7
Thanks Mike. I am able to reproduce this error in origin 1.4.1. I'm not convinced this is a regression. Are you saying that you observe the symlinks being successfully created in 3.4 under /var/log/containers using docker 1.12 and the journald logging driver?
On both my 3.4 and 3.5 clusters, docker is started with --log-driver=json-file 3.4 - /var/log/containers has entries: root@ip-172-31-13-107: ~ # openshift version openshift v3.4.0.39 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 root@ip-172-31-13-107: ~ # docker version Client: Version: 1.12.5 API version: 1.24 Package version: docker-common-1.12.5-9.el7.x86_64 Go version: go1.7.4 Git commit: c22650b Built: Thu Jan 5 14:00:55 2017 OS/Arch: linux/amd64 Server: Version: 1.12.5 API version: 1.24 Package version: docker-common-1.12.5-9.el7.x86_64 Go version: go1.7.4 Git commit: c22650b Built: Thu Jan 5 14:00:55 2017 OS/Arch: linux/amd64 root@ip-172-31-13-107: ~ # ps -ef | grep dockerd-current root 6031 1 37 Mar21 ? 17:51:43 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=json-file --log-opt max-size=50m --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com --insecure-registry registry.ops.openshift.com 3.5 - /var/log/containers empty root@ip-172-31-60-44: ~ # openshift version openshift v3.5.0.55 kubernetes v1.5.2+43a9be4 etcd 3.1.0 root@ip-172-31-60-44: ~ # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-14.el7.x86_64 Go version: go1.7.4 Git commit: 3a094bd/1.12.6 Built: Thu Mar 16 14:27:53 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-14.el7.x86_64 Go version: go1.7.4 Git commit: 3a094bd/1.12.6 Built: Thu Mar 16 14:27:53 2017 OS/Arch: linux/amd64 root@ip-172-31-60-44: ~ # ps -ef | grep dockerd-current root 39225 38698 0 14:45 pts/0 00:00:00 grep --color=auto dockerd-current root 44844 1 0 Mar22 ? 00:07:05 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=journald --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com --insecure-registry registry.ops.openshift.com
Ah, so 3.4 is using --log-driver=json-file and 3.5 is using --log-driver=journald. That explains why it doesn't appear in the 3.4 cluster. The error doesn't result from a code regression but a change in configuration options from 3.4 to 3.5.
Right you are - could have sworn I saw json for both. I'll remove the regression flag.
I'm not sure how these clusters were deployed, but openshift-ansible overrides the docker RPM default config and changes the logging driver to json-file https://bugzilla.redhat.com/show_bug.cgi?id=1335939 https://github.com/openshift/openshift-ansible/pull/1976
We do an RPM install of docker in our image provisioner. It pulls from the rhel7-next mirror on mirror.openshift.com
Aaaand as soon as I say that, I find the commit that later changed it back to journald :-/ https://github.com/openshift/openshift-ansible/commit/222e0e52f7176ae028caaf8406bb0b08ebff715d That went into openshift-ansible openshift-ansible-3.5.4-1. So the change from 3.4 to 3.5 of the installer might explain the difference.
This is the commit upstream that addressed this https://github.com/kubernetes/kubernetes/commit/7fed242d5557b5bf82463566e41af8d18905449e I did the backport to 3.5 and it isn't bad. https://github.com/sjenning/ose/commit/923432047771ddab9f66d9bf3ae8335c0b86e43c Or we can UpcomingRelease it and pick it up in the 3.6 rebase.
OCP 3.5 PR https://github.com/openshift/ose/pull/663
One of the customers hit this issue. messages-20171016:Oct 9 15:42:53 xxxxx atomic-openshift-node: E1009 15:42:53.049129 59782 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "xxxxxxxxxxxx(7fdfd342-ac9b-11e7-acc4-001a4a408e0c)" container "xxxxxxxxxxx": symlink /var/log/containers/xxxxxxxxx-0838efecda57312940f683a44aee9d2ac515f6a23ee67d4b2677b9b0335f3666.log: no such file or directory Their logging driver is "journald" and versions are: atomic-openshift-3.5.5.31.24-1.git.0.ff74e0b.el7.x86_64 docker-1.12.6-32.git88a4867.el7.x86_64 It looks like the PR c#14 (https://github.com/openshift/ose/pull/663) was closed without merge. Could you please share the status of the fix for OCP 3.5?
I reopened the PR now that there is a customer request
merged 3.5 fix https://github.com/openshift/ose/pull/663
3.5 backport verified on v3.5.5.31.40 Used cluster-loader to create many pods on each node (journald used as log driver). This error never occurred.
Verified and the issue can not reproduced again in the env:openshift v3.5.5.31.40 kubernetes v1.5.2+43a9be4
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3389