Bug 1903290 - Kubelet repeatedly log the same log line from exited containers
Summary: Kubelet repeatedly log the same log line from exited containers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Ryan Phillips
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks: 1975626
TreeView+ depends on / blocked
 
Reported: 2020-12-01 19:00 UTC by Jim Minter
Modified: 2021-08-02 18:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:37:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2262 0 None closed Bug 1903290: kubelet: refactor KUBELET_LOG_LEVEL into KubeletConfig field 2021-01-19 15:33:24 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:37:38 UTC

Description Jim Minter 2020-12-01 19:00:17 UTC
I am seeing the kubelet repeatedly log the same log line from exited containers from near the end of that container's log stream.

Example: current time is around 18:40 UTC:
[root@aro-v4-shared-gxqb4-master-1 ~]# journalctl -S '1 hour ago' SYSLOG_IDENTIFIER=hyperkube | cut -d: -f4-|sort |uniq -c |sort -nr|head
    109  I1130 14:03:35.562313       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "AggregatorProxyClientCert"
    109  I1130 14:03:35.561000       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "InternalLoadBalancerServing"
    109  I1130 14:03:35.560982       1 certrotationcontroller.go:556] Shutting down CertRotation
    109  I1130 14:03:35.560971       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "LocalhostRecoveryServing"
    109  I1130 14:03:35.560967       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeSchedulerClient"
    109  I1130 14:03:35.560965       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "ExternalLoadBalancerServing"
    109  I1130 14:03:35.560961       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeControllerManagerClient"
    109  I1130 14:03:35.560932       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "LocalhostServing"
    109  I1130 14:03:35.560928       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "ServiceNetworkServing"
    109  I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"

These are piles of stale logs from some thing that exited around 14:03:35.

[root@aro-v4-shared-gxqb4-master-1 ~]# journalctl -S '1 hour ago' |grep 'client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"' | head
Dec 01 17:42:31 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:42:32 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:43:31 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:43:56 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:44:31 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:45:12 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:45:31 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:46:23 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:46:31 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"
Dec 01 17:47:31 aro-v4-shared-gxqb4-master-1 hyperkube[1493]: I1130 14:03:35.560906       1 client_cert_rotation_controller.go:180] Shutting down CertRotationController - "KubeAPIServerToKubeletClientCert"

We are seeing this commonly in ARO production on 4.4 (e.g. 4.4.27) and 4.5 (e.g. 4.5.22 prerelease) clusters.

I think we need to get this fixed, we need logs to be reliable.

Comment 3 Ryan Phillips 2020-12-01 22:04:10 UTC
Updating the Version of this BZ to 4.4.27. The patch I pasted is in 4.7. A reproducer pod would be super helpful.

Comment 4 Jim Minter 2020-12-01 22:19:02 UTC
The following pod manifest recreates this issue for me: it's the terminationMessagePolicy: FallbackToLogsOnError that's the difference between this happening and not.

apiVersion: v1
kind: Pod
metadata:
  name: logtrouble
spec:
  containers:
  - name: logtrouble
    image: busybox
    args:
    - sh
    - -c
    - "while true; do echo -n 'LOGTROUBLE: '; date; sleep 1; done"
    terminationMessagePolicy: FallbackToLogsOnError
    livenessProbe:
      tcpSocket:
        port: 80
  restartPolicy: Always
  nodeName: aro-v4-shared-gxqb4-worker-eastus1-xh9ml
  terminationGracePeriodSeconds: 0

Comment 5 Ryan Phillips 2020-12-02 00:07:14 UTC
The log gets injected in the pod status field and then subsequently gets printed at this line at log level 3:

https://github.com/openshift/origin/blob/release-4.4/vendor/k8s.io/kubernetes/pkg/kubelet/status/status_manager.go#L434

We have a separate PR to refactor the log level, and I will be proposing to set the log level of the kubelet to 2.

https://github.com/openshift/machine-config-operator/pull/2262

Comment 6 Seth Jennings 2020-12-03 20:50:23 UTC
Why solve the problem once when you can solve it twice
https://github.com/kubernetes/kubernetes/pull/97056

Commenting here just for completeness.  https://github.com/openshift/machine-config-operator/pull/2262 will still be the official fix for this.

Comment 9 MinLi 2020-12-18 08:22:27 UTC
verified! 


run pod in Comment 4, and check log in version 4.7.0-0.nightly-2020-12-17-201522: 
sh-4.4# journalctl -S '1 hour ago' SYSLOG_IDENTIFIER=hyperkube | cut -d: -f4-|sort |uniq -c |sort -nr|head
      2  W1218 07:26:48.916985    1434 conversion.go:111] Could not get instant cpu stats: cumulative stats decrease
      1 08:44 UTC. --
      1  W1218 08:08:39.682254    1434 kubelet_getters.go:300] Path "/var/lib/kubelet/pods/f8971161-785a-4f70-9192-753da4bf6cc6/volumes" does not exist
      1  W1218 08:08:38.777575    1434 conversion.go:111] Could not get instant cpu stats: cumulative stats decrease
      1  W1218 08:08:38.777405    1434 conversion.go:111] Could not get instant cpu stats: cumulative stats decrease
      1  W1218 08:08:37.102257    1434 pod_container_deletor.go:79] Container "9fde818aa1863bffb5ca524f357c98e47080af94726bf39391f271b96ff43289" not found in pod's containers
      1  W1218 08:05:09.681799    1434 kubelet_getters.go:300] Path "/var/lib/kubelet/pods/8c06a856-b749-4708-b5cd-ce6ce7397877/volumes" does not exist
      1  W1218 08:05:07.749987    1434 pod_container_deletor.go:79] Container "25301ae88902e09b906839544448e23a545605b6009ab7d8963fba1892262328" not found in pod's containers
      1  W1218 08:03:18.916752    1434 conversion.go:111] Could not get instant cpu stats: cumulative stats decrease
      1  W1218 08:03:18.916746    1434 conversion.go:111] Could not get instant cpu stats: cumulative stats decrease


check log in version 4.4.27:
sh-4.4# journalctl -S '1 hour ago' SYSLOG_IDENTIFIER=hyperkube | cut -d: -f4-|sort |uniq -c |sort -nr|head
     26  LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:16 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:15 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:14 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:13 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:12 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:11 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:10 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:09 UTC 2020
     26  LOGTROUBLE: Fri Dec 18 07:54:08 UTC 2020

sh-4.4# journalctl -S '1 hour ago' SYSLOG_IDENTIFIER=hyperkube  | grep 'LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020'
Dec 18 07:54:18 ip-10-0-150-213 hyperkube[1379]: I1218 07:54:18.835233    1379 status_manager.go:569] Patch status for pod "logtrouble_default(3716a3b0-55c0-444f-a1fd-75e45d11d3d1)" with "{\"metadata\":{\"uid\":\"3716a3b0-55c0-444f-a1fd-75e45d11d3d1\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-12-18T07:54:18Z\",\"message\":\"containers with unready status: [logtrouble]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-12-18T07:54:18Z\",\"message\":\"containers with unready status: [logtrouble]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://1fe885942bd0ea37d892fd0ee118144f9f62f64c4b9318e0109e4bd12ed57c84\",\"image\":\"docker.io/library/busybox:latest\",\"imageID\":\"docker.io/library/busybox@sha256:31a54a0cf86d7354788a8265f60ae6acb4b348a67efbcf7c1007dd3cf7af05ab\",\"lastState\":{\"terminated\":{\"containerID\":\"cri-o://1fe885942bd0ea37d892fd0ee118144f9f62f64c4b9318e0109e4bd12ed57c84\",\"exitCode\":137,\"finishedAt\":\"2020-12-18T07:54:18Z\",\"message\":\"LOGTROUBLE: Fri Dec 18 07:53:49 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:50 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:51 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:52 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:53 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:54 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:55 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:56 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:57 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:58 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:59 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:00 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:01 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:02 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:03 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:04 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:05 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:06 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:07 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:08 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:09 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:10 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:11 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:12 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:13 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:14 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:15 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:16 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020\\n\",\"reason\":\"Error\",\"startedAt\":\"2020-12-18T07:53:49Z\"}},\"name\":\"logtrouble\",\"ready\":false,\"restartCount\":13,\"started\":false,\"state\":{\"waiting\":{\"message\":\"back-off 5m0s restarting failed container=logtrouble pod=logtrouble_default(3716a3b0-55c0-444f-a1fd-75e45d11d3d1)\",\"reason\":\"CrashLoopBackOff\"}}}]}}"
Dec 18 07:54:18 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:54:33 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:54:46 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:54:57 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:55:09 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:55:23 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:55:37 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:55:49 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:56:01 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:56:15 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:56:30 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:56:44 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:56:57 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:57:11 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:57:26 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:57:40 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:57:51 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:58:06 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:58:18 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:58:31 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:58:42 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:58:55 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:59:07 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:59:18 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:59:20 ip-10-0-150-213 hyperkube[1379]: I1218 07:59:20.630716    1379 status_manager.go:569] Patch status for pod "logtrouble_default(3716a3b0-55c0-444f-a1fd-75e45d11d3d1)" with "{\"metadata\":{\"uid\":\"3716a3b0-55c0-444f-a1fd-75e45d11d3d1\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-12-18T07:59:20Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-12-18T07:59:20Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://88148e7367464eeb81cd853522b7533b5c6f16eb1ed0602f62436483a2d27ff7\",\"image\":\"docker.io/library/busybox:latest\",\"imageID\":\"docker.io/library/busybox@sha256:31a54a0cf86d7354788a8265f60ae6acb4b348a67efbcf7c1007dd3cf7af05ab\",\"lastState\":{\"terminated\":{\"containerID\":\"cri-o://1fe885942bd0ea37d892fd0ee118144f9f62f64c4b9318e0109e4bd12ed57c84\",\"exitCode\":137,\"finishedAt\":\"2020-12-18T07:54:18Z\",\"message\":\"LOGTROUBLE: Fri Dec 18 07:53:49 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:50 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:51 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:52 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:53 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:54 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:55 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:56 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:57 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:58 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:53:59 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:00 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:01 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:02 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:03 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:04 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:05 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:06 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:07 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:08 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:09 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:10 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:11 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:12 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:13 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:14 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:15 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:16 UTC 2020\\nLOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020\\n\",\"reason\":\"Error\",\"startedAt\":\"2020-12-18T07:53:49Z\"}},\"name\":\"logtrouble\",\"ready\":true,\"restartCount\":14,\"started\":true,\"state\":{\"running\":{\"startedAt\":\"2020-12-18T07:59:19Z\"}}}]}}"
Dec 18 07:59:20 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020
Dec 18 07:59:46 ip-10-0-150-213 hyperkube[1379]: LOGTROUBLE: Fri Dec 18 07:54:17 UTC 2020

Comment 12 errata-xmlrpc 2021-02-24 15:37:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.