Bug 1915085 - Pods created and rapidly terminated get stuck [NEEDINFO]
Summary: Pods created and rapidly terminated get stuck
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.7.0
Assignee: Ryan Phillips
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1883991 1898614 1912880 1914022 1934691 1946418 2063138 (view as bug list)
Depends On:
Blocks: 1896387 1912521 1912880 1914022 1922292 1958371
TreeView+ depends on / blocked
 
Reported: 2021-01-11 22:20 UTC by Clayton Coleman
Modified: 2022-05-24 07:23 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-apps][Feature:DeploymentConfig] deploymentconfigs when run iteratively should immediately start a new deployment
Last Closed: 2021-12-20 20:56:30 UTC
Target Upstream Version:
ehashman: needinfo? (david.karlsen)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes pull 98424 0 None closed kubelet: Pods created and rapidly terminated get stuck 2021-02-15 19:11:06 UTC
Github openshift kubernetes pull 551 0 None closed Bug 1915085: UPSTREAM: 98424: register all pending pod deletions and check for kill 2021-02-19 14:34:30 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:52:26 UTC

Description Clayton Coleman 2021-01-11 22:20:37 UTC
In https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25787/pull-ci-openshift-origin-master-e2e-aws-fips/1348717863535382528 the pod pod-submit-status-0-0 is created and is still pending 5m later (other later pods are created and failed).  There are no obvious failure events that would explain why the pod is not created.

This test creates and starts lots of pods in bulk - the fact that this run failed twice in a row indicates this is a fairly reproducible failure.

Looking back over the last 14 day of jobs in master, about 2% of failed runs include this failure, which indicates we may be failing to start pods quickly around 2% of the time.

Marking urgent because pods not starting within a reasonable period is certainly a release blocker.

Comment 1 Ryan Phillips 2021-01-14 17:50:10 UTC
This issue is the race when the pod is created and then deleted before the container is fully started.

┌Jan 11 21:04:47.627101 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.627056    1733 kubelet.go:1895] SyncLoop (ADD, "api"): "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)"                                                                                                                                                                                                                                   ┤
│Jan 11 21:04:47.703426 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.703262    1733 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "pod-submit-status-0-0" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8")                                                                  ┤
│Jan 11 21:04:47.804122 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.803923    1733 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "pod-submit-status-0-0" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8")                                                                                     ┤
│Jan 11 21:04:47.808769 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.808731    1733 operation_generator.go:672] MountVolume.SetUp succeeded for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "pod-submit-status-0-0" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8")                                                                                      │
│Jan 11 21:04:47.950956 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.950805    1733 kuberuntime_manager.go:439] No sandbox for pod "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" can be found. Need to start a new one                                                                                                                                                                                      │
│Jan 11 21:04:48.815520 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:48.812665    1733 kubelet.go:1911] SyncLoop (DELETE, "api"): "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)"                                                                                                                                                                                                                                │
│Jan 11 21:04:49.025532 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.024632    1733 reconciler.go:196] operationExecutor.UnmountVolume started for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "c76833c3-6e57-4135-8968-2ed47a665cf8" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8")                                                                    ┤
│Jan 11 21:04:49.181520 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.175663    1733 operation_generator.go:797] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8" (OuterVolumeSpecName: "default-token-29fj8") pod "c76833c3-6e57-4135-8968-2ed47a665cf8" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8"). InnerVolumeSpecName "default-token-29fj8". PluginName "│
│Jan 11 21:04:49.278204 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.278115    1733 reconciler.go:319] Volume detached for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") on node "ip-10-0-166-126.ec2.internal" DevicePath ""                                                                                                                                │
│Jan 11 21:04:49.475544 ip-10-0-166-126 crio[1696]: 2021-01-11T21:04:49Z [verbose] Add: e2e-pods-3370:pod-submit-status-0-0:c76833c3-6e57-4135-8968-2ed47a665cf8:(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/ca245555-8ee4-42c9-860c-011cc8cb612b"}],"ips":[{"version":"4","interface":0,"address":"10.128.2.55/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.128.2.1"},{"dst":"224.0.0.0│
│Jan 11 21:04:49.475544 ip-10-0-166-126 crio[1696]: I0111 21:04:49.389759  246192 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-pods-3370", Name:"pod-submit-status-0-0", UID:"c76833c3-6e57-4135-8968-2ed47a665cf8", APIVersion:"v1", ResourceVersion:"181788", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.2.55/23]                                                                       │
│Jan 11 21:04:49.664620 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.664296    1733 kubelet.go:1911] SyncLoop (DELETE, "api"): "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)"                                                                                                                                                                                                                                │
│Jan 11 21:04:52.326687 ip-10-0-166-126 hyperkube[1733]: W0111 21:04:52.325987    1733 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable podc76833c3-6e57-4135-8968-2ed47a665cf8] : unable to destroy cgroup paths for cgroup [kubepods burstable podc76833c3-6e57-4135-8968-2ed47a665cf8] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-│
│Jan 11 21:04:53.309527 ip-10-0-166-126 crio[1696]: 2021-01-11T21:04:51Z [verbose] Del: e2e-pods-3370:pod-submit-status-0-0:c76833c3-6e57-4135-8968-2ed47a665cf8:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}                                                                                                                                                                                     │
│Jan 11 21:04:53.478638 ip-10-0-166-126 crio[1696]: time="2021-01-11 21:04:53.478493634Z" level=info msg="RunSandbox: releasing container name: k8s_POD_pod-submit-status-0-0_e2e-pods-3370_c76833c3-6e57-4135-8968-2ed47a665cf8_0" id=392e4627-a43c-4c0b-8fbc-3126a016d104 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                                                                              ┤
│Jan 11 21:04:53.478638 ip-10-0-166-126 crio[1696]: time="2021-01-11 21:04:53.478590851Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_pod-submit-status-0-0_e2e-pods-3370_c76833c3-6e57-4135-8968-2ed47a665cf8_0" id=392e4627-a43c-4c0b-8fbc-3126a016d104 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                                                                                │
│Jan 11 21:04:53.482614 ip-10-0-166-126 hyperkube[1733]: E0111 21:04:53.479356    1733 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF                                                                                                                  │
│Jan 11 21:04:53.482614 ip-10-0-166-126 hyperkube[1733]: E0111 21:04:53.479381    1733 kuberuntime_manager.go:755] createPodSandbox for pod "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF                                                                                                                 │
│Jan 11 21:04:53.482614 ip-10-0-166-126 hyperkube[1733]: E0111 21:04:53.479452    1733 pod_workers.go:191] Error syncing pod c76833c3-6e57-4135-8968-2ed47a665cf8 ("pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)"), skipping: failed to "CreatePodSandbox" for "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" with CreatePodSandboxError: "CreatePodSandbox for pod

Comment 2 Ryan Phillips 2021-01-14 18:22:26 UTC
https://github.com/openshift/origin/blob/e49fe16bd3eb5a177d04e59739bc91864d608d1c/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L311

The Intn() will generate a number between [0..2000) to calculate for milliseconds. The lower the number the more likely we will hit the race where the Pod attempts to start, and then Kubelet tears it down. Perhaps we could add a random range between 2-4 seconds?

Harshal: Could you create an upstream issue for this as well? (If it doesn't exist).

Comment 3 Harshal Patil 2021-01-18 12:49:50 UTC
Just filed an upstream issue - https://github.com/kubernetes/kubernetes/issues/98142

Comment 4 Harshal Patil 2021-01-20 05:19:52 UTC
*** Bug 1883991 has been marked as a duplicate of this bug. ***

Comment 8 Neelesh Agrawal 2021-01-22 18:22:04 UTC
*** Bug 1912880 has been marked as a duplicate of this bug. ***

Comment 9 Clayton Coleman 2021-01-22 19:31:07 UTC
Release impact is that customers who create and delete pods in job environments (batch, data, processing) or have fast running or looping pods, or perform a deployment and then immediately rollout a new deployment could see substantial (>5m) delays to their apps, preventing new versions from rolling out.

The test deliberately abuses this path because any failures in this path can have significant impacts during upgrades to applications trying to remain available, causing us to violate our own or our customer SLOs for applications.

Any bugs in starting pods are always release blockers because running pods is our one job.

Comment 10 Clayton Coleman 2021-01-22 19:32:35 UTC
https://search.ci.openshift.org/?search=Pod+Container+Status+should+never+report+success+for+a+pending+container+&maxAge=48h&context=1&type=junit&name=4%5C.7&maxMatches=5&maxBytes=20971520&groupBy=job

In the last 2 days 11% of all 4.7 CI jobs that failed reported this as one error, so as we've been fixing other flakes this is jumping up to top blocker.

Comment 11 Neelesh Agrawal 2021-01-22 19:38:41 UTC
Making it a blocker for the release. Please share if there are any mitigating factors for this bug.

Comment 12 Elana Hashman 2021-01-25 21:46:19 UTC
This bug and the underlying race condition are complex but we are still looking into it.

I'm reviewing https://github.com/kubernetes/kubernetes/pull/89155 to see if that might help mitigate this issue at all as it touches the sync loop that we thing is the underlying issue. If it does, I will pick up the PR.

Comment 15 Francesco Giudici 2021-01-27 08:58:48 UTC
Filling the environment field with related CI job.
Original CI bug was bug #1887857, closed as dup of bug #1883991, closed as dup of this one.

Comment 16 Elana Hashman 2021-01-27 16:35:50 UTC
*** Bug 1912880 has been marked as a duplicate of this bug. ***

Comment 17 Elana Hashman 2021-01-27 23:21:53 UTC
*** Bug 1908378 has been marked as a duplicate of this bug. ***

Comment 19 Mike Fiedler 2021-01-28 22:13:29 UTC
Reproduced bug 1912880 on 4.7.0-0.nightly-2021-01-27-110023.   This is blocking the ability to run a cluster with a reasonable user workload for > 1 day.   Marking this as a TestBlocker (same as 1912880)

Comment 21 Elana Hashman 2021-02-01 19:05:07 UTC
*** Bug 1898614 has been marked as a duplicate of this bug. ***

Comment 22 Elana Hashman 2021-02-02 21:20:55 UTC
How to detect the issue:

1) Check pod metadata. Look for creation and deletion timestamp. If they are close together (e.g. <1m apart), and the pod is stuck in termination, you have likely hit this bug. The pods may be in "Pending" state. Example:

  creationTimestamp: "2021-01-29T19:50:57Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2021-01-29T19:51:27Z"

2) Check kubelet logs. You will likely see entries similar to the above or on linked BZs: https://bugzilla.redhat.com/show_bug.cgi?id=1915085#c1

   You may also see that the kubelet main SyncLoop calls "DELETE" on the pod, but never calls a final "REMOVE".


How to mitigate issue when present:

Force terminate pods.

This should be low risk when it's confirmed that the pod profile matches the criteria above in "how to detect the issue". When stateless pods controlled by e.g. replicasets or deployments are spun up and then quickly killed, it should be fine to force remove them. There is a risk of running into this scenario: https://bugzilla.redhat.com/show_bug.cgi?id=1903228

The biggest risk when doing force removal of pods is when the pod is long-running and has a long terminationGracePeriodSeconds set, or is a statefulset pod. Check for these conditions before removal. We do not believe these sorts of pods are affected by this bug.

Comment 23 aaleman 2021-02-02 21:33:55 UTC
"Force terminate pod" meaning running "oc delete pod --force podname"

Comment 28 errata-xmlrpc 2021-02-24 15:51:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 29 Harshal Patil 2021-03-11 09:09:39 UTC
*** Bug 1934691 has been marked as a duplicate of this bug. ***

Comment 30 Ryan Phillips 2021-04-07 13:34:55 UTC
*** Bug 1946418 has been marked as a duplicate of this bug. ***

Comment 31 Ryan Phillips 2021-05-26 14:04:50 UTC
*** Bug 1914022 has been marked as a duplicate of this bug. ***

Comment 32 David J. M. Karlsen 2021-10-28 18:16:35 UTC
I still experience this in recent 4.7.x clusters:

deletionTimestamp: '2021-10-28T16:18:09Z'
  resourceVersion: '1494521179'
  name: deployment-deployment-7f5d48bdd7-pb69f
  uid: 7d94e616-c460-494e-bb3b-72ee803b609f
  deletionGracePeriodSeconds: 1
  creationTimestamp: '2021-10-28T16:18:03Z'

Comment 33 Elana Hashman 2021-11-01 20:12:39 UTC
@david.karlsen@evry.com can you confirm if the kubelet logs also match the symptoms as reported in https://bugzilla.redhat.com/show_bug.cgi?id=1915085#c22 ?

Comment 34 David J. M. Karlsen 2021-11-01 21:45:36 UTC
I find this in messages.log:

Oct 29 23:53:20 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:20.706492067+02:00" level=info msg="Running pod sandbox: kuberhealthy/deployment-deployment-5468b65549-6g989/POD" id=8e43e8fc-bc05-4929-83de-656559afbf50 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Oct 29 23:53:21 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:21.533273834+02:00" level=info msg="Got pod network &{Name:deployment-deployment-5468b65549-6g989 Namespace:kuberhealthy ID:2dcc07a20c6ccc2e20fdc7dcc982912a95d4c0af34b4c9734b154a4c51193c24 UID:6aaaa5ea-fcfc-404a-91dd-408a33080105 NetNS:/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Oct 29 23:53:21 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:21.533787417+02:00" level=info msg="Adding pod kuberhealthy_deployment-deployment-5468b65549-6g989 to CNI network \"multus-cni-network\" (type=multus)"
Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: 2021-10-29T23:53:23+02:00 [verbose] Add: kuberhealthy:deployment-deployment-5468b65549-6g989:6aaaa5ea-fcfc-404a-91dd-408a33080105:openshift-sdn(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0"}],"ips":[{"version":"4","interface":0,"address":"10.200.7.98/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.200.6.1"},{"dst":"224.0.0.0/4"},{"dst":"10.200.0.0/16"}],"dns":{}}
Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: I1029 23:53:23.999505 1212108 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kuberhealthy", Name:"deployment-deployment-5468b65549-6g989", UID:"6aaaa5ea-fcfc-404a-91dd-408a33080105", APIVersion:"v1", ResourceVersion:"916476433", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.200.7.98/23] from openshift-sdn
Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:24.122058234+02:00" level=info msg="Got pod network &{Name:deployment-deployment-5468b65549-6g989 Namespace:kuberhealthy ID:2dcc07a20c6ccc2e20fdc7dcc982912a95d4c0af34b4c9734b154a4c51193c24 UID:6aaaa5ea-fcfc-404a-91dd-408a33080105 NetNS:/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:24.122942532+02:00" level=info msg="Checking pod kuberhealthy_deployment-deployment-5468b65549-6g989 for CNI network multus-cni-network (type=multus)"
Oct 29 23:53:26 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:26.810628606+02:00" level=info msg="Got pod network &{Name:deployment-deployment-5468b65549-6g989 Namespace:kuberhealthy ID:2dcc07a20c6ccc2e20fdc7dcc982912a95d4c0af34b4c9734b154a4c51193c24 UID:6aaaa5ea-fcfc-404a-91dd-408a33080105 NetNS:/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Oct 29 23:53:26 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:26.810973273+02:00" level=info msg="Deleting pod kuberhealthy_deployment-deployment-5468b65549-6g989 from CNI network \"multus-cni-network\" (type=multus)"
Oct 29 23:53:27 alt-ksx-r-c01oco03 crio: 2021-10-29T23:53:26+02:00 [verbose] Del: kuberhealthy:deployment-deployment-5468b65549-6g989:6aaaa5ea-fcfc-404a-91dd-408a33080105:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}
Oct 29 23:54:33 alt-ksx-r-c01oco03 hyperkube: E1029 23:54:33.433190    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 29 23:54:59 alt-ksx-r-c01oco03 hyperkube: E1029 23:54:59.767814    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 29 23:56:34 alt-ksx-r-c01oco03 hyperkube: E1029 23:56:34.487743    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 29 23:57:04 alt-ksx-r-c01oco03 hyperkube: E1029 23:57:04.368029    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 29 23:57:30 alt-ksx-r-c01oco03 hyperkube: E1029 23:57:30.978273    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 29 23:59:04 alt-ksx-r-c01oco03 hyperkube: E1029 23:59:04.782507    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 30 00:01:34 alt-ksx-r-c01oco03 hyperkube: E1030 00:01:34.186534    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 30 00:05:07 alt-ksx-r-c01oco03 hyperkube: E1030 00:05:07.083424    3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989"
Oct 30 00:06:05 alt-ksx-r-c01oco03 hyperkube: E

Comment 35 Elana Hashman 2021-11-02 20:45:12 UTC
@david.karlsen@evry.com the logs you've provided do not contain any SyncLoop logs that match my comment so I can't confirm.

It is possible that you are encountering similar symptoms with a different underlying root cause.

If you can attach a full kubelet log at verbosity v=4 + the full pod statuses as obtained with `oc get pod -o yaml` I will be able to tell for sure.

Comment 38 W. Trevor King 2021-12-20 20:56:30 UTC
Per comment 28, the linked pull requests shipped with 4.7.0, and it's important to track that an not re-open CLOSED ERRATA bugs.  If you are running a later release and see similar issues, please open a new bug, from which you can refer back to this one with "...similar symptoms as bug 1915085" or similar if that seems appropriate to your new bug.

Comment 40 Elana Hashman 2022-03-14 23:00:36 UTC
*** Bug 2063138 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.