Created attachment 1381297 [details] fluentd log Description of problem: free-int cluster, fluentd pods are in Error status on master nodes, attached fluentd pod logs, find error "primary shard is not active Timeout" NAME READY STATUS RESTARTS AGE IP NODE logging-curator-2-99xk9 1/1 Running 6 40d 10.131.0.23 Node logging-es-data-master-3g3ip92b-4-qdrmz 2/2 Running 4 25d 10.129.4.16 Node logging-es-data-master-gnknl4vw-4-d289n 2/2 Running 4 40d 10.131.0.24 Node logging-es-data-master-ovck58c4-4-gw5dl 2/2 Running 4 25d 10.129.4.15 Node logging-fluentd-cczft 1/1 Running 2 40d 10.131.2.124 Node logging-fluentd-d9r76 1/1 Running 2 40d 10.129.2.117 Node logging-fluentd-frgfp 1/1 Running 6 40d 10.130.2.63 Node logging-fluentd-gjb9k 1/1 Running 2 40d 10.129.0.51 Node logging-fluentd-kmttf 0/1 Error 4 40d <none> Master logging-fluentd-lqd99 0/1 Error 5 40d <none> Master logging-fluentd-n4g6n 1/1 Running 2 40d 10.129.4.18 Node logging-fluentd-v8pqv 1/1 Running 2 40d 10.131.0.27 Master logging-fluentd-vfgch 0/1 Error 7 40d <none> Node logging-fluentd-vnt4t 1/1 Running 2 40d 10.128.2.209 Node logging-kibana-3-m4g6p 2/2 Running 4 25d 10.131.0.25 Node Take logging-fluentd-kmttf for example, describe pod info, Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SandboxChanged 19s (x27596 over 4d) kubelet, ip-172-31-60-182.ec2.internal Pod sandbox changed, it will be killed and re-created. Version-Release number of selected component (if applicable): OpenShift Master:v3.8.30 (online version 3.6.0.83) Kubernetes Master:v1.8.1+0d5291c How reproducible: Always Steps to Reproduce: 1. Check logging pods' status 2. 3. Actual results: fluentd pods are in Error status on master nodes Expected results: fluentd pods should be running well on master nodes Additional info:
Can we get logs from the node on which this pod failed? THe "pod sandbox changed; it will be killed and re-created" message is normal and can happen for a number of reasons. Node logs would be necessary to track it down further. Thanks!
Logged in and got details. I think this is another instance of https://github.com/openshift/ose/pull/999 due to the "could not retrieve port mappings: checkpoint is not found." in the logs.
v3.8.31 will include this fix.
https://github.com/openshift/ose/pull/998
clear needinfo request since Dan had got details.