Bug 1534419 - [free-int]fluentd pods are in Error status on master nodes
Summary: [free-int]fluentd pods are in Error status on master nodes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Logging
Version: 3.x
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.x
Assignee: Jeff Cantrill
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-15 08:23 UTC by Junqi Zhao
Modified: 2018-05-11 14:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-11 14:30:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fluentd log (198.78 KB, application/x-gzip)
2018-01-15 08:23 UTC, Junqi Zhao
no flags Details

Description Junqi Zhao 2018-01-15 08:23:14 UTC
Created attachment 1381297 [details]
fluentd log

Description of problem:
free-int cluster, fluentd pods are in Error status on master nodes, attached fluentd pod logs, find error
"primary shard is not active Timeout"

NAME                                      READY     STATUS    RESTARTS   AGE       IP             NODE
logging-curator-2-99xk9                   1/1       Running   6          40d       10.131.0.23    Node
logging-es-data-master-3g3ip92b-4-qdrmz   2/2       Running   4          25d       10.129.4.16    Node
logging-es-data-master-gnknl4vw-4-d289n   2/2       Running   4          40d       10.131.0.24    Node
logging-es-data-master-ovck58c4-4-gw5dl   2/2       Running   4          25d       10.129.4.15    Node
logging-fluentd-cczft                     1/1       Running   2          40d       10.131.2.124   Node
logging-fluentd-d9r76                     1/1       Running   2          40d       10.129.2.117   Node
logging-fluentd-frgfp                     1/1       Running   6          40d       10.130.2.63    Node
logging-fluentd-gjb9k                     1/1       Running   2          40d       10.129.0.51    Node
logging-fluentd-kmttf                     0/1       Error     4          40d       <none>         Master
logging-fluentd-lqd99                     0/1       Error     5          40d       <none>         Master
logging-fluentd-n4g6n                     1/1       Running   2          40d       10.129.4.18    Node
logging-fluentd-v8pqv                     1/1       Running   2          40d       10.131.0.27    Master
logging-fluentd-vfgch                     0/1       Error     7          40d       <none>         Node
logging-fluentd-vnt4t                     1/1       Running   2          40d       10.128.2.209   Node
logging-kibana-3-m4g6p                    2/2       Running   4          25d       10.131.0.25    Node

Take logging-fluentd-kmttf for example, describe pod info,

Events:
  Type    Reason          Age                   From                                    Message
  ----    ------          ----                  ----                                    -------
  Normal  SandboxChanged  19s (x27596 over 4d)  kubelet, ip-172-31-60-182.ec2.internal  Pod sandbox changed, it will be killed and re-created.

Version-Release number of selected component (if applicable):

OpenShift Master:v3.8.30 (online version 3.6.0.83)
Kubernetes Master:v1.8.1+0d5291c 

How reproducible:
Always

Steps to Reproduce:
1. Check logging pods' status
2.
3.

Actual results:
fluentd pods are in Error status on master nodes 

Expected results:
fluentd pods should be running well on master nodes 

Additional info:

Comment 1 Dan Williams 2018-01-15 19:36:49 UTC
Can we get logs from the node on which this pod failed?  THe "pod sandbox changed; it will be killed and re-created" message is normal and can happen for a number of reasons.  Node logs would be necessary to track it down further.  Thanks!

Comment 2 Dan Williams 2018-01-15 20:44:14 UTC
Logged in and got details.  I think this is another instance of https://github.com/openshift/ose/pull/999 due to the "could not retrieve port mappings: checkpoint is not found." in the logs.

Comment 3 Justin Pierce 2018-01-15 21:30:27 UTC
v3.8.31 will include this fix.

Comment 4 Justin Pierce 2018-01-15 21:31:45 UTC
https://github.com/openshift/ose/pull/998

Comment 5 Junqi Zhao 2018-01-16 00:52:28 UTC
clear needinfo request since Dan had got details.


Note You need to log in before you can comment on or make changes to this bug.