1534419 – [free-int]fluentd pods are in Error status on master nodes

Bug 1534419 - [free-int]fluentd pods are in Error status on master nodes

Summary: [free-int]fluentd pods are in Error status on master nodes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.x
Assignee:	Jeff Cantrill
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-15 08:23 UTC by Junqi Zhao
Modified:	2018-05-11 14:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-11 14:30:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fluentd log (198.78 KB, application/x-gzip) 2018-01-15 08:23 UTC, Junqi Zhao	no flags	Details
View All

Description Junqi Zhao 2018-01-15 08:23:14 UTC

Created attachment 1381297 [details]
fluentd log

Description of problem:
free-int cluster, fluentd pods are in Error status on master nodes, attached fluentd pod logs, find error
"primary shard is not active Timeout"

NAME                                      READY     STATUS    RESTARTS   AGE       IP             NODE
logging-curator-2-99xk9                   1/1       Running   6          40d       10.131.0.23    Node
logging-es-data-master-3g3ip92b-4-qdrmz   2/2       Running   4          25d       10.129.4.16    Node
logging-es-data-master-gnknl4vw-4-d289n   2/2       Running   4          40d       10.131.0.24    Node
logging-es-data-master-ovck58c4-4-gw5dl   2/2       Running   4          25d       10.129.4.15    Node
logging-fluentd-cczft                     1/1       Running   2          40d       10.131.2.124   Node
logging-fluentd-d9r76                     1/1       Running   2          40d       10.129.2.117   Node
logging-fluentd-frgfp                     1/1       Running   6          40d       10.130.2.63    Node
logging-fluentd-gjb9k                     1/1       Running   2          40d       10.129.0.51    Node
logging-fluentd-kmttf                     0/1       Error     4          40d       <none>         Master
logging-fluentd-lqd99                     0/1       Error     5          40d       <none>         Master
logging-fluentd-n4g6n                     1/1       Running   2          40d       10.129.4.18    Node
logging-fluentd-v8pqv                     1/1       Running   2          40d       10.131.0.27    Master
logging-fluentd-vfgch                     0/1       Error     7          40d       <none>         Node
logging-fluentd-vnt4t                     1/1       Running   2          40d       10.128.2.209   Node
logging-kibana-3-m4g6p                    2/2       Running   4          25d       10.131.0.25    Node

Take logging-fluentd-kmttf for example, describe pod info,

Events:
  Type    Reason          Age                   From                                    Message
  ----    ------          ----                  ----                                    -------
  Normal  SandboxChanged  19s (x27596 over 4d)  kubelet, ip-172-31-60-182.ec2.internal  Pod sandbox changed, it will be killed and re-created.

Version-Release number of selected component (if applicable):

OpenShift Master:v3.8.30 (online version 3.6.0.83)
Kubernetes Master:v1.8.1+0d5291c 

How reproducible:
Always

Steps to Reproduce:
1. Check logging pods' status
2.
3.

Actual results:
fluentd pods are in Error status on master nodes 

Expected results:
fluentd pods should be running well on master nodes 

Additional info:

Comment 1 Dan Williams 2018-01-15 19:36:49 UTC

Can we get logs from the node on which this pod failed?  THe "pod sandbox changed; it will be killed and re-created" message is normal and can happen for a number of reasons.  Node logs would be necessary to track it down further.  Thanks!

Comment 2 Dan Williams 2018-01-15 20:44:14 UTC

Logged in and got details.  I think this is another instance of https://github.com/openshift/ose/pull/999 due to the "could not retrieve port mappings: checkpoint is not found." in the logs.

Comment 3 Justin Pierce 2018-01-15 21:30:27 UTC

v3.8.31 will include this fix.

Comment 4 Justin Pierce 2018-01-15 21:31:45 UTC

https://github.com/openshift/ose/pull/998

Comment 5 Junqi Zhao 2018-01-16 00:52:28 UTC

clear needinfo request since Dan had got details.

Note You need to log in before you can comment on or make changes to this bug.