Bug 1534419

Summary: [free-int]fluentd pods are in Error status on master nodes
Product: OpenShift Online Reporter: Junqi Zhao <juzhao>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.xCC: aos-bugs, dcbw, jupierce, juzhao
Target Milestone: ---Keywords: OnlineStarter
Target Release: 3.x   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-11 14:30:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
fluentd log none

Description Junqi Zhao 2018-01-15 08:23:14 UTC
Created attachment 1381297 [details]
fluentd log

Description of problem:
free-int cluster, fluentd pods are in Error status on master nodes, attached fluentd pod logs, find error
"primary shard is not active Timeout"

NAME                                      READY     STATUS    RESTARTS   AGE       IP             NODE
logging-curator-2-99xk9                   1/1       Running   6          40d       10.131.0.23    Node
logging-es-data-master-3g3ip92b-4-qdrmz   2/2       Running   4          25d       10.129.4.16    Node
logging-es-data-master-gnknl4vw-4-d289n   2/2       Running   4          40d       10.131.0.24    Node
logging-es-data-master-ovck58c4-4-gw5dl   2/2       Running   4          25d       10.129.4.15    Node
logging-fluentd-cczft                     1/1       Running   2          40d       10.131.2.124   Node
logging-fluentd-d9r76                     1/1       Running   2          40d       10.129.2.117   Node
logging-fluentd-frgfp                     1/1       Running   6          40d       10.130.2.63    Node
logging-fluentd-gjb9k                     1/1       Running   2          40d       10.129.0.51    Node
logging-fluentd-kmttf                     0/1       Error     4          40d       <none>         Master
logging-fluentd-lqd99                     0/1       Error     5          40d       <none>         Master
logging-fluentd-n4g6n                     1/1       Running   2          40d       10.129.4.18    Node
logging-fluentd-v8pqv                     1/1       Running   2          40d       10.131.0.27    Master
logging-fluentd-vfgch                     0/1       Error     7          40d       <none>         Node
logging-fluentd-vnt4t                     1/1       Running   2          40d       10.128.2.209   Node
logging-kibana-3-m4g6p                    2/2       Running   4          25d       10.131.0.25    Node

Take logging-fluentd-kmttf for example, describe pod info,

Events:
  Type    Reason          Age                   From                                    Message
  ----    ------          ----                  ----                                    -------
  Normal  SandboxChanged  19s (x27596 over 4d)  kubelet, ip-172-31-60-182.ec2.internal  Pod sandbox changed, it will be killed and re-created.

Version-Release number of selected component (if applicable):

OpenShift Master:v3.8.30 (online version 3.6.0.83)
Kubernetes Master:v1.8.1+0d5291c 

How reproducible:
Always

Steps to Reproduce:
1. Check logging pods' status
2.
3.

Actual results:
fluentd pods are in Error status on master nodes 

Expected results:
fluentd pods should be running well on master nodes 

Additional info:

Comment 1 Dan Williams 2018-01-15 19:36:49 UTC
Can we get logs from the node on which this pod failed?  THe "pod sandbox changed; it will be killed and re-created" message is normal and can happen for a number of reasons.  Node logs would be necessary to track it down further.  Thanks!

Comment 2 Dan Williams 2018-01-15 20:44:14 UTC
Logged in and got details.  I think this is another instance of https://github.com/openshift/ose/pull/999 due to the "could not retrieve port mappings: checkpoint is not found." in the logs.

Comment 3 Justin Pierce 2018-01-15 21:30:27 UTC
v3.8.31 will include this fix.

Comment 4 Justin Pierce 2018-01-15 21:31:45 UTC
https://github.com/openshift/ose/pull/998

Comment 5 Junqi Zhao 2018-01-16 00:52:28 UTC
clear needinfo request since Dan had got details.