Bug 1534419

Summary:

[free-int]fluentd pods are in Error status on master nodes

Product:

OpenShift Online

Reporter:

Junqi Zhao <juzhao>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Junqi Zhao <juzhao>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.x

CC:

aos-bugs, dcbw, jupierce, juzhao

Target Milestone:

---

Keywords:

OnlineStarter

Target Release:

3.x

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-11 14:30:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
fluentd log	none

Description Junqi Zhao 2018-01-15 08:23:14 UTC

Created attachment 1381297 [details]
fluentd log

Description of problem:
free-int cluster, fluentd pods are in Error status on master nodes, attached fluentd pod logs, find error
"primary shard is not active Timeout"

NAME                                      READY     STATUS    RESTARTS   AGE       IP             NODE
logging-curator-2-99xk9                   1/1       Running   6          40d       10.131.0.23    Node
logging-es-data-master-3g3ip92b-4-qdrmz   2/2       Running   4          25d       10.129.4.16    Node
logging-es-data-master-gnknl4vw-4-d289n   2/2       Running   4          40d       10.131.0.24    Node
logging-es-data-master-ovck58c4-4-gw5dl   2/2       Running   4          25d       10.129.4.15    Node
logging-fluentd-cczft                     1/1       Running   2          40d       10.131.2.124   Node
logging-fluentd-d9r76                     1/1       Running   2          40d       10.129.2.117   Node
logging-fluentd-frgfp                     1/1       Running   6          40d       10.130.2.63    Node
logging-fluentd-gjb9k                     1/1       Running   2          40d       10.129.0.51    Node
logging-fluentd-kmttf                     0/1       Error     4          40d       <none>         Master
logging-fluentd-lqd99                     0/1       Error     5          40d       <none>         Master
logging-fluentd-n4g6n                     1/1       Running   2          40d       10.129.4.18    Node
logging-fluentd-v8pqv                     1/1       Running   2          40d       10.131.0.27    Master
logging-fluentd-vfgch                     0/1       Error     7          40d       <none>         Node
logging-fluentd-vnt4t                     1/1       Running   2          40d       10.128.2.209   Node
logging-kibana-3-m4g6p                    2/2       Running   4          25d       10.131.0.25    Node

Take logging-fluentd-kmttf for example, describe pod info,

Events:
  Type    Reason          Age                   From                                    Message
  ----    ------          ----                  ----                                    -------
  Normal  SandboxChanged  19s (x27596 over 4d)  kubelet, ip-172-31-60-182.ec2.internal  Pod sandbox changed, it will be killed and re-created.

Version-Release number of selected component (if applicable):

OpenShift Master:v3.8.30 (online version 3.6.0.83)
Kubernetes Master:v1.8.1+0d5291c 

How reproducible:
Always

Steps to Reproduce:
1. Check logging pods' status
2.
3.

Actual results:
fluentd pods are in Error status on master nodes 

Expected results:
fluentd pods should be running well on master nodes 

Additional info:

Comment 1 Dan Williams 2018-01-15 19:36:49 UTC

Can we get logs from the node on which this pod failed?  THe "pod sandbox changed; it will be killed and re-created" message is normal and can happen for a number of reasons.  Node logs would be necessary to track it down further.  Thanks!

Comment 2 Dan Williams 2018-01-15 20:44:14 UTC

Logged in and got details.  I think this is another instance of https://github.com/openshift/ose/pull/999 due to the "could not retrieve port mappings: checkpoint is not found." in the logs.

Comment 3 Justin Pierce 2018-01-15 21:30:27 UTC

v3.8.31 will include this fix.

Comment 4 Justin Pierce 2018-01-15 21:31:45 UTC

https://github.com/openshift/ose/pull/998

Comment 5 Junqi Zhao 2018-01-16 00:52:28 UTC

clear needinfo request since Dan had got details.