Bug 1468845

Summary: logging-fluentd pod segfaulting and restarting in some nodes - configured for logging-mux secure forward
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: LoggingAssignee: Rich Megginson <rmeggins>
Status: CLOSED WORKSFORME QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, jcantril, mifiedle, nhosoi, pportant, rmeggins, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-18 16:23:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2017-07-08 20:50:55 UTC
Description of problem:

- cluster is the 3.6 scalability cluster (more details below).  3 masters, 3 infra nodes where the router, registry, elasticsearch, kibana, logging-mux, cassandra, heapster and hawkular-metrics pods are running.   The infra nodes have 40 vCPU and 110GB memory.   See below for everything running on infra.


- Elastic search cluster (3 pod cluster) and logging-mux (3 replicas) up and running successfully
- tagged a single node with logging-fluentd-infra=true and verified it was working ok (no errors in log, indices created in elasticsearch ok).   Verified with ss that fluentd was using logging-mux service
- started tagging additional nodes to run fluentd.
- Some of the nodes are seeing seg faults, restarts and dead connections to the mux service (full pod log and node syslog attached)

Not all pods have this issue - unclear right now what is different about the ones that have the problem - investigating)


/usr/share/ruby/openssl/buffering.rb:319: [BUG] Segmentation fault
ruby 2.0.0p648 (2015-12-16) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0013 p:---- s:0072 e:000071 CFUNC  :syswrite
c:0012 p:0164 s:0068 e:000067 METHOD /usr/share/ruby/openssl/buffering.rb:319
c:0011 p:0009 s:0059 e:000058 METHOD /usr/share/ruby/openssl/buffering.rb:337
c:0010 p:0011 s:0055 e:000054 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/plugin/buf_memory.rb:61
c:0009 p:0068 s:0051 e:000050 METHOD /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:300
c:0008 p:0075 s:0043 e:000042 METHOD /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:264
c:0007 p:0027 s:0036 e:000035 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:490
c:0006 p:0010 s:0032 e:000031 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:354
c:0005 p:0046 s:0027 e:000026 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:333
c:0004 p:0170 s:0021 e:000020 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:342
c:0003 p:0063 s:0009 e:000008 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:149 [FINISH]
c:0002 p:---- s:0004 e:000003 IFUNC
c:0001 p:---- s:0002 e:000001 TOP    [FINISH]
-- Ruby level backtrace information ----------------------------------------
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:149:in `run'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:342:in `try_flush'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:333:in `pop'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:354:in `write_chunk'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:490:in `write'
/usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:264:in `write_objects'
/usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:300:in `send_data'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/plugin/buf_memory.rb:61:in `write_to'
/usr/share/ruby/openssl/buffering.rb:337:in `write'
/usr/share/ruby/openssl/buffering.rb:319:in `do_write'
/usr/share/ruby/openssl/buffering.rb:319:in `syswrite'

Along with:
                                                                                                  
2017-07-08 16:16:49 -0400 [error]: fluentd main process died unexpectedly. restarting.
2017-07-08 16:17:44 -0400 [error]: fluentd main process died unexpectedly. restarting.

and


2017-07-08 16:04:14 -0400 [warn]: dead connection found: logging-mux, reconnecting...
2017-07-08 16:17:08 -0400 [warn]: dead connection found: logging-mux, reconnecting...
2017-07-08 16:17:13 -0400 [warn]: recovered connection to dead node: logging-mux
2017-07-08 16:17:28 -0400 [warn]: dead connection found: logging-mux, reconnecting...
2017-07-08 16:17:33 -0400 [warn]: recovered connection to dead node: logging-mux



Version-Release number of selected component (if applicable):  Logging 3.6.126.1 containers


How reproducible: Unknown


Infra pods running in scalability cluster:

root@svt-j-1: ~/mffiedler/logs # oc get pods --all-namespaces  -o wide| grep -E "logging|default|openshift-infra|cns"                                                        
cnscluster        glusterfs-rfp4j                            1/1       Running   0          3d        172.16.0.6     172.16.0.6
cnscluster        glusterfs-s2x9t                            1/1       Running   0          3d        172.16.0.23    172.16.0.23
cnscluster        glusterfs-zctj2                            1/1       Running   0          3d        172.16.0.22    172.16.0.22
cnscluster        heketi-1-dzc94                             1/1       Running   1          1d        10.131.0.221   172.16.0.5
default           docker-registry-1-hm07q                    1/1       Running   1          1d        10.133.0.231   172.16.0.7
default           router-1-4t8x1                             1/1       Running   1          1d        172.16.0.5     172.16.0.5
default           router-1-bmzzc                             1/1       Running   0          1d        172.16.0.18    172.16.0.18
default           router-1-fqm25                             1/1       Running   1          1d        172.16.0.7     172.16.0.7
logging           logging-curator-4-c23k2                    1/1       Running   1          1d        10.133.0.230   172.16.0.7
logging           logging-es-data-master-qrvpgxna-10-ldzn4   1/1       Running   0          1h        10.131.0.185   172.16.0.5
logging           logging-es-data-master-xocpf85i-10-xh4vv   1/1       Running   1          1h        10.133.0.228   172.16.0.7
logging           logging-es-data-master-z0pje5ge-10-88hgb   1/1       Running   0          1h        10.132.0.110   172.16.0.18
logging           logging-fluentd-4c4bp                      1/1       Running   0          13m       10.129.0.13    172.16.0.8
logging           logging-fluentd-h7qxj                      1/1       Running   0          13m       10.128.0.15    172.16.0.16
logging           logging-fluentd-jppv8                      1/1       Running   0          13m       10.130.0.12    172.16.0.15
logging           logging-fluentd-mzx8l                      1/1       Running   0          32m       10.132.0.112   172.16.0.18
logging           logging-fluentd-rlgg4                      1/1       Running   0          32m       10.131.0.187   172.16.0.5
logging           logging-fluentd-xfwl0                      1/1       Running   0          37m       10.133.0.234   172.16.0.7
logging           logging-kibana-1-1799n                     2/2       Running   20         1d        10.133.0.232   172.16.0.7
logging           logging-mux-4-53rwx                        1/1       Running   0          40m       10.132.0.111   172.16.0.18
logging           logging-mux-4-7f9hn                        1/1       Running   0          12m       10.131.0.188   172.16.0.5
logging           logging-mux-4-nhw0p                        1/1       Running   0          12m       10.133.0.235   172.16.0.7
openshift-infra   hawkular-cassandra-1-5gccz                 1/1       Running   1          1d        10.133.0.229   172.16.0.7
openshift-infra   hawkular-metrics-mtnq3                     1/1       Running   17         1d        10.131.0.231   172.16.0.5
openshift-infra   heapster-bb32t                             1/1       Running   196        1d        10.132.0.54    172.16.0.18

Comment 2 Rich Megginson 2017-07-10 22:52:11 UTC
Is this a test blocker?  If so, please raise the priority and/or severity.

Comment 3 Rich Megginson 2017-07-11 19:52:31 UTC
I've recently updated the fluentd image to have a lot of new packages, among them new versions of fluentd itself.  openshift3/logging-fluentd:v3.6.140-2 - please try the new image

Also, do you see the core dumps correlated with either fluentd running OOM, or unable to connect to mux?

Comment 4 Mike Fiedler 2017-07-11 20:25:31 UTC
I've moved up to 3.6.140-1 - does it have the new fluentd?   140-2 is not yet available on the mirror.

Comment 5 Rich Megginson 2017-07-11 20:30:44 UTC
(In reply to Mike Fiedler from comment #4)
> I've moved up to 3.6.140-1 - does it have the new fluentd?   140-2 is not
> yet available on the mirror.

I built v3.6.140-2 this morning - not sure how long it takes to show up in the mirrors.  You definitely need v3.6.140-2 - not sure if this image is also tagged v3.6.140 - I suppose you could docker pull both and see if they are the same.

Comment 6 Mike Fiedler 2017-07-11 22:37:16 UTC
I think you need to rebuild the logging-mux pod with the new fluentd too.

Comment 7 Rich Megginson 2017-07-11 22:41:04 UTC
(In reply to Mike Fiedler from comment #6)
> I think you need to rebuild the logging-mux pod with the new fluentd too.

How?  The mux pod and the fluentd daemonset use the same fluentd image.

Comment 8 Mike Fiedler 2017-07-11 22:42:04 UTC
Disregard comment 6 - it uses logging-fluentd in the dc

When I start logging-mux with the new logging-fluentd container image I get:

oc logs logging-mux-11-kktns                                                │Status: Downloaded newer image for brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging
run.sh: line 108: bc: command not found

Comment 9 Mike Fiedler 2017-07-11 22:42:37 UTC
The logging-mux pod goes into CrashLoopBackoff

Comment 10 Noriko Hosoi 2017-07-11 22:49:45 UTC
(In reply to Mike Fiedler from comment #9)
> The logging-mux pod goes into CrashLoopBackoff

The origin Dockefile has this addition to "yum install" list.  I'd think it's needed for the 3.6 build...
--- a/fluentd/Dockerfile
+++ b/fluentd/Dockerfile
@@ -23,6 +23,7 @@ RUN rpmkeys --import file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7 && \
       ruby-devel \
       libcurl-devel \
       make \
+      bc \

Comment 11 Rich Megginson 2017-07-11 22:51:51 UTC
yep - rebuilding

Comment 12 Rich Megginson 2017-07-11 23:06:16 UTC
done - try v3.6.140-3

Comment 13 Rich Megginson 2017-07-13 19:52:53 UTC
Mike, is this still happening with the latest image?  Can we close this?

Comment 14 Mike Fiedler 2017-07-18 16:23:35 UTC
Did not see this in later runs.   Changing status to CLOSED->WORKSFORME since no fix delivered.