Bug 1468845 - logging-fluentd pod segfaulting and restarting in some nodes - configured for logging-mux secure forward
logging-fluentd pod segfaulting and restarting in some nodes - configured for...
Status: CLOSED WORKSFORME
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.6.0
x86_64 Linux
unspecified Severity medium
: ---
: ---
Assigned To: Rich Megginson
Mike Fiedler
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-08 16:50 EDT by Mike Fiedler
Modified: 2017-08-16 15 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-18 12:23:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mike Fiedler 2017-07-08 16:50:55 EDT
Description of problem:

- cluster is the 3.6 scalability cluster (more details below).  3 masters, 3 infra nodes where the router, registry, elasticsearch, kibana, logging-mux, cassandra, heapster and hawkular-metrics pods are running.   The infra nodes have 40 vCPU and 110GB memory.   See below for everything running on infra.


- Elastic search cluster (3 pod cluster) and logging-mux (3 replicas) up and running successfully
- tagged a single node with logging-fluentd-infra=true and verified it was working ok (no errors in log, indices created in elasticsearch ok).   Verified with ss that fluentd was using logging-mux service
- started tagging additional nodes to run fluentd.
- Some of the nodes are seeing seg faults, restarts and dead connections to the mux service (full pod log and node syslog attached)

Not all pods have this issue - unclear right now what is different about the ones that have the problem - investigating)


/usr/share/ruby/openssl/buffering.rb:319: [BUG] Segmentation fault
ruby 2.0.0p648 (2015-12-16) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0013 p:---- s:0072 e:000071 CFUNC  :syswrite
c:0012 p:0164 s:0068 e:000067 METHOD /usr/share/ruby/openssl/buffering.rb:319
c:0011 p:0009 s:0059 e:000058 METHOD /usr/share/ruby/openssl/buffering.rb:337
c:0010 p:0011 s:0055 e:000054 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/plugin/buf_memory.rb:61
c:0009 p:0068 s:0051 e:000050 METHOD /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:300
c:0008 p:0075 s:0043 e:000042 METHOD /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:264
c:0007 p:0027 s:0036 e:000035 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:490
c:0006 p:0010 s:0032 e:000031 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:354
c:0005 p:0046 s:0027 e:000026 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:333
c:0004 p:0170 s:0021 e:000020 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:342
c:0003 p:0063 s:0009 e:000008 METHOD /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:149 [FINISH]
c:0002 p:---- s:0004 e:000003 IFUNC
c:0001 p:---- s:0002 e:000001 TOP    [FINISH]
-- Ruby level backtrace information ----------------------------------------
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:149:in `run'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:342:in `try_flush'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:333:in `pop'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:354:in `write_chunk'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:490:in `write'
/usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:264:in `write_objects'
/usr/share/gems/gems/fluent-plugin-secure-forward-0.4.3/lib/fluent/plugin/out_secure_forward.rb:300:in `send_data'
/usr/share/gems/gems/fluentd-0.12.31/lib/fluent/plugin/buf_memory.rb:61:in `write_to'
/usr/share/ruby/openssl/buffering.rb:337:in `write'
/usr/share/ruby/openssl/buffering.rb:319:in `do_write'
/usr/share/ruby/openssl/buffering.rb:319:in `syswrite'

Along with:
                                                                                                  
2017-07-08 16:16:49 -0400 [error]: fluentd main process died unexpectedly. restarting.
2017-07-08 16:17:44 -0400 [error]: fluentd main process died unexpectedly. restarting.

and


2017-07-08 16:04:14 -0400 [warn]: dead connection found: logging-mux, reconnecting...
2017-07-08 16:17:08 -0400 [warn]: dead connection found: logging-mux, reconnecting...
2017-07-08 16:17:13 -0400 [warn]: recovered connection to dead node: logging-mux
2017-07-08 16:17:28 -0400 [warn]: dead connection found: logging-mux, reconnecting...
2017-07-08 16:17:33 -0400 [warn]: recovered connection to dead node: logging-mux



Version-Release number of selected component (if applicable):  Logging 3.6.126.1 containers


How reproducible: Unknown


Infra pods running in scalability cluster:

root@svt-j-1: ~/mffiedler/logs # oc get pods --all-namespaces  -o wide| grep -E "logging|default|openshift-infra|cns"                                                        
cnscluster        glusterfs-rfp4j                            1/1       Running   0          3d        172.16.0.6     172.16.0.6
cnscluster        glusterfs-s2x9t                            1/1       Running   0          3d        172.16.0.23    172.16.0.23
cnscluster        glusterfs-zctj2                            1/1       Running   0          3d        172.16.0.22    172.16.0.22
cnscluster        heketi-1-dzc94                             1/1       Running   1          1d        10.131.0.221   172.16.0.5
default           docker-registry-1-hm07q                    1/1       Running   1          1d        10.133.0.231   172.16.0.7
default           router-1-4t8x1                             1/1       Running   1          1d        172.16.0.5     172.16.0.5
default           router-1-bmzzc                             1/1       Running   0          1d        172.16.0.18    172.16.0.18
default           router-1-fqm25                             1/1       Running   1          1d        172.16.0.7     172.16.0.7
logging           logging-curator-4-c23k2                    1/1       Running   1          1d        10.133.0.230   172.16.0.7
logging           logging-es-data-master-qrvpgxna-10-ldzn4   1/1       Running   0          1h        10.131.0.185   172.16.0.5
logging           logging-es-data-master-xocpf85i-10-xh4vv   1/1       Running   1          1h        10.133.0.228   172.16.0.7
logging           logging-es-data-master-z0pje5ge-10-88hgb   1/1       Running   0          1h        10.132.0.110   172.16.0.18
logging           logging-fluentd-4c4bp                      1/1       Running   0          13m       10.129.0.13    172.16.0.8
logging           logging-fluentd-h7qxj                      1/1       Running   0          13m       10.128.0.15    172.16.0.16
logging           logging-fluentd-jppv8                      1/1       Running   0          13m       10.130.0.12    172.16.0.15
logging           logging-fluentd-mzx8l                      1/1       Running   0          32m       10.132.0.112   172.16.0.18
logging           logging-fluentd-rlgg4                      1/1       Running   0          32m       10.131.0.187   172.16.0.5
logging           logging-fluentd-xfwl0                      1/1       Running   0          37m       10.133.0.234   172.16.0.7
logging           logging-kibana-1-1799n                     2/2       Running   20         1d        10.133.0.232   172.16.0.7
logging           logging-mux-4-53rwx                        1/1       Running   0          40m       10.132.0.111   172.16.0.18
logging           logging-mux-4-7f9hn                        1/1       Running   0          12m       10.131.0.188   172.16.0.5
logging           logging-mux-4-nhw0p                        1/1       Running   0          12m       10.133.0.235   172.16.0.7
openshift-infra   hawkular-cassandra-1-5gccz                 1/1       Running   1          1d        10.133.0.229   172.16.0.7
openshift-infra   hawkular-metrics-mtnq3                     1/1       Running   17         1d        10.131.0.231   172.16.0.5
openshift-infra   heapster-bb32t                             1/1       Running   196        1d        10.132.0.54    172.16.0.18
Comment 2 Rich Megginson 2017-07-10 18:52:11 EDT
Is this a test blocker?  If so, please raise the priority and/or severity.
Comment 3 Rich Megginson 2017-07-11 15:52:31 EDT
I've recently updated the fluentd image to have a lot of new packages, among them new versions of fluentd itself.  openshift3/logging-fluentd:v3.6.140-2 - please try the new image

Also, do you see the core dumps correlated with either fluentd running OOM, or unable to connect to mux?
Comment 4 Mike Fiedler 2017-07-11 16:25:31 EDT
I've moved up to 3.6.140-1 - does it have the new fluentd?   140-2 is not yet available on the mirror.
Comment 5 Rich Megginson 2017-07-11 16:30:44 EDT
(In reply to Mike Fiedler from comment #4)
> I've moved up to 3.6.140-1 - does it have the new fluentd?   140-2 is not
> yet available on the mirror.

I built v3.6.140-2 this morning - not sure how long it takes to show up in the mirrors.  You definitely need v3.6.140-2 - not sure if this image is also tagged v3.6.140 - I suppose you could docker pull both and see if they are the same.
Comment 6 Mike Fiedler 2017-07-11 18:37:16 EDT
I think you need to rebuild the logging-mux pod with the new fluentd too.
Comment 7 Rich Megginson 2017-07-11 18:41:04 EDT
(In reply to Mike Fiedler from comment #6)
> I think you need to rebuild the logging-mux pod with the new fluentd too.

How?  The mux pod and the fluentd daemonset use the same fluentd image.
Comment 8 Mike Fiedler 2017-07-11 18:42:04 EDT
Disregard comment 6 - it uses logging-fluentd in the dc

When I start logging-mux with the new logging-fluentd container image I get:

oc logs logging-mux-11-kktns                                                │Status: Downloaded newer image for brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging
run.sh: line 108: bc: command not found
Comment 9 Mike Fiedler 2017-07-11 18:42:37 EDT
The logging-mux pod goes into CrashLoopBackoff
Comment 10 Noriko Hosoi 2017-07-11 18:49:45 EDT
(In reply to Mike Fiedler from comment #9)
> The logging-mux pod goes into CrashLoopBackoff

The origin Dockefile has this addition to "yum install" list.  I'd think it's needed for the 3.6 build...
--- a/fluentd/Dockerfile
+++ b/fluentd/Dockerfile
@@ -23,6 +23,7 @@ RUN rpmkeys --import file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7 && \
       ruby-devel \
       libcurl-devel \
       make \
+      bc \
Comment 11 Rich Megginson 2017-07-11 18:51:51 EDT
yep - rebuilding
Comment 12 Rich Megginson 2017-07-11 19:06:16 EDT
done - try v3.6.140-3
Comment 13 Rich Megginson 2017-07-13 15:52:53 EDT
Mike, is this still happening with the latest image?  Can we close this?
Comment 14 Mike Fiedler 2017-07-18 12:23:35 EDT
Did not see this in later runs.   Changing status to CLOSED->WORKSFORME since no fix delivered.

Note You need to log in before you can comment on or make changes to this bug.