Bug 1464024 - [trello 1qpV9jcS] fluentd pods still connecting to master api servers after deploying with openshift_logging_use_mux=true
Summary: [trello 1qpV9jcS] fluentd pods still connecting to master api servers after d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.6.z
Assignee: Rich Megginson
QA Contact: Xia Zhao
URL:
Whiteboard: aos-scalability-36
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-22 09:57 UTC by Mike Fiedler
Modified: 2017-10-25 13:02 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Fluentd was not removing the kubernetes metadata filter configuration when being used as a mux client. Consequence: Fluentd was still opening connections to the OpenShift API server. Fix: Make sure to remove the kubernetes metadata filter configuration file when Fluentd is being used as a mux client. Result: No connection from Fluentd running as a mux client to the OpenShift API server.
Clone Of:
Environment:
Last Closed: 2017-10-25 13:02:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logging-fluentd configmaps with and without mux service enabled (1.11 KB, application/x-gzip)
2017-06-22 09:57 UTC, Mike Fiedler
no flags Details
fluentd_daemonset_with_openshift-ansible-playbooks-3.6.133-1.git.0.950bb48.el7.noarch (4.97 KB, text/plain)
2017-07-05 05:42 UTC, Xia Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 4554 0 None None None 2017-07-05 13:55:39 UTC
Red Hat Product Errata RHBA-2017:3049 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update 2017-10-25 15:57:15 UTC

Description Mike Fiedler 2017-06-22 09:57:26 UTC
Created attachment 1290603 [details]
logging-fluentd configmaps with and without mux service enabled

Description of problem:

HA cluster with 3 masters and 205 nodes

Deployed fluentd without the mux service and observed socket connections (ss -tnpi) to port 8443 on the 3 master API servers.   As fluentd started on all nodes, socket connections increased from 75 to 148 which is roughly what is expected when not using the mux service.

Cleaned up the logging project (deleted logging ds, dc, route, svc, configmap, pv, pvc, secrets and sa)

Observed that the socket connections on the master api servers returned to 75.

Deployed logging with the mux service and mux client options enabled:

openshift_logging_use_mux=true
openshift_logging_use_mux_client=true

After the fluentd pods started, the socket connections on each master api server again rose from 75 to 148.  This was not expected - the fluentd connections should have gone to the mux service instead of the API servers.

A diff of the logging-fluentd daemonset definitions and configmaps for the "no mux" and "yes mux" configurations are identical save for uuids and timestamps.  I assume the secure-forward section of the configmap for logging-fluentd should contain configuration in the mux case, but it was the same as in the no mux case.

Attaching the logging-fluentd configmaps, let me know what other information is required.   Full inventory is below. 


Version-Release number of selected component (if applicable): 3.6.121


How reproducible: Always


Steps to Reproduce:  See above


Additional info:

[oo_first_master]
192.1.0.8

[oo_first_master:vars]
openshift_deployment_type=openshift-enterprise
openshift_release=v3.6.0

openshift_logging_install_logging=true
openshift_logging_use_ops=false
openshift_logging_master_url=https://192.1.0.9:8443
openshift_logging_master_public_url=https://10.19.45.157:8443
openshift_logging_kibana_hostname=kibana.example.com
openshift_logging_namespace=logging
openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=v3.6.121
openshift_logging_es_pvc_dynamic=false
openshift_logging_es_pvc_size=150Gi
openshift_logging_fluentd_use_journal=true
openshift_logging_use_mux=true
openshift_logging_use_mux_client=true
openshift_logging_es_pv_selector=None

Comment 1 Rich Megginson 2017-06-22 14:59:52 UTC
pretty sure this is just because we haven't updated all of the downstream images yet

Comment 2 Rich Megginson 2017-06-22 15:00:14 UTC
and/or openshift-ansible 3.6 rpm packages

Comment 7 Mike Fiedler 2017-06-28 01:48:54 UTC
Removing TestBlocker since the workaround in comment 4 works ok.    Should also ba able to clone the openshift-ansible PR branch while the merge is pending.  I will try that today.

Comment 9 Rich Megginson 2017-06-28 15:21:37 UTC
Merged https://github.com/openshift/openshift-ansible/commit/9613d2e517ced0bc5d165801df3442ab331d214c

Will be fixed downstream when there is a new build of openshift-ansible for 3.6 with this fix

Comment 10 Xia Zhao 2017-07-05 05:41:36 UTC
@mifiedle

The way in comment #8 didn't work when I want to verify this issue, here is the detailed step, could you help to take a look?

1. have logging + mux deployed in -n logging:
# oc get po
NAME                                      READY     STATUS    RESTARTS   AGE
logging-curator-1-hn64k                   1/1       Running   0          2h
logging-es-data-master-whsfhhon-1-cww70   1/1       Running   0          2h
logging-fluentd-7dvhq                     1/1       Running   0          2h
logging-fluentd-n74kv                     1/1       Running   0          2h
logging-kibana-1-btk71                    2/2       Running   0          2h
logging-mux-1-v286l                       1/1       Running   0          2h

2. login kibana UI --> log entries presented fine there

3. Attempt to verify bug fix, none of the commands able to give output:

[root@host-8-175-70 ~]# oc exec logging-fluentd-7dvhq -- ss -tnpi
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port              
[root@host-8-175-70 ~]# oc exec logging-fluentd-7dvhq -- ss -tnpi
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port              
[root@host-8-175-70 ~]# oc exec logging-fluentd-n74kv -- ss -tnpi
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port              
[root@host-8-175-70 ~]# oc exec logging-mux-1-v286l -- ss -tnpi
Cannot open netlink socket: Permission denied
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port 

Test env:
# openshift version
openshift v3.6.133
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

# rpm -qa | grep ansible
openshift-ansible-callback-plugins-3.6.133-1.git.0.950bb48.el7.noarch
openshift-ansible-docs-3.6.133-1.git.0.950bb48.el7.noarch
openshift-ansible-lookup-plugins-3.6.133-1.git.0.950bb48.el7.noarch
openshift-ansible-filter-plugins-3.6.133-1.git.0.950bb48.el7.noarch
openshift-ansible-playbooks-3.6.133-1.git.0.950bb48.el7.noarch
ansible-2.2.3.0-1.el7.noarch
openshift-ansible-3.6.133-1.git.0.950bb48.el7.noarch
openshift-ansible-roles-3.6.133-1.git.0.950bb48.el7.noarch

@Rich, 
Seems the fix is not in the above ansible packages, attached the fluentd daemonset, there is no USE_MUX_CLIENT exist.

Comment 11 Xia Zhao 2017-07-05 05:42:19 UTC
Created attachment 1294441 [details]
fluentd_daemonset_with_openshift-ansible-playbooks-3.6.133-1.git.0.950bb48.el7.noarch

Comment 12 Mike Fiedler 2017-07-05 14:42:46 UTC
@xiazhao  The PR still hasn't merged.  See https://github.com/openshift/openshift-ansible/pull/4554.   The only way to test mux right now is to use the workaround.

Comment 13 Rich Megginson 2017-07-05 20:50:42 UTC
merged upstream: https://github.com/openshift/openshift-ansible/commit/01f91dfe6257dc1df73df4e12ccd8db899369d27

awaiting new downstream package . . .

Comment 14 Rich Megginson 2017-07-06 13:37:37 UTC
openshift-ansible-3.6.136-1.git.0.ac6bb62.el7

Comment 15 Mike Fiedler 2017-07-06 15:22:38 UTC
Verified on openshift-ansible 3.6.136 rpm install.

Comment 17 errata-xmlrpc 2017-10-25 13:02:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049


Note You need to log in before you can comment on or make changes to this bug.