Created attachment 1290603 [details] logging-fluentd configmaps with and without mux service enabled Description of problem: HA cluster with 3 masters and 205 nodes Deployed fluentd without the mux service and observed socket connections (ss -tnpi) to port 8443 on the 3 master API servers. As fluentd started on all nodes, socket connections increased from 75 to 148 which is roughly what is expected when not using the mux service. Cleaned up the logging project (deleted logging ds, dc, route, svc, configmap, pv, pvc, secrets and sa) Observed that the socket connections on the master api servers returned to 75. Deployed logging with the mux service and mux client options enabled: openshift_logging_use_mux=true openshift_logging_use_mux_client=true After the fluentd pods started, the socket connections on each master api server again rose from 75 to 148. This was not expected - the fluentd connections should have gone to the mux service instead of the API servers. A diff of the logging-fluentd daemonset definitions and configmaps for the "no mux" and "yes mux" configurations are identical save for uuids and timestamps. I assume the secure-forward section of the configmap for logging-fluentd should contain configuration in the mux case, but it was the same as in the no mux case. Attaching the logging-fluentd configmaps, let me know what other information is required. Full inventory is below. Version-Release number of selected component (if applicable): 3.6.121 How reproducible: Always Steps to Reproduce: See above Additional info: [oo_first_master] 192.1.0.8 [oo_first_master:vars] openshift_deployment_type=openshift-enterprise openshift_release=v3.6.0 openshift_logging_install_logging=true openshift_logging_use_ops=false openshift_logging_master_url=https://192.1.0.9:8443 openshift_logging_master_public_url=https://10.19.45.157:8443 openshift_logging_kibana_hostname=kibana.example.com openshift_logging_namespace=logging openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/ openshift_logging_image_version=v3.6.121 openshift_logging_es_pvc_dynamic=false openshift_logging_es_pvc_size=150Gi openshift_logging_fluentd_use_journal=true openshift_logging_use_mux=true openshift_logging_use_mux_client=true openshift_logging_es_pv_selector=None
pretty sure this is just because we haven't updated all of the downstream images yet
and/or openshift-ansible 3.6 rpm packages
Removing TestBlocker since the workaround in comment 4 works ok. Should also ba able to clone the openshift-ansible PR branch while the merge is pending. I will try that today.
Merged https://github.com/openshift/openshift-ansible/commit/9613d2e517ced0bc5d165801df3442ab331d214c Will be fixed downstream when there is a new build of openshift-ansible for 3.6 with this fix
@mifiedle The way in comment #8 didn't work when I want to verify this issue, here is the detailed step, could you help to take a look? 1. have logging + mux deployed in -n logging: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-hn64k 1/1 Running 0 2h logging-es-data-master-whsfhhon-1-cww70 1/1 Running 0 2h logging-fluentd-7dvhq 1/1 Running 0 2h logging-fluentd-n74kv 1/1 Running 0 2h logging-kibana-1-btk71 2/2 Running 0 2h logging-mux-1-v286l 1/1 Running 0 2h 2. login kibana UI --> log entries presented fine there 3. Attempt to verify bug fix, none of the commands able to give output: [root@host-8-175-70 ~]# oc exec logging-fluentd-7dvhq -- ss -tnpi State Recv-Q Send-Q Local Address:Port Peer Address:Port [root@host-8-175-70 ~]# oc exec logging-fluentd-7dvhq -- ss -tnpi State Recv-Q Send-Q Local Address:Port Peer Address:Port [root@host-8-175-70 ~]# oc exec logging-fluentd-n74kv -- ss -tnpi State Recv-Q Send-Q Local Address:Port Peer Address:Port [root@host-8-175-70 ~]# oc exec logging-mux-1-v286l -- ss -tnpi Cannot open netlink socket: Permission denied State Recv-Q Send-Q Local Address:Port Peer Address:Port Test env: # openshift version openshift v3.6.133 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # rpm -qa | grep ansible openshift-ansible-callback-plugins-3.6.133-1.git.0.950bb48.el7.noarch openshift-ansible-docs-3.6.133-1.git.0.950bb48.el7.noarch openshift-ansible-lookup-plugins-3.6.133-1.git.0.950bb48.el7.noarch openshift-ansible-filter-plugins-3.6.133-1.git.0.950bb48.el7.noarch openshift-ansible-playbooks-3.6.133-1.git.0.950bb48.el7.noarch ansible-2.2.3.0-1.el7.noarch openshift-ansible-3.6.133-1.git.0.950bb48.el7.noarch openshift-ansible-roles-3.6.133-1.git.0.950bb48.el7.noarch @Rich, Seems the fix is not in the above ansible packages, attached the fluentd daemonset, there is no USE_MUX_CLIENT exist.
Created attachment 1294441 [details] fluentd_daemonset_with_openshift-ansible-playbooks-3.6.133-1.git.0.950bb48.el7.noarch
@xiazhao The PR still hasn't merged. See https://github.com/openshift/openshift-ansible/pull/4554. The only way to test mux right now is to use the workaround.
merged upstream: https://github.com/openshift/openshift-ansible/commit/01f91dfe6257dc1df73df4e12ccd8db899369d27 awaiting new downstream package . . .
openshift-ansible-3.6.136-1.git.0.ac6bb62.el7
Verified on openshift-ansible 3.6.136 rpm install.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049