Created attachment 1591662 [details] connections in rsyslog pod Description of problem: Set ES_REBIND_INTERVAL=2 in rsyslog env, then observe the status of rsyslog pods, the rsyslog pod always restart. Check the connections in rsyslog pod, there are many connections. $ oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-logging-operator-5d885c5df4-bbjfk 1/1 Running 0 23h 10.128.2.53 ip-10-0-163-27.eu-west-2.compute.internal <none> <none> curator-1563414000-pxm8m 0/1 Completed 0 3m46s 10.128.2.111 ip-10-0-163-27.eu-west-2.compute.internal <none> <none> elasticsearch-cdm-aoi3n0cd-1-6b8b844d95-5lc8l 2/2 Running 0 16h 10.131.0.65 ip-10-0-153-249.eu-west-2.compute.internal <none> <none> elasticsearch-cdm-aoi3n0cd-2-7d6cbc9584-s6hbd 2/2 Running 0 16h 10.128.2.79 ip-10-0-163-27.eu-west-2.compute.internal <none> <none> kibana-5cbd5cc9c9-9gjrc 2/2 Running 0 23h 10.129.2.49 ip-10-0-136-175.eu-west-2.compute.internal <none> <none> rsyslog-5lkwz 2/2 Running 25 16h 10.129.2.86 ip-10-0-136-175.eu-west-2.compute.internal <none> <none> rsyslog-6sh92 2/2 Running 26 16h 10.128.0.70 ip-10-0-164-63.eu-west-2.compute.internal <none> <none> rsyslog-8mz2j 2/2 Running 15 16h 10.131.0.69 ip-10-0-153-249.eu-west-2.compute.internal <none> <none> rsyslog-lq9tj 2/2 Running 21 16h 10.130.0.72 ip-10-0-128-210.eu-west-2.compute.internal <none> <none> rsyslog-mndsk 2/2 Running 12 16h 10.128.2.82 ip-10-0-163-27.eu-west-2.compute.internal <none> <none> rsyslog-xfqhk 2/2 Running 17 16h 10.129.0.62 ip-10-0-154-151.eu-west-2.compute.internal <none> <none> $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE elasticsearch ClusterIP 172.30.137.32 <none> 9200/TCP 23h $ oc exec rsyslog-5lkwz env |grep -i es_rebind Defaulting container name to rsyslog. Use 'oc describe pod/rsyslog-5lkwz -n openshift-logging' to see all of the containers in this pod. ES_REBIND_INTERVAL=2 $ oc exec rsyslog-5lkwz -- lsof -i -P -n |grep ESTABLISHED Defaulting container name to rsyslog. Use 'oc describe pod/rsyslog-5lkwz -n openshift-logging' to see all of the containers in this pod. rsyslogd 82154 root 18u IPv4 33084936 0t0 TCP 10.129.2.86:45720->172.30.137.32:9200 (ESTABLISHED) rsyslogd 82154 root 20u IPv4 33037790 0t0 TCP 10.129.2.86:41614->172.30.137.32:9200 (ESTABLISHED) rsyslogd 82154 root 54u IPv4 33037814 0t0 TCP 10.129.2.86:34214->172.30.0.1:443 (ESTABLISHED) rsyslogd 82154 root 55u IPv4 33036156 0t0 TCP 10.129.2.86:41630->172.30.137.32:9200 (ESTABLISHED) rsyslogd 82154 root 56u IPv4 33034987 0t0 TCP 10.129.2.86:41660->172.30.137.32:9200 (ESTABLISHED) rsyslogd 82154 root 57u IPv4 33038674 0t0 TCP 10.129.2.86:41686->172.30.137.32:9200 (ESTABLISHED) ...... rsyslogd 82154 root 2420u IPv4 33235286 0t0 TCP 10.129.2.86:59582->172.30.137.32:9200 (ESTABLISHED) rsyslogd 82154 root 2421u IPv4 33236350 0t0 TCP 10.129.2.86:59588->172.30.137.32:9200 (ESTABLISHED) rsyslogd 82154 root 2422u IPv4 33235307 0t0 TCP 10.129.2.86:59594->172.30.137.32:9200 (ESTABLISHED) rsyslog_e 82207 root 5u IPv6 33043769 0t0 TCP 10.129.2.86:24231->10.129.2.7:40972 (ESTABLISHED) rsyslog_e 82207 root 6u IPv6 33043888 0t0 TCP 10.129.2.86:24231->10.128.2.7:55054 (ESTABLISHED) Version-Release number of selected component (if applicable): ose-cluster-logging-operator-v4.2.0-201907161019 ose-logging-rsyslog-v4.2.0-201907141446 How reproducible: Always Steps to Reproduce: 1.Deploy logging using rsyslog as log collector 2.oc set env ds/rsyslog ES_REBIND_INTERVAL=2 3.wait for a few minutes, check pod status and connections in rsyslog pod Actual results: Expected results: Rsyslog pod won't restart frequently Additional info:
I could not reproduce the problem with my local build from the master branch with ES_REBIND_INTERVAL=2 or even 1. $ oc exec $rsyslog_pod -c rsyslog -- env | egrep ES_REBIND_INTERVAL ES_REBIND_INTERVAL=2 $ oc get pods -l component=rsyslog NAME READY STATUS RESTARTS AGE rsyslog-4bpfs 2/2 Running 0 21m rsyslog-5hkz4 2/2 Running 0 20m rsyslog-ct77m 2/2 Running 0 20m rsyslog-hm6dq 2/2 Running 0 21m rsyslog-lt45x 2/2 Running 0 20m rsyslog-rgwvd 2/2 Running 0 19m $ oc exec rsyslog-4bpfs -c rsyslog -- lsof -i -P -n |grep ESTABLISHED rsyslogd 11 root 17u IPv4 1210629 0t0 TCP 10.129.2.14:44796->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 19u IPv4 1168709 0t0 TCP 10.129.2.14:41474->172.30.0.1:443 (ESTABLISHED) rsyslogd 11 root 58u IPv4 1174673 0t0 TCP 10.129.2.14:41864->172.30.0.1:443 (ESTABLISHED) rsyslogd 11 root 60u IPv4 1210631 0t0 TCP 10.129.2.14:44798->172.30.10.184:9200 (ESTABLISHED) When I set ES_REBIND_INTERVAL to 0, I could observe increasing connections, but not restarting the rsyslog pods as follows. $ oc exec $rsyslog_pod -c rsyslog -- env | egrep ES_REBIND_INTERVAL ES_REBIND_INTERVAL=0 rsyslog-4qhn9 2/2 Running 0 7m6s rsyslog-b6wmv 2/2 Running 0 6m45s rsyslog-grvcc 2/2 Running 0 7m24s rsyslog-j8hgr 2/2 Running 0 8m26s rsyslog-t5p2r 2/2 Running 0 7m49s rsyslog-thwll 2/2 Running 0 8m5s $ oc exec rsyslog-4qhn9 -c rsyslog -- lsof -i -P -n |grep ESTABLISHED rsyslogd 11 root 42u IPv4 1340706 0t0 TCP 10.129.2.16:52414->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 55u IPv4 1339902 0t0 TCP 10.129.2.16:47800->172.30.0.1:443 (ESTABLISHED) rsyslogd 11 root 59u IPv4 1348684 0t0 TCP 10.129.2.16:52742->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 60u IPv4 1355722 0t0 TCP 10.129.2.16:53066->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 61u IPv4 1363086 0t0 TCP 10.129.2.16:53394->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 62u IPv4 1369401 0t0 TCP 10.129.2.16:53726->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 63u IPv4 1375033 0t0 TCP 10.129.2.16:54054->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 64u IPv4 1382677 0t0 TCP 10.129.2.16:54392->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 65u IPv4 1389661 0t0 TCP 10.129.2.16:54718->172.30.10.184:9200 (ESTABLISHED) rsyslogd 11 root 67u IPv4 1395577 0t0 TCP 10.129.2.16:55036->172.30.10.184:9200 (ESTABLISHED) rsyslog_e 436 root 5u IPv6 1384622 0t0 TCP 10.129.2.16:24231->10.129.2.8:33832 (ESTABLISHED) rsyslog_e 436 root 6u IPv6 1386367 0t0 TCP 10.129.2.16:24231->10.128.2.11:49554 (ESTABLISHED) I'd think setting ES_REBIND_INTERVAL to 0 does not make much sense, which means "force a new connection" every time by calling "curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 1)". I wonder can we have a minimum value for ES_REBIND_INTERVAL to prevent this kind of misconfiguration, possibly in rsyslog.sh? What would be the appropriate minimum value if we do. Please note that rsyslog.sh sets it to 200, by default.
> I'd think setting ES_REBIND_INTERVAL to 0 does not make much sense, which means "force a new connection" every time by calling "curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 1)". There may be cases where you want to force rsyslog to open a new connection every time. I don't think we should make it impossible. Perhaps just let them off with a warning. The problem is that I have no idea what would be an appropriate minimum. I suppose we could guess, but we would probably be wrong.
(In reply to Rich Megginson from comment #2) > > I'd think setting ES_REBIND_INTERVAL to 0 does not make much sense, which means "force a new connection" every time by calling "curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 1)". > > There may be cases where you want to force rsyslog to open a new connection > every time. I don't think we should make it impossible. Perhaps just let > them off with a warning. The problem is that I have no idea what would be > an appropriate minimum. I suppose we could guess, but we would probably be > wrong. All right... How about the piled up connections? Now I'm confused that this is benign or a real problem... If we set CURLOPT_FORBID_REUSE, to the curl handler - [1], the count of established connections does not get increased even if setting "ES_REBIND_INTERVAL=0". But I'm not sure how the performance would be impacted. Both man pages of CURLOPT_FRESH_CONNECT and CURLOPT_FORBID_REUSE note "This option should be used with caution and only if you understand what it does as it may seriously impact performance"... COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME rsyslogd 11 root 20u IPv4 5629682 0t0 TCP 10.131.0.31:42800->172.30.128.45:9200 (ESTABLISHED) rsyslogd 11 root 50u IPv4 5609788 0t0 TCP 10.131.0.31:54806->172.30.0.1:443 (ESTABLISHED) rsyslogd 11 root 53u IPv4 5714871 0t0 TCP 10.131.0.31:50740->172.30.128.45:9200 (ESTABLISHED) [1] diff --git a/rsyslog/vendored_src/rsyslog/rsyslog/plugins/omelasticsearch/omelasticsearch.c b/rsyslog/vendored_src/rsyslog/rsyslog/plugins/omelasticsearch/omelasticsearch.c index 3ca8a017..659ad32d 100644 --- a/rsyslog/vendored_src/rsyslog/rsyslog/plugins/omelasticsearch/omelasticsearch.c +++ b/rsyslog/vendored_src/rsyslog/rsyslog/plugins/omelasticsearch/omelasticsearch.c @@ -1475,12 +1475,14 @@ curlPost(wrkrInstanceData_t *pWrkrData, uchar *message, int msglen, uchar **tpls if ((pWrkrData->pData->rebindInterval > -1) && (pWrkrData->nOperations > pWrkrData->pData->rebindInterval)) { - curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 1); + curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 1L); + curl_easy_setopt(curl, CURLOPT_FORBID_REUSE, 1L); pWrkrData->nOperations = 0; STATSCOUNTER_INC(rebinds, mutRebinds); } else { /* by default, reuse existing connections */ curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 0); + curl_easy_setopt(curl, CURLOPT_FORBID_REUSE, 0); } if(pWrkrData->pData->numServers > 1) { Again, I still cannot reproduce the pods' frequent restarts. rsyslog-29dz8 2/2 Running 0 19m rsyslog-2g5hp 2/2 Running 0 19m rsyslog-dl7j4 2/2 Running 0 19m rsyslog-mz2m2 2/2 Running 0 18m rsyslog-rrhv2 2/2 Running 0 18m rsyslog-t5l47 2/2 Running 0 18m
Maybe we should investigate exactly what happens in libcurl when you set + curl_easy_setopt(curl, CURLOPT_FRESH_CONNECT, 1L); and you perform another operation. Does it call close() on the socket? shutdown()? If so, then what I would expect is that there would be a number of sockets in the TIME_WAIT state, which would be cleaned up after the TCP ttl expiration. I would not expect sockets to still be in the ESTABLISHED state after calling close() or shutdown(). I see this in the libcurl code for rhel7.7 (rhpkg co curl; cd curl; rhpkg prep; cd curl-7.29.0/lib/) - url.c: reuse_fresh is set by CURLOPT_FRESH_CONNECT /* reuse_fresh is TRUE if we are told to use a new connection by force, but we only acknowledge this option if this is not a re-used connection already (which happens due to follow-location or during a HTTP authentication phase). */ if(data->set.reuse_fresh && !data->state.this_is_a_follow) reuse = FALSE; else reuse = ConnectionExists(data, conn, &conn_temp); It's hard to tell which code paths cause libcurl to call close() - there is a lot of caching and connection optimization, as well as platform independent wrappers. It may be that we also need to set CURLOPT_FORBID_REUSE to force the connection to close() sooner rather than later.
I suppose you could step through rsyslog in the debugger.
Verified with ose-logging-rsyslog-v4.2.0-201909011300
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922