Description of problem: Grafana pushes frequent alerts to tendrl-monitoring-integration registered webhook (callback) , the webhook is currently handled by a http server which is not supported to be used in production, hence moving to cherrypy server for this. Version-Release number of selected component (if applicable): Package version where we saw this was tendrl-monitoring-integration-1.6.3-2.el7rhgs How reproducible: 70% Steps to Reproduce: 1. Generate large number of grafana alerts for RHGSWA managed gluster cluster Actual results: tendrl-monitoring-integration fails to process some of the alerts Expected results: tendrl-monitoring-integration should process/handle all alerts posted by grafana on the callback webhook
Can you please provide which log file reports the error and example of the errors in the log file?
This issue is fixed
Filip it is not about an error, I think Rohan is telling like when too many errors came then HTTP server will miss some error, it is not capable to process number of the error, so monitoring-integration will miss some error to notify the user.
I dont have the test environment anymore. To reproduce this issue: 1) Send a very large number of HTTP POST requests to tendrl-server port 8789 2) Check tendrl-monitoring-integration error logs or check HTTP response for error codes 500, 404 etc
I tried this with old (tendrl-monitoring-integration-1.6.3-1.el7rhgs) and new (tendrl-monitoring-integration-1.6.3-4.el7rhgs) versions: 1. Open `journalctl -u tendrl-monitoring-integration -fe` on <server> where <server> is address of server with WA and Grafana. 2. Run for x in {1..10000}; do curl -d "foo=bar" -X POST http://<server>:8789/$x; done and check journal from step 1. Result for both versions: In log are shown numbers of requests. There is a server freeze after usually every 2000-3000 requests and some alerts are not logged. It seems that there is no improvement with new version. For example logs for requests 3821-4192 are not shown in journal: ``` Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3817 HTTP/1.1" 404 233 "-" "curl/7.29.0" Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3818 HTTP/1.1" 404 233 "-" "curl/7.29.0" Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3819 HTTP/1.1" 404 233 "-" "curl/7.29.0" Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3820 HTTP/1.1" 404 233 "-" "curl/7.29.0" Jun 13 07:11:54 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:54 -0400] "POST /4193 HTTP/1.1" 404 233 "-" "curl/7.29.0" Jun 13 07:11:54 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:54 -0400] "POST /4194 HTTP/1.1" 404 233 "-" "curl/7.29.0" Jun 13 07:11:54 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:54 -0400] "POST /4195 HTTP/1.1" 404 233 "-" "curl/7.29.0" ``` --> ASSIGNED
I gave a correct URL to receive the request by particular handler method and I verified all 10000 requests are received correctly. please modify your shell script like: STEP 1: Modify webhook receiver code in /usr/lib/python2.7/site- packages/tendrl/monitoring_integration/webhook/webhook_receiver.py as https://paste.ofcode.org/BavgiVFWzWiFbULUx3kWct STEP 2: Restart your tendrl-monitoring-integration service service tendrl-monitoring-integration restart STEP 3: Use below script to send request for x in {1..10000}; do curl -d "foo=bar" -X POST http://{server-id}:8789/grafana_callback; echo $x; done STEP 4: After all requests are sent, open a file in a server machine vi /root/result you can see number 1 to 10000 incremented sequentially. Or you can stop in between and check a number of request send and number in the result file. When a request is received webhook will increment number and save in a file, which means a number of requests send equal to a number of requests received. please ping me if you face any difficulties in this.
Each time please restart monitoring-integration and check, because next time it will put number in result file where it is actually stopped last time, so restart will initialize again zero. And deleted result file also. Only grafana is going to push use monitoring-integration webhook and configuring webhook in grafana is done by monitoring-integration only. I feel we don't need to worry about invalid URL request count. If we give correct URL and request is not received correctly then it is a bug.
I tried the script provided in Comment 11 and it correctly created the /root/result file with 10000 lines but when I: * downgraded to tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch, * deleted /root/result, * updated /usr/lib/python2.7/site-packages/tendrl/monitoring_integration/webhook/webhook_receiver.py to use the same index that it writes to file when webhook is called, * restarted tendrl-monitoring-integration and executed the script from Comment 11, it also created file with 10000 lines. So I was not able to reproduce it for older version. Do you have any idea how to reproduce the issue for older version where the issue should be present?
I dont have any more info about this, please close this if not required
Comment 15 doesn't answer the need info request.
Apologies, I missed out one detail To reproduce this issue: 1) Send a very large number of HTTP POST requests to "http://$tendrl-server:8789/grafana_callback" 2) Check tendrl-monitoring-integration error logs or check HTTP response for error codes 500, 404 etc or check if any request has been dropped and not processed
Please verify this issue and close this, from last few build we are not faced any issue related to grafana callback.
I tested this with old version: tendrl-ansible-1.5.4-7.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-commons-1.5.4-9.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-14.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-14.el7rhgs.noarch tendrl-node-agent-1.5.4-16.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.5.4-6.el7rhgs.noarch and with new version: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-8.el7rhgs.noarch In both cases the server was able to process all requests that I sent to it but I have also done load testing with ApacheBench tool (ab): $ ab -c 10 -n 10000 -p data.json -T application/json http://<wa-server>:8789/grafana_callback $ cat data.json {"ruleId":1} Results of this load testing suggest that after usage of cherrypy the number of requests that server can handle is significantly greater. Old version ----------- Time taken for tests: 1576.683 seconds Complete requests: 10000 Failed requests: 0 Requests per second: 6.34 [#/sec] (mean) Time per request: 1576.683 [ms] (mean) Time per request: 157.668 [ms] (mean, across all concurrent requests) Transfer rate: 1.12 [Kbytes/sec] received 1.26 kb/s sent 2.38 kb/s total New version ----------- Time taken for tests: 476.874 seconds Complete requests: 10000 Failed requests: 0 Requests per second: 20.97 [#/sec] (mean) Time per request: 476.874 [ms] (mean) Time per request: 47.687 [ms] (mean, across all concurrent requests) Transfer rate: 2.85 [Kbytes/sec] received 3.44 kb/s sent 6.29 kb/s total --> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616