Bug 1576829 - Grafana alert callback webhook fails sometimes
Summary: Grafana alert callback webhook fails sometimes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: gowtham
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 1503137
TreeView+ depends on / blocked
 
Reported: 2018-05-10 13:30 UTC by Rohan Kanade
Modified: 2018-09-04 07:07 UTC (History)
5 users (show)

Fixed In Version: tendrl-ui-1.6.3-2.el7rhgs tendrl-ansible-1.6.3-4.el7rhgs tendrl-notifier-1.6.3-3.el7rhgs tendrl-commons-1.6.3-5.el7rhgs tendrl-api-1.6.3-3.el7rhgs tendrl-monitoring-integration-1.6.3-3.el7rhgs tendrl-node-agent-1.6.3-5.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 07:06:18 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github /Tendrl monitoring-integration issues 451 0 None None None 2018-05-10 13:30:31 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:07:07 UTC

Description Rohan Kanade 2018-05-10 13:30:32 UTC
Description of problem:
Grafana pushes frequent alerts to tendrl-monitoring-integration registered webhook (callback) , the webhook is currently handled by a http server which is not supported to be used in production, hence moving to cherrypy server for this.

Version-Release number of selected component (if applicable):
Package version where we saw this was tendrl-monitoring-integration-1.6.3-2.el7rhgs

How reproducible:
70%

Steps to Reproduce:
1. Generate large number of grafana alerts for RHGSWA managed gluster cluster


Actual results:
tendrl-monitoring-integration fails to process some of the alerts

Expected results:
tendrl-monitoring-integration should process/handle all alerts posted by grafana on the callback webhook

Comment 2 Filip Balák 2018-05-10 14:01:05 UTC
Can you please provide which log file reports the error and example of the errors in the log file?

Comment 3 gowtham 2018-05-16 12:38:46 UTC
This issue is fixed

Comment 4 gowtham 2018-05-16 12:38:46 UTC
This issue is fixed

Comment 5 gowtham 2018-05-16 12:50:36 UTC
Filip it is not about an error, I think Rohan is telling like when too many errors came then HTTP server will miss some error, it is not capable to process number of the error, so monitoring-integration will miss some error to notify the
user.

Comment 9 Rohan Kanade 2018-05-22 04:52:33 UTC
I dont have the test environment anymore.

To reproduce this issue:
1) Send a very large number of HTTP POST requests to tendrl-server port 8789

2) Check tendrl-monitoring-integration error logs or check HTTP response for error codes 500, 404 etc

Comment 10 Filip Balák 2018-06-13 11:24:26 UTC
I tried this with old (tendrl-monitoring-integration-1.6.3-1.el7rhgs) and new (tendrl-monitoring-integration-1.6.3-4.el7rhgs) versions:
1. Open `journalctl -u tendrl-monitoring-integration -fe` on <server>
where <server> is address of server with WA and Grafana.
2. Run
for x in {1..10000}; do curl -d "foo=bar" -X POST http://<server>:8789/$x; done
and check journal from step 1.

Result for both versions:
In log are shown numbers of requests. There is a server freeze after usually every 2000-3000 requests and some alerts are not logged. It seems that there is no improvement with new version.

For example logs for requests 3821-4192 are not shown in journal:
```
Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3817 HTTP/1.1" 404 233 "-" "curl/7.29.0"
Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3818 HTTP/1.1" 404 233 "-" "curl/7.29.0"
Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3819 HTTP/1.1" 404 233 "-" "curl/7.29.0"
Jun 13 07:11:50 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:50 -0400] "POST /3820 HTTP/1.1" 404 233 "-" "curl/7.29.0"
Jun 13 07:11:54 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:54 -0400] "POST /4193 HTTP/1.1" 404 233 "-" "curl/7.29.0"
Jun 13 07:11:54 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:54 -0400] "POST /4194 HTTP/1.1" 404 233 "-" "curl/7.29.0"
Jun 13 07:11:54 <server> tendrl-monitoring-integration[1039]: <some-ip> - - [13/Jun/2018:07:11:54 -0400] "POST /4195 HTTP/1.1" 404 233 "-" "curl/7.29.0"
```

--> ASSIGNED

Comment 11 gowtham 2018-06-20 09:07:05 UTC
I gave a correct URL to receive the request by particular handler method and I verified all 10000 requests are received correctly. 

please modify your shell script like:
STEP 1: Modify webhook receiver code in /usr/lib/python2.7/site-
packages/tendrl/monitoring_integration/webhook/webhook_receiver.py as
https://paste.ofcode.org/BavgiVFWzWiFbULUx3kWct

STEP 2: Restart your tendrl-monitoring-integration service
service tendrl-monitoring-integration  restart

STEP 3: Use below script to send request
for x in {1..10000}; do curl -d "foo=bar" -X POST http://{server-id}:8789/grafana_callback; echo $x; done

STEP 4: After all requests are sent, open a file in a server machine
vi /root/result

you can see number 1 to 10000 incremented sequentially. Or you can stop in between and check a number of request send and number in the result file.

When a request is received webhook will increment number and save in a file, which means a number of requests send equal to a number of requests received.


please ping me if you face any difficulties in this.

Comment 12 gowtham 2018-06-20 09:16:26 UTC
Each time please restart monitoring-integration and check, because next time it will put number in result file where it is actually stopped last time, so restart will initialize again zero. And deleted result file also.


Only grafana is going to push use monitoring-integration webhook and configuring webhook in grafana is done by monitoring-integration only. I feel we don't need to worry about invalid URL request count. If we give correct URL and request is not received correctly then it is a bug.

Comment 13 gowtham 2018-06-20 09:16:39 UTC
Each time please restart monitoring-integration and check, because next time it will put number in result file where it is actually stopped last time, so restart will initialize again zero. And deleted result file also.


Only grafana is going to push use monitoring-integration webhook and configuring webhook in grafana is done by monitoring-integration only. I feel we don't need to worry about invalid URL request count. If we give correct URL and request is not received correctly then it is a bug.

Comment 14 Filip Balák 2018-06-26 12:55:48 UTC
I tried the script provided in Comment 11 and it correctly created the /root/result file with 10000 lines but when I:
* downgraded to tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch,
* deleted /root/result,
* updated /usr/lib/python2.7/site-packages/tendrl/monitoring_integration/webhook/webhook_receiver.py to use the same index that it writes to file when webhook is called,
* restarted tendrl-monitoring-integration
and executed the script from Comment 11, it also created file with 10000 lines. So I was not able to reproduce it for older version. Do you have any idea how to reproduce the issue for older version where the issue should be present?

Comment 15 Rohan Kanade 2018-06-28 12:49:54 UTC
I dont have any more info about this, please close this if not required

Comment 17 Martin Bukatovic 2018-06-29 16:24:35 UTC
Comment 15 doesn't answer the need info request.

Comment 18 Rohan Kanade 2018-07-10 12:24:16 UTC
Apologies, I missed out one detail

To reproduce this issue:
1) Send a very large number of HTTP POST requests to "http://$tendrl-server:8789/grafana_callback"

2) Check tendrl-monitoring-integration error logs or check HTTP response for error codes 500, 404 etc or check if any request has been dropped and not processed

Comment 19 gowtham 2018-07-12 12:33:55 UTC
Please verify this issue and close this, from last few build we are not faced any issue related to grafana callback.

Comment 21 Filip Balák 2018-07-25 17:57:02 UTC
I tested this with old version:
tendrl-ansible-1.5.4-7.el7rhgs.noarch
tendrl-api-1.5.4-4.el7rhgs.noarch
tendrl-api-httpd-1.5.4-4.el7rhgs.noarch
tendrl-commons-1.5.4-9.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-14.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-14.el7rhgs.noarch
tendrl-node-agent-1.5.4-16.el7rhgs.noarch
tendrl-notifier-1.5.4-6.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.5.4-6.el7rhgs.noarch
and with new version:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-9.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch
tendrl-node-agent-1.6.3-9.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-8.el7rhgs.noarch

In both cases the server was able to process all requests that I sent to it but I have also done load testing with ApacheBench tool (ab):

$ ab -c 10 -n 10000 -p  data.json -T application/json http://<wa-server>:8789/grafana_callback
$ cat data.json
{"ruleId":1}


Results of this load testing suggest that after usage of cherrypy the number of requests that server can handle is significantly greater.

Old version
-----------
Time taken for tests:   1576.683 seconds
Complete requests:      10000
Failed requests:        0
Requests per second:    6.34 [#/sec] (mean)
Time per request:       1576.683 [ms] (mean)
Time per request:       157.668 [ms] (mean, across all concurrent requests)
Transfer rate:          1.12 [Kbytes/sec] received
                        1.26 kb/s sent
                        2.38 kb/s total

New version
-----------
Time taken for tests:   476.874 seconds
Complete requests:      10000
Failed requests:        0
Requests per second:    20.97 [#/sec] (mean)
Time per request:       476.874 [ms] (mean)
Time per request:       47.687 [ms] (mean, across all concurrent requests)
Transfer rate:          2.85 [Kbytes/sec] received
                        3.44 kb/s sent
                        6.29 kb/s total

--> VERIFIED

Comment 23 errata-xmlrpc 2018-09-04 07:06:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.