Bug 1406417
| Summary: | http opens too many connections and gets stuck after a while | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Amit Ugol <augol> | ||||
| Component: | openstack-gnocchi | Assignee: | Mehdi ABAAKOUK <mabaakou> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Sasha Smolyak <ssmolyak> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 10.0 (Newton) | CC: | aarapov, akaris, akrzos, apevec, aschultz, awestbro, dmaley, eglynn, fbaudin, gkeegan, jjoyce, jschluet, lhh, mabaakou, mburns, mcornea, mschuppe, ngalvin, pablo.iranzo, pkilambi, rhel-osp-director-maint, rwyse, sacpatil, sclewis, sknauss, ssmolyak, vcojot | ||||
| Target Milestone: | async | Keywords: | Triaged, ZStream | ||||
| Target Release: | 10.0 (Newton) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openstack-gnocchi-3.0.11-1.el7ost | Doc Type: | Bug Fix | ||||
| Doc Text: |
The default ceph timeout is now 30 seconds instead of infinite.
|
Story Points: | --- | ||||
| Clone Of: | 1373395 | ||||||
| : | 1472413 (view as bug list) | Environment: | |||||
| Last Closed: | 2017-07-12 14:07:53 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1472413 | ||||||
| Attachments: |
|
||||||
|
Description
Amit Ugol
2016-12-20 13:20:50 UTC
Amit, Do you have log files for keystone and httpd? Also is the environment still up and accessible? -Alex Created attachment 1234241 [details]
gnocchi metricd log
Here is some statistics from one of the controllers, I assume it looks the same across all controllers.
# netstat -planex | grep httpd | wc -l
265
Looking at how many users are connected to each listener (note the 3rd one).
# netstat -planex | grep httpd | grep ACC
unix 2 [ ACC ] STREAM LISTENING 95062029 99018/httpd /var/run/wsgi.99018.3.1.sock
unix 2 [ ACC ] STREAM LISTENING 95062031 99018/httpd /var/run/wsgi.99018.3.2.sock
unix 256 [ ACC ] STREAM LISTENING 95062033 99018/httpd /var/run/wsgi.99018.3.3.sock
unix 2 [ ACC ] STREAM LISTENING 95062036 99018/httpd /var/run/wsgi.99018.3.4.sock
unix 2 [ ACC ] STREAM LISTENING 95062039 99018/httpd /var/run/wsgi.99018.3.5.sock
unix 2 [ ACC ] STREAM LISTENING 95062042 99018/httpd /var/run/wsgi.99018.3.6.sock
What the Q looks like for that socket
# netstat -pane | grep /var/run/wsgi.99018.3.3.sock | grep CONNECTING | wc -l
101
Who is accessing this file:
# lsof /var/run/wsgi.99018.3.3.sock
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
httpd 99018 root 27u unix 0xffff8802f4a44c00 0t0 95062033 /var/run/wsgi.99018.3.3.sock
httpd 299749 gnocchi 17u unix 0xffff88034e66e400 0t0 97281708 /var/run/wsgi.99018.3.3.sock
httpd 299749 gnocchi 27u unix 0xffff8802f4a44c00 0t0 95062033 /var/run/wsgi.99018.3.3.sock
httpd 299750 gnocchi 17u unix 0xffff8802df38b400 0t0 97279565 /var/run/wsgi.99018.3.3.sock
httpd 299750 gnocchi 27u unix 0xffff8802f4a44c00 0t0 95062033 /var/run/wsgi.99018.3.3.sock
Seeing this, I attached the gnocchi metricd log. I am uncertain if what I shh here is because gnocchi is acting up or this is a symptom of something else.
Example from the beginning of the run:
2016-12-14 13:26:35.100 96366 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
Then minor hops in the middle of the log:
2016-12-17 06:44:35.667 96366 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
2016-12-17 06:46:35.669 96366 INFO gnocchi.cli [-] 36 measurements bundles across 36 metrics wait to be processed.
2016-12-17 06:48:35.668 96366 INFO gnocchi.cli [-] 9 measurements bundles across 9 metrics wait to be processed.
2016-12-17 06:50:35.668 96366 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
Then when things started to get stuck:
2016-12-21 05:30:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
2016-12-21 05:32:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
2016-12-21 05:34:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
2016-12-21 05:36:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
general statistics:
# netstat -s
Ip:
436333421 total packets received
0 forwarded
0 incoming packets discarded
436333376 incoming packets delivered
438030715 requests sent out
152 outgoing packets dropped
16 dropped because of missing route
Icmp:
6431 ICMP messages received
169 input ICMP message failed.
ICMP input histogram:
destination unreachable: 6423
echo replies: 8
379 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 371
echo request: 8
IcmpMsg:
InType0: 8
InType3: 6423
OutType3: 371
OutType8: 8
Tcp:
19527878 active connections openings
4352955 passive connection openings
15440779 failed connection attempts
19273701 connection resets received
837 connections established
423244935 segments received
436782145 segments send out
15526077 segments retransmited
1714 bad segments received.
34222654 resets sent
Udp:
12779382 packets received
340 packets to unknown port received.
0 packet receive errors
12270687 packets sent
0 receive buffer errors
0 send buffer errors
UdpLite:
TcpExt:
83686 invalid SYN cookies received
15429756 resets received for embryonic SYN_RECV sockets
2309217 TCP sockets finished time wait in fast timer
15429819 packets rejects in established connections because of timestamp
9870151 delayed acks sent
1081 delayed acks further delayed because of locked socket
Quick ack mode was activated 33376 times
39213 times the listen queue of a socket overflowed
39214 SYNs to LISTEN sockets dropped
3433066 packets directly queued to recvmsg prequeue.
108386 bytes directly in process context from backlog
1341242179 bytes directly received in process context from prequeue
58612400 packet headers predicted
1424983 packets header predicted and directly queued to user
94424229 acknowledgments not containing data payload received
78694458 predicted acknowledgments
23 times recovered from packet loss by selective acknowledgements
5 congestion windows fully recovered without slow start
18 congestion windows recovered without slow start by DSACK
27735 congestion windows recovered without slow start after partial ack
23 fast retransmits
15467445 other TCP timeouts
TCPLossProbes: 39135
TCPLossProbeRecovery: 33359
10 times receiver scheduled too late for direct processing
33376 DSACKs sent for old packets
35488 DSACKs received
17408610 connections reset due to unexpected data
117320 connections reset due to early user close
1231 connections aborted due to timeout
TCPDSACKIgnoredNoUndo: 34269
TCPSackMerged: 2
TCPSackShiftFallback: 24
TCPDeferAcceptDrop: 41823
TCPRcvCoalesce: 7982726
TCPOFOQueue: 1695
TCPChallengeACK: 1714
TCPSYNChallenge: 1714
TCPAutoCorking: 1829
TCPFromZeroWindowAdv: 502
TCPToZeroWindowAdv: 502
TCPWantZeroWindowAdv: 3807
TCPSynRetrans: 15462317
TCPOrigDataSent: 131607369
TCPHystartTrainDetect: 2672
TCPHystartTrainCwnd: 53325
TCPHystartDelayDetect: 3
TCPHystartDelayCwnd: 82
TCPACKSkippedSeq: 1
IpExt:
InNoRoutes: 2
InBcastPkts: 302300
InOctets: 81610164231
OutOctets: 70685470260
InBcastOctets: 99154400
InNoECTPkts: 436333449
I was interested in this bug since it affect Gnocchi and I have been conducting significant performance and scale integrated testing of the Telemetry stack included with OpenStack. Amit shared his machines so I could take a look. Looking at the httpd process count, we can quickly see that there is a "build up" of httpd processes on the controllers: [root@overcloud-controller-0 ~]# ps aux | grep httpd -c 265 Apache's MaxClients/ServerLimit configuration is 256 on RHEL 7.3 by default and in this environment's configuration. Apache is also running with the prefork module (processes for concurrency) Scanning over the httpd error logs we can see [Tue Dec 20 09:25:47.622278 2016] [mpm_prefork:error] [pid 108059] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting Indicating that httpd has actually spawned enough processes to hit MaxClients (MaxRequestWorkers). When this occurs all services hosted in httpd will be inaccessible. This affects keystone admin&main endpoints, gnocchi api, aodh api, ceilometer api and horizon. I then checked in if Gnocchi can get to its coordination driver (redis) since I saw this same issue before (https://bugzilla.redhat.com/show_bug.cgi?id=1372751) however redis is easily reachable through haproxy. Next I checked ceph's status since Gnocchi is using ceph as a storage driver in this environment: [root@overcloud-controller-0 httpd]# ceph -s cluster 67c6af0a-c1fa-11e6-8887-525207784b92 health HEALTH_ERR clock skew detected on mon.overcloud-controller-0, mon.overcloud-controller-1 3 full osd(s) Monitor clock skew detected monmap e1: 3 mons at {overcloud-controller-0=172.18.0.17:6789/0,overcloud-controller-1=172.18.0.24:6789/0,overcloud-controller-2=172.18.0.16:6789/0} election epoch 20, quorum 0,1,2 overcloud-controller-2,overcloud-controller-0,overcloud-controller-1 osdmap e56: 3 osds: 3 up, 3 in flags full,sortbitwise pgmap v90583: 224 pgs, 6 pools, 2106 MB data, 16914 objects 35485 MB used, 1341 MB / 36827 MB avail 224 active+clean [root@overcloud-controller-0 httpd]# ceph health HEALTH_ERR clock skew detected on mon.overcloud-controller-0, mon.overcloud-controller-1; 3 full osd(s); Monitor clock skew detected [root@overcloud-controller-0 httpd]# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 36827M 1341M 35485M 96.36 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 432M 0 metrics 1 186M 30.10 432M 16424 images 2 0 0 432M 1 backups 3 1024M 70.31 432M 260 volumes 4 0 0 432M 1 vms 5 896M 67.45 432M 228 As we can see, ceph's status is HEALTH_ERR and %96.36 Raw used with 3 osds full. Since Gnocchi api is attempting to access ceph it is either (waiting on ceph or timing out against ceph) Since this can consume a lot of time, the number of requests incoming to gnocchi-api via ceilometer collector is building into the httpd "requests queue" faster than they must be able to be processed as a timeout/error, this is causing the build up of httpd processes. Once enough requests are sitting around and the httpd "requests queue" (MaxClients) is full, we end up starving all of the other services including Keystone which effectively cuts off everything in the cloud. This really brings up a bigger issue, hosting everything in httpd means all of those services are sharing httpd resources and if one mis-behaves we don't currently have a mechanism to make sure it doesn't take out Keystone and then effectively the whole cloud. I also hit this issue(256 httpd processes) on an environment with 1 controller and 1 compute with default configuration. Hello, I have another bugzilla which is a possible duplicate of this one, although the symptoms were a bit different: https://bugzilla.redhat.com/show_bug.cgi?id=1438605 It's also possible that I simply looked at this from a different angle. We are seeing this problem at several installations - up to the point that we right now "shoot from the hip" to tell customers to temporarily disable gnocchi if issues with httpd / connection exhaustion occur. Regards, Andreas Hi, I have applied https://review.openstack.org/#/c/452296/1/gnocchi/storage/ceph.py to an OSP10 (newton) environment. How do I verify that it is doing the right thing? Also, I'm seeing these things in /var/log/gnocchi: statsd.log:2017-04-12 14:58:53.825 124343 DEBUG gnocchi.service [-] storage.ceph_timeout = None log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630 statsd.log:2017-04-12 14:58:53.831 124343 DEBUG gnocchi.service [-] storage.swift_timeout = 300 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630 so it looks like storage.ceph_timeout is set to 'None'. On my controllers I only have two httpds running under gnocchi. # pgrep -u gnocchi -afl httpd 262704 /usr/sbin/httpd -DFOREGROUND 262705 /usr/sbin/httpd -DFOREGROUND I'll check later today. For now gnocchi.conf looks like that: controller: # Ceph connection timeout (integer value) #ceph_timeout = <None> openstack-gnocchi-3.0.6-2.el7ost.noarch Sasha, I think we are good for this one, deployer can now set a timeout if their want to mitigate this kind of issue. The fact that you see the option is that you have the good packages :) The default is current None, that means we use the one configured in ceph.conf by default, like all other openstack projects. We recently discus about changing the default to something else upstream, I open https://bugzilla.redhat.com/show_bug.cgi?id=1461773 to track this future change. I have made some sanity check and we need a change just merge upstream before shipping a hotfix. The ceph timeout has been changed to 30 and can now be set from gnocchi.conf Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1748 Guys, From what I understand, the fix will provide an option to configure the default timeout(30 sec) to mitigate the issue of opening too many httpd connections trying to access Ceph. This fix will not address Ceph HEALTH_ERR issue mentioned in comment#6. Right? Exactly. The Ceph HEALTH_ERR is not really a bug as it was due to OSDs that was full. @Mehdi, Thanks a lot for your precise and prompt reply. |