Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1406417

Summary: http opens too many connections and gets stuck after a while
Product: Red Hat OpenStack Reporter: Amit Ugol <augol>
Component: openstack-gnocchiAssignee: Mehdi ABAAKOUK <mabaakou>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: aarapov, akaris, akrzos, apevec, aschultz, awestbro, dmaley, eglynn, fbaudin, gkeegan, jjoyce, jschluet, lhh, mabaakou, mburns, mcornea, mschuppe, ngalvin, pablo.iranzo, pkilambi, rhel-osp-director-maint, rwyse, sacpatil, sclewis, sknauss, ssmolyak, vcojot
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-gnocchi-3.0.11-1.el7ost Doc Type: Bug Fix
Doc Text:
The default ceph timeout is now 30 seconds instead of infinite.
Story Points: ---
Clone Of: 1373395
: 1472413 (view as bug list) Environment:
Last Closed: 2017-07-12 14:07:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1472413    
Attachments:
Description Flags
gnocchi metricd log none

Description Amit Ugol 2016-12-20 13:20:50 UTC
Description of problem:
This is a copy of https://bugzilla.redhat.com/show_bug.cgi?id=1373395
The fix introduced in that bug made systems stay responsive for (in this case) ~5.5 days. This was tested as part of RHOS 10 GA longevity test.

Version-Release number of selected component (if applicable):
RHOS 10 GA (2016-12-14.1)

How reproducible:
Its not a question of how but of when.


What I did:
Installed RHOS 10 GA manually on an environment with:
Name        | Num | RAM | SWAP | vCPU
-------------------------------------
Undercloud  |   1 |  16 |    1 |    4
Controllers |   3 |  16 |    1 |    3
Computes    |   2 |  16 |    1 |    3
Ceph        |   3 |   4 |    1 |    2

Deployment command:
openstack overcloud deploy \
    --templates \
    --libvirt-type kvm \
    --control-scale 3 --control-flavor control \
    --compute-scale 2 --compute-flavor compute \
    --ceph-storage-scale 3 --ceph-storage-flavor ceph-storage \
    --neutron-network-type vxlan \
    --neutron-tunnel-types vxlan \
    --ntp-server clock.redhat.com \
    -e ./templates/swap_env.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/services/sahara.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/net-multiple-nics.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
    -e ./templates/ceph.yaml \
    -e ./templates/overrides.yaml

I ran heat_integrationtests 2-3 times a day while stabilizing a related job.

On all controller nodes ~ 9GB ram is used.

Example load:
[heat-admin@overcloud-controller-2 ~]$ uptime
 13:13:27 up 6 days, 15 min,  2 users,  load average: 1.32, 1.27, 1.51

On all nodes all services are up. pcs status is OK.
The only issue is that telnetting to any of the public keystone addresses results in a non communicative server.
(Doing the same on the undercloud machine and typing hi <enter> will close the session as expected.)

The final error message was a 504.

Comment 1 Alex Krzos 2016-12-20 13:57:24 UTC
Amit,

Do you have log files for keystone and httpd?  Also is the environment still up and accessible?

-Alex

Comment 2 Amit Ugol 2016-12-21 06:04:31 UTC
Created attachment 1234241 [details]
gnocchi metricd log

Comment 3 Amit Ugol 2016-12-21 06:12:40 UTC
Here is some statistics from one of the controllers, I assume it looks the same across all controllers.

# netstat -planex | grep httpd | wc -l
265

Looking at how many users are connected to each listener (note the 3rd one). 

# netstat -planex | grep httpd | grep ACC
unix  2      [ ACC ]     STREAM     LISTENING     95062029 99018/httpd          /var/run/wsgi.99018.3.1.sock
unix  2      [ ACC ]     STREAM     LISTENING     95062031 99018/httpd          /var/run/wsgi.99018.3.2.sock
unix  256    [ ACC ]     STREAM     LISTENING     95062033 99018/httpd          /var/run/wsgi.99018.3.3.sock
unix  2      [ ACC ]     STREAM     LISTENING     95062036 99018/httpd          /var/run/wsgi.99018.3.4.sock
unix  2      [ ACC ]     STREAM     LISTENING     95062039 99018/httpd          /var/run/wsgi.99018.3.5.sock
unix  2      [ ACC ]     STREAM     LISTENING     95062042 99018/httpd          /var/run/wsgi.99018.3.6.sock

What the Q looks like for that socket
# netstat -pane | grep  /var/run/wsgi.99018.3.3.sock | grep CONNECTING | wc -l
101

Who is accessing this file:
# lsof /var/run/wsgi.99018.3.3.sock
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
httpd    99018    root   27u  unix 0xffff8802f4a44c00      0t0 95062033 /var/run/wsgi.99018.3.3.sock
httpd   299749 gnocchi   17u  unix 0xffff88034e66e400      0t0 97281708 /var/run/wsgi.99018.3.3.sock
httpd   299749 gnocchi   27u  unix 0xffff8802f4a44c00      0t0 95062033 /var/run/wsgi.99018.3.3.sock
httpd   299750 gnocchi   17u  unix 0xffff8802df38b400      0t0 97279565 /var/run/wsgi.99018.3.3.sock
httpd   299750 gnocchi   27u  unix 0xffff8802f4a44c00      0t0 95062033 /var/run/wsgi.99018.3.3.sock

Seeing this, I attached the gnocchi metricd log. I am uncertain if what I shh here is because gnocchi is acting up or this is a symptom of something else.

Example from the beginning of the run:

2016-12-14 13:26:35.100 96366 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.

Then minor hops in the middle of the log:
2016-12-17 06:44:35.667 96366 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
2016-12-17 06:46:35.669 96366 INFO gnocchi.cli [-] 36 measurements bundles across 36 metrics wait to be processed.
2016-12-17 06:48:35.668 96366 INFO gnocchi.cli [-] 9 measurements bundles across 9 metrics wait to be processed.
2016-12-17 06:50:35.668 96366 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.

Then when things started to get stuck:
2016-12-21 05:30:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
2016-12-21 05:32:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
2016-12-21 05:34:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.
2016-12-21 05:36:36.495 96366 INFO gnocchi.cli [-] 7 measurements bundles across 1 metrics wait to be processed.


general statistics:
# netstat -s
Ip:
    436333421 total packets received
    0 forwarded
    0 incoming packets discarded
    436333376 incoming packets delivered
    438030715 requests sent out
    152 outgoing packets dropped
    16 dropped because of missing route
Icmp:
    6431 ICMP messages received
    169 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 6423
        echo replies: 8
    379 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 371
        echo request: 8
IcmpMsg:
        InType0: 8
        InType3: 6423
        OutType3: 371
        OutType8: 8
Tcp:
    19527878 active connections openings
    4352955 passive connection openings
    15440779 failed connection attempts
    19273701 connection resets received
    837 connections established
    423244935 segments received
    436782145 segments send out
    15526077 segments retransmited
    1714 bad segments received.
    34222654 resets sent
Udp:
    12779382 packets received
    340 packets to unknown port received.
    0 packet receive errors
    12270687 packets sent
    0 receive buffer errors
    0 send buffer errors
UdpLite:
TcpExt:
    83686 invalid SYN cookies received
    15429756 resets received for embryonic SYN_RECV sockets
    2309217 TCP sockets finished time wait in fast timer
    15429819 packets rejects in established connections because of timestamp
    9870151 delayed acks sent
    1081 delayed acks further delayed because of locked socket
    Quick ack mode was activated 33376 times
    39213 times the listen queue of a socket overflowed
    39214 SYNs to LISTEN sockets dropped
    3433066 packets directly queued to recvmsg prequeue.
    108386 bytes directly in process context from backlog
    1341242179 bytes directly received in process context from prequeue
    58612400 packet headers predicted
    1424983 packets header predicted and directly queued to user
    94424229 acknowledgments not containing data payload received
    78694458 predicted acknowledgments
    23 times recovered from packet loss by selective acknowledgements
    5 congestion windows fully recovered without slow start
    18 congestion windows recovered without slow start by DSACK
    27735 congestion windows recovered without slow start after partial ack
    23 fast retransmits
    15467445 other TCP timeouts
    TCPLossProbes: 39135
    TCPLossProbeRecovery: 33359
    10 times receiver scheduled too late for direct processing
    33376 DSACKs sent for old packets
    35488 DSACKs received
    17408610 connections reset due to unexpected data
    117320 connections reset due to early user close
    1231 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 34269
    TCPSackMerged: 2
    TCPSackShiftFallback: 24
    TCPDeferAcceptDrop: 41823
    TCPRcvCoalesce: 7982726
    TCPOFOQueue: 1695
    TCPChallengeACK: 1714
    TCPSYNChallenge: 1714
    TCPAutoCorking: 1829
    TCPFromZeroWindowAdv: 502
    TCPToZeroWindowAdv: 502
    TCPWantZeroWindowAdv: 3807
    TCPSynRetrans: 15462317
    TCPOrigDataSent: 131607369
    TCPHystartTrainDetect: 2672
    TCPHystartTrainCwnd: 53325
    TCPHystartDelayDetect: 3
    TCPHystartDelayCwnd: 82
    TCPACKSkippedSeq: 1
IpExt:
    InNoRoutes: 2
    InBcastPkts: 302300
    InOctets: 81610164231
    OutOctets: 70685470260
    InBcastOctets: 99154400
    InNoECTPkts: 436333449

Comment 6 Alex Krzos 2016-12-21 13:58:36 UTC
I was interested in this bug since it affect Gnocchi and I have been conducting significant performance and scale integrated testing of the Telemetry stack included with OpenStack.


Amit shared his machines so I could take a look. Looking at the httpd process count, we can quickly see that there is a "build up" of httpd processes on the controllers:

[root@overcloud-controller-0 ~]# ps aux | grep httpd -c
265

Apache's MaxClients/ServerLimit configuration is 256 on RHEL 7.3 by default and in this environment's configuration.  Apache is also running with the prefork module (processes for concurrency)

Scanning over the httpd error logs we can see
[Tue Dec 20 09:25:47.622278 2016] [mpm_prefork:error] [pid 108059] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

Indicating that httpd has actually spawned enough processes to hit MaxClients (MaxRequestWorkers).  When this occurs all services hosted in httpd will be inaccessible.  This affects keystone admin&main endpoints, gnocchi api, aodh api, ceilometer api and horizon.

I then checked in if Gnocchi can get to its coordination driver (redis) since I saw this same issue before (https://bugzilla.redhat.com/show_bug.cgi?id=1372751) however redis is easily reachable through haproxy.

Next I checked ceph's status since Gnocchi is using ceph as a storage driver in this environment:

[root@overcloud-controller-0 httpd]# ceph -s
    cluster 67c6af0a-c1fa-11e6-8887-525207784b92
     health HEALTH_ERR
            clock skew detected on mon.overcloud-controller-0, mon.overcloud-controller-1
            3 full osd(s)
            Monitor clock skew detected 
     monmap e1: 3 mons at {overcloud-controller-0=172.18.0.17:6789/0,overcloud-controller-1=172.18.0.24:6789/0,overcloud-controller-2=172.18.0.16:6789/0}
            election epoch 20, quorum 0,1,2 overcloud-controller-2,overcloud-controller-0,overcloud-controller-1
     osdmap e56: 3 osds: 3 up, 3 in
            flags full,sortbitwise
      pgmap v90583: 224 pgs, 6 pools, 2106 MB data, 16914 objects
            35485 MB used, 1341 MB / 36827 MB avail
                 224 active+clean
[root@overcloud-controller-0 httpd]# ceph health
HEALTH_ERR clock skew detected on mon.overcloud-controller-0, mon.overcloud-controller-1; 3 full osd(s); Monitor clock skew detected
[root@overcloud-controller-0 httpd]# ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED 
    36827M     1341M       35485M         96.36 
POOLS:
    NAME        ID     USED      %USED     MAX AVAIL     OBJECTS 
    rbd         0          0         0          432M           0 
    metrics     1       186M     30.10          432M       16424 
    images      2          0         0          432M           1 
    backups     3      1024M     70.31          432M         260 
    volumes     4          0         0          432M           1 
    vms         5       896M     67.45          432M         228 


As we can see, ceph's status is HEALTH_ERR and %96.36 Raw used with 3 osds full.  Since Gnocchi api is attempting to access ceph it is either (waiting  on ceph or timing out against ceph)  Since this can consume a lot of time, the number of requests incoming to gnocchi-api via ceilometer collector is building into the httpd "requests queue" faster than they must be able to be processed as a timeout/error, this is causing the build up of httpd processes.  Once enough requests are sitting around and the httpd "requests queue" (MaxClients) is full, we end up starving all of the other services including Keystone which effectively cuts off everything in the cloud.

This really brings up a bigger issue, hosting everything in httpd means all of those services are sharing httpd resources and if one mis-behaves we don't currently have a mechanism to make sure it doesn't take out Keystone and then effectively the whole cloud.

Comment 8 Marius Cornea 2017-01-09 19:57:44 UTC
I also hit this issue(256 httpd processes) on an environment with 1 controller and 1 compute with default configuration.

Comment 15 Andreas Karis 2017-04-11 14:26:02 UTC
Hello,

I have another bugzilla which is a possible duplicate of this one, although the symptoms were a bit different: https://bugzilla.redhat.com/show_bug.cgi?id=1438605
It's also possible that I simply looked at this from a different angle.

We are seeing this problem at several installations - up to the point that we right now "shoot from the hip" to tell customers to temporarily disable gnocchi if issues with httpd / connection exhaustion occur.

Regards,

Andreas

Comment 18 Vincent S. Cojot 2017-04-12 15:58:01 UTC
Hi,
I have applied https://review.openstack.org/#/c/452296/1/gnocchi/storage/ceph.py to an OSP10 (newton) environment.

How do I verify that it is doing the right thing?
Also, I'm seeing these things in /var/log/gnocchi:

statsd.log:2017-04-12 14:58:53.825 124343 DEBUG gnocchi.service [-] storage.ceph_timeout           = None log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630
statsd.log:2017-04-12 14:58:53.831 124343 DEBUG gnocchi.service [-] storage.swift_timeout          = 300 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630

so it looks like storage.ceph_timeout is set to 'None'.

On my controllers I only have two httpds running under gnocchi.
# pgrep -u gnocchi -afl httpd
262704 /usr/sbin/httpd -DFOREGROUND
262705 /usr/sbin/httpd -DFOREGROUND

I'll check later today.

Comment 23 Sasha Smolyak 2017-06-14 11:29:37 UTC
For now gnocchi.conf looks like that:

controller:
# Ceph connection timeout (integer value)
#ceph_timeout = <None>
openstack-gnocchi-3.0.6-2.el7ost.noarch

Comment 25 Mehdi ABAAKOUK 2017-06-15 09:57:05 UTC
Sasha, I think we are good for this one, deployer can now set a timeout if their want to mitigate this kind of issue.

The fact that you see the option is that you have the good packages :)

The default is current None, that means we use the one configured in ceph.conf by default, like all other openstack projects.

We recently discus about changing the default to something else upstream, I open https://bugzilla.redhat.com/show_bug.cgi?id=1461773 to track this future change.

Comment 30 Mehdi ABAAKOUK 2017-06-19 08:05:24 UTC
I have made some sanity check and we need a change just merge upstream before shipping a hotfix.

Comment 40 Sasha Smolyak 2017-07-06 07:38:00 UTC
The ceph timeout has been changed to  30 and can now be set from gnocchi.conf

Comment 42 errata-xmlrpc 2017-07-12 14:07:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1748

Comment 43 Sachin 2019-03-14 09:52:22 UTC
Guys,

From what I understand, the fix will provide an option to configure the default timeout(30 sec) to mitigate the issue of opening too many httpd connections trying to access Ceph. This fix will not address Ceph HEALTH_ERR issue mentioned in comment#6. Right?

Comment 44 Mehdi ABAAKOUK 2019-03-14 10:20:15 UTC
Exactly. The Ceph HEALTH_ERR is not really a bug as it was due to OSDs that was full.

Comment 45 Sachin 2019-03-15 08:53:46 UTC
@Mehdi,

Thanks a lot for your precise and prompt reply.