Bug 1028583
Summary: | gluster-swift with default TCP configuration incurs thousands of errors while running catalyst - please consider modifying the default state of one or more TCP tunables | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nick Dokos <ndokos> | ||||
Component: | gluster-swift | Assignee: | Luis Pabón <lpabon> | ||||
Status: | CLOSED WONTFIX | QA Contact: | SATHEESARAN <sasundar> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 2.1 | CC: | madam, ndokos, ppai, rhs-bugs, sasundar | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-11-20 06:14:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Closing this bug as RHS 2.1 is EOL. If this bug persists in recent versions, it should be opened against RHGS 3.1.x |
Created attachment 821745 [details] Number of sockets in TW state vs time Description of problem: We ran into the problem running catalyst with a subset of the standard workload. The configuration consists of 8 clients with 64 threads each. There are 6 servers in "standard" configuration. The workload is a subset of the standard workload: 100K small files - each fileset of 10K files consists of mostly small files (5 bytes to about 10KB) plus a single somewhat larger file (3MB). We ran the PUT phase in order to create the files on the servers and then ran the GET phase repeatedly, varying the setting of some TCP tunables. We did not drop cache between the runs: all the files are served from the servers' page cache. In the default configuration, the run completes but incurs about 38000 errors, so we only GET about 60% of the files. That behavior is consistent between clients. Turning on a couple of TCP tunables, net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle, that modify the behavior of the TCP code wrt to TIMED_WAIT sockets makes a marked difference: when either (or both) of those is turned on, the number of errors drops to 0 and the time it takes to complete the run drops by about 25%. The accompanying graph shows the number of sockets in TIMED_WAIT state during the runs and the subsequent recovery as the sockets transition out of that state. The vertical lines show the (rough) time of completion of the GETs for three of the configurations. The fourth configuration where both of the tunables were turned on roughly coincides with the one where only tcp_tw_recycle was turned on, but all three of the "good" configurations completed the GETs within a couple of seconds of each other. Most if not all of the sockets are localhost-only: they connect the local Swift proxy workers to the local Swift object workers. Unfortunately, the tunables affect *all* sockets on the system. Nevertheless, we think that the default setting of at least one tunable should be changed, for otherwise gluster-swift falls down rather badly. Setting tcp_tw_reuse is probably the most conservative option: it leaves the behavior closer to the default, but still allows TW sockets to be reused if necessary. Version-Release number of selected component (if applicable): How reproducible: Always. Steps to Reproduce: 1. Please contact me if you need to reproduce. 2. 3. Actual results: Expected results: Additional info: