Bug 823600

Summary: occasional timeouts in openswift services when enough concurrent requests
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ben England <bengland>
Component: gluster-swiftAssignee: Luis Pabón <lpabon>
Status: CLOSED CURRENTRELEASE QA Contact: Saurabh <saujain>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.0CC: gluster-bugs, madam, mzywusko, rhs-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-10 10:54:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
generates traffic from multiple clients to servers using curl none

Description Ben England 2012-05-21 15:49:39 UTC
Created attachment 585861 [details]
generates traffic from multiple clients to servers using curl

Description of problem:

with very large object sizes, I still can get HTTP 503 Internal Server Error on PUT requests.  This seems to be related to a "chunk write timeout" in the logs.  This may be related to fact that proxy server has been configured to have 8 worker threads but the other servers have not been similarly configured.

On other runs not recorded here, I have seen timeouts from account-server requests as well.

I think this problem is different from previous HTTP 503 error condition, which happened much more easily.

Note that this error did not occur until this size threshold and thread count was reached in the run.

Workload:1 GB/object, 8 clients, 16 threads (2/client),  4 objects/thread.

Configuration: 4 servers, 2-replica volume, swift configuration changes are:



Version-Release number of selected component (if applicable):

 -- gprfs005 -- 
gluster-swift-object-1.4.8-3.el6.noarch
gluster-swift-1.4.8-3.el6.noarch
gluster-swift-container-1.4.8-3.el6.noarch
gluster-swift-proxy-1.4.8-3.el6.noarch
gluster-swift-account-1.4.8-3.el6.noarch
gluster-swift-plugin-1.0-3.noarch
glusterfs-3.3.0qa38-1.el6.x86_64
glusterfs-fuse-3.3.0qa38-1.el6.x86_64
glusterfs-server-3.3.0qa38-1.el6.x86_64
glusterfs-geo-replication-3.3.0qa38-1.el6.x86_64
glusterfs-rdma-3.3.0qa38-1.el6.x86_64

How reproducible:

use attached python script that generates curl workload (also available at http://perf1.lab.bos.redhat.com/bengland/laptop/matte/scalability-cluster/shared/benchmarks/gluster_test/ufo/parallel_curl.py )

You should see something like this:
  
ERROR: 503
ERROR: 503
ERROR: 503
WARNING: 3 errors found
doing put 16 threads 4 objects/thread 1048576 KB/object
clients: ['gprfc009', 'gprfc010', 'gprfc011', 'gprfc012', 'gprfc013', 'gprfc014', 'gprfc015', 'gprfc016']
servers: ['gprfs005-10ge', 'gprfs006-10ge', 'gprfs007-10ge', 'gprfs008-10ge']
elapsed time =  152.65 sec
throughput =      0.42 objs/sec
transfer rate =    429.33 MB/s


Additional info:

[root@gprfs005 swift]# more proxy-server.conf 
[DEFAULT]
#bind_port = 443
#cert_file = /etc/swift/cert.crt
#key_file = /etc/swift/cert.key
bind_port = 8080
user = root
workers = 8
[pipeline:main]
pipeline = healthcheck cache tempauth proxy-server

[app:proxy-server]
use = egg:swift#proxy
allow_account_management = true
account_autocreate = true
log_facility = LOG_LOCAL1
log_level = WARN
log_headers = False
conn_timeout = 5.0

[filter:tempauth]
use = egg:swift#tempauth
user_admin_admin = admin .admin .reseller_admin
user_testfs_tester = testing .admin
user_test2_tester2 = testing2 .admin
user_test_tester3 = testing3

[filter:healthcheck]
use = egg:swift#healthcheck

[filter:cache]
use = egg:swift#memcache
# THIS MUST BE UPDATED WHENEVER THE CONFIGURATION CHANGES
memcache_servers=gprfs005-10ge,gprfs006-10ge,gprfs007-10ge,gprfs008-10ge


[DEFAULT]
mount_path = /mnt/glusterfs
auth_account = auth
#ip of the fs server.
mount_ip = localhost
#fs server need not be local, remote server can also be used,
#set remote_cluster=yes for using remote server.
remote_cluster = no

errors logged by servers during the run are at:

http://perf1.lab.bos.redhat.com/bengland/laptop/matte/scalability-cluster/shared/benchmarks/gluster_test/ufo/ufo-errs.log

Comment 3 Ben England 2012-06-22 19:00:27 UTC
per Sudhir's instructions I'm marking it verified.