Bug 806841
Summary: | object-strorage: GET for large data set fails | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Saurabh <saujain> |
Component: | object-storage | Assignee: | Junaid <junaid> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Saurabh <saujain> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | pre-release | CC: | divya, gluster-bugs, mzywusko, redhat, rfortier, vagarwal |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-07-24 17:57:39 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | DP | CRM: | |
Verified Versions: | 3.3.0qa45 | Category: | --- |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 817967 |
Description
Saurabh
2012-03-26 10:41:43 UTC
This issue can be resolved if the changes in configuration files are made, this includes adding the variable node_timeout = 60 in the proxy-server and container-server config files. Saurabh, the issue was with a bottleneck in the code which is fixed in the new release and also requires some tuning in the configuration files which are available in the new rpm's. Created 20000 files of size 1MB inside a subdir of a container and tried to list the container objects, but the result is this, [root@QA-39 object-dir]# curl -v -H 'X-Storage-Token: AUTH_tk92ce6a0a1224460dbd30a0146a767877' http://172.17.251.90:8080/v1/AUTH_test/cont1/ * About to connect() to 172.17.251.90 port 8080 (#0) * Trying 172.17.251.90... connected * Connected to 172.17.251.90 (172.17.251.90) port 8080 (#0) > GET /v1/AUTH_test/cont1/ HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.12.9.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 > Host: 172.17.251.90:8080 > Accept: */* > X-Storage-Token: AUTH_tk92ce6a0a1224460dbd30a0146a767877 > < HTTP/1.1 503 Internal Server Error < Content-Type: text/html; charset=UTF-8 < Content-Length: 0 < Date: Fri, 01 Jun 2012 00:30:54 GMT < * Connection #0 to host 172.17.251.90 left intact * Closing connection #0 altogether for large number of objects it is failing to list them, this was not seen earlier. Can you please mention the GlusterFS version and gluster-swift version. Also, the type of GlusterFS used and the hierarchy structure of objects(number of subdirs in the object) and the timeout values. GlusterFS version: 3.3.0qa45 gluster-object- swift-rc1-rpms glusterfs volume: distribute-replicate (2x2) machine type: four physical nodes with 48GB RAM and 24 cpus didn't modify the node-timeout values as you mentioned that the issue is been fixed by changes in the code. tried with swift-rc-rpms and 3.3.0qa45 the proxy-server config files are updated with the tunables like, node_timeout = 120 conn_timeout = 5 workers = 4 backlog = 10000 The GET over 10000 objects works and list all the objects, Below I am just showing the result of the HEAD [root@QA-31 ~]# curl -v -H 'X-Auth-Token: AUTH_tk7fa5b73946254642a5c916a296392407' http://10.16.157.63:8080/v1/AUTH_test/cont2/ -X HEAD * About to connect() to 10.16.157.63 port 8080 (#0) * Trying 10.16.157.63... connected * Connected to 10.16.157.63 (10.16.157.63) port 8080 (#0) > HEAD /v1/AUTH_test/cont2/ HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.12.9.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 > Host: 10.16.157.63:8080 > Accept: */* > X-Auth-Token: AUTH_tk7fa5b73946254642a5c916a296392407 > < HTTP/1.1 204 No Content < X-Container-Object-Count: 10000 < X-Container-Bytes-Used: 10485760000 < Accept-Ranges: bytes < Content-Length: 0 < Date: Fri, 08 Jun 2012 11:24:33 GMT < * Connection #0 to host 10.16.157.63 left intact * Closing connection #0 I have tried to collect the info for the latest rpms and found with tunables been set as for rc-rpm config the GET for 10000 files work, but without the tunables the GET still fails. Although, with stat-prefetch I tried to find the difference in numbers of maximum objects can be listed with this option on/off. so, with stat-prefetch on 2000 objects got listed. stat-prefetch off .. response is "503 Internal server" again stat-prefetch on .. response is "503 Internal server" even if you reduce the number of files, to 1900 or 1800 with stat-prefetch on the issue remains same. seems there is some issue using stat-prefetch itself, vol config files were updated with correct information of stat-prefetch on/off. Since the load that you are applying is high, to successfully complete the request we should tune the conf files. So, I have tested the GET with some tunables like giving larger values for node_timeout, client_timeout, conn_timeout, workers and backlog, the GET passed for 10000 objects. Has anyone thought about what happens if you've got a container with five million files (or more, perhaps 25 million) in it, contained in directory tree that's a few levels deep and has thousands of directories? Is this completely impossible? |