Bug 985250 - [Perf] Catalyst workload execution "Page allocation errors"
[Perf] Catalyst workload execution "Page allocation errors"
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-swift (Show other bugs)
2.1
x86_64 Linux
high Severity medium
: ---
: ---
Assigned To: Luis Pabón
pushpesh sharma
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-17 03:49 EDT by pushpesh sharma
Modified: 2016-11-08 17:25 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-23 18:32:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs & conf & reports (13.87 MB, application/x-bzip)
2013-07-17 03:53 EDT, pushpesh sharma
no flags Details
swift confs (4.05 KB, application/x-bzip)
2013-07-17 03:54 EDT, pushpesh sharma
no flags Details

  None (edit)
Description pushpesh sharma 2013-07-17 03:49:52 EDT
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.I ran catalyst workload on a volume at localhost.

2.Volume info looks like this:-

[root@dhcp207-9 ~]# gluster volume info
 
Volume Name: test
Type: Distribute
Volume ID: 440fdac0-a3bd-4ab1-a70c-f4c390d97100
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.65.207.9:/mnt/lv1/lv1
Brick2: 10.65.207.9:/mnt/lv2/lv2
Brick3: 10.65.207.9:/mnt/lv3/lv3
Brick4: 10.65.207.9:/mnt/lv4/lv4
 
Volume Name: test2
Type: Distribute
Volume ID: 6d922203-6657-4ed3-897a-069ef6c396bf
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.65.207.9:/mnt/lv5/lv5
Brick2: 10.65.207.9:/mnt/lv6/lv6
Brick3: 10.65.207.9:/mnt/lv7/lv7
Brick4: 10.65.207.9:/mnt/lv8/lv8
[root@dhcp207-9 ~]# 

3.All the conf files , catalyst workload conf (cmd file) is also attached.

4. As per the discussion, there is need to fix up few things. 

As per Peter:-

We need at least one BZ for the container listing code, as it is not ignoring
the temporary files created by PUT operations. Definite bug, and one which
affects the PDQ code as well.

Jul 11 19:30:07 dhcp207-9 container-server ERROR __call__ error with HEAD
 /test/0/AUTH_test/catalyst :
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/swift/container/server.py", line
   519, in __call__
     res = method(req)
   File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1558,
   in wrapped
     return func(*a, **kw)
   File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 520, in
   _timing_stats
     resp = func(ctrl, *args, **kwargs)
   File "/usr/lib/python2.6/site-packages/swift/container/server.py", line
   290, in HEAD
     info = broker.get_info()
   File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskDir.py",
   line 424, in get_info
     self._update_object_count()
  File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskDir.py",
   line 395, in _update_object_count
     objects, object_count, bytes_used = get_container_details(self.datadir)
   File "/usr/lib/python2.6/site-packages/gluster/swift/common/utils.py", line
   291, in get_container_details
     bytes_used, obj_list)
   File "/usr/lib/python2.6/site-packages/gluster/swift/common/utils.py", line
   271, in update_list
     obj_list)
   File "/usr/lib/python2.6/site-packages/gluster/swift/common/utils.py", line
   260, in _update_list
     bytes_used += os_path.getsize(os.path.join(path, obj_name))
   File "/usr/lib64/python2.6/genericpath.py", line 49, in getsize
     return os.stat(filename).st_size
 OSError: [Errno 2] No such file or directory:
 '/mnt/gluster-object/test/catalyst/run0/insightdemo12/docs/20111114_0/job332380003/.3323800004474.htm.826883c0b80fcc76efd5fa2dfa615407'
 (txn: tx33d22d160ba94addb23698f9902a2799)

 Notice that the file name is of the pattern: "."<name>.<random>, which is a
 temporary file name.

Actual results:

[root@dhcp207-9 ~]# rpm -qa|grep gluster
gluster-swift-object-1.8.0-6.3.el6rhs.noarch
vdsm-gluster-4.10.2-22.7.el6rhs.noarch
gluster-swift-plugin-1.8.0-2.el6rhs.noarch
glusterfs-geo-replication-3.4.0.12rhs.beta3-1.el6rhs.x86_64
glusterfs-3.4.0.12rhs.beta3-1.el6rhs.x86_64
gluster-swift-1.8.0-6.3.el6rhs.noarch
glusterfs-server-3.4.0.12rhs.beta3-1.el6rhs.x86_64
gluster-swift-proxy-1.8.0-6.3.el6rhs.noarch
gluster-swift-account-1.8.0-6.3.el6rhs.noarch
glusterfs-rdma-3.4.0.12rhs.beta3-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.12rhs.beta3-1.el6rhs.x86_64
gluster-swift-container-1.8.0-6.3.el6rhs.noarch
Expected results:


Additional info:
Comment 1 pushpesh sharma 2013-07-17 03:53:19 EDT
Created attachment 774654 [details]
logs & conf & reports
Comment 2 pushpesh sharma 2013-07-17 03:54:04 EDT
Created attachment 774655 [details]
swift confs
Comment 4 pushpesh sharma 2013-07-19 02:41:02 EDT
Problem start arises almost consistently when gluster-volume usage reach 70-71% with both accourate_size_list=on/off. 

Jul 18 21:52:11 dhcp207-9 kernel: swift-proxy-ser: page allocation failure. order:1, mode:0x20
Jul 18 21:52:11 dhcp207-9 kernel: Pid: 4604, comm: swift-proxy-ser Not tainted 2.6.32-358.11.1.el6.x86_64 #1
Jul 18 21:52:11 dhcp207-9 kernel: Call Trace:
Jul 18 21:52:11 dhcp207-9 kernel: <IRQ>  [<ffffffff8112c157>] ? __alloc_pages_nodemask+0x757/0x8d0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81166b02>] ? kmem_getpages+0x62/0x170
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8116771a>] ? fallback_alloc+0x1ba/0x270
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81167499>] ? ____cache_alloc_node+0x99/0x160
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8116841b>] ? kmem_cache_alloc+0x11b/0x190
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81439f18>] ? sk_prot_alloc+0x48/0x1c0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8143aff2>] ? sk_clone+0x22/0x2e0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81489f36>] ? inet_csk_clone+0x16/0xd0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814a2ea3>] ? tcp_create_openreq_child+0x23/0x450
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814a069d>] ? tcp_v4_syn_recv_sock+0x4d/0x310
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814a2c46>] ? tcp_check_req+0x226/0x460
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81498606>] ? tcp_rcv_state_process+0x126/0xa10
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814a013b>] ? tcp_v4_do_rcv+0x35b/0x430
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814a194e>] ? tcp_v4_rcv+0x4fe/0x8d0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81065905>] ? enqueue_entity+0x125/0x410
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8147f6ed>] ? ip_local_deliver_finish+0xdd/0x2d0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8147f978>] ? ip_local_deliver+0x98/0xa0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8147ee3d>] ? ip_rcv_finish+0x12d/0x440
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8147f3c5>] ? ip_rcv+0x275/0x350
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8144858b>] ? __netif_receive_skb+0x4ab/0x750
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814488ca>] ? process_backlog+0x9a/0x100
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8144d133>] ? net_rx_action+0x103/0x2f0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jul 18 21:52:11 dhcp207-9 kernel: <EOI>  [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff810778fa>] ? local_bh_enable_ip+0x9a/0xb0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff815106db>] ? _spin_unlock_bh+0x1b/0x20
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8143905e>] ? release_sock+0xce/0xe0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814b0c28>] ? inet_stream_connect+0x68/0x2c0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81436607>] ? sys_connect+0xd7/0xf0
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8117dd77>] ? fd_install+0x47/0x90
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81434e2a>] ? sock_map_fd+0x2a/0x40
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8121bd46>] ? security_file_fcntl+0x16/0x20
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff81194768>] ? sys_fcntl+0x118/0x530
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff814354a1>] ? sys_socket+0x51/0x80
Jul 18 21:52:11 dhcp207-9 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

#########################################################

[root@dhcp207-9 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_dhcp2079-lv_root
                       13G  2.3G  9.3G  20% /
tmpfs                 3.6G     0  3.6G   0% /dev/shm
/dev/vda1             485M   32M  428M   7% /boot
/dev/mapper/vg-lv1     10G  7.2G  2.9G  72% /mnt/lv1
/dev/mapper/vg-lv2     10G  7.3G  2.8G  73% /mnt/lv2
/dev/mapper/vg-lv3     10G  7.1G  3.0G  71% /mnt/lv3
/dev/mapper/vg-lv4     10G  7.0G  3.1G  70% /mnt/lv4
/dev/mapper/vg-lv5     10G   52M   10G   1% /mnt/lv5
/dev/mapper/vg-lv6     10G   52M   10G   1% /mnt/lv6
/dev/mapper/vg-lv7     10G   52M   10G   1% /mnt/lv7
/dev/mapper/vg-lv8     10G   52M   10G   1% /mnt/lv8
/dev/mapper/vg-lv9     10G   33M   10G   1% /mnt/lv9
10.65.207.9:test       40G   29G   12G  71% /mnt/gluster-object/test
10.65.207.9:test2      40G  206M   40G   1% /mnt/gluster-object/test2


Process :- swift-container is in dead state with 3.4Gb in memory(RES)(almost 50% of Total RAM).

All REST request there after result in error.
Comment 5 pushpesh sharma 2013-07-19 02:50:25 EDT
I am able to create new data on gluster-volume. I was able to create 5-10MB of data on volume using the mount point.So volume is not full.
Comment 6 Luis Pabón 2013-07-26 08:17:22 EDT
The page allocation error occurred because container listing consumed all the RAM.  Container listing should be off for the catalyst workload.  Large container listing is something we would like to fix for the next version.
Comment 7 Luis Pabón 2013-07-29 16:15:10 EDT
AFAIK, this issue will truly be fixed in future releases when we can do accurate listings on containers with a massive amount of objects.

Currently the fix for this version is to:
1. Default value of fs.conf::accurate_container_listing to off
2. Fix from Bug 988969
Comment 8 Luis Pabón 2013-07-29 16:16:29 EDT
Please also create a new bug which is highlighted by the traceback above.  That will be a lower priority issue, which when the value of accurate_container_listing is ON, we may be also getting information about temporary files.
Comment 9 pushpesh sharma 2013-08-05 05:36:14 EDT
Catalyst workload execution with new set of configuration recommended for performance is successful without any errors. Issue seems to be fixed with accurate_container_listing=off and log_request=off. 

[root@luigi ~]# rpm -qa|grep swift
gluster-swift-container-1.8.0-6.11.el6rhs.noarch
gluster-swift-1.8.0-6.11.el6rhs.noarch
gluster-swift-proxy-1.8.0-6.11.el6rhs.noarch
gluster-swift-account-1.8.0-6.11.el6rhs.noarch
gluster-swift-object-1.8.0-6.11.el6rhs.noarch
gluster-swift-plugin-1.8.0-4.el6rhs.noarch

Catalyst workload is executed on test setup with 3*2 volume type.

Marking as verified, as it is no longer visible on the latest test build.
Comment 12 Scott Haines 2013-09-23 18:32:31 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.