Bug 1761423 - brick process (glusterfsd) killed due to OOM on volume create delete test
Summary: brick process (glusterfsd) killed due to OOM on volume create delete test
Keywords:
Status: CLOSED DUPLICATE of bug 1790336
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Mohit Agrawal
QA Contact: Prasanth
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-14 11:26 UTC by Nag Pavan Chilakam
Modified: 2020-06-10 12:12 UTC (History)
5 users (show)

Fixed In Version: glusterfs-6.0-38
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-10 07:56:41 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2019-10-14 11:26:42 UTC
Description of problem:
=======================
I am running a test which does volume create/deletes of different kinds of volume on a brick mux setup.
There are 2 volumes which are always constant through the cycle one a 5x4+2 ec_basevol volume and another a 2x3 distrep_basevol volume.
The test which is being run through a script, creates volumes of any type ie is single brick vol, 1x3, nx3, 1x(4+2), nx(4+2), 1x(2+1)arbiter, nx(2+1) volume .
It creates about 100 volumes starts them, stops them , delete them (note that another 2 vols always exist)

This test ran for about 1 week, successful, without any issues, post which the ECvolume bricks got oom killed on all nodes




Version-Release number of selected component (if applicable):
===================
6.0.15

How reproducible:
=============
hit once after 1 week of testcycle

Steps to Reproduce:
====================
1.6 node cluster, brickmux enabled
2. 2 volumes created:1_base_ecvol->5 x (4 + 2) = 30 and  1_basevol->2x3volume
3. ran a test which keeps {creating 100volumes, start, stop ,delete } indefinitely, with sleeps in each op

Actual results:
----------------
ecvol bricks oom killed on all nodes




Additional info:
==================
[root@dhcp35-75 ~]# gluster v info 1_base_ecvol 
 
Volume Name: 1_base_ecvol
Type: Distributed-Disperse
Volume ID: aa4eeab4-1012-44f7-8b0b-c13a99e31346
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x (4 + 2) = 30
Transport-type: tcp
Bricks:
Brick1: dhcp35-75.lab.eng.blr.redhat.com:/gluster/brick2/1_base_ecvol
Brick2: dhcp35-194.lab.eng.blr.redhat.com:/gluster/brick2/1_base_ecvol
Brick3: dhcp35-173.lab.eng.blr.redhat.com:/gluster/brick2/1_base_ecvol
Brick4: dhcp35-108.lab.eng.blr.redhat.com:/gluster/brick2/1_base_ecvol
Brick5: dhcp35-42.lab.eng.blr.redhat.com:/gluster/brick2/1_base_ecvol
Brick6: dhcp35-182.lab.eng.blr.redhat.com:/gluster/brick2/1_base_ecvol
Brick7: dhcp35-75.lab.eng.blr.redhat.com:/gluster/brick3/1_base_ecvol
Brick8: dhcp35-194.lab.eng.blr.redhat.com:/gluster/brick3/1_base_ecvol
Brick9: dhcp35-173.lab.eng.blr.redhat.com:/gluster/brick3/1_base_ecvol
Brick10: dhcp35-108.lab.eng.blr.redhat.com:/gluster/brick3/1_base_ecvol
Brick11: dhcp35-42.lab.eng.blr.redhat.com:/gluster/brick3/1_base_ecvol
Brick12: dhcp35-182.lab.eng.blr.redhat.com:/gluster/brick3/1_base_ecvol
Brick13: dhcp35-75.lab.eng.blr.redhat.com:/gluster/brick4/1_base_ecvol
Brick14: dhcp35-194.lab.eng.blr.redhat.com:/gluster/brick4/1_base_ecvol
Brick15: dhcp35-173.lab.eng.blr.redhat.com:/gluster/brick4/1_base_ecvol
Brick16: dhcp35-108.lab.eng.blr.redhat.com:/gluster/brick4/1_base_ecvol
Brick17: dhcp35-42.lab.eng.blr.redhat.com:/gluster/brick4/1_base_ecvol
Brick18: dhcp35-182.lab.eng.blr.redhat.com:/gluster/brick4/1_base_ecvol
Brick19: dhcp35-75.lab.eng.blr.redhat.com:/gluster/brick5/1_base_ecvol
Brick20: dhcp35-194.lab.eng.blr.redhat.com:/gluster/brick5/1_base_ecvol
Brick21: dhcp35-173.lab.eng.blr.redhat.com:/gluster/brick5/1_base_ecvol
Brick22: dhcp35-108.lab.eng.blr.redhat.com:/gluster/brick5/1_base_ecvol
Brick23: dhcp35-42.lab.eng.blr.redhat.com:/gluster/brick5/1_base_ecvol
Brick24: dhcp35-182.lab.eng.blr.redhat.com:/gluster/brick5/1_base_ecvol
Brick25: dhcp35-75.lab.eng.blr.redhat.com:/gluster/brick6/1_base_ecvol
Brick26: dhcp35-194.lab.eng.blr.redhat.com:/gluster/brick6/1_base_ecvol
Brick27: dhcp35-173.lab.eng.blr.redhat.com:/gluster/brick6/1_base_ecvol
Brick28: dhcp35-108.lab.eng.blr.redhat.com:/gluster/brick6/1_base_ecvol
Brick29: dhcp35-42.lab.eng.blr.redhat.com:/gluster/brick6/1_base_ecvol
Brick30: dhcp35-182.lab.eng.blr.redhat.com:/gluster/brick6/1_base_ecvol
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
cluster.brick-multiplex: enable
[root@dhcp35-75 ~]# gluster v info 1_basevol
 
Volume Name: 1_basevol
Type: Distributed-Replicate
Volume ID: 8f5d0079-6a4e-4109-b1e3-8ca420aa01e5
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: dhcp35-75.lab.eng.blr.redhat.com:/gluster/brick1/1_basevol
Brick2: dhcp35-194.lab.eng.blr.redhat.com:/gluster/brick1/1_basevol
Brick3: dhcp35-173.lab.eng.blr.redhat.com:/gluster/brick1/1_basevol
Brick4: dhcp35-108.lab.eng.blr.redhat.com:/gluster/brick1/1_basevol
Brick5: dhcp35-42.lab.eng.blr.redhat.com:/gluster/brick1/1_basevol
Brick6: dhcp35-182.lab.eng.blr.redhat.com:/gluster/brick1/1_basevol
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: enable
[root@dhcp35-75 ~]# rpm -qa|grep gluster
glusterfs-6.0-15.el7rhgs.x86_64
glusterfs-server-6.0-15.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-6.0-15.el7rhgs.x86_64
glusterfs-geo-replication-6.0-15.el7rhgs.x86_64
glusterfs-events-6.0-15.el7rhgs.x86_64
glusterfs-rdma-6.0-15.el7rhgs.x86_64
glusterfs-libs-6.0-15.el7rhgs.x86_64
glusterfs-api-6.0-15.el7rhgs.x86_64
glusterfs-fuse-6.0-15.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-23.el7_7.1.x86_64
vdsm-gluster-4.30.18-1.0.el7rhgs.x86_64
glusterfs-cli-6.0-15.el7rhgs.x86_64
python2-gluster-6.0-15.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64




[Fri Oct 11 21:54:08 2019] CPU: 5 PID: 24895 Comm: glfs_epoll00f Kdump: loaded Not tainted 3.10.0-1062.1.2.el7.x86_64 #1
[Fri Oct 11 21:54:08 2019] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[Fri Oct 11 21:54:08 2019] Call Trace:
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb792c2>] dump_stack+0x19/0x1b
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb73c64>] dump_header+0x90/0x229
[Fri Oct 11 21:54:08 2019]  [<ffffffffad70825b>] ? cred_has_capability+0x6b/0x120
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb86fed>] ? do_async_page_fault+0x6d/0xf0
[Fri Oct 11 21:54:08 2019]  [<ffffffffad5bfd74>] oom_kill_process+0x254/0x3e0
[Fri Oct 11 21:54:08 2019]  [<ffffffffad70833e>] ? selinux_capable+0x2e/0x40
[Fri Oct 11 21:54:08 2019]  [<ffffffffad5c05c6>] out_of_memory+0x4b6/0x4f0
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb7477c>] __alloc_pages_slowpath+0x5d6/0x724
[Fri Oct 11 21:54:08 2019]  [<ffffffffad5c6b84>] __alloc_pages_nodemask+0x404/0x420
[Fri Oct 11 21:54:08 2019]  [<ffffffffad618105>] alloc_pages_vma+0xb5/0x200
[Fri Oct 11 21:54:08 2019]  [<ffffffffad5e5407>] ? anon_vma_interval_tree_insert+0x97/0xa0
[Fri Oct 11 21:54:08 2019]  [<ffffffffad5efa54>] handle_pte_fault+0x984/0xe20
[Fri Oct 11 21:54:08 2019]  [<ffffffffad5f200d>] handle_mm_fault+0x39d/0x9b0
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb87653>] __do_page_fault+0x213/0x500
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb87a26>] trace_do_page_fault+0x56/0x150
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb86fa2>] do_async_page_fault+0x22/0xf0
[Fri Oct 11 21:54:08 2019]  [<ffffffffadb837a8>] async_page_fault+0x28/0x30
[Fri Oct 11 21:54:08 2019] Mem-Info:
[Fri Oct 11 21:54:08 2019] active_anon:1503183 inactive_anon:293959 isolated_anon:0
 active_file:0 inactive_file:0 isolated_file:0
 unevictable:14123 dirty:8 writeback:0 unstable:0
 slab_reclaimable:23017 slab_unreclaimable:58554
 mapped:6476 shmem:3411 pagetables:34129 bounce:0
 free:25763 free_pcp:4041 free_cma:0
[Fri Oct 11 21:54:08 2019] Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Fri Oct 11 21:54:08 2019] lowmem_reserve[]: 0 2812 7799 7799
[Fri Oct 11 21:54:08 2019] Node 0 DMA32 free:44028kB min:24320kB low:30400kB high:36480kB active_anon:2063388kB inactive_anon:517544kB active_file:0kB inactive_file:0kB unevictable:23496kB isolated(anon):0kB isolated(file):0kB present:3129328kB managed:2882956kB mlocked:23496kB dirty:0kB writeback:0kB mapped:8048kB shmem:4612kB slab_reclaimable:37468kB slab_unreclaimable:89748kB kernel_stack:15600kB pagetables:48172kB unstable:0kB bounce:0kB free_pcp:6844kB local_pcp:616kB free_cma:0kB writeback_tmp:0kB pages_scanned:14 all_unreclaimable? yes
[Fri Oct 11 21:54:08 2019] lowmem_reserve[]: 0 0 4987 4987
[Fri Oct 11 21:54:08 2019] Node 0 Normal free:43116kB min:43128kB low:53908kB high:64692kB active_anon:3949344kB inactive_anon:658292kB active_file:0kB inactive_file:0kB unevictable:32996kB isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5106928kB mlocked:32996kB dirty:32kB writeback:0kB mapped:17856kB shmem:9032kB slab_reclaimable:54600kB slab_unreclaimable:144468kB kernel_stack:8848kB pagetables:88344kB unstable:0kB bounce:0kB free_pcp:9320kB local_pcp:684kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? yes
[Fri Oct 11 21:54:08 2019] lowmem_reserve[]: 0 0 0 0
[Fri Oct 11 21:54:08 2019] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[Fri Oct 11 21:54:08 2019] Node 0 DMA32: 4563*4kB (UE) 1430*8kB (UE) 101*16kB (UE) 113*32kB (UEM) 89*64kB (UEM) 19*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43052kB
[Fri Oct 11 21:54:08 2019] Node 0 Normal: 10234*4kB (UEM) 2*8kB (E) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 40952kB
[Fri Oct 11 21:54:08 2019] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Fri Oct 11 21:54:08 2019] 19333 total pagecache pages
[Fri Oct 11 21:54:08 2019] 12920 pages in swap cache
[Fri Oct 11 21:54:08 2019] Swap cache stats: add 2933098, delete 2920182, find 1907978/2049507
[Fri Oct 11 21:54:08 2019] Free swap  = 0kB
[Fri Oct 11 21:54:08 2019] Total swap = 8257532kB
[Fri Oct 11 21:54:08 2019] 2097050 pages RAM
[Fri Oct 11 21:54:08 2019] 0 pages HighMem/MovableOnly
[Fri Oct 11 21:54:08 2019] 95602 pages reserved
[Fri Oct 11 21:54:08 2019] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Fri Oct 11 21:54:08 2019] [ 1093]     0  1093    13867     5143      32       51             0 systemd-journal
[Fri Oct 11 21:54:08 2019] [ 1128]     0  1128     9608      634      24      621         -1000 systemd-udevd
[Fri Oct 11 21:54:08 2019] [ 1170]     0  1170    80723     1928      39        0         -1000 multipathd
[Fri Oct 11 21:54:08 2019] [ 1536]     0  1536   263111    12924      67        0         -1000 dmeventd
[Fri Oct 11 21:54:08 2019] [ 2197]     0  2197    13882      106      29       87         -1000 auditd
[Fri Oct 11 21:54:08 2019] [ 2224]     0  2224     6620      280      18       43             0 systemd-logind
[Fri Oct 11 21:54:08 2019] [ 2228]     0  2228     4225      175      14       44             0 alsactl
[Fri Oct 11 21:54:08 2019] [ 2229]   999  2229   153245      356      61     1673             0 polkitd
[Fri Oct 11 21:54:08 2019] [ 2238]     0  2238     1097       97       8       34             0 acpid
[Fri Oct 11 21:54:08 2019] [ 2239]     0  2239    13224      277      30      181             0 smartd
[Fri Oct 11 21:54:08 2019] [ 2240]     0  2240    57069      272      62      483             0 abrtd
[Fri Oct 11 21:54:08 2019] [ 2245]     0  2245     5408      233      16       41             0 irqbalance
[Fri Oct 11 21:54:08 2019] [ 2248]    81  2248    16644      377      34      102          -900 dbus-daemon
[Fri Oct 11 21:54:08 2019] [ 2259]     0  2259    22641      161      47      227             0 rngd
[Fri Oct 11 21:54:08 2019] [ 2261]     0  2261    56438      247      61      360             0 abrt-watch-log
[Fri Oct 11 21:54:08 2019] [ 2276]    32  2276    17319      165      37      106             0 rpcbind
[Fri Oct 11 21:54:08 2019] [ 2281]   998  2281    29451      236      28       87             0 chronyd
[Fri Oct 11 21:54:08 2019] [ 2283]     0  2283    90658      534      99     6418             0 firewalld
[Fri Oct 11 21:54:08 2019] [ 2290]     0  2290    48776      116      35      130             0 gssproxy
[Fri Oct 11 21:54:08 2019] [ 2303]     0  2303   156444      497      91      267             0 NetworkManager
[Fri Oct 11 21:54:08 2019] [ 2459]     0  2459    25724      324      52      457             0 dhclient
[Fri Oct 11 21:54:08 2019] [ 2672]     0  2672    28230      270      57      256         -1000 sshd
[Fri Oct 11 21:54:08 2019] [ 2678]     0  2678   146587      401     101     3235             0 tuned
[Fri Oct 11 21:54:08 2019] [ 2681]     0  2681    28928      121      13       16             0 rhsmcertd
[Fri Oct 11 21:54:08 2019] [ 2704]     0  2704   251168      749     153     1876             0 libvirtd
[Fri Oct 11 21:54:08 2019] [ 2716]     0  2716     6477      143      18       48             0 atd
[Fri Oct 11 21:54:08 2019] [ 2719]     0  2719    27527      164      10       33             0 agetty
[Fri Oct 11 21:54:08 2019] [ 2922]     0  2922    22424      179      44      239             0 master
[Fri Oct 11 21:54:08 2019] [ 2938]    89  2938    22494      254      44      235             0 qmgr
[Fri Oct 11 21:54:08 2019] [ 3029]     0  3029    26993      136      10        6             0 rhnsd
[Fri Oct 11 21:54:08 2019] [17378]     0 17378    39195      345      81      319             0 sshd
[Fri Oct 11 21:54:08 2019] [17382]     0 17382    28913      329      13       43             0 bash
[Fri Oct 11 21:54:08 2019] [18086]     0 18086    31573      247      19      130             0 crond
[Fri Oct 11 21:54:08 2019] [18089]     0 18089    64863     2497      56      154             0 rsyslogd
[Fri Oct 11 21:54:08 2019] [18097]   994 18097    11224      165      26      178             0 nrpe
[Fri Oct 11 21:54:08 2019] [18762]     0 18762   155079    18561     121    15677             0 glusterd
[Fri Oct 11 21:54:08 2019] [18952]     0 18952    39195      357      81      310             0 sshd
[Fri Oct 11 21:54:08 2019] [18956]     0 18956    28913      329      13       44             0 bash
[Fri Oct 11 21:54:08 2019] [19007]     0 19007    32009      183      19      117             0 screen
[Fri Oct 11 21:54:08 2019] [19008]     0 19008    29087      466      13       86             0 bash
[Fri Oct 11 21:54:08 2019] [19123]     0 19123 31593224   903706   17115  1063764             0 glusterfsd
[Fri Oct 11 21:54:08 2019] [19252]     0 19252 27399980   814989   14788   953595             0 glusterfsd
[Fri Oct 11 21:54:08 2019] [21285]     0 21285    32009      167      16      122             0 screen
[Fri Oct 11 21:54:08 2019] [21286]     0 21286    28922      237      13      144             0 bash
[Fri Oct 11 21:54:08 2019] [26598]     0 26598    33608      264      21     4657             0 bash
[Fri Oct 11 21:54:08 2019] [ 6036]    89  6036    22450      441      45        0             0 pickup
[Fri Oct 11 21:54:08 2019] [ 1124]     0  1124    26989       74       9        0             0 sleep
[Fri Oct 11 21:54:08 2019] [ 3568]     0  3568  2384508    39899     449        0             0 glusterfs
[Fri Oct 11 21:54:08 2019] [ 3643]     0  3643    26989       74      10        0             0 sleep
[Fri Oct 11 21:54:08 2019] Out of memory: Kill process 19123 (glusterfsd) score 473 or sacrifice child
[Fri Oct 11 21:54:08 2019] Killed process 19123 (glusterfsd), UID 0, total-vm:126372896kB, anon-rss:3614824kB, file-rss:0kB, shmem-rss:0kB

Comment 11 Mohit Agrawal 2020-06-10 07:56:41 UTC

*** This bug has been marked as a duplicate of bug 1790336 ***


Note You need to log in before you can comment on or make changes to this bug.