Bug 697945

Summary: clvmd crashes when attempting to create hundreds of VGs
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Milan Broz <mbroz>
Status: CLOSED ERRATA QA Contact: Corey Marthaler <cmarthal>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: agk, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, pvrabec, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.87-1.el6 Doc Type: Bug Fix
Doc Text:
clvmd was crashing when attempting to create a high number of volume groups at once. This was caused by the limit set by number of available file descriptors per process. While clvmd was creating pipes and the limit was reached under the pressure of high number of requests, clvmd did not return an error but continued to use unitialised pipes instead, ending up with a crash. To fix this, clvmd now returns an error message immediately if the pipe creation fails.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 16:56:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 743047    
Attachments:
Description Flags
coredump from taft-02 none

Description Corey Marthaler 2011-04-19 18:12:00 UTC
Description of problem:

for i in b c d e f g h
do
   for j in $(seq 1 100)
   do 
      vgcreate $i$j /dev/sd$i$j &
   done
done

Apr 19 12:08:13 taft-02 clvmd: Cluster LVM daemon started - connected to CMAN
Apr 19 12:10:51 taft-02 kernel: sdb: sdb1 sdb2 sdb3 sdb4 sdb5 sdb6 sdb7 sdb8 sdb9 sdb10 sdb11 sdb12 sdb13 sdb14 sdb15 sdb16 sdb17 sdb18 sdb19 sdb20 sdb21 sdb22 sdb23 sdb24 sdb25 sdb26 sdb27 sdb28 sdb29 sdb30 sdb31 sdb32 sdb33 sdb34 sdb35 sdb36 sdb37 sdb38 sdb39 sdb40 sdb41 sdb42 sdb43 sdb44 sdb45 sdb46 sdb47 sdb48 sdb49 sdb50 sdb51 sdb52 sdb53 sdb54 sdb55 sdb56 sdb57 sdb58 sdb59 sdb60 sdb61 sdb62 sdb63 sdb64 sdb65 sdb66 sdb67 sdb68 sdb69 sdb70 sdb71 sdb72 sdb73 sdb74 sdb75 sdb76 sdb77 sdb78 sdb79 sdb80 sdb81 sdb82 sdb83 sdb84 sdb85 sdb86 sdb87 sdb88 sdb89 sdb90 sdb91 sdb92 sdb93 sdb94 sdb95 sdb96 sdb97 sdb98 sdb99 sdb100
Apr 19 12:10:54 taft-02 kernel: sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 sdc11 sdc12 sdc13 sdc14 sdc15 sdc16 sdc17 sdc18 sdc19 sdc20 sdc21 sdc22 sdc23 sdc24 sdc25 sdc26 sdc27 sdc28 sdc29 sdc30 sdc31 sdc32 sdc33 sdc34 sdc35 sdc36 sdc37 sdc38 sdc39 sdc40 sdc41 sdc42 sdc43 sdc44 sdc45 sdc46 sdc47 sdc48 sdc49 sdc50 sdc51 sdc52 sdc53 sdc54 sdc55 sdc56 sdc57 sdc58 sdc59 sdc60 sdc61 sdc62 sdc63 sdc64 sdc65 sdc66 sdc67 sdc68 sdc69 sdc70 sdc71 sdc72 sdc73 sdc74 sdc75 sdc76 sdc77 sdc78 sdc79 sdc80 sdc81 sdc82 sdc83 sdc84 sdc85 sdc86 sdc87 sdc88 sdc89 sdc90 sdc91 sdc92 sdc93 sdc94 sdc95 sdc96 sdc97 sdc98 sdc99 sdc100
Apr 19 12:10:55 taft-02 kernel: sdd: sdd1 sdd2 sdd3 sdd4 sdd5 sdd6 sdd7 sdd8 sdd9 sdd10 sdd11 sdd12 sdd13 sdd14 sdd15 sdd16 sdd17 sdd18 sdd19 sdd20 sdd21 sdd22 sdd23 sdd24 sdd25 sdd26 sdd27 sdd28 sdd29 sdd30 sdd31 sdd32 sdd33 sdd34 sdd35 sdd36 sdd37 sdd38 sdd39 sdd40 sdd41 sdd42 sdd43 sdd44 sdd45 sdd46 sdd47 sdd48 sdd49 sdd50 sdd51 sdd52 sdd53 sdd54 sdd55 sdd56 sdd57 sdd58 sdd59 sdd60 sdd61 sdd62 sdd63 sdd64 sdd65 sdd66 sdd67 sdd68 sdd69 sdd70 sdd71 sdd72 sdd73 sdd74 sdd75 sdd76 sdd77 sdd78 sdd79 sdd80 sdd81 sdd82 sdd83 sdd84 sdd85 sdd86 sdd87 sdd88 sdd89 sdd90 sdd91 sdd92 sdd93 sdd94 sdd95 sdd96 sdd97 sdd98 sdd99 sdd100
Apr 19 12:10:57 taft-02 kernel: sde: sde1 sde2 sde3 sde4 sde5 sde6 sde7 sde8 sde9 sde10 sde11 sde12 sde13 sde14 sde15 sde16 sde17 sde18 sde19 sde20 sde21 sde22 sde23 sde24 sde25 sde26 sde27 sde28 sde29 sde30 sde31 sde32 sde33 sde34 sde35 sde36 sde37 sde38 sde39 sde40 sde41 sde42 sde43 sde44 sde45 sde46 sde47 sde48 sde49 sde50 sde51 sde52 sde53 sde54 sde55 sde56 sde57 sde58 sde59 sde60 sde61 sde62 sde63 sde64 sde65 sde66 sde67 sde68 sde69 sde70 sde71 sde72 sde73 sde74 sde75 sde76 sde77 sde78 sde79 sde80 sde81 sde82 sde83 sde84 sde85 sde86 sde87 sde88 sde89 sde90 sde91 sde92 sde93 sde94 sde95 sde96 sde97 sde98 sde99 sde100
Apr 19 12:10:58 taft-02 kernel: sdf: sdf1 sdf2 sdf3 sdf4 sdf5 sdf6 sdf7 sdf8 sdf9 sdf10 sdf11 sdf12 sdf13 sdf14 sdf15 sdf16 sdf17 sdf18 sdf19 sdf20 sdf21 sdf22 sdf23 sdf24 sdf25 sdf26 sdf27 sdf28 sdf29 sdf30 sdf31 sdf32 sdf33 sdf34 sdf35 sdf36 sdf37 sdf38 sdf39 sdf40 sdf41 sdf42 sdf43 sdf44 sdf45 sdf46 sdf47 sdf48 sdf49 sdf50 sdf51 sdf52 sdf53 sdf54 sdf55 sdf56 sdf57 sdf58 sdf59 sdf60 sdf61 sdf62 sdf63 sdf64 sdf65 sdf66 sdf67 sdf68 sdf69 sdf70 sdf71 sdf72 sdf73 sdf74 sdf75 sdf76 sdf77 sdf78 sdf79 sdf80 sdf81 sdf82 sdf83 sdf84 sdf85 sdf86 sdf87 sdf88 sdf89 sdf90 sdf91 sdf92 sdf93 sdf94 sdf95 sdf96 sdf97 sdf98 sdf99 sdf100
Apr 19 12:11:00 taft-02 kernel: sdg: sdg1 sdg2 sdg3 sdg4 sdg5 sdg6 sdg7 sdg8 sdg9 sdg10 sdg11 sdg12 sdg13 sdg14 sdg15 sdg16 sdg17 sdg18 sdg19 sdg20 sdg21 sdg22 sdg23 sdg24 sdg25 sdg26 sdg27 sdg28 sdg29 sdg30 sdg31 sdg32 sdg33 sdg34 sdg35 sdg36 sdg37 sdg38 sdg39 sdg40 sdg41 sdg42 sdg43 sdg44 sdg45 sdg46 sdg47 sdg48 sdg49 sdg50 sdg51 sdg52 sdg53 sdg54 sdg55 sdg56 sdg57 sdg58 sdg59 sdg60 sdg61 sdg62 sdg63 sdg64 sdg65 sdg66 sdg67 sdg68 sdg69 sdg70 sdg71 sdg72 sdg73 sdg74 sdg75 sdg76 sdg77 sdg78 sdg79 sdg80 sdg81 sdg82 sdg83 sdg84 sdg85 sdg86 sdg87 sdg88 sdg89 sdg90 sdg91 sdg92 sdg93 sdg94 sdg95 sdg96 sdg97 sdg98 sdg99 sdg100
Apr 19 12:11:02 taft-02 kernel: sdh: sdh1 sdh2 sdh3 sdh4 sdh5 sdh6 sdh7 sdh8 sdh9 sdh10 sdh11 sdh12 sdh13 sdh14 sdh15 sdh16 sdh17 sdh18 sdh19 sdh20 sdh21 sdh22 sdh23 sdh24 sdh25 sdh26 sdh27 sdh28 sdh29 sdh30 sdh31 sdh32 sdh33 sdh34 sdh35 sdh36 sdh37 sdh38 sdh39 sdh40 sdh41 sdh42 sdh43 sdh44 sdh45 sdh46 sdh47 sdh48 sdh49 sdh50 sdh51 sdh52 sdh53 sdh54 sdh55 sdh56 sdh57 sdh58 sdh59 sdh60 sdh61 sdh62 sdh63 sdh64 sdh65 sdh66 sdh67 sdh68 sdh69 sdh70 sdh71 sdh72 sdh73 sdh74 sdh75 sdh76 sdh77 sdh78 sdh79 sdh80 sdh81 sdh82 sdh83 sdh84 sdh85 sdh86 sdh87 sdh88 sdh89 sdh90 sdh91 sdh92 sdh93 sdh94 sdh95 sdh96 sdh97 sdh98 sdh99 sdh100
Apr 19 12:12:04 taft-02 kernel: dlm: Using TCP for communications
Apr 19 12:12:04 taft-02 kernel: dlm: got connection from 1
Apr 19 12:12:04 taft-02 kernel: dlm: connecting to 3
Apr 19 12:12:04 taft-02 kernel: dlm: connecting to 4
Apr 19 12:12:05 taft-02 clvmd: Cluster LVM daemon started - connected to CMAN
Apr 19 12:25:43 taft-02 abrt[26309]: saved core dump of pid 14878 (/usr/sbin/clvmd) to /var/spool/abrt/ccpp-1303233863-14878.new/coredump (4359434240 bytes)
Apr 19 12:25:43 taft-02 abrtd: Directory 'ccpp-1303233863-14878' creation detected
Apr 19 12:25:43 taft-02 abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'kerneloops-1302902280-2099-1'
Apr 19 12:25:43 taft-02 abrt[26309]: size of '/var/spool/abrt' >= 1250 MB, deleting 'kerneloops-1302902280-2099-1'
Apr 19 12:25:43 taft-02 abrtd: Lock file '/var/spool/abrt/kerneloops-1302902280-2099-1.lock' is locked by process 26309
Apr 19 12:25:45 taft-02 abrtd: New crash /var/spool/abrt/ccpp-1303233863-14878, processing



[root@taft-02 ccpp-1303233863-14878]# ls -l /var/spool/abrt/ccpp-1303233863-14878
total 30732
-rw-r-----. 1 abrt root          4 Apr 19 12:24 analyzer
-rw-r-----. 1 abrt root          6 Apr 19 12:24 architecture
-rw-r-----. 1 abrt root         10 Apr 19 12:24 cmdline
-rw-r-----. 1 abrt root          4 Apr 19 12:25 component
-rw-r--r--. 1 root root 4359434240 Apr 19 12:25 coredump
-rw-r-----. 1 abrt root        104 Apr 19 12:25 description
-rw-r-----. 1 abrt root         15 Apr 19 12:24 executable
-rw-r-----. 1 abrt root          7 Apr 19 12:25 hostname
-rw-r-----. 1 abrt root         25 Apr 19 12:24 kernel
-rw-r-----. 1 abrt root         26 Apr 19 12:25 package
-rw-r-----. 1 abrt root         56 Apr 19 12:24 reason
-rw-r-----. 1 abrt root         59 Apr 19 12:24 release
-rw-------. 1 root root    4541420 Apr 19 12:36 sosreport.tar.xz
-rw-r-----. 1 abrt root         10 Apr 19 12:24 time
-rw-r-----. 1 abrt root          1 Apr 19 12:24 uid


[root@taft-02 ccpp-1303233863-14878]# cat cmdline
clvmd -T30

[root@taft-02 ccpp-1303233863-14878]# cat reason
Process /usr/sbin/clvmd was killed by signal 6 (SIGABRT)


Version-Release number of selected component (if applicable):
2.6.32-131.0.1.el6.x86_64

lvm2-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-libs-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-cluster-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
udev-147-2.35.el6    BUILT: Wed Mar 30 07:32:05 CDT 2011
device-mapper-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
cmirror-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011

Comment 1 Corey Marthaler 2011-04-19 18:22:28 UTC
Core was generated by `clvmd -T30'.
Program terminated with signal 6, Aborted.
#0  0x000000331fc32a45 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12-41.el6.x86_64 corosynclib-1.2.3-36.el6.x86_64 glibc-2.12-1.25.el6.x86_64 libgcc-4.4.5-6.el6.x86_64 libselinux-2.0.94-5.el6.x86_64 libsepol-2.0.41-3.el6.x86_64 libudev-147-2.35.el6.x86_64
(gdb) bt
#0  0x000000331fc32a45 in raise () from /lib64/libc.so.6
#1  0x000000331fc34225 in abort () from /lib64/libc.so.6
#2  0x000000331fc6fdfb in __libc_message () from /lib64/libc.so.6
#3  0x000000331fc75716 in malloc_printerr () from /lib64/libc.so.6
#4  0x00000033260055df in dm_hash_destroy (t=0x6942d0) at datastruct/hash.c:132
#5  0x000000000040f90e in cmd_client_cleanup (client=0x1b37e90) at clvmd-command.c:343
#6  0x0000000000413c75 in process_work_item (arg=<value optimized out>) at clvmd.c:1903
#7  lvm_thread_fn (arg=<value optimized out>) at clvmd.c:1959
#8  0x00000033200077e1 in start_thread () from /lib64/libpthread.so.0
#9  0x000000331fce68ed in clone () from /lib64/libc.so.6

Comment 2 Corey Marthaler 2011-04-19 19:39:56 UTC
Created attachment 493269 [details]
coredump from taft-02

Comment 3 Milan Broz 2011-08-11 13:02:39 UTC
I hope it is fixed by properly return error if clvmd has no more file descriptors, should be part of 2.02.87 upstream.

(I was not able to reproduce clvmd crash at least with patch but backtraces differs.)

Comment 5 Corey Marthaler 2011-09-08 17:56:44 UTC
The crash no longer appears to exist in the latest lvm rpms. I now see countless lock messages:

  Can't get lock for c13.
  cluster request failed: Device or resource busy

Is this the expected behavior now instead of crashing?

2.6.32-193.el6.x86_64

lvm2-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
lvm2-libs-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
lvm2-cluster-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
udev-147-2.37.el6    BUILT: Wed Aug 10 07:48:15 CDT 2011
device-mapper-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-libs-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-event-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-event-libs-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
cmirror-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011

Comment 6 Milan Broz 2011-09-16 15:13:11 UTC
Yes. Busy message is sent when code is out of free descriptors (pipe() call fails). Maybe in the future we should handle it better but for now return busy is the only option.

Comment 7 Corey Marthaler 2011-09-16 19:33:37 UTC
Marking verified based on comment #5 and comment #6.

Comment 8 Peter Rajnoha 2011-10-27 12:18:31 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
clvmd was crashing when attempting to create a high number of volume groups at once. This was caused by the limit set by number of available file descriptors per process. While clvmd was creating pipes and the limit was reached under the pressure of high number of requests, clvmd did not return an error but continued to use unitialised pipes instead, ending up with a crash. To fix this, clvmd now returns an error message immediately if the pipe creation fails.

Comment 9 errata-xmlrpc 2011-12-06 16:56:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1522.html