Description of problem:
Following update to kernel-3.5.0-2 on two F17 x86_64 machines, the machines won't connect to oVirt 3.1 nfs domains. Returning to 3.4.6-2.fc17.x86_64 resolves the issue.
I don't know what's causing the issue, but I'm happy to help debug it.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Update to this version of the kernel on F17 machine serving as oVirt host.
NFS domain is inaccessible.
I've tried putting selinux into permissive mode and downing the firewall, neither has an effect. The affected host does mount the nfs share, but fails while attaching to it.
Any chance someone can look at this? It's a critical issue for oVirt.
(In reply to comment #0)
> Description of problem:
> Following update to kernel-3.5.0-2 on two F17 x86_64 machines, the machines
> won't connect to oVirt 3.1 nfs domains. Returning to 3.4.6-2.fc17.x86_64
> resolves the issue.
Can you attach the relevant vdsm logs? Thanks.
Created attachment 603363 [details]
Engine and VDSM logs, from the two boxes involved.
Inside the .zip file are two .tar.bz2. One each for the ovirt engine log, and one for the vdsm log.
Just to clarify, by "two boxes involved" in my comment, I'm meaning the two servers in my test environment here. They're different boxes to Jason's ones, from the original report.
Further useful info may be in BZ 847083 too. (same problem, before we knew it was kernel version specific)
Created attachment 603368 [details]
vdsm log from F17 host w/ 3.5 kernel
This is a vdsm log from an F17 ovirt host that had, while running a pre-3.5 kernel, been configured to use an NFS master data domain. Upon booting into the current 3.5 kernel, the host will no longer connect to the NFS domain. Booting back into the earlier kernel resolves the issue.
One of the vdsm process (from oop) crashed with the following backtrace:
#0 0x00007fa0f9398925 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007fa0f939a0d8 in __GI_abort () at abort.c:91
#2 0x00007fa0f93d864b in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa0f94dbc28 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3 0x00007fa0f93df7ce in malloc_printerr (ptr=0x7fa09c00e000, str=0x7fa0f94dbce8 "double free or corruption (!prev)", action=3) at malloc.c:5027
#4 _int_free (av=0x7fa09c000020, p=0x7fa09c00dff0, have_lock=0) at malloc.c:3948
#5 0x00007fa0f632ee90 in ffi_call_unix64 () from /lib64/libffi.so.5
#6 0x00007fa0f632e8a0 in ffi_call () from /lib64/libffi.so.5
#7 0x00007fa0f6548cc3 in _call_function_pointer (argcount=1, resmem=0x7fa0adff66e0, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fa0adff66c0, pProc=0x7fa0f93e3140 <__GI___libc_free>, flags=4361)
#8 _ctypes_callproc (pProc=pProc@entry=0x7fa0f93e3140 <__GI___libc_free>, argtuple=argtuple@entry=(<c_char_p at remote 0x1b72e60>,), flags=4361, argtypes=argtypes@entry=0x0, restype=
<_ctypes.PyCSimpleType at remote 0x1c4fab0>, checker=0x0) at /usr/src/debug/Python-2.7.3/Modules/_ctypes/callproc.c:1174
#9 0x00007fa0f65423dd in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /usr/src/debug/Python-2.7.3/Modules/_ctypes/_ctypes.c:3913
#10 0x00007fa0fa081a7e in PyObject_Call (func=func@entry=<_FuncPtr(__name__='free') at remote 0x24a8ae0>, arg=arg@entry=(<c_char_p at remote 0x1b72e60>,), kw=kw@entry=0x0)
#11 0x00007fa0fa1113e3 in do_call (nk=<optimized out>, na=1, pp_stack=0x7fa0adff6aa8, func=<_FuncPtr(__name__='free') at remote 0x24a8ae0>) at /usr/src/debug/Python-2.7.3/Python/ceval.c:4316
#12 call_function (oparg=<optimized out>, pp_stack=0x7fa0adff6aa8) at /usr/src/debug/Python-2.7.3/Python/ceval.c:4121
#13 PyEval_EvalFrameEx (f=f@entry=
Frame 0x7fa09c00d5a0, for file /usr/share/vdsm/storage/fileUtils.py, line 272, in _createAlignedBuffer (self=<DirectFile(_closed=False, _writable=False, _mode='dr', _fd=3) at remote 0x1c82d90>, size=1024, pbuff=<c_char_p at remote 0x1b72e60>, ppbuff=<LP_c_char_p at remote 0x1b72cb0>, rc=0), throwflag=<optimized out>) at /usr/src/debug/Python-2.7.3/Python/ceval.c:2740
The relevant part for VDSM is vdsm/storage/fileUtils.py:272
260 def _createAlignedBuffer(self, size):
261 pbuff = ctypes.c_char_p(0)
262 ppbuff = ctypes.pointer(pbuff)
263 # Because we usually have fixed sizes for our reads, caching
264 # buffers might give a slight performance boost.
265 rc = libc.posix_memalign(ppbuff, PAGESIZE, size)
266 if rc:
267 raise OSError(rc, "Could not allocate aligned buffer")
269 ctypes.memset(pbuff, 0, size)
270 yield pbuff
As conclusion the NFS operation gets stuck because the helper died (why vdsm isn't detecting that the fd has been closed?).
I'm still not sure if this is a VDSM issue that gets exposed only with the newer kernel, or if the glibc and the kernel currently have some issue with posix_memalign+free.
Created attachment 603582 [details]
VDSM core dump file.
gdb /bin/python core.7273.1344613714.dump
This is easily reproducible with:
$ uname -sr ; rpm -q glibc
$ cat python_crash.py
libc = ctypes.CDLL("libc.so.6", use_errno=True)
pbuff = ctypes.c_char_p(0)
ppbuff = ctypes.pointer(pbuff)
SIZE = 100
libc.posix_memalign(ppbuff, libc.getpagesize(), SIZE)
ctypes.memset(pbuff, 0, SIZE)
$ python python_crash.py
*** glibc detected *** python: double free or corruption (fasttop): 0x0000000001c55000 ***
======= Backtrace: =========
======= Memory map: ========
00400000-00401000 r-xp 00000000 fd:02 546244 /usr/bin/python2.7
00600000-00601000 r--p 00000000 fd:02 546244 /usr/bin/python2.7
00601000-00602000 rw-p 00001000 fd:02 546244 /usr/bin/python2.7
01b59000-01c72000 rw-p 00000000 00:00 0 [heap]
35c2800000-35c2820000 r-xp 00000000 fd:02 524867 /usr/lib64/ld-2.15.so
35c2a1f000-35c2a20000 r--p 0001f000 fd:02 524867 /usr/lib64/ld-2.15.so
35c2a20000-35c2a21000 rw-p 00020000 fd:02 524867 /usr/lib64/ld-2.15.so
35c2a21000-35c2a22000 rw-p 00000000 00:00 0
35c3000000-35c31ac000 r-xp 00000000 fd:02 527324 /usr/lib64/libc-2.15.so
35c31ac000-35c33ac000 ---p 001ac000 fd:02 527324 /usr/lib64/libc-2.15.so
35c33ac000-35c33b0000 r--p 001ac000 fd:02 527324 /usr/lib64/libc-2.15.so
35c33b0000-35c33b2000 rw-p 001b0000 fd:02 527324 /usr/lib64/libc-2.15.so
35c33b2000-35c33b7000 rw-p 00000000 00:00 0
35c3400000-35c3416000 r-xp 00000000 fd:02 527412 /usr/lib64/libpthread-2.15.so
35c3416000-35c3616000 ---p 00016000 fd:02 527412 /usr/lib64/libpthread-2.15.so
35c3616000-35c3617000 r--p 00016000 fd:02 527412 /usr/lib64/libpthread-2.15.so
35c3617000-35c3618000 rw-p 00017000 fd:02 527412 /usr/lib64/libpthread-2.15.so
35c3618000-35c361c000 rw-p 00000000 00:00 0
35c3800000-35c38fa000 r-xp 00000000 fd:02 527882 /usr/lib64/libm-2.15.so
35c38fa000-35c3af9000 ---p 000fa000 fd:02 527882 /usr/lib64/libm-2.15.so
35c3af9000-35c3afa000 r--p 000f9000 fd:02 527882 /usr/lib64/libm-2.15.so
35c3afa000-35c3afb000 rw-p 000fa000 fd:02 527882 /usr/lib64/libm-2.15.so
35c3c00000-35c3c03000 r-xp 00000000 fd:02 527494 /usr/lib64/libdl-2.15.so
35c3c03000-35c3e02000 ---p 00003000 fd:02 527494 /usr/lib64/libdl-2.15.so
35c3e02000-35c3e03000 r--p 00002000 fd:02 527494 /usr/lib64/libdl-2.15.so
35c3e03000-35c3e04000 rw-p 00003000 fd:02 527494 /usr/lib64/libdl-2.15.so
35c5000000-35c5015000 r-xp 00000000 fd:02 527885 /usr/lib64/libgcc_s-4.7.0-20120507.so.1
35c5015000-35c5214000 ---p 00015000 fd:02 527885 /usr/lib64/libgcc_s-4.7.0-20120507.so.1
35c5214000-35c5215000 rw-p 00014000 fd:02 527885 /usr/lib64/libgcc_s-4.7.0-20120507.so.1
35c5c00000-35c5c07000 r-xp 00000000 fd:02 528449 /usr/lib64/libffi.so.5.0.10
35c5c07000-35c5e06000 ---p 00007000 fd:02 528449 /usr/lib64/libffi.so.5.0.10
35c5e06000-35c5e07000 r--p 00006000 fd:02 528449 /usr/lib64/libffi.so.5.0.10
35c5e07000-35c5e08000 rw-p 00007000 fd:02 528449 /usr/lib64/libffi.so.5.0.10
35d4a00000-35d4b6c000 r-xp 00000000 fd:02 546233 /usr/lib64/libpython2.7.so.1.0
35d4b6c000-35d4d6c000 ---p 0016c000 fd:02 546233 /usr/lib64/libpython2.7.so.1.0
35d4d6c000-35d4d6d000 r--p 0016c000 fd:02 546233 /usr/lib64/libpython2.7.so.1.0
35d4d6d000-35d4daa000 rw-p 0016d000 fd:02 546233 /usr/lib64/libpython2.7.so.1.0
35d4daa000-35d4dba000 rw-p 00000000 00:00 0
35d5800000-35d5802000 r-xp 00000000 fd:02 533736 /usr/lib64/libutil-2.15.so
35d5802000-35d5a01000 ---p 00002000 fd:02 533736 /usr/lib64/libutil-2.15.so
35d5a01000-35d5a02000 r--p 00001000 fd:02 533736 /usr/lib64/libutil-2.15.so
35d5a02000-35d5a03000 rw-p 00002000 fd:02 533736 /usr/lib64/libutil-2.15.so
7ff17fb91000-7ff17fbd2000 rw-p 00000000 00:00 0
7ff17fbd2000-7ff17fbd9000 r-xp 00000000 fd:02 532340 /usr/lib64/python2.7/lib-dynload/_struct.so
7ff17fbd9000-7ff17fdd8000 ---p 00007000 fd:02 532340 /usr/lib64/python2.7/lib-dynload/_struct.so
7ff17fdd8000-7ff17fdd9000 r--p 00006000 fd:02 532340 /usr/lib64/python2.7/lib-dynload/_struct.so
7ff17fdd9000-7ff17fddb000 rw-p 00007000 fd:02 532340 /usr/lib64/python2.7/lib-dynload/_struct.so
7ff17fddb000-7ff17fdf4000 r-xp 00000000 fd:02 532139 /usr/lib64/python2.7/lib-dynload/_ctypes.so
7ff17fdf4000-7ff17fff4000 ---p 00019000 fd:02 532139 /usr/lib64/python2.7/lib-dynload/_ctypes.so
7ff17fff4000-7ff17fff5000 r--p 00019000 fd:02 532139 /usr/lib64/python2.7/lib-dynload/_ctypes.so
7ff17fff5000-7ff17fff9000 rw-p 0001a000 fd:02 532139 /usr/lib64/python2.7/lib-dynload/_ctypes.so
7ff17fff9000-7ff186426000 r--p 00000000 fd:02 569679 /usr/lib/locale/locale-archive
7ff186426000-7ff1864d9000 rw-p 00000000 00:00 0
7ff1864da000-7ff186561000 rw-p 00000000 00:00 0
7ff18657c000-7ff18657e000 rw-p 00000000 00:00 0
7fff0cf06000-7fff0cf27000 rw-p 00000000 00:00 0 [stack]
7fff0cf4e000-7fff0cf4f000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted (core dumped)
(In reply to comment #8)
> This is easily reproducible with:
The reproducer above is probably invalid (two libc.free calls, which might assume a real double release performed by either vdsm or ctypes/python), the original is most likely a real corruption. Also the backtrace looks slightly different (!prev vs fasttop):
*** glibc detected *** python: double free or corruption (fasttop): 0x0000000001c55000 ***
*** glibc detected *** python: double free or corruption (!prev): 0x00007fe710027000 ***
Is there any further information I can provide to help advance this bug? The current stable version of oVirt is broken due to this. Perhaps this would be better dealt with as a vdsm bug?
(In reply to comment #10)
> Is there any further information I can provide to help advance this bug? The
> current stable version of oVirt is broken due to this. Perhaps this would be
> better dealt with as a vdsm bug?
We need someone to volunteer to backport these (master branch) to the ovirt-3.1 branch:
8f226cf Change oop to be a new process instead of a fork
41ca78b Fixing broken compilation
b400488 fix logging
5bcb224 Add missing log object to CrabRPCServer
As I suspected this affects also the vdsm master branch (it's not related to the oop mechanism used). I just found a vdsm host using crabrpc with the same issue (kernel 3.5.1-1.fc17.x86_64).
I noticed this message:
kernel: [86611.703352] python: segfault at 18 ip 0000003fb307b06b sp 00007fff34c3b240 error 4 in libc-2.15.so[3fb3000000+1ab000]
Fede, why is this on vdsm?
(In reply to comment #14)
> Fede, why is this on vdsm?
I hardly think that it's a kernel/glibc bug (posix_memalign/free are widely used). It could be a python/ctypes issue, but at the moment I'd try to figure out if we're using them properly (for example if there's any path that might lead to a double free on the same pointer, etc...).
@abaron -- anything we can do to make this a higher priority? It's a blocking issue for all ovirt-node use. If it needs to go to the kernel team or we need to pull in people from a different team, I'll do that, but I need to know who to talk to.
Jason, would you please try Saggi's http://gerrit.ovirt.org/8143 ?
(In reply to comment #18)
> Jason, would you please try Saggi's http://gerrit.ovirt.org/8143 ?
Maybe I'm doing it wrong, but I built a new rpm from vdsm master and, after modding the spec file to work with the version of libvirt that comes with F17, installed the package. It didn't fix the problem. Then, I changed the spec file back to its previous libvirt requirement, and also built a newer libvirt for my F17 host. I installed the packages and it didn't fix the problem.
Again, not sure if I'm testing this wrong...
(In reply to comment #19)
> (In reply to comment #18)
> > Jason, would you please try Saggi's http://gerrit.ovirt.org/8143 ?
> Maybe I'm doing it wrong, but I built a new rpm from vdsm master and, after
> modding the spec file to work with the version of libvirt that comes with
> F17, installed the package. It didn't fix the problem. Then, I changed the
> spec file back to its previous libvirt requirement, and also built a newer
> libvirt for my F17 host. I installed the packages and it didn't fix the
> Again, not sure if I'm testing this wrong...
Could you give it a try with the new vdsm build?
* Mon Sep 24 2012 Federico Simoncelli <email@example.com> 4.10.0-9.fc17
- BZ#845660 Use the recommended alignment instead of using pagesize
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > Jason, would you please try Saggi's http://gerrit.ovirt.org/8143 ?
> > Maybe I'm doing it wrong, but I built a new rpm from vdsm master and, after
> > modding the spec file to work with the version of libvirt that comes with
> > F17, installed the package. It didn't fix the problem. Then, I changed the
> > spec file back to its previous libvirt requirement, and also built a newer
> > libvirt for my F17 host. I installed the packages and it didn't fix the
> > problem.
> > Again, not sure if I'm testing this wrong...
> Could you give it a try with the new vdsm build?
I tried with this build, and unfortunately, the issue remains.
I updated an F17 host (different than that one I'd been testing my self-built pkgs on) with the vdsm build referenced below. With kernel 3.5.4, my host would not attach to my existing, gluster-based nfs data domain. I also tried to create a new data domain, nfs based but non-gluster, and the host running 3.5.4 showed the same behavior as reported earlier -- wouldn't attach.
I also tried with kernel 3.4.6 & this new vdsm build, and my gluster and non-gluster nfs domains, etc. -- all that continued to work normally.
> * Mon Sep 24 2012 Federico Simoncelli <firstname.lastname@example.org> 4.10.0-9.fc17
> - BZ#845660 Use the recommended alignment instead of using pagesize
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > (In reply to comment #18)
> > > > Jason, would you please try Saggi's http://gerrit.ovirt.org/8143 ?
> > >
> > > Maybe I'm doing it wrong, but I built a new rpm from vdsm master and, after
> > > modding the spec file to work with the version of libvirt that comes with
> > > F17, installed the package. It didn't fix the problem. Then, I changed the
> > > spec file back to its previous libvirt requirement, and also built a newer
> > > libvirt for my F17 host. I installed the packages and it didn't fix the
> > > problem.
> > >
> > > Again, not sure if I'm testing this wrong...
> > Could you give it a try with the new vdsm build?
> I tried with this build, and unfortunately, the issue remains.
I tried it myself too and indeed the issue is not solved yet.
# grep _PC_REC_XFER_ALIGN /usr/share/vdsm/storage/fileUtils.py
_PC_REC_XFER_ALIGN = 17
alignment = libc.fpathconf(self.fileno(), _PC_REC_XFER_ALIGN)
# uname -r
# ls -l /var/log/core/core.2187.1348563252.dump
-rw-------. 1 vdsm kvm 2573868 Sep 25 04:54 /var/log/core/core.2187.1348563252.dump
Anything else others can do to help w/ this? I've got a new home lab setup, but stuck on what appears to be this bug (ovirt 3.1 stable branch)...
Jason, Bert - could you please try the patch mentioned in comment #24?
(In reply to comment #25)
> Jason, Bert - could you please try the patch mentioned in comment #24?
OK, this is looking good. On my F17 setup, running kernel 3.5.4 with vdsm built w/ the above patch, my nfs iso domain is up.
I wasn't 100% positive about the right way to build a pkg w/ this patch, so let me confirm that:
I have vdsm from git: git clone http://gerrit.ovirt.org/p/vdsm.git, and that's up to date.
Then I did: git fetch git://gerrit.ovirt.org/vdsm refs/changes/56/8356/3 && git checkout FETCH_HEAD -- That's the checkout line for the patch referenced above. So I ran that, then continued as directed on ttp://wiki.ovirt.org/wiki/Vdsm_Developers#Building_a_Vdsm_RPM.
I replaced the vdsm pkgs on my f17 ovirt 3.1 test box w/ those, rebooted, it's running 3.5.4, and my nfs iso domain is up.
Fixed in vdsm-4.10.0-10.fc17
Jason can you give it a try? Thanks.
(In reply to comment #27)
> Fixed in vdsm-4.10.0-10.fc17
> Jason can you give it a try? Thanks.
My pleasure -- I just tested w/ vdsm 184.108.40.206 and kernel 3.5.4 on oVirt 3.1, and my nfs iso and data domains are both up.
I believe this is now fixed