Bug 1162640 - supervdsm segfault in libgfapi while querying volume status detail
Summary: supervdsm segfault in libgfapi while querying volume status detail
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.5
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.5.1
Assignee: Darshan
QA Contact: Gil Klein
URL:
Whiteboard: gluster
Depends On:
Blocks: oVirt_3.5.1_tracker
TreeView+ depends on / blocked
 
Reported: 2014-11-11 12:04 UTC by Jillian Morgan
Modified: 2016-04-29 09:53 UTC (History)
18 users (show)

Fixed In Version: ovirt-3.5.1_rc1
Clone Of:
: 1164218 (view as bug list)
Environment:
Last Closed: 2015-01-21 16:02:32 UTC
oVirt Team: Gluster
Embargoed:


Attachments (Terms of Use)
VDSM log around the time of supervdsm failure and VMs pausing (523.35 KB, text/plain)
2014-11-12 15:20 UTC, Jillian Morgan
no flags Details
VDSM log around the time of supervdsm failure and VMs pausing (523.35 KB, application/x-xz)
2014-11-12 15:22 UTC, Jillian Morgan
no flags Details
Backtrace from supervdsm segfault (26.67 KB, text/plain)
2014-11-13 03:03 UTC, Jillian Morgan
no flags Details
gluster, vdsmd, and sanlock logs from one event (35.48 KB, application/zip)
2014-12-04 18:06 UTC, Darrell
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 35256 0 master MERGED gluster: proper check in gfapi.py to avoid segfault in libgfapi Never
oVirt gerrit 35515 0 ovirt-3.5 MERGED gluster: proper check in gfapi.py to avoid segfault in libgfapi Never

Description Jillian Morgan 2014-11-11 12:04:08 UTC
Description of problem:
supervsdm encounters a segfault in libgfapi, causing all VMs on that host to enter paused state and remain stuck there forever, even though supervdsm is restarted automatically (by systemd). High-availability VM's are neither migrated nor restarted. The paused VMs cannot be manually un-paused. Must manually kill and then restart the VM's.

Version-Release number of selected component (if applicable):
vdsm-4.16.7-1.gitdb83943.fc20.x86_64
vdsm-gluster-4.16.7-1.gitdb83943.fc20.noarch
glusterfs-api-3.5.2-1.fc20.x86_64

How reproducible:
Consistent since upgrade to ovirt release 3.5. Ovirt nodes remain operational for anywhere from 8 hours to as little as 30 minutes between instances of the above-described segfaults.

It issue appears to occur when supervdsm is looping over all gluster volumes and retrieving their details. This occurs regularily (every 5 minutes it seems) usually without crashing, but then randomly does crash. It should also be noted that the volume being queried when the crash occurs is not consistent. It looks like each volume is queried twice, once plain (third arg of call to volumeStatus() is empty ('')) and then again with detail (third arg of call to volumeStatus() is 'detail'). The crash always comes during one of the 'detail' queries.

Steps to Reproduce:
1. vm storage domain on glusterfs
2. run some VMs
3. wait

Actual results:
Here are the relevant log extracts that I can find:

/var/log/messages:
Nov  5 21:46:29 big kernel: supervdsmServer[11638]: segfault at 18 ip 00007f652c41115a sp 00007f5fd3b7cf40 error 4 in libgfapi.so.0.0.0[7f652c40a000+19000]

/var/log/vdsm/supervdsm.log: (looping through getting information about each gluster volume)
MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,015::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call wrapper with ('web', '', 'detail') {}
MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,015::utils::738::root::(execCmd) /usr/sbin/gluster --mode=script volume status web detail --xml (cwd None)
MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,323::utils::758::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,324::supervdsmServer::108::SuperVdsm.ServerCallback::(wrapper) return wrapper with {'bricks': [{'sizeTotal': '5722917.992', 'mntOptions': 'rw,relatime,ssd,space_cache', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924', 'device': '/dev/bcache0', 'blockSize': '4096', 'brick': 'big:/mnt/export/r0/web', 'sizeFree': '5624569.770', 'fsName': 'btrfs'}, {'sizeTotal': '5722917.992', 'mntOptions': 'rw,relatime,ssd,space_cache', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051', 'device': '/dev/bcache0', 'blockSize': '4096', 'brick': 'bang:/mnt/export/r0/web', 'sizeFree': '5619114.816', 'fsName': 'btrfs'}], 'name': 'web'}
MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,325::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call wrapper with ('web',) {}
   [*** supervdsm crashed here and is restarted ***]
MainThread::DEBUG::2014-11-05 22:55:21,655::netconfpersistence::134::root::(_getConfigs) Non-existing config set.
MainThread::DEBUG::2014-11-05 22:55:21,655::netconfpersistence::134::root::(_getConfigs) Non-existing config set.
MainThread::DEBUG::2014-11-05 22:55:21,722::supervdsmServer::411::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm
MainThread::DEBUG::2014-11-05 22:55:21,722::supervdsmServer::420::SuperVdsm.Server::(main) Parsing cmd args
MainThread::DEBUG::2014-11-05 22:55:21,722::supervdsmServer::423::SuperVdsm.Server::(main) Cleaning old socket /var/run/vdsm/svdsm.sock
MainThread::DEBUG::2014-11-05 22:55:21,723::supervdsmServer::427::SuperVdsm.Server::(main) Setting up keep alive thread
MainThread::DEBUG::2014-11-05 22:55:21,723::supervdsmServer::433::SuperVdsm.Server::(main) Creating remote object manager
MainThread::DEBUG::2014-11-05 22:55:21,724::fileUtils::192::Storage.fileUtils::(chown) Changing owner for /var/run/vdsm/svdsm.sock, to (36:36)
MainThread::DEBUG::2014-11-05 22:55:21,725::supervdsmServer::444::SuperVdsm.Server::(main) Started serving super vdsm object

/var/log/vdsm/vdsm.log:
Thread-792::DEBUG::2014-11-05 21:46:28,579::BindingXMLRPC::1132::vds::(wrapper) client [192.168.21.36]::call volumeStatus with ('web', '', '') {} flowID [3d9b0b87]
Thread-792::DEBUG::2014-11-05 21:46:29,002::BindingXMLRPC::1139::vds::(wrapper) return volumeStatus with {'volumeStatus': {'bricks': [{'status': 'ONLINE', 'brick': 'big:/mnt/export/r0/web', 'pid': '4531', 'port': '50154', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924'}, {'status': 'ONLINE', 'brick': 'bang:/mnt/export/r0/web', 'pid': '20996', 'port': '50154', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051'}], 'nfs': [{'status': 'OFFLINE', 'hostname': '192.168.1.83', 'pid': '-1', 'port': 'N/A', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924'}, {'status': 'ONLINE', 'hostname': 'bang', 'pid': '21004', 'port': '2049', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051'}], 'shd': [{'status': 'ONLINE', 'hostname': '192.168.1.83', 'pid': '4547', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924'},
{'status': 'ONLINE', 'hostname': 'bang', 'pid': '21011', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051'}], 'name': 'web'}, 'status': {'message': 'Done', 'code': 0}}
Thread-792::DEBUG::2014-11-05 21:46:29,014::BindingXMLRPC::1132::vds::(wrapper) client [192.168.21.36]::call volumeStatus with ('web', '', 'detail') {} flowID [3d9b0b87]
   [*** expecting "BindingXMLRPC::1139::vds::(wrapper) return volumeStatus with ..." to come next, but of course it doesn't come because supervsdm crashed.]


Expected results:
volumeStatus() query should return sucessfully (no crash, duh). Also, running VM's should not be affected by this random failed volume information query.

Additional info:
I have two identically configured compute nodes that both experience this issue since upgrading to Ovirt 3.5, either one may suffer the crash at any given time, and not at the same time. But eventually both will suffer the crash and wind up with all VM's paused.

A core dump found in /var/log/core/core.538.1415241989.dump appears to be truncated and cannot produce a stack trace:

# gdb core.538.1415241989.dump
...
BFD: Warning: /var/log/core/core.538.1415241989.dump is truncated: expected core file size >= 19657957376, found: 14539485184.
...
Failed to read a valid object file image from memory.
Core was generated by `/usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f652c41115a in ?? ()

Comment 1 Dan Kenigsberg 2014-11-12 15:09:17 UTC
Ian, can you attach vdsm.log? It may help me understand why the VMs paused, and why unpausing them failed.

How often are the crashes? Can you reproduce them when gdb is attached to supervdsmd, and thus provide the backtrace?

Darshan, I'm 99.9% sure that the segfault occurs in libgfapi, and not in vdsm or Python. Are you aware of such bug in libgfapi? If it cannot be solved there, we may need to disable this new feature in Vdsm.

Comment 2 Jillian Morgan 2014-11-12 15:20:29 UTC
Created attachment 956777 [details]
VDSM log around the time of supervdsm failure and VMs pausing

Dan,

Here's the log you requested surrounding the previously described instance of the failure. This is a 16MB single-file extract (524K compressed) from the rotated log set.

Comment 3 Jillian Morgan 2014-11-12 15:22:32 UTC
Created attachment 956778 [details]
VDSM log around the time of supervdsm failure and VMs pausing

Take two: accidentally selected type text/plain last time, but the file is XZ compressed.

Comment 4 Dan Kenigsberg 2014-11-12 21:26:05 UTC
> Ian, how often are the crashes? Can you reproduce them when gdb is attached to
> supervdsmd, and thus provide the backtrace?

Comment 5 Jillian Morgan 2014-11-12 23:13:29 UTC
Dan,

The crashes were frequent. At least a couple of times per day. But never on cue (not intentionally reproducible), and never while I was watching.

It's been about two weeks since I did the upgrade, and have been cursing at it since, trying to find any plausible cause, and channelling my deepest Google-fu to find anyone experiencing the same issue. Each of the nodes had been restarted many times since the upgrade while trying to diagnose. I eventually gave up a few days ago and shut the whole cluster down (actually considered a switch to OpenStack, but that prospect made my scr*tum shrivel up real tight).

Taking a deep breath, I started everything back up and on a lark, stopped all gluster volumes, disconnected the peers, added a 3rd dummy gluster node to the stack (no bricks, just glusterd to help with quorum, even though there has never been any communication issues between the two nodes), re-probed the peers, started up the volumes, started vdsmd on the nodes, started the engine, and attached gdb to the supervdsmd on one node.

And of course now with gdb attached it's been running all day without fail.. sigh. The previous pattern was that it would never fail while I was working with it. Inevitably I would log out of the webadmin portal, and when I came back later I would find that it had crapped out shortly thereafter. I could never witness it failing in real-time.

I will keep gdb attached (a watched pot/program never boils/crashes??) and hope for the best. Will report back when it finally decides to crash again, or if it doesn't. [fingers crossed]

Of course if it doesn't crash now, it means that either something went haywire in my gluster volumes during the upgrade from 3.4 to 3.5, or that some odd metadata that 3.4 handled/ignored is making 3.5 happy, that a teardown and rebuild of the gluster stack fixed (maybe?)

Comment 6 Jillian Morgan 2014-11-13 03:03:33 UTC
Created attachment 956912 [details]
Backtrace from supervdsm segfault

Backtrace showing crash in glfs_set_volfile_server

Comment 7 Jillian Morgan 2014-11-13 03:36:02 UTC
Backtrace attached.

It looks to me like supervdsm called into glfs_set_volfile_server() with a NULL fs argument, and then the segfault comes naturally when it tries to dereference &fs->ctx->cmd_args.

That comes from gfapi.py's glfsInit():

def glfsInit(volumeId, host, port, protocol):
    fs = _glfs_new(volumeId)

    rc = _glfs_set_volfile_server(fs,
                                  protocol,
                                  host,
                                  port)

Symptomatic cause of segfault: fs is generated from volumeID via _glfs_new() and then used without being error-checked for NULL.

Root cause: still unknown.

I can see that glfs_new() is being passed a valid volumeId ('gv1'). glfs_new() seems to have 4 places where it could potentially return NULL...

Your insight is greatly appreciated.

Comment 8 Jillian Morgan 2014-11-13 14:02:17 UTC
I've set up another run of gdb attached to supervdsmd, with breakpoints at each of the 4 places that glfs_new() can return NULL, to find out which one is the culprit. If you have any hints to help narrow down the cause more quickly, I'm all ears.

Comment 9 Jillian Morgan 2014-11-13 16:24:16 UTC
I can't seem to get gdb to break where I ask it to. It keep breaking at some line just before or after the desired breakpoint, so I  break at the point of an actual "return NULL" in order to inspect _which_ "return NULL" is causing the issue.

Is this because of compiler optimization? What's the point of the -debuginfo packages if they aren't useful for debugging optimized code? Is there any way for me to do anything helpful now without having proper debug builds of all the relevant packages?

I'm stuck at this point and don't know what more I can do without your development expertise on this stuff.

Your insight is greatly appreciated.

Comment 10 Dan Kenigsberg 2014-11-13 19:18:22 UTC
Vdsm must check the return value of glfs_new(). This would give us better hints regarding why it fails and why. Please try with something like the following patch:

diff --git a/vdsm/gluster/gfapi.py b/vdsm/gluster/gfapi.py
index 2aee49b..1359091 100644
--- a/vdsm/gluster/gfapi.py
+++ b/vdsm/gluster/gfapi.py
@@ -50,6 +50,8 @@ class StatVfsStruct(ctypes.Structure):
 
 def glfsInit(volumeId, host, port, protocol):
     fs = _glfs_new(volumeId)
+    if fs is None:
+        raise ge.GlfsInitException(rc=7, err=['glfs_new(%s) failed' % volumeId])
 
     rc = _glfs_set_volfile_server(fs,
                                   protocol,

Comment 11 Jillian Morgan 2014-11-13 19:31:51 UTC
Implemented the above patch on both compute nodes. I should expect to see the 'glfs_new(<volumeId>) failed' message in the vdsm.log right? So now when that happens, what would you like me to report?

Comment 12 Darshan 2014-11-14 11:40:55 UTC
Ian,
     Can you please attach the untruncated core dump and syslog around the time when this crash occurs. Its unlikely for glfs_init to return null, core dump and syslog would help to debug the issue.  Do you see any abnormal behaviour in glusterd or the volumes when this crash happens ?

     Have raised a bug to fix the segmentation fault that occurs when null is passed to glfs_set_volfile_server().

Comment 13 Jillian Morgan 2014-11-14 16:04:42 UTC
Darshan,

I have had 4 coredumps from supervdsm now (I don't know why there wasn't one from each segfault), and every one of them gdb says is truncated. They have all been different sizes from 14GB to 29MB (although the bigger ones seem to be sparse files. the 14GB file really only consumed 6.6GB, and the rest were a lot smaller than that).

Any idea why these coredumps are truncated (or that gdb _thinks_ they're truncated?), or how to produce a good one?

Comment 14 Dan Kenigsberg 2014-11-15 06:26:14 UTC
Darshan, Vdsm must check the return value of glfs_new(). We should also understand why glfs_new() fails. We should not feed glfs_set_volfile_server() with garbage and wonder about the consequences.

Note that with your "gluster: Temporary fix for supervdsm memory leak", this segfault would take place in the helper subprocess and not in the context of supervdsm.

Comment 15 Dan Kenigsberg 2014-11-17 14:44:05 UTC
Darshan, can you tell why _glfs_new(volumeId) has failed and returned NULL? it would be worthwhile to log that information, too.

Comment 16 Jillian Morgan 2014-11-17 15:56:20 UTC
A data point:

With the above patch in place, of course SuperVDSM no longer segfaults. It happily performs normally for some period of hours, then at some point when the (still unknown) failure condition occurs, I start getting the new GlfsInitException "glfs init failed" messages in the logs (one for each volume) every 5 minutes thereafter.

I noted that if I restart glusterd, then the exceptions stop temporarily, but then do start again after some hours.

During the period when the exceptions are occurring, all glusterd commands issued from the command-line continue to work as expected, and the volumes are stable.

I have not witnessed any ill effect upon oVirt from these trapped failures so far.  oVirt still shows all volumes and bricks as UP. The cluster has been stable since putting in the patch to avoid the segfault.

Comment 17 Dan Kenigsberg 2014-11-17 16:54:39 UTC
Can glusterd be asked to provide debug logs, so the reason for glfs_new()'s failure is understood?

Comment 18 Jillian Morgan 2014-11-17 18:11:40 UTC
An update to my last comment: Once the exceptions start, they are NOT consistently occurring every 5 minutes when volume detail is queried. In fact, the interval between exceptions can be anywhere from 5 to 50 minutes, but always on a multiple of 5 minutes (+/- a few seconds), corresponding to when vdsm makes the volume status queries. It is still correct, though, that a restart of glusterd will allow supervdsm to run cleanly for multiple hours before the first exception.

Here's a histogram of the interval between exceptions for the last few days (all 100 current rotated vdsm logs). (Not counting one 13hr interval after a glusterd restart).

Interval	Count
00:05:00	156
00:10:00	71
00:15:00	37
00:20:00	17
00:25:00	11
00:30:00	3
00:35:00	2
00:40:00	1
00:45:00	1
00:50:00	2
00:55:00	0

The glusterfs logs are extensive (lots of INFO-level "noise"), but have nothing interesting corresponding to the times of vdsm's glfs_new failures. It is unclear to me if/how actions via libgfapi are logged. It seems like glfs_set_logging() has to be called on a glfs_t, but we never get that far since glfs_new is failing. Looks like glfs_new() would have to be instrumented to include it's own logging to find out which of it's 4 failure-scenarios is the one causing it to fail and return NULL.

Comment 19 Darshan 2014-11-18 05:57:10 UTC
Raghavendra/poornima,
     Can you please provide your thoughts on why glfs_new() is returning null.

Comment 20 Poornima G 2014-11-18 09:15:48 UTC
glfs_new() called as a part of new process, failing intermitently is unexpected. glfs_new(), all it does is memory allocation and hostname translation, but any of the memory allocation failures may result in unstable system. glfs_new() doesn't fail even if the volume name passed doesnot exist, it does no socket connection to any other process.
Can you check if you have any gluster log messages in syslog?

Comment 21 Darrell 2014-12-04 17:49:21 UTC
This or something very similar bites me too. In conjunction with vdsmd memory leaks, whenever VDSMD restarts on my Centos 7 hosts, all the VMs running on that host pause and can not be resumed. Interestingly, there's a Centos 6 host in the same cluster, and when VDSMD is restarted on that host, VMs do NOT pause, so this seems to be something related to Centos 7 in some way, at least for me.

I can reproduce this by restarting VDSMD or just supervdsmd. It looks like restarting supervdsdm causes the gluster fuse mount to break, as I have a lot of gluster errors when this happens (attached).

vdsm-python-4.16.7-1.gitdb83943.el7.noarch
vdsm-python-zombiereaper-4.16.7-1.gitdb83943.el7.noarch
vdsm-xmlrpc-4.16.7-1.gitdb83943.el7.noarch
vdsm-yajsonrpc-4.16.7-1.gitdb83943.el7.noarch
vdsm-4.16.7-1.gitdb83943.el7.x86_64
vdsm-cli-4.16.7-1.gitdb83943.el7.noarch
vdsm-jsonrpc-4.16.7-1.gitdb83943.el7.noarch
glusterfs-cli-3.6.1-1.el7.x86_64
glusterfs-libs-3.6.1-1.el7.x86_64
glusterfs-3.6.1-1.el7.x86_64
glusterfs-rdma-3.6.1-1.el7.x86_64
glusterfs-api-3.6.1-1.el7.x86_64
glusterfs-fuse-3.6.1-1.el7.x86_64

Comment 22 Darrell 2014-12-04 18:06:11 UTC
Created attachment 964782 [details]
gluster, vdsmd, and sanlock logs from one event

vdsmd crash/oom kill at ~ 5:27

Comment 23 Sandro Bonazzola 2015-01-15 14:25:35 UTC
This is an automated message: 
This bug should be fixed in oVirt 3.5.1 RC1, moving to QA

Comment 24 Sandro Bonazzola 2015-01-21 16:02:32 UTC
oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.