Description of problem: supervsdm encounters a segfault in libgfapi, causing all VMs on that host to enter paused state and remain stuck there forever, even though supervdsm is restarted automatically (by systemd). High-availability VM's are neither migrated nor restarted. The paused VMs cannot be manually un-paused. Must manually kill and then restart the VM's. Version-Release number of selected component (if applicable): vdsm-4.16.7-1.gitdb83943.fc20.x86_64 vdsm-gluster-4.16.7-1.gitdb83943.fc20.noarch glusterfs-api-3.5.2-1.fc20.x86_64 How reproducible: Consistent since upgrade to ovirt release 3.5. Ovirt nodes remain operational for anywhere from 8 hours to as little as 30 minutes between instances of the above-described segfaults. It issue appears to occur when supervdsm is looping over all gluster volumes and retrieving their details. This occurs regularily (every 5 minutes it seems) usually without crashing, but then randomly does crash. It should also be noted that the volume being queried when the crash occurs is not consistent. It looks like each volume is queried twice, once plain (third arg of call to volumeStatus() is empty ('')) and then again with detail (third arg of call to volumeStatus() is 'detail'). The crash always comes during one of the 'detail' queries. Steps to Reproduce: 1. vm storage domain on glusterfs 2. run some VMs 3. wait Actual results: Here are the relevant log extracts that I can find: /var/log/messages: Nov 5 21:46:29 big kernel: supervdsmServer[11638]: segfault at 18 ip 00007f652c41115a sp 00007f5fd3b7cf40 error 4 in libgfapi.so.0.0.0[7f652c40a000+19000] /var/log/vdsm/supervdsm.log: (looping through getting information about each gluster volume) MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,015::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call wrapper with ('web', '', 'detail') {} MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,015::utils::738::root::(execCmd) /usr/sbin/gluster --mode=script volume status web detail --xml (cwd None) MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,323::utils::758::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,324::supervdsmServer::108::SuperVdsm.ServerCallback::(wrapper) return wrapper with {'bricks': [{'sizeTotal': '5722917.992', 'mntOptions': 'rw,relatime,ssd,space_cache', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924', 'device': '/dev/bcache0', 'blockSize': '4096', 'brick': 'big:/mnt/export/r0/web', 'sizeFree': '5624569.770', 'fsName': 'btrfs'}, {'sizeTotal': '5722917.992', 'mntOptions': 'rw,relatime,ssd,space_cache', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051', 'device': '/dev/bcache0', 'blockSize': '4096', 'brick': 'bang:/mnt/export/r0/web', 'sizeFree': '5619114.816', 'fsName': 'btrfs'}], 'name': 'web'} MainProcess|Thread-792::DEBUG::2014-11-05 21:46:29,325::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call wrapper with ('web',) {} [*** supervdsm crashed here and is restarted ***] MainThread::DEBUG::2014-11-05 22:55:21,655::netconfpersistence::134::root::(_getConfigs) Non-existing config set. MainThread::DEBUG::2014-11-05 22:55:21,655::netconfpersistence::134::root::(_getConfigs) Non-existing config set. MainThread::DEBUG::2014-11-05 22:55:21,722::supervdsmServer::411::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm MainThread::DEBUG::2014-11-05 22:55:21,722::supervdsmServer::420::SuperVdsm.Server::(main) Parsing cmd args MainThread::DEBUG::2014-11-05 22:55:21,722::supervdsmServer::423::SuperVdsm.Server::(main) Cleaning old socket /var/run/vdsm/svdsm.sock MainThread::DEBUG::2014-11-05 22:55:21,723::supervdsmServer::427::SuperVdsm.Server::(main) Setting up keep alive thread MainThread::DEBUG::2014-11-05 22:55:21,723::supervdsmServer::433::SuperVdsm.Server::(main) Creating remote object manager MainThread::DEBUG::2014-11-05 22:55:21,724::fileUtils::192::Storage.fileUtils::(chown) Changing owner for /var/run/vdsm/svdsm.sock, to (36:36) MainThread::DEBUG::2014-11-05 22:55:21,725::supervdsmServer::444::SuperVdsm.Server::(main) Started serving super vdsm object /var/log/vdsm/vdsm.log: Thread-792::DEBUG::2014-11-05 21:46:28,579::BindingXMLRPC::1132::vds::(wrapper) client [192.168.21.36]::call volumeStatus with ('web', '', '') {} flowID [3d9b0b87] Thread-792::DEBUG::2014-11-05 21:46:29,002::BindingXMLRPC::1139::vds::(wrapper) return volumeStatus with {'volumeStatus': {'bricks': [{'status': 'ONLINE', 'brick': 'big:/mnt/export/r0/web', 'pid': '4531', 'port': '50154', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924'}, {'status': 'ONLINE', 'brick': 'bang:/mnt/export/r0/web', 'pid': '20996', 'port': '50154', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051'}], 'nfs': [{'status': 'OFFLINE', 'hostname': '192.168.1.83', 'pid': '-1', 'port': 'N/A', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924'}, {'status': 'ONLINE', 'hostname': 'bang', 'pid': '21004', 'port': '2049', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051'}], 'shd': [{'status': 'ONLINE', 'hostname': '192.168.1.83', 'pid': '4547', 'hostuuid': '6956bfd3-7a64-4789-9fda-d83bbea93924'}, {'status': 'ONLINE', 'hostname': 'bang', 'pid': '21011', 'hostuuid': 'e2a5191a-4d72-430f-9227-feb150163051'}], 'name': 'web'}, 'status': {'message': 'Done', 'code': 0}} Thread-792::DEBUG::2014-11-05 21:46:29,014::BindingXMLRPC::1132::vds::(wrapper) client [192.168.21.36]::call volumeStatus with ('web', '', 'detail') {} flowID [3d9b0b87] [*** expecting "BindingXMLRPC::1139::vds::(wrapper) return volumeStatus with ..." to come next, but of course it doesn't come because supervsdm crashed.] Expected results: volumeStatus() query should return sucessfully (no crash, duh). Also, running VM's should not be affected by this random failed volume information query. Additional info: I have two identically configured compute nodes that both experience this issue since upgrading to Ovirt 3.5, either one may suffer the crash at any given time, and not at the same time. But eventually both will suffer the crash and wind up with all VM's paused. A core dump found in /var/log/core/core.538.1415241989.dump appears to be truncated and cannot produce a stack trace: # gdb core.538.1415241989.dump ... BFD: Warning: /var/log/core/core.538.1415241989.dump is truncated: expected core file size >= 19657957376, found: 14539485184. ... Failed to read a valid object file image from memory. Core was generated by `/usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f652c41115a in ?? ()
Ian, can you attach vdsm.log? It may help me understand why the VMs paused, and why unpausing them failed. How often are the crashes? Can you reproduce them when gdb is attached to supervdsmd, and thus provide the backtrace? Darshan, I'm 99.9% sure that the segfault occurs in libgfapi, and not in vdsm or Python. Are you aware of such bug in libgfapi? If it cannot be solved there, we may need to disable this new feature in Vdsm.
Created attachment 956777 [details] VDSM log around the time of supervdsm failure and VMs pausing Dan, Here's the log you requested surrounding the previously described instance of the failure. This is a 16MB single-file extract (524K compressed) from the rotated log set.
Created attachment 956778 [details] VDSM log around the time of supervdsm failure and VMs pausing Take two: accidentally selected type text/plain last time, but the file is XZ compressed.
> Ian, how often are the crashes? Can you reproduce them when gdb is attached to > supervdsmd, and thus provide the backtrace?
Dan, The crashes were frequent. At least a couple of times per day. But never on cue (not intentionally reproducible), and never while I was watching. It's been about two weeks since I did the upgrade, and have been cursing at it since, trying to find any plausible cause, and channelling my deepest Google-fu to find anyone experiencing the same issue. Each of the nodes had been restarted many times since the upgrade while trying to diagnose. I eventually gave up a few days ago and shut the whole cluster down (actually considered a switch to OpenStack, but that prospect made my scr*tum shrivel up real tight). Taking a deep breath, I started everything back up and on a lark, stopped all gluster volumes, disconnected the peers, added a 3rd dummy gluster node to the stack (no bricks, just glusterd to help with quorum, even though there has never been any communication issues between the two nodes), re-probed the peers, started up the volumes, started vdsmd on the nodes, started the engine, and attached gdb to the supervdsmd on one node. And of course now with gdb attached it's been running all day without fail.. sigh. The previous pattern was that it would never fail while I was working with it. Inevitably I would log out of the webadmin portal, and when I came back later I would find that it had crapped out shortly thereafter. I could never witness it failing in real-time. I will keep gdb attached (a watched pot/program never boils/crashes??) and hope for the best. Will report back when it finally decides to crash again, or if it doesn't. [fingers crossed] Of course if it doesn't crash now, it means that either something went haywire in my gluster volumes during the upgrade from 3.4 to 3.5, or that some odd metadata that 3.4 handled/ignored is making 3.5 happy, that a teardown and rebuild of the gluster stack fixed (maybe?)
Created attachment 956912 [details] Backtrace from supervdsm segfault Backtrace showing crash in glfs_set_volfile_server
Backtrace attached. It looks to me like supervdsm called into glfs_set_volfile_server() with a NULL fs argument, and then the segfault comes naturally when it tries to dereference &fs->ctx->cmd_args. That comes from gfapi.py's glfsInit(): def glfsInit(volumeId, host, port, protocol): fs = _glfs_new(volumeId) rc = _glfs_set_volfile_server(fs, protocol, host, port) Symptomatic cause of segfault: fs is generated from volumeID via _glfs_new() and then used without being error-checked for NULL. Root cause: still unknown. I can see that glfs_new() is being passed a valid volumeId ('gv1'). glfs_new() seems to have 4 places where it could potentially return NULL... Your insight is greatly appreciated.
I've set up another run of gdb attached to supervdsmd, with breakpoints at each of the 4 places that glfs_new() can return NULL, to find out which one is the culprit. If you have any hints to help narrow down the cause more quickly, I'm all ears.
I can't seem to get gdb to break where I ask it to. It keep breaking at some line just before or after the desired breakpoint, so I break at the point of an actual "return NULL" in order to inspect _which_ "return NULL" is causing the issue. Is this because of compiler optimization? What's the point of the -debuginfo packages if they aren't useful for debugging optimized code? Is there any way for me to do anything helpful now without having proper debug builds of all the relevant packages? I'm stuck at this point and don't know what more I can do without your development expertise on this stuff. Your insight is greatly appreciated.
Vdsm must check the return value of glfs_new(). This would give us better hints regarding why it fails and why. Please try with something like the following patch: diff --git a/vdsm/gluster/gfapi.py b/vdsm/gluster/gfapi.py index 2aee49b..1359091 100644 --- a/vdsm/gluster/gfapi.py +++ b/vdsm/gluster/gfapi.py @@ -50,6 +50,8 @@ class StatVfsStruct(ctypes.Structure): def glfsInit(volumeId, host, port, protocol): fs = _glfs_new(volumeId) + if fs is None: + raise ge.GlfsInitException(rc=7, err=['glfs_new(%s) failed' % volumeId]) rc = _glfs_set_volfile_server(fs, protocol,
Implemented the above patch on both compute nodes. I should expect to see the 'glfs_new(<volumeId>) failed' message in the vdsm.log right? So now when that happens, what would you like me to report?
Ian, Can you please attach the untruncated core dump and syslog around the time when this crash occurs. Its unlikely for glfs_init to return null, core dump and syslog would help to debug the issue. Do you see any abnormal behaviour in glusterd or the volumes when this crash happens ? Have raised a bug to fix the segmentation fault that occurs when null is passed to glfs_set_volfile_server().
Darshan, I have had 4 coredumps from supervdsm now (I don't know why there wasn't one from each segfault), and every one of them gdb says is truncated. They have all been different sizes from 14GB to 29MB (although the bigger ones seem to be sparse files. the 14GB file really only consumed 6.6GB, and the rest were a lot smaller than that). Any idea why these coredumps are truncated (or that gdb _thinks_ they're truncated?), or how to produce a good one?
Darshan, Vdsm must check the return value of glfs_new(). We should also understand why glfs_new() fails. We should not feed glfs_set_volfile_server() with garbage and wonder about the consequences. Note that with your "gluster: Temporary fix for supervdsm memory leak", this segfault would take place in the helper subprocess and not in the context of supervdsm.
Darshan, can you tell why _glfs_new(volumeId) has failed and returned NULL? it would be worthwhile to log that information, too.
A data point: With the above patch in place, of course SuperVDSM no longer segfaults. It happily performs normally for some period of hours, then at some point when the (still unknown) failure condition occurs, I start getting the new GlfsInitException "glfs init failed" messages in the logs (one for each volume) every 5 minutes thereafter. I noted that if I restart glusterd, then the exceptions stop temporarily, but then do start again after some hours. During the period when the exceptions are occurring, all glusterd commands issued from the command-line continue to work as expected, and the volumes are stable. I have not witnessed any ill effect upon oVirt from these trapped failures so far. oVirt still shows all volumes and bricks as UP. The cluster has been stable since putting in the patch to avoid the segfault.
Can glusterd be asked to provide debug logs, so the reason for glfs_new()'s failure is understood?
An update to my last comment: Once the exceptions start, they are NOT consistently occurring every 5 minutes when volume detail is queried. In fact, the interval between exceptions can be anywhere from 5 to 50 minutes, but always on a multiple of 5 minutes (+/- a few seconds), corresponding to when vdsm makes the volume status queries. It is still correct, though, that a restart of glusterd will allow supervdsm to run cleanly for multiple hours before the first exception. Here's a histogram of the interval between exceptions for the last few days (all 100 current rotated vdsm logs). (Not counting one 13hr interval after a glusterd restart). Interval Count 00:05:00 156 00:10:00 71 00:15:00 37 00:20:00 17 00:25:00 11 00:30:00 3 00:35:00 2 00:40:00 1 00:45:00 1 00:50:00 2 00:55:00 0 The glusterfs logs are extensive (lots of INFO-level "noise"), but have nothing interesting corresponding to the times of vdsm's glfs_new failures. It is unclear to me if/how actions via libgfapi are logged. It seems like glfs_set_logging() has to be called on a glfs_t, but we never get that far since glfs_new is failing. Looks like glfs_new() would have to be instrumented to include it's own logging to find out which of it's 4 failure-scenarios is the one causing it to fail and return NULL.
Raghavendra/poornima, Can you please provide your thoughts on why glfs_new() is returning null.
glfs_new() called as a part of new process, failing intermitently is unexpected. glfs_new(), all it does is memory allocation and hostname translation, but any of the memory allocation failures may result in unstable system. glfs_new() doesn't fail even if the volume name passed doesnot exist, it does no socket connection to any other process. Can you check if you have any gluster log messages in syslog?
This or something very similar bites me too. In conjunction with vdsmd memory leaks, whenever VDSMD restarts on my Centos 7 hosts, all the VMs running on that host pause and can not be resumed. Interestingly, there's a Centos 6 host in the same cluster, and when VDSMD is restarted on that host, VMs do NOT pause, so this seems to be something related to Centos 7 in some way, at least for me. I can reproduce this by restarting VDSMD or just supervdsmd. It looks like restarting supervdsdm causes the gluster fuse mount to break, as I have a lot of gluster errors when this happens (attached). vdsm-python-4.16.7-1.gitdb83943.el7.noarch vdsm-python-zombiereaper-4.16.7-1.gitdb83943.el7.noarch vdsm-xmlrpc-4.16.7-1.gitdb83943.el7.noarch vdsm-yajsonrpc-4.16.7-1.gitdb83943.el7.noarch vdsm-4.16.7-1.gitdb83943.el7.x86_64 vdsm-cli-4.16.7-1.gitdb83943.el7.noarch vdsm-jsonrpc-4.16.7-1.gitdb83943.el7.noarch glusterfs-cli-3.6.1-1.el7.x86_64 glusterfs-libs-3.6.1-1.el7.x86_64 glusterfs-3.6.1-1.el7.x86_64 glusterfs-rdma-3.6.1-1.el7.x86_64 glusterfs-api-3.6.1-1.el7.x86_64 glusterfs-fuse-3.6.1-1.el7.x86_64
Created attachment 964782 [details] gluster, vdsmd, and sanlock logs from one event vdsmd crash/oom kill at ~ 5:27
This is an automated message: This bug should be fixed in oVirt 3.5.1 RC1, moving to QA
oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report.