Description of problem: Version-Release number of selected component (if applicable): glusterfs-geo-replication-3.3.0rhsvirt1-2.el6_2.x86_64 glusterfs-fuse-3.3.0rhsvirt1-2.el6_2.x86_64 glusterfs-rdma-3.3.0rhsvirt1-2.el6_2.x86_64 glusterfs-devel-3.3.0rhsvirt1-2.el6_2.x86_64 glusterfs-debuginfo-3.3.0rhsvirt1-2.el6_2.x86_64 glusterfs-3.3.0rhsvirt1-2.el6_2.x86_64 glusterfs-server-3.3.0rhsvirt1-2.el6_2.x86_64 vdsm-4.9.6-34.0.el6_3.x86_64 rhevm-3.1.0-16.el6ev.noarch How reproducible: Not sure about the reproducible steps. Steps to Reproduce: 1. Only similarity I could notice is that all the VM's that were migrated to the other host are now paused and "VM has been paused due to a storage error" is displayed on RHEV-M. 2. 3. Actual results: VM has been paused due to a storage error Expected results: Additional info: # gluster volume info Volume Name: distrep2 Type: Distributed-Replicate Volume ID: af0738c8-b5a3-49da-b3e9-414b6ca2df75 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/disk2 Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/disk2 Brick3: rhs-gp-srv12.lab.eng.blr.redhat.com:/disk2 Brick4: rhs-gp-srv15.lab.eng.blr.redhat.com:/disk2 Options Reconfigured: performance.quick-read: disable performance.io-cache: disable performance.stat-prefetch: disable performance.read-ahead: disable storage.linux-aio: enable cluster.eager-lock: enable /var/log/glusterfs/rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com\:_distrep2.log [2012-09-13 11:58:38.691968] I [fuse-bridge.c:4222:fuse_graph_setup] 0-fuse: switched to graph 0 [2012-09-13 11:58:38.692496] I [client-handshake.c:453:client_set_lk_version_cbk] 0-distrep2-client-2: Server lk version = 1 [2012-09-13 11:58:38.692656] I [fuse-bridge.c:3405:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13 [2012-09-13 11:58:38.693274] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 0-distrep2-replicate-0: added root inode [2012-09-13 11:58:38.693942] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 0-distrep2-replicate-1: added root inode [2012-09-13 14:41:47.881010] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-0: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (01453d96-2085-4852-ab0b-60dd3e8fe354) [2012-09-13 14:41:47.881089] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-1: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (01453d96-2085-4852-ab0b-60dd3e8fe354) [2012-09-13 17:06:32.030180] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed [2012-09-13 17:06:32.031212] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2012-09-14 12:15:00.444610] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-0: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (105350ae-cc0d-4425-a126-a055f3599e75) [2012-09-14 12:15:00.444855] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-1: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (105350ae-cc0d-4425-a126-a055f3599e75)
Skimming through the libvirt logs, I see something similar to an issue I saw when setting up our rhevm instance: qemu-kvm: -drive file=/rhev/data-center/609c0436-4b5c-4639-89ed-1fc60ccaa9a1/9c17fd91-2e28-463d-b9b3-93fcd9a77679/images/8179fad6-6fe8-4f19-9fdc-76592eef53d8/86c8e2c0-3a32-4f00-ad21-1c02afc304db,if=none,id=drive-virtio-disk0,format=raw,serial=8179fad6-6fe8-4f19-9fdc-76592eef53d8,cache=none,werror=stop,rerror=stop,aio=threads: could not open disk image /rhev/data-center/609c0436-4b5c-4639-89ed-1fc60ccaa9a1/9c17fd91-2e28-463d-b9b3-93fcd9a77679/images/8179fad6-6fe8-4f19-9fdc-76592eef53d8/86c8e2c0-3a32-4f00-ad21-1c02afc304db: Permission denied 2012-09-13 09:41:42.838+0000: shutting down I disabled selinux on hypervisor hosts to bypass a similar error. If I re-enable and try a migration to that host, however, the migration fails (as opposed to migrating the vm and stopping due to error). It sounds like it could be something different, but out of curiosity, is this a persistent state of the hypervisor host? In other words, can you forcibly start the paused VMs, or any VM on that host, or is the error transient during the migration? If persistent, does an 'echo 0 > /selinux/enforce' change anything?
Can you please respond to Brian's questions?
Unfortunately the setup no longer exists. However, the selinux error in comment #6 are similar to the one at https://bugzilla.redhat.com/show_bug.cgi?id=835936 and addressed in selinux-policy-3.7.19-155.el6_3.6
Taking another look at the libvirtd.log, I see snippets similar to that below for several VMs. This suggests a VM was paused due to a failed write (EACCES, "Permission denied"), but I'm not seeing anything that correlates this to a problem in gluster. FWIW, I've been doing significant rhevm testing on a dhtrep2 volume the past couple of weeks (including doing migrations/self-heals) and I've yet to see a VM paused error as described here. I think we'll need more information to make further progress with this bug: - Some kind of insight into the steps to reproduce. - Ideally, access to an effected setup in this state to determine reproducibility, persistence of the error and potentially to debug. --- libvirtd.log --- qemuMonitorEmitIOError:954 : mon=0x7f49d81261d0 qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49d81261d0 refs=4 qemuProcessHandleIOError:854 : Transitioned guest VM_5 to paused state due to IO error qemuProcessHandleIOError:864 : Preserving lock state '(null)' qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49d81261d0 refs=3 qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49d81261d0 refs=2 virDomainFree:2313 : dom=0x97a350, (VM: name=VM_5, uuid=0e8a2034-f4b1-4d34-8859-47e35793373e) virDomainFree:2313 : dom=0x97a350, (VM: name=VM_5, uuid=0e8a2034-f4b1-4d34-8859-47e35793373e) qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49d81261d0 refs=3 qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49d81261d0 buf={"timestamp": {"seconds": 1347628099, "microseconds": 478179}, "event": "STOP"}^M {"timestamp": {"seconds": 1347628099, "microseconds": 478478}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Permission denied", "errno": 13}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
Setting this closed due to insufficient data. Please re-open if we can reproduce with more data and potential access to the configuration in this state.