Bug 857456

Summary: Alert detected on RHEV-M - VM has been paused due to a storage error.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Gowrishankar Rajaiyan <grajaiya>
Component: fuseAssignee: Brian Foster <bfoster>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Gowrishankar Rajaiyan <grajaiya>
Severity: unspecified Docs Contact:
Priority: high    
Version: unspecifiedCC: aavati, bfoster, csaba, grajaiya, iheim, rhs-bugs, shaines, vbellur, vinaraya
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-12 21:59:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gowrishankar Rajaiyan 2012-09-14 13:32:57 UTC
Description of problem:


Version-Release number of selected component (if applicable):
glusterfs-geo-replication-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-fuse-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-rdma-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-devel-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-debuginfo-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-server-3.3.0rhsvirt1-2.el6_2.x86_64
vdsm-4.9.6-34.0.el6_3.x86_64
rhevm-3.1.0-16.el6ev.noarch

How reproducible: Not sure about the reproducible steps.


Steps to Reproduce:
1. Only similarity I could notice is that all the VM's that were migrated to the other host are now paused and "VM has been paused due to a storage error" is displayed on RHEV-M.
2.
3.
  
Actual results:
VM has been paused due to a storage error

Expected results:


Additional info:

# gluster volume info
 
Volume Name: distrep2
Type: Distributed-Replicate
Volume ID: af0738c8-b5a3-49da-b3e9-414b6ca2df75
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/disk2
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/disk2
Brick3: rhs-gp-srv12.lab.eng.blr.redhat.com:/disk2
Brick4: rhs-gp-srv15.lab.eng.blr.redhat.com:/disk2
Options Reconfigured:
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
storage.linux-aio: enable
cluster.eager-lock: enable

/var/log/glusterfs/rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com\:_distrep2.log
[2012-09-13 11:58:38.691968] I [fuse-bridge.c:4222:fuse_graph_setup] 0-fuse: switched to graph 0
[2012-09-13 11:58:38.692496] I [client-handshake.c:453:client_set_lk_version_cbk] 0-distrep2-client-2: Server lk version = 1
[2012-09-13 11:58:38.692656] I [fuse-bridge.c:3405:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
[2012-09-13 11:58:38.693274] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 0-distrep2-replicate-0: added root inode
[2012-09-13 11:58:38.693942] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 0-distrep2-replicate-1: added root inode
[2012-09-13 14:41:47.881010] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-0: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (01453d96-2085-4852-ab0b-60dd3e8fe354)
[2012-09-13 14:41:47.881089] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-1: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (01453d96-2085-4852-ab0b-60dd3e8fe354)
[2012-09-13 17:06:32.030180] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-09-13 17:06:32.031212] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-09-14 12:15:00.444610] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-0: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (105350ae-cc0d-4425-a126-a055f3599e75)
[2012-09-14 12:15:00.444855] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-1: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (105350ae-cc0d-4425-a126-a055f3599e75)

Comment 6 Brian Foster 2012-11-16 20:12:28 UTC
Skimming through the libvirt logs, I see something similar to an issue I saw when setting up our rhevm instance:

qemu-kvm: -drive file=/rhev/data-center/609c0436-4b5c-4639-89ed-1fc60ccaa9a1/9c17fd91-2e28-463d-b9b3-93fcd9a77679/images/8179fad6-6fe8-4f19-9fdc-76592eef53d8/86c8e2c0-3a32-4f00-ad21-1c02afc304db,if=none,id=drive-virtio-disk0,format=raw,serial=8179fad6-6fe8-4f19-9fdc-76592eef53d8,cache=none,werror=stop,rerror=stop,aio=threads: could not open disk image /rhev/data-center/609c0436-4b5c-4639-89ed-1fc60ccaa9a1/9c17fd91-2e28-463d-b9b3-93fcd9a77679/images/8179fad6-6fe8-4f19-9fdc-76592eef53d8/86c8e2c0-3a32-4f00-ad21-1c02afc304db: Permission denied
2012-09-13 09:41:42.838+0000: shutting down

I disabled selinux on hypervisor hosts to bypass a similar error. If I re-enable and try a migration to that host, however, the migration fails (as opposed to migrating the vm and stopping due to error).

It sounds like it could be something different, but out of curiosity, is this a persistent state of the hypervisor host? In other words, can you forcibly start the paused VMs, or any VM on that host, or is the error transient during the migration? If persistent, does an 'echo 0 > /selinux/enforce' change anything?

Comment 7 Vijay Bellur 2012-11-19 08:03:21 UTC
Can you please respond to Brian's questions?

Comment 8 Gowrishankar Rajaiyan 2012-11-19 11:20:52 UTC
Unfortunately the setup no longer exists. However, the selinux error in comment #6 are similar to the one at https://bugzilla.redhat.com/show_bug.cgi?id=835936 and addressed in selinux-policy-3.7.19-155.el6_3.6

Comment 9 Brian Foster 2012-12-04 19:54:21 UTC
Taking another look at the libvirtd.log, I see snippets similar to that below for several VMs. This suggests a VM was paused due to a failed write (EACCES, "Permission denied"), but I'm not seeing anything that correlates this to a problem in gluster.

FWIW, I've been doing significant rhevm testing on a dhtrep2 volume the past couple of weeks (including doing migrations/self-heals) and I've yet to see a VM paused error as described here. I think we'll need more information to make further progress with this bug:

- Some kind of insight into the steps to reproduce.
- Ideally, access to an effected setup in this state to determine reproducibility, persistence of the error and potentially to debug.

--- libvirtd.log ---

qemuMonitorEmitIOError:954 : mon=0x7f49d81261d0
qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49d81261d0 refs=4 
qemuProcessHandleIOError:854 : Transitioned guest VM_5 to paused state due to IO error
qemuProcessHandleIOError:864 : Preserving lock state '(null)'
qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49d81261d0 refs=3
qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49d81261d0 refs=2
virDomainFree:2313 : dom=0x97a350, (VM: name=VM_5, uuid=0e8a2034-f4b1-4d34-8859-47e35793373e)
virDomainFree:2313 : dom=0x97a350, (VM: name=VM_5, uuid=0e8a2034-f4b1-4d34-8859-47e35793373e)
qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49d81261d0 refs=3 
qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49d81261d0 buf={"timestamp": {"seconds": 1347628099, "microseconds": 478179}, "event": "STOP"}^M 
{"timestamp": {"seconds": 1347628099, "microseconds": 478478}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Permission denied", "errno": 13}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}

Comment 11 Brian Foster 2012-12-12 21:59:59 UTC
Setting this closed due to insufficient data. Please re-open if we can reproduce with more data and potential access to the configuration in this state.