857456 – Alert detected on RHEV-M - VM has been paused due to a storage error.

Bug 857456 - Alert detected on RHEV-M - VM has been paused due to a storage error.

Summary: Alert detected on RHEV-M - VM has been paused due to a storage error.

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	fuse
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Brian Foster
QA Contact:	Gowrishankar Rajaiyan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-14 13:32 UTC by Gowrishankar Rajaiyan
Modified:	2012-12-12 21:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-12-12 21:59:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Gowrishankar Rajaiyan 2012-09-14 13:32:57 UTC

Description of problem:


Version-Release number of selected component (if applicable):
glusterfs-geo-replication-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-fuse-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-rdma-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-devel-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-debuginfo-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-3.3.0rhsvirt1-2.el6_2.x86_64
glusterfs-server-3.3.0rhsvirt1-2.el6_2.x86_64
vdsm-4.9.6-34.0.el6_3.x86_64
rhevm-3.1.0-16.el6ev.noarch

How reproducible: Not sure about the reproducible steps.


Steps to Reproduce:
1. Only similarity I could notice is that all the VM's that were migrated to the other host are now paused and "VM has been paused due to a storage error" is displayed on RHEV-M.
2.
3.
  
Actual results:
VM has been paused due to a storage error

Expected results:


Additional info:

# gluster volume info
 
Volume Name: distrep2
Type: Distributed-Replicate
Volume ID: af0738c8-b5a3-49da-b3e9-414b6ca2df75
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/disk2
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/disk2
Brick3: rhs-gp-srv12.lab.eng.blr.redhat.com:/disk2
Brick4: rhs-gp-srv15.lab.eng.blr.redhat.com:/disk2
Options Reconfigured:
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
storage.linux-aio: enable
cluster.eager-lock: enable

/var/log/glusterfs/rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com\:_distrep2.log
[2012-09-13 11:58:38.691968] I [fuse-bridge.c:4222:fuse_graph_setup] 0-fuse: switched to graph 0
[2012-09-13 11:58:38.692496] I [client-handshake.c:453:client_set_lk_version_cbk] 0-distrep2-client-2: Server lk version = 1
[2012-09-13 11:58:38.692656] I [fuse-bridge.c:3405:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
[2012-09-13 11:58:38.693274] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 0-distrep2-replicate-0: added root inode
[2012-09-13 11:58:38.693942] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 0-distrep2-replicate-1: added root inode
[2012-09-13 14:41:47.881010] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-0: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (01453d96-2085-4852-ab0b-60dd3e8fe354)
[2012-09-13 14:41:47.881089] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-1: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (01453d96-2085-4852-ab0b-60dd3e8fe354)
[2012-09-13 17:06:32.030180] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-09-13 17:06:32.031212] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-09-14 12:15:00.444610] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-0: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (105350ae-cc0d-4425-a126-a055f3599e75)
[2012-09-14 12:15:00.444855] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-distrep2-client-1: remote operation failed: Stale NFS file handle. Path: /9c17fd91-2e28-463d-b9b3-93fcd9a77679/dom_md/metadata (105350ae-cc0d-4425-a126-a055f3599e75)

Comment 6 Brian Foster 2012-11-16 20:12:28 UTC

Skimming through the libvirt logs, I see something similar to an issue I saw when setting up our rhevm instance:

qemu-kvm: -drive file=/rhev/data-center/609c0436-4b5c-4639-89ed-1fc60ccaa9a1/9c17fd91-2e28-463d-b9b3-93fcd9a77679/images/8179fad6-6fe8-4f19-9fdc-76592eef53d8/86c8e2c0-3a32-4f00-ad21-1c02afc304db,if=none,id=drive-virtio-disk0,format=raw,serial=8179fad6-6fe8-4f19-9fdc-76592eef53d8,cache=none,werror=stop,rerror=stop,aio=threads: could not open disk image /rhev/data-center/609c0436-4b5c-4639-89ed-1fc60ccaa9a1/9c17fd91-2e28-463d-b9b3-93fcd9a77679/images/8179fad6-6fe8-4f19-9fdc-76592eef53d8/86c8e2c0-3a32-4f00-ad21-1c02afc304db: Permission denied
2012-09-13 09:41:42.838+0000: shutting down

I disabled selinux on hypervisor hosts to bypass a similar error. If I re-enable and try a migration to that host, however, the migration fails (as opposed to migrating the vm and stopping due to error).

It sounds like it could be something different, but out of curiosity, is this a persistent state of the hypervisor host? In other words, can you forcibly start the paused VMs, or any VM on that host, or is the error transient during the migration? If persistent, does an 'echo 0 > /selinux/enforce' change anything?

Comment 7 Vijay Bellur 2012-11-19 08:03:21 UTC

Can you please respond to Brian's questions?

Comment 8 Gowrishankar Rajaiyan 2012-11-19 11:20:52 UTC

Unfortunately the setup no longer exists. However, the selinux error in comment #6 are similar to the one at https://bugzilla.redhat.com/show_bug.cgi?id=835936 and addressed in selinux-policy-3.7.19-155.el6_3.6

Comment 9 Brian Foster 2012-12-04 19:54:21 UTC

Taking another look at the libvirtd.log, I see snippets similar to that below for several VMs. This suggests a VM was paused due to a failed write (EACCES, "Permission denied"), but I'm not seeing anything that correlates this to a problem in gluster.

FWIW, I've been doing significant rhevm testing on a dhtrep2 volume the past couple of weeks (including doing migrations/self-heals) and I've yet to see a VM paused error as described here. I think we'll need more information to make further progress with this bug:

- Some kind of insight into the steps to reproduce.
- Ideally, access to an effected setup in this state to determine reproducibility, persistence of the error and potentially to debug.

--- libvirtd.log ---

qemuMonitorEmitIOError:954 : mon=0x7f49d81261d0
qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49d81261d0 refs=4 
qemuProcessHandleIOError:854 : Transitioned guest VM_5 to paused state due to IO error
qemuProcessHandleIOError:864 : Preserving lock state '(null)'
qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49d81261d0 refs=3
qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49d81261d0 refs=2
virDomainFree:2313 : dom=0x97a350, (VM: name=VM_5, uuid=0e8a2034-f4b1-4d34-8859-47e35793373e)
virDomainFree:2313 : dom=0x97a350, (VM: name=VM_5, uuid=0e8a2034-f4b1-4d34-8859-47e35793373e)
qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49d81261d0 refs=3 
qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49d81261d0 buf={"timestamp": {"seconds": 1347628099, "microseconds": 478179}, "event": "STOP"}^M 
{"timestamp": {"seconds": 1347628099, "microseconds": 478478}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Permission denied", "errno": 13}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}

Comment 11 Brian Foster 2012-12-12 21:59:59 UTC

Setting this closed due to insufficient data. Please re-open if we can reproduce with more data and potential access to the configuration in this state.

Note You need to log in before you can comment on or make changes to this bug.