Bug 1327927 - hosted-engine --vm-status sometimes returns exception ovirt_hosted_engine_ha.lib.exceptions.RequestError
Summary: hosted-engine --vm-status sometimes returns exception ovirt_hosted_engine_ha....
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 3.6.5
Hardware: x86_64
OS: Linux
unspecified
medium vote
Target Milestone: ---
: 3.6.7
Assignee: bugs@ovirt.org
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-18 02:42 UTC by Donat
Modified: 2016-07-03 10:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-03 10:39:50 UTC
oVirt Team: Gluster
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
tail -f /var/log/vdsm/*log while bug (244.60 KB, text/plain)
2016-04-18 11:42 UTC, Donat
no flags Details
system logs while exception appearing (13.39 KB, text/plain)
2016-04-18 11:50 UTC, Donat
no flags Details

Description Donat 2016-04-18 02:42:19 UTC
Description of problem:

Intermittently `hosted-engine --vm-status` returns exception such as:

```
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats
    stats = broker.get_stats_from_storage(service)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 260, in _checked_communicate
    .format(message or response))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.metadata'
```

I also did `ls` on these files. When `hosted-engine --vm-status doing ok there is output:

```
root@b2:~# ls -laL /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/ /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/;date
/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/:
total 2036
drwxrwx---. 2 vdsm kvm    4096 Apr 18 04:46 .
drwxr-xr-x. 5 vdsm kvm    4096 Apr 15 02:19 ..
-rw-rw----. 1 vdsm kvm 1048576 Apr 18 05:28 hosted-engine.lockspace
-rw-rw----. 1 vdsm kvm 1028096 Apr 18 05:28 hosted-engine.metadata

/var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/:
total 2037
drwxr-xr-x. 2 vdsm kvm    4096 Apr 15 02:19 .
drwxr-xr-x. 7 vdsm kvm    4096 Apr 15 06:11 ..
-rw-rw----. 1 vdsm kvm 1028096 Apr 18 05:28 173e71a5-c6fa-44b0-a24d-1e936bc915b5
-rw-rw----. 1 vdsm kvm 1048576 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.lease
-rw-r--r--. 1 vdsm kvm     283 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.meta
```

But when it fails, there is output:

```
ls: cannot access /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: No such file or directory
/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/:
ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.lockspace: No such file or directory
ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.metadata: No such file or directory
total 8
drwxrwx---. 2 vdsm kvm 4096 Apr 18 04:46 .
drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 ..
??????????? ? ?    ?      ?            ? hosted-engine.lockspace
??????????? ? ?    ?      ?            ? hosted-engine.metadata
```

Version-Release number of selected component (if applicable):


How reproducible:

 # watch hosted-engine --vm-status

Actual results:


Expected results:


Additional info:

I configured 3 server system like it's described there: http://www.ovirt.org/blog/2016/03/up-and-running-with-ovirt-3-6/

Comment 1 Roy Golan 2016-04-18 06:07:02 UTC
This is more the underlying infra issue because obviously the ovirt-ha-agent is failing to access those files.

Probably the your replica is intermittently fallbacking to read-only or some similar IO issue with the mount. Please share the vdsm and system and gluster logs.

Comment 2 Donat 2016-04-18 11:19:46 UTC
Which log files to share additionally?

Currently, I did `tail -f /var/log/glusterfs/* &` and `hosted-engine --vm-status` in a while loop (with `sleep 2` in between). And such log lines are appearing, I will paste lines that appear since good output through bad output until good again:

```
/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/:
total 2036
drwxrwx---. 2 vdsm kvm    4096 Apr 18 14:03 .
drwxr-xr-x. 5 vdsm kvm    4096 Apr 15 02:19 ..
-rw-rw----. 1 vdsm kvm 1048576 Apr 18 14:11 hosted-engine.lockspace
-rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 hosted-engine.metadata

/var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/:
total 2037
drwxr-xr-x. 2 vdsm kvm    4096 Apr 15 02:19 .
drwxr-xr-x. 7 vdsm kvm    4096 Apr 15 06:11 ..
-rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 173e71a5-c6fa-44b0-a24d-1e936bc915b5
-rw-rw----. 1 vdsm kvm 1048576 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.lease
-rw-r--r--. 1 vdsm kvm     283 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.meta
Mon Apr 18 14:11:44 MSK 2016

ls: cannot access /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: No such file or directory
/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/:
ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.lockspace: No such file or directory
ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.metadata: No such file or directory
total 8
drwxrwx---. 2 vdsm kvm 4096 Apr 18 14:03 .
drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 ..
??????????? ? ?    ?      ?            ? hosted-engine.lockspace
??????????? ? ?    ?      ?            ? hosted-engine.metadata
Mon Apr 18 14:11:47 MSK 2016   

==> /var/log/glusterfs/cli.log <==
[2016-04-18 11:11:47.488794] I [cli.c:721:main] 0-cli: Started running /usr/sbin/gluster with version 3.7.10
[2016-04-18 11:11:47.488822] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to remote glusterd at glustermount
[2016-04-18 11:11:47.560607] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-04-18 11:11:47.560697] I [socket.c:2356:socket_event_handler] 0-transport: disconnecting now
[2016-04-18 11:11:47.561404] I [cli-rpc-ops.c:805:gf_cli_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2016-04-18 11:11:47.561779] I [input.c:36:cli_batch] 0-: Exiting with: 0

==> /var/log/glusterfs/rhev-data-center-mnt-glusterSD-glustermount:engine.log <==
[2016-04-18 11:11:47.211421] I [fuse-bridge.c:4997:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/glusterSD/glustermount:engine
[2016-04-18 11:11:47.211613] W [glusterfsd.c:1251:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f2122334dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f212399f8b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f212399f739] ) 0-: received signum (15), shutting down
[2016-04-18 11:11:47.211633] I [fuse-bridge.c:5704:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/glustermount:engine'.
[2016-04-18 11:11:47.627752] I [MSGID: 100030] [glusterfsd.c:2332:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.10 (args: /usr/sbin/glusterfs --volfile-server=glustermount --volfile-server=bank1-storage --volfile-server=bank2-storage --volfile-server=bank3-storage --volfile-id=engine /rhev/data-center/mnt/glusterSD/glustermount:engine)
[2016-04-18 11:11:47.638048] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-04-18 11:11:47.643595] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2016-04-18 11:11:47.644113] I [MSGID: 114020] [client.c:2106:notify] 0-engine-client-0: parent translators are ready, attempting connect on transport
[2016-04-18 11:11:47.646525] I [MSGID: 114020] [client.c:2106:notify] 0-engine-client-1: parent translators are ready, attempting connect on transport
[2016-04-18 11:11:47.647057] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-engine-client-0: changing port to 49152 (from 0)
[2016-04-18 11:11:47.648816] I [MSGID: 114020] [client.c:2106:notify] 0-engine-client-2: parent translators are ready, attempting connect on transport
[2016-04-18 11:11:47.651576] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-engine-client-1: changing port to 49152 (from 0)
[2016-04-18 11:11:47.651626] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-engine-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
Final graph:
+------------------------------------------------------------------------------+
  1: volume engine-client-0
  2:     type protocol/client  
  3:     option clnt-lk-version 1
  4:     option volfile-checksum 0
  5:     option volfile-key engine
  6:     option client-version 3.7.10
  7:     option process-uuid bank2.qwerty.ru-21602-2016/04/18-11:11:47:624805-engine-client-0-0-0
  8:     option fops-version 1298437
  9:     option ping-timeout 42
 10:     option remote-host bank1-storage
 11:     option remote-subvolume /gluster/engine/brick
 12:     option transport-type socket
 13:     option transport.address-family inet
 14:     option username 76303cd2-6dc7-49e1-a4d9-5fc7288195af
 15:     option password d24d1f70-31a1-41fc-af86-bbfa9129e575
 16:     option send-gids true 
 17: end-volume
 18:
 19: volume engine-client-1
 20:     type protocol/client  
 21:     option ping-timeout 42
 22:     option remote-host bank2-storage
 23:     option remote-subvolume /gluster/engine/brick
 24:     option transport-type socket
 25:     option transport.address-family inet
 26:     option username 76303cd2-6dc7-49e1-a4d9-5fc7288195af
 27:     option password d24d1f70-31a1-41fc-af86-bbfa9129e575
 28:     option send-gids true 
 29: end-volume
 30:
 31: volume engine-client-2
 32:     type protocol/client  
 33:     option ping-timeout 42
 34:     option remote-host bank3-storage
 35:     option remote-subvolume /gluster/engine/brick
 36:     option transport-type socket
 37:     option transport.address-family inet
 38:     option username 76303cd2-6dc7-49e1-a4d9-5fc7288195af
 39:     option password d24d1f70-31a1-41fc-af86-bbfa9129e575
 40:     option send-gids true 
 41: end-volume
 42:
 43: volume engine-replicate-0 
 44:     type cluster/replicate
 45:     option arbiter-count 1
 46:     subvolumes engine-client-0 engine-client-1 engine-client-2
 47: end-volume
 48:
 49: volume engine-dht
 50:     type cluster/distribute
 51:     subvolumes engine-replicate-0
 52: end-volume
 53:
 54: volume engine-write-behind
 55:     type performance/write-behind
 56:     subvolumes engine-dht 
 57: end-volume
 58:
 59: volume engine-read-ahead  
 60:     type performance/read-ahead
 61:     subvolumes engine-write-behind
 62: end-volume
 63:
 64: volume engine-readdir-ahead
 65:     type performance/readdir-ahead
 66:     subvolumes engine-read-ahead
 67: end-volume
 68:
 69: volume engine-io-cache
 70:     type performance/io-cache
 71:     subvolumes engine-readdir-ahead
 72: end-volume
 73:
 74: volume engine-quick-read  
 75:     type performance/quick-read
 76:     subvolumes engine-io-cache
 77: end-volume
 78:
 79: volume engine-open-behind 
 80:     type performance/open-behind
 81:     subvolumes engine-quick-read
 82: end-volume
 83:
 84: volume engine-md-cache
 85:     type performance/md-cache
 86:     subvolumes engine-open-behind
 87: end-volume
 88:
 89: volume engine
 90:     type debug/io-stats   
 91:     option log-level INFO 
 92:     option latency-measurement off
 93:     option count-fop-hits off
 94:     subvolumes engine-md-cache
 95: end-volume
 96:
 97: volume meta-autoload
 98:     type meta
 99:     subvolumes engine
100: end-volume
101:
+------------------------------------------------------------------------------+
[2016-04-18 11:11:47.653608] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-engine-client-0: Connected to engine-client-0, attached to remote volume '/gluster/engine/brick'.
[2016-04-18 11:11:47.653621] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-engine-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2016-04-18 11:11:47.653660] I [MSGID: 108005] [afr-common.c:4003:afr_notify] 0-engine-replicate-0: Subvolume 'engine-client-0' came back up; going online.
[2016-04-18 11:11:47.653825] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-engine-client-0: Server lk version = 1
[2016-04-18 11:11:47.654200] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-engine-client-2: changing port to 49152 (from 0)
[2016-04-18 11:11:47.655660] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-engine-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-04-18 11:11:47.655903] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-engine-client-1: Connected to engine-client-1, attached to remote volume '/gluster/engine/brick'.
[2016-04-18 11:11:47.655920] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-engine-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2016-04-18 11:11:47.655991] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-engine-client-1: Server lk version = 1
[2016-04-18 11:11:47.658178] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-engine-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-04-18 11:11:47.658670] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-engine-client-2: Connected to engine-client-2, attached to remote volume '/gluster/engine/brick'.
[2016-04-18 11:11:47.658688] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-engine-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2016-04-18 11:11:47.661032] I [fuse-bridge.c:5156:fuse_graph_setup] 0-fuse: switched to graph 0
[2016-04-18 11:11:47.661099] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-engine-client-2: Server lk version = 1
[2016-04-18 11:11:47.661194] I [fuse-bridge.c:4067:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2016-04-18 11:11:47.662634] I [MSGID: 108031] [afr-common.c:1900:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-1

/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/:
total 2036
drwxrwx---. 2 vdsm kvm    4096 Apr 18 14:03 .
drwxr-xr-x. 5 vdsm kvm    4096 Apr 15 02:19 ..
-rw-rw----. 1 vdsm kvm 1048576 Apr 18 14:11 hosted-engine.lockspace
-rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 hosted-engine.metadata

/var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/:
total 2037
drwxr-xr-x. 2 vdsm kvm    4096 Apr 15 02:19 .
drwxr-xr-x. 7 vdsm kvm    4096 Apr 15 06:11 ..
-rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 173e71a5-c6fa-44b0-a24d-1e936bc915b5
-rw-rw----. 1 vdsm kvm 1048576 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.lease
-rw-r--r--. 1 vdsm kvm     283 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.meta
Mon Apr 18 14:11:49 MSK 2016

Comment 3 Donat 2016-04-18 11:42:28 UTC
Created attachment 1148178 [details]
tail -f /var/log/vdsm/*log while bug

This is `tail -f /var/log/vdsm/*log` output while `hosted-engine --vm-status` started to return exception.

Comment 4 Donat 2016-04-18 11:50:04 UTC
Created attachment 1148182 [details]
system logs while exception appearing

This is `tail -f /var/log/{secure,sanlock.log,messages,cron,maillog}` output while exception is appeared and disappeared.

Comment 5 Donat 2016-04-18 16:32:32 UTC
These tests (per each log file) was run sequentially, so time-stamps between logs does not match. If you wish, I may retest and just attach combined logs for single test event.

Comment 6 Roy Golan 2016-04-19 08:29:04 UTC
# From your gluster log
signum (15), shutting down
[2016-04-18 11:11:47.211633] I [fuse-bridge.c:5704:fini] 0-fuse: Unmounting
'/rhev/data-center/mnt/glusterSD/glustermount:engine'.

why is this happening on your setup?

Comment 7 Roy Golan 2016-04-19 08:30:11 UTC
Sahina any idea what could cause this?

Comment 8 Donat 2016-04-19 13:41:53 UTC
(In reply to Roy Golan from comment #6)
> # From your gluster log
> signum (15), shutting down
> [2016-04-18 11:11:47.211633] I [fuse-bridge.c:5704:fini] 0-fuse: Unmounting
> '/rhev/data-center/mnt/glusterSD/glustermount:engine'.
> 
> why is this happening on your setup?

I do nothing while this is happening. Only observing `while sleep 2; do hosted-engine --vm-status; done`. Error appears roughly once per minute.

Comment 9 Donat 2016-04-20 13:02:26 UTC
I mistakenly cleared needinfo? flag from sabose, in this post I return it.

Comment 10 Sahina Bose 2016-04-20 13:14:19 UTC
Roy, this is the same question I've in bug 1298693 comment 6. If you notice, in the attached logs of bug, there are periodic unmount messages in the gluster mount log. I suspect the ha-agent is remounting?

Comment 11 Sahina Bose 2016-06-21 07:04:05 UTC
The related bug has been fixed. Can you check if you face the issue after updating packages to ovirt-3.6.7?

Comment 12 Red Hat Bugzilla Rules Engine 2016-06-21 10:17:00 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 13 Donat 2016-06-21 11:39:29 UTC
Unfortunately, can not test any more. Probably, this bug can be closed.

Comment 14 Doron Fediuck 2016-07-03 10:39:50 UTC
Closing based on comment 13.


Note You need to log in before you can comment on or make changes to this bug.