Hide Forgot
Description of problem: Intermittently `hosted-engine --vm-status` returns exception such as: ``` Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats stats = broker.get_stats_from_storage(service) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 260, in _checked_communicate .format(message or response)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.metadata' ``` I also did `ls` on these files. When `hosted-engine --vm-status doing ok there is output: ``` root@b2:~# ls -laL /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/ /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/;date /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/: total 2036 drwxrwx---. 2 vdsm kvm 4096 Apr 18 04:46 . drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 18 05:28 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Apr 18 05:28 hosted-engine.metadata /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: total 2037 drwxr-xr-x. 2 vdsm kvm 4096 Apr 15 02:19 . drwxr-xr-x. 7 vdsm kvm 4096 Apr 15 06:11 .. -rw-rw----. 1 vdsm kvm 1028096 Apr 18 05:28 173e71a5-c6fa-44b0-a24d-1e936bc915b5 -rw-rw----. 1 vdsm kvm 1048576 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.lease -rw-r--r--. 1 vdsm kvm 283 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.meta ``` But when it fails, there is output: ``` ls: cannot access /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: No such file or directory /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/: ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.lockspace: No such file or directory ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.metadata: No such file or directory total 8 drwxrwx---. 2 vdsm kvm 4096 Apr 18 04:46 . drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 .. ??????????? ? ? ? ? ? hosted-engine.lockspace ??????????? ? ? ? ? ? hosted-engine.metadata ``` Version-Release number of selected component (if applicable): How reproducible: # watch hosted-engine --vm-status Actual results: Expected results: Additional info: I configured 3 server system like it's described there: http://www.ovirt.org/blog/2016/03/up-and-running-with-ovirt-3-6/
This is more the underlying infra issue because obviously the ovirt-ha-agent is failing to access those files. Probably the your replica is intermittently fallbacking to read-only or some similar IO issue with the mount. Please share the vdsm and system and gluster logs.
Which log files to share additionally? Currently, I did `tail -f /var/log/glusterfs/* &` and `hosted-engine --vm-status` in a while loop (with `sleep 2` in between). And such log lines are appearing, I will paste lines that appear since good output through bad output until good again: ``` /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/: total 2036 drwxrwx---. 2 vdsm kvm 4096 Apr 18 14:03 . drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 18 14:11 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 hosted-engine.metadata /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: total 2037 drwxr-xr-x. 2 vdsm kvm 4096 Apr 15 02:19 . drwxr-xr-x. 7 vdsm kvm 4096 Apr 15 06:11 .. -rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 173e71a5-c6fa-44b0-a24d-1e936bc915b5 -rw-rw----. 1 vdsm kvm 1048576 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.lease -rw-r--r--. 1 vdsm kvm 283 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.meta Mon Apr 18 14:11:44 MSK 2016 ls: cannot access /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: No such file or directory /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/: ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.lockspace: No such file or directory ls: cannot access /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/hosted-engine.metadata: No such file or directory total 8 drwxrwx---. 2 vdsm kvm 4096 Apr 18 14:03 . drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 .. ??????????? ? ? ? ? ? hosted-engine.lockspace ??????????? ? ? ? ? ? hosted-engine.metadata Mon Apr 18 14:11:47 MSK 2016 ==> /var/log/glusterfs/cli.log <== [2016-04-18 11:11:47.488794] I [cli.c:721:main] 0-cli: Started running /usr/sbin/gluster with version 3.7.10 [2016-04-18 11:11:47.488822] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to remote glusterd at glustermount [2016-04-18 11:11:47.560607] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-04-18 11:11:47.560697] I [socket.c:2356:socket_event_handler] 0-transport: disconnecting now [2016-04-18 11:11:47.561404] I [cli-rpc-ops.c:805:gf_cli_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2016-04-18 11:11:47.561779] I [input.c:36:cli_batch] 0-: Exiting with: 0 ==> /var/log/glusterfs/rhev-data-center-mnt-glusterSD-glustermount:engine.log <== [2016-04-18 11:11:47.211421] I [fuse-bridge.c:4997:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/glusterSD/glustermount:engine [2016-04-18 11:11:47.211613] W [glusterfsd.c:1251:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f2122334dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f212399f8b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f212399f739] ) 0-: received signum (15), shutting down [2016-04-18 11:11:47.211633] I [fuse-bridge.c:5704:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/glustermount:engine'. [2016-04-18 11:11:47.627752] I [MSGID: 100030] [glusterfsd.c:2332:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.10 (args: /usr/sbin/glusterfs --volfile-server=glustermount --volfile-server=bank1-storage --volfile-server=bank2-storage --volfile-server=bank3-storage --volfile-id=engine /rhev/data-center/mnt/glusterSD/glustermount:engine) [2016-04-18 11:11:47.638048] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-04-18 11:11:47.643595] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2016-04-18 11:11:47.644113] I [MSGID: 114020] [client.c:2106:notify] 0-engine-client-0: parent translators are ready, attempting connect on transport [2016-04-18 11:11:47.646525] I [MSGID: 114020] [client.c:2106:notify] 0-engine-client-1: parent translators are ready, attempting connect on transport [2016-04-18 11:11:47.647057] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-engine-client-0: changing port to 49152 (from 0) [2016-04-18 11:11:47.648816] I [MSGID: 114020] [client.c:2106:notify] 0-engine-client-2: parent translators are ready, attempting connect on transport [2016-04-18 11:11:47.651576] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-engine-client-1: changing port to 49152 (from 0) [2016-04-18 11:11:47.651626] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-engine-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) Final graph: +------------------------------------------------------------------------------+ 1: volume engine-client-0 2: type protocol/client 3: option clnt-lk-version 1 4: option volfile-checksum 0 5: option volfile-key engine 6: option client-version 3.7.10 7: option process-uuid bank2.qwerty.ru-21602-2016/04/18-11:11:47:624805-engine-client-0-0-0 8: option fops-version 1298437 9: option ping-timeout 42 10: option remote-host bank1-storage 11: option remote-subvolume /gluster/engine/brick 12: option transport-type socket 13: option transport.address-family inet 14: option username 76303cd2-6dc7-49e1-a4d9-5fc7288195af 15: option password d24d1f70-31a1-41fc-af86-bbfa9129e575 16: option send-gids true 17: end-volume 18: 19: volume engine-client-1 20: type protocol/client 21: option ping-timeout 42 22: option remote-host bank2-storage 23: option remote-subvolume /gluster/engine/brick 24: option transport-type socket 25: option transport.address-family inet 26: option username 76303cd2-6dc7-49e1-a4d9-5fc7288195af 27: option password d24d1f70-31a1-41fc-af86-bbfa9129e575 28: option send-gids true 29: end-volume 30: 31: volume engine-client-2 32: type protocol/client 33: option ping-timeout 42 34: option remote-host bank3-storage 35: option remote-subvolume /gluster/engine/brick 36: option transport-type socket 37: option transport.address-family inet 38: option username 76303cd2-6dc7-49e1-a4d9-5fc7288195af 39: option password d24d1f70-31a1-41fc-af86-bbfa9129e575 40: option send-gids true 41: end-volume 42: 43: volume engine-replicate-0 44: type cluster/replicate 45: option arbiter-count 1 46: subvolumes engine-client-0 engine-client-1 engine-client-2 47: end-volume 48: 49: volume engine-dht 50: type cluster/distribute 51: subvolumes engine-replicate-0 52: end-volume 53: 54: volume engine-write-behind 55: type performance/write-behind 56: subvolumes engine-dht 57: end-volume 58: 59: volume engine-read-ahead 60: type performance/read-ahead 61: subvolumes engine-write-behind 62: end-volume 63: 64: volume engine-readdir-ahead 65: type performance/readdir-ahead 66: subvolumes engine-read-ahead 67: end-volume 68: 69: volume engine-io-cache 70: type performance/io-cache 71: subvolumes engine-readdir-ahead 72: end-volume 73: 74: volume engine-quick-read 75: type performance/quick-read 76: subvolumes engine-io-cache 77: end-volume 78: 79: volume engine-open-behind 80: type performance/open-behind 81: subvolumes engine-quick-read 82: end-volume 83: 84: volume engine-md-cache 85: type performance/md-cache 86: subvolumes engine-open-behind 87: end-volume 88: 89: volume engine 90: type debug/io-stats 91: option log-level INFO 92: option latency-measurement off 93: option count-fop-hits off 94: subvolumes engine-md-cache 95: end-volume 96: 97: volume meta-autoload 98: type meta 99: subvolumes engine 100: end-volume 101: +------------------------------------------------------------------------------+ [2016-04-18 11:11:47.653608] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-engine-client-0: Connected to engine-client-0, attached to remote volume '/gluster/engine/brick'. [2016-04-18 11:11:47.653621] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-engine-client-0: Server and Client lk-version numbers are not same, reopening the fds [2016-04-18 11:11:47.653660] I [MSGID: 108005] [afr-common.c:4003:afr_notify] 0-engine-replicate-0: Subvolume 'engine-client-0' came back up; going online. [2016-04-18 11:11:47.653825] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-engine-client-0: Server lk version = 1 [2016-04-18 11:11:47.654200] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-engine-client-2: changing port to 49152 (from 0) [2016-04-18 11:11:47.655660] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-engine-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2016-04-18 11:11:47.655903] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-engine-client-1: Connected to engine-client-1, attached to remote volume '/gluster/engine/brick'. [2016-04-18 11:11:47.655920] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-engine-client-1: Server and Client lk-version numbers are not same, reopening the fds [2016-04-18 11:11:47.655991] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-engine-client-1: Server lk version = 1 [2016-04-18 11:11:47.658178] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-engine-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2016-04-18 11:11:47.658670] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-engine-client-2: Connected to engine-client-2, attached to remote volume '/gluster/engine/brick'. [2016-04-18 11:11:47.658688] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-engine-client-2: Server and Client lk-version numbers are not same, reopening the fds [2016-04-18 11:11:47.661032] I [fuse-bridge.c:5156:fuse_graph_setup] 0-fuse: switched to graph 0 [2016-04-18 11:11:47.661099] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-engine-client-2: Server lk version = 1 [2016-04-18 11:11:47.661194] I [fuse-bridge.c:4067:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22 [2016-04-18 11:11:47.662634] I [MSGID: 108031] [afr-common.c:1900:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-1 /rhev/data-center/mnt/glusterSD/glustermount:engine/c42dd33a-715d-4ebe-932c-2481e0f84165/ha_agent/: total 2036 drwxrwx---. 2 vdsm kvm 4096 Apr 18 14:03 . drwxr-xr-x. 5 vdsm kvm 4096 Apr 15 02:19 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 18 14:11 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 hosted-engine.metadata /var/run/vdsm/storage/c42dd33a-715d-4ebe-932c-2481e0f84165/ea7f669a-d087-47eb-bffc-f3fce48bd1ee/: total 2037 drwxr-xr-x. 2 vdsm kvm 4096 Apr 15 02:19 . drwxr-xr-x. 7 vdsm kvm 4096 Apr 15 06:11 .. -rw-rw----. 1 vdsm kvm 1028096 Apr 18 14:11 173e71a5-c6fa-44b0-a24d-1e936bc915b5 -rw-rw----. 1 vdsm kvm 1048576 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.lease -rw-r--r--. 1 vdsm kvm 283 Apr 15 02:19 173e71a5-c6fa-44b0-a24d-1e936bc915b5.meta Mon Apr 18 14:11:49 MSK 2016
Created attachment 1148178 [details] tail -f /var/log/vdsm/*log while bug This is `tail -f /var/log/vdsm/*log` output while `hosted-engine --vm-status` started to return exception.
Created attachment 1148182 [details] system logs while exception appearing This is `tail -f /var/log/{secure,sanlock.log,messages,cron,maillog}` output while exception is appeared and disappeared.
These tests (per each log file) was run sequentially, so time-stamps between logs does not match. If you wish, I may retest and just attach combined logs for single test event.
# From your gluster log signum (15), shutting down [2016-04-18 11:11:47.211633] I [fuse-bridge.c:5704:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/glustermount:engine'. why is this happening on your setup?
Sahina any idea what could cause this?
(In reply to Roy Golan from comment #6) > # From your gluster log > signum (15), shutting down > [2016-04-18 11:11:47.211633] I [fuse-bridge.c:5704:fini] 0-fuse: Unmounting > '/rhev/data-center/mnt/glusterSD/glustermount:engine'. > > why is this happening on your setup? I do nothing while this is happening. Only observing `while sleep 2; do hosted-engine --vm-status; done`. Error appears roughly once per minute.
I mistakenly cleared needinfo? flag from sabose, in this post I return it.
Roy, this is the same question I've in bug 1298693 comment 6. If you notice, in the attached logs of bug, there are periodic unmount messages in the gluster mount log. I suspect the ha-agent is remounting?
The related bug has been fixed. Can you check if you face the issue after updating packages to ovirt-3.6.7?
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Unfortunately, can not test any more. Probably, this bug can be closed.
Closing based on comment 13.