Bug 1622814 - kvm lock problem
Summary: kvm lock problem
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: 4.1
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1637953
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-28 06:07 UTC by Dmitry Melekhov
Modified: 2019-06-18 09:18 UTC (History)
9 users (show)

Fixed In Version: glusterfs-5.5 (!arbiter)
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-18 09:18:18 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
logs from all nodes (1.61 MB, application/x-gzip)
2018-08-28 06:07 UTC, Dmitry Melekhov
no flags Details
all logs from node with VM (father) (13.38 MB, application/x-gzip)
2018-08-31 06:53 UTC, Dmitry Melekhov
no flags Details
all logs from node with brick (8.12 MB, application/x-gzip)
2018-08-31 06:55 UTC, Dmitry Melekhov
no flags Details
all logs from node with arbiter (701.59 KB, application/x-gzip)
2018-08-31 06:56 UTC, Dmitry Melekhov
no flags Details
libvirt log for balamak VM from fahter host (57.37 KB, text/plain)
2018-09-03 04:34 UTC, Dmitry Melekhov
no flags Details
messages from host with VM (1.63 MB, text/plain)
2018-09-03 05:56 UTC, Dmitry Melekhov
no flags Details
log from host which is target of VM migration (1.72 MB, text/plain)
2018-09-03 05:58 UTC, Dmitry Melekhov
no flags Details
this is log from VM migration target, really (778.39 KB, text/plain)
2018-09-03 06:01 UTC, Dmitry Melekhov
no flags Details

Description Dmitry Melekhov 2018-08-28 06:07:23 UTC
Created attachment 1479143 [details]
logs from all nodes

Description of problem:


Version-Release number of selected component (if applicable):

Gluster 4.1.2
Centos 7.5
qemu-kvm-ev 2.10.0

How reproducible:
we don't know how to reproduce this

We have 3 servers, and replicated volume with 2 bricks and 1 arbiter,
we run VMs on the same servers.


Steps to Reproduce:
We got problem during reboot,
first we rebooted host with arbiter, waited for heal,
then tried to move VM.

Actual results:

Got it log:

[2018-08-27 09:56:22.085411] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-
client-6-RECON_NO:-0 (version: 4.1.2)
[2018-08-27 09:56:22.107609] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-
client-6-RECON_NO:-0
[2018-08-27 09:56:22.107747] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-clien
t-6-RECON_NO:-0
[2018-08-27 09:58:37.905829] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:c3eb6cfc-2ef9-470a-89d1-a87170d00da5-GRAPH_ID:0-PID:30292-HOST:father-PC_NAME:p
ool-client-6-RECON_NO:-0
[2018-08-27 09:58:37.905926] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28c831d8bc550000}
[2018-08-27 09:58:37.905959] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2870a7d6bc550000}
[2018-08-27 09:58:37.905979] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2880a7d6bc550000}
[2018-08-27 09:58:37.905997] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28f031d8bc550000}
[2018-08-27 09:58:37.906016] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28b07dd5bc550000}
[2018-08-27 09:58:37.906034] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28e0a7d6bc550000}
[2018-08-27 09:58:37.906056] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28b845d8bc550000}
[2018-08-27 09:58:37.906079] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2858a7d8bc550000}
[2018-08-27 09:58:37.906098] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2868a8d7bc550000}
[2018-08-27 09:58:37.906121] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28f80bd7bc550000}
...

[2018-08-27 09:58:37.907375] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28a8cdd6bc550000}
[2018-08-27 09:58:37.907393] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2880cdd6bc550000}
[2018-08-27 09:58:37.907476] I [socket.c:3837:socket_submit_reply] 0-tcp.pool-server: not connected (priv->connected = -1)
[2018-08-27 09:58:37.907520] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88cb, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server)
[2018-08-27 09:58:37.910727] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605
ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed
[2018-08-27 09:58:37.910814] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88ce, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server)
[2018-08-27 09:58:37.910861] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605
ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed
[2018-08-27 09:58:37.910904] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88cf, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server)
[2018-08-27 09:58:37.910940] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605
ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed
[2018-08-27 09:58:37.910979] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88d1, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server)
[2018-08-27 09:58:37.911012] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605
ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed
[2018-08-27 09:58:37.911050] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88d8, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server)
[2018-08-27 09:58:37.911083] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605
ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed
[2018-08-27 09:58:37.916217] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605
ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed
[2018-08-27 09:58:37.916520] I [MSGID: 115013] [server-helpers.c:286:do_fd_cleanup] 0-pool-server: fd cleanup on /balamak.img


after this I/O on  /balamak.img was blocked.



Expected results:

VM moved.

Comment 1 Atin Mukherjee 2018-08-29 12:30:36 UTC
Seeing bunch of rpc failures in the log snippet. You may have to provide all the log files for us to debug this problem.

Moving this to rpc component for now.

Comment 2 Dmitry Melekhov 2018-08-29 12:46:07 UTC
Hello!

I collected all logs from /var/log/glusterfs to my desktop,
so, please, just ask.

Thank you!

Comment 3 Mohit Agrawal 2018-08-31 06:41:36 UTC
Hi,

To find out RCA only brick logs are not sufficient, is it possible to share a complete dump of /var/log/glusterfs from client and server??

Regards
Mohit Agrawal

Comment 4 Dmitry Melekhov 2018-08-31 06:53:21 UTC
Created attachment 1479974 [details]
all logs from node with VM (father)

Comment 5 Dmitry Melekhov 2018-08-31 06:55:04 UTC
Created attachment 1479975 [details]
all logs from node with brick

Comment 6 Dmitry Melekhov 2018-08-31 06:56:05 UTC
Created attachment 1479977 [details]
all logs from node with arbiter

Comment 7 Dmitry Melekhov 2018-08-31 06:56:31 UTC
Hello!

Uploaded logs from all servers.

Thank you!

Comment 8 Mohit Agrawal 2018-09-03 04:18:23 UTC
Hi,

In currently available logs I think disconnect logs are coming in brick_logs is expected. It seems on the nodes cli command "gluster v heal info" running continuously 

I have tried to search client logs related to pid "30292" as logs shared in comment 1 but i did not find any log specific to this client.


find ./ -type f | xargs grep -il "PID:30292"
./father/glusterfs/bricks/wall-pool-brick.log-20180826
./father/glusterfs/bricks/wall-pool-brick.log
./spirit/glusterfs/bricks/wall-pool-brick.log-20180826
./spirit/glusterfs/bricks/wall-pool-brick.log
./son/glusterfs/bricks/wall-pool-brick.log-20180826
./son/glusterfs/bricks/wall-pool-brick.log


I have tried to find the client logs for pid 23095 
 
find ./ -type f | xargs grep -il "PID:23095"
./father/glusterfs/bricks/wall-pool-brick.log-20180826
./father/glusterfs/bricks/wall-pool-brick.log
./spirit/glusterfs/bricks/wall-pool-brick.log-20180826
./spirit/glusterfs/bricks/wall-pool-brick.log
./son/glusterfs/glfsheal-pool.log
./son/glusterfs/bricks/wall-pool-brick.log-20180826
./son/glusterfs/bricks/wall-pool-brick.log


and I have found below messages

client_logs(glfsheal)

>>>>>>>>>>>>>>>>>

[2018-08-29 12:03:52.281959] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-8: error returned while attempting to connect to host:(null), port:0
[2018-08-29 12:03:52.282163] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-pool-client-8: changing port to 49152 (from 0)
[2018-08-29 12:03:52.286433] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-8: error returned while attempting to connect to host:(null), port:0
[2018-08-29 12:03:52.286723] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-8: error returned while attempting to connect to host:(null), port:0
[2018-08-29 12:03:52.288198] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-pool-client-8: Connected to pool-client-8, attached to remote volume '/wall/pool/brick'.
[2018-08-29 12:03:52.290949] I [MSGID: 108031] [afr-common.c:2599:afr_local_discovery_cbk] 0-pool-replicate-0: selecting local read_child pool-client-7
[2018-08-29 12:03:52.291409] I [MSGID: 109005] [dht-selfheal.c:2342:dht_selfheal_directory] 0-pool-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-08-29 12:03:52.291455] I [MSGID: 108031] [afr-common.c:2599:afr_local_discovery_cbk] 0-pool-replicate-0: selecting local read_child pool-client-7
[2018-08-29 12:03:52.291510] I [MSGID: 104041] [glfs-resolve.c:972:__glfs_active_subvol] 0-pool: switched to graph 736f6e2d-3233-3039-352d-323031382d30 (0)
[2018-08-29 12:06:22.037960] W [MSGID: 108040] [afr.c:327:afr_pending_xattrs_init] 0-pool-replicate-0: Unable to fetch afr-pending-xattr option from volfile. Falling back to using client translator names.
[2018-08-29 12:06:22.039597] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 736f6e2d-3234-3837-382d-323031382d30 (0) coming up
[2018-08-29 12:06:22.039631] I [MSGID: 114020] [client.c:2328:notify] 0-pool-client-6: parent translators are ready, attempting connect on transport
[2018-08-29 12:06:22.043608] I [MSGID: 114020] [client.c:2328:notify] 0-pool-client-7: parent translators are ready, attempting connect on transport
[2018-08-29 12:06:22.043806] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-6: error returned while attempting to connect to host:(null), port:0
[2018-08-29 12:06:22.044005] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-6: error returned while attempting to connect to host:(null), port:0


>>>>>>>>>>>>>>>>>>>>

father brick_log
>>>>>>>>>>>>>>>>>>>>
[2018-08-29 12:03:52.300269] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-6-RECON_NO:-0 (version: 4.1.2)
[2018-08-29 12:03:56.744768] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-6-RECON_NO:-0
[2018-08-29 12:03:56.745028] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-6-RECON_NO:-0

>>>>>>>>>>>>>>>>>>>


son brick_log
>>>>>>>>>>>>>>>
[2018-08-29 12:03:52.279149] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-7-RECON_NO:-0 (version: 4.1.2)
[2018-08-29 12:03:56.720853] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-7-RECON_NO:-0
[2018-08-29 12:03:56.721127] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-7-RECON_NO:-0

>>>>>>>>>>>>>>>>>>



sprit brick_log
>>>>>>>>>>>>>>>>.
[2018-08-29 12:05:30.498485] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-8-RECON_NO:-0 (version: 4.1.2)
[2018-08-29 12:05:34.932341] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-8-RECON_NO:-0
[2018-08-29 12:05:34.933610] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-8-RECON_NO:-0

>>>>>>>>>>>>>>>>>>>>>

Can you please share what errors application was throwing and at the time of throwing error application was connection what type of client fuse/nfs??


Regards

Comment 9 Dmitry Melekhov 2018-09-03 04:33:27 UTC
Hello!

"gluster v heal info" is coming from zabbix monitoring system agent.

>Can you please share what errors application was throwing and at the time of throwing error application was connection what type of client fuse/nfs



we run qemu-kvm , which uses gfapi.

I'll attach log from /var/log/libvirt/qemu
for this VM and date from all hosts, if any.

Comment 10 Dmitry Melekhov 2018-09-03 04:34:58 UTC
Created attachment 1480409 [details]
libvirt log for balamak VM from fahter host

Comment 11 Dmitry Melekhov 2018-09-03 04:36:00 UTC
There are no logs for this VM on another host (son) at this date/time.

Thank you!

Comment 12 Mohit Agrawal 2018-09-03 05:49:35 UTC
Hi,

On the client logs we can see it is showing below meessage but on the server logs i am not able to figure out why brick was not able to respond to the client.
[2018-08-27 09:48:14.076276] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-pool-client-8: server 192.168.22.60:49152 has not responded in the last 10 seconds, disconnecting.

Was it happened something due to network disrupt or some issue at brick side ??

For network side we can't say anything, in brick logs I am not able to figure out reason why brick has not responded to client of ping timer query.

Can you please share system logs /var/log/messages at the same time??

Thanks
Mohit Agrawal

Comment 13 Dmitry Melekhov 2018-09-03 05:56:09 UTC
Created attachment 1480439 [details]
messages from host with VM

Yes, sure.
btw, local time is GMT+4 , so interesting part is
ug 27 13:56:32 father crmd[2427]:  notice: Initiating stop operation balamak_stop_0 on son
Aug 27 13:56:32 father crmd[2427]:  notice: Initiating stop operation balamak_stop_0 locally on father
Aug 27 13:56:32 father crmd[2427]:  notice: Initiating migrate_to operation endor_migrate_to_0 locally on father
Aug 27 13:56:32 father VirtualDomain(balamak)[24337]: INFO: Issuing graceful shutdown request for domain balamak.
Aug 27 13:56:32 father VirtualDomain(endor)[24338]: INFO: endor: Starting live migration to son (using: virsh --connect=qemu:///system --quiet migrate --live  endor qemu+ssh://son/system ).
Aug 27 13:56:34 father libvirtd: 2018-08-27 09:56:34.448+0000: 2840: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain balamak; current job is (async nested, migration out)
 owned by (2835 remoteDispatchDomainMigrateBegin3Params, 2835 remoteDispatchDomainMigrateBegin3Params) for (122s, 122s)
Aug 27 13:56:34 father libvirtd: 2018-08-27 09:56:34.448+0000: 2840: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate
Begin3Params)
Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.267+0000: 2838: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (modify, none) for domain balamak; current job is (async nested, migration out
) owned by (2835 remoteDispatchDomainMigrateBegin3Params, 2835 remoteDispatchDomainMigrateBegin3Params) for (149s, 149s)
Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.267+0000: 2838: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate
Begin3Params)
Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.939+0000: 2839: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain endor; current job is (async nested, migration out) o
wned by (2837 remoteDispatchDomainMigrateBegin3Params, 2837 remoteDispatchDomainMigrateBegin3Params) for (30s, 30s)
Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.939+0000: 2839: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate
Begin3Params)
Aug 27 13:57:33 father libvirtd: 2018-08-27 09:57:33.440+0000: 2840: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain endor; current job is (async nested, migration out) o
wned by (2837 remoteDispatchDomainMigrateBegin3Params, 2837 remoteDispatchDomainMigrateBegin3Params) for (61s, 61s)
Aug 27 13:57:33 father libvirtd: 2018-08-27 09:57:33.440+0000: 2840: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate
Begin3Params)

I'll add logs from 2 other hosts...

Comment 14 Dmitry Melekhov 2018-09-03 05:58:05 UTC
Created attachment 1480441 [details]
log from host which is target of VM migration

Comment 15 Dmitry Melekhov 2018-09-03 05:58:57 UTC
Sorry, messed up, last log which is marked as son is from spirit, which is gluster arbiter, sorry.

Comment 16 Dmitry Melekhov 2018-09-03 06:01:02 UTC
Created attachment 1480442 [details]
this is log from VM migration target, really

Comment 17 Dmitry Melekhov 2018-09-03 06:02:05 UTC
Thank you!

Comment 18 Amar Tumballi 2019-06-18 08:49:01 UTC
Dmitry, Apologies for delay. Did you get a chance to upgrade to glusterfs-6.x and try if the issue is fixed? I would request you to do try upgrading and testing. We noticed many issues when ovirt community tried to use glusterfs-5.2 version, and since then fixed many issues seen with ovirt community. Would like to hear feedback.

Comment 19 Dmitry Melekhov 2019-06-18 09:03:13 UTC
Hello!

We decided that problem is in arbiter, so we replaced arbiter with 3rd node and have no such problem.
Currently we are running 5.6.
Sorry, I don't know is there such problem with current arbiter code or not.
Thank you!

I think it is better to close this issue.

Comment 20 Amar Tumballi 2019-06-18 09:18:18 UTC
Thanks for the update Dmitry! First of all, as it is working for you, we are happy.

I will close the bug for now. Will discuss with the team about if any such issues are fixed in Arbiter since 4.x time frame.


Note You need to log in before you can comment on or make changes to this bug.