Created attachment 1479143 [details] logs from all nodes Description of problem: Version-Release number of selected component (if applicable): Gluster 4.1.2 Centos 7.5 qemu-kvm-ev 2.10.0 How reproducible: we don't know how to reproduce this We have 3 servers, and replicated volume with 2 bricks and 1 arbiter, we run VMs on the same servers. Steps to Reproduce: We got problem during reboot, first we rebooted host with arbiter, waited for heal, then tried to move VM. Actual results: Got it log: [2018-08-27 09:56:22.085411] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool- client-6-RECON_NO:-0 (version: 4.1.2) [2018-08-27 09:56:22.107609] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool- client-6-RECON_NO:-0 [2018-08-27 09:56:22.107747] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-clien t-6-RECON_NO:-0 [2018-08-27 09:58:37.905829] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:c3eb6cfc-2ef9-470a-89d1-a87170d00da5-GRAPH_ID:0-PID:30292-HOST:father-PC_NAME:p ool-client-6-RECON_NO:-0 [2018-08-27 09:58:37.905926] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28c831d8bc550000} [2018-08-27 09:58:37.905959] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2870a7d6bc550000} [2018-08-27 09:58:37.905979] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2880a7d6bc550000} [2018-08-27 09:58:37.905997] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28f031d8bc550000} [2018-08-27 09:58:37.906016] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28b07dd5bc550000} [2018-08-27 09:58:37.906034] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28e0a7d6bc550000} [2018-08-27 09:58:37.906056] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28b845d8bc550000} [2018-08-27 09:58:37.906079] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2858a7d8bc550000} [2018-08-27 09:58:37.906098] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2868a8d7bc550000} [2018-08-27 09:58:37.906121] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28f80bd7bc550000} ... [2018-08-27 09:58:37.907375] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=28a8cdd6bc550000} [2018-08-27 09:58:37.907393] W [inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server: releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318 held by {client=0x7ffb58035bc0, pid=30292 lk-owner=2880cdd6bc550000} [2018-08-27 09:58:37.907476] I [socket.c:3837:socket_submit_reply] 0-tcp.pool-server: not connected (priv->connected = -1) [2018-08-27 09:58:37.907520] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88cb, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server) [2018-08-27 09:58:37.910727] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605 ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed [2018-08-27 09:58:37.910814] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88ce, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server) [2018-08-27 09:58:37.910861] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605 ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed [2018-08-27 09:58:37.910904] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88cf, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server) [2018-08-27 09:58:37.910940] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605 ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed [2018-08-27 09:58:37.910979] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88d1, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server) [2018-08-27 09:58:37.911012] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605 ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed [2018-08-27 09:58:37.911050] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xcb88d8, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 30) to rpc-transport (tcp.pool-server) [2018-08-27 09:58:37.911083] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605 ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed [2018-08-27 09:58:37.916217] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x20084) [0x7ffb64379084] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x605 ba) [0x7ffb5fddf5ba] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7ffb5fd89fce] ) 0-: Reply submission failed [2018-08-27 09:58:37.916520] I [MSGID: 115013] [server-helpers.c:286:do_fd_cleanup] 0-pool-server: fd cleanup on /balamak.img after this I/O on /balamak.img was blocked. Expected results: VM moved.
Seeing bunch of rpc failures in the log snippet. You may have to provide all the log files for us to debug this problem. Moving this to rpc component for now.
Hello! I collected all logs from /var/log/glusterfs to my desktop, so, please, just ask. Thank you!
Hi, To find out RCA only brick logs are not sufficient, is it possible to share a complete dump of /var/log/glusterfs from client and server?? Regards Mohit Agrawal
Created attachment 1479974 [details] all logs from node with VM (father)
Created attachment 1479975 [details] all logs from node with brick
Created attachment 1479977 [details] all logs from node with arbiter
Hello! Uploaded logs from all servers. Thank you!
Hi, In currently available logs I think disconnect logs are coming in brick_logs is expected. It seems on the nodes cli command "gluster v heal info" running continuously I have tried to search client logs related to pid "30292" as logs shared in comment 1 but i did not find any log specific to this client. find ./ -type f | xargs grep -il "PID:30292" ./father/glusterfs/bricks/wall-pool-brick.log-20180826 ./father/glusterfs/bricks/wall-pool-brick.log ./spirit/glusterfs/bricks/wall-pool-brick.log-20180826 ./spirit/glusterfs/bricks/wall-pool-brick.log ./son/glusterfs/bricks/wall-pool-brick.log-20180826 ./son/glusterfs/bricks/wall-pool-brick.log I have tried to find the client logs for pid 23095 find ./ -type f | xargs grep -il "PID:23095" ./father/glusterfs/bricks/wall-pool-brick.log-20180826 ./father/glusterfs/bricks/wall-pool-brick.log ./spirit/glusterfs/bricks/wall-pool-brick.log-20180826 ./spirit/glusterfs/bricks/wall-pool-brick.log ./son/glusterfs/glfsheal-pool.log ./son/glusterfs/bricks/wall-pool-brick.log-20180826 ./son/glusterfs/bricks/wall-pool-brick.log and I have found below messages client_logs(glfsheal) >>>>>>>>>>>>>>>>> [2018-08-29 12:03:52.281959] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-8: error returned while attempting to connect to host:(null), port:0 [2018-08-29 12:03:52.282163] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-pool-client-8: changing port to 49152 (from 0) [2018-08-29 12:03:52.286433] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-8: error returned while attempting to connect to host:(null), port:0 [2018-08-29 12:03:52.286723] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-8: error returned while attempting to connect to host:(null), port:0 [2018-08-29 12:03:52.288198] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-pool-client-8: Connected to pool-client-8, attached to remote volume '/wall/pool/brick'. [2018-08-29 12:03:52.290949] I [MSGID: 108031] [afr-common.c:2599:afr_local_discovery_cbk] 0-pool-replicate-0: selecting local read_child pool-client-7 [2018-08-29 12:03:52.291409] I [MSGID: 109005] [dht-selfheal.c:2342:dht_selfheal_directory] 0-pool-dht: Directory selfheal failed: Unable to form layout for directory / [2018-08-29 12:03:52.291455] I [MSGID: 108031] [afr-common.c:2599:afr_local_discovery_cbk] 0-pool-replicate-0: selecting local read_child pool-client-7 [2018-08-29 12:03:52.291510] I [MSGID: 104041] [glfs-resolve.c:972:__glfs_active_subvol] 0-pool: switched to graph 736f6e2d-3233-3039-352d-323031382d30 (0) [2018-08-29 12:06:22.037960] W [MSGID: 108040] [afr.c:327:afr_pending_xattrs_init] 0-pool-replicate-0: Unable to fetch afr-pending-xattr option from volfile. Falling back to using client translator names. [2018-08-29 12:06:22.039597] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 736f6e2d-3234-3837-382d-323031382d30 (0) coming up [2018-08-29 12:06:22.039631] I [MSGID: 114020] [client.c:2328:notify] 0-pool-client-6: parent translators are ready, attempting connect on transport [2018-08-29 12:06:22.043608] I [MSGID: 114020] [client.c:2328:notify] 0-pool-client-7: parent translators are ready, attempting connect on transport [2018-08-29 12:06:22.043806] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-6: error returned while attempting to connect to host:(null), port:0 [2018-08-29 12:06:22.044005] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-pool-client-6: error returned while attempting to connect to host:(null), port:0 >>>>>>>>>>>>>>>>>>>> father brick_log >>>>>>>>>>>>>>>>>>>> [2018-08-29 12:03:52.300269] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-6-RECON_NO:-0 (version: 4.1.2) [2018-08-29 12:03:56.744768] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-6-RECON_NO:-0 [2018-08-29 12:03:56.745028] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-6-RECON_NO:-0 >>>>>>>>>>>>>>>>>>> son brick_log >>>>>>>>>>>>>>> [2018-08-29 12:03:52.279149] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-7-RECON_NO:-0 (version: 4.1.2) [2018-08-29 12:03:56.720853] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-7-RECON_NO:-0 [2018-08-29 12:03:56.721127] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-7-RECON_NO:-0 >>>>>>>>>>>>>>>>>> sprit brick_log >>>>>>>>>>>>>>>>. [2018-08-29 12:05:30.498485] I [MSGID: 115029] [server-handshake.c:763:server_setvolume] 0-pool-server: accepted client from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-8-RECON_NO:-0 (version: 4.1.2) [2018-08-29 12:05:34.932341] I [MSGID: 115036] [server.c:483:server_rpc_notify] 0-pool-server: disconnecting connection from CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-8-RECON_NO:-0 [2018-08-29 12:05:34.933610] I [MSGID: 101055] [client_t.c:444:gf_client_unref] 0-pool-server: Shutting down connection CTX_ID:092a1bb9-3348-4511-95d3-8c2b65178e25-GRAPH_ID:0-PID:23095-HOST:son-PC_NAME:pool-client-8-RECON_NO:-0 >>>>>>>>>>>>>>>>>>>>> Can you please share what errors application was throwing and at the time of throwing error application was connection what type of client fuse/nfs?? Regards
Hello! "gluster v heal info" is coming from zabbix monitoring system agent. >Can you please share what errors application was throwing and at the time of throwing error application was connection what type of client fuse/nfs we run qemu-kvm , which uses gfapi. I'll attach log from /var/log/libvirt/qemu for this VM and date from all hosts, if any.
Created attachment 1480409 [details] libvirt log for balamak VM from fahter host
There are no logs for this VM on another host (son) at this date/time. Thank you!
Hi, On the client logs we can see it is showing below meessage but on the server logs i am not able to figure out why brick was not able to respond to the client. [2018-08-27 09:48:14.076276] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-pool-client-8: server 192.168.22.60:49152 has not responded in the last 10 seconds, disconnecting. Was it happened something due to network disrupt or some issue at brick side ?? For network side we can't say anything, in brick logs I am not able to figure out reason why brick has not responded to client of ping timer query. Can you please share system logs /var/log/messages at the same time?? Thanks Mohit Agrawal
Created attachment 1480439 [details] messages from host with VM Yes, sure. btw, local time is GMT+4 , so interesting part is ug 27 13:56:32 father crmd[2427]: notice: Initiating stop operation balamak_stop_0 on son Aug 27 13:56:32 father crmd[2427]: notice: Initiating stop operation balamak_stop_0 locally on father Aug 27 13:56:32 father crmd[2427]: notice: Initiating migrate_to operation endor_migrate_to_0 locally on father Aug 27 13:56:32 father VirtualDomain(balamak)[24337]: INFO: Issuing graceful shutdown request for domain balamak. Aug 27 13:56:32 father VirtualDomain(endor)[24338]: INFO: endor: Starting live migration to son (using: virsh --connect=qemu:///system --quiet migrate --live endor qemu+ssh://son/system ). Aug 27 13:56:34 father libvirtd: 2018-08-27 09:56:34.448+0000: 2840: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain balamak; current job is (async nested, migration out) owned by (2835 remoteDispatchDomainMigrateBegin3Params, 2835 remoteDispatchDomainMigrateBegin3Params) for (122s, 122s) Aug 27 13:56:34 father libvirtd: 2018-08-27 09:56:34.448+0000: 2840: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate Begin3Params) Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.267+0000: 2838: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (modify, none) for domain balamak; current job is (async nested, migration out ) owned by (2835 remoteDispatchDomainMigrateBegin3Params, 2835 remoteDispatchDomainMigrateBegin3Params) for (149s, 149s) Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.267+0000: 2838: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate Begin3Params) Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.939+0000: 2839: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain endor; current job is (async nested, migration out) o wned by (2837 remoteDispatchDomainMigrateBegin3Params, 2837 remoteDispatchDomainMigrateBegin3Params) for (30s, 30s) Aug 27 13:57:02 father libvirtd: 2018-08-27 09:57:02.939+0000: 2839: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate Begin3Params) Aug 27 13:57:33 father libvirtd: 2018-08-27 09:57:33.440+0000: 2840: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain endor; current job is (async nested, migration out) o wned by (2837 remoteDispatchDomainMigrateBegin3Params, 2837 remoteDispatchDomainMigrateBegin3Params) for (61s, 61s) Aug 27 13:57:33 father libvirtd: 2018-08-27 09:57:33.440+0000: 2840: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigrate Begin3Params) I'll add logs from 2 other hosts...
Created attachment 1480441 [details] log from host which is target of VM migration
Sorry, messed up, last log which is marked as son is from spirit, which is gluster arbiter, sorry.
Created attachment 1480442 [details] this is log from VM migration target, really
Thank you!
Dmitry, Apologies for delay. Did you get a chance to upgrade to glusterfs-6.x and try if the issue is fixed? I would request you to do try upgrading and testing. We noticed many issues when ovirt community tried to use glusterfs-5.2 version, and since then fixed many issues seen with ovirt community. Would like to hear feedback.
Hello! We decided that problem is in arbiter, so we replaced arbiter with 3rd node and have no such problem. Currently we are running 5.6. Sorry, I don't know is there such problem with current arbiter code or not. Thank you! I think it is better to close this issue.
Thanks for the update Dmitry! First of all, as it is working for you, we are happy. I will close the bug for now. Will discuss with the team about if any such issues are fixed in Arbiter since 4.x time frame.