Description of problem: qemu VMs sometimes get IO errors and mark disks as read-only in the guest. A simple reset within the guest solves the issue so it is a transient error that the glusterapi recovers (cannot just remount,rw though), but we cannot run VMs that go read-only once in a while :) Version-Release number of selected component (if applicable): Using pve 6 (debian buster (10)) versions: - gluster 5.5-3 - pve-qemu-kvm 4.0.0-3 How reproducible: Has happened once a week on some of our VMs (small test cluster with 6 VMs right now) over the past two weeks. I could not reproduce manually by running a heavy IO load so I think it is unrelated to the activity. The VMs with problems weren't fully idle either (there are syslog messages just before the problem). I could not reproduce either by killing gluster daemons. I could, however, reproduce with tcpkill as described below. The fuse `glusterfs` mount does not have the problem. I did not test if disabling SSL fixes the problem, but this would probably have been noticed before if not ;) Steps to Reproduce: 1. Setup a gluster volume with SSL enabled and replication=2, and configure a VM to use it with gluster://<volume>/disk 2. Identify the port used by qemu with e.g. `ss -npt | grep <qemu pid>` and run `tcpkill -i <iface> host <qemu host> and port <qemu port>` -- it is wise to check nothing else uses that ip/port combo wherever you run tcpkill (possible if the port is used as source port somewhere else) 3. Do some IO on vm, e.g. `echo foo | dd of=fileongluster conv=fsync` Actual results: VM does an IO error Expected results: VM only writes on brick where it works, and file goes through normal healing mechanisms when other brick reconnects (or whatever happens without SSL) Additional info: The only logs I could find were about connection dropping on a single brick, nothing in qemu logs or the other brick. Here is what it looks like when the problem really happened (it is different when tcpkill resets the connection as that looks like a proper connection close, but I am fairly confident the problem is the same) [2019-11-17 08:25:17.124430] E [socket.c:246:ssl_dump_error_stack] 0-tcp.data-server: error:1408F09C:SSL routines:ssl3_get_record:http request [2019-11-17 08:25:17.124485] W [socket.c:1386:__socket_read_simple_msg] 0-tcp.data-server: reading from socket failed. Error (Input/output error), peer (10.0.2.XY:49117) [2019-11-17 08:25:17.124511] I [socket.c:3712:socket_submit_outgoing_msg] 0-tcp.data-server: not connected (priv->connected = -1) [2019-11-17 08:25:17.124516] I [MSGID: 115036] [server.c:469:server_rpc_notify] 0-data-server: disconnecting connection from CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-0 [2019-11-17 08:25:17.124545] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf534, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.124576] W [inodelk.c:609:pl_inodelk_log_cleanup] 0-data-server: releasing lock on 2b156905-db61-4c11-b1c8-8356d7ebd554 held by {client=0x7f5198039a30, pid=929457 lk-owner=2847a09f5c7f0000} [2019-11-17 08:25:17.124607] I [MSGID: 115013] [server-helpers.c:283:do_fd_cleanup] 0-data-server: fd cleanup on /images/169076/vm-169076-disk-1.qcow2 [2019-11-17 08:25:17.124621] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.124717] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf535, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.124797] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.125129] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf536, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.125181] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.125360] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf537, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.125409] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.125542] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf538, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.125585] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.125761] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf539, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.125876] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.125960] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf53a, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server) [2019-11-17 08:25:17.126011] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed [2019-11-17 08:25:17.126039] I [MSGID: 101055] [client_t.c:435:gf_client_unref] 0-data-server: Shutting down connection CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-0 [2019-11-17 08:25:27.618791] I [addr.c:54:compare_addr_and_update] 0-/mnt/data/gluster-data: allowed = "*", received addr = "10.0.2.XY" [2019-11-17 08:25:27.618827] I [login.c:46:gf_auth] 0-auth/login: connecting user name: hyperv1.fdn.fr [2019-11-17 08:25:27.618835] I [login.c:110:gf_auth] 0-auth/login: allowed user names: hyperv2.fdn.fr,hyperv1.fdn.fr [2019-11-17 08:25:27.618843] I [MSGID: 115029] [server-handshake.c:537:server_setvolume] 0-data-server: accepted client from CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-1 (version: 5.5) I will have a look at the code and pinpoint something more precise either tonight or next week if it hasn't been identified yet, it looks fairly straightforward, to check if this has been fixed by now as I do not have anywhere to test if the problem is still present on master branch. Thanks!
For info, we just got our 3rd node to setup a 3rd arbiter brick (so the volume finally is on optimal configuration) and the problem no longer manifests. I still think this is an issue, I'll setup a test cluster with 2 bricks and try to patch next week as planned anyway. (also forgot to say: the IO error was immediate on connection kill, not after a timeout) Thanks, -- Dominique MartinetFor info, I just setup
This bug is moved to https://github.com/gluster/glusterfs/issues/865, and will be tracked there from now on. Visit GitHub issues URL for further details