Bug 1775522 - Qemu VM issues IO errors with gluster native driver & SSL enabled
Summary: Qemu VM issues IO errors with gluster native driver & SSL enabled
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: qemu-block
Version: 5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-22 07:40 UTC by Dominique Martinet
Modified: 2020-03-12 12:18 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:18:28 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Dominique Martinet 2019-11-22 07:40:04 UTC
Description of problem:
qemu VMs sometimes get IO errors and mark disks as read-only in the guest.
A simple reset within the guest solves the issue so it is a transient error that the glusterapi recovers (cannot just remount,rw though), but we cannot run VMs that go read-only once in a while :)

Version-Release number of selected component (if applicable):
Using pve 6 (debian buster (10)) versions:
 - gluster 5.5-3
 - pve-qemu-kvm 4.0.0-3

How reproducible:
Has happened once a week on some of our VMs (small test cluster with 6 VMs right now) over the past two weeks.
I could not reproduce manually by running a heavy IO load so I think it is unrelated to the activity. The VMs with problems weren't fully idle either (there are syslog messages just before the problem).
I could not reproduce either by killing gluster daemons.

I could, however, reproduce with tcpkill as described below.

The fuse `glusterfs` mount does not have the problem.
I did not test if disabling SSL fixes the problem, but this would probably have been noticed before if not ;)


Steps to Reproduce:
1. Setup a gluster volume with SSL enabled and replication=2, and configure a VM to use it with gluster://<volume>/disk
2. Identify the port used by qemu with e.g. `ss -npt | grep <qemu pid>` and run `tcpkill -i <iface> host <qemu host> and port <qemu port>` -- it is wise to check nothing else uses that ip/port combo wherever you run tcpkill (possible if the port is used as source port somewhere else)
3. Do some IO on vm, e.g. `echo foo | dd of=fileongluster conv=fsync`

Actual results:
VM does an IO error

Expected results:
VM only writes on brick where it works, and file goes through normal healing mechanisms when other brick reconnects (or whatever happens without SSL)


Additional info:
The only logs I could find were about connection dropping on a single brick, nothing in qemu logs or the other brick.
Here is what it looks like when the problem really happened (it is different when tcpkill resets the connection as that looks like a proper connection close, but I am fairly confident the problem is the same)

[2019-11-17 08:25:17.124430] E [socket.c:246:ssl_dump_error_stack] 0-tcp.data-server:   error:1408F09C:SSL routines:ssl3_get_record:http request
[2019-11-17 08:25:17.124485] W [socket.c:1386:__socket_read_simple_msg] 0-tcp.data-server: reading from socket failed. Error (Input/output error), peer (10.0.2.XY:49117)
[2019-11-17 08:25:17.124511] I [socket.c:3712:socket_submit_outgoing_msg] 0-tcp.data-server: not connected (priv->connected = -1)
[2019-11-17 08:25:17.124516] I [MSGID: 115036] [server.c:469:server_rpc_notify] 0-data-server: disconnecting connection from CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-0
[2019-11-17 08:25:17.124545] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf534, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.124576] W [inodelk.c:609:pl_inodelk_log_cleanup] 0-data-server: releasing lock on 2b156905-db61-4c11-b1c8-8356d7ebd554 held by {client=0x7f5198039a30, pid=929457 lk-owner=2847a09f5c7f0000}
[2019-11-17 08:25:17.124607] I [MSGID: 115013] [server-helpers.c:283:do_fd_cleanup] 0-data-server: fd cleanup on /images/169076/vm-169076-disk-1.qcow2
[2019-11-17 08:25:17.124621] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.124717] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf535, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.124797] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125129] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf536, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125181] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125360] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf537, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125409] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125542] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf538, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125585] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125761] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf539, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125876] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125960] E [rpcsvc.c:1381:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xf53a, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.126011] E [server.c:133:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913) [0x7f521e948913] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108) [0x7f521e8e8108] -->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3) [0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.126039] I [MSGID: 101055] [client_t.c:435:gf_client_unref] 0-data-server: Shutting down connection CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-0
[2019-11-17 08:25:27.618791] I [addr.c:54:compare_addr_and_update] 0-/mnt/data/gluster-data: allowed = "*", received addr = "10.0.2.XY"
[2019-11-17 08:25:27.618827] I [login.c:46:gf_auth] 0-auth/login: connecting user name: hyperv1.fdn.fr
[2019-11-17 08:25:27.618835] I [login.c:110:gf_auth] 0-auth/login: allowed user names: hyperv2.fdn.fr,hyperv1.fdn.fr
[2019-11-17 08:25:27.618843] I [MSGID: 115029] [server-handshake.c:537:server_setvolume] 0-data-server: accepted client from CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-1 (version: 5.5)


I will have a look at the code and pinpoint something more precise either tonight or next week if it hasn't been identified yet, it looks fairly straightforward, to check if this has been fixed by now as I do not have anywhere to test if the problem is still present on master branch.

Thanks!

Comment 1 Dominique Martinet 2019-11-22 15:23:36 UTC
For info, we just got our 3rd node to setup a 3rd arbiter brick (so the volume finally is on optimal configuration) and the problem no longer manifests.

I still think this is an issue, I'll setup a test cluster with 2 bricks and try to patch next week as planned anyway.

(also forgot to say: the IO error was immediate on connection kill, not after a timeout)

Thanks,
-- 
Dominique MartinetFor info, I just setup

Comment 2 Worker Ant 2020-03-12 12:18:28 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/865, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.