Activating a host (ovirt-node-ng-installer-4.3.0-2019020409.el7.iso) fails because it cannot connect to the existing glusterfs 3.12 storage domains. I can connect to the glusterfs server with telnet and there are no client restrictions on it. We could try to upgrade to glusterfs 4.1 but as it's not compatible with ovirt 4.2 this would break our old cluster... [2019-02-04 12:47:02.969362] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.16 --volfile-server=*.*.*.15 --volfile-server=*.*.*.14 --volfile-id=ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5) [2019-02-04 12:47:02.979349] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-02-04 12:58:10.232517] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.16 --volfile-server=*.*.*.15 --volfile-server=*.*.*.14 --volfile-id=ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5) [2019-02-04 12:58:10.242924] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-02-04 13:02:44.511106] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.16 [2019-02-04 13:02:44.511157] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15 [2019-02-04 13:02:44.512757] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f58b4ccffbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7f58b4a98e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7f58b4a98f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f58b4a9a531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7f58b4a9b0d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-02-04 12:47:02.979593 (xid=0x2) [2019-02-04 13:02:44.512779] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:ssd5) [2019-02-04 13:02:44.512813] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7f58b4a98e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x5641acd22361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5641acd1aceb] ) 0-: received signum (0), shutting down [2019-02-04 13:02:44.512841] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'. [2019-02-04 13:02:44.521264] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'. [2019-02-04 13:02:44.521784] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f58b3b31dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5641acd1ae75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5641acd1aceb] ) 0-: received signum (15), shutting down [2019-02-04 13:13:51.774986] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.16 [2019-02-04 13:13:51.775037] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15 [2019-02-04 13:13:51.775928] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f41e4f9ffbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7f41e4d68e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7f41e4d68f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f41e4d6a531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7f41e4d6b0d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-02-04 12:58:10.243172 (xid=0x2) [2019-02-04 13:13:51.775949] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:ssd5) [2019-02-04 13:13:51.775982] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7f41e4d68e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x563a96061361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x563a96059ceb] ) 0-: received signum (0), shutting down [2019-02-04 13:13:51.776008] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'. [2019-02-04 13:13:51.776214] E [mount-common.c:232:fuse_mnt_umount] 0-glusterfs-fuse: fuse: failed to unmount /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5: No such file or directory [2019-02-04 13:13:51.776229] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'. [2019-02-04 13:13:51.776338] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f41e3e01dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x563a96059e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x563a96059ceb] ) 0-: received signum (15), shutting down
Created attachment 1526800 [details] engine.log
Created attachment 1526801 [details] server.log
Created attachment 1526802 [details] vdsm.log
gluster volume info Volume Name: hdd2 Type: Replicate Volume ID: 6d491c15-adf3-4d5c-aade-46f0206f74e5 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/hdd2/brick1 Brick2: *.*.*.15:/data/hdd2/brick1 Brick3: *.*.*.16:/data/hdd2/brick1 Options Reconfigured: server.allow-insecure: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on network.ping-timeout: 30 cluster.granular-entry-heal: enable performance.strict-o-direct: on storage.owner-gid: 36 storage.owner-uid: 36 features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: on network.remote-dio: off performance.low-prio-threads: 32 performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on Volume Name: sdd8 Type: Replicate Volume ID: 73b020ba-c63d-47c0-a102-f994f713b5fb Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/ssd8/brick1 Brick2: *.*.*.15:/data/ssd8/brick1 Brick3: *.*.*.16:/data/ssd8/brick1 Options Reconfigured: server.allow-insecure: on nfs.disable: on transport.address-family: inet user.cifs: off auth.allow: * performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on storage.owner-uid: 36 storage.owner-gid: 36 performance.strict-o-direct: on cluster.granular-entry-heal: enable network.ping-timeout: 30 Volume Name: ssd3 Type: Replicate Volume ID: 4b340741-a9bc-46e5-8667-d7f75ec57254 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/ssd3/brick1 Brick2: *.*.*.15:/data/ssd3/brick1 Brick3: *.*.*.16:/data/ssd3/brick1 Options Reconfigured: server.allow-insecure: on nfs.disable: on transport.address-family: inet user.cifs: off auth.allow: * performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on storage.owner-uid: 36 storage.owner-gid: 36 performance.strict-o-direct: on cluster.granular-entry-heal: enable network.ping-timeout: 30 Volume Name: ssd4 Type: Replicate Volume ID: edebdd8f-4cd7-490b-a7a9-3985cc016282 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/ssd4/brick1 Brick2: *.*.*.15:/data/ssd4/brick1 Brick3: *.*.*.16:/data/ssd4/brick1 Options Reconfigured: cluster.granular-entry-heal: enable features.shard-block-size: 512MB server.event-threads: 1 network.ping-timeout: 10 server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: enable performance.low-prio-threads: 32 performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet performance.readdir-ahead: off nfs.disable: on Volume Name: ssd5 Type: Replicate Volume ID: 5c5ca8e3-f9dd-4bca-a5d3-28c94d965cca Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/ssd5/brick1 Brick2: *.*.*.15:/data/ssd5/brick1 Brick3: *.*.*.16:/data/ssd5/brick1 Options Reconfigured: cluster.granular-entry-heal: enable features.shard-block-size: 512MB server.event-threads: 1 network.ping-timeout: 10 server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: enable performance.low-prio-threads: 32 performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet performance.readdir-ahead: off nfs.disable: on Volume Name: ssd6 Type: Replicate Volume ID: 84a554aa-b634-4582-8cc3-75d23ebdcd91 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/ssd6/brick1 Brick2: *.*.*.15:/data/ssd6/brick1 Brick3: *.*.*.16:/data/ssd6/brick1 Options Reconfigured: cluster.granular-entry-heal: enable features.shard-block-size: 512MB server.event-threads: 1 network.ping-timeout: 10 server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: enable performance.low-prio-threads: 32 performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet performance.readdir-ahead: off nfs.disable: on Volume Name: ssd9 Type: Replicate Volume ID: 1a379dd8-7e89-4b0b-b6c6-fec45275a9a0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: *.*.*.14:/data/ssd9/brick1 Brick2: *.*.*.15:/data/ssd9/brick1 Brick3: *.*.*.16:/data/ssd9/brick1 Options Reconfigured: nfs.disable: on transport.address-family: inet user.cifs: off auth.allow: * performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on storage.owner-uid: 36 storage.owner-gid: 36 performance.strict-o-direct: on cluster.granular-entry-heal: enable network.ping-timeout: 30 server.allow-insecure: on
All glusterfs info about "failed to fetch volume file" lead me to "server.allow-insecure on" and "rpc-auth-allow-insecure on" but I have thes enabled. https://bugzilla.redhat.com/show_bug.cgi?id=1248067 https://lists.gluster.org/pipermail/gluster-devel/2012-December/037371.html After doing many experiments and inputs I got from vijay on IRC, this is what finally worked. 1) volume set <volname> server.allow-insecure on 2) option rpc-auth-allow-insecure on in glusterd.vol If either of the above is not present, i get the same "unable to get volfile" error. I have tried setting virsh (actually libvirt) to spawn qemu process as 'root', 'vdsm' and 'qemu' user and it fails for all the 3 if either of the above is NOT set. Only when both of the above is set, it works for me and I am able to start a VM via virsh.
Created attachment 1527521 [details] supervdsm.log
Moving assignment for this bug to Sanju who's investigating this on Gluster community side. Netbulae can you please provide mount logs and bricks logs?
Mount logs are all filled with the same as in the first comment "failed to fetch volume", there are no brick logs on the client. And the brick logs on the glusterfs server have nothing in there referring to these hosts. I can increase the debug info if needed
Sanju, is this the same as Bug 1677160
(In reply to Sahina Bose from comment #11) > Sanju, is this the same as Bug 1677160 Sahina, the error messages in the glusterd logs are same in both the bugs. Also, looking at the comment 7, I can see that Bug 1677160 is opened to track it from gluster side. So both the bugs are same.
There must be something unique to our situation, I have time thursday to debug further.
Ovirt-4.3.1 already release, so moving to ovirt-4.3.2
Still happening. I thought it was because my co-worker had added "server.allow-insecure: on " to the master storage volume, but didn't restart the volume. So I restarted this volume but it didn't help. As I was checking the oVirt event log for another issue, I see a lot of entries like this: Mar 1, 2019, 1:53:04 PM VDSM node10 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues Mar 1, 2019, 1:46:03 PM VDSM node10 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-000000000043, msdUUID=ed888117-094e-46f5-b433-61eb72e242be' Mar 1, 2019, 1:44:03 PM VDSM node10 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues Mar 1, 2019, 1:41:55 PM VDSM gluster5 command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "Attempt to call function: <bound method GlusterHost.storageDevicesList of <vdsm.gluster.apiwrapper.GlusterHost object at 0x7fb70c0527d0>> with arguments: () error: argument of type 'NoneType' is not iterable"} Mar 1, 2019, 1:41:54 PM VDSM gluster6 command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "Attempt to call function: <bound method GlusterHost.storageDevicesList of <vdsm.gluster.apiwrapper.GlusterHost object at 0x7fb6284b0750>> with arguments: () error: argument of type 'NoneType' is not iterable"}
Hi Netbulae, How can we proceed on this? Is there a possibility we can do some live debug of the situation? I am 'amarts' on IRC, and we can catch up there to discuss further. From latest comments I understand adding 'insecure' options (as per comment#5) didn't help later.
Yes no problem, I'm back in the office on Thursday from 9:00 CEST to 17:30 CEST.
Sahina - this bug can't block glusterfs-6 release given the product is different. We should have a clone at glusterfs and change this dependency.
Changing the component - as this bug is already tracked via Bug 1677319 in oVirt
We tried to reproduce this issue, but couldn't hit the same problem. Until and unless we have a reproducer and the complete logs especially the client, glusterd and brick log files, it'd be hard to debug.
Upgraded to node 4.3.2-2019031310. As stated earlier, there is nothing in the brick and glusterd log. I've set "diagnostics.client-log-level" to DEBUG but still see nothing. glustershd.log [2019-03-18 11:10:54.005050] D [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-hdd2-client-2: remote operation failed [Resource temporarily unavailable] [2019-03-18 11:10:54.005079] D [MSGID: 0] [client-rpc-fops.c:1511:client3_3_inodelk_cbk] 0-stack-trace: stack-address: 0x7f1254002560, hdd2-client-2 returned -1 error: Resource temporarily unavailable [Resource temporarily unavailable] [2019-03-18 11:10:55.545357] D [MSGID: 0] [afr-self-heald.c:218:afr_shd_index_inode] 0-hdd2-replicate-0: glusterfs.xattrop_entry_changes_gfid dir gfid for hdd2-client-0: b8f46fe6-ef56-44ac-8586-22061b0f2d5b [2019-03-18 11:10:55.546606] D [MSGID: 0] [afr-self-heald.c:597:afr_shd_index_healer] 0-hdd2-replicate-0: finished index sweep on subvol hdd2-client-0 [2019-03-18 11:11:24.234360] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-0: returning as transport is already disconnected OR there are no frames (0 || 0) [2019-03-18 11:11:24.234416] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-1: returning as transport is already disconnected OR there are no frames (0 || 0) [2019-03-18 11:11:24.234425] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-2: returning as transport is already disconnected OR there are no frames (0 || 0) [2019-03-18 11:12:46.240335] D [logging.c:1855:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present [2019-03-18 11:12:46.240388] D [logging.c:1817:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5 [2019-03-18 11:14:09.000515] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f12ab9f7ebb] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7f12ab7c228b] (--> /lib64/libgfrpc.so.0(+0x14a31)[0x7f12ab7c2a31] (--> /lib64/libgfrpc.so.0(rpc_clnt_submit+0x4a1)[0x7f12ab7bf651] (--> /usr/lib64/glusterfs/3.12.15/xlator/protocol/client.so(+0x102e2)[0x7f129e0e02e2] ))))) 0-: 192.168.99.14:49154: ping timer event already removed [2019-03-18 11:14:09.000619] D [MSGID: 0] [syncop-utils.c:548:syncop_is_subvol_local] 0-ssd4-client-0: subvol ssd4-client-0 is local [2019-03-18 11:14:09.000637] D [MSGID: 0] [afr-self-heald.c:580:afr_shd_index_healer] 0-ssd4-replicate-0: starting index sweep on subvol ssd4-client-0 [2019-03-18 11:14:09.000635] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-ssd4-client-0: Ping latency is 0ms glusterfs clien log [2019-03-18 11:03:57.442215] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.14 --volfile-server=*.*.*.15 --volfile-server=*.*.*.16 --volfile-id=/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9) [2019-03-18 11:03:57.472426] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-03-18 11:06:03.636977] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f392add8dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x556b04278e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x556b04278ceb] ) 0-: received signum (15), shutting down [2019-03-18 11:06:03.637039] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'. [2019-03-18 11:06:03.654263] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'. [2019-03-18 11:13:29.415760] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.14 --volfile-server=*.*.*.15 --volfile-server=*.*.*.16 --volfile-id=/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9) [2019-03-18 11:13:29.444824] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14 [2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15 [2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-03-18 11:13:29.445101 (xid=0x2) [2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/ssd9) [2019-03-18 11:29:01.002550] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7fb4beba4e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x5586d9f3c361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5586d9f34ceb] ) 0-: received signum (0), shutting down [2019-03-18 11:29:01.002578] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'. [2019-03-18 11:29:01.009036] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'. [2019-03-18 11:29:01.009655] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fb4bdc3ddd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5586d9f34e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5586d9f34ceb] ) 0-: received signum (15), shutting down supervdsm.log MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:11:29,020::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call volumeInfo with (u'ssd9', u'*.*.*.14') {} MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:11:29,020::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.14 ssd9 --xml (cwd None) MainProcess|jsonrpc/6::DEBUG::2019-03-18 12:11:31,460::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call ksmTune with ({u'run': 0, u'merge_across_nodes': 1},) {} MainProcess|jsonrpc/6::DEBUG::2019-03-18 12:11:31,461::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,227::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1 MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,227::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 122 seconds (max pending: 1) MainProcess|jsonrpc/5::ERROR::2019-03-18 12:13:29,227::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo xmltree = _execGlusterXml(command) File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml return _getTree(rc, out, err) File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree raise ge.GlusterCmdExecFailedException(rc, out, err) GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err='' MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,229::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f3877683a10>, u'*.*.*.14:/ssd9', u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.16', 'cgroup': 'vdsm-glusterfs'} MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,230::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o backup-volfile-servers=*.*.*.15:*.*.*.16 *.*.*.14:/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9 (cwd None)
(In reply to Amar Tumballi from comment #16) > Hi Netbulae, How can we proceed on this? Is there a possibility we can do > some live debug of the situation? I am 'amarts' on IRC, and we can catch up > there to discuss further. > > From latest comments I understand adding 'insecure' options (as per > comment#5) didn't help later. Yes adding insecure options and restarting the volumes didn't help
Is there anything in Comment 21 that could help narrow this?
[2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14 Why did we get a disconnect. Was glusterd service at *.14 not running? [2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15 [2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-03-18 11:13:29.445101 (xid=0x2) The above log seems to be the culprit here. [2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/ssd9) And the above log is the after effect. I have few questions: 1. Does the mount fail everytime? 2. Do you see any change in the behaviour when the primary volfile server is changed? 3. What are the gluster version in the individual peers? (Keeping the needinfo intact for now, but request Sahina to get us these details to work on).
Redirecting to reporter. Could you answer questions in Comment 24?
(In reply to Atin Mukherjee from comment #24) > [2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] > 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14 > > Why did we get a disconnect. Was glusterd service at *.14 not running? > > [2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] > 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15 > [2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> > /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> > /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> > /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> > /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> > /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced > unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at > 2019-03-18 11:13:29.445101 (xid=0x2) > > The above log seems to be the culprit here. > > [2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] > 0-mgmt: failed to fetch volume file (key:/ssd9) > > And the above log is the after effect. > > > I have few questions: > > 1. Does the mount fail everytime? Yes. It also stays the same when we switch the primary storage domain to another one. > 2. Do you see any change in the behaviour when the primary volfile server is > changed? No I have different primary volfile server across volumes to spread the load a bit more. Same effect always. > 3. What are the gluster version in the individual peers? All nodes and servers are on 3.12.15 > > (Keeping the needinfo intact for now, but request Sahina to get us these > details to work on).
Since I'm unable to reproduce this even after multiple attempts, the only possibility I have to make some progress on this is by asking you to test different combinations, I understand that this might make you frustrated but I have no other way at this point of time to pinpoint this. In my local setup I tried every possible options to simulate this but had no success. As I explained in comment 24, it seems like that client couldn't get the volfile from glusterd running in *.15 instance. However since there's no log entry in INFO mode in glusterd which could indicate the possibility of this failure can I request you to do the following if possible? 1. Run 'pkill glusterd; glusterd ' on *.15 node 2. Attempt to mount the client. 3. Find out the failure log of 'failed to fetch volume file' and see the timestamp. From glusterd log map this timestamp and send us the snippet of the log entries around this timestamp. 4. Run gluster v info command from all the nodes and paste back the output 5. Provide the output of 'gluster v get all cluster.op-version' from one of the nodes
(In reply to Atin Mukherjee from comment #27) I tried the sequence you posted but currently we have *.16 and ssd5 as the master storage domain as I removed ssd9 Also upgraded to ovirt node 4.3.3 rc2 because of several fixes. But I don't get the 'failed to fetch volume file' this way, it appears the backup volfile option is not working: >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,186::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.16 ssd5 --xml (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,297::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1 >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:10:09,297::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo > xmltree = _execGlusterXml(command) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml > return _getTree(rc, out, err) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree > raise ge.GlusterCmdExecFailedException(rc, out, err) >GlusterCmdExecFailedException: Command execution failed: rc=1 out='Connection failed. Please check if gluster daemon is operational.\n' err='' >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,298::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=1*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'} When I leave the glusterfs process running, I also don't get it anymore in 4.3.3, but I get "'Error : Request timed out\n' err=''" and "MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n')" >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:25:54,918::commands::219::root::(execCmd) FAILED: <err> = 'Running scope as unit run-10178.scope.\nMount failed. Please check the log file for more details.\n'; <rc> = 1 >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:25:54,918::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 765 seconds (max pending: 2) >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:25:54,918::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in mount >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 143, in mount > cgroup=cgroup) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 277, in _mount > _runcmd(cmd) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd > raise MountError(rc, b";".join((out, err))) >MountError: (1, ';Running scope as unit run-10178.scope.\nMount failed. Please check the log file for more details.\n') >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,720::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,720::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 294 seconds (max pending: 2) >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,802::routes::110::root::(get_gateway) The gateway *.*.*.3 is duplicated for the device ovirtmgmt >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,807::routes::110::root::(get_gateway) The gateway *.*.*.3 is duplicated for the device ovirtmgmt >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,809::cmdutils::133::root::(exec_cmd) /sbin/tc qdisc show (cwd None) >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,860::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,866::cmdutils::133::root::(exec_cmd) /sbin/tc class show dev bond0 classid 1389:1388 (cwd None) >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,919::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,005::vsctl::68::root::(commit) Executing commands: /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,006::cmdutils::133::root::(exec_cmd) /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface (cwd None) >MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,038::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 > >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,472::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/ovirt.netbulae.mgmt:_var_lib_exports_iso (cwd None) >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,502::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,503::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,529::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/*.*.*.15:_data_ovirt') >{'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,529::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/*.*.*.15:_data_ovirt (cwd None) >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,595::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,596::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,738::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {} >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,738::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/libexec/vdsm/fc-scan (cwd None) >MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,804::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,841::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4') {'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,841::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4 (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,849::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n'; <rc> = 32 >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,850::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount > mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount > _runcmd(cmd) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd > raise MountError(rc, b";".join((out, err))) >MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n') >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,850::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,850::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5 (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,875::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,876::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,889::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6') {'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,889::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6 (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,895::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6: mountpoint not found\n'; <rc> = 32 >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,895::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount > mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount > _runcmd(cmd) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd > raise MountError(rc, b";".join((out, err))) >MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6: mountpoint not found\n') >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,895::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2') {'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,896::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2 (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,902::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2: mountpoint not found\n'; <rc> = 32 >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,902::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount > mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount > _runcmd(cmd) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd > raise MountError(rc, b";".join((out, err))) >MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2: mountpoint not found\n') >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,903::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3') {'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,903::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3 (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,910::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3: mountpoint not found\n'; <rc> = 32 >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,910::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount > mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount > _runcmd(cmd) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd > raise MountError(rc, b";".join((out, err))) >MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3: mountpoint not found\n') >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,911::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8') {'force': True, 'lazy': True, 'freeloop': False} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,911::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8 (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,917::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8: mountpoint not found\n'; <rc> = 32 >MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,917::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount > mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount > _runcmd(cmd) > File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd > raise MountError(rc, b";".join((out, err))) >MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8: mountpoint not found\n') >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,042::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {} >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,042::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/libexec/vdsm/fc-scan (cwd None) >MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,107::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None
I cannot find anything related in glusterd log gluster v get all cluster.op-version Option Value ------ ----- cluster.op-version 31202
The volume info is in comment 4 I get the same with the brick running I see now: >MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:51:27,437::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call volumeInfo with (u'ssd5', u'*.*.*.16') {} >MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:51:27,437::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.16 ssd5 --xml (cwd None) >MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,553::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1 >MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,553::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 123 seconds (max pending: 1) >MainProcess|jsonrpc/0::ERROR::2019-04-10 09:53:27,553::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo > xmltree = _execGlusterXml(command) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml > return _getTree(rc, out, err) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree > raise ge.GlusterCmdExecFailedException(rc, out, err) >GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err='' >MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,555::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'} >MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,555::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o backup-volfile-servers=*.*.*.15:*.*.*.14 *.*.*.16:ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5 (cwd None) >MainProcess|jsonrpc/1::DEBUG::2019-04-10 09:54:27,765::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {} But the port is reachable from the node: >Brick *.*.*.16:/data/ssd5/brick1 49156 0 Y 15429 >[root@node9 ~]# telnet *.*.*.16 49156 >Trying *.*.*.16... >Connected to *.*.*.16. >Escape character is '^]'. And in /var/log/messages: >Apr 10 09:56:27 node9 vdsm[56358]: WARN Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=*.*.*.15:*.*.*.14', u'id': u'b036005a-d44d-4689-a8c3-13e1bbf55af7', u'connection': u'*.*.*.16:ssd5', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': >u'00000001-0001-0001-0001-000000000043', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'fbedfc27-e064-48d5-b332-daeeccdd1cf4'} at 0x7f77c4589210> timeout=60, duration=300.00 at 0x7f77c4589810> >task#=674 at 0x7f77e406f790>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 >self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012 ret = func(*args, >**kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012 self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012 response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012 res = >method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012 result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1095, in connectStorageServer#012 >connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012 result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012 return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012 result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012 return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method#012 ret = func(*args, **kwargs)#012File: >"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2411, in connectStorageServer#012 conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 172, in connect#012 >self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount#012 cgroup=cgroup)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012 return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012 **kwargs)#012File: "<string>", line 2, in mount#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012 kind, result = conn.recv() >Apr 10 09:56:27 node9 vdsm[56358]: WARN Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {u'masterVersion': 10, u'domainDict': {u'33f1ba00-6a16-4e58-b4c5-94426f1c4482': u'active', u'782a61af-a520-44c4-8845-74bf92888552': u'active', u'b30bab9d-9a66-44ce-ad17-2eb4ee858d8f': u'active', u'09959920-a31b-42c2-a547-e50b73602c96': u'active', u'f5d2f7c6-093f-46d6-a844-224d92db5ef9': u'active', u'515166a4-4b2f-402d-bf37-95f3c59635cb': >u'active', u'1ed0a635-67ee-4255-aad9-b70822350706': u'active', u'95b4e5d2-2974-4d5f-91e4-351f75a15435': u'active', u'84f2ff4a-2ec5-42dc-807d-bd12745f387d': u'active'}, u'storagepoolID': u'00000001-0001-0001-0001-000000000043', u'scsiKey': >u'00000001-0001-0001-0001-000000000043', u'masterSdUUID': u'09959920-a31b-42c2-a547-e50b73602c96', u'hostID': 12}, 'jsonrpc': '2.0', 'method': u'StoragePool.connect', 'id': u'95545e2f-6b72-4107-bc19-8de3e549aba0'} at 0x7f77c458d310> >timeout=60, duration=120.01 at 0x7f77c458d0d0> task#=677 at 0x7f77e406f990>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", >line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line >195, in >run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 >task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012 self._handler(self._ctx, >self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012 response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012 res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012 result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1091, in >connect#012 self._UUID, hostID, masterSdUUID, masterVersion, domainDict)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012 result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012 return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012 result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012 return fn(*args, **kargs)#012File: "<string>", line 2, in connectStoragePool#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", >line 50, in method#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1034, in connectStoragePool#012 spUUID, hostID, msdUUID, masterVersion, domainsMap)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1096, in _connectStoragePool#012 res = pool.connect(hostID, msdUUID, masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect#012 >self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild#012 self.setMasterDomain(msdUUID, masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1491, in setMasterDomain#012 domain = sdCache.produce(msdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce#012 domain.getRealDomain()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce#012 domain = >self._findDomain(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain#012 return findMethod(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in >_findUnfetchedDomain#012 return mod.findDomain(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 146, in findDomain#012 return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 130, in findDomainPath#012 for tmpSdUUID, domainPath in fileSD.scanDomains("*"):#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 893, in scanDomains#012 for >res in misc.itmap(collectMetaFiles, mntList, oop.HELPERS_PER_DOMAIN):#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/misc.py", line 538, in itmap#012 yield respQueue.get()#012File: "/usr/lib64/python2.7/Queue.py", line 168, in >get#012 self.not_empty.wait()#012File: "/usr/lib/python2.7/site-packages/pthreading.py", line 127, in wait#012 return self.__cond.wait()#012File: "/usr/lib/python2.7/site-packages/pthread.py", line 131, in wait#012 return >_libpthread.pthread_cond_wait(self._cond, m.mutex())
Circling back on this - do you still see the same issues with the latest gluster 6 release. Checking since we haven't been able to reproduce internally
I still have it, a bit different ouput >Oct 1 12:59:54 vh10en vdsm[9365]: WARN Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=*.*.*.15:*.*.*.14', u'id': u'b036005a-d44d-4689-a8c3-13e1bbf55af7', u'connection': u'*.*.*.16:ssd5', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': u'00000001-0001-0001-0001-000000000043', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'94712630-8f2f-4508-9950-51f9e3738c65'} at 0x7f4eb835e850> timeout=60, duration=60.00 at 0x7f4eb835efd0> task#=1 at 0x7f4edc05ced0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012 self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012 response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012 res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012 result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1102, in connectStorageServer#012 connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012 result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012 return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012 result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012 return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer#012 conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 167, in connect#012 self.validate()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 297, in validate#012 if not self.volinfo:#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 284, in volinfo#012 self._volinfo = self._get_gluster_volinfo()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 329, in _get_gluster_volinfo#012 self._volfileserver)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012 return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012 **kwargs)#012File: "<string>", line 2, in glusterVolumeInfo#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012 kind, result = conn.recv() >MainProcess|jsonrpc/0::ERROR::2019-10-01 13:00:55,244::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo > xmltree = _execGlusterXml(command) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml > return _getTree(rc, out, err) > File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree > raise ge.GlusterCmdExecFailedException(rc, out, err) >GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err='' >MainProcess|jsonrpc/0::DEBUG::2019-10-01 13:00:55,247::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7fe92597a7d0>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs' Maybe there is something wrong with the ports that glusterfs is listening to tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:49153 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:49154 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:16514 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:49163 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:49164 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:2223 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:54322 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN cat /etc/glusterfs/glusterd.vol volume management type mgmt/glusterd option working-directory /var/lib/glusterd option transport-type socket,rdma option transport.socket.keepalive-time 10 option transport.socket.keepalive-interval 2 option transport.socket.read-fail-log off option ping-timeout 0 option event-threads 1 option rpc-auth-allow-insecure on # option lock-timer 180 # option transport.address-family inet6 # option base-port 49152 end-volume
Can you try executing "gluster volume info" from the hypervisor where the logs in comment 32 are from? Any errors in glusterd logs?
Took me while to get back to this. Gluster volume info gives nothing on the host: Connection failed. Please check if gluster daemon is operational. I upgraded the servers to 6.7 but it still fails the same way. I see this on the glusterfs server: [2020-01-30 14:08:46.647837] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:10:29.372627] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:10:43.085746] I [addr.c:54:compare_addr_and_update] 0-/data/ssd5/brick1: allowed = "*", received addr = "*******" [2020-01-30 14:10:43.085778] I [login.c:110:gf_auth] 0-auth/login: allowed user names: f5fab461-ede5-4183-a19e-305b269eea2f [2020-01-30 14:10:43.085790] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-ssd5-server: accepted client from CTX_ID:ce82e003-2e05-4364-916b-9fb9cb039294-GRAPH_ID:0-PID:28195-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 (version: 6.7) with subvol /data/ssd5/brick1 [2020-01-30 14:10:47.790480] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:10:52.965452] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-ssd5-server: disconnecting connection from CTX_ID:ce82e003-2e05-4364-916b-9fb9cb039294-GRAPH_ID:0-PID:28195-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 [2020-01-30 14:10:52.965661] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-ssd5-server: Shutting down connection CTX_ID:ce82e003-2e05-4364-916b-9fb9cb039294-GRAPH_ID:0-PID:28195-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 [2020-01-30 14:10:54.621507] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:10:55.837521] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:10:56.275138] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:11:06.381265] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2020-01-30 14:12:55.606360] I [addr.c:54:compare_addr_and_update] 0-/data/ssd5/brick1: allowed = "*", received addr = "*******" [2020-01-30 14:12:55.606386] I [login.c:110:gf_auth] 0-auth/login: allowed user names: f5fab461-ede5-4183-a19e-305b269eea2f [2020-01-30 14:12:55.606397] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-ssd5-server: accepted client from CTX_ID:b3722fa2-990c-48d9-abc0-420f72c54392-GRAPH_ID:0-PID:23347-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 (version: 6.7) with subvol /data/ssd5/brick1 [2020-01-30 14:12:55.618474] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-ssd5-server: disconnecting connection from CTX_ID:b3722fa2-990c-48d9-abc0-420f72c54392-GRAPH_ID:0-PID:23347-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 [2020-01-30 14:12:55.618560] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-ssd5-server: Shutting down connection CTX_ID:b3722fa2-990c-48d9-abc0-420f72c54392-GRAPH_ID:0-PID:23347-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0
I also see something weird. The current op version is "31202" and the max-op-version is 60000 gluster volume get all cluster.op-version Option Value ------ ----- cluster.op-version 31202 gluster volume get all cluster.max-op-version Option Value ------ ----- cluster.max-op-version 60000 But there are still 3.12 clients connected: ---------------------------------------------- Client connections for volume ssd5 ---------------------------------------------- Brick : *.*.*.14:/data/ssd5/brick1 Clients connected : 11 Hostname BytesRead BytesWritten OpVersion -------- --------- ------------ --------- *.*.*.16:49094 13182330180 1103788832 60000 *.*.*.15:49110 14505195172 1270240536 60000 *.*.*.14:49125 9780313532 387268628 60000 *.*.*.52:49113 2421230060 411692260 31202 *.*.*.48:49115 41319016 27510548 31202 *.*.*.26:49112 8168637332 2729824420 31202 *.*.*.23:49110 2601732380 249970492 31202 *.*.*.24:49096 20354222864 8527392540 31202 *.*.*.21:49087 40880996 27250864 31202 *.*.*.25:49108 18496475704 10863379708 31202 *.*.*.22:49128 10968166124 369379460 31202 ---------------------------------------------- Brick : *.*.*.15:/data/ssd5/brick1 Clients connected : 11 Hostname BytesRead BytesWritten OpVersion -------- --------- ------------ --------- *.*.*.16:49085 41797760 13564520 60000 *.*.*.15:49119 63473288 15698216812 60000 *.*.*.14:49123 40819324 10113856164 60000 *.*.*.52:49079 2436482772 3960303116 31202 *.*.*.48:49082 48044660 3481611392 31202 *.*.*.23:49083 2645383096 3744524756 31202 *.*.*.26:49111 8178707752 4321726256 31202 *.*.*.24:49060 20361209920 12099313068 31202 *.*.*.25:49082 18492574172 11152990192 31202 *.*.*.21:49086 47554156 3481285424 31202 *.*.*.22:49083 10972024796 3825736592 31202 ---------------------------------------------- Brick : *.*.*.16:/data/ssd5/brick1 Clients connected : 11 Hostname BytesRead BytesWritten OpVersion -------- --------- ------------ --------- *.*.*.16:49092 57680180 14216217988 60000 *.*.*.15:49117 46014068 14771256 60000 *.*.*.14:49121 2754940588 25494224 60000 *.*.*.52:49085 2423667656 466211046 31202 *.*.*.26:49095 8168987248 1485480938 31202 *.*.*.23:49082 2617245772 14075255362 31202 *.*.*.48:49079 42126712 28841506 31202 *.*.*.21:49085 41689312 28587970 31202 *.*.*.25:49081 18488929016 9410462674 31202 *.*.*.24:49059 20350640484 8561125950 31202 *.*.*.22:49082 10965532484 377943494 31202 ----------------------------------------------
I found some more interesting logs from the time of the upgrade: [2020-01-30 13:26:06.299777] E [rpc-transport.c:297:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/6.7/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2020-01-30 13:26:06.299790] W [rpc-transport.c:301:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine [2020-01-30 13:26:06.299799] W [rpcsvc.c:1985:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2020-01-30 13:26:06.299806] E [MSGID: 106244] [glusterd.c:1785:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2020-01-30 13:26:06.301194] I [socket.c:904:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 10 [2020-01-30 13:26:06.302227] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31202 [2020-01-30 13:26:06.512609] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: f311b00a-745a-47df-b268-35ac22ff9b22 [2020-01-30 13:26:06.544549] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2020-01-30 13:26:06.544636] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2020-01-30 13:26:06.544671] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2020-01-30 13:26:06.544700] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-01-30 13:26:06.552118] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-01-30 13:26:06.622036] E [MSGID: 101032] [store.c:447:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.upgrade. [No such file or directory] [2020-01-30 13:26:06.622061] I [glusterd.c:1999:init] 0-management: Regenerating volfiles due to a max op-version mismatch or glusterd.upgrade file not being present, op_version retrieved:0, max op_version: 60000 [2020-01-30 13:26:06.552111] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2020-01-30 13:26:06.622264] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e65) [0x7f61667c0e65] -->glusterd(glusterfs_sigwaiter+0xe5) [0x55baf40401f5] -->glusterd(cleanup_and_exit+0x6b) [0x55baf404005b] ) 0-: received signum (15), shutting down [2020-01-30 13:26:06.622307] W [mgmt-pmap.c:133:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload [2020-01-30 13:28:05.445836] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.7 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2020-01-30 13:28:05.446957] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 12947 [2020-01-31 09:22:44.398186] I [MSGID: 106499] [glusterd-handler.c:4429:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ssd6 [2020-01-31 09:23:10.745234] E [MSGID: 106061] [glusterd-utils.c:10297:glusterd_max_opversion_use_rsp_dict] 0-management: Maximum supported op-version not set in destination dictionary [2020-01-31 09:23:36.149157] E [MSGID: 106204] [glusterd-utils.c:13043:glusterd_get_global_options_for_all_vols] 0-management: Option with name: cluster.min-op-version does not exist [Invalid argument] [2020-01-31 09:24:50.010331] E [MSGID: 106061] [glusterd-utils.c:10297:glusterd_max_opversion_use_rsp_dict] 0-management: Maximum supported op-version not set in destination dictionary
I started a full audit and finally found the problem. We use jumbo frames and the MTU was set to 1500 I think someone forgot to save the switch config of the storage switches and it reverted after we did some failure testing after install. So I increased the MTU to 9000 and the gluster domains connect fine now. To remove the rdma error, I modified "/etc/glusterfs/glusterd.vol" and removed rdma from "option transport-type socket,rdma" And I set the tcp and ping timeouts to 10 gluster volume set <VOLUME> network.ping-timeout "10" gluster volume set <VOLUME> client.tcp-user-timeout 10 gluster volume set <VOLUME> server.tcp-user-timeout 10 To get rid of the tcp-user-timeout warning: [2020-01-30 13:26:06.552111] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout The only thing I still worry a bit about is: [2020-01-30 13:26:06.622061] I [glusterd.c:1999:init] 0-management: Regenerating volfiles due to a max op-version mismatch or glusterd.upgrade file not being present, op_version retrieved:0, max op_version: 60000
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days