Bug 1672318

Summary: "failed to fetch volume file" when trying to activate host in DC with glusterfs 3.12 domains
Product: [Community] GlusterFS Reporter: Netbulae <info>
Component: glusterdAssignee: Sanju <srakonde>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 5CC: amukherj, bugs, bugs, cshao, godas, info, msobczyk, pasik, sabose, sbonazzo, srakonde
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 13:47:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1677160    
Bug Blocks: 1672818, 1732875    
Attachments:
Description Flags
engine.log
none
server.log
none
vdsm.log
none
supervdsm.log none

Description Netbulae 2019-02-04 14:28:57 UTC
Activating a host (ovirt-node-ng-installer-4.3.0-2019020409.el7.iso) fails because it cannot connect to the existing glusterfs 3.12 storage domains.

I can connect to the glusterfs server with telnet and there are no client restrictions on it.

We could try to upgrade to glusterfs 4.1 but as it's not compatible with ovirt 4.2 this would break our old cluster...

[2019-02-04 12:47:02.969362] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.16 --volfile-server=*.*.*.15 --volfile-server=*.*.*.14 --volfile-id=ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5)
[2019-02-04 12:47:02.979349] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-02-04 12:58:10.232517] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.16 --volfile-server=*.*.*.15 --volfile-server=*.*.*.14 --volfile-id=ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5)
[2019-02-04 12:58:10.242924] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-02-04 13:02:44.511106] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.16
[2019-02-04 13:02:44.511157] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-02-04 13:02:44.512757] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f58b4ccffbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7f58b4a98e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7f58b4a98f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f58b4a9a531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7f58b4a9b0d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-02-04 12:47:02.979593 (xid=0x2)
[2019-02-04 13:02:44.512779] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:ssd5)
[2019-02-04 13:02:44.512813] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7f58b4a98e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x5641acd22361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5641acd1aceb] ) 0-: received signum (0), shutting down
[2019-02-04 13:02:44.512841] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:02:44.521264] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:02:44.521784] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f58b3b31dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5641acd1ae75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5641acd1aceb] ) 0-: received signum (15), shutting down
[2019-02-04 13:13:51.774986] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.16
[2019-02-04 13:13:51.775037] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-02-04 13:13:51.775928] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f41e4f9ffbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7f41e4d68e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7f41e4d68f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f41e4d6a531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7f41e4d6b0d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-02-04 12:58:10.243172 (xid=0x2)
[2019-02-04 13:13:51.775949] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:ssd5)
[2019-02-04 13:13:51.775982] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7f41e4d68e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x563a96061361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x563a96059ceb] ) 0-: received signum (0), shutting down
[2019-02-04 13:13:51.776008] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:13:51.776214] E [mount-common.c:232:fuse_mnt_umount] 0-glusterfs-fuse: fuse: failed to unmount /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5: No such file or directory
[2019-02-04 13:13:51.776229] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:13:51.776338] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f41e3e01dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x563a96059e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x563a96059ceb] ) 0-: received signum (15), shutting down

Comment 1 Netbulae 2019-02-04 14:29:37 UTC
Created attachment 1526800 [details]
engine.log

Comment 2 Netbulae 2019-02-04 14:30:25 UTC
Created attachment 1526801 [details]
server.log

Comment 3 Netbulae 2019-02-04 14:34:39 UTC
Created attachment 1526802 [details]
vdsm.log

Comment 4 Netbulae 2019-02-06 12:23:45 UTC
gluster volume info
 
Volume Name: hdd2
Type: Replicate
Volume ID: 6d491c15-adf3-4d5c-aade-46f0206f74e5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/hdd2/brick1
Brick2: *.*.*.15:/data/hdd2/brick1
Brick3: *.*.*.16:/data/hdd2/brick1
Options Reconfigured:
server.allow-insecure: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
network.ping-timeout: 30
cluster.granular-entry-heal: enable
performance.strict-o-direct: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: on
network.remote-dio: off
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
 
Volume Name: sdd8
Type: Replicate
Volume ID: 73b020ba-c63d-47c0-a102-f994f713b5fb
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd8/brick1
Brick2: *.*.*.15:/data/ssd8/brick1
Brick3: *.*.*.16:/data/ssd8/brick1
Options Reconfigured:
server.allow-insecure: on
nfs.disable: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
network.ping-timeout: 30
 
Volume Name: ssd3
Type: Replicate
Volume ID: 4b340741-a9bc-46e5-8667-d7f75ec57254
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd3/brick1
Brick2: *.*.*.15:/data/ssd3/brick1
Brick3: *.*.*.16:/data/ssd3/brick1
Options Reconfigured:
server.allow-insecure: on
nfs.disable: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
network.ping-timeout: 30
 
Volume Name: ssd4
Type: Replicate
Volume ID: edebdd8f-4cd7-490b-a7a9-3985cc016282
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd4/brick1
Brick2: *.*.*.15:/data/ssd4/brick1
Brick3: *.*.*.16:/data/ssd4/brick1
Options Reconfigured:
cluster.granular-entry-heal: enable
features.shard-block-size: 512MB
server.event-threads: 1
network.ping-timeout: 10
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
 
Volume Name: ssd5
Type: Replicate
Volume ID: 5c5ca8e3-f9dd-4bca-a5d3-28c94d965cca
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd5/brick1
Brick2: *.*.*.15:/data/ssd5/brick1
Brick3: *.*.*.16:/data/ssd5/brick1
Options Reconfigured:
cluster.granular-entry-heal: enable
features.shard-block-size: 512MB
server.event-threads: 1
network.ping-timeout: 10
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
 
Volume Name: ssd6
Type: Replicate
Volume ID: 84a554aa-b634-4582-8cc3-75d23ebdcd91
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd6/brick1
Brick2: *.*.*.15:/data/ssd6/brick1
Brick3: *.*.*.16:/data/ssd6/brick1
Options Reconfigured:
cluster.granular-entry-heal: enable
features.shard-block-size: 512MB
server.event-threads: 1
network.ping-timeout: 10
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
 
Volume Name: ssd9
Type: Replicate
Volume ID: 1a379dd8-7e89-4b0b-b6c6-fec45275a9a0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd9/brick1
Brick2: *.*.*.15:/data/ssd9/brick1
Brick3: *.*.*.16:/data/ssd9/brick1
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
network.ping-timeout: 30
server.allow-insecure: on

Comment 5 Netbulae 2019-02-06 12:32:02 UTC
All glusterfs info about "failed to fetch volume file" lead me to "server.allow-insecure on" and "rpc-auth-allow-insecure on" but I have thes enabled.

https://bugzilla.redhat.com/show_bug.cgi?id=1248067

https://lists.gluster.org/pipermail/gluster-devel/2012-December/037371.html

After doing many experiments and inputs I got from vijay on IRC, this is what finally worked.

1) volume set <volname> server.allow-insecure on
2) option rpc-auth-allow-insecure on in glusterd.vol

If either of the above is not present, i get the same "unable to get volfile" error. I have tried setting virsh (actually libvirt) to spawn qemu process as 'root', 'vdsm' and 'qemu' user

and it fails for all the 3 if either of the above is NOT set.

Only when both of the above is set, it works for me and I am able to start a VM via virsh.

Comment 6 Netbulae 2019-02-06 13:02:22 UTC
Created attachment 1527521 [details]
supervdsm.log

Comment 9 Sandro Bonazzola 2019-02-14 14:00:14 UTC
Moving assignment for this bug to Sanju who's investigating this on Gluster community side.
Netbulae can you please provide mount logs and bricks logs?

Comment 10 Netbulae 2019-02-14 14:45:45 UTC
Mount logs are all filled with the same as in the first comment "failed to fetch volume", there are no brick logs on the client.

And the brick logs on the glusterfs server have nothing in there referring to these hosts.

I can increase the debug info if needed

Comment 11 Sahina Bose 2019-02-25 08:37:24 UTC
Sanju, is this the same as Bug 1677160

Comment 12 Sanju 2019-02-25 09:51:17 UTC
(In reply to Sahina Bose from comment #11)
> Sanju, is this the same as Bug 1677160

Sahina, the error messages in the glusterd logs are same in both the bugs. Also, looking at the comment 7, I can see that Bug 1677160 is opened to track it from gluster side. So both the bugs are same.

Comment 13 Netbulae 2019-02-26 11:06:26 UTC
There must be something unique to our situation, I have time thursday to debug further.

Comment 14 Gobinda Das 2019-02-27 10:52:54 UTC
Ovirt-4.3.1 already release, so moving to ovirt-4.3.2

Comment 15 Netbulae 2019-03-01 13:36:30 UTC
Still happening. I thought it was because my co-worker had added "server.allow-insecure: on
" to the master storage volume, but didn't restart the volume.

So I restarted this volume but it didn't help.

As I was checking the oVirt event log for another issue, I see a lot of entries like this:

Mar 1, 2019, 1:53:04 PM
	
VDSM node10 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
	
	
Mar 1, 2019, 1:46:03 PM
	
VDSM node10 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-000000000043, msdUUID=ed888117-094e-46f5-b433-61eb72e242be'
	
	
Mar 1, 2019, 1:44:03 PM
	
VDSM node10 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
	
	
Mar 1, 2019, 1:41:55 PM
	
VDSM gluster5 command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "Attempt to call function: <bound method GlusterHost.storageDevicesList of <vdsm.gluster.apiwrapper.GlusterHost object at 0x7fb70c0527d0>> with arguments: () error: argument of type 'NoneType' is not iterable"}
	
	
Mar 1, 2019, 1:41:54 PM
	
VDSM gluster6 command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "Attempt to call function: <bound method GlusterHost.storageDevicesList of <vdsm.gluster.apiwrapper.GlusterHost object at 0x7fb6284b0750>> with arguments: () error: argument of type 'NoneType' is not iterable"}

Comment 16 Amar Tumballi 2019-03-04 07:30:51 UTC
Hi Netbulae, How can we proceed on this? Is there a possibility we can do some live debug of the situation? I am 'amarts' on IRC, and we can catch up there to discuss further.

From latest comments I understand adding 'insecure' options (as per comment#5) didn't help later.

Comment 17 Netbulae 2019-03-05 12:23:25 UTC
Yes no problem, I'm back in the office on Thursday from 9:00 CEST to 17:30 CEST.

Comment 18 Atin Mukherjee 2019-03-06 15:27:48 UTC
Sahina - this bug can't block glusterfs-6 release given the product is different. We should have a clone at glusterfs and change this dependency.

Comment 19 Sahina Bose 2019-03-07 08:26:25 UTC
Changing the component - as this bug is already tracked via Bug 1677319 in oVirt

Comment 20 Atin Mukherjee 2019-03-12 05:06:50 UTC
We tried to reproduce this issue, but couldn't hit the same problem. Until and unless we have a reproducer and the complete logs especially the client, glusterd and brick log files, it'd be hard to debug.

Comment 21 Netbulae 2019-03-18 11:37:04 UTC
Upgraded to node 4.3.2-2019031310.

As stated earlier, there is nothing in the brick and glusterd log. I've set "diagnostics.client-log-level" to DEBUG but still see nothing.

glustershd.log


[2019-03-18 11:10:54.005050] D [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-hdd2-client-2: remote operation failed [Resource temporarily unavailable]
[2019-03-18 11:10:54.005079] D [MSGID: 0] [client-rpc-fops.c:1511:client3_3_inodelk_cbk] 0-stack-trace: stack-address: 0x7f1254002560, hdd2-client-2 returned -1 error: Resource temporarily unavailable [Resource temporarily unavailable]
[2019-03-18 11:10:55.545357] D [MSGID: 0] [afr-self-heald.c:218:afr_shd_index_inode] 0-hdd2-replicate-0: glusterfs.xattrop_entry_changes_gfid dir gfid for hdd2-client-0: b8f46fe6-ef56-44ac-8586-22061b0f2d5b
[2019-03-18 11:10:55.546606] D [MSGID: 0] [afr-self-heald.c:597:afr_shd_index_healer] 0-hdd2-replicate-0: finished index sweep on subvol hdd2-client-0
[2019-03-18 11:11:24.234360] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-0: returning as transport is already disconnected OR there are no frames (0 || 0)
[2019-03-18 11:11:24.234416] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-1: returning as transport is already disconnected OR there are no frames (0 || 0)
[2019-03-18 11:11:24.234425] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-2: returning as transport is already disconnected OR there are no frames (0 || 0)
[2019-03-18 11:12:46.240335] D [logging.c:1855:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2019-03-18 11:12:46.240388] D [logging.c:1817:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2019-03-18 11:14:09.000515] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f12ab9f7ebb] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7f12ab7c228b] (--> /lib64/libgfrpc.so.0(+0x14a31)[0x7f12ab7c2a31] (--> /lib64/libgfrpc.so.0(rpc_clnt_submit+0x4a1)[0x7f12ab7bf651] (--> /usr/lib64/glusterfs/3.12.15/xlator/protocol/client.so(+0x102e2)[0x7f129e0e02e2] ))))) 0-: 192.168.99.14:49154: ping timer event already removed
[2019-03-18 11:14:09.000619] D [MSGID: 0] [syncop-utils.c:548:syncop_is_subvol_local] 0-ssd4-client-0: subvol ssd4-client-0 is local
[2019-03-18 11:14:09.000637] D [MSGID: 0] [afr-self-heald.c:580:afr_shd_index_healer] 0-ssd4-replicate-0: starting index sweep on subvol ssd4-client-0
[2019-03-18 11:14:09.000635] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-ssd4-client-0: Ping latency is 0ms

glusterfs clien log

[2019-03-18 11:03:57.442215] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.14 --volfile-server=*.*.*.15 --volfile-server=*.*.*.16 --volfile-id=/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9)
[2019-03-18 11:03:57.472426] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-18 11:06:03.636977] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f392add8dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x556b04278e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x556b04278ceb] ) 0-: received signum (15), shutting down
[2019-03-18 11:06:03.637039] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:06:03.654263] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:13:29.415760] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.14 --volfile-server=*.*.*.15 --volfile-server=*.*.*.16 --volfile-id=/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9)
[2019-03-18 11:13:29.444824] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14
[2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-03-18 11:13:29.445101 (xid=0x2)
[2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/ssd9)
[2019-03-18 11:29:01.002550] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7fb4beba4e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x5586d9f3c361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5586d9f34ceb] ) 0-: received signum (0), shutting down
[2019-03-18 11:29:01.002578] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:29:01.009036] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:29:01.009655] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fb4bdc3ddd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5586d9f34e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5586d9f34ceb] ) 0-: received signum (15), shutting down

supervdsm.log

MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:11:29,020::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call volumeInfo with (u'ssd9', u'*.*.*.14') {}
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:11:29,020::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.14 ssd9 --xml (cwd None)
MainProcess|jsonrpc/6::DEBUG::2019-03-18 12:11:31,460::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call ksmTune with ({u'run': 0, u'merge_across_nodes': 1},) {}
MainProcess|jsonrpc/6::DEBUG::2019-03-18 12:11:31,461::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,227::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,227::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 122 seconds (max pending: 1)
MainProcess|jsonrpc/5::ERROR::2019-03-18 12:13:29,227::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
    xmltree = _execGlusterXml(command)
  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
    return _getTree(rc, out, err)
  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
    raise ge.GlusterCmdExecFailedException(rc, out, err)
GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err=''
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,229::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f3877683a10>, u'*.*.*.14:/ssd9', u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.16', 'cgroup': 'vdsm-glusterfs'}
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,230::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o backup-volfile-servers=*.*.*.15:*.*.*.16 *.*.*.14:/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9 (cwd None)

Comment 22 Netbulae 2019-03-20 11:34:26 UTC
(In reply to Amar Tumballi from comment #16)
> Hi Netbulae, How can we proceed on this? Is there a possibility we can do
> some live debug of the situation? I am 'amarts' on IRC, and we can catch up
> there to discuss further.
> 
> From latest comments I understand adding 'insecure' options (as per
> comment#5) didn't help later.

Yes adding insecure options and restarting the volumes didn't help

Comment 23 Sahina Bose 2019-03-26 12:59:37 UTC
Is there anything in Comment 21 that could help narrow this?

Comment 24 Atin Mukherjee 2019-03-26 13:24:04 UTC
[2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14

Why did we get a disconnect. Was glusterd service at *.14 not running?

[2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-03-18 11:13:29.445101 (xid=0x2)

The above log seems to be the culprit here. 

[2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/ssd9)

And the above log is the after effect.


I have few questions:

1. Does the mount fail everytime?
2. Do you see any change in the behaviour when the primary volfile server is changed?
3. What are the gluster version in the individual peers?

(Keeping the needinfo intact for now, but request Sahina to get us these details to work on).

Comment 25 Sahina Bose 2019-03-29 13:46:15 UTC
Redirecting to reporter. Could you answer questions in Comment 24?

Comment 26 Netbulae 2019-04-01 09:37:41 UTC
(In reply to Atin Mukherjee from comment #24)
> [2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14
> 
> Why did we get a disconnect. Was glusterd service at *.14 not running?
> 
> [2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
> [2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (-->
> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (-->
> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (-->
> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (-->
> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced
> unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at
> 2019-03-18 11:13:29.445101 (xid=0x2)
> 
> The above log seems to be the culprit here. 
> 
> [2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk]
> 0-mgmt: failed to fetch volume file (key:/ssd9)
> 
> And the above log is the after effect.
> 
> 
> I have few questions:
> 
> 1. Does the mount fail everytime?

Yes. It also stays the same when we switch the primary storage domain to another one.

> 2. Do you see any change in the behaviour when the primary volfile server is
> changed?

No I have different primary volfile server across volumes to spread the load a bit more. Same effect always.

> 3. What are the gluster version in the individual peers?

All nodes and servers are on 3.12.15

> 
> (Keeping the needinfo intact for now, but request Sahina to get us these
> details to work on).

Comment 27 Atin Mukherjee 2019-04-01 12:14:53 UTC
Since I'm unable to reproduce this even after multiple attempts, the only possibility I have to make some progress on this is by asking you to test different combinations, I understand that this might make you frustrated but I have no other way at this point of time to pinpoint this. In my local setup I tried every possible options to simulate this but had no success.

As I explained in comment 24, it seems like that client couldn't get the volfile from glusterd running in *.15 instance. However since there's no log entry in INFO mode in glusterd which could indicate the possibility of this failure can I request you to do the following if possible?

1. Run 'pkill glusterd; glusterd ' on *.15 node
2. Attempt to mount the client.
3. Find out the failure log of 'failed to fetch volume file'  and see the timestamp. From glusterd log map this timestamp and send us the snippet of the log entries around this timestamp.
4. Run gluster v info command from all the nodes and paste back the output
5. Provide the output of 'gluster v get all cluster.op-version' from one of the nodes

Comment 28 Netbulae 2019-04-09 10:43:38 UTC
(In reply to Atin Mukherjee from comment #27)

I tried the sequence you posted but currently we have *.16 and ssd5 as the master storage domain as I removed ssd9

Also upgraded to ovirt node 4.3.3 rc2 because of several fixes.

But I don't get the 'failed to fetch volume file' this way, it appears the backup volfile option is not working:

>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,186::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.16 ssd5 --xml (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,297::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:10:09,297::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
>    xmltree = _execGlusterXml(command)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
>    return _getTree(rc, out, err)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
>    raise ge.GlusterCmdExecFailedException(rc, out, err)
>GlusterCmdExecFailedException: Command execution failed: rc=1 out='Connection failed. Please check if gluster daemon is operational.\n' err=''
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,298::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=1*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'}

When I leave the glusterfs process running, I also don't get it anymore in 4.3.3, but I get "'Error : Request timed out\n' err=''" and "MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n')"


>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:25:54,918::commands::219::root::(execCmd) FAILED: <err> = 'Running scope as unit run-10178.scope.\nMount failed. Please check the log file for more details.\n'; <rc> = 1
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:25:54,918::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 765 seconds (max pending: 2)
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:25:54,918::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in mount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 143, in mount
>    cgroup=cgroup)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 277, in _mount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (1, ';Running scope as unit run-10178.scope.\nMount failed. Please check the log file for more details.\n')
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,720::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,720::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 294 seconds (max pending: 2)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,802::routes::110::root::(get_gateway) The gateway *.*.*.3 is duplicated for the device ovirtmgmt
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,807::routes::110::root::(get_gateway) The gateway *.*.*.3 is duplicated for the device ovirtmgmt
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,809::cmdutils::133::root::(exec_cmd) /sbin/tc qdisc show (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,860::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,866::cmdutils::133::root::(exec_cmd) /sbin/tc class show dev bond0 classid 1389:1388 (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,919::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,005::vsctl::68::root::(commit) Executing commands: /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,006::cmdutils::133::root::(exec_cmd) /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,038::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
>
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,472::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/ovirt.netbulae.mgmt:_var_lib_exports_iso (cwd None)
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,502::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,503::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,529::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/*.*.*.15:_data_ovirt') >{'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,529::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/*.*.*.15:_data_ovirt (cwd None)
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,595::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,596::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,738::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {}
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,738::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/libexec/vdsm/fc-scan (cwd None)
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,804::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,841::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,841::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,849::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,850::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,850::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,850::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,875::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,876::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,889::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,889::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,895::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,895::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,895::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,896::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,902::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,902::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,903::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,903::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,910::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,910::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,911::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,911::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,917::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,917::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,042::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,042::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/libexec/vdsm/fc-scan (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,107::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None

Comment 29 Netbulae 2019-04-09 10:45:21 UTC
I cannot find anything related in glusterd log

gluster v get all cluster.op-version
Option                                  Value                                   
------                                  -----                                   
cluster.op-version                      31202

Comment 30 Netbulae 2019-04-10 08:06:11 UTC
The volume info is in comment 4

I get the same with the brick running I see now:

>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:51:27,437::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call volumeInfo with (u'ssd5', u'*.*.*.16') {}
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:51:27,437::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.16 ssd5 --xml (cwd None)
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,553::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,553::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 123 seconds (max pending: 1)
>MainProcess|jsonrpc/0::ERROR::2019-04-10 09:53:27,553::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
>    xmltree = _execGlusterXml(command)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
>    return _getTree(rc, out, err)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
>    raise ge.GlusterCmdExecFailedException(rc, out, err)
>GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err=''
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,555::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'}
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,555::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o backup-volfile-servers=*.*.*.15:*.*.*.14 *.*.*.16:ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5 (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-10 09:54:27,765::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {}

But the port is reachable from the node:

>Brick *.*.*.16:/data/ssd5/brick1       49156     0          Y       15429

>[root@node9 ~]# telnet *.*.*.16 49156
>Trying *.*.*.16...
>Connected to *.*.*.16.
>Escape character is '^]'.


And in /var/log/messages:

>Apr 10 09:56:27 node9 vdsm[56358]: WARN Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=*.*.*.15:*.*.*.14', u'id': u'b036005a-d44d-4689-a8c3-13e1bbf55af7', u'connection': u'*.*.*.16:ssd5', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': >u'00000001-0001-0001-0001-000000000043', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'fbedfc27-e064-48d5-b332-daeeccdd1cf4'} at 0x7f77c4589210> timeout=60, duration=300.00 at 0x7f77c4589810> >task#=674 at 0x7f77e406f790>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  >self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012  ret = func(*args, >**kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012  self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012  response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012  res = >method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012  result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1095, in connectStorageServer#012  >connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012  result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012  return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012  result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012  return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method#012  ret = func(*args, **kwargs)#012File: >"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2411, in connectStorageServer#012  conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 172, in connect#012  >self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount#012  cgroup=cgroup)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012  return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012  **kwargs)#012File: "<string>", line 2, in mount#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012  kind, result = conn.recv()
>Apr 10 09:56:27 node9 vdsm[56358]: WARN Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {u'masterVersion': 10, u'domainDict': {u'33f1ba00-6a16-4e58-b4c5-94426f1c4482': u'active', u'782a61af-a520-44c4-8845-74bf92888552': u'active', u'b30bab9d-9a66-44ce-ad17-2eb4ee858d8f': u'active', u'09959920-a31b-42c2-a547-e50b73602c96': u'active', u'f5d2f7c6-093f-46d6-a844-224d92db5ef9': u'active', u'515166a4-4b2f-402d-bf37-95f3c59635cb': >u'active', u'1ed0a635-67ee-4255-aad9-b70822350706': u'active', u'95b4e5d2-2974-4d5f-91e4-351f75a15435': u'active', u'84f2ff4a-2ec5-42dc-807d-bd12745f387d': u'active'}, u'storagepoolID': u'00000001-0001-0001-0001-000000000043', u'scsiKey': >u'00000001-0001-0001-0001-000000000043', u'masterSdUUID': u'09959920-a31b-42c2-a547-e50b73602c96', u'hostID': 12}, 'jsonrpc': '2.0', 'method': u'StoragePool.connect', 'id': u'95545e2f-6b72-4107-bc19-8de3e549aba0'} at 0x7f77c458d310> >timeout=60, duration=120.01 at 0x7f77c458d0d0> task#=677 at 0x7f77e406f990>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", >line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line >195, in >run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  >task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012  self._handler(self._ctx, >self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012  response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012  res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012  result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1091, in >connect#012  self._UUID, hostID, masterSdUUID, masterVersion, domainDict)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012  result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012  return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012  result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012  return fn(*args, **kargs)#012File: "<string>", line 2, in connectStoragePool#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", >line 50, in method#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1034, in connectStoragePool#012  spUUID, hostID, msdUUID, masterVersion, domainsMap)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1096, in _connectStoragePool#012  res = pool.connect(hostID, msdUUID, masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect#012  >self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild#012  self.setMasterDomain(msdUUID, masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1491, in setMasterDomain#012  domain = sdCache.produce(msdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce#012  domain.getRealDomain()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain#012  return self._cache._realProduce(self._sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce#012  domain = >self._findDomain(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain#012  return findMethod(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in >_findUnfetchedDomain#012  return mod.findDomain(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 146, in findDomain#012  return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 130, in findDomainPath#012  for tmpSdUUID, domainPath in fileSD.scanDomains("*"):#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 893, in scanDomains#012  for >res in misc.itmap(collectMetaFiles, mntList, oop.HELPERS_PER_DOMAIN):#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/misc.py", line 538, in itmap#012  yield respQueue.get()#012File: "/usr/lib64/python2.7/Queue.py", line 168, in >get#012  self.not_empty.wait()#012File: "/usr/lib/python2.7/site-packages/pthreading.py", line 127, in wait#012  return self.__cond.wait()#012File: "/usr/lib/python2.7/site-packages/pthread.py", line 131, in wait#012  return >_libpthread.pthread_cond_wait(self._cond, m.mutex())

Comment 31 Sahina Bose 2019-09-26 14:01:50 UTC
Circling back on this - do you still see the same issues with the latest gluster 6 release. Checking since we haven't been able to reproduce internally

Comment 32 Netbulae 2019-10-01 11:37:17 UTC
I still have it, a bit different ouput

>Oct  1 12:59:54 vh10en vdsm[9365]: WARN Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=*.*.*.15:*.*.*.14', u'id': u'b036005a-d44d-4689-a8c3-13e1bbf55af7', u'connection': u'*.*.*.16:ssd5', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': u'00000001-0001-0001-0001-000000000043', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'94712630-8f2f-4508-9950-51f9e3738c65'} at 0x7f4eb835e850> timeout=60, duration=60.00 at 0x7f4eb835efd0> task#=1 at 0x7f4edc05ced0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012  self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012  response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012  res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012  result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1102, in connectStorageServer#012  connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012  result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012  return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012  result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012  return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer#012  conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 167, in connect#012  self.validate()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 297, in validate#012  if not self.volinfo:#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 284, in volinfo#012  self._volinfo = self._get_gluster_volinfo()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 329, in _get_gluster_volinfo#012  self._volfileserver)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012  return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012  **kwargs)#012File: "<string>", line 2, in glusterVolumeInfo#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012  kind, result = conn.recv()


>MainProcess|jsonrpc/0::ERROR::2019-10-01 13:00:55,244::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
>    xmltree = _execGlusterXml(command)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
>    return _getTree(rc, out, err)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
>    raise ge.GlusterCmdExecFailedException(rc, out, err)
>GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err=''
>MainProcess|jsonrpc/0::DEBUG::2019-10-01 13:00:55,247::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7fe92597a7d0>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'

Maybe there is something wrong with the ports that glusterfs is listening to

tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49153           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49154           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:16514           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49163           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49164           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:2223            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:54322           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN

cat /etc/glusterfs/glusterd.vol 
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 0
    option event-threads 1
    option rpc-auth-allow-insecure on
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
end-volume

Comment 33 Sahina Bose 2019-10-11 10:22:43 UTC
Can you try executing "gluster volume info" from the hypervisor where the logs in comment 32 are from?
Any errors in glusterd logs?

Comment 34 Netbulae 2020-01-30 15:17:20 UTC
Took me while to get back to this.

Gluster volume info gives nothing on the host:

Connection failed. Please check if gluster daemon is operational.

I upgraded the servers to 6.7 but it still fails the same way.

I see this on the glusterfs server:

[2020-01-30 14:08:46.647837] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:10:29.372627] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:10:43.085746] I [addr.c:54:compare_addr_and_update] 0-/data/ssd5/brick1: allowed = "*", received addr = "*******"
[2020-01-30 14:10:43.085778] I [login.c:110:gf_auth] 0-auth/login: allowed user names: f5fab461-ede5-4183-a19e-305b269eea2f
[2020-01-30 14:10:43.085790] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-ssd5-server: accepted client from CTX_ID:ce82e003-2e05-4364-916b-9fb9cb039294-GRAPH_ID:0-PID:28195-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 (version: 6.7) with subvol /data/ssd5/brick1
[2020-01-30 14:10:47.790480] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:10:52.965452] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-ssd5-server: disconnecting connection from CTX_ID:ce82e003-2e05-4364-916b-9fb9cb039294-GRAPH_ID:0-PID:28195-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0
[2020-01-30 14:10:52.965661] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-ssd5-server: Shutting down connection CTX_ID:ce82e003-2e05-4364-916b-9fb9cb039294-GRAPH_ID:0-PID:28195-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0
[2020-01-30 14:10:54.621507] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:10:55.837521] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:10:56.275138] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:11:06.381265] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/features/worm.so(+0x7281) [0x7f6329b56281] -->/usr/lib64/glusterfs/6.7/xlator/features/locks.so(+0x1c259) [0x7f6329d7c259] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f63399a5254] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2020-01-30 14:12:55.606360] I [addr.c:54:compare_addr_and_update] 0-/data/ssd5/brick1: allowed = "*", received addr = "*******"
[2020-01-30 14:12:55.606386] I [login.c:110:gf_auth] 0-auth/login: allowed user names: f5fab461-ede5-4183-a19e-305b269eea2f
[2020-01-30 14:12:55.606397] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-ssd5-server: accepted client from CTX_ID:b3722fa2-990c-48d9-abc0-420f72c54392-GRAPH_ID:0-PID:23347-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0 (version: 6.7) with subvol /data/ssd5/brick1
[2020-01-30 14:12:55.618474] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-ssd5-server: disconnecting connection from CTX_ID:b3722fa2-990c-48d9-abc0-420f72c54392-GRAPH_ID:0-PID:23347-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0
[2020-01-30 14:12:55.618560] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-ssd5-server: Shutting down connection CTX_ID:b3722fa2-990c-48d9-abc0-420f72c54392-GRAPH_ID:0-PID:23347-HOST:*******-PC_NAME:ssd5-client-0-RECON_NO:-0

Comment 35 Netbulae 2020-01-31 09:41:55 UTC
I also see something weird. The current op version is "31202" and the max-op-version is 60000

gluster volume get all cluster.op-version
Option                                  Value                                   
------                                  -----                                   
cluster.op-version                      31202                                   

gluster volume get all cluster.max-op-version
Option                                  Value                                   
------                                  -----                                   
cluster.max-op-version                  60000  

But there are still 3.12 clients connected:

----------------------------------------------

Client connections for volume ssd5
----------------------------------------------
Brick : *.*.*.14:/data/ssd5/brick1
Clients connected : 11
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
*.*.*.16:49094                                  13182330180      1103788832           60000
*.*.*.15:49110                                  14505195172      1270240536           60000
*.*.*.14:49125                                   9780313532       387268628           60000
*.*.*.52:49113                                   2421230060       411692260           31202
*.*.*.48:49115                                     41319016        27510548           31202
*.*.*.26:49112                                   8168637332      2729824420           31202
*.*.*.23:49110                                   2601732380       249970492           31202
*.*.*.24:49096                                  20354222864      8527392540           31202
*.*.*.21:49087                                     40880996        27250864           31202
*.*.*.25:49108                                  18496475704     10863379708           31202
*.*.*.22:49128                                  10968166124       369379460           31202
----------------------------------------------
Brick : *.*.*.15:/data/ssd5/brick1
Clients connected : 11
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
*.*.*.16:49085                                     41797760        13564520           60000
*.*.*.15:49119                                     63473288     15698216812           60000
*.*.*.14:49123                                     40819324     10113856164           60000
*.*.*.52:49079                                   2436482772      3960303116           31202
*.*.*.48:49082                                     48044660      3481611392           31202
*.*.*.23:49083                                   2645383096      3744524756           31202
*.*.*.26:49111                                   8178707752      4321726256           31202
*.*.*.24:49060                                  20361209920     12099313068           31202
*.*.*.25:49082                                  18492574172     11152990192           31202
*.*.*.21:49086                                     47554156      3481285424           31202
*.*.*.22:49083                                  10972024796      3825736592           31202
----------------------------------------------
Brick : *.*.*.16:/data/ssd5/brick1
Clients connected : 11
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
*.*.*.16:49092                                     57680180     14216217988           60000
*.*.*.15:49117                                     46014068        14771256           60000
*.*.*.14:49121                                   2754940588        25494224           60000
*.*.*.52:49085                                   2423667656       466211046           31202
*.*.*.26:49095                                   8168987248      1485480938           31202
*.*.*.23:49082                                   2617245772     14075255362           31202
*.*.*.48:49079                                     42126712        28841506           31202
*.*.*.21:49085                                     41689312        28587970           31202
*.*.*.25:49081                                  18488929016      9410462674           31202
*.*.*.24:49059                                  20350640484      8561125950           31202
*.*.*.22:49082                                  10965532484       377943494           31202
----------------------------------------------

Comment 36 Netbulae 2020-02-04 11:25:04 UTC
I found some more interesting logs from the time of the upgrade:

[2020-01-30 13:26:06.299777] E [rpc-transport.c:297:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/6.7/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2020-01-30 13:26:06.299790] W [rpc-transport.c:301:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2020-01-30 13:26:06.299799] W [rpcsvc.c:1985:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2020-01-30 13:26:06.299806] E [MSGID: 106244] [glusterd.c:1785:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2020-01-30 13:26:06.301194] I [socket.c:904:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 10
[2020-01-30 13:26:06.302227] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31202
[2020-01-30 13:26:06.512609] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: f311b00a-745a-47df-b268-35ac22ff9b22
[2020-01-30 13:26:06.544549] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2020-01-30 13:26:06.544636] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2020-01-30 13:26:06.544671] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2020-01-30 13:26:06.544700] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-01-30 13:26:06.552118] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-01-30 13:26:06.622036] E [MSGID: 101032] [store.c:447:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.upgrade. [No such file or directory]
[2020-01-30 13:26:06.622061] I [glusterd.c:1999:init] 0-management: Regenerating volfiles due to a max op-version mismatch or glusterd.upgrade file not being present, op_version retrieved:0, max op_version: 60000
[2020-01-30 13:26:06.552111] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2020-01-30 13:26:06.622264] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e65) [0x7f61667c0e65] -->glusterd(glusterfs_sigwaiter+0xe5) [0x55baf40401f5] -->glusterd(cleanup_and_exit+0x6b) [0x55baf404005b] ) 0-: received signum (15), shutting down
[2020-01-30 13:26:06.622307] W [mgmt-pmap.c:133:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload
[2020-01-30 13:28:05.445836] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.7 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2020-01-30 13:28:05.446957] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 12947


[2020-01-31 09:22:44.398186] I [MSGID: 106499] [glusterd-handler.c:4429:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ssd6
[2020-01-31 09:23:10.745234] E [MSGID: 106061] [glusterd-utils.c:10297:glusterd_max_opversion_use_rsp_dict] 0-management: Maximum supported op-version not set in destination dictionary
[2020-01-31 09:23:36.149157] E [MSGID: 106204] [glusterd-utils.c:13043:glusterd_get_global_options_for_all_vols] 0-management: Option with name: cluster.min-op-version does not exist [Invalid argument]
[2020-01-31 09:24:50.010331] E [MSGID: 106061] [glusterd-utils.c:10297:glusterd_max_opversion_use_rsp_dict] 0-management: Maximum supported op-version not set in destination dictionary

Comment 37 Netbulae 2020-02-06 13:47:28 UTC
I started a full audit and finally found the problem.

We use jumbo frames and the MTU was set to 1500

I think someone forgot to save the switch config of the storage switches and it reverted after we did some failure testing after install.

So I increased the MTU to 9000 and the gluster domains connect fine now.

To remove the rdma error, I modified "/etc/glusterfs/glusterd.vol" and removed rdma from "option transport-type socket,rdma"

And I set the tcp and ping timeouts to 10 

  gluster volume set <VOLUME> network.ping-timeout "10"
  gluster volume set <VOLUME> client.tcp-user-timeout 10
  gluster volume set <VOLUME> server.tcp-user-timeout 10

To get rid of the tcp-user-timeout warning:

[2020-01-30 13:26:06.552111] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout

The only thing I still worry a bit about is:

[2020-01-30 13:26:06.622061] I [glusterd.c:1999:init] 0-management: Regenerating volfiles due to a max op-version mismatch or glusterd.upgrade file not being present, op_version retrieved:0, max op_version: 60000

Comment 38 Red Hat Bugzilla 2023-09-14 04:46:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days