Bug 1672318 - "failed to fetch volume file" when trying to activate host in DC with glusterfs 3.12 domains [NEEDINFO]
Summary: "failed to fetch volume file" when trying to activate host in DC with gluster...
Keywords:
Status: NEW
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 5
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
Assignee: Sanju
QA Contact:
URL:
Whiteboard:
Depends On: 1677160
Blocks: Gluster_5_Affecting_oVirt_4.3 1732875 glusterfs-6.0
TreeView+ depends on / blocked
 
Reported: 2019-02-04 14:28 UTC by Netbulae
Modified: 2019-10-11 10:22 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
sabose: needinfo? (info)


Attachments (Terms of Use)
engine.log (17.03 MB, text/plain)
2019-02-04 14:29 UTC, Netbulae
no flags Details
server.log (50.48 KB, text/plain)
2019-02-04 14:30 UTC, Netbulae
no flags Details
vdsm.log (5.32 MB, text/plain)
2019-02-04 14:34 UTC, Netbulae
no flags Details
supervdsm.log (5.70 MB, text/plain)
2019-02-06 13:02 UTC, Netbulae
no flags Details

Description Netbulae 2019-02-04 14:28:57 UTC
Activating a host (ovirt-node-ng-installer-4.3.0-2019020409.el7.iso) fails because it cannot connect to the existing glusterfs 3.12 storage domains.

I can connect to the glusterfs server with telnet and there are no client restrictions on it.

We could try to upgrade to glusterfs 4.1 but as it's not compatible with ovirt 4.2 this would break our old cluster...

[2019-02-04 12:47:02.969362] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.16 --volfile-server=*.*.*.15 --volfile-server=*.*.*.14 --volfile-id=ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5)
[2019-02-04 12:47:02.979349] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-02-04 12:58:10.232517] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.16 --volfile-server=*.*.*.15 --volfile-server=*.*.*.14 --volfile-id=ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5)
[2019-02-04 12:58:10.242924] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-02-04 13:02:44.511106] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.16
[2019-02-04 13:02:44.511157] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-02-04 13:02:44.512757] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f58b4ccffbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7f58b4a98e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7f58b4a98f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f58b4a9a531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7f58b4a9b0d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-02-04 12:47:02.979593 (xid=0x2)
[2019-02-04 13:02:44.512779] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:ssd5)
[2019-02-04 13:02:44.512813] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7f58b4a98e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x5641acd22361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5641acd1aceb] ) 0-: received signum (0), shutting down
[2019-02-04 13:02:44.512841] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:02:44.521264] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:02:44.521784] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f58b3b31dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5641acd1ae75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5641acd1aceb] ) 0-: received signum (15), shutting down
[2019-02-04 13:13:51.774986] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.16
[2019-02-04 13:13:51.775037] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-02-04 13:13:51.775928] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f41e4f9ffbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7f41e4d68e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7f41e4d68f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f41e4d6a531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7f41e4d6b0d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-02-04 12:58:10.243172 (xid=0x2)
[2019-02-04 13:13:51.775949] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:ssd5)
[2019-02-04 13:13:51.775982] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7f41e4d68e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x563a96061361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x563a96059ceb] ) 0-: received signum (0), shutting down
[2019-02-04 13:13:51.776008] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:13:51.776214] E [mount-common.c:232:fuse_mnt_umount] 0-glusterfs-fuse: fuse: failed to unmount /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5: No such file or directory
[2019-02-04 13:13:51.776229] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5'.
[2019-02-04 13:13:51.776338] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f41e3e01dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x563a96059e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x563a96059ceb] ) 0-: received signum (15), shutting down

Comment 1 Netbulae 2019-02-04 14:29:37 UTC
Created attachment 1526800 [details]
engine.log

Comment 2 Netbulae 2019-02-04 14:30:25 UTC
Created attachment 1526801 [details]
server.log

Comment 3 Netbulae 2019-02-04 14:34:39 UTC
Created attachment 1526802 [details]
vdsm.log

Comment 4 Netbulae 2019-02-06 12:23:45 UTC
gluster volume info
 
Volume Name: hdd2
Type: Replicate
Volume ID: 6d491c15-adf3-4d5c-aade-46f0206f74e5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/hdd2/brick1
Brick2: *.*.*.15:/data/hdd2/brick1
Brick3: *.*.*.16:/data/hdd2/brick1
Options Reconfigured:
server.allow-insecure: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
network.ping-timeout: 30
cluster.granular-entry-heal: enable
performance.strict-o-direct: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: on
network.remote-dio: off
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
 
Volume Name: sdd8
Type: Replicate
Volume ID: 73b020ba-c63d-47c0-a102-f994f713b5fb
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd8/brick1
Brick2: *.*.*.15:/data/ssd8/brick1
Brick3: *.*.*.16:/data/ssd8/brick1
Options Reconfigured:
server.allow-insecure: on
nfs.disable: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
network.ping-timeout: 30
 
Volume Name: ssd3
Type: Replicate
Volume ID: 4b340741-a9bc-46e5-8667-d7f75ec57254
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd3/brick1
Brick2: *.*.*.15:/data/ssd3/brick1
Brick3: *.*.*.16:/data/ssd3/brick1
Options Reconfigured:
server.allow-insecure: on
nfs.disable: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
network.ping-timeout: 30
 
Volume Name: ssd4
Type: Replicate
Volume ID: edebdd8f-4cd7-490b-a7a9-3985cc016282
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd4/brick1
Brick2: *.*.*.15:/data/ssd4/brick1
Brick3: *.*.*.16:/data/ssd4/brick1
Options Reconfigured:
cluster.granular-entry-heal: enable
features.shard-block-size: 512MB
server.event-threads: 1
network.ping-timeout: 10
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
 
Volume Name: ssd5
Type: Replicate
Volume ID: 5c5ca8e3-f9dd-4bca-a5d3-28c94d965cca
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd5/brick1
Brick2: *.*.*.15:/data/ssd5/brick1
Brick3: *.*.*.16:/data/ssd5/brick1
Options Reconfigured:
cluster.granular-entry-heal: enable
features.shard-block-size: 512MB
server.event-threads: 1
network.ping-timeout: 10
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
 
Volume Name: ssd6
Type: Replicate
Volume ID: 84a554aa-b634-4582-8cc3-75d23ebdcd91
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd6/brick1
Brick2: *.*.*.15:/data/ssd6/brick1
Brick3: *.*.*.16:/data/ssd6/brick1
Options Reconfigured:
cluster.granular-entry-heal: enable
features.shard-block-size: 512MB
server.event-threads: 1
network.ping-timeout: 10
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
 
Volume Name: ssd9
Type: Replicate
Volume ID: 1a379dd8-7e89-4b0b-b6c6-fec45275a9a0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: *.*.*.14:/data/ssd9/brick1
Brick2: *.*.*.15:/data/ssd9/brick1
Brick3: *.*.*.16:/data/ssd9/brick1
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
network.ping-timeout: 30
server.allow-insecure: on

Comment 5 Netbulae 2019-02-06 12:32:02 UTC
All glusterfs info about "failed to fetch volume file" lead me to "server.allow-insecure on" and "rpc-auth-allow-insecure on" but I have thes enabled.

https://bugzilla.redhat.com/show_bug.cgi?id=1248067

https://lists.gluster.org/pipermail/gluster-devel/2012-December/037371.html

After doing many experiments and inputs I got from vijay on IRC, this is what finally worked.

1) volume set <volname> server.allow-insecure on
2) option rpc-auth-allow-insecure on in glusterd.vol

If either of the above is not present, i get the same "unable to get volfile" error. I have tried setting virsh (actually libvirt) to spawn qemu process as 'root', 'vdsm' and 'qemu' user

and it fails for all the 3 if either of the above is NOT set.

Only when both of the above is set, it works for me and I am able to start a VM via virsh.

Comment 6 Netbulae 2019-02-06 13:02:22 UTC
Created attachment 1527521 [details]
supervdsm.log

Comment 9 Sandro Bonazzola 2019-02-14 14:00:14 UTC
Moving assignment for this bug to Sanju who's investigating this on Gluster community side.
Netbulae can you please provide mount logs and bricks logs?

Comment 10 Netbulae 2019-02-14 14:45:45 UTC
Mount logs are all filled with the same as in the first comment "failed to fetch volume", there are no brick logs on the client.

And the brick logs on the glusterfs server have nothing in there referring to these hosts.

I can increase the debug info if needed

Comment 11 Sahina Bose 2019-02-25 08:37:24 UTC
Sanju, is this the same as Bug 1677160

Comment 12 Sanju 2019-02-25 09:51:17 UTC
(In reply to Sahina Bose from comment #11)
> Sanju, is this the same as Bug 1677160

Sahina, the error messages in the glusterd logs are same in both the bugs. Also, looking at the comment 7, I can see that Bug 1677160 is opened to track it from gluster side. So both the bugs are same.

Comment 13 Netbulae 2019-02-26 11:06:26 UTC
There must be something unique to our situation, I have time thursday to debug further.

Comment 14 Gobinda Das 2019-02-27 10:52:54 UTC
Ovirt-4.3.1 already release, so moving to ovirt-4.3.2

Comment 15 Netbulae 2019-03-01 13:36:30 UTC
Still happening. I thought it was because my co-worker had added "server.allow-insecure: on
" to the master storage volume, but didn't restart the volume.

So I restarted this volume but it didn't help.

As I was checking the oVirt event log for another issue, I see a lot of entries like this:

Mar 1, 2019, 1:53:04 PM
	
VDSM node10 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
	
	
Mar 1, 2019, 1:46:03 PM
	
VDSM node10 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-000000000043, msdUUID=ed888117-094e-46f5-b433-61eb72e242be'
	
	
Mar 1, 2019, 1:44:03 PM
	
VDSM node10 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
	
	
Mar 1, 2019, 1:41:55 PM
	
VDSM gluster5 command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "Attempt to call function: <bound method GlusterHost.storageDevicesList of <vdsm.gluster.apiwrapper.GlusterHost object at 0x7fb70c0527d0>> with arguments: () error: argument of type 'NoneType' is not iterable"}
	
	
Mar 1, 2019, 1:41:54 PM
	
VDSM gluster6 command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "Attempt to call function: <bound method GlusterHost.storageDevicesList of <vdsm.gluster.apiwrapper.GlusterHost object at 0x7fb6284b0750>> with arguments: () error: argument of type 'NoneType' is not iterable"}

Comment 16 Amar Tumballi 2019-03-04 07:30:51 UTC
Hi Netbulae, How can we proceed on this? Is there a possibility we can do some live debug of the situation? I am 'amarts' on IRC, and we can catch up there to discuss further.

From latest comments I understand adding 'insecure' options (as per comment#5) didn't help later.

Comment 17 Netbulae 2019-03-05 12:23:25 UTC
Yes no problem, I'm back in the office on Thursday from 9:00 CEST to 17:30 CEST.

Comment 18 Atin Mukherjee 2019-03-06 15:27:48 UTC
Sahina - this bug can't block glusterfs-6 release given the product is different. We should have a clone at glusterfs and change this dependency.

Comment 19 Sahina Bose 2019-03-07 08:26:25 UTC
Changing the component - as this bug is already tracked via Bug 1677319 in oVirt

Comment 20 Atin Mukherjee 2019-03-12 05:06:50 UTC
We tried to reproduce this issue, but couldn't hit the same problem. Until and unless we have a reproducer and the complete logs especially the client, glusterd and brick log files, it'd be hard to debug.

Comment 21 Netbulae 2019-03-18 11:37:04 UTC
Upgraded to node 4.3.2-2019031310.

As stated earlier, there is nothing in the brick and glusterd log. I've set "diagnostics.client-log-level" to DEBUG but still see nothing.

glustershd.log


[2019-03-18 11:10:54.005050] D [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-hdd2-client-2: remote operation failed [Resource temporarily unavailable]
[2019-03-18 11:10:54.005079] D [MSGID: 0] [client-rpc-fops.c:1511:client3_3_inodelk_cbk] 0-stack-trace: stack-address: 0x7f1254002560, hdd2-client-2 returned -1 error: Resource temporarily unavailable [Resource temporarily unavailable]
[2019-03-18 11:10:55.545357] D [MSGID: 0] [afr-self-heald.c:218:afr_shd_index_inode] 0-hdd2-replicate-0: glusterfs.xattrop_entry_changes_gfid dir gfid for hdd2-client-0: b8f46fe6-ef56-44ac-8586-22061b0f2d5b
[2019-03-18 11:10:55.546606] D [MSGID: 0] [afr-self-heald.c:597:afr_shd_index_healer] 0-hdd2-replicate-0: finished index sweep on subvol hdd2-client-0
[2019-03-18 11:11:24.234360] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-0: returning as transport is already disconnected OR there are no frames (0 || 0)
[2019-03-18 11:11:24.234416] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-1: returning as transport is already disconnected OR there are no frames (0 || 0)
[2019-03-18 11:11:24.234425] D [rpc-clnt-ping.c:336:rpc_clnt_start_ping] 0-hdd2-client-2: returning as transport is already disconnected OR there are no frames (0 || 0)
[2019-03-18 11:12:46.240335] D [logging.c:1855:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2019-03-18 11:12:46.240388] D [logging.c:1817:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2019-03-18 11:14:09.000515] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f12ab9f7ebb] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7f12ab7c228b] (--> /lib64/libgfrpc.so.0(+0x14a31)[0x7f12ab7c2a31] (--> /lib64/libgfrpc.so.0(rpc_clnt_submit+0x4a1)[0x7f12ab7bf651] (--> /usr/lib64/glusterfs/3.12.15/xlator/protocol/client.so(+0x102e2)[0x7f129e0e02e2] ))))) 0-: 192.168.99.14:49154: ping timer event already removed
[2019-03-18 11:14:09.000619] D [MSGID: 0] [syncop-utils.c:548:syncop_is_subvol_local] 0-ssd4-client-0: subvol ssd4-client-0 is local
[2019-03-18 11:14:09.000637] D [MSGID: 0] [afr-self-heald.c:580:afr_shd_index_healer] 0-ssd4-replicate-0: starting index sweep on subvol ssd4-client-0
[2019-03-18 11:14:09.000635] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-ssd4-client-0: Ping latency is 0ms

glusterfs clien log

[2019-03-18 11:03:57.442215] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.14 --volfile-server=*.*.*.15 --volfile-server=*.*.*.16 --volfile-id=/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9)
[2019-03-18 11:03:57.472426] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-18 11:06:03.636977] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f392add8dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x556b04278e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x556b04278ceb] ) 0-: received signum (15), shutting down
[2019-03-18 11:06:03.637039] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:06:03.654263] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:13:29.415760] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=*.*.*.14 --volfile-server=*.*.*.15 --volfile-server=*.*.*.16 --volfile-id=/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9)
[2019-03-18 11:13:29.444824] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14
[2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-03-18 11:13:29.445101 (xid=0x2)
[2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/ssd9)
[2019-03-18 11:29:01.002550] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xce32) [0x7fb4beba4e32] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x841) [0x5586d9f3c361] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5586d9f34ceb] ) 0-: received signum (0), shutting down
[2019-03-18 11:29:01.002578] I [fuse-bridge.c:5914:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:29:01.009036] I [fuse-bridge.c:5919:fini] 0-fuse: Closing fuse connection to '/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9'.
[2019-03-18 11:29:01.009655] W [glusterfsd.c:1500:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fb4bdc3ddd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5586d9f34e75] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5586d9f34ceb] ) 0-: received signum (15), shutting down

supervdsm.log

MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:11:29,020::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call volumeInfo with (u'ssd9', u'*.*.*.14') {}
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:11:29,020::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.14 ssd9 --xml (cwd None)
MainProcess|jsonrpc/6::DEBUG::2019-03-18 12:11:31,460::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call ksmTune with ({u'run': 0, u'merge_across_nodes': 1},) {}
MainProcess|jsonrpc/6::DEBUG::2019-03-18 12:11:31,461::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,227::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,227::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 122 seconds (max pending: 1)
MainProcess|jsonrpc/5::ERROR::2019-03-18 12:13:29,227::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
    xmltree = _execGlusterXml(command)
  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
    return _getTree(rc, out, err)
  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
    raise ge.GlusterCmdExecFailedException(rc, out, err)
GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err=''
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,229::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f3877683a10>, u'*.*.*.14:/ssd9', u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.16', 'cgroup': 'vdsm-glusterfs'}
MainProcess|jsonrpc/5::DEBUG::2019-03-18 12:13:29,230::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o backup-volfile-servers=*.*.*.15:*.*.*.16 *.*.*.14:/ssd9 /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd9 (cwd None)

Comment 22 Netbulae 2019-03-20 11:34:26 UTC
(In reply to Amar Tumballi from comment #16)
> Hi Netbulae, How can we proceed on this? Is there a possibility we can do
> some live debug of the situation? I am 'amarts' on IRC, and we can catch up
> there to discuss further.
> 
> From latest comments I understand adding 'insecure' options (as per
> comment#5) didn't help later.

Yes adding insecure options and restarting the volumes didn't help

Comment 23 Sahina Bose 2019-03-26 12:59:37 UTC
Is there anything in Comment 21 that could help narrow this?

Comment 24 Atin Mukherjee 2019-03-26 13:24:04 UTC
[2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14

Why did we get a disconnect. Was glusterd service at *.14 not running?

[2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
[2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (--> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2019-03-18 11:13:29.445101 (xid=0x2)

The above log seems to be the culprit here. 

[2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/ssd9)

And the above log is the after effect.


I have few questions:

1. Does the mount fail everytime?
2. Do you see any change in the behaviour when the primary volfile server is changed?
3. What are the gluster version in the individual peers?

(Keeping the needinfo intact for now, but request Sahina to get us these details to work on).

Comment 25 Sahina Bose 2019-03-29 13:46:15 UTC
Redirecting to reporter. Could you answer questions in Comment 24?

Comment 26 Netbulae 2019-04-01 09:37:41 UTC
(In reply to Atin Mukherjee from comment #24)
> [2019-03-18 11:29:01.000279] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: disconnected from remote-host: *.*.*.14
> 
> Why did we get a disconnect. Was glusterd service at *.14 not running?
> 
> [2019-03-18 11:29:01.000330] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: connecting to next volfile server *.*.*.15
> [2019-03-18 11:29:01.002495] E [rpc-clnt.c:346:saved_frames_unwind] (-->
> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb4beddbfbb] (-->
> /lib64/libgfrpc.so.0(+0xce11)[0x7fb4beba4e11] (-->
> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fb4beba4f2e] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fb4beba6531] (-->
> /lib64/libgfrpc.so.0(+0xf0d8)[0x7fb4beba70d8] ))))) 0-glusterfs: forced
> unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at
> 2019-03-18 11:13:29.445101 (xid=0x2)
> 
> The above log seems to be the culprit here. 
> 
> [2019-03-18 11:29:01.002517] E [glusterfsd-mgmt.c:2136:mgmt_getspec_cbk]
> 0-mgmt: failed to fetch volume file (key:/ssd9)
> 
> And the above log is the after effect.
> 
> 
> I have few questions:
> 
> 1. Does the mount fail everytime?

Yes. It also stays the same when we switch the primary storage domain to another one.

> 2. Do you see any change in the behaviour when the primary volfile server is
> changed?

No I have different primary volfile server across volumes to spread the load a bit more. Same effect always.

> 3. What are the gluster version in the individual peers?

All nodes and servers are on 3.12.15

> 
> (Keeping the needinfo intact for now, but request Sahina to get us these
> details to work on).

Comment 27 Atin Mukherjee 2019-04-01 12:14:53 UTC
Since I'm unable to reproduce this even after multiple attempts, the only possibility I have to make some progress on this is by asking you to test different combinations, I understand that this might make you frustrated but I have no other way at this point of time to pinpoint this. In my local setup I tried every possible options to simulate this but had no success.

As I explained in comment 24, it seems like that client couldn't get the volfile from glusterd running in *.15 instance. However since there's no log entry in INFO mode in glusterd which could indicate the possibility of this failure can I request you to do the following if possible?

1. Run 'pkill glusterd; glusterd ' on *.15 node
2. Attempt to mount the client.
3. Find out the failure log of 'failed to fetch volume file'  and see the timestamp. From glusterd log map this timestamp and send us the snippet of the log entries around this timestamp.
4. Run gluster v info command from all the nodes and paste back the output
5. Provide the output of 'gluster v get all cluster.op-version' from one of the nodes

Comment 28 Netbulae 2019-04-09 10:43:38 UTC
(In reply to Atin Mukherjee from comment #27)

I tried the sequence you posted but currently we have *.16 and ssd5 as the master storage domain as I removed ssd9

Also upgraded to ovirt node 4.3.3 rc2 because of several fixes.

But I don't get the 'failed to fetch volume file' this way, it appears the backup volfile option is not working:

>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,186::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.16 ssd5 --xml (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,297::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:10:09,297::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
>    xmltree = _execGlusterXml(command)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
>    return _getTree(rc, out, err)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
>    raise ge.GlusterCmdExecFailedException(rc, out, err)
>GlusterCmdExecFailedException: Command execution failed: rc=1 out='Connection failed. Please check if gluster daemon is operational.\n' err=''
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:10:09,298::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=1*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'}

When I leave the glusterfs process running, I also don't get it anymore in 4.3.3, but I get "'Error : Request timed out\n' err=''" and "MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n')"


>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:25:54,918::commands::219::root::(execCmd) FAILED: <err> = 'Running scope as unit run-10178.scope.\nMount failed. Please check the log file for more details.\n'; <rc> = 1
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:25:54,918::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 765 seconds (max pending: 2)
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:25:54,918::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in mount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 143, in mount
>    cgroup=cgroup)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 277, in _mount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (1, ';Running scope as unit run-10178.scope.\nMount failed. Please check the log file for more details.\n')
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,720::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,720::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 294 seconds (max pending: 2)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,802::routes::110::root::(get_gateway) The gateway *.*.*.3 is duplicated for the device ovirtmgmt
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,807::routes::110::root::(get_gateway) The gateway *.*.*.3 is duplicated for the device ovirtmgmt
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,809::cmdutils::133::root::(exec_cmd) /sbin/tc qdisc show (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,860::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,866::cmdutils::133::root::(exec_cmd) /sbin/tc class show dev bond0 classid 1389:1388 (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:49,919::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,005::vsctl::68::root::(commit) Executing commands: /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,006::cmdutils::133::root::(exec_cmd) /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-09 12:30:50,038::cmdutils::141::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
>
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,472::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/ovirt.netbulae.mgmt:_var_lib_exports_iso (cwd None)
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,502::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,503::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,529::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/*.*.*.15:_data_ovirt') >{'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,529::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/*.*.*.15:_data_ovirt (cwd None)
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,595::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,596::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,738::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {}
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,738::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/libexec/vdsm/fc-scan (cwd None)
>MainProcess|jsonrpc/2::DEBUG::2019-04-09 12:37:02,804::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,841::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,841::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,849::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,850::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.15:ssd4: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,850::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,850::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,875::commands::219::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,876::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return umount with None
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,889::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,889::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,895::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,895::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:ssd6: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,895::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,896::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,902::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,902::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_hdd2: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,903::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,903::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,910::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,910::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_ssd3: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,911::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call umount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'/rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8') {'force': True, 'lazy': True, 'freeloop': False}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,911::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8 (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:02,917::commands::219::root::(execCmd) FAILED: <err> = 'umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8: mountpoint not found\n'; <rc> = 32
>MainProcess|jsonrpc/6::ERROR::2019-04-09 12:37:02,917::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in umount
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 147, in umount
>    mount._umount(fs_file, force=force, lazy=lazy, freeloop=freeloop)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 296, in _umount
>    _runcmd(cmd)
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 305, in _runcmd
>    raise MountError(rc, b";".join((out, err)))
>MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/*.*.*.14:_sdd8: mountpoint not found\n')
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,042::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {}
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,042::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/libexec/vdsm/fc-scan (cwd None)
>MainProcess|jsonrpc/6::DEBUG::2019-04-09 12:37:03,107::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None

Comment 29 Netbulae 2019-04-09 10:45:21 UTC
I cannot find anything related in glusterd log

gluster v get all cluster.op-version
Option                                  Value                                   
------                                  -----                                   
cluster.op-version                      31202

Comment 30 Netbulae 2019-04-10 08:06:11 UTC
The volume info is in comment 4

I get the same with the brick running I see now:

>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:51:27,437::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call volumeInfo with (u'ssd5', u'*.*.*.16') {}
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:51:27,437::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/sbin/gluster --mode=script volume info --remote-host=*.*.*.16 ssd5 --xml (cwd None)
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,553::commands::219::root::(execCmd) FAILED: <err> = ''; <rc> = 1
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,553::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 123 seconds (max pending: 1)
>MainProcess|jsonrpc/0::ERROR::2019-04-10 09:53:27,553::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
>    xmltree = _execGlusterXml(command)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
>    return _getTree(rc, out, err)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
>    raise ge.GlusterCmdExecFailedException(rc, out, err)
>GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err=''
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,555::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'}
>MainProcess|jsonrpc/0::DEBUG::2019-04-10 09:53:27,555::commands::198::root::(execCmd) /usr/bin/taskset --cpu-list 0-63 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o backup-volfile-servers=*.*.*.15:*.*.*.14 *.*.*.16:ssd5 /rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5 (cwd None)
>MainProcess|jsonrpc/1::DEBUG::2019-04-10 09:54:27,765::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7f10804e2990>,) {}

But the port is reachable from the node:

>Brick *.*.*.16:/data/ssd5/brick1       49156     0          Y       15429

>[root@node9 ~]# telnet *.*.*.16 49156
>Trying *.*.*.16...
>Connected to *.*.*.16.
>Escape character is '^]'.


And in /var/log/messages:

>Apr 10 09:56:27 node9 vdsm[56358]: WARN Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=*.*.*.15:*.*.*.14', u'id': u'b036005a-d44d-4689-a8c3-13e1bbf55af7', u'connection': u'*.*.*.16:ssd5', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': >u'00000001-0001-0001-0001-000000000043', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'fbedfc27-e064-48d5-b332-daeeccdd1cf4'} at 0x7f77c4589210> timeout=60, duration=300.00 at 0x7f77c4589810> >task#=674 at 0x7f77e406f790>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  >self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012  ret = func(*args, >**kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012  self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012  response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012  res = >method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012  result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1095, in connectStorageServer#012  >connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012  result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012  return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012  result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012  return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method#012  ret = func(*args, **kwargs)#012File: >"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2411, in connectStorageServer#012  conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 172, in connect#012  >self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount#012  cgroup=cgroup)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012  return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012  **kwargs)#012File: "<string>", line 2, in mount#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012  kind, result = conn.recv()
>Apr 10 09:56:27 node9 vdsm[56358]: WARN Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {u'masterVersion': 10, u'domainDict': {u'33f1ba00-6a16-4e58-b4c5-94426f1c4482': u'active', u'782a61af-a520-44c4-8845-74bf92888552': u'active', u'b30bab9d-9a66-44ce-ad17-2eb4ee858d8f': u'active', u'09959920-a31b-42c2-a547-e50b73602c96': u'active', u'f5d2f7c6-093f-46d6-a844-224d92db5ef9': u'active', u'515166a4-4b2f-402d-bf37-95f3c59635cb': >u'active', u'1ed0a635-67ee-4255-aad9-b70822350706': u'active', u'95b4e5d2-2974-4d5f-91e4-351f75a15435': u'active', u'84f2ff4a-2ec5-42dc-807d-bd12745f387d': u'active'}, u'storagepoolID': u'00000001-0001-0001-0001-000000000043', u'scsiKey': >u'00000001-0001-0001-0001-000000000043', u'masterSdUUID': u'09959920-a31b-42c2-a547-e50b73602c96', u'hostID': 12}, 'jsonrpc': '2.0', 'method': u'StoragePool.connect', 'id': u'95545e2f-6b72-4107-bc19-8de3e549aba0'} at 0x7f77c458d310> >timeout=60, duration=120.01 at 0x7f77c458d0d0> task#=677 at 0x7f77e406f990>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", >line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line >195, in >run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  >task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012  self._handler(self._ctx, >self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012  response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012  res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012  result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1091, in >connect#012  self._UUID, hostID, masterSdUUID, masterVersion, domainDict)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012  result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012  return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012  result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012  return fn(*args, **kargs)#012File: "<string>", line 2, in connectStoragePool#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", >line 50, in method#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1034, in connectStoragePool#012  spUUID, hostID, msdUUID, masterVersion, domainsMap)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1096, in _connectStoragePool#012  res = pool.connect(hostID, msdUUID, masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect#012  >self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild#012  self.setMasterDomain(msdUUID, masterVersion)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1491, in setMasterDomain#012  domain = sdCache.produce(msdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce#012  domain.getRealDomain()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain#012  return self._cache._realProduce(self._sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce#012  domain = >self._findDomain(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain#012  return findMethod(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 169, in >_findUnfetchedDomain#012  return mod.findDomain(sdUUID)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 146, in findDomain#012  return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 130, in findDomainPath#012  for tmpSdUUID, domainPath in fileSD.scanDomains("*"):#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 893, in scanDomains#012  for >res in misc.itmap(collectMetaFiles, mntList, oop.HELPERS_PER_DOMAIN):#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/misc.py", line 538, in itmap#012  yield respQueue.get()#012File: "/usr/lib64/python2.7/Queue.py", line 168, in >get#012  self.not_empty.wait()#012File: "/usr/lib/python2.7/site-packages/pthreading.py", line 127, in wait#012  return self.__cond.wait()#012File: "/usr/lib/python2.7/site-packages/pthread.py", line 131, in wait#012  return >_libpthread.pthread_cond_wait(self._cond, m.mutex())

Comment 31 Sahina Bose 2019-09-26 14:01:50 UTC
Circling back on this - do you still see the same issues with the latest gluster 6 release. Checking since we haven't been able to reproduce internally

Comment 32 Netbulae 2019-10-01 11:37:17 UTC
I still have it, a bit different ouput

>Oct  1 12:59:54 vh10en vdsm[9365]: WARN Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=*.*.*.15:*.*.*.14', u'id': u'b036005a-d44d-4689-a8c3-13e1bbf55af7', u'connection': u'*.*.*.16:ssd5', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': u'00000001-0001-0001-0001-000000000043', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'94712630-8f2f-4508-9950-51f9e3738c65'} at 0x7f4eb835e850> timeout=60, duration=60.00 at 0x7f4eb835efd0> task#=1 at 0x7f4edc05ced0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__#012  self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest#012  response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request#012  res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod#012  result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1102, in connectStorageServer#012  connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper#012  result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012  return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012  result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012  return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer#012  conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 167, in connect#012  self.validate()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 297, in validate#012  if not self.volinfo:#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 284, in volinfo#012  self._volinfo = self._get_gluster_volinfo()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 329, in _get_gluster_volinfo#012  self._volfileserver)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012  return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012  **kwargs)#012File: "<string>", line 2, in glusterVolumeInfo#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012  kind, result = conn.recv()


>MainProcess|jsonrpc/0::ERROR::2019-10-01 13:00:55,244::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in volumeInfo
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
>    res = func(*args, **kwargs)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 529, in volumeInfo
>    xmltree = _execGlusterXml(command)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 131, in _execGlusterXml
>    return _getTree(rc, out, err)
>  File "/usr/lib/python2.7/site-packages/vdsm/gluster/cli.py", line 112, in _getTree
>    raise ge.GlusterCmdExecFailedException(rc, out, err)
>GlusterCmdExecFailedException: Command execution failed: rc=1 out='Error : Request timed out\n' err=''
>MainProcess|jsonrpc/0::DEBUG::2019-10-01 13:00:55,247::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call mount with (<vdsm.supervdsm_server._SuperVdsm object at 0x7fe92597a7d0>, u'*.*.*.16:ssd5', u'/rhev/data-center/mnt/glusterSD/*.*.*.16:ssd5') {'vfstype': u'glusterfs', 'mntOpts': u'backup-volfile-servers=*.*.*.15:*.*.*.14', 'cgroup': 'vdsm-glusterfs'

Maybe there is something wrong with the ports that glusterfs is listening to

tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49153           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49154           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:16514           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49163           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49164           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:2223            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:54322           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN

cat /etc/glusterfs/glusterd.vol 
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 0
    option event-threads 1
    option rpc-auth-allow-insecure on
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
end-volume

Comment 33 Sahina Bose 2019-10-11 10:22:43 UTC
Can you try executing "gluster volume info" from the hypervisor where the logs in comment 32 are from?
Any errors in glusterd logs?


Note You need to log in before you can comment on or make changes to this bug.