Bug 1481619 - When a node hosting gluster-block devices is rebooted, further discovery fails on that node
Summary: When a node hosting gluster-block devices is rebooted, further discovery fail...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhgs-server-container
Version: cns-3.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: CNS 3.6
Assignee: Mohamed Ashiq
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On: 1479777
Blocks: 1445448
TreeView+ depends on / blocked
 
Reported: 2017-08-15 08:12 UTC by krishnaram Karthick
Modified: 2017-10-11 06:58 UTC (History)
14 users (show)

Fixed In Version: cns-deploy-5.0.0-21
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-11 06:58:29 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:2877 0 normal SHIPPED_LIVE rhgs-server-container bug fix and enhancement update 2017-10-11 11:11:39 UTC

Description krishnaram Karthick 2017-08-15 08:12:26 UTC
Description of problem:

On a CNS setup with 3 worker nodes, 10 app pods were created with gluster-block device as PVC. After one of the nodes containing gluster pod was rebooted, targets were not seen and discovery of target fails.

Please note that this issue was not seen with just 1 app pod and 1 gluster-block device

[root@dhcp46-248 ~]# iscsiadm -m discovery -t st -p 10.70.46.248
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: connection login retries (reopen_max) 5 exceeded
iscsiadm: Could not perform SendTargets discovery: encountered connection failure

From the dmesg of 10.70.46.248, I see there is a timeout. (complete output shall be attached)

[  108.273653] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  132.602560]  connection33:0: detected conn error (1020)
[  132.602724]  connection7:0: detected conn error (1020)
[  132.609926]  connection10:0: detected conn error (1020)
[  132.661266]  connection16:0: detected conn error (1020)
[  146.436854]  session13: session recovery timed out after 120 secs
[  146.436864]  session7: session recovery timed out after 120 secs
[  146.436868]  session8: session recovery timed out after 120 secs
[  146.436872]  session9: session recovery timed out after 120 secs
[  146.436906]  session10: session recovery timed out after 120 secs
[  146.436910]  session12: session recovery timed out after 120 secs
[  146.436914]  session11: session recovery timed out after 120 secs
[  147.973445]  session33: session recovery timed out after 120 secs
[  147.973453]  session35: session recovery timed out after 120 secs
[  147.973455]  session34: session recovery timed out after 120 secs
[  147.973458]  session15: session recovery timed out after 120 secs
[  147.973460]  session16: session recovery timed out after 120 secs
[  147.973462]  session14: session recovery timed out after 120 secs

All app pods are up and running after the reboot:

oc get pods
NAME                               READY     STATUS    RESTARTS   AGE
glusterblock-provisioner-1-kpbcx   1/1       Running   4          4d
glusterfs-b6s3d                    1/1       Running   0          19h
glusterfs-gmlvv                    1/1       Running   0          19h
glusterfs-xnklw                    1/1       Running   1          19h
heketi-1-3895t                     1/1       Running   0          19h
jenkins-1-1-46vbx                  1/1       Running   0          49m
jenkins-10-1-0hq63                 1/1       Running   1          48m
jenkins-2-1-l6wr6                  1/1       Running   1          49m
jenkins-3-1-qm1sp                  1/1       Running   0          48m
jenkins-4-1-mfhgq                  1/1       Running   1          48m
jenkins-5-1-wtkgh                  1/1       Running   1          48m
jenkins-6-1-nf45x                  1/1       Running   0          48m
jenkins-7-1-28jwf                  1/1       Running   0          48m
jenkins-8-1-jbq5l                  1/1       Running   0          48m
jenkins-9-1-xg0xp                  1/1       Running   0          48m
storage-project-router-1-21vfw     1/1       Running   5          5d

Targets are not seen on 10.70.46.248

 oc rsh glusterfs-xnklw (This is the gluster-pod on 10.70.46.248)
sh-4.2# 
sh-4.2# 
sh-4.2# targetcli  ls
o- /  [...]
  o- backstores  [...]
  | o- block  [Storage Objects: 0]
  | o- fileio  [Storage Objects: 0]
  | o- pscsi  [Storage Objects: 0]
  | o- ramdisk  [Storage Objects: 0]
  | o- user:glfs  [Storage Objects: 0]
  o- iscsi  [Targets: 0]
  o- loopback  [Targets: 0]

sh-4.2# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-08-15 07:11:51 UTC; 46min ago
 Main PID: 225 (gluster-blockd)
   CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode201fbd6_80eb_11e7_8f8c_00505684d1d7.slice/docker-29d41f84b60cf07d74e96de8efb1472972e2a4e80bfeca32b1cac84d5f0b5e04.scope/system.slice/gluster-blockd.service
           └─225 /usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO...

Aug 15 07:11:51 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Started Gluster...
Aug 15 07:11:51 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Starting Gluste...
Hint: Some lines were ellipsized, use -l to show in full.
sh-4.2# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2017-08-15 07:11:50 UTC; 46min ago
 Main PID: 172 (tcmu-runner)
   CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode201fbd6_80eb_11e7_8f8c_00505684d1d7.slice/docker-29d41f84b60cf07d74e96de8efb1472972e2a4e80bfeca32b1cac84d5f0b5e04.scope/system.slice/tcmu-runner.service
           └─172 /usr/bin/tcmu-runner --tcmu-log-dir=/var/log/gluster-block/

Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Starting LIO Us...
Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com tcmu-runner[172]: tcmu-runner
                                                                    : load_ou...
Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com tcmu-runner[172]: 2017-08-1...
Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Started LIO Use...
Hint: Some lines were ellipsized, use -l to show in full.

Other logs/information:

[root@dhcp46-248 ~]# ls -l /var/crash/
total 0

[root@dhcp46-248 ~]# lsmod | grep 'target'
iscsi_target_mod      302966  1 
target_core_user       23936  0 
uio                    19259  1 target_core_user
target_core_mod       367918  3 iscsi_target_mod,target_core_user
crc_t10dif             12714  2 target_core_mod,sd_mod
[root@dhcp46-248 ~]# lsmod | grep 'multi'
xt_multiport           12798  1 
dm_multipath           27427  5 dm_round_robin
dm_mod                123303  95 dm_round_robin,dm_multipath,dm_log,dm_persistent_data,dm_mirror,dm_bufio,dm_thin_pool

cat /etc/sysconfig/iptables
# Generated by iptables-save v1.4.21 on Wed Jul  5 13:40:34 2017
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:OS_FIREWALL_ALLOW - [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -j OS_FIREWALL_ALLOW
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 10250 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 443 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 4789 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24007 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24008 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2222 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m multiport --dports 49152:49664 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 111 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24006 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 3260 -j ACCEPT
COMMIT
# Completed on Wed Jul  5 13:40:34 2017

block-hosting volume:
gluster vol list
heketidbstorage
vol_e13fd33c564cd64cd7d7849ccdf1d704
sh-4.2# gluster vol status vol_e13fd33c564cd64cd7d7849ccdf1d704
Status of volume: vol_e13fd33c564cd64cd7d7849ccdf1d704
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.248:/var/lib/heketi/mounts/v
g_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90
d48523fba33a8d490e270dbf74c150/brick        49152     0          Y       2382
Brick 10.70.47.72:/var/lib/heketi/mounts/vg
_17681cb20e6a51e1f62a4e48e58dede2/brick_ae9
7d4c5eac71ea7b73f1738d575d5e3/brick         49152     0          Y       2801
Brick 10.70.47.49:/var/lib/heketi/mounts/vg
_856a20c7776264fb613f75a6f3593191/brick_d92
0187a1411a055f4dace3f20ca9780/brick         49152     0          Y       2909
Self-heal Daemon on localhost               N/A       N/A        Y       3685
Self-heal Daemon on dhcp47-49.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       3814
Self-heal Daemon on 10.70.46.248            N/A       N/A        Y       2391

Task Status of Volume vol_e13fd33c564cd64cd7d7849ccdf1d704
------------------------------------------------------------------------------
There are no active volume tasks

sh-4.2# gluster vol heal vol_e13fd33c564cd64cd7d7849ccdf1d704 info
Brick 10.70.46.248:/var/lib/heketi/mounts/vg_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90d48523fba33a8d490e270dbf74c150/brick
Status: Connected
Number of entries: 0

Brick 10.70.47.72:/var/lib/heketi/mounts/vg_17681cb20e6a51e1f62a4e48e58dede2/brick_ae97d4c5eac71ea7b73f1738d575d5e3/brick
Status: Connected
Number of entries: 0

Brick 10.70.47.49:/var/lib/heketi/mounts/vg_856a20c7776264fb613f75a6f3593191/brick_d920187a1411a055f4dace3f20ca9780/brick
Status: Connected
Number of entries: 0

iscsi login from other 2 nodes works:
(Pasting the output from just one node)

[root@dhcp47-49 ~]# iscsiadm -m discovery -t st -p 10.70.47.49
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:e908e9c0-66ab-4cbb-a360-3e9e0cbcf9d8
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:e908e9c0-66ab-4cbb-a360-3e9e0cbcf9d8
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:e908e9c0-66ab-4cbb-a360-3e9e0cbcf9d8
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:755c65a5-c286-4ea0-901f-0e282a66db0f
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:755c65a5-c286-4ea0-901f-0e282a66db0f
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:755c65a5-c286-4ea0-901f-0e282a66db0f
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:60bada90-6392-4af9-866a-ee3388c9e4e1
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:60bada90-6392-4af9-866a-ee3388c9e4e1
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:60bada90-6392-4af9-866a-ee3388c9e4e1
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:18cdd32f-dbe9-4a12-ac7c-e5207867d0da
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:18cdd32f-dbe9-4a12-ac7c-e5207867d0da
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:18cdd32f-dbe9-4a12-ac7c-e5207867d0da
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:b90ee999-2860-4723-92c8-7a63dab56c95
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:b90ee999-2860-4723-92c8-7a63dab56c95
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:b90ee999-2860-4723-92c8-7a63dab56c95
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:5d0a0d4b-9035-4b71-8270-07c4acc60e9a
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:5d0a0d4b-9035-4b71-8270-07c4acc60e9a
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:5d0a0d4b-9035-4b71-8270-07c4acc60e9a
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:f2346d41-b484-46d8-a2de-629c5412996f
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:f2346d41-b484-46d8-a2de-629c5412996f
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:f2346d41-b484-46d8-a2de-629c5412996f
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:8f48dca4-9ee8-4dda-8163-e00791b7d062
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:8f48dca4-9ee8-4dda-8163-e00791b7d062
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:8f48dca4-9ee8-4dda-8163-e00791b7d062
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:d666317b-93b3-4156-984d-98a31406d87f
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:d666317b-93b3-4156-984d-98a31406d87f
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:d666317b-93b3-4156-984d-98a31406d87f
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:a85853e1-0e40-489d-a2e1-2e291ece9ed6
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:a85853e1-0e40-489d-a2e1-2e291ece9ed6
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:a85853e1-0e40-489d-a2e1-2e291ece9ed6
10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:7b41ddd3-1bb8-433f-9d33-db880495acd9
10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:7b41ddd3-1bb8-433f-9d33-db880495acd9
10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:7b41ddd3-1bb8-433f-9d33-db880495acd9

[root@dhcp47-49 ~]# iscsiadm -m discovery -t st -p 10.70.46.248
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: cannot make connection to 10.70.46.248: Connection refused
iscsiadm: connection login retries (reopen_max) 5 exceeded
iscsiadm: Could not perform SendTargets discovery: encountered connection failure

block hosting volume is up and there are no pending heals

sh-4.2# gluster vol list
heketidbstorage
vol_e13fd33c564cd64cd7d7849ccdf1d704
sh-4.2# gluster vol status vol_e13fd33c564cd64cd7d7849ccdf1d704
Status of volume: vol_e13fd33c564cd64cd7d7849ccdf1d704
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.248:/var/lib/heketi/mounts/v
g_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90
d48523fba33a8d490e270dbf74c150/brick        49152     0          Y       2382
Brick 10.70.47.72:/var/lib/heketi/mounts/vg
_17681cb20e6a51e1f62a4e48e58dede2/brick_ae9
7d4c5eac71ea7b73f1738d575d5e3/brick         49152     0          Y       2801
Brick 10.70.47.49:/var/lib/heketi/mounts/vg
_856a20c7776264fb613f75a6f3593191/brick_d92
0187a1411a055f4dace3f20ca9780/brick         49152     0          Y       2909
Self-heal Daemon on localhost               N/A       N/A        Y       3685
Self-heal Daemon on dhcp47-49.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       3814
Self-heal Daemon on 10.70.46.248            N/A       N/A        Y       2391

Task Status of Volume vol_e13fd33c564cd64cd7d7849ccdf1d704
------------------------------------------------------------------------------
There are no active volume tasks

sh-4.2# gluster vol heal vol_e13fd33c564cd64cd7d7849ccdf1d704 info
Brick 10.70.46.248:/var/lib/heketi/mounts/vg_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90d48523fba33a8d490e270dbf74c150/brick
Status: Connected
Number of entries: 0

Brick 10.70.47.72:/var/lib/heketi/mounts/vg_17681cb20e6a51e1f62a4e48e58dede2/brick_ae97d4c5eac71ea7b73f1738d575d5e3/brick
Status: Connected
Number of entries: 0

Brick 10.70.47.49:/var/lib/heketi/mounts/vg_856a20c7776264fb613f75a6f3593191/brick_d920187a1411a055f4dace3f20ca9780/brick
Status: Connected
Number of entries: 0

Version-Release number of selected component (if applicable):
sh-4.2# rpm -qa | grep 'gluster'
glusterfs-fuse-3.8.4-40.el7rhgs.x86_64
glusterfs-server-3.8.4-40.el7rhgs.x86_64
gluster-block-0.2.1-6.el7rhgs.x86_64
glusterfs-libs-3.8.4-40.el7rhgs.x86_64
glusterfs-3.8.4-40.el7rhgs.x86_64
glusterfs-api-3.8.4-40.el7rhgs.x86_64
glusterfs-cli-3.8.4-40.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-40.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-40.el7rhgs.x86_64

How reproducible:
Although the TC was run once, the steps are quite simple. Should be reproducible

Steps to Reproduce:
1. create a 3 node CNS setup
2. create 10 app pods and use gluster-block device as pvc
3. Run IOs from app pods
4. rebooted one of the node hosting gluster pod
5. Wait for all gluster-block related services to come up and check if iscsi discovery works on all nodes & targetcli ls lists all devices

Actual results:
targets are not discoverable from the node which was rebooted

Expected results:
targets should be discovered

Additional info:
sosreports, gluster-block logs, dmesg output shall be attached shortly

Comment 8 krishnaram Karthick 2017-08-16 05:48:46 UTC
The fix has is to make the file '/etc/target/saveconfig.json' persistent while building container image and it has to be made at cns.

Changing the component and release to cns-deployment and cns-3.6 respectively.

Comment 9 Humble Chirammal 2017-08-16 06:24:11 UTC
"Please note that this issue was not seen with just 1 app pod and 1 gluster-block device"

This is confusing, if its abt the saved target configuration and if there are no changes in the way block device is created, it should same behaviour for one block device as well. Can you please confirm this ?

Comment 10 krishnaram Karthick 2017-08-16 07:01:29 UTC
(In reply to Humble Chirammal from comment #9)
> "Please note that this issue was not seen with just 1 app pod and 1
> gluster-block device"
> 
> This is confusing, if its abt the saved target configuration and if there
> are no changes in the way block device is created, it should same behaviour
> for one block device as well. Can you please confirm this ?

When I mentioned the issue was not seen with 1 app pod, I had not checked if the discovery was working. I had only checked if the app pods are up and able to access the device. The intent here was to test with few app pods and test with 1 app pod was just a quick check on the stability before the actual test was done. So yes, the discovery part was not validated with 1 app pod. Please ignore that statement.

Comment 17 krishnaram Karthick 2017-09-01 15:36:13 UTC
Verified the fix in build - cns-deploy-5.0.0-24.el7rhgs.x86_64

/etc/target/saveconfig.json file is now persisted. 

steps followed to verify:

1) got into the gluster pod and ran 'cp /etc/target/saveconfig.json /var/log/glusterfs/
2) Rebooted the node where gluster pods run
3) After the node rebooted and pod came up, ran diff /etc/target/saveconfig.json /var/log/glusterfs/saveconfig.json. There was no diff.

Moving the bug to verified.

Comment 18 errata-xmlrpc 2017-10-11 06:58:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2877


Note You need to log in before you can comment on or make changes to this bug.