Description of problem: On a CNS setup with 3 worker nodes, 10 app pods were created with gluster-block device as PVC. After one of the nodes containing gluster pod was rebooted, targets were not seen and discovery of target fails. Please note that this issue was not seen with just 1 app pod and 1 gluster-block device [root@dhcp46-248 ~]# iscsiadm -m discovery -t st -p 10.70.46.248 iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: connection login retries (reopen_max) 5 exceeded iscsiadm: Could not perform SendTargets discovery: encountered connection failure From the dmesg of 10.70.46.248, I see there is a timeout. (complete output shall be attached) [ 108.273653] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 132.602560] connection33:0: detected conn error (1020) [ 132.602724] connection7:0: detected conn error (1020) [ 132.609926] connection10:0: detected conn error (1020) [ 132.661266] connection16:0: detected conn error (1020) [ 146.436854] session13: session recovery timed out after 120 secs [ 146.436864] session7: session recovery timed out after 120 secs [ 146.436868] session8: session recovery timed out after 120 secs [ 146.436872] session9: session recovery timed out after 120 secs [ 146.436906] session10: session recovery timed out after 120 secs [ 146.436910] session12: session recovery timed out after 120 secs [ 146.436914] session11: session recovery timed out after 120 secs [ 147.973445] session33: session recovery timed out after 120 secs [ 147.973453] session35: session recovery timed out after 120 secs [ 147.973455] session34: session recovery timed out after 120 secs [ 147.973458] session15: session recovery timed out after 120 secs [ 147.973460] session16: session recovery timed out after 120 secs [ 147.973462] session14: session recovery timed out after 120 secs All app pods are up and running after the reboot: oc get pods NAME READY STATUS RESTARTS AGE glusterblock-provisioner-1-kpbcx 1/1 Running 4 4d glusterfs-b6s3d 1/1 Running 0 19h glusterfs-gmlvv 1/1 Running 0 19h glusterfs-xnklw 1/1 Running 1 19h heketi-1-3895t 1/1 Running 0 19h jenkins-1-1-46vbx 1/1 Running 0 49m jenkins-10-1-0hq63 1/1 Running 1 48m jenkins-2-1-l6wr6 1/1 Running 1 49m jenkins-3-1-qm1sp 1/1 Running 0 48m jenkins-4-1-mfhgq 1/1 Running 1 48m jenkins-5-1-wtkgh 1/1 Running 1 48m jenkins-6-1-nf45x 1/1 Running 0 48m jenkins-7-1-28jwf 1/1 Running 0 48m jenkins-8-1-jbq5l 1/1 Running 0 48m jenkins-9-1-xg0xp 1/1 Running 0 48m storage-project-router-1-21vfw 1/1 Running 5 5d Targets are not seen on 10.70.46.248 oc rsh glusterfs-xnklw (This is the gluster-pod on 10.70.46.248) sh-4.2# sh-4.2# sh-4.2# targetcli ls o- / [...] o- backstores [...] | o- block [Storage Objects: 0] | o- fileio [Storage Objects: 0] | o- pscsi [Storage Objects: 0] | o- ramdisk [Storage Objects: 0] | o- user:glfs [Storage Objects: 0] o- iscsi [Targets: 0] o- loopback [Targets: 0] sh-4.2# systemctl status gluster-blockd ● gluster-blockd.service - Gluster block storage utility Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2017-08-15 07:11:51 UTC; 46min ago Main PID: 225 (gluster-blockd) CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode201fbd6_80eb_11e7_8f8c_00505684d1d7.slice/docker-29d41f84b60cf07d74e96de8efb1472972e2a4e80bfeca32b1cac84d5f0b5e04.scope/system.slice/gluster-blockd.service └─225 /usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO... Aug 15 07:11:51 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Started Gluster... Aug 15 07:11:51 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Starting Gluste... Hint: Some lines were ellipsized, use -l to show in full. sh-4.2# systemctl status tcmu-runner ● tcmu-runner.service - LIO Userspace-passthrough daemon Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled) Active: active (running) since Tue 2017-08-15 07:11:50 UTC; 46min ago Main PID: 172 (tcmu-runner) CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode201fbd6_80eb_11e7_8f8c_00505684d1d7.slice/docker-29d41f84b60cf07d74e96de8efb1472972e2a4e80bfeca32b1cac84d5f0b5e04.scope/system.slice/tcmu-runner.service └─172 /usr/bin/tcmu-runner --tcmu-log-dir=/var/log/gluster-block/ Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Starting LIO Us... Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com tcmu-runner[172]: tcmu-runner : load_ou... Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com tcmu-runner[172]: 2017-08-1... Aug 15 07:11:50 dhcp46-248.lab.eng.blr.redhat.com systemd[1]: Started LIO Use... Hint: Some lines were ellipsized, use -l to show in full. Other logs/information: [root@dhcp46-248 ~]# ls -l /var/crash/ total 0 [root@dhcp46-248 ~]# lsmod | grep 'target' iscsi_target_mod 302966 1 target_core_user 23936 0 uio 19259 1 target_core_user target_core_mod 367918 3 iscsi_target_mod,target_core_user crc_t10dif 12714 2 target_core_mod,sd_mod [root@dhcp46-248 ~]# lsmod | grep 'multi' xt_multiport 12798 1 dm_multipath 27427 5 dm_round_robin dm_mod 123303 95 dm_round_robin,dm_multipath,dm_log,dm_persistent_data,dm_mirror,dm_bufio,dm_thin_pool cat /etc/sysconfig/iptables # Generated by iptables-save v1.4.21 on Wed Jul 5 13:40:34 2017 *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :OS_FIREWALL_ALLOW - [0:0] -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT -A INPUT -j OS_FIREWALL_ALLOW -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 10250 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 443 -j ACCEPT -A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 4789 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24007 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24008 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2222 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m multiport --dports 49152:49664 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 111 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24006 -j ACCEPT -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 3260 -j ACCEPT COMMIT # Completed on Wed Jul 5 13:40:34 2017 block-hosting volume: gluster vol list heketidbstorage vol_e13fd33c564cd64cd7d7849ccdf1d704 sh-4.2# gluster vol status vol_e13fd33c564cd64cd7d7849ccdf1d704 Status of volume: vol_e13fd33c564cd64cd7d7849ccdf1d704 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.248:/var/lib/heketi/mounts/v g_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90 d48523fba33a8d490e270dbf74c150/brick 49152 0 Y 2382 Brick 10.70.47.72:/var/lib/heketi/mounts/vg _17681cb20e6a51e1f62a4e48e58dede2/brick_ae9 7d4c5eac71ea7b73f1738d575d5e3/brick 49152 0 Y 2801 Brick 10.70.47.49:/var/lib/heketi/mounts/vg _856a20c7776264fb613f75a6f3593191/brick_d92 0187a1411a055f4dace3f20ca9780/brick 49152 0 Y 2909 Self-heal Daemon on localhost N/A N/A Y 3685 Self-heal Daemon on dhcp47-49.lab.eng.blr.r edhat.com N/A N/A Y 3814 Self-heal Daemon on 10.70.46.248 N/A N/A Y 2391 Task Status of Volume vol_e13fd33c564cd64cd7d7849ccdf1d704 ------------------------------------------------------------------------------ There are no active volume tasks sh-4.2# gluster vol heal vol_e13fd33c564cd64cd7d7849ccdf1d704 info Brick 10.70.46.248:/var/lib/heketi/mounts/vg_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90d48523fba33a8d490e270dbf74c150/brick Status: Connected Number of entries: 0 Brick 10.70.47.72:/var/lib/heketi/mounts/vg_17681cb20e6a51e1f62a4e48e58dede2/brick_ae97d4c5eac71ea7b73f1738d575d5e3/brick Status: Connected Number of entries: 0 Brick 10.70.47.49:/var/lib/heketi/mounts/vg_856a20c7776264fb613f75a6f3593191/brick_d920187a1411a055f4dace3f20ca9780/brick Status: Connected Number of entries: 0 iscsi login from other 2 nodes works: (Pasting the output from just one node) [root@dhcp47-49 ~]# iscsiadm -m discovery -t st -p 10.70.47.49 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:e908e9c0-66ab-4cbb-a360-3e9e0cbcf9d8 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:e908e9c0-66ab-4cbb-a360-3e9e0cbcf9d8 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:e908e9c0-66ab-4cbb-a360-3e9e0cbcf9d8 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:755c65a5-c286-4ea0-901f-0e282a66db0f 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:755c65a5-c286-4ea0-901f-0e282a66db0f 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:755c65a5-c286-4ea0-901f-0e282a66db0f 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:60bada90-6392-4af9-866a-ee3388c9e4e1 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:60bada90-6392-4af9-866a-ee3388c9e4e1 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:60bada90-6392-4af9-866a-ee3388c9e4e1 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:18cdd32f-dbe9-4a12-ac7c-e5207867d0da 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:18cdd32f-dbe9-4a12-ac7c-e5207867d0da 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:18cdd32f-dbe9-4a12-ac7c-e5207867d0da 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:b90ee999-2860-4723-92c8-7a63dab56c95 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:b90ee999-2860-4723-92c8-7a63dab56c95 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:b90ee999-2860-4723-92c8-7a63dab56c95 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:5d0a0d4b-9035-4b71-8270-07c4acc60e9a 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:5d0a0d4b-9035-4b71-8270-07c4acc60e9a 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:5d0a0d4b-9035-4b71-8270-07c4acc60e9a 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:f2346d41-b484-46d8-a2de-629c5412996f 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:f2346d41-b484-46d8-a2de-629c5412996f 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:f2346d41-b484-46d8-a2de-629c5412996f 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:8f48dca4-9ee8-4dda-8163-e00791b7d062 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:8f48dca4-9ee8-4dda-8163-e00791b7d062 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:8f48dca4-9ee8-4dda-8163-e00791b7d062 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:d666317b-93b3-4156-984d-98a31406d87f 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:d666317b-93b3-4156-984d-98a31406d87f 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:d666317b-93b3-4156-984d-98a31406d87f 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:a85853e1-0e40-489d-a2e1-2e291ece9ed6 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:a85853e1-0e40-489d-a2e1-2e291ece9ed6 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:a85853e1-0e40-489d-a2e1-2e291ece9ed6 10.70.46.248:3260,1 iqn.2016-12.org.gluster-block:7b41ddd3-1bb8-433f-9d33-db880495acd9 10.70.47.49:3260,2 iqn.2016-12.org.gluster-block:7b41ddd3-1bb8-433f-9d33-db880495acd9 10.70.47.72:3260,3 iqn.2016-12.org.gluster-block:7b41ddd3-1bb8-433f-9d33-db880495acd9 [root@dhcp47-49 ~]# iscsiadm -m discovery -t st -p 10.70.46.248 iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: cannot make connection to 10.70.46.248: Connection refused iscsiadm: connection login retries (reopen_max) 5 exceeded iscsiadm: Could not perform SendTargets discovery: encountered connection failure block hosting volume is up and there are no pending heals sh-4.2# gluster vol list heketidbstorage vol_e13fd33c564cd64cd7d7849ccdf1d704 sh-4.2# gluster vol status vol_e13fd33c564cd64cd7d7849ccdf1d704 Status of volume: vol_e13fd33c564cd64cd7d7849ccdf1d704 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.248:/var/lib/heketi/mounts/v g_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90 d48523fba33a8d490e270dbf74c150/brick 49152 0 Y 2382 Brick 10.70.47.72:/var/lib/heketi/mounts/vg _17681cb20e6a51e1f62a4e48e58dede2/brick_ae9 7d4c5eac71ea7b73f1738d575d5e3/brick 49152 0 Y 2801 Brick 10.70.47.49:/var/lib/heketi/mounts/vg _856a20c7776264fb613f75a6f3593191/brick_d92 0187a1411a055f4dace3f20ca9780/brick 49152 0 Y 2909 Self-heal Daemon on localhost N/A N/A Y 3685 Self-heal Daemon on dhcp47-49.lab.eng.blr.r edhat.com N/A N/A Y 3814 Self-heal Daemon on 10.70.46.248 N/A N/A Y 2391 Task Status of Volume vol_e13fd33c564cd64cd7d7849ccdf1d704 ------------------------------------------------------------------------------ There are no active volume tasks sh-4.2# gluster vol heal vol_e13fd33c564cd64cd7d7849ccdf1d704 info Brick 10.70.46.248:/var/lib/heketi/mounts/vg_0beb90cbab247e75a6e4a3a101ec1bfb/brick_90d48523fba33a8d490e270dbf74c150/brick Status: Connected Number of entries: 0 Brick 10.70.47.72:/var/lib/heketi/mounts/vg_17681cb20e6a51e1f62a4e48e58dede2/brick_ae97d4c5eac71ea7b73f1738d575d5e3/brick Status: Connected Number of entries: 0 Brick 10.70.47.49:/var/lib/heketi/mounts/vg_856a20c7776264fb613f75a6f3593191/brick_d920187a1411a055f4dace3f20ca9780/brick Status: Connected Number of entries: 0 Version-Release number of selected component (if applicable): sh-4.2# rpm -qa | grep 'gluster' glusterfs-fuse-3.8.4-40.el7rhgs.x86_64 glusterfs-server-3.8.4-40.el7rhgs.x86_64 gluster-block-0.2.1-6.el7rhgs.x86_64 glusterfs-libs-3.8.4-40.el7rhgs.x86_64 glusterfs-3.8.4-40.el7rhgs.x86_64 glusterfs-api-3.8.4-40.el7rhgs.x86_64 glusterfs-cli-3.8.4-40.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-40.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-40.el7rhgs.x86_64 How reproducible: Although the TC was run once, the steps are quite simple. Should be reproducible Steps to Reproduce: 1. create a 3 node CNS setup 2. create 10 app pods and use gluster-block device as pvc 3. Run IOs from app pods 4. rebooted one of the node hosting gluster pod 5. Wait for all gluster-block related services to come up and check if iscsi discovery works on all nodes & targetcli ls lists all devices Actual results: targets are not discoverable from the node which was rebooted Expected results: targets should be discovered Additional info: sosreports, gluster-block logs, dmesg output shall be attached shortly
The fix has is to make the file '/etc/target/saveconfig.json' persistent while building container image and it has to be made at cns. Changing the component and release to cns-deployment and cns-3.6 respectively.
"Please note that this issue was not seen with just 1 app pod and 1 gluster-block device" This is confusing, if its abt the saved target configuration and if there are no changes in the way block device is created, it should same behaviour for one block device as well. Can you please confirm this ?
(In reply to Humble Chirammal from comment #9) > "Please note that this issue was not seen with just 1 app pod and 1 > gluster-block device" > > This is confusing, if its abt the saved target configuration and if there > are no changes in the way block device is created, it should same behaviour > for one block device as well. Can you please confirm this ? When I mentioned the issue was not seen with 1 app pod, I had not checked if the discovery was working. I had only checked if the app pods are up and able to access the device. The intent here was to test with few app pods and test with 1 app pod was just a quick check on the stability before the actual test was done. So yes, the discovery part was not validated with 1 app pod. Please ignore that statement.
Verified the fix in build - cns-deploy-5.0.0-24.el7rhgs.x86_64 /etc/target/saveconfig.json file is now persisted. steps followed to verify: 1) got into the gluster pod and ran 'cp /etc/target/saveconfig.json /var/log/glusterfs/ 2) Rebooted the node where gluster pods run 3) After the node rebooted and pod came up, ran diff /etc/target/saveconfig.json /var/log/glusterfs/saveconfig.json. There was no diff. Moving the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2877