Bug 1477455
Summary: | app pod with block vol as pvc fails to restart as iSCSI login fails | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | krishnaram Karthick <kramdoss> | ||||||
Component: | gluster-block | Assignee: | Prasanna Kumar Kalever <prasanna.kalever> | ||||||
Status: | CLOSED ERRATA | QA Contact: | krishnaram Karthick <kramdoss> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rhgs-3.3 | CC: | agrover, akhakhar, amukherj, annair, hchiramm, jarrpa, kramdoss, madam, mchristi, mliyazud, mzywusko, pkarampu, pprakash, prasanna.kalever, rcyriac, rhs-bugs, rreddy, rtalur, sanandpa, sankarshan, storage-qa-internal, vbellur | ||||||
Target Milestone: | --- | ||||||||
Target Release: | RHGS 3.3.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | tcmu-runner-1.2.0-14.el7rhgs | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | 1476285 | Environment: | |||||||
Last Closed: | 2017-09-21 04:20:54 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1417151, 1476285 | ||||||||
Attachments: |
|
Comment 3
Humble Chirammal
2017-08-02 08:00:16 UTC
The steps after which the issue was hit has some missing steps. Although this has been discussed among Me, Humble and Prasanna, updating the bz for any future reference. Steps to Reproduce: 1. Create 10 app pods with gluster-block as pvc 2. Run IOs 3. create 11th app pod with block-size exceeding the block-hosting volume (i.e., On a 500Gb block-hosting volume, tried to create a 600Gb block device), This failed as expected 4. Deleteed 11th pod and pvc 5. create 12th app pod with block-size equal to that of the block-hosting volume (i.e., On a 500Gb block-hosting volume, tried to create a 500Gb block device), This worked as expected 6. Deleted 12th app pod and pvc 3. restart app pod (I had deleted the app pod and dc had respun a new pod) 4. check if app pods are running As an additional comment, regardless of discovery part, the tcmu-runner status, restart..etc was hung in this setup. Do we know why this service was hung ? This issue was hit on the test setup used for testing EFK stack too. one of the elasticsearch pod using gluster-block device has crashed with the same error seen in this issue. Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 10m 19s 6 kubelet, dhcp46-100.lab.eng.blr.redhat.com Warning FailedSync Error syncing pod 8m 19s 5 kubelet, dhcp46-100.lab.eng.blr.redhat.com Warning FailedMount Unable to mount volumes for pod "logging-es-data-master-p1h7hxxi-1-399sq_logging(0adba18a-77fe-11e7-9fc8-005056a51dd3)": timeout expired waiting for volumes to attach/mount for pod "logging"/"logging-es-data-master-p1h7hxxi-1-399sq". list of unattached/unmounted volumes=[elasticsearch-storage] Only 6 gluster-block devices were created and IOs were run on them continuously. Other than that, there were no other specific failure induced. oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-m4r99 1/1 Running 0 1d logging-curator-ops-1-qfvdz 1/1 Running 1 1d logging-es-data-master-dj9ey1jh-1-6835j 1/1 Running 0 1d logging-es-data-master-p1h7hxxi-1-399sq 0/1 Error 0 1d logging-es-data-master-uaaqo2e8-1-phcg1 0/1 Error 0 1d logging-es-ops-data-master-7u5z9m9j-1-n95j1 1/1 Running 0 1d logging-es-ops-data-master-lt229hum-1-z0z16 0/1 Error 0 1d logging-es-ops-data-master-vw10nnol-1-bz5rx 1/1 Running 1 1d logging-fluentd-pg3g7 1/1 Running 1 1d logging-fluentd-r6w3t 1/1 Running 0 1d logging-fluentd-sk6hx 1/1 Running 0 1d logging-fluentd-tk335 1/1 Running 1 1d logging-kibana-3-6jft3 2/2 Running 0 1d logging-kibana-ops-2-07x4d 2/2 Running 17 1d Once this issue is hit, app pods using gluster-block devices seems to be failing one by one. However, discovery seems to be failing with a different error now.. [root@dhcp46-100 ~]# iscsiadm -m discovery -t st -p 10.70.46.100 iscsiadm: cannot make connection to 10.70.46.100: Connection refused iscsiadm: cannot make connection to 10.70.46.100: Connection refused iscsiadm: cannot make connection to 10.70.46.100: Connection refused iscsiadm: cannot make connection to 10.70.46.100: Connection refused iscsiadm: cannot make connection to 10.70.46.100: Connection refused iscsiadm: cannot make connection to 10.70.46.100: Connection refused iscsiadm: connection login retries (reopen_max) 5 exceeded iscsiadm: Could not perform SendTargets discovery: encountered connection failure Me and Ramky had a look at the setup and looks like multipathing was not configured. We are at the moment not sure if that could be a reason. But Ramky will be giving a try once again after configuring multipath. =============================================================================== Following steps needs to be performed to configure multipathing on all the initiator nodes. # mpathconf --enable # cat >> /etc/multipath.conf # LIO iSCSI devices { device { vendor "LIO-ORG" user_friendly_names "yes" # names like mpatha path_grouping_policy "failover" # one path per group path_selector "round-robin 0" failback immediate path_checker "tur" prio "const" no_path_retry 120 rr_weight "uniform" } } Ctrl^C # systemctl restart multipathd ================================================================================ These steps are For above pod see this [root@dhcp43-165 ~]# cd /var/lib/origin/openshift.local.volumes/pods/428bc478-9484-11e7-86b3-5254001958c3/volumes/kubernetes.io~iscsi/pvc-413e172b-9484-11e7-86b3-5254001958c3 [root@dhcp43-165 pvc-413e172b-9484-11e7-86b3-5254001958c3]# ls ls: cannot open directory .: Input/output error [root@dhcp43-165 pvc-413e172b-9484-11e7-86b3-5254001958c3]# (In reply to Humble Chirammal from comment #38) > For above pod see this > > [root@dhcp43-165 ~]# cd > /var/lib/origin/openshift.local.volumes/pods/428bc478-9484-11e7-86b3- > 5254001958c3/volumes/kubernetes.io~iscsi/pvc-413e172b-9484-11e7-86b3- > 5254001958c3 > [root@dhcp43-165 pvc-413e172b-9484-11e7-86b3-5254001958c3]# ls > ls: cannot open directory .: Input/output error > [root@dhcp43-165 pvc-413e172b-9484-11e7-86b3-5254001958c3]# For this particular device, io error count: 0x13 and eh_timeout is 10. Ideally this node would have configured for mpath, but at present this node is operating on single path and multipath is not configured. Created attachment 1324933 [details]
iscsi messages from problematic node.
ISCSI messages from "dhcp43-165"
Apart from iscsi errors attached, I/O errors are reported on this device. --snip-- Sep 12 11:28:41 dhcp43-165 kernel: sd 8:0:0:0: rejecting I/O to offline device Sep 12 11:28:41 dhcp43-165 kernel: sd 8:0:0:0: rejecting I/O to offline device Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065162 Sep 12 11:28:41 dhcp43-165 kernel: sd 8:0:0:0: rejecting I/O to offline device Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065169 Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065177 Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065181 Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065185 Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065189 Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065157 Sep 12 11:28:41 dhcp43-165 kernel: blk_update_request: I/O error, dev sdg, sector 1065157 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040ca ("xlog_iodone") error 5 numblks 64 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): Log I/O Error Detected. Shutting down filesystem Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040d1 ("xlog_iodone") error 5 numblks 64 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040d9 ("xlog_iodone") error 5 numblks 64 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040dd ("xlog_iodone") error 5 numblks 64 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040e1 ("xlog_iodone") error 5 numblks 64 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040e5 ("xlog_iodone") error 5 numblks 64 Sep 12 11:28:41 dhcp43-165 kernel: XFS (sdg): metadata I/O error: block 0x1040c5 ("xlog_iodone") error 5 numblks 64 -snip------ So its rejecting I/O and it seems to me we are into the same issue we encountered in previous iterations. Comment >=39 refers to "sdg" that too got same errors. (In reply to Humble Chirammal from comment #48) > Comment >=39 refers to "sdg" that too got same errors. The i/o errors and issue pattern look same to me against the last iteration of this issue and the dependent (1480192) bug analysis. (In reply to Humble Chirammal from comment #49) > (In reply to Humble Chirammal from comment #48) > > Comment >=39 refers to "sdg" that too got same errors. > > The i/o errors and issue pattern look same to me against the last iteration > of this issue and the dependent (1480192) bug analysis. As per https://bugzilla.redhat.com/show_bug.cgi?id=1480192#c19, the issue is seen without gluster in picture, where as this particular iteration I/O errors happened because of ping-timeouts in gluster rpc layer. So they are not same even though end result is I/O errors in dmesg. (In reply to Pranith Kumar K from comment #50) > (In reply to Humble Chirammal from comment #49) > > (In reply to Humble Chirammal from comment #48) > > > Comment >=39 refers to "sdg" that too got same errors. > > > > The i/o errors and issue pattern look same to me against the last iteration > > of this issue and the dependent (1480192) bug analysis. > > As per https://bugzilla.redhat.com/show_bug.cgi?id=1480192#c19, the issue is > seen without gluster in picture, Well, thats why the entire 'software iscsi' stack ( target and initiator) was in doubt from start and we were/are trying to isolate where the culprit hide or the triggering point. > where as this particular iteration I/O > errors happened because of ping-timeouts in gluster rpc layer. So they are > not same even though end result is I/O errors in dmesg. My assumption is this: The "unmount" of the share fails in initiators due to the I/O error which caused "file exists" in next container respawn. If all the mounted shares/targets rejects I/O at unmount time or started to reject I/O at some point ( may be after serving i/o for some time or after a change in target side) this need to be analysed. (In reply to Humble Chirammal from comment #52) > (In reply to Pranith Kumar K from comment #50) > > (In reply to Humble Chirammal from comment #49) > > > (In reply to Humble Chirammal from comment #48) > > > > Comment >=39 refers to "sdg" that too got same errors. > > > > > > The i/o errors and issue pattern look same to me against the last iteration > > > of this issue and the dependent (1480192) bug analysis. > > > > As per https://bugzilla.redhat.com/show_bug.cgi?id=1480192#c19, the issue is > > seen without gluster in picture, > > Well, thats why the entire 'software iscsi' stack ( target and initiator) > was in doubt from start and we were/are trying to isolate where the culprit > hide or the triggering point. I think we are going in circles. This iteration of reproducer is not what happened when krk raised this bz. I think that part of the analysis is conclusive. > > > where as this particular iteration I/O > > errors happened because of ping-timeouts in gluster rpc layer. So they are > > not same even though end result is I/O errors in dmesg. > > My assumption is this: > > The "unmount" of the share fails in initiators due to the I/O error which > caused "file exists" in next container respawn. Could you help to confirm this? That would give good data point. > > If all the mounted shares/targets rejects I/O at unmount time or started to > reject I/O at some point ( may be after serving i/o for some time or after a > change in target side) this need to be analysed. [Side note] As mentioned in previous comment, we should have configured multipath on these systems which is missing in this setup ( atleast on server which refers latest comments about sdg and sda devices) compared to previous iterations. It looks like commands do get stuck, but handler_glfs gets a reply and sends it to lio. It looks like the problem is just that something is deleting the target from lio: We see these errors in lio: Sep 12 11:26:58 dhcp42-226 kernel: Unable to locate Target IQN: iqn.2016-12.org.gluster-block:df906719-ba77-45ad-814e-01a2589360ae in Storage Node Sep 12 11:26:58 dhcp42-226 kernel: iSCSI Login negotiation failed. There is no runner or lio code that would do this that I know of. However, it does not make sense that we would see those errors but if you logged in from another node that the login would work unless you are creating/deleting targets dynamically. Created attachment 1325590 [details]
kernel logs; naming kernel_logs_$host_$devname
Ran the same test mentioned in comment 80, no app pod crash was seen. Moving the bug to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2773 |