Bug 1476285 - [Tracker Bug (gluster-block)] app pod with block vol as pvc fails to restart as iSCSI login fails
Summary: [Tracker Bug (gluster-block)] app pod with block vol as pvc fails to restart ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: CNS-deployment
Version: cns-3.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: CNS 3.6
Assignee: Humble Chirammal
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On: 1477455
Blocks: 1445448
TreeView+ depends on / blocked
 
Reported: 2017-07-28 13:55 UTC by krishnaram Karthick
Modified: 2018-12-13 15:20 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1477455 (view as bug list)
Environment:
Last Closed: 2017-10-11 06:58:29 UTC
Embargoed:


Attachments (Terms of Use)
Target status (24.98 KB, text/plain)
2017-07-31 08:53 UTC, Humble Chirammal
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:2877 0 normal SHIPPED_LIVE rhgs-server-container bug fix and enhancement update 2017-10-11 11:11:39 UTC

Description krishnaram Karthick 2017-07-28 13:55:59 UTC
Description of problem:
On a 3 node CNS setup, 10 app pods were created with PVC coming from gluster-block devices. Although no issues were seen during creation of the app pods, after an hour or so, one of the app pod had consistent crashes and this is what was the reason.

  1h    1h      3       kubelet, dhcp47-49.lab.eng.blr.redhat.com       spec.containers{mongodb}        Normal  Pulled          Container image "registry.access.redhat.com/rhscl/mongodb-32-rhel7@sha256:48e323b31f38ca23bf6c566756c08e7b485d19e5cbee3507b7dd6cbf3b1a9ece" already present on machine
  1h    1h      4       kubelet, dhcp47-49.lab.eng.blr.redhat.com       spec.containers{mongodb}        Normal  Created         Created container
  1h    1h      3       kubelet, dhcp47-49.lab.eng.blr.redhat.com       spec.containers{mongodb}        Warning Failed          Error: failed to start container "mongodb": Error response from daemon: {"message":"mkdir /var/lib/origin/openshift.local.volumes/pods/2d92ccf4-736f-11e7-a04e-00505684d1d7/volumes/kubernetes.io~iscsi/pvc-27b69e09-736f-11e7-a04e-00505684d1d7: file exists"}
  1h    12m     248     kubelet, dhcp47-49.lab.eng.blr.redhat.com       spec.containers{mongodb}        Warning BackOff         Back-off restarting failed container
  1h    2m      311     kubelet, dhcp47-49.lab.eng.blr.redhat.com                                       Warning FailedSync      Error syncing pod

After a while, iSCSI login failed for all 3 target servers and as a result, any app pod that was restarted failed to access the PVs.

  1h	22m	44	kubelet, dhcp47-49.lab.eng.blr.redhat.com		Warning	FailedSync	Error syncing pod
  1h	11m	30	kubelet, dhcp47-49.lab.eng.blr.redhat.com		Warning	FailedMount	MountVolume.SetUp failed for volume "kubernetes.io/iscsi/95762a9d-738a-11e7-a04e-00505684d1d7-pvc-fb4967c1-736f-11e7-a04e-00505684d1d7" (spec.Name: "pvc-fb4967c1-736f-11e7-a04e-00505684d1d7") pod "95762a9d-738a-11e7-a04e-00505684d1d7" (UID: "95762a9d-738a-11e7-a04e-00505684d1d7") with: failed to get any path for iscsi disk, last err seen:
iscsi: failed to sendtargets to portal 10.70.47.72:3260 output: iscsiadm: Login response timeout. Waited 30 seconds and did not get response PDU.
iscsiadm: discovery login to 10.70.47.72 failed, giving up 2
iscsiadm: Could not perform SendTargets discovery: encountered non-retryable iSCSI login failure
, err exit status 19
  1h	2m	53	kubelet, dhcp47-49.lab.eng.blr.redhat.com		Warning	FailedMount	Unable to mount volumes for pod "mongodb-9-1-wm0tq_storage-project(95762a9d-738a-11e7-a04e-00505684d1d7)": timeout expired waiting for volumes to attach/mount for pod "storage-project"/"mongodb-9-1-wm0tq". list of unattached/unmounted volumes=[mongodb-9-data]

I'm not sure if both the issues are inter-related but this is a serious issue. Block volumes are no more accessible.

Manually, trying to do a iscsi login fails.

iscsiadm -m discovery -t st -p 10.70.47.49
iscsiadm: Login response timeout. Waited 30 seconds and did not get response PDU.
iscsiadm: discovery login to 10.70.47.49 failed, giving up 2
iscsiadm: Could not perform SendTargets discovery: encountered non-retryable iSCSI login failure


Version-Release number of selected component (if applicable):
cns-deploy-5.0.0-12.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. Create 10 app pods with gluster-block as pvc
2. Run IOs
3. restart app pod (I had deleted the app pod and dc had respun a new pod)
4. check if app pods 

Actual results:
app pod fails to start

Expected results:
app pod should be able to access pvc and it should turn up without any issues

Additional info:

Comment 4 Humble Chirammal 2017-07-31 08:47:00 UTC
    IQN:		iqn.2016-12.org.gluster-block:05480682-b3b5-42ae-8136-5591ad8c55cf
    IQN:		iqn.2016-12.org.gluster-block:65a334bd-8a86-49f6-b3fa-cff462183df6
    IQN:		iqn.2016-12.org.gluster-block:37cb276e-98f1-47c9-a58d-65ccb3f23e5b
    IQN:		iqn.2016-12.org.gluster-block:5526c757-42cd-4e3d-ae8c-561dda73bdf0
    IQN:		iqn.2016-12.org.gluster-block:4265ef0f-f269-4f62-a8e2-8538b5eae2a9
    IQN:		iqn.2016-12.org.gluster-block:cd3664fe-e97e-4a73-83a4-1dcbc91f2519
    IQN:		iqn.2016-12.org.gluster-block:7e9d060c-6117-474a-b2b2-e2aeb45d3e37
    IQN:		iqn.2016-12.org.gluster-block:b8af7876-d86b-4891-95c0-5e6b452eb900
    IQN:		iqn.2016-12.org.gluster-block:6dd3aa4e-be14-49e8-b3d1-4d7b6fa8052f
    IQN:		iqn.2016-12.org.gluster-block:955537fa-8fb1-4c17-95df-3806e9fd4f21

Above 10 targets exist and active in background ( see attachment) . However it fails to get discovered from the initiator.

Comment 5 Humble Chirammal 2017-07-31 08:53:19 UTC
Created attachment 1306901 [details]
Target status

Comment 7 krishnaram Karthick 2017-08-02 08:07:14 UTC
The steps after which the issue was hit has some missing steps. Although this has been discussed among Me, Humble and Prasanna, updating the bz for any future reference.

Steps to Reproduce:
1. Create 10 app pods with gluster-block as pvc
2. Run IOs
3. create 11th app pod with block-size exceeding the block-hosting volume (i.e., On a 500Gb block-hosting volume, tried to create a 600Gb block device), This failed as expected
4. Deleteed 11th pod and pvc
5. create 12th app pod with block-size equal to that of the block-hosting volume (i.e., On a 500Gb block-hosting volume, tried to create a 500Gb block device), This worked as expected
6. Deleted 12th app pod and pvc
3. restart app pod (I had deleted the app pod and dc had respun a new pod)
4. check if app pods are running

Comment 11 Michael Adam 2017-09-13 13:52:17 UTC
tracked bug is on qa, hence moving this as well

Comment 12 Michael Adam 2017-09-13 13:57:16 UTC
Need to build new images with the updated tcmu-runner. hence post not on_qa

Comment 13 Humble Chirammal 2017-09-13 18:12:08 UTC
The tracker bug moved to ASSIGNED state based on "failed qa"  with latest tcmu-runner build. I am changing this bug state as well.

Comment 14 Humble Chirammal 2017-09-18 09:24:27 UTC
Changing to ON_QA depend on the tracker bug.

Comment 15 krishnaram Karthick 2017-09-18 11:37:58 UTC
Indeed verification of https://bugzilla.redhat.com/show_bug.cgi?id=1477455#c82 was done on CNS configuration.

verified in cns-deploy-5.0.0-43.el7rhgs.x86_64

Moving the bug to verified.

Comment 16 errata-xmlrpc 2017-10-11 06:58:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2877


Note You need to log in before you can comment on or make changes to this bug.