Description of problem: On a 3 node CNS setup, 10 app pods were created with PVC coming from gluster-block devices. Although no issues were seen during creation of the app pods, after an hour or so, one of the app pod had consistent crashes and this is what was the reason. 1h 1h 3 kubelet, dhcp47-49.lab.eng.blr.redhat.com spec.containers{mongodb} Normal Pulled Container image "registry.access.redhat.com/rhscl/mongodb-32-rhel7@sha256:48e323b31f38ca23bf6c566756c08e7b485d19e5cbee3507b7dd6cbf3b1a9ece" already present on machine 1h 1h 4 kubelet, dhcp47-49.lab.eng.blr.redhat.com spec.containers{mongodb} Normal Created Created container 1h 1h 3 kubelet, dhcp47-49.lab.eng.blr.redhat.com spec.containers{mongodb} Warning Failed Error: failed to start container "mongodb": Error response from daemon: {"message":"mkdir /var/lib/origin/openshift.local.volumes/pods/2d92ccf4-736f-11e7-a04e-00505684d1d7/volumes/kubernetes.io~iscsi/pvc-27b69e09-736f-11e7-a04e-00505684d1d7: file exists"} 1h 12m 248 kubelet, dhcp47-49.lab.eng.blr.redhat.com spec.containers{mongodb} Warning BackOff Back-off restarting failed container 1h 2m 311 kubelet, dhcp47-49.lab.eng.blr.redhat.com Warning FailedSync Error syncing pod After a while, iSCSI login failed for all 3 target servers and as a result, any app pod that was restarted failed to access the PVs. 1h 22m 44 kubelet, dhcp47-49.lab.eng.blr.redhat.com Warning FailedSync Error syncing pod 1h 11m 30 kubelet, dhcp47-49.lab.eng.blr.redhat.com Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/iscsi/95762a9d-738a-11e7-a04e-00505684d1d7-pvc-fb4967c1-736f-11e7-a04e-00505684d1d7" (spec.Name: "pvc-fb4967c1-736f-11e7-a04e-00505684d1d7") pod "95762a9d-738a-11e7-a04e-00505684d1d7" (UID: "95762a9d-738a-11e7-a04e-00505684d1d7") with: failed to get any path for iscsi disk, last err seen: iscsi: failed to sendtargets to portal 10.70.47.72:3260 output: iscsiadm: Login response timeout. Waited 30 seconds and did not get response PDU. iscsiadm: discovery login to 10.70.47.72 failed, giving up 2 iscsiadm: Could not perform SendTargets discovery: encountered non-retryable iSCSI login failure , err exit status 19 1h 2m 53 kubelet, dhcp47-49.lab.eng.blr.redhat.com Warning FailedMount Unable to mount volumes for pod "mongodb-9-1-wm0tq_storage-project(95762a9d-738a-11e7-a04e-00505684d1d7)": timeout expired waiting for volumes to attach/mount for pod "storage-project"/"mongodb-9-1-wm0tq". list of unattached/unmounted volumes=[mongodb-9-data] I'm not sure if both the issues are inter-related but this is a serious issue. Block volumes are no more accessible. Manually, trying to do a iscsi login fails. iscsiadm -m discovery -t st -p 10.70.47.49 iscsiadm: Login response timeout. Waited 30 seconds and did not get response PDU. iscsiadm: discovery login to 10.70.47.49 failed, giving up 2 iscsiadm: Could not perform SendTargets discovery: encountered non-retryable iSCSI login failure Version-Release number of selected component (if applicable): cns-deploy-5.0.0-12.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. Create 10 app pods with gluster-block as pvc 2. Run IOs 3. restart app pod (I had deleted the app pod and dc had respun a new pod) 4. check if app pods Actual results: app pod fails to start Expected results: app pod should be able to access pvc and it should turn up without any issues Additional info:
IQN: iqn.2016-12.org.gluster-block:05480682-b3b5-42ae-8136-5591ad8c55cf IQN: iqn.2016-12.org.gluster-block:65a334bd-8a86-49f6-b3fa-cff462183df6 IQN: iqn.2016-12.org.gluster-block:37cb276e-98f1-47c9-a58d-65ccb3f23e5b IQN: iqn.2016-12.org.gluster-block:5526c757-42cd-4e3d-ae8c-561dda73bdf0 IQN: iqn.2016-12.org.gluster-block:4265ef0f-f269-4f62-a8e2-8538b5eae2a9 IQN: iqn.2016-12.org.gluster-block:cd3664fe-e97e-4a73-83a4-1dcbc91f2519 IQN: iqn.2016-12.org.gluster-block:7e9d060c-6117-474a-b2b2-e2aeb45d3e37 IQN: iqn.2016-12.org.gluster-block:b8af7876-d86b-4891-95c0-5e6b452eb900 IQN: iqn.2016-12.org.gluster-block:6dd3aa4e-be14-49e8-b3d1-4d7b6fa8052f IQN: iqn.2016-12.org.gluster-block:955537fa-8fb1-4c17-95df-3806e9fd4f21 Above 10 targets exist and active in background ( see attachment) . However it fails to get discovered from the initiator.
Created attachment 1306901 [details] Target status
The steps after which the issue was hit has some missing steps. Although this has been discussed among Me, Humble and Prasanna, updating the bz for any future reference. Steps to Reproduce: 1. Create 10 app pods with gluster-block as pvc 2. Run IOs 3. create 11th app pod with block-size exceeding the block-hosting volume (i.e., On a 500Gb block-hosting volume, tried to create a 600Gb block device), This failed as expected 4. Deleteed 11th pod and pvc 5. create 12th app pod with block-size equal to that of the block-hosting volume (i.e., On a 500Gb block-hosting volume, tried to create a 500Gb block device), This worked as expected 6. Deleted 12th app pod and pvc 3. restart app pod (I had deleted the app pod and dc had respun a new pod) 4. check if app pods are running
tracked bug is on qa, hence moving this as well
Need to build new images with the updated tcmu-runner. hence post not on_qa
The tracker bug moved to ASSIGNED state based on "failed qa" with latest tcmu-runner build. I am changing this bug state as well.
Changing to ON_QA depend on the tracker bug.
Indeed verification of https://bugzilla.redhat.com/show_bug.cgi?id=1477455#c82 was done on CNS configuration. verified in cns-deploy-5.0.0-43.el7rhgs.x86_64 Moving the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2877