1480192 – App pod fails to start with error 'mkdir /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: file exists'

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1480192 - App pod fails to start with error 'mkdir /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: file exists'

Summary: App pod fails to start with error 'mkdir /var/lib/origin/openshift.local.volu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	iscsi-initiator-utils
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Leech
QA Contact:	Filip Suba
Docs Contact:
URL:
Whiteboard:
Depends On:	1593197
Blocks:	1550279
TreeView+	depends on / blocked

Reported:	2017-08-10 11:26 UTC by krishnaram Karthick
Modified:	2023-09-15 00:03 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1593197 (view as bug list)
Environment:
Last Closed:	2020-01-16 15:08:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kubelet log for pod mongo-2 (4.03 MB, text/plain) 2017-08-10 16:35 UTC, Humble Chirammal	no flags	Details
dmesg (1015.62 KB, text/plain) 2017-08-11 12:04 UTC, Jan Safranek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1477455	0	unspecified	CLOSED	app pod with block vol as pvc fails to restart as iSCSI login fails	2021-02-22 00:41:40 UTC

Internal Links: 1477455

Description krishnaram Karthick 2017-08-10 11:26:18 UTC

Description of problem:

One of the app pod fails to start and has gone into CrashLoopBackOff status. The persistent storage for this app pod is provided by gluster-block device. App pod fails to start with this error message 'mkdir /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: file exists'

From the logs of app pod, it looks like for some reason the app pod has restarted after which the pod tries to do a mkdir and fails there. App pod logs shall be attached shortly.

oc get pods -o wide
NAME                               READY     STATUS             RESTARTS   AGE       IP            NODE
glusterblock-provisioner-1-wvk3n   1/1       Running            0          2d        10.131.0.5    dhcp46-13.lab.eng.blr.redhat.com
glusterfs-7fww2                    1/1       Running            1          2d        10.70.47.69   dhcp47-69.lab.eng.blr.redhat.com
glusterfs-9x8vg                    1/1       Running            1          2d        10.70.47.23   dhcp47-23.lab.eng.blr.redhat.com
glusterfs-br2qq                    1/1       Running            4          2d        10.70.46.11   dhcp46-11.lab.eng.blr.redhat.com
heketi-1-d47vc                     1/1       Running            0          1d        10.128.2.45   dhcp47-23.lab.eng.blr.redhat.com
mongo-1-1-khvhw                    1/1       Running            0          1d        10.131.0.36   dhcp46-13.lab.eng.blr.redhat.com
mongo-10-1-lzzdr                   1/1       Running            0          1d        10.129.0.49   dhcp47-69.lab.eng.blr.redhat.com
mongo-11-1-brbn0                   1/1       Running            0          4h        10.129.0.51   dhcp47-69.lab.eng.blr.redhat.com
mongo-2-1-89z37                    0/1       CrashLoopBackOff   298        1d        10.129.0.46   dhcp47-69.lab.eng.blr.redhat.com
mongo-3-1-ph9f2                    1/1       Running            1          1d        10.130.0.47   dhcp46-11.lab.eng.blr.redhat.com
mongo-4-1-ljgjq                    1/1       Running            0          1d        10.128.2.49   dhcp47-23.lab.eng.blr.redhat.com
mongo-5-1-hwv62                    1/1       Running            0          1d        10.130.0.48   dhcp46-11.lab.eng.blr.redhat.com
mongo-6-1-pmpnt                    1/1       Running            0          1d        10.129.0.48   dhcp47-69.lab.eng.blr.redhat.com
mongo-7-1-300l1                    1/1       Running            0          1d        10.130.0.49   dhcp46-11.lab.eng.blr.redhat.com
mongo-8-1-m630j                    1/1       Running            0          1d        10.128.2.50   dhcp47-23.lab.eng.blr.redhat.com
mongo-9-1-lbr8w                    1/1       Running            0          9h        10.130.0.51   dhcp46-11.lab.eng.blr.redhat.com
storage-project-router-2-56dv2     1/1       Running            0          2d        10.70.46.13   dhcp46-13.lab.eng.blr.redhat.com


oc describe pods/mongo-2-1-89z37
Name:                   mongo-2-1-89z37
Namespace:              storage-project
Security Policy:        privileged
Node:                   dhcp47-69.lab.eng.blr.redhat.com/10.70.47.69
Start Time:             Wed, 09 Aug 2017 15:16:24 +0530
Labels:                 app=mongodb-persistent
                        deployment=mongo-2-1
                        deploymentconfig=mongo-2
                        name=mongo-2
Annotations:            kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"storage-project","name":"mongo-2-1","uid":"96b5b242-7ce7-
11e7-8464-005...
                        openshift.io/deployment-config.latest-version=1
                        openshift.io/deployment-config.name=mongo-2
                        openshift.io/deployment.name=mongo-2-1
                        openshift.io/generated-by=OpenShiftNewApp
                        openshift.io/scc=privileged
Status:                 Running
IP:                     10.129.0.46
Controllers:            ReplicationController/mongo-2-1
Containers:
  mongodb:
    Container ID:       docker://568eb3afa2c5464cbdc3dfd49f9fef6ce7b5556781197845f5c0179d90b0a845
    Image:              registry.access.redhat.com/rhscl/mongodb-32-rhel7@sha256:902e0ffb919b8f627ba07321aa51539bcb5a11c1e500c5217b82ea2e465c7d2b
    Image ID:           docker-pullable://registry.access.redhat.com/rhscl/mongodb-32-rhel7@sha256:902e0ffb919b8f627ba07321aa51539bcb5a11c1e500c5217b82ea2e465c7d2b
    Port:               27017/TCP
    State:              Waiting
      Reason:           CrashLoopBackOff
    Last State:         Terminated
      Reason:           ContainerCannotRun
      Message:          mkdir /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: file exists
      Exit Code:        128
      Started:          Thu, 10 Aug 2017 16:32:39 +0530
      Finished:         Thu, 10 Aug 2017 16:32:39 +0530
    Ready:              False
    Restart Count:      297


Version-Release number of selected component (if applicable):
oc version
oc v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
Frequently

Steps to Reproduce:
1. create 10 app pods with gluster-block as backend for PVC
2. check if all the pods are up after couple of hours

Actual results:
one of the app pod crashes

Expected results:
no app pod failure should be seen

Logs shall be attached shortly

Comment 3 Matthew Wong 2017-08-10 15:47:41 UTC

What's the output from `stat /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97`? I think the only place Docker will call mkdir is https://github.com/moby/moby/blob/1.12.x/volume/volume.go#L120 (mkdirall -> mkdir) so something must be strange about that gluster-block mountpoint.

Comment 4 Humble Chirammal 2017-08-10 16:32:06 UTC

I kind of looked into this error in kubelet and I do suspect this is an issue with Docker. The interesting thing which I cannot understand here is, in next restart ( after a crash or after liveness probe failed), why the 'same' mount point is retried to recreate. 

1) The mount path says the POD UUID is same in next restart -> I am not 100% sure this is expected to be same "UUID" in a restart attempt 

2) If its expected to same UUID, the mount point would have been wiped before trying next "mkdir" or when it try to recreate the pod.

Karthick, can u attach the logs which I captured specifically on this pod lifecycle?

Comment 5 Humble Chirammal 2017-08-10 16:35:52 UTC

Created attachment 1311830 [details]
kubelet log for pod mongo-2

Comment 6 Humble Chirammal 2017-08-10 16:39:13 UTC

Comment#5 is the complete log of crashing pod where one of the instance looks like this :

-snip- from above log:

Aug  9 15:18:44 localhost atomic-openshift-node: I0809 15:18:44.843326    2918 prober.go:106] Liveness probe for "mongo-2-1-89z37_storage-project(9ba2d5b4-7ce7-11e7-8464-005056a56b97):mongodb" failed (failure): dial tcp 10.129.0.46:27017: getsockopt: connection refused
Aug  9 15:18:45 localhost atomic-openshift-node: I0809 15:18:45.034949    2918 prober.go:106] Readiness probe for "mongo-2-1-89z37_storage-project(9ba2d5b4-7ce7-11e7-8464-005056a56b97):mongodb" failed (failure): MongoDB shell version: 3.2.10
Aug  9 15:18:45 localhost journal: 2017-08-09T09:48:45.779+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/var/lib/mongodb/data 64-bit host=mongo-2-1-89z37
Aug  9 15:18:55 localhost atomic-openshift-node: I0809 15:18:55.279488    2918 roundrobin.go:284] LoadBalancerRR: Setting endpoints for storage-project/mongo-2:mongo to [10.129.0.46:27017]
Aug  9 15:35:02 localhost atomic-openshift-node: I0809 15:35:02.486605    2918 kubelet.go:1847] SyncLoop (PLEG): "mongo-2-1-89z37_storage-project(9ba2d5b4-7ce7-11e7-8464-005056a56b97)", event: &pleg.PodLifecycleEvent{ID:"9ba2d5b4-7ce7-11e7-8464-005056a56b97", Type:"ContainerDied", Data:"748e2fd1fca5033438ca87bba9e71c46a2237faf114d3435caae91a361884f7c"}
Aug  9 15:35:02 localhost atomic-openshift-node: I0809 15:35:02.543773    2918 roundrobin.go:305] LoadBalancerRR: Removing endpoints for storage-project/mongo-2:mongo
Aug  9 15:35:02 localhost atomic-openshift-node: I0809 15:35:02.791449    2918 kuberuntime_manager.go:458] Container {Name:mongodb Image:registry.access.redhat.com/rhscl/mongodb-32-rhel7@sha256:902e0ffb919b8f627ba07321aa51539bcb5a11c1e500c5217b82ea2e465c7d2b Command:[] Args:[] WorkingDir: Ports:[{Name: HostPort:0 ContainerPort:27017 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:MONGODB_USER Value:user0PO ValueFrom:nil} {Name:MONGODB_PASSWORD Value:RFhLCWFGdBqVHVVw ValueFrom:nil} {Name:MONGODB_DATABASE Value:sampledb ValueFrom:nil} {Name:MONGODB_ADMIN_PASSWORD Value:4kXvmjg4qqehJ2SU ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:mongo-2-data ReadOnly:false MountPath:/var/lib/mongodb/data SubPath:} {Name:default-token-8zw5f ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:nil,TCPSocket:&TCPSocketAction{Port:27017,},},InitialDelaySeconds:30,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[/bin/sh -i -c mongo 127.0.0.1:27017/$MONGODB_DATABASE -u $MONGODB_USER -p $MONGODB_PASSWORD --eval="quit()"],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:3,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*false,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Aug  9 15:35:02 localhost atomic-openshift-node: I0809 15:35:02.791618    2918 kuberuntime_manager.go:742] checking backoff for container "mongodb" in pod "mongo-2-1-89z37_storage-project(9ba2d5b4-7ce7-11e7-8464-005056a56b97)"
Aug  9 15:35:04 localhost atomic-openshift-node: I0809 15:35:04.842896    2918 kubelet.go:1880] SyncLoop (container unhealthy): "mongo-2-1-89z37_storage-project(9ba2d5b4-7ce7-11e7-8464-005056a56b97)"
Aug  9 15:35:05 localhost atomic-openshift-node: E0809 15:35:05.459662    2918 pod_workers.go:182] Error syncing pod 9ba2d5b4-7ce7-11e7-8464-005056a56b97 ("mongo-2-1-89z37_storage-project(9ba2d5b4-7ce7-11e7-8464-005056a56b97)"), skipping: failed to "StartContainer" for "mongodb" with RunContainerError: "failed to start container \"c8275d8fb46d7079a94cbefa6eb65b3526ac8cc8c0b4725b020ab6bee670445a\": Error response from daemon: {\"message\":\"mkdir /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: file exists\"}"

--/snip--

This is where it started this issue and from on the error is same and till now, it has restart almost "298" times and never got out of this loop

Comment 7 Humble Chirammal 2017-08-10 16:44:00 UTC

Huamin mentioned this thread to me http://post-office.corp.redhat.com/archives/openshift-sme/2017-July/msg01366.html where it happened after restart of "docker" deamon. Its different plugin ( gluster)  compared this plugin ( iscsi).

afaict, no docker restarted happend in this case and its just pod restart.

Comment 8 Jan Safranek 2017-08-11 12:02:55 UTC

(In reply to Matthew Wong from comment #3)
> What's the output from `stat
> /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-
> 005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-
> 005056a56b97`? I think the only place Docker will call mkdir is
> https://github.com/moby/moby/blob/1.12.x/volume/volume.go#L120 (mkdirall ->
> mkdir) so something must be strange about that gluster-block mountpoint.

The directory exists and /dev/mapper/mpathd is mounted there. What's weird that any access to the directory ends with Input/output error:

$ ls -la /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97
ls: cannot access /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: Input/output error


And any direct access to the block device ends with error too:

$ ls -la /dev/mapper/mpathd
lrwxrwxrwx. 1 root root 8 10. srp 17.08 /dev/mapper/mpathd -> ../dm-21

$ dd if=/dev/dm-21 of=/dev/zero 
dd: error reading ‘/dev/dm-21’: Input/output error
1024+0 records in
1024+0 records out
524288 bytes (524 kB) copied, 0,0212096 s, 24,7 MB/s

dmesg is full of errors like:

[182285.740791] iSCSI Login negotiation failed.
[182302.677688] iSCSI Login timeout on Network Portal 10.70.47.69:3260
[182302.680205] Unexpected ret: -104 send data 48
[182302.682427] tx_data returned -32, expecting 48.
[182302.684398] iSCSI Login negotiation failed.
[182302.686350] Unable to locate Target IQN: iqn.2016-12.org.gluster-block:3749cd04-9c8e-45c0-a5d8-3935d64160e3 in Storage Node
[182302.688294] Unexpected ret: -104 send data 48
[182302.690238] tx_data returned -32, expecting 48.
[182302.692329] iSCSI Login negotiation failed.
[182302.693671] Unable to locate Target IQN: iqn.2016-12.org.gluster-block:a0e52e94-4b09-462a-aa04-064592552633 in Storage Node
[182302.694704] Unexpected ret: -104 send data 48
[182302.695695] tx_data returned -32, expecting 48.
[182302.696670] iSCSI Login negotiation failed.
[182302.697662] Unable to locate Target IQN: iqn.2016-12.org.gluster-block:60e14200-f87c-4eb1-8f9c-b29e7d5d214a in Storage Node
[182302.698637] Unexpected ret: -104 send data 48
[182302.699626] tx_data returned -32, expecting 48.
[182302.700627] iSCSI Login negotiation failed.
[182302.701630] Unable to locate Target IQN: iqn.2016-12.org.gluster-block:c196ba23-9058-4e08-914d-803e4d4fed7c in Storage Node
[182302.702621] Unexpected ret: -104 send data 48
[182302.703583] tx_data returned -32, expecting 48.
[182302.704581] iSCSI Login negotiation failed.


Docker error code seems to be quite confusing:

$ docker run -ti -v /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97:/mnt/t busybox sh

/usr/bin/docker-current: Error response from daemon: mkdir /var/lib/origin/openshift.local.volumes/pods/9ba2d5b4-7ce7-11e7-8464-005056a56b97/volumes/kubernetes.io~iscsi/pvc-969b4f5b-7ce7-11e7-8464-005056a56b97: file exists.

But it must be related to Input/output error.

Summary: something is wrong with iSCSI

Comment 9 Jan Safranek 2017-08-11 12:04:29 UTC

Created attachment 1312105 [details]
dmesg

Comment 11 Jan Safranek 2017-08-11 12:18:01 UTC

While there are lot of errors in dmesg, none of them look related to iqn.2016-12.org.gluster-block:46859033-193d-4450-8085-c66b1c7cb164, which is behind the pvc-969b4f5b-7ce7-11e7-8464-005056a56b97.

On the other hand, there are some errors related to dm-21, which IS the faulty device:

[182077.757369] blk_update_request: I/O error, dev dm-21, sector 1024
[182077.758635] blk_update_request: I/O error, dev dm-21, sector 1024
[182077.760616] Buffer I/O error on dev dm-21, logical block 1024, async page read
[182077.762493] blk_update_request: I/O error, dev dm-21, sector 1025
[182077.764182] Buffer I/O error on dev dm-21, logical block 1025, async page read
[182077.765176] blk_update_request: I/O error, dev dm-21, sector 1026
[182077.766142] Buffer I/O error on dev dm-21, logical block 1026, async page read
[182077.767120] blk_update_request: I/O error, dev dm-21, sector 1027
[182077.768054] Buffer I/O error on dev dm-21, logical block 1027, async page read
[182077.769039] blk_update_request: I/O error, dev dm-21, sector 1028
[182077.770043] Buffer I/O error on dev dm-21, logical block 1028, async page read
[182077.771033] blk_update_request: I/O error, dev dm-21, sector 1029
[182077.771983] Buffer I/O error on dev dm-21, logical block 1029, async page read
[182077.772967] blk_update_request: I/O error, dev dm-21, sector 1030
[182077.773941] Buffer I/O error on dev dm-21, logical block 1030, async page read
[182077.774922] blk_update_request: I/O error, dev dm-21, sector 1031
[182077.775892] Buffer I/O error on dev dm-21, logical block 1031, async page read

From timestamps I can see it's related to dd I tried in comment #8.

Comment 12 Jan Safranek 2017-08-11 12:33:18 UTC

Looking at multipath output, dm-21 seems to be faulty:

mpathd (3600140546859033193d44508085c66b1) dm-21 LIO-ORG ,TCMU device     
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| `- 33:0:0:0 sdf 8:80  failed undef unknown
|-+- policy='round-robin 0' prio=0 status=enabled
| `- 34:0:0:0 sdg 8:96  failed undef unknown
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 35:0:0:0 sdh 8:112 failed undef unknown

Comment 13 Humble Chirammal 2017-08-11 13:26:52 UTC

(In reply to Jan Safranek from comment #12)
> Looking at multipath output, dm-21 seems to be faulty:
> 
> mpathd (3600140546859033193d44508085c66b1) dm-21 LIO-ORG ,TCMU device     
> size=5.0G features='0' hwhandler='0' wp=rw
> |-+- policy='round-robin 0' prio=0 status=enabled
> | `- 33:0:0:0 sdf 8:80  failed undef unknown
> |-+- policy='round-robin 0' prio=0 status=enabled
> | `- 34:0:0:0 sdg 8:96  failed undef unknown
> `-+- policy='round-robin 0' prio=0 status=enabled
>   `- 35:0:0:0 sdh 8:112 failed undef unknown

True and its path device state is "transport-offline" and the i/o error count looks "0xc" for sd{h,g} and 0x16 for sdf. 

iic, If we send some inquiry to this target and according to the return we can confirm  where the issue is.

Comment 14 Qian Cai 2017-08-11 14:03:30 UTC

I saw those iscsi device (using software iscsi target) corruption through k8s all the time for a long time. Every a few weeks it happens, and only way to recover is to re-format the iscsi disks.

Comment 18 Humble Chirammal 2017-08-11 14:21:10 UTC

(In reply to CAI Qian from comment #14)
> I saw those iscsi device (using software iscsi target) corruption through
> k8s all the time for a long time. Every a few weeks it happens, and only way
> to recover is to re-format the iscsi disks.

Do we have any bz reported on this ?

Comment 19 Qian Cai 2017-08-11 14:38:43 UTC

No, unfortunately. The reproducer is simple though.

First, create an iscsi target either from a fileio or block.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/ch24.html#osm-target-setup

Then, create a pod to access the iscsi storage,
http://lacrosse.corp.redhat.com/~qcai/ci/iscsi-nfs.yaml

Let it running for a few weeks, and at some point, there are tons of "iSCSI Login negotiation failed" and "I/O errors" floating in the node's dmesg indicated that the iscsi target disks are corrupted.

Comment 27 Sahina Bose 2017-08-29 07:08:33 UTC

Could you help us track this down? 
The underlying block device seems to be corrupted - we need some help in root causing this.

Comment 28 Chris Leech 2017-08-30 23:26:17 UTC

I'm looking through the TCP trace, and focusing in on the TCP resets as an indication of something going wrong it looks like the target side isn't responding to iSCSI login attempts.

The first few resets don't have the rest of the connection in the trace, but then the rest follow a pattern of establishing a TCP connection, sending the iSCSI login PDU, then the target ACKs acceptance of the login command but never responds.  After about 15 seconds the initiator times out the attempt and terminates the connection with a RST.

Normally I'd be looking for a disruption to the network layer, but that at least looks like the network is functioning but the target isn't processing the login request.

Continuing to look ...

Comment 29 Mike Christie 2017-09-01 14:31:17 UTC

Do you have more of the target side log? The dmesg start too early.

If the target was not able to execute a IO (it sent it to whatever layer is below it and never got a response) then the inititor side EH will fire and that will lead to the iscsi session being dropped. The iscsi target will then fails to respond to relogin requests (tcp connection is extablished but the login response pdu  is not send by the target so the initiator ends up dropping the connection and retrying) like Chris noted until the outstanding command has completed. You will see the errors in dmesg until the command completes.

Comment 30 Humble Chirammal 2017-09-05 10:29:21 UTC

Thanks Mike!
Karthick, Can you please provide the logs or test setup details to Mike as requested in c#29?

Comment 31 krishnaram Karthick 2017-09-05 13:20:14 UTC

(In reply to Humble Chirammal from comment #30)
> Thanks Mike!
> Karthick, Can you please provide the logs or test setup details to Mike as
> requested in c#29?

I believe ramky is working on this. Moving the needinfo to Ramky.

Comment 32 Ramakrishna Reddy Yekulla 2017-09-06 18:54:47 UTC

sorry for the delay in the getting the setup, this is due to lack of a stable cns builds in the today. Trying to get a stable CNS environment up. Will work on the reproducer and provide the access as soon as possible.

Comment 35 Ramakrishna Reddy Yekulla 2017-09-08 11:20:46 UTC

This is the present state of the Machine as on 16:50 IST
--------------------------------------------------------

[root@dhcp42-238 ~]# oc get pvc
NAME         STATUS    VOLUME                                     CAPACITY   ACCESSMODES   STORAGECLASS    AGE
mongodb-1    Bound     pvc-d051652f-9483-11e7-86b3-5254001958c3   5Gi        RWO           gluster-block   26m
mongodb-10   Bound     pvc-4df412d0-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   23m
mongodb-11   Bound     pvc-544dd9c9-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   22m
mongodb-12   Bound     pvc-5a9a2673-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   22m
mongodb-13   Bound     pvc-60d67ae4-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   22m
mongodb-14   Bound     pvc-671c9306-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   22m
mongodb-15   Bound     pvc-6d6db90d-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   22m
mongodb-2    Bound     pvc-52d4908f-9483-11e7-86b3-5254001958c3   5Gi        RWO           gluster-block   30m
mongodb-3    Bound     pvc-ffd12d45-9483-11e7-86b3-5254001958c3   5Gi        RWO           gluster-block   25m
mongodb-4    Bound     pvc-280239b4-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   24m
mongodb-5    Bound     pvc-2e461a5a-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   23m
mongodb-6    Bound     pvc-34abbd3f-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   23m
mongodb-7    Bound     pvc-3af694b1-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   23m
mongodb-8    Bound     pvc-413e172b-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   23m
mongodb-9    Bound     pvc-478a7f37-9484-11e7-86b3-5254001958c3   1G         RWO           gluster-block   23m

----------------

[root@dhcp42-238 ~]# oc get pods -o wide
NAME                                  READY     STATUS    RESTARTS   AGE       IP             NODE
glusterblock-provisioner-dc-1-plpxc   1/1       Running   0          18h       10.129.2.2     dhcp42-65.lab.eng.blr.redhat.com
glusterfs-82mj4                       1/1       Running   0          19h       10.70.42.226   dhcp42-226.lab.eng.blr.redhat.com
glusterfs-90ds9                       1/1       Running   0          19h       10.70.42.65    dhcp42-65.lab.eng.blr.redhat.com
glusterfs-qz5pg                       1/1       Running   0          19h       10.70.42.248   dhcp42-248.lab.eng.blr.redhat.com
heketi-1-g1cv6                        1/1       Running   0          18h       10.130.0.7     dhcp43-51.lab.eng.blr.redhat.com
mongodb-1-1-6c087                     1/1       Running   0          22m       10.131.0.2     dhcp42-226.lab.eng.blr.redhat.com
mongodb-10-1-fqv16                    1/1       Running   0          8m        10.130.0.14    dhcp43-51.lab.eng.blr.redhat.com
mongodb-11-1-7mhch                    1/1       Running   0          10m       10.128.2.12    dhcp43-165.lab.eng.blr.redhat.com
mongodb-12-1-cjp22                    1/1       Running   0          18m       10.129.0.5     dhcp42-248.lab.eng.blr.redhat.com
mongodb-13-1-qqtqn                    1/1       Running   0          18m       10.128.2.11    dhcp43-165.lab.eng.blr.redhat.com
mongodb-14-1-vmm2q                    1/1       Running   0          17m       10.130.0.13    dhcp43-51.lab.eng.blr.redhat.com
mongodb-15-1-b1870                    1/1       Running   0          17m       10.129.2.6     dhcp42-65.lab.eng.blr.redhat.com
mongodb-2-1-smqx5                     1/1       Running   0          25m       10.130.0.9     dhcp43-51.lab.eng.blr.redhat.com
mongodb-3-1-gz2wb                     1/1       Running   0          20m       10.129.2.3     dhcp42-65.lab.eng.blr.redhat.com
mongodb-4-1-qd4xb                     1/1       Running   0          10m       10.128.0.8     dhcp43-50.lab.eng.blr.redhat.com
mongodb-5-1-7mf6t                     1/1       Running   0          19m       10.128.0.5     dhcp43-50.lab.eng.blr.redhat.com
mongodb-6-1-cm5bk                     1/1       Running   0          19m       10.129.0.3     dhcp42-248.lab.eng.blr.redhat.com
mongodb-7-1-qtnk6                     1/1       Running   0          19m       10.129.2.4     dhcp42-65.lab.eng.blr.redhat.com
mongodb-8-1-l8304                     1/1       Running   0          19m       10.128.2.9     dhcp43-165.lab.eng.blr.redhat.com
mongodb-9-1-r5b74                     1/1       Running   0          18m       10.131.0.4     dhcp42-226.lab.eng.blr.redhat.com
storage-project-router-1-qgk4j        1/1       Running   0          20h       10.70.43.165   dhcp43-165.lab.eng.blr.redhat.com

Comment 36 Ramakrishna Reddy Yekulla 2017-09-08 11:23:32 UTC


In the above setup, I'm running 15 mongodb apps. 4 of which have 3GB capacity and remaining 11 have 1Gi. The pending task is getting I/O on the workloads.

Comment 37 Ramakrishna Reddy Yekulla 2017-09-08 11:30:00 UTC

(In reply to Ramakrishna Reddy Yekulla from comment #36)
 Correction three 5GB claims and eleven 1GB claim . 
> In the above setup, I'm running 15 mongodb apps. 3 of which have 5GB
> capacity and remaining 11 have 1Gi. The pending task is getting I/O on the
> workloads.

Comment 38 Ramakrishna Reddy Yekulla 2017-09-08 15:58:32 UTC

=============================================================================
The status as of 20:45, more than 4hrs in running, none of the pods 
have gone into CrashLoopBackOff.I'll move the setup to follow the 
instructions given in the https://bugzilla.redhat.com/show_bug.cgi?id=1477455
==============================================================================
[root@dhcp42-238 ~]# oc get pods -o wide
NAME                                  READY     STATUS    RESTARTS   AGE       IP             NODE
glusterblock-provisioner-dc-1-plpxc   1/1       Running   0          22h       10.129.2.2     dhcp42-65.lab.eng.blr.redhat.com
glusterfs-82mj4                       1/1       Running   0          23h       10.70.42.226   dhcp42-226.lab.eng.blr.redhat.com
glusterfs-90ds9                       1/1       Running   0          23h       10.70.42.65    dhcp42-65.lab.eng.blr.redhat.com
glusterfs-qz5pg                       1/1       Running   0          23h       10.70.42.248   dhcp42-248.lab.eng.blr.redhat.com
heketi-1-g1cv6                        1/1       Running   0          22h       10.130.0.7     dhcp43-51.lab.eng.blr.redhat.com
mongodb-1-1-6c087                     1/1       Running   0          4h        10.131.0.2     dhcp42-226.lab.eng.blr.redhat.com
mongodb-10-1-fqv16                    1/1       Running   0          4h        10.130.0.14    dhcp43-51.lab.eng.blr.redhat.com
mongodb-11-1-7mhch                    1/1       Running   0          4h        10.128.2.12    dhcp43-165.lab.eng.blr.redhat.com
mongodb-12-1-cjp22                    1/1       Running   0          4h        10.129.0.5     dhcp42-248.lab.eng.blr.redhat.com
mongodb-13-1-qqtqn                    1/1       Running   0          4h        10.128.2.11    dhcp43-165.lab.eng.blr.redhat.com
mongodb-14-1-vmm2q                    1/1       Running   0          4h        10.130.0.13    dhcp43-51.lab.eng.blr.redhat.com
mongodb-15-1-b1870                    1/1       Running   0          4h        10.129.2.6     dhcp42-65.lab.eng.blr.redhat.com
mongodb-2-1-smqx5                     1/1       Running   0          4h        10.130.0.9     dhcp43-51.lab.eng.blr.redhat.com
mongodb-3-1-gz2wb                     1/1       Running   0          4h        10.129.2.3     dhcp42-65.lab.eng.blr.redhat.com
mongodb-4-1-qd4xb                     1/1       Running   0          4h        10.128.0.8     dhcp43-50.lab.eng.blr.redhat.com
mongodb-5-1-7mf6t                     1/1       Running   0          4h        10.128.0.5     dhcp43-50.lab.eng.blr.redhat.com
mongodb-6-1-cm5bk                     1/1       Running   0          4h        10.129.0.3     dhcp42-248.lab.eng.blr.redhat.com
mongodb-7-1-qtnk6                     1/1       Running   0          4h        10.129.2.4     dhcp42-65.lab.eng.blr.redhat.com
mongodb-8-1-l8304                     1/1       Running   0          4h        10.128.2.9     dhcp43-165.lab.eng.blr.redhat.com
mongodb-9-1-r5b74                     1/1       Running   0          4h        10.131.0.4     dhcp42-226.lab.eng.blr.redhat.com
storage-project-router-1-qgk4j        1/1       Running   0          1d        10.70.43.165   dhcp43-165.lab.eng.blr.redhat.com

Comment 41 Bipin Kunal 2018-04-17 07:24:15 UTC

--------------------------------------------------------------------
On Tue, Apr 17, 2018 at 10:54 AM, Jaspreet Kaur <jkaur> wrote:
Hello All,

Can anyone assist on this?



On Mon, Apr 16, 2018 at 3:08 PM, Krishnaram Karthick Ramdoss <kramdoss> wrote:
Hi Bipin,

I don't remember there was a workaround for this issue. 

@Humble, I remember you were actively involved in chasing this bug, is there a workaround or can we suggest one?

Thanks,
Karthick

On Mon, Apr 16, 2018 at 2:53 PM, Bipin Kunal <bkunal> wrote:
Hello Karthik/Cris,
   A customer seems to be hitting the similar issue in CNS 3.6 environment as indicated by the bug[1]. 
   Do we have any workaround for this issue? We wanted to try out workaround first, If workaround doesn't work we can open a new bug for debugging.

   If we don't have workaround how should we go ahead? Should we open a new bug for debugging or do you want to continue on the same bug?

   You can reply directly to the Bug.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1480192#c39 

-- 


BIPIN KUNAL

SSME-RED HAT STORAGE

Red Hat India Pvt Ltd

M: +91-902-884-3548    



---------------------------------------------------------------------------

@Humble,
   Can you help us here?

-Bipin Kunal

Comment 65 Red Hat Bugzilla 2023-09-15 00:03:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.

amukherj
aos-bugs
aos-storage-staff
bkunal
cleech
emahoney
fsuba
hchen
hchiramm
jmagrini
jsafrane
jstrunk
knarra
kramdoss
loberman
mchristi
mfiedler
prasanna.kalever
qcai
rcyriac
rhandlin
rreddy
sabose
takirby
xiubli