Bug 1669560
Summary: | [3.11] mounting fails with multipath iscsi when one path is down | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Niels de Vos <ndevos> | |
Component: | Storage | Assignee: | Jan Safranek <jsafrane> | |
Status: | CLOSED ERRATA | QA Contact: | Chao Yang <chaoyang> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.11.0 | CC: | aos-bugs, aos-storage-staff, hchiramm, jsafrane, ndevos, rgeorge, rhs-bugs, sankarshan, sponnaga | |
Target Milestone: | --- | Keywords: | Regression, ZStream | |
Target Release: | 3.11.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 1669403 | |||
: | 1680012 (view as bug list) | Environment: | ||
Last Closed: | 2019-04-11 05:38:26 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1669403, 1680012 |
Comment 1
Humble Chirammal
2019-01-27 06:55:23 UTC
I can't reproduce it with v3.11.86. Kubelet tries to log into shutdown node several times and times out, but it gives up and continues with just 2 paths. My pod is running in ~45 seconds. Can you please retry with current OCP version and leave your cluster running for investigation? Find me on IRC in #aos in CET (UTC+1) business hours. I admin that iscsi volume plugin may have issues with this:
> IMO, attach disk fails when we collect portalhostmap for the mentioned
> target from the host buses.
> It fails because of existing session entry in sysfs while collecting the
> map. Attach disk failure
> soon after we fail while collecting sessions address is problematic at
> certain scenarios.
> One other thought is that, the teardown of the mount and wiping of the
> session entries should have happened
> while detach or pod teardown scenario. May be it attempted and failed or a
> race between the teardown/setup.
> Some more details from above mentioned questions and logs/reproducer should
> confirm this thought.
But the plugin should never setup and teardown a single volume in parallel. It does things in sequence. How did you reach such state? Simple shutdown of gluster node + deletion of pod was not enough for me.
Upstream PR: https://github.com/kubernetes/kubernetes/pull/74306 (In reply to Jan Safranek from comment #7) > I admin that iscsi volume plugin may have issues with this: > > > IMO, attach disk fails when we collect portalhostmap for the mentioned > > target from the host buses. > > It fails because of existing session entry in sysfs while collecting the > > map. Attach disk failure > > soon after we fail while collecting sessions address is problematic at > > certain scenarios. > > One other thought is that, the teardown of the mount and wiping of the > > session entries should have happened > > while detach or pod teardown scenario. May be it attempted and failed or a > > race between the teardown/setup. > > Some more details from above mentioned questions and logs/reproducer should > > confirm this thought. > > But the plugin should never setup and teardown a single volume in parallel. > It does things in sequence. How did you reach such state? Simple shutdown of > gluster node + deletion of pod was not enough for me. Yeah, it looks to me a RACE as mentioned above. Origin 3.11 PR: https://github.com/openshift/origin/pull/22137 > After 10.0.77.71 is down, “TargetPortal" in pv info is still 10.0.77.71, do we need to update the pv info?
This is ok. PV contains long-term properties of the PV, not what is actually used in a pod (and there is no Kubernetes API to get that). When 10.0.77.71 comes back to life, it may be still used when a new pod is started.
Based on the above comments,update the bug status to "verified". Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0636 |