Bug 1548384
| Summary: | [starter-us-east-2] Pod bringup fails with Multi-attach/ Network Plugin error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Aditya Konarde <akonarde> | ||||
| Component: | Networking | Assignee: | Ben Bennett <bbennett> | ||||
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | ||||
| Status: | CLOSED WORKSFORME | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | akonarde, aos-bugs, bbennett, eparis, jchevret, jokerman, mmccomas, mmclane, pbergene, sross | ||||
| Version: | 3.7.0 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.9.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-06-15 18:27:55 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Aditya Konarde
2018-02-23 11:24:03 UTC
Created attachment 1399845 [details]
YAML for pod that fails to start
Attached YAML for Jenkins pod that fails to start
Error in events:
1h 1h 1 jenkins-1-fxthx Pod Warning FailedAttachVolume attachdetach Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another
Here is a reproducer with two running pods. Jenkins pod cannot talk to the content-repository pod through it's service on the internal network. I can however reach the content-repository pod through it's route which is tied to that same service. Namespace: ldimaggi-osiotest2-jenkins po/jenkins-1-wq85z Node: ip-172-31-67-10.us-east-2.compute.internal PodIP: 10.129.98.130 po/content-repository-1-4f9zx Node: ip-172-31-67-252.us-east-2.compute.internal PodIP: 10.129.0.84 svc/content-repository 172.30.210.244 <none> 80/TCP # oc -n ldimaggi-osiotest2-jenkins rsh po/jenkins-1-wq85z sh-4.2$ curl -v http://content-repository * About to connect() to content-repository port 80 (#0) * Trying 172.30.210.244... * No route to host * Failed connect to content-repository:80; No route to host * Closing connection 0 curl: (7) Failed connect to content-repository:80; No route to host This is the bug that we eventually decided was because of idling, right? Yes assigning to Ben Bennett. Ben, Solly Ross, and the openshift.io team are going to have to figure out how their custom idle/unidler managed to deadlock the node. Around 12:15PM UTC on Wed 7 we noticed 100s of pods getting evicted. Looking at logs there appear to have been about 40 nodes that got into a NotReady state and began evicting pods. Attached above is the node logs from one of them. Around 12:17 it appears the node started failing to connect to internal resources such as the api. This seems highly similar to the sort of network issues we're also seeing with pods & services. Can't reproduce this and I haven't seen it since. The idler is changing completely in 3.11 so should not be able to do anything like this. |