Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1548384

Summary:

[starter-us-east-2] Pod bringup fails with Multi-attach/ Network Plugin error

Product:

OpenShift Container Platform

Reporter:

Aditya Konarde <akonarde>

Component:

Networking

Assignee:

Ben Bennett <bbennett>

Networking sub component:

openshift-sdn

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED WORKSFORME

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

akonarde, aos-bugs, bbennett, eparis, jchevret, jokerman, mmccomas, mmclane, pbergene, sross

Version:

3.7.0

Target Milestone:

---

Target Release:

3.9.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-06-15 18:27:55 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
YAML for pod that fails to start	none

Description Aditya Konarde 2018-02-23 11:24:03 UTC

Description of problem:
Jenkins pods fail to start with  Multi-Attach error or Network plugin error

We can see Multi-attach errors similar to:
33m        33m         1         jenkins-1-fxthx         Pod                                  Warning   FailedAttachVolume       attachdetach                                           Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another

And NetworkPlugin errors:
2m        2m        1         jenkins-slave-pt8tt-77tsn   Pod                                              Warning   FailedCreatePodSandBox   kubelet, ip-172-31-68-243.us-east-2.compute.internal   Failed create pod sandbox: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "jenkins-slave-pt8tt-77tsn_akonarde-jenkins" network: CNI request failed with status 400: 'failed to Statfs "/proc/78228/ns/net": no such file or directory


Pod 

Version-Release number of selected component (if applicable):
v3.6.0

How reproducible:
Always

Steps to Reproduce:
1. Log in to OpenShift.io
2. Start a build pipeline (This tries to start pods on starter-us-east-2)
3. Check Failed bringups for Jenkins pods

Actual results:
Failures in Pod startup, as listed above

Expected results:
Pods should start without failures and be able to communicate through the Jenkins service

Additional info:

Comment 1 Aditya Konarde 2018-02-23 11:57:57 UTC

Created attachment 1399845 [details]
YAML for pod that fails to start

Attached YAML for Jenkins pod that fails to start

Error in events:

1h        1h        1         jenkins-1-fxthx              Pod                                                Warning   FailedAttachVolume            attachdetach                                           Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another

Comment 2 jchevret 2018-02-23 13:45:09 UTC

Here is a reproducer with two running pods. Jenkins pod cannot talk to the content-repository pod through it's service on the internal network. I can however reach the content-repository pod through it's route which is tied to that same service.


Namespace: ldimaggi-osiotest2-jenkins

po/jenkins-1-wq85z
Node: ip-172-31-67-10.us-east-2.compute.internal
PodIP: 10.129.98.130

po/content-repository-1-4f9zx
Node: ip-172-31-67-252.us-east-2.compute.internal
PodIP: 10.129.0.84

svc/content-repository   172.30.210.244   <none>                         80/TCP


# oc -n ldimaggi-osiotest2-jenkins rsh po/jenkins-1-wq85z

sh-4.2$ curl -v http://content-repository 
* About to connect() to content-repository port 80 (#0)
*   Trying 172.30.210.244...
* No route to host
* Failed connect to content-repository:80; No route to host
* Closing connection 0
curl: (7) Failed connect to content-repository:80; No route to host

Comment 4 Dan Winship 2018-02-26 15:15:08 UTC

This is the bug that we eventually decided was because of idling, right?

Comment 5 Eric Paris 2018-02-26 16:22:58 UTC

Yes assigning to Ben Bennett.  Ben, Solly Ross, and the openshift.io team are going to have to figure out how their custom idle/unidler managed to deadlock the node.

Comment 7 jchevret 2018-03-07 13:13:03 UTC

Around 12:15PM UTC on Wed 7 we noticed 100s of pods getting evicted. Looking at logs there appear to have been about 40 nodes that got into a NotReady state and began evicting pods.

Attached above is the node logs from one of them. Around 12:17 it appears the node started failing to connect to internal resources such as the api. This seems highly similar to the sort of network issues we're also seeing with pods & services.

Comment 8 Ben Bennett 2018-06-15 18:27:55 UTC

Can't reproduce this and I haven't seen it since.  The idler is changing completely in 3.11 so should not be able to do anything like this.