Bug 1548384 - [starter-us-east-2] Pod bringup fails with Multi-attach/ Network Plugin error
Summary: [starter-us-east-2] Pod bringup fails with Multi-attach/ Network Plugin error
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.9.z
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-23 11:24 UTC by Aditya Konarde
Modified: 2020-05-20 19:57 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-15 18:27:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
YAML for pod that fails to start (6.28 KB, text/plain)
2018-02-23 11:57 UTC, Aditya Konarde
no flags Details

Description Aditya Konarde 2018-02-23 11:24:03 UTC
Description of problem:
Jenkins pods fail to start with  Multi-Attach error or Network plugin error

We can see Multi-attach errors similar to:
33m        33m         1         jenkins-1-fxthx         Pod                                  Warning   FailedAttachVolume       attachdetach                                           Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another

And NetworkPlugin errors:
2m        2m        1         jenkins-slave-pt8tt-77tsn   Pod                                              Warning   FailedCreatePodSandBox   kubelet, ip-172-31-68-243.us-east-2.compute.internal   Failed create pod sandbox: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "jenkins-slave-pt8tt-77tsn_akonarde-jenkins" network: CNI request failed with status 400: 'failed to Statfs "/proc/78228/ns/net": no such file or directory


Pod 

Version-Release number of selected component (if applicable):
v3.6.0

How reproducible:
Always

Steps to Reproduce:
1. Log in to OpenShift.io
2. Start a build pipeline (This tries to start pods on starter-us-east-2)
3. Check Failed bringups for Jenkins pods

Actual results:
Failures in Pod startup, as listed above

Expected results:
Pods should start without failures and be able to communicate through the Jenkins service

Additional info:

Comment 1 Aditya Konarde 2018-02-23 11:57:57 UTC
Created attachment 1399845 [details]
YAML for pod that fails to start

Attached YAML for Jenkins pod that fails to start

Error in events:

1h        1h        1         jenkins-1-fxthx              Pod                                                Warning   FailedAttachVolume            attachdetach                                           Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another

Comment 2 jchevret 2018-02-23 13:45:09 UTC
Here is a reproducer with two running pods. Jenkins pod cannot talk to the content-repository pod through it's service on the internal network. I can however reach the content-repository pod through it's route which is tied to that same service.


Namespace: ldimaggi-osiotest2-jenkins

po/jenkins-1-wq85z
Node: ip-172-31-67-10.us-east-2.compute.internal
PodIP: 10.129.98.130

po/content-repository-1-4f9zx
Node: ip-172-31-67-252.us-east-2.compute.internal
PodIP: 10.129.0.84

svc/content-repository   172.30.210.244   <none>                         80/TCP


# oc -n ldimaggi-osiotest2-jenkins rsh po/jenkins-1-wq85z

sh-4.2$ curl -v http://content-repository 
* About to connect() to content-repository port 80 (#0)
*   Trying 172.30.210.244...
* No route to host
* Failed connect to content-repository:80; No route to host
* Closing connection 0
curl: (7) Failed connect to content-repository:80; No route to host

Comment 4 Dan Winship 2018-02-26 15:15:08 UTC
This is the bug that we eventually decided was because of idling, right?

Comment 5 Eric Paris 2018-02-26 16:22:58 UTC
Yes assigning to Ben Bennett.  Ben, Solly Ross, and the openshift.io team are going to have to figure out how their custom idle/unidler managed to deadlock the node.

Comment 7 jchevret 2018-03-07 13:13:03 UTC
Around 12:15PM UTC on Wed 7 we noticed 100s of pods getting evicted. Looking at logs there appear to have been about 40 nodes that got into a NotReady state and began evicting pods.

Attached above is the node logs from one of them. Around 12:17 it appears the node started failing to connect to internal resources such as the api. This seems highly similar to the sort of network issues we're also seeing with pods & services.

Comment 8 Ben Bennett 2018-06-15 18:27:55 UTC
Can't reproduce this and I haven't seen it since.  The idler is changing completely in 3.11 so should not be able to do anything like this.


Note You need to log in before you can comment on or make changes to this bug.