Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1548384

Summary: [starter-us-east-2] Pod bringup fails with Multi-attach/ Network Plugin error
Product: OpenShift Container Platform Reporter: Aditya Konarde <akonarde>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: unspecified CC: akonarde, aos-bugs, bbennett, eparis, jchevret, jokerman, mmccomas, mmclane, pbergene, sross
Version: 3.7.0   
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-15 18:27:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
YAML for pod that fails to start none

Description Aditya Konarde 2018-02-23 11:24:03 UTC
Description of problem:
Jenkins pods fail to start with  Multi-Attach error or Network plugin error

We can see Multi-attach errors similar to:
33m        33m         1         jenkins-1-fxthx         Pod                                  Warning   FailedAttachVolume       attachdetach                                           Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another

And NetworkPlugin errors:
2m        2m        1         jenkins-slave-pt8tt-77tsn   Pod                                              Warning   FailedCreatePodSandBox   kubelet, ip-172-31-68-243.us-east-2.compute.internal   Failed create pod sandbox: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "jenkins-slave-pt8tt-77tsn_akonarde-jenkins" network: CNI request failed with status 400: 'failed to Statfs "/proc/78228/ns/net": no such file or directory


Pod 

Version-Release number of selected component (if applicable):
v3.6.0

How reproducible:
Always

Steps to Reproduce:
1. Log in to OpenShift.io
2. Start a build pipeline (This tries to start pods on starter-us-east-2)
3. Check Failed bringups for Jenkins pods

Actual results:
Failures in Pod startup, as listed above

Expected results:
Pods should start without failures and be able to communicate through the Jenkins service

Additional info:

Comment 1 Aditya Konarde 2018-02-23 11:57:57 UTC
Created attachment 1399845 [details]
YAML for pod that fails to start

Attached YAML for Jenkins pod that fails to start

Error in events:

1h        1h        1         jenkins-1-fxthx              Pod                                                Warning   FailedAttachVolume            attachdetach                                           Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another

Comment 2 jchevret 2018-02-23 13:45:09 UTC
Here is a reproducer with two running pods. Jenkins pod cannot talk to the content-repository pod through it's service on the internal network. I can however reach the content-repository pod through it's route which is tied to that same service.


Namespace: ldimaggi-osiotest2-jenkins

po/jenkins-1-wq85z
Node: ip-172-31-67-10.us-east-2.compute.internal
PodIP: 10.129.98.130

po/content-repository-1-4f9zx
Node: ip-172-31-67-252.us-east-2.compute.internal
PodIP: 10.129.0.84

svc/content-repository   172.30.210.244   <none>                         80/TCP


# oc -n ldimaggi-osiotest2-jenkins rsh po/jenkins-1-wq85z

sh-4.2$ curl -v http://content-repository 
* About to connect() to content-repository port 80 (#0)
*   Trying 172.30.210.244...
* No route to host
* Failed connect to content-repository:80; No route to host
* Closing connection 0
curl: (7) Failed connect to content-repository:80; No route to host

Comment 4 Dan Winship 2018-02-26 15:15:08 UTC
This is the bug that we eventually decided was because of idling, right?

Comment 5 Eric Paris 2018-02-26 16:22:58 UTC
Yes assigning to Ben Bennett.  Ben, Solly Ross, and the openshift.io team are going to have to figure out how their custom idle/unidler managed to deadlock the node.

Comment 7 jchevret 2018-03-07 13:13:03 UTC
Around 12:15PM UTC on Wed 7 we noticed 100s of pods getting evicted. Looking at logs there appear to have been about 40 nodes that got into a NotReady state and began evicting pods.

Attached above is the node logs from one of them. Around 12:17 it appears the node started failing to connect to internal resources such as the api. This seems highly similar to the sort of network issues we're also seeing with pods & services.

Comment 8 Ben Bennett 2018-06-15 18:27:55 UTC
Can't reproduce this and I haven't seen it since.  The idler is changing completely in 3.11 so should not be able to do anything like this.