Description of problem: Jenkins pods fail to start with Multi-Attach error or Network plugin error We can see Multi-attach errors similar to: 33m 33m 1 jenkins-1-fxthx Pod Warning FailedAttachVolume attachdetach Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another And NetworkPlugin errors: 2m 2m 1 jenkins-slave-pt8tt-77tsn Pod Warning FailedCreatePodSandBox kubelet, ip-172-31-68-243.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "jenkins-slave-pt8tt-77tsn_akonarde-jenkins" network: CNI request failed with status 400: 'failed to Statfs "/proc/78228/ns/net": no such file or directory Pod Version-Release number of selected component (if applicable): v3.6.0 How reproducible: Always Steps to Reproduce: 1. Log in to OpenShift.io 2. Start a build pipeline (This tries to start pods on starter-us-east-2) 3. Check Failed bringups for Jenkins pods Actual results: Failures in Pod startup, as listed above Expected results: Pods should start without failures and be able to communicate through the Jenkins service Additional info:
Created attachment 1399845 [details] YAML for pod that fails to start Attached YAML for Jenkins pod that fails to start Error in events: 1h 1h 1 jenkins-1-fxthx Pod Warning FailedAttachVolume attachdetach Multi-Attach error for volume "pvc-5133bd2f-1874-11e8-a099-02d7377a4b17" Volume is already exclusively attached to one node and can't be attached to another
Here is a reproducer with two running pods. Jenkins pod cannot talk to the content-repository pod through it's service on the internal network. I can however reach the content-repository pod through it's route which is tied to that same service. Namespace: ldimaggi-osiotest2-jenkins po/jenkins-1-wq85z Node: ip-172-31-67-10.us-east-2.compute.internal PodIP: 10.129.98.130 po/content-repository-1-4f9zx Node: ip-172-31-67-252.us-east-2.compute.internal PodIP: 10.129.0.84 svc/content-repository 172.30.210.244 <none> 80/TCP # oc -n ldimaggi-osiotest2-jenkins rsh po/jenkins-1-wq85z sh-4.2$ curl -v http://content-repository * About to connect() to content-repository port 80 (#0) * Trying 172.30.210.244... * No route to host * Failed connect to content-repository:80; No route to host * Closing connection 0 curl: (7) Failed connect to content-repository:80; No route to host
This is the bug that we eventually decided was because of idling, right?
Yes assigning to Ben Bennett. Ben, Solly Ross, and the openshift.io team are going to have to figure out how their custom idle/unidler managed to deadlock the node.
Around 12:15PM UTC on Wed 7 we noticed 100s of pods getting evicted. Looking at logs there appear to have been about 40 nodes that got into a NotReady state and began evicting pods. Attached above is the node logs from one of them. Around 12:17 it appears the node started failing to connect to internal resources such as the api. This seems highly similar to the sort of network issues we're also seeing with pods & services.
Can't reproduce this and I haven't seen it since. The idler is changing completely in 3.11 so should not be able to do anything like this.