Bug 1600651
| Summary: | After upgrade from 3.9 to 3.10 sdn daemonset pods stopped updating iptables on some nodes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | ihorvath | ||||
| Component: | Networking | Assignee: | Ricardo Carrillo Cruz <ricarril> | ||||
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | ||||
| Status: | CLOSED NOTABUG | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | unspecified | CC: | aos-bugs, bbennett, hongli, pasik, ricarril, xtian | ||||
| Version: | 3.10.0 | Flags: | ihorvath:
needinfo-
|
||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.10.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-11-12 15:36:10 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
A few questions: 1. What is the timing relative to the logs? Are these logs of the pod once they're already in the stuck state? 2. Are the stuck pods running 3.9 or 3.10? 3. I noticed that the stuck pods seem to hang without any logs for 5-6 minutes at a time. That doesn't seem right. I'd like to know if the stuck pod is idle, or burning CPU 4. What service was it that you were trying to access? (In reply to Casey Callendrello from comment #1) > A few questions: > > 1. What is the timing relative to the logs? Are these logs of the pod once > they're already in the stuck state? Yes, the logs are from the already stuck pods. > > 2. Are the stuck pods running 3.9 or 3.10? They are all running ose-node:v3.10.9 or at least that's what oc describe tells me about every one of them. > > 3. I noticed that the stuck pods seem to hang without any logs for 5-6 > minutes at a time. That doesn't seem right. I'd like to know if the stuck > pod is idle, or burning CPU So docker says it consumes about a core, more or less, but it's not updating iptables, so not sure what it is doing right now. docker logs -f also not writing any new lines, so we don't get any more info from there. CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS k8s_sdn_sdn-9jn7p_openshift-sdn_7ef19599-85fc-11e8-b268-02ec8e61afcf_0 111.71% 1.333 GiB / 157.2 GiB 0.85% 0 B / 0 B 721 kB / 90.1 kB 66 (In reply to Ben Bennett from comment #2) > 4. What service was it that you were trying to access? Service name is zagg-service. I was wrong about the logging, it seems that now that I let a pod just keep logging, eventually there is new lines appearing, so it is trying to do something, is the big pause expected? Like 5-10 mins? E0716 15:46:32.604121 78850 service.go:341] Service port "topdeals/jws-app:" doesn't exists I0716 15:53:07.708777 78850 vnids.go:162] Dissociate netid 4701779 from namespace "rinngok196" Can you please confirm if this is still an issue? I have left the team that was dealing with this over a year ago. I do not know if this is still an issue. I imagine if it was, they would've opened another ticket by now, so feel free to close this. |
Created attachment 1458478 [details] output of oc logs from 3 stuck daemonset sdn containers Description of problem: After the upgrade, we noticed that some services are not directing traffic to the pods that would handle a request. Upon further investigation it turned out that out of 100 nodes 7 of them had long running sdn pods that were not updating iptables anymore. So on those hosts iptables had wrong IPs listed for the pods, thus the request would always return with Host not found error. Version-Release number of selected component (if applicable): oc v3.10.9 kubernetes v1.10.0+b81c8f8 How reproducible: After killing the pods they came back up correct, so not sure how the make them stuck again. If we see sdn pods getting stuck in 3.10 we will update this ticket. Additional info: Per bbennett's instructions we are attaching all logs from 3 of these stuck pods.