Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600651

Summary: After upgrade from 3.9 to 3.10 sdn daemonset pods stopped updating iptables on some nodes
Product: OpenShift Container Platform Reporter: ihorvath
Component: NetworkingAssignee: Ricardo Carrillo Cruz <ricarril>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, bbennett, hongli, pasik, ricarril, xtian
Version: 3.10.0Flags: ihorvath: needinfo-
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-12 15:36:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of oc logs from 3 stuck daemonset sdn containers none

Description ihorvath 2018-07-12 17:21:31 UTC
Created attachment 1458478 [details]
output of oc logs from 3 stuck daemonset sdn containers

Description of problem:
After the upgrade, we noticed that some services are not directing traffic to the pods that would handle a request. Upon further investigation it turned out that out of 100 nodes 7 of them had long running sdn pods that were not updating iptables anymore. So on those hosts iptables had wrong IPs listed for the pods, thus the request would always return with Host not found error.

Version-Release number of selected component (if applicable):
oc v3.10.9
kubernetes v1.10.0+b81c8f8

How reproducible:
After killing the pods they came back up correct, so not sure how the make them stuck again. If we see sdn pods getting stuck in 3.10 we will update this ticket.


Additional info:
Per bbennett's instructions we are attaching all logs from 3 of these stuck pods.

Comment 1 Casey Callendrello 2018-07-13 16:45:23 UTC
A few questions:

1. What is the timing relative to the logs? Are these logs of the pod once they're already in the stuck state?

2. Are the stuck pods running 3.9 or 3.10?

3. I noticed that the stuck pods seem to hang without any logs for 5-6 minutes at a time. That doesn't seem right. I'd like to know if the stuck pod is idle, or burning CPU

Comment 2 Ben Bennett 2018-07-13 17:04:46 UTC
4. What service was it that you were trying to access?

Comment 3 ihorvath 2018-07-16 15:43:50 UTC
(In reply to Casey Callendrello from comment #1)
> A few questions:
> 
> 1. What is the timing relative to the logs? Are these logs of the pod once
> they're already in the stuck state?

Yes, the logs are from the already stuck pods.
> 
> 2. Are the stuck pods running 3.9 or 3.10?

They are all running ose-node:v3.10.9 or at least that's what oc describe tells me about every one of them.

> 
> 3. I noticed that the stuck pods seem to hang without any logs for 5-6
> minutes at a time. That doesn't seem right. I'd like to know if the stuck
> pod is idle, or burning CPU

So docker says it consumes about a core, more or less, but it's not updating iptables, so not sure what it is doing right now. docker logs -f also not writing any new lines, so we don't get any more info from there.

CONTAINER                                                                        CPU %               MEM USAGE / LIMIT       MEM %               NET I/O             BLOCK I/O           PIDS
k8s_sdn_sdn-9jn7p_openshift-sdn_7ef19599-85fc-11e8-b268-02ec8e61afcf_0           111.71%             1.333 GiB / 157.2 GiB   0.85%               0 B / 0 B           721 kB / 90.1 kB    66

Comment 4 ihorvath 2018-07-16 15:45:27 UTC
(In reply to Ben Bennett from comment #2)
> 4. What service was it that you were trying to access?

Service name is zagg-service.

Comment 5 ihorvath 2018-07-16 15:57:40 UTC
I was wrong about the logging, it seems that now that I let a pod just keep logging, eventually there is new lines appearing, so it is trying to do something, is the big pause expected? Like 5-10 mins? 

E0716 15:46:32.604121   78850 service.go:341] Service port "topdeals/jws-app:" doesn't exists 
I0716 15:53:07.708777   78850 vnids.go:162] Dissociate netid 4701779 from namespace "rinngok196"

Comment 6 Ricardo Carrillo Cruz 2019-05-07 08:46:40 UTC
Can you please confirm if this is still an issue?

Comment 7 ihorvath 2019-11-12 15:26:59 UTC
I have left the team that was dealing with this over a year ago. I do not know if this is still an issue. I imagine if it was, they would've opened another ticket by now, so feel free to close this.