Description of problem: Upon upgrade of OCP to version v3.11.404 service endpoint propagation started to take too much time (in some cases around 2.5 hours). Service idling is properly disabled and iptablesSyncPeriod is set to 2h. The initial problem was addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1963160, however after applying the errata for the aforementioned BZ if ovs and sdn pods from a node running on v3.11.465 version are deleted, the node takes some minutes to reach some of the services. Version-Release number of selected component (if applicable): v3.11.465 How reproducible: Steps to Reproduce: 1. Delete ovs and sdn pods from a node upgraded to v3.11.404 version 2. 3. Actual results: It takes some minutes for services to be reached from this node. Expected results: It should reach services quickly. Additional info:
> if ovs and sdn pods from a node running on v3.11.465 version are deleted, the node takes some minutes to reach some of the services. I assume that by "take some minutes to reach some of the services" you mean that it takes a long time for the node to have the expected set of iptables rules, NOT that the iptables rules are there but the TCP traffic is slow? Also, wny are you deleting ovs and sdn pods? You should never do that. If you randomly kill the OVS pod on a running node, we do not make any claims about how long it will take the node to recover. If you are trying to test "how long does it take a node to fully program itself at startup", the right way to do that is to reboot the node, not to just randomly kill infrastructure pods on it. But also, the fact that it takes a long time to set up all of the service IPs at startup is a totally different bug from it taking a long time to propagate changes to a running node... > Here [0] are the following logs provided by customer: I don't have access to that folder. Can you please clarify: is there still any problem with propagation of service changes to running nodes? And, why is the customer killing sdn and ovs pods and then expecting things to still work well?
OK, I see; I was misled by the "Caches are synced for service config controller" message before, which was just indicating when it had received all of the data internally, not when the proxy had actually *processed* all of the data. In the .286 log, the last "sdn proxy: add ..." message appears 18 seconds after startup. In the .465 log, the last "sdn proxy: add ..." message appears 21 *minutes* after startup, and that's not even the last one (eg, the log cuts off before demo.verification.svc appears). I'm pretty sure this is due to an iptables locking fix introduced in 3.11.344 which is mostly unnoticeable in ordinary operation, but at startup time it could end up interfering with the initial bulk creation of services pretty badly.
Create about 3000 service on build # oc version oc v3.11.500 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-13-214.ec2.internal:8443 openshift v3.11.499 kubernetes v1.11.0+d4cacc0 # oc get svc -n z1 | wc -l 3006 and then delete sdn pod to make it recreated. From the new created sdn pod, I see it costs 6 seconds from started to the last "sdn proxy: add". Here is the logs in attachment. @Dan Winshop Could you help confirm if this is enough to verify this bug.
(In reply to zhaozhanqi from comment #14) > and then delete sdn pod to make it recreated. From the new created sdn pod, > I see it costs 6 seconds from started to the last "sdn proxy: add". > > Here is the logs in attachment. > > @Dan Winship Could you help confirm if this is enough to verify this bug. Yeah, it would take several minutes to complete if the fix wasn't working. Maybe just do some spot checks of a handful of the services to make sure they actually work?
yes, I checked all service are working well. then I will move this bug to verified. thanks Dan
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 3.11.z security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3193