Description of problem: SYN packets could generate high load Version-Release number of selected component (if applicable): 4.9.25 4.10.10 How reproducible: 100% Steps to Reproduce: 1. Create an SNO with SR-IOV and PAO 2. Create CPU partitioning PerformanceProfile 3. Run the synflood binary from a VM to an SR-IOV Pod NIC Actual results: The load of the SNO will rise infinitely. This causes almost all the liveness and readiness timeouts. ksoftirqd/0 will occupy 100% CPU 11 root -12 0 0 0 0 R 99.7 0.0 2:41.03 ksoftirqd/0 When the synflood binary is run: $ uptime 15:34:50 up 23 min, 2 users, load average: 15.40, 5.97, 4.37 After the synflood binary stops: $ uptime 15:39:13 up 27 min, 2 users, load average: 0.61, 2.71, 3.39 Expected results: Additional info: Nokia is testing whether OCP could survive from SYN flooding. The synflood is a binary to continuously generate SYN packet to target.
This could be related to the configured RPS mask for all devices. We decided that RPS mask is not needed for physical devices in https://github.com/openshift/cluster-node-tuning-operator/pull/377 and https://github.com/openshift/cluster-node-tuning-operator/pull/371 The symptoms seem to match what you are observing. I suggest you try without RPS and see if the situation improves.
Hi Martin, Thank you so much for your quick reply! For temporary solution (Disable RPS on physical devices), could we apply this https://bugzilla.redhat.com/show_bug.cgi?id=2093267#c2 ? Best Regards, Chen
It was the first rought testing version that should still work however.
Disabling RPS completely (including veth devices) might affect latency of guaranteed pods. Disabling RPS for only physical devices should be ok even for RAN deployments.
Thank you so much Martin! Do we have better temporary solution other than https://bugzilla.redhat.com/show_bug.cgi?id=2093267#c2 ? Best Regards, Chen
Nope, please test with https://bugzilla.redhat.com/show_bug.cgi?id=2093267#c2
*** This bug has been marked as a duplicate of bug 2100544 ***