Bug 1302500

Summary: [RFE] Include fanout support for netsniff-ng in RHEL 7
Product: Red Hat Enterprise Linux 7 Reporter: Sam Roza <sroza>
Component: netsniff-ngAssignee: Eric Garver <egarver>
Status: CLOSED ERRATA QA Contact: xmu
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: aloughla, atragler, egarver, jiji, rkhan, sroza
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 06:13:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1354445, 1354448, 1360213    
Bug Blocks: 1203710    

Description Sam Roza 2016-01-28 00:39:57 UTC
1. Proposed title of this feature request  
  
  Include fanout support for netsniff-ng in RHEL 7

2. Who is the customer behind the request?  
Account: Citrix/Bytemobile Acct#617148
  
TAM customer: yes  
SRM customer: no
Strategic: yes  
  
3. What is the nature and description of the request?  

Customer requests that we add fanout support to our implementation of netsniff-ng. 
  
4. Why does the customer need this? (List the business requirements here)  

Our use of netsniff-ng is driven by two specific use cases, keeping in mind the nature of our product - TCP and content optimization proxy:

1. Provide a highly scalable packet capture (pcap) environment for customer support. We have seen many instances of packet loss, which become rather challenging to diagnose in our system because of the number of hops (load balancer, switches in chassis, & blades). tcpdump just does not scale at the relatively modest rates that we are looking at - 2-3Gbps. It drops packets itself and hence cannot be used for reliably proving/disproving if a blade running RHEL sent/received packets. We would like to build a solution around netsniff-ng.

2. The second and the more important part of it is continuing to do what we call is network analytics research to improve our TCP optimization algorithms. We want a lightweight but reliable solution using which we can capture packet data up to TCP headers in production environment without adversely affecting the performance of the blade. tcpdump, running in a single thread just does not provide that.

Technical use case aside - the business impact of first requirement should be clear - continuing to support existing business. The second is more researchy/futuristic and I doubt if anyone would be able to provide numbers for the market that it potentially addresses. I hope we can make progress with technical use case.
  
5. How would the customer like to achieve this? (List the functional requirements here)  

Backport necessary components for netsniff-ng to allow the use of 
  
6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.  
  
Use the fanout option in a netsniff-ng environment. Will assist with testing and test cases if necessary.

7. Is there already an existing RFE upstream or in Red Hat Bugzilla? 

The option may exist upstream already.

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  

This would be a RHEL 7.2 or 7.3 request. 
  
9. Is the sales team involved in this request and do they have any additional input?  

Not at the present time.
  
10. List any affected packages or components. 

netsniff-ng 
  
11. Would the customer be able to assist in testing this functionality if implemented? 

Yes. Customer is readily available to assist with testing.

Comment 11 xmu 2016-06-13 09:28:03 UTC
(In reply to Sam Roza from comment #0)

> 11. Would the customer be able to assist in testing this functionality if
> implemented? 
> 
> Yes. Customer is readily available to assist with testing.

Hi all,
What time does the customer finish the test? 
And would the customer share the test plan for us? if yes, it will save time to test for us.
Can anybody help to answer my questions?
Thanks.

Comment 16 xmu 2016-07-08 10:29:24 UTC
Verified netsniff-ng package on netsniff-ng-0.5.8-8.el7.

Test cases 1,2,4 passed, and they were tested on kernel 3.10.0-461.el7.
Test case 3 failed on kernel 3.10.0-461.el7, and passed on upstream kernel 4.7.0-rc4. so it's a kernel bug and I will open a bug for this case.

Please see the details:
                  
server:# rpm -q netsniff-ng                                                     
netsniff-ng-0.5.8-8.el7.x86_64                                                  
                                                                                
[Test case 1]: the highly scalable packet capture(pcap) environment for customer support.
Topo: client(host) ---- server(a vm on the host)
                                                                                
client:#hping3 -i u1 -c 100 -I br_wan -p 4000 -S 10.73.131.140                  
                                                                                
server:# tcpdump -i eth0 "port 4000 and tcp[tcpflags] == 0x02" >/dev/null
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^C34 packets captured                                                           
100 packets received by filter                                                  
66 packets dropped by kernel >>>>> 66% packets were droped by kernel           
                                                                                
server:# netsniff-ng -i eth0 -s -f "port 4000 and tcp[tcpflags] == 0x02"
Running! Hang up with ^C!                                                       
                                                                                
           0  packets incoming (100 unread on exit)                             
         100  packets passed filter                                             
           0  packets failed filter (out of space)                              
      0.0000% packet droprate >>>>> none droped                                 
           6  sec, 249975 usec in total                                         
                                                                                
[Test case 2]: fanout feature(-C, -K)
Topo: client(host) ---- server(host)
Note: we need test CPU option on a physical machine.

server: start three netsniff-ng instances,
#netsniff-ng -C 1 -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
#netsniff-ng -C 1 -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
#netsniff-ng -C 1 -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"

results:
# netsniff-ng -C 1 -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
Running! Hang up with ^C!

           0  packets incoming (34 unread on exit)
          34  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
          33  sec, 408535 usec in total

# netsniff-ng -C 1 -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
Running! Hang up with ^C!

           0  packets incoming (33 unread on exit)
          33  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
          26  sec, 252446 usec in total

# netsniff-ng -C 1 -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
Running! Hang up with ^C!

           0  packets incoming (33 unread on exit)
          33  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
          24  sec, 281861 usec in total

Also test other options:
---subtest a: option="hash cpu rnd"
server# nc -k -l 4000 &
server# netsniff-ng -C 1 -K $option -s -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
server# netsniff-ng -C 1 -K $option -s -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
client#for i in {1..100}; do echo "hello world" | nc 10.73.131.140 4000; done
result:
all the option tests pass: the sum results equal to the connection numbers. 

---subtest b: option="roll"
client# for i in `seq 10`; hping3 -i u1 -c 1000000 -I br_wan -p 4000 -S 10.73.130.153;done
server# netsniff-ng -C 1 -K $option -s -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
server# netsniff-ng -C 1 -K $option -s -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
result:
firstly the flow all goes to one instance, after the stress, moves to the next.

---subtest c: option="qm"
server#netsniff-ng -C 1 -K qm -s -i eno1 -f "port 4000 and tcp[tcpflags] == 0x02"
result:the command promote: "Cannot set fanout ring mode!", and I checked the our newest kernel has not support qm by now.

[Test case 3]: fanout feature with -C -K -L
steps:
client# ping -c 2 -s 2000 10.73.131.128
server# netsniff-ng -C 1 -K hash -L defrag -s -i eth0 icmp
results:
test pass with upstream kernel 4.7.0-rc4 and netsniff-ng-0.5.8-8.
# uname -r
4.7.0-rc4
# rpm -q netsniff-ng
netsniff-ng-0.5.8-8.el7.x86_64

# netsniff-ng -C 1 -K hash -s -i eth0 icmp
Running! Hang up with ^C!

           8  packets incoming (0 unread on exit)
           8  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
           6  sec, 352121 usec in total
# netsniff-ng -C 1 -K hash -L defrag -s -i eth0 icmp
Running! Hang up with ^C!

           4  packets incoming (0 unread on exit)
           4  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
           9  sec, 868458 usec in total

!!! But test failed on kernel 3.10.0-439.el7, so it would be a kernel bug.
# uname -r
3.10.0-439.el7.x86_64
# netsniff-ng -C 1 -K hash -L defrag -s -i eth0 icmp
Running! Hang up with ^C!

           4  packets incoming (4 unread on exit)
           8  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
           6  sec, 471818 usec in total
[root@bootp-73-131-140 ~]# netsniff-ng -C 1 -K hash -s -i eth0 icmp
Running! Hang up with ^C!

           4  packets incoming (4 unread on exit)
           8  packets passed filter
           0  packets failed filter (out of space)
      0.0000% packet droprate
           6  sec, 700179 usec in total


[Test case 4]:fanout feature without -C

server:start 3 netsniff-ng instances without -C option.
# netsniff-ng -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
# netsniff-ng -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"
# netsniff-ng -K lb -i eth0 -f "port 4000 and tcp[tcpflags] == 0x02"

3 results are the same:
# netsniff-ng -K lb -i eth0 -s -f "port 4000 and tcp[tcpflags] == 0x02"
Running! Hang up with ^C!

           0  packets incoming (100 unread on exit)
         100  packets passed filter   >>> all packets
           0  packets failed filter (out of space)
      0.0000% packet droprate
           9  sec, 123036 usec in total

Comment 17 Eric Garver 2016-07-08 13:01:23 UTC
(In reply to xmu from comment #16)
> Verified netsniff-ng package on netsniff-ng-0.5.8-8.el7.
> 
> Test cases 1,2,4 passed, and they were tested on kernel 3.10.0-461.el7.
> Test case 3 failed on kernel 3.10.0-461.el7, and passed on upstream kernel
> 4.7.0-rc4. so it's a kernel bug and I will open a bug for this case.

Yes. Please file a BZ for the defrag failure.

Comment 18 xmu 2016-07-12 06:08:08 UTC
I filed two kernel bugs.
Bug 1354445 - netsniff-ng -f can not dump file exactly.
Bug 1354448 - netsniff-ng fanout -L defrag problem

Comment 19 xmu 2016-07-13 03:27:43 UTC
(In reply to xmu from comment #16)
> Verified netsniff-ng package on netsniff-ng-0.5.8-8.el7.
Actually, netsniff-ng-0.5.8-8.el7 package test pass on 4.7.0-rc4, and
results on kernel 3.10.0-461.el7 are just partial correctness .
take test 1 for example:
           0  packets incoming (100 unread on exit)   > wrong                    
         100  packets passed filter    > right

since the netsniff-ng package's test results are not exactness on rhel7 kernel, and it dues to the kernel bugs. 
set this bug depends on bz1354445 and bz1354448.
I will retest this package after the kernel bugs are verified.

Comment 20 xmu 2016-09-22 02:34:44 UTC
verified on netsniff-ng-0.5.8-10.el7.x86_64 and 3.10.0-506.el7.x86_64

# rpm -q netsniff-ng
netsniff-ng-0.5.8-10.el7.x86_64
# uname -r
3.10.0-506.el7.x86_64

Retested the cases on Comment 16. all passed, and the statistics are right.

And also did the performance of netsniff-ng. the test resluts as follows:
bound the netsniff-ng instances to separate cores.
set all irq smp_afinity to cpu0, NIC queues smp_afinity to cpu1.
# netsniff-ng -C 1 -K lb -L roll -b $cpuid -s -i ens2f0 -o $cpuid.pcap

1 netsniff-ng instance -- on cpu 2.
packetdrills always drop packets due to the CPU overload.

2 netsniff-ng instances -- on cpu2 and cpu3.
throughput were 900M~920M, and the max CPU of netsniff-ng was 70%
sometimes: packet droprate of each instance was 0%
sometimes: packet droprate of each instance was 0.2~2%

3 netsniff-ng instances -- on cpu2 cpu3 cpu4.
throughput were 880M~890M, and the max CPU of netsniff-ng was 40%.
always: packet droprate of each instance was 0%

I checked the "ethtool -S ens2f0" and "netstat -s", didn't find abnormal about the dropped packets.

I also tested other fanout discipline: hash, rnd, the result were similar.

For this performance issue: when the CPU of netsniff-ng is less than 100%(60%~70%), it drops packets occasionally.
I opened bz1377222 to track this issue.
Despite all this, "fanout" is really a solid improvement. So set VERIFIED.

Comment 22 errata-xmlrpc 2016-11-04 06:13:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2419.html