Bug 1310169
Summary: | [Beaker 21.1] KVM Guest task EWD while uploading logs | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | PaulB <pbunyan> |
Component: | lab controller | Assignee: | beaker-dev-list |
Status: | CLOSED NOTABUG | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 21 | CC: | asavkov, bpeck, dcallagh, jburke, jstancek, mjia, pbunyan, rjoost, rvrbovsk |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-10 16:29:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
PaulB
2016-02-19 16:06:17 UTC
All, bpeck had suggested I trying running the restraint jobs with ipv6 disabled. I have kicked off jobs with ipv6 disabled on both target hosts. I will follow up when testing completes... Thx for the suggestion, bpeck. Best, -pbunyan Paul, When this is happening with IPv6 enabled can you run ping6 against the lab controller and the beaker server from the SUT? Thanks (In reply to Bill Peck from comment #6) > Paul, > > When this is happening with IPv6 enabled can you run ping6 against the lab > controller and the beaker server from the SUT? > > Thanks All, Tests were run using restraint plugin for "ping6 $BEAKER_LAB_CONTROLLER" and the issue was _not_ reproducible: https://beaker.engineering.redhat.com/jobs/1251058 https://beaker.engineering.redhat.com/jobs/1251061 Then tests were run using restraint plugin for "ping6 -i 60 $BEAKER_LAB_CONTROLLER" and the issue *was* reproducible: https://beaker.engineering.redhat.com/jobs/1252053 Best, -pbunyan this is not a bug with restraint. restraint is not able to communicate via IPv6 which is a kernel bug most likely in the kvm bridging module. All, We spoke about this issue our team meeting... There was speculation on the following BZs: [] Bug 1293709 - [RHEL-6.2][KVM connectathon ipv6] BUG: soft lockup - CPU#0 stuck for 63s! [rpciod/0:1260] [] Bug 1065257 - ipv6: bridge: system stops responding to neighbor solicitation [] Bug 1167003 - ipv6 connectivity breaks down after upgrading to kernel 2.6.32-504.1.3 from 2.6.32-431.29.2 [] Bug 1228673 - IPv6 neighbor solicitations don't pass bridge Testing was rerun on the target hosts using: distro: RHEL-6.7 Server x86_64 kernel: 2.6.32-525.el6 task: /kernel/RHEL6/ltp-lite 20160126-6ltp-lite (+killtimeoverride+) https://beaker.engineering.redhat.com/jobs/1257386 jstancek dug in when he saw the job was stuck and emailed me the following data: ======================================================== Paul, I noticed that your test job with -525 kernel got stuck, so I logged in and tried couple things: https://beaker.engineering.redhat.com/recipes/2550558 https://beaker.engineering.redhat.com/recipes/2550559 I found that: - ping6 to link scope address of LC works - ping6 to global address of LC doesn't - tcpdump (promisc mode) has no effect on the issue, either on host or guest - echo 0 > /sys/devices/virtual/net/br1/bridge/multicast_snooping on host makes the issue go away, this is strong hint that there is something wrong with the bridge on KVM host - http://lenovo-kbuilder.lab.bos.redhat.com/~jstancek/bugs/rhel6_ipv6_issue/ I've captured tcpdump traces from host and guest and what I noticed is that some neighbour solicitation (NS) packets from LC make it to host, but don't make it to guest Specifically, if NS source address is link scope (fe80::5054:ff:fe38:5099), then guest will receive this packet and also reply, (ping6 over link scope addr works) But if NS source address is global address (2620:52:0:1040:5054:ff:fe38:5099), then guest will never see this packet - reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1167003#c17 still seems to work, which is quite confusing. Host pings guest: [root@tyan-gt24-08 ~]# ping6 -I br1 2620:52:0:1040:5056:ff:fe00:1d PING 2620:52:0:1040:5056:ff:fe00:1d(2620:52:0:1040:5056:ff:fe00:1d) from 2620:52:0:1040:2e0:81ff:fe47:4af1 br1: 56 data bytes From 2620:52:0:1040:2e0:81ff:fe47:4af1 icmp_seq=2 Destination unreachable: Address unreachable From 2620:52:0:1040:2e0:81ff:fe47:4af1 icmp_seq=3 Destination unreachable: Address unreachable From 2620:52:0:1040:2e0:81ff:fe47:4af1 icmp_seq=4 Destination unreachable: Address unreachable ^C --- 2620:52:0:1040:5056:ff:fe00:1d ping statistics --- 5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4300ms [root@tyan-gt24-08 ~]# echo 0 > /sys/devices/virtual/net/br1/bridge/multicast_snooping [root@tyan-gt24-08 ~]# ping6 -I br1 2620:52:0:1040:5056:ff:fe00:1d PING 2620:52:0:1040:5056:ff:fe00:1d(2620:52:0:1040:5056:ff:fe00:1d) from 2620:52:0:1040:2e0:81ff:fe47:4af1 br1: 56 data bytes 64 bytes from 2620:52:0:1040:5056:ff:fe00:1d: icmp_seq=1 ttl=64 time=1.54 ms 64 bytes from 2620:52:0:1040:5056:ff:fe00:1d: icmp_seq=2 ttl=64 time=0.204 ms 64 bytes from 2620:52:0:1040:5056:ff:fe00:1d: icmp_seq=3 ttl=64 time=0.120 ms ^C --- 2620:52:0:1040:5056:ff:fe00:1d ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2573ms rtt min/avg/max/mdev = 0.120/0.623/1.547/0.654 ms - same reproducer didn't work for me with 2.6.32-627.el6.x86_64 - it still looks very similar to these BZs: Bug 1293709 - [RHEL-6.2][KVM connectathon ipv6] BUG: soft lockup - CPU#0 stuck for 63s! [rpciod/0:1260] Bug 1065257 - ipv6: bridge: system stops responding to neighbor solicitation Bug 1167003 - ipv6 connectivity breaks down after upgrading to kernel 2.6.32-504.1.3 from 2.6.32-431.29.2 - I think we should try your jobs with latest RHEL6 kernel ============================== Or any kernel after -585.el7. I'm thinking this is: Bug 1228673 - IPv6 neighbor solicitations don't pass bridge The patch [1] for this BZ says: "When more than a multicast address is present in a MLDv2 report, all but the first address is ignored,..." And I found following MLD report sent by guest: No. Time Source Destination Protocol Length Info 514 29 fe80::5056:ff:fe00:1d ff02::16 ICMPv6 110 Multicast Listener Report Message v2 Frame 514: 110 bytes on wire (880 bits), 110 bytes captured (880 bits) Ethernet II, Src: 52:56:00:00:00:1d (52:56:00:00:00:1d), Dst: IPv6mcast_00:00:00:16 (33:33:00:00:00:16) Internet Protocol Version 6, Src: fe80::5056:ff:fe00:1d (fe80::5056:ff:fe00:1d), Dst: ff02::16 (ff02::16) Internet Control Message Protocol v6 Type: Multicast Listener Report Message v2 (143) Code: 0 Checksum: 0x205f [correct] Reserved: 0000 Number of Multicast Address Records: 2 Multicast Address Record Exclude: ff02::202 Multicast Address Record Exclude: ff02::1:ff00:1d If I'm reading the patch description correctly, this means that 2nd address in report above is going to be ignored by bridge. Which looks bad, because "ff02::1:ff00:0/104" is Solicited-node multicast address. That is address that should be the destination for neighbour solicitation packets. # uname -r 2.6.32-584.el6.x86_64 # bridge mdb | grep 1d # (ping to guest is broken) With patch it is ... # uname -r 2.6.32-585.el6.x86_64 # bridge mdb | grep 1d dev br1 port vnet0 grp ff02::1:ff00:1d temp dev br1 port eth1 grp ff02::1:ff29:691d temp guest# ip a 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:56:00:00:00:1d brd ff:ff:ff:ff:ff:ff inet 10.16.70.205/21 brd 10.16.71.255 scope global eth0 inet6 2620:52:0:1040:5056:ff:fe00:1d/64 scope global dynamic valid_lft 2591996sec preferred_lft 604796sec inet6 fe80::5056:ff:fe00:1d/64 scope link valid_lft forever preferred_lft forever [1] http://marc.info/?l=linux-netdev&m=143230806620464&w=2 Regards, Jan ======================================================== Thank you, Jan! pbunyan - kicks off testing with latest RHEL6 kernel... Beat, -pbunyan (In reply to Bill Peck from comment #8) > this is not a bug with restraint. restraint is not able to communicate via > IPv6 which is a kernel bug most likely in the kvm bridging module. Apologies for piling on... I agree with bpeck, this is not an issue with the harness. This is an ipv6 bridge issue. I have opened the following BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1316634 Best, -pbunyan |