Bug 1310169 - [Beaker 21.1] KVM Guest task EWD while uploading logs
[Beaker 21.1] KVM Guest task EWD while uploading logs
Status: CLOSED NOTABUG
Product: Beaker
Classification: Community
Component: lab controller (Show other bugs)
21
Unspecified Unspecified
unspecified Severity unspecified (vote)
: ---
: ---
Assigned To: beaker-dev-list
tools-bugs
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-19 11:06 EST by PaulB
Modified: 2016-03-10 11:29 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-10 11:29:35 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description PaulB 2016-02-19 11:06:17 EST
Description of problem:
During kernel testing, KVM Guest task EWD (ltp-lite) while uploading logs 
after successful completion of task.

Version-Release number of selected component (if applicable):
 Beaker: 21.1
 restraint: 0.1.23-2.el6eso     
 distro: RHEL-6.6 Server x86_64
 kernel: 2.6.32-504.el6

How reproducible:
 unknown

Steps to Reproduce:
1. Install target host listed in comment#1 with RHEL-6.6 Server x86_64
2. Create two KVM Guests on the target host
   One i386Guest  
    guestargs="--ram=1024 --vcpus=1 --file-size=20 --hvm --kvm --nonsparse
   One x86_64Guest
    guestargs="--ram=2048 --vcpus=2 --file-size=20 --hvm --kvm --nonsparse"
3. Run Kernel Tier1 testing (KT1) on Guests

Actual results:
https://beaker.engineering.redhat.com/jobs/1233043
https://beaker.engineering.redhat.com/recipes/2497243
https://beaker.engineering.redhat.com/recipes/2497243#task38282408

Expected results:
 Successful completion of uploading task logs

Additional info:
The x86_64Guest had successfully completed the 
/kernel/RHEL6/ltp-lite 20160126-2 task.
The task timed out and EWD while uploading the logs.
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2016/02/12330/1233043/2497243/38282408/taskoutput.log
---<-snip->---
*** Current Time: Fri Feb 19 00:44:24 2016 Localwatchdog at: Fri Feb 19 02:23:15 2016
INFO: ltp-pan reported all tests PASS
---<-snip->---
RHELKT1LITE.FILTERED Passed: 
---<-snip->---
Uploading RHELKT1LITE.FILTERED.run.log *** Current Time: Fri Feb 19 00:45:23 2016 Localwatchdog at: Fri Feb 19 02:23:15 2016
*** Current Time: Fri Feb 19 00:46:23 2016 Localwatchdog at: Fri Feb 19 02:23:15 2016
---<-snip->---
*** Current Time: Fri Feb 19 02:22:23 2016 Localwatchdog at: Fri Feb 19 02:23:15 2016
/usr/share/restraint/plugins/run_task_plugins /usr/share/restraint/plugins/run_plugins
** /10_localwatchdog WARN Score:0
Uploading ps-lwd.log 
---<-snip->---

-End
Comment 4 PaulB 2016-02-26 14:37:21 EST
All,
bpeck had suggested I trying running the restraint jobs with ipv6 disabled.
I have kicked off jobs with ipv6 disabled on both target hosts.
I will follow up when testing completes...

Thx for the suggestion, bpeck.

Best,
-pbunyan
Comment 6 Bill Peck 2016-03-03 14:56:48 EST
Paul,

When this is happening with IPv6 enabled can you run ping6 against the lab controller and the beaker server from the SUT?

Thanks
Comment 7 PaulB 2016-03-10 09:03:52 EST
(In reply to Bill Peck from comment #6)
> Paul,
> 
> When this is happening with IPv6 enabled can you run ping6 against the lab
> controller and the beaker server from the SUT?
> 
> Thanks

All,
Tests were run using restraint plugin for "ping6 $BEAKER_LAB_CONTROLLER"
and the issue was _not_ reproducible:
 https://beaker.engineering.redhat.com/jobs/1251058
 https://beaker.engineering.redhat.com/jobs/1251061

Then tests were run using restraint plugin for "ping6 -i 60 $BEAKER_LAB_CONTROLLER" and the issue *was* reproducible:
 https://beaker.engineering.redhat.com/jobs/1252053

Best,
-pbunyan
Comment 8 Bill Peck 2016-03-10 09:13:09 EST
this is not a bug with restraint.  restraint is not able to communicate via IPv6 which is a kernel bug most likely in the kvm bridging module.
Comment 9 PaulB 2016-03-10 09:19:36 EST
All,
We spoke about this issue our team meeting...

There was speculation on the following BZs:
[] Bug 1293709 - [RHEL-6.2][KVM connectathon ipv6] BUG: soft lockup - CPU#0 
    stuck for 63s! [rpciod/0:1260] 
[] Bug 1065257 - ipv6: bridge: system stops responding to neighbor solicitation 
[] Bug 1167003 - ipv6 connectivity breaks down after upgrading to kernel 
    2.6.32-504.1.3 from 2.6.32-431.29.2 
[] Bug 1228673 - IPv6 neighbor solicitations don't pass bridge


Testing was rerun on the target hosts using:
 distro:  RHEL-6.7 Server x86_64
 kernel: 2.6.32-525.el6
 task: /kernel/RHEL6/ltp-lite 20160126-6ltp-lite (+killtimeoverride+)
 https://beaker.engineering.redhat.com/jobs/1257386


jstancek dug in when he saw the job was stuck and emailed me the following data:
========================================================
Paul,

I noticed that your test job with -525 kernel got stuck,
so I logged in and tried couple things:
  https://beaker.engineering.redhat.com/recipes/2550558
  https://beaker.engineering.redhat.com/recipes/2550559

I found that:

- ping6 to link scope address of LC works
- ping6 to global address of LC doesn't
- tcpdump (promisc mode) has no effect on the issue, either on host or guest
- echo 0 > /sys/devices/virtual/net/br1/bridge/multicast_snooping
   on host makes the issue go away, this is strong hint that there
   is something wrong with the bridge on KVM host
- http://lenovo-kbuilder.lab.bos.redhat.com/~jstancek/bugs/rhel6_ipv6_issue/
  I've captured tcpdump traces from host and guest
  and what I noticed is that some neighbour solicitation (NS) packets from LC
  make it to host, but don't make it to guest
  Specifically, if NS source address is link scope (fe80::5054:ff:fe38:5099),
  then guest will receive this packet and also reply, (ping6 over link scope addr works)
  But if NS source address is global address (2620:52:0:1040:5054:ff:fe38:5099),
  then guest will never see this packet
- reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1167003#c17
  still seems to work, which is quite confusing.

Host pings guest:
[root@tyan-gt24-08 ~]# ping6 -I br1 2620:52:0:1040:5056:ff:fe00:1d
PING 2620:52:0:1040:5056:ff:fe00:1d(2620:52:0:1040:5056:ff:fe00:1d) from 2620:52:0:1040:2e0:81ff:fe47:4af1 br1: 56 data bytes
From 2620:52:0:1040:2e0:81ff:fe47:4af1 icmp_seq=2 Destination unreachable: Address unreachable
From 2620:52:0:1040:2e0:81ff:fe47:4af1 icmp_seq=3 Destination unreachable: Address unreachable
From 2620:52:0:1040:2e0:81ff:fe47:4af1 icmp_seq=4 Destination unreachable: Address unreachable
^C
--- 2620:52:0:1040:5056:ff:fe00:1d ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4300ms

[root@tyan-gt24-08 ~]# echo 0 > /sys/devices/virtual/net/br1/bridge/multicast_snooping
[root@tyan-gt24-08 ~]# ping6 -I br1 2620:52:0:1040:5056:ff:fe00:1d
PING 2620:52:0:1040:5056:ff:fe00:1d(2620:52:0:1040:5056:ff:fe00:1d) from 2620:52:0:1040:2e0:81ff:fe47:4af1 br1: 56 data bytes
64 bytes from 2620:52:0:1040:5056:ff:fe00:1d: icmp_seq=1 ttl=64 time=1.54 ms
64 bytes from 2620:52:0:1040:5056:ff:fe00:1d: icmp_seq=2 ttl=64 time=0.204 ms
64 bytes from 2620:52:0:1040:5056:ff:fe00:1d: icmp_seq=3 ttl=64 time=0.120 ms
^C
--- 2620:52:0:1040:5056:ff:fe00:1d ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2573ms
rtt min/avg/max/mdev = 0.120/0.623/1.547/0.654 ms

- same reproducer didn't work for me with 2.6.32-627.el6.x86_64
- it still looks very similar to these BZs:
  Bug 1293709 - [RHEL-6.2][KVM connectathon ipv6] BUG: soft lockup - CPU#0 stuck for 63s! [rpciod/0:1260] 
  Bug 1065257 - ipv6: bridge: system stops responding to neighbor solicitation 
  Bug 1167003 - ipv6 connectivity breaks down after upgrading to kernel 2.6.32-504.1.3 from 2.6.32-431.29.2
- I think we should try your jobs with latest RHEL6 kernel

==============================
Or any kernel after -585.el7. I'm thinking this is:
  Bug 1228673 - IPv6 neighbor solicitations don't pass bridge

The patch [1] for this BZ says:
"When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored,..."

And I found following MLD report sent by guest:

No.     Time           Source                Destination           Protocol Length Info
    514 29             fe80::5056:ff:fe00:1d ff02::16              ICMPv6   110    Multicast Listener Report Message v2

Frame 514: 110 bytes on wire (880 bits), 110 bytes captured (880 bits)
Ethernet II, Src: 52:56:00:00:00:1d (52:56:00:00:00:1d), Dst: IPv6mcast_00:00:00:16 (33:33:00:00:00:16)
Internet Protocol Version 6, Src: fe80::5056:ff:fe00:1d (fe80::5056:ff:fe00:1d), Dst: ff02::16 (ff02::16)
Internet Control Message Protocol v6
    Type: Multicast Listener Report Message v2 (143)
    Code: 0
    Checksum: 0x205f [correct]
    Reserved: 0000
    Number of Multicast Address Records: 2
    Multicast Address Record Exclude: ff02::202
    Multicast Address Record Exclude: ff02::1:ff00:1d

If I'm reading the patch description correctly, this means that
2nd address in report above is going to be ignored by bridge.

Which looks bad, because "ff02::1:ff00:0/104" is Solicited-node
multicast address. That is address that should be the destination
for neighbour solicitation packets.

# uname -r
2.6.32-584.el6.x86_64
# bridge mdb | grep 1d
# (ping to guest is broken)

With patch it is ...

# uname -r
2.6.32-585.el6.x86_64
# bridge mdb | grep 1d
dev br1 port vnet0 grp ff02::1:ff00:1d temp
dev br1 port eth1 grp ff02::1:ff29:691d temp

guest# ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:56:00:00:00:1d brd ff:ff:ff:ff:ff:ff
    inet 10.16.70.205/21 brd 10.16.71.255 scope global eth0
    inet6 2620:52:0:1040:5056:ff:fe00:1d/64 scope global dynamic 
       valid_lft 2591996sec preferred_lft 604796sec
    inet6 fe80::5056:ff:fe00:1d/64 scope link 
       valid_lft forever preferred_lft forever


[1] http://marc.info/?l=linux-netdev&m=143230806620464&w=2

Regards,
Jan
========================================================

Thank you, Jan!

pbunyan - kicks off testing with latest RHEL6 kernel...

Beat,
-pbunyan
Comment 10 PaulB 2016-03-10 11:29:35 EST
(In reply to Bill Peck from comment #8)
> this is not a bug with restraint.  restraint is not able to communicate via
> IPv6 which is a kernel bug most likely in the kvm bridging module.

Apologies for piling on...
I agree with bpeck, this is not an issue with the harness.
This is an ipv6 bridge issue. I have opened the following BZ:
 https://bugzilla.redhat.com/show_bug.cgi?id=1316634

Best, 
-pbunyan

Note You need to log in before you can comment on or make changes to this bug.