Bug 1549642
Summary: | Race condition between host up at engine eyes and SuperVdsm.ServerCallback::add_sourceroute on DHCP configured hosts | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Simone Tiraboschi <stirabos> | ||||||||||||||||||||
Component: | BLL.Network | Assignee: | Petr Horáček <phoracek> | ||||||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Liran Rotenberg <lrotenbe> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||
Version: | 4.2.2 | CC: | bugs, cshao, huzhao, jiaczhan, khakimi, lveyde, mavital, nsednev, qiyuan, stirabos, weiwang, ycui, yzhao | ||||||||||||||||||||
Target Milestone: | ovirt-4.2.2 | Keywords: | Triaged | ||||||||||||||||||||
Target Release: | --- | Flags: | rule-engine:
ovirt-4.2+
rule-engine: exception+ |
||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | ovirt-engine-4.2.2.4 | Doc Type: | If docs needed, set a value | ||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2018-03-29 10:54:39 UTC | Type: | Bug | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||
Bug Blocks: | 1455169, 1522737, 1534212 | ||||||||||||||||||||||
Attachments: |
|
Description
Simone Tiraboschi
2018-02-27 14:46:55 UTC
Created attachment 1401390 [details]
vdsm logs
Created attachment 1401391 [details]
supervdsm logs
Created attachment 1401392 [details]
hosted-engine-setup logs
Created attachment 1401393 [details]
host-deploy-ansible logs
Created attachment 1401394 [details]
host-deploy logs
Created attachment 1401395 [details]
engine logs
I think it's due to handling of DHCP responses. VDSM registers add_sourceroute to react on DHCP responses: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/network/initializer.py#L67 If the host already has a DHCP lease the deployment will continue and the host will go up with that address. Then once DHCP response arrives, and the host could be already up at engine eyes at that time if the DHCP server is slow enough, add_sourceroute will kick it and this could is exactly what happened here. Indeed the DHCP response completed exactly when we lost the host: Feb 27 11:12:23 10-37-160-177 dhclient[30837]: bound to 10.37.160.177 -- renewal in 1140207 seconds. But dhcp client process started 50 seconds before: Feb 27 11:11:37 10-37-160-177 dhclient[30837]: DHCPDISCOVER on ovirtmgmt to 255.255.255.255 port 67 interval 6 (xid=0x54bcf2d5) Feb 27 11:11:43 10-37-160-177 dhclient[30837]: DHCPDISCOVER on ovirtmgmt to 255.255.255.255 port 67 interval 10 (xid=0x54bcf2d5) Feb 27 11:11:45 10-37-160-177 dhclient[30837]: DHCPREQUEST on ovirtmgmt to 255.255.255.255 port 67 (xid=0x54bcf2d5) Feb 27 11:11:45 10-37-160-177 dhclient[30837]: DHCPOFFER from 10.37.160.253 Feb 27 11:11:45 10-37-160-177 dhclient[30837]: DHCPACK from 10.37.160.253 (xid=0x54bcf2d5) Feb 27 11:12:23 10-37-160-177 dhclient[30837]: bound to 10.37.160.177 -- renewal in 1140207 seconds. Created attachment 1401416 [details]
/var/log/messages from the host
How common is this? Would you specify the severity for you? Can you explain how does hosted engine setup modify host routing? (In reply to Dan Kenigsberg from comment #9) > How common is this? Probably not that common but we definitively hit this more that once. > Would you specify the severity for you? The nasty point is that the error got by the user is almost random depending on when we lost the host in the middle of the deployment and this can cause a lot of noise. I'd say high > Can you explain how does hosted engine setup modify host routing? The local bootstrap VM runs over default libvirt natted network. We start it with: virsh net-start default And routing rules looks like: [root@c74he20180108h1 ~]# ip route default via 192.168.1.1 dev eth0 proto static metric 100 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.114 metric 100 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 [root@c74he20180108h1 ~]# ip rule list 0: from all lookup local 32766: from all lookup main 32767: from all lookup default Then ovirt-hosted-engine-setup adds the host to the engine and so the ovirtmgmt bridge got created so SuperVdsm.ServerCallback::add_sourceroute got called and it disrupts libvirt default natted network. We detect it and we have a kind of workaround that fixes it running: ip route add 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 table 170238129 where 170238129 is the id of the table of ovirtmgmt 'src 192.168.122.1' is enough to win over rules added by add_sourceroute and so the engine VM is still able to talk with the engine. The point is if, as for this bug, add_sourceroute got called after our workaround disrupting it. add_sourceroute runs: ip -4 route add 0.0.0.0/0 via 10.37.160.254 dev ovirtmgmt table 170238129 ip -4 route add 10.37.160.0/24 via 10.37.160.177 dev ovirtmgmt table 170238129 ip rule add from 10.37.160.0/24 table 170238129 ip rule add from all to 10.37.160.0/24 dev ovirtmgmt table 170238129 > ip rule add from 10.37.160.0/24 table 170238129
> ip rule add from all to 10.37.160.0/24 dev ovirtmgmt table 170238129
These are the rules that causes traffic to be routed based on the special table (170238129).
If you need to bypass it, you just need to know which IP/subnet you need to 'protect' and add a high priority rule (a low pref/priority number) on the host so it will be matched before the ones sourceroute adds:
ip rule add from 10.37.160.0/24 priority 100 table main
ip rule add from all to 10.37.160.0/24 dev ovirtmgmt priority 100 table main
Can you try it?
(In reply to Edward Haas from comment #11) > Can you try it? Sure, thanks, do you think https://gerrit.ovirt.org/88295/ is enough? (In reply to Simone Tiraboschi from comment #12) > (In reply to Edward Haas from comment #11) > > Can you try it? > > Sure, thanks, do you think https://gerrit.ovirt.org/88295/ is enough? I don't think that change is valid, the command is not "ip route" but "ip rule" and the arguments are different if I'm not mistaken. Created attachment 1402414 [details]
before add_sourceroute
Created attachment 1402415 [details]
after add_sourceroute
*** Bug 1552027 has been marked as a duplicate of this bug. *** *** Bug 1551289 has been marked as a duplicate of this bug. *** Works for me on these components: ovirt-hosted-engine-ha-2.2.7-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.13-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch Linux 3.10.0-861.el7.x86_64 #1 SMP Wed Mar 14 10:21:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) Deployment worked fine over iSCSI storage. This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |