Bug 1377161
Summary: | ovirt-ha-agent should talk to vdsm via loopback address or other socket family. | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | ovirt-hosted-engine-ha | Assignee: | Martin Sivák <msivak> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Artyom <alukiano> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.0.3 | CC: | amarchuk, dfediuck, fromani, gklein, gveitmic, lsurette, mkalinin, msivak, rgolan, ykaul, ylavi |
Target Milestone: | ovirt-4.0.4 | Keywords: | TestOnly, Triaged |
Target Release: | 4.0.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-10-17 13:09:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1358530 | ||
Bug Blocks: |
Description
Germano Veit Michel
2016-09-19 02:31:08 UTC
- Given there is no network then the sanlock should expire, cause we write the timestamp to the whiteboard, so I'd expect sanlock to free(kill) the resource(VM) can you output `sanlock log_dump` when this happens? - We clearly see a ping fails, how come we didn't lower the score? can you run this with debug log level? (In reply to Roy Golan from comment #2) > - Given there is no network then the sanlock should expire, cause we write > the timestamp to the whiteboard, so I'd expect sanlock to free(kill) the > resource(VM) > can you output `sanlock log_dump` when this happens? sanlock may be on FC, or via another interface other than the mgmt, no? > > - We clearly see a ping fails, how come we didn't lower the score? can you > run this with debug log level? Hi Roy, First of all thank you for the prompt response. It's fibre channel storage, sanlock will not expire. And there is also the case where storage might be accessed via a different NIC and network. It did not lower the score because the agent is just looping (restarting and dying). I can easily reproduce this in my env. Just put the ovirtmgmt network administratively down. Provided you reach your NFS/iSCSI storage via a different network you will see the exact same problem. To me, the main problem is this, where the IP address might be attached to an interface that can go down by unplugging a cable. $ cat sos_commands/process/lsof_-b_M_-n_-l | grep ovirt-ha | grep TCP ovirt-ha- 19064 36 14u IPv4 39445 0t0 TCP 192.168.7.22:45752->192.168.7.22:54321 (ESTABLISHED) ovirt-ha- 19064 24267 36 14u IPv4 39445 0t0 TCP 192.168.7.22:45752->192.168.7.22:54321 (ESTABLISHED) Unplugging a cable should not make vdsm and ha-agent stop talking. If you need more info, please let me know. Cheers, Germano This change should fix it: Changing 'socket.gethostname()' to 'localhost' in the vdsm lib the agent is using. https://gerrit.ovirt.org/#/c/63308/2/lib/vdsm/jsonrpcvdscli.py Please confirm. After applying that fix it seems to fall back to using ipv6 lo ovirt-ha- 1503 36 8u IPv6 78992 0t0 TCP [::1]:54156->[::1]:54321 (ESTABLISHED) ovirt-ha- 1503 1912 36 8u IPv6 78992 0t0 TCP [::1]:54156->[::1]:54321 (ESTABLISHED) Looks good. (In reply to Germano Veit Michel from comment #6) > This change should fix it: > > Changing 'socket.gethostname()' to 'localhost' in the vdsm lib the agent is > using. > > https://gerrit.ovirt.org/#/c/63308/2/lib/vdsm/jsonrpcvdscli.py > > Please confirm. Yes it prevented other problems with the client but this will resolve the problem of the interface going down. One more problem (that may be masqueraded by this fix) is that we keep looping and not lowering the score, even though the ping failed which doesn't look right. Martin? Guys, I think the same issue will happen to MOM and vdsClient tools. We use the default address the vdsm library provides to us and this should be fixed there as Germano found out. Francesco: can you please confirm this? It might affect any vdsm client, not just HE (what about supervdsm btw?) (In reply to Martin Sivák from comment #9) > Guys, I think the same issue will happen to MOM and vdsClient tools. We use > the default address the vdsm library provides to us and this should be fixed > there as Germano found out. > > Francesco: can you please confirm this? It might affect any vdsm client, not > just HE (what about supervdsm btw?) An added bonus is that we can probably skip SSL in this case, but I assume it requires some further changes. Would be nice to move to sockets. > One more problem (that may be masqueraded by this fix) is that we keep
> looping and not lowering the score, even though the ping failed which
> doesn't look right. Martin?
Ping monitor tries to access the network gateway, it is not related to the vdsm socket and should still fail. We keep looping because we never get to the actual score calculation (and we no longer publish any updates). Other nodes should stop seeing updates from this host and act accordingly though.
On the other hand.. nobody will kill the local VM if sanlock is still fine.
Ah, I see the localhost patch was already merged to the necessary branches. This requires no code change to hosted engine and is therefore test only. (In reply to Martin Sivák from comment #9) > Guys, I think the same issue will happen to MOM and vdsClient tools. We use > the default address the vdsm library provides to us and this should be fixed > there as Germano found out. > > Francesco: can you please confirm this? It might affect any vdsm client, not > just HE (what about supervdsm btw?) jsonrpcvdscli is indeed the recommended way to talk with Vdsm, so, yes, it will affect each and every client. Vdsm talsk to Supervdsm using one UNIX domain socket, so this part should be fine. Verified on vdsm-4.18.13-1.el7ev.x86_64 and ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch I do not have the possibility to check it on FC, so I checked it on NFS. 1) Configure two interfaces on the host and use one as management network 2) Bring down management interface "ip link set dev ovirtmgmt down" 3) Check that vdsm succeeds to talk with the ovirt-ha-agent 4) Block connection via IP tables to the gateway and check that ovirt-ha-agent drop score of the host to 1800 5) Check lsof: # lsof | grep ovirt-ha | grep TCP ovirt-ha- 119122 vdsm 8u IPv6 1342661 0t0 TCP localhost:52498->localhost:54321 (ESTABLISHED) ovirt-ha- 119122 vdsm 12u IPv6 1362041 0t0 TCP localhost:52964->localhost:54321 (ESTABLISHED) ovirt-ha- 119122 119531 vdsm 8u IPv6 1342661 0t0 TCP localhost:52498->localhost:54321 (ESTABLISHED) ovirt-ha- 119122 119531 vdsm 12u IPv6 1362041 0t0 TCP localhost:52964->localhost:54321 (ESTABLISHED) ovirt-ha- 119122 120865 vdsm 8u IPv6 1342661 0t0 TCP localhost:52498->localhost:54321 (ESTABLISHED) ovirt-ha- 119122 120865 vdsm 12u IPv6 1362041 0t0 TCP localhost:52964->localhost:54321 (ESTABLISHED) |