1730776 – Hosted-egine fail with livelines check after Network switch

Bug 1730776 - Hosted-egine fail with livelines check after Network switch

Summary: Hosted-egine fail with livelines check after Network switch

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	future
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	bugs@ovirt.org
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-17 15:08 UTC by crl.langlois@gmail.com
Modified:	2023-09-14 05:31 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-08-06 13:07:23 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description crl.langlois@gmail.com 2019-07-17 15:08:47 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 crl.langlois@gmail.com 2019-07-17 15:33:12 UTC

After switching the overall network IP addresses on all nodes (switch address from 10.8.236.x to 10.16.248.x) The hosted-engine keep failing the liveliness check.

All nodes have the name resolved correclty. Can also ssh to the engine with is hostname. 

But in the ovn-controler.log the old IP address of the engine is use. 
ovn-controler.log snippet

2019-07-17T15:28:08.625Z|00792|reconnect|INFO|ssl:10.8.236.244:6642: connection attempt timed out
2019-07-17T15:28:08.625Z|00793|reconnect|INFO|ssl:10.8.236.244:6642: waiting 8 seconds before reconnect
2019-07-17T15:28:16.633Z|00794|reconnect|INFO|ssl:10.8.236.244:6642: connecting..

Info of the ovn-cotroler DB

[root@ovhost1 ~]# ovs-vsctl list Open_vSwitch
_uuid               : 585f81f4-edc4-4ba9-b68e-719795007411
bridges             : [7e653df6-2e43-4963-a538-fc6c0b7aa5a2]
cur_cfg             : 1
datapath_types      : [netdev, system]
db_version          : "7.15.1"
external_ids        : {hostname="ovhost1", ovn-bridge-mappings="", ovn-encap-ip="10.8.236.162", ovn-encap-type=geneve, ovn-remote="ssl:10.8.236.244:6642", rundir="/var/run/openvswitch", system-id="27f10a45-1543-458f-8fb0-b5a01a87b045"}
iface_types         : [geneve, gre, internal, lisp, patch, stt, system, tap, vxlan]
manager_options     : []
next_cfg            : 1
other_config        : {}
ovs_version         : "2.9.0"
ssl                 : []
statistics          : {}
system_type         : centos
system_version      : "7"


The ovn-encap-ip="10.8.236.162" and ovn-remote="ssl:10.8.236.244:6642" are on the old network


current routing table 
 route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    0      0        0 ovirtmgmt
10.16.248.0     0.0.0.0         255.255.255.0   U     0      0        0 ovirtmgmt
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eno1
link-local      0.0.0.0         255.255.0.0     U     1003   0        0 eno2
link-local      0.0.0.0         255.255.0.0     U     1024   0        0 ovirtmgmt



current ip address of the Host

ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a0:36:9f:28:17:9e brd ff:ff:ff:ff:ff:ff
    inet 10.16.248.65/24 brd 10.16.248.255 scope global dynamic ovirtmgmt
       valid_lft 82130sec preferred_lft 82130sec
    inet6 fe80::a236:9fff:fe28:179e/64 scope link 
       valid_lft forever preferred_lft forever


How do we get the ovn-controler to update the IP address to the new network?

Thanks

Comment 2 Michael Burman 2019-07-23 06:00:44 UTC

Thank you for the report.

Please try to restart ovn-controller + ovirt-provider-ovn
If it doesn't help, you will need to re-install the host in the cluster insuring the ovn network provider enabled in the cluster.

Comment 3 crl.langlois@gmail.com 2019-07-23 13:46:22 UTC

Thank you for your respond.

Since the creation of the bug i managed to fix the ovn-controler issue. But i do not suspect to be my current issue.

I have fix the ovn-controller issue with those command
ovs-vsctl set Open_vSwitch . external-ids:ovn-remote=ssl:10.8.248.74:6642
That fix the connection issue.

Even with this fix i still have the liveliness check failure. It is hard to see what is failing because i am not sure what this check is trying to achieved.

The current status of the situation is like this:

3 hosted-engine host (originaly in the 10.8.236.x network)
switch 1 hoste to the 10.16.248.x netowrk.
all hosts van be forward/reverse lookup.

if i boot the engine on one the host that is is the 236.x network the engine goes up and with status good. (engine get assign a 236.x address when booting on one of this host)
if i boot the engine on the host that was move to the .248.x network the engine goes up but with fail the livelines check. (on this host the engine get assign a 248.x address)

at this point when the engine is up but status bad i can ssh to it without any issue the engine service start okey but fail after a couples of minutes. The only things i have is this in the service status
2019-07-23 09:08:41,177-0400 ovirt-engine: ERROR run:554 Error: process terminated with status code 1

I am not sure were to look at this point.

here what i did also.

When i boot the engine in the 236.x network i can access the WebUI without issue. I then re-install the host that was move to the 248.x network. The installation finish with success and activation work as expected also.

I have search for info in what this liveliness check do but did not find what i was looking for.

The ultimate goal is to move all the hosts in the 248.x network.

Regards

Comment 4 crl.langlois@gmail.com 2019-07-23 15:25:03 UTC

In the /etc/httpd/logs/error_log i always get this messages.

[Tue Jul 23 11:21:52.430555 2019] [proxy:error] [pid 3189] AH00959: ap_proxy_connect_backend disabling worker for (127.0.0.1) for 5s
[Tue Jul 23 11:21:52.430562 2019] [proxy_ajp:error] [pid 3189] [client 10.16.248.65:35154] AH00896: failed to make connection to backend: 127.0.0.1

The 10.16.248.65 is the address of the host that was move to the new network.

and in the access_log i have

Comment 5 Dominik Holler 2019-07-30 15:20:57 UTC

Carl, can you please share the relevant /var/log/ovirt-hosted-engine-ha/agent.log and /var/log/ovirt-hosted-engine-ha/broker.log?
How does the should Engine VM get it's IP address?
Does the Engine VM uses this IP address?
To which IP address the host resolves the FQDN of Engine's VM?

Comment 6 Dominik Holler 2019-08-06 13:07:23 UTC

Closing this bug, because of the required information is missing. You are welcome to open the bug again if there are new information available.

Comment 7 crl.langlois@gmail.com 2019-08-12 18:14:15 UTC

Sorry for the delay i was in vacation for the past 2 week. The current situation is that we have notice that the fail for liveliness check was due to our LDAP provider. If we removed this provider and only have the internal provider the engine goes live without issue. I am a not sure why this prevent the engine to go live. For the moment not having this provider is not a big issue but i wonder why this strange behavior.

Comment 8 Dominik Holler 2019-08-12 18:17:35 UTC

(In reply to crl.langlois from comment #7)
> Sorry for the delay i was in vacation for the past 2 week. The current
> situation is that we have notice that the fail for liveliness check was due
> to our LDAP provider. If we removed this provider and only have the internal
> provider the engine goes live without issue. I am a not sure why this
> prevent the engine to go live. For the moment not having this provider is
> not a big issue but i wonder why this strange behavior.

Thanks for the feedback.
Martin, does this ring a bell?

Comment 9 Martin Perina 2019-08-12 18:39:50 UTC

We really need detailed engine logs using sos logcollector to be able to investigate this issue, is it possible to attach them?

Comment 10 Red Hat Bugzilla 2023-09-14 05:31:59 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.