Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
After switching the overall network IP addresses on all nodes (switch address from 10.8.236.x to 10.16.248.x) The hosted-engine keep failing the liveliness check. All nodes have the name resolved correclty. Can also ssh to the engine with is hostname. But in the ovn-controler.log the old IP address of the engine is use. ovn-controler.log snippet 2019-07-17T15:28:08.625Z|00792|reconnect|INFO|ssl:10.8.236.244:6642: connection attempt timed out 2019-07-17T15:28:08.625Z|00793|reconnect|INFO|ssl:10.8.236.244:6642: waiting 8 seconds before reconnect 2019-07-17T15:28:16.633Z|00794|reconnect|INFO|ssl:10.8.236.244:6642: connecting.. Info of the ovn-cotroler DB [root@ovhost1 ~]# ovs-vsctl list Open_vSwitch _uuid : 585f81f4-edc4-4ba9-b68e-719795007411 bridges : [7e653df6-2e43-4963-a538-fc6c0b7aa5a2] cur_cfg : 1 datapath_types : [netdev, system] db_version : "7.15.1" external_ids : {hostname="ovhost1", ovn-bridge-mappings="", ovn-encap-ip="10.8.236.162", ovn-encap-type=geneve, ovn-remote="ssl:10.8.236.244:6642", rundir="/var/run/openvswitch", system-id="27f10a45-1543-458f-8fb0-b5a01a87b045"} iface_types : [geneve, gre, internal, lisp, patch, stt, system, tap, vxlan] manager_options : [] next_cfg : 1 other_config : {} ovs_version : "2.9.0" ssl : [] statistics : {} system_type : centos system_version : "7" The ovn-encap-ip="10.8.236.162" and ovn-remote="ssl:10.8.236.244:6642" are on the old network current routing table route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default gateway 0.0.0.0 UG 0 0 0 ovirtmgmt 10.16.248.0 0.0.0.0 255.255.255.0 U 0 0 0 ovirtmgmt link-local 0.0.0.0 255.255.0.0 U 1002 0 0 eno1 link-local 0.0.0.0 255.255.0.0 U 1003 0 0 eno2 link-local 0.0.0.0 255.255.0.0 U 1024 0 0 ovirtmgmt current ip address of the Host ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether a0:36:9f:28:17:9e brd ff:ff:ff:ff:ff:ff inet 10.16.248.65/24 brd 10.16.248.255 scope global dynamic ovirtmgmt valid_lft 82130sec preferred_lft 82130sec inet6 fe80::a236:9fff:fe28:179e/64 scope link valid_lft forever preferred_lft forever How do we get the ovn-controler to update the IP address to the new network? Thanks
Thank you for the report. Please try to restart ovn-controller + ovirt-provider-ovn If it doesn't help, you will need to re-install the host in the cluster insuring the ovn network provider enabled in the cluster.
Thank you for your respond. Since the creation of the bug i managed to fix the ovn-controler issue. But i do not suspect to be my current issue. I have fix the ovn-controller issue with those command ovs-vsctl set Open_vSwitch . external-ids:ovn-remote=ssl:10.8.248.74:6642 That fix the connection issue. Even with this fix i still have the liveliness check failure. It is hard to see what is failing because i am not sure what this check is trying to achieved. The current status of the situation is like this: 3 hosted-engine host (originaly in the 10.8.236.x network) switch 1 hoste to the 10.16.248.x netowrk. all hosts van be forward/reverse lookup. if i boot the engine on one the host that is is the 236.x network the engine goes up and with status good. (engine get assign a 236.x address when booting on one of this host) if i boot the engine on the host that was move to the .248.x network the engine goes up but with fail the livelines check. (on this host the engine get assign a 248.x address) at this point when the engine is up but status bad i can ssh to it without any issue the engine service start okey but fail after a couples of minutes. The only things i have is this in the service status 2019-07-23 09:08:41,177-0400 ovirt-engine: ERROR run:554 Error: process terminated with status code 1 I am not sure were to look at this point. here what i did also. When i boot the engine in the 236.x network i can access the WebUI without issue. I then re-install the host that was move to the 248.x network. The installation finish with success and activation work as expected also. I have search for info in what this liveliness check do but did not find what i was looking for. The ultimate goal is to move all the hosts in the 248.x network. Regards
In the /etc/httpd/logs/error_log i always get this messages. [Tue Jul 23 11:21:52.430555 2019] [proxy:error] [pid 3189] AH00959: ap_proxy_connect_backend disabling worker for (127.0.0.1) for 5s [Tue Jul 23 11:21:52.430562 2019] [proxy_ajp:error] [pid 3189] [client 10.16.248.65:35154] AH00896: failed to make connection to backend: 127.0.0.1 The 10.16.248.65 is the address of the host that was move to the new network. and in the access_log i have
Carl, can you please share the relevant /var/log/ovirt-hosted-engine-ha/agent.log and /var/log/ovirt-hosted-engine-ha/broker.log? How does the should Engine VM get it's IP address? Does the Engine VM uses this IP address? To which IP address the host resolves the FQDN of Engine's VM?
Closing this bug, because of the required information is missing. You are welcome to open the bug again if there are new information available.
Sorry for the delay i was in vacation for the past 2 week. The current situation is that we have notice that the fail for liveliness check was due to our LDAP provider. If we removed this provider and only have the internal provider the engine goes live without issue. I am a not sure why this prevent the engine to go live. For the moment not having this provider is not a big issue but i wonder why this strange behavior.
(In reply to crl.langlois from comment #7) > Sorry for the delay i was in vacation for the past 2 week. The current > situation is that we have notice that the fail for liveliness check was due > to our LDAP provider. If we removed this provider and only have the internal > provider the engine goes live without issue. I am a not sure why this > prevent the engine to go live. For the moment not having this provider is > not a big issue but i wonder why this strange behavior. Thanks for the feedback. Martin, does this ring a bell?
We really need detailed engine logs using sos logcollector to be able to investigate this issue, is it possible to attach them?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days