Bug 1109722
| Summary: | Hosts does not change statuses properly during network outage. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | ||||||
| Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | |||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.4.0 | CC: | acathrow, amureini, gklein, iheim, laravot, lpeer, ofrenkel, oourfali, pstehlik, Rhev-m-bugs, rhodain, yeylon | ||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||
| Target Release: | 3.5.0 | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | infra | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-07-27 09:46:13 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Roman Hodain
2014-06-16 08:42:12 UTC
please attach engine.log i assume that this cause timeout in requests from engine to vdsm, and the timeout as defined in the engine is what you see: /share/ovirt-engine/bin/engine-config.sh --get vdsTimeout vdsTimeout: 180 version: general if this is the case, i dont think this is a bug. usually on real network error there is some error that returns faster that the timeout. I had went over the code and seems that we are setting the "Connecting" status at the right time. Please try to reproduce and at the moment that you have hosts in UP rather than in Connecting capture and attach the result of the following query select vds_name,status from vds_static a ,vds_dynamic b where a.vds_id=b.vds_id; What I would like to check is if that is a UI issue that is not get updated from some reason with the new status while the database already had been updated (In reply to Eli Mesika from comment #6) > I had went over the code and seems that we are setting the "Connecting" > status at the right time. > > Please try to reproduce and at the moment that you have hosts in UP rather > than in Connecting capture and attach the result of the following query > > select vds_name,status from vds_static a ,vds_dynamic b where > a.vds_id=b.vds_id; > > > What I would like to check is if that is a UI issue that is not get updated > from some reason with the new status while the database already had been > updated I have reproduced the issue. The hypervisor and the manager are time synced: I have run the following commands on the hypervisor: [root@vm-155 ~]# date;iptables -I INPUT -p tcp --dport 54321 -j REJECT Thu Jul 24 04:09:28 EDT 2014 [root@vm-155 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination REJECT tcp -- anywhere anywhere tcp dpt:54321 reject-with icmp-port-unreachable Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination The engine.log is is also attached and the status of the hypervisor from the DB as well. Created attachment 920462 [details]
Engine logs
Created attachment 920464 [details]
Hypervisor statuses in time
How comes that all vds_name,vds_id are the same: 10.34.57.155 | cf443954-bf2f-47ee-a8a9-7d078d5b5b11 (In reply to Eli Mesika from comment #10) > How comes that all vds_name,vds_id are the same: > > 10.34.57.155 | cf443954-bf2f-47ee-a8a9-7d078d5b5b11 Sorry for the confusion. There is just one hypevisor in the ENV. The file wa generated by the following command: while true; do /usr/share/ovirt-engine/dbscripts/engine-psql.sh -q -t -c "select CURRENT_TIMESTAMP(0), vds_name,a.vds_id, status from vds_static a ,vds_dynamic b where a.vds_id=b.vds_id"| grep -v '^$'>>/tmp/vds_status.out; sleep 1; done I started that before the test and stop it after the test. After checking the scenario it seems that we get an immediate notification when engine=>hypervisor connection is blocked. In the BZ case, the hypervisor=>VDSM connection was blocked using iptables. engine will not notice that before the 180 seconds timeout. My conclusion is that this behavior was not changed (as I also run 'git blame') on the relevant code and saw no changes in behavior...) So, bottom line, it seems that engine never listened to hypervisor=>VDSM disconnects, so the only way was (and is) waiting the time-out on the VDSM operation executed and getting the exception. After discussing the results with Oved it was decided to close as NOTABUG |