| Summary: | VMDB cluster instances being deleted in advance of hosts being disconnected and later reconnected to same named clusters | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Thomas Hennessy <thenness> | ||||
| Component: | Providers | Assignee: | Adam Grare <agrare> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Dave Johnson <dajohnso> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.4.0 | CC: | carnott, cpelland, gblomqui, jdeubel, jfrey, jhardy, jocarter, mfeifer, obarenbo, simaishi, thenness | ||||
| Target Milestone: | GA | Keywords: | ZStream | ||||
| Target Release: | 5.7.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1333467 1333468 1346957 (view as bug list) | Environment: | |||||
| Last Closed: | 2016-08-24 17:13:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1333467, 1333468, 1346957 | ||||||
| Attachments: |
|
||||||
It looks like as part of the EMS Refresh run at 2016-04-19T20:11:28 0 Clusters were returned from the VimBroker: [2016-04-19T20:11:34.320522 #11233:4cde98] INFO -- : MIQ(VcRefresher.get_vc_data) EMS: [chrcnc02-vcsv-01], id: [20000000000001] Retrieving Cluster inventory...Complete - Count: [0] The next EMS Refresh run at 2016-04-19T20:16:28 (and every subsequent refresh in the attached logs) there were 16 clusters returned: [2016-04-19T20:16:54.080356 #11233:4cde98] INFO -- : MIQ(VcRefresher.get_vc_data) EMS: [chrcnc02-vcsv-01], id: [20000000000001] Retrieving Cluster inventory...Complete - Count: [16] In addition to clusters, #Hosts returned 0 down from 87 #Vms returned 31 down from 1155 #Folders returned 9 down from 18 #Resource Pools returned 0 down from 24 Is it possible to get the vim log that would include the EMS refresh from 04-19 20:11:34 to see any errors/warnings there? Adam, Yes, the one refresh does that the returned counts for cluster and Host are zero, and then it bounces back to a non-zero value, which is why I provided *that* log. also, there are two *VERY SIGNIFICANT* log lines that occur in the context of an active ems refresh at times [2016-04-19T20:12:08.209311 #11233:4cde98] which is the clusters being deleted in the ems refresh and the next log line for this same pid 2016-04-19T20:12:13.703938 #11233:4cde98] detailing the deleting of all hosts... as for the vim.log, as you should know the vim.logging is effectively disabled by default because it is so verbose, but if any error were encountered, it should be reflected into the EVM.log. So, no vim.log (or more accurately a vim.log with a size of 0 bytes). I'll be happy to provide you a link to the full set of logs provided by the customer, but IMHO, you have in what has been provided, both all there is, and likely all that is needed, although certainly not all that one might wish for. Tom Hennessy Tom, There are also 31 VMs updated and a large number (guessing 1124) deleted at 2016-04-19T20:12:27.079065. If there are no hashes returned from the VimBroker then the save inventory methods will remove the associated database entries, so the disconnects/deletes are a symptom of the VimBroker not returning any data for those items. I'll take a look at the full logs to try to get any more information however unlikely. Thanks, Adam are you suggesting that the vsphere logs might be useful in diagnosing what is happening here? If the vsphere logs from that period are still available? Tom Hennessy Initially I'd like to look at the full set of logs from the customer that you mentioned. Adam, the more I think about your statements above, the more alarmed I become. It has never been the role of the ems refresh worker to remove instanced from the VMDB. the role of the ems refresh worker is to create new instances and to update the information provided about existing VMDB instances into the VMDB so that the data reflected for those elements faithfully represents the information in the external management system (aka provider) at the time that the refresh happens. but never to delete instances, as it now appears to be doing. for good reason.... in the logs provided by the customer, there is no record of the cluster or hosts being removed in the audit log. what happens to the C&U data that has been so carefully milled and warehoused for months if the instances are removed by some automated agent. What happens to those reports that a CFME administrator runs that showed usage info last week and shows nothing now? Who answers that question? this is a bug and was never an intended behavior for the ems refresh worker. VMDB instances should only be removed by some positive action thru the UI or an externally initiated web services request which can be tracked and has both policy and audit footprints and breadcrumbs left behind. Tom Hennessy Adam, The full set of materials from the customer related to this case can be found at http://file.rdu.redhat.com/~thenness/TWC/sf%2001620979/. Tom Hennessy Tom, To correct my statement: the VMs and Hosts are being disconnected not deleted. The clusters are being deleted. Adam Hi Tom, I reviewed the logs and there are some interesting exceptions in the vim.log that lead to clearing the @hostSystems cache, but it is unclear if they are related to the main issue: [2016-04-08T08:15:38.757576 #896:15d37128] WARN -- : MiqVimUpdate::updateProps (): undefined method `delete_at' for nil:NilClass [2016-04-08T08:15:38.757878 #896:15d37128] WARN -- : Clearing cache for (): @hostSystems If the customer is willing we would like to enable debug logging in the VimBroker so that we can see the state of the broker cache over time, this should help us get a better idea why the broker is occasionally returning 0 Hosts and Clusters. In order to debug the delete_at NilClass issue we are going to have to apply a patch that will print more information about the managed object that is triggering the exception. We can do this at a later time if you want to focus on the 0 Hosts&Clusters issue, just let me know and I will send you the patch. Thanks, Adam update from customer where this problem has happened. "To add to this list of Items that are getting rediscovered by cloudforms and lose their tags would be VMs and templates." Tom Hennessy Greg, Yes, the case was in "Waiting on Red Hat" when you examined it. the text provided today from the customer is as follows: ====================== Problem was first noticed after the follwing updates were applied to each environment and has been seen in all environments (3) with these updates since. Upgraded to latest errata cfme-5.4.4.2-1 Hotfix for Event Cather (see redhat support case 01593702) Hotfix for smart state on RHEL 7 vms (See redhat support Case 01606013) Updated Vxdisklib to 5.5.4 (VMware-vix-disklib-5.5.4-2454786) ========================= Yesterday we were chasing other issues related to oddities with ems refresh for which a totally new BZ was opened and is associated with this case. that bZ ( 1332996) has the potential to become a sev1 unless the customer can wrap his hands around what is happening. ( EMS Refresh returns a new HOST instance for "localhost" and the ems refresh worker enters an sql query loop from which it never exits) During that chasing we did have VIM BROKER trace active, and I will attach it to this case for your reference. In the interest of full disclosure, I believe that what we have at this customer is the concurrent change of many of the VMware infrastructure elements in an uncoordinated fashion with the CFME team. I do not see VMWARE Events for new HOST instances being created but I do see property changes against existing VMs that suggest the host name of vm is being changed to "localhost" while apparently being transformed into ESX instances. The level of code the customer is running offers us some insight into the virtual environment at each startup. Below are the log lines generated by two startups for the same appliance where most of the ems refreshes are running. =========== [----] I, [2016-05-03T07:57:57.121592 #2218:c6fe94] INFO -- : MIQ(MiqRegion.atStartup) Region: [20], name: [Region 20], Under Management: VMs: [2695], Hosts: [358], Sockets: [922] [----] I, [2016-05-03T07:57:57.383212 #2218:c6fe94] INFO -- : MIQ(MiqRegion.atStartup) Region: [20], name: [Region 20], Not Under Management: VMs: [6], Hosts: [63], Sockets: [142] then less than four hours later [----] I, [2016-05-03T11:13:50.406487 #6854:11bbe98] INFO -- : MIQ(MiqRegion.atStartup) Region: [20], name: [Region 20], Under Management: VMs: [1659], Hosts: [271], Sockets: [678] [----] I, [2016-05-03T11:13:50.529860 #6854:11bbe98] INFO -- : MIQ(MiqRegion.atStartup) Region: [20], name: [Region 20], Not Under Management: VMs: [1044], Hosts: [150], Sockets: [386] ============= So we lost 80 HOSTS and 250 sockets in less than four hours!!!???? If it is helpful I can make all of the diagnostic info from last night's testing (evm debug logs, rails debug logs, and vim debug logs) available for you and/or Adam to review. I think the customer is ready to implement any code that might shed more light on the goofiness that they are encountering, including what this case was opened initially to report as a bug. full set of current logs from appliance doing ems refresh with evm debug, rails debug and vim debug logging active can be found at http://file.rdu.redhat.com/~thenness/TWC/sf%2001620979/2016-05-03-debug-logging/ Thanks Tom, so it looks like inventory refresh starts to request inventory before the initial broker cache has finished being built. The hotfix they are running only included the fix for events hanging, but not to handle truncated inventory returned from VMware. For that they need the patch from this PR https://github.com/ManageIQ/manageiq/pull/7540. I will put together a hotfix that includes truncated inventory handling. Adam, Thanks for the update. It appears that the customer will be willing to apply your proposed diagnostic code and any other code that might help get to the bottom of this problem. Please prepare a set of materials including implementation instructions that we can provide to the customer tomorrow. He is currently out of the office for the rest of the day. Tom Hennessy Latest update from customer: " We have not seen any issues with tags disappearing from Cluster, Hosts, or VMs. However we have not restarted the environment and that was typically when we saw the issue." I would consider the patch still under evaluation. Tom Hennessy |
Created attachment 1149587 [details] contains emsrefresh activity for provider showing cluster delete followed by host disconnects (many vm disconnects also follow but are not material for this case) Description of problem:Customer has a set of provisioning scripts that are sensitive to tags on clusters. Cluster entries are being logged as being deleted immediately before a set of hosts are being disconnected (not deleted). The hosts reconnect later (with same host id) and are reconnected to new cluster instances named the same as the ones deleted, but the tags are lost in the process so provisioning is being impacted. I do not see any UI initiated cluster removals in the produciton logs, nor are there any indications in the audit.log file that CFME actions have initiated these activities but appears to be only responding in the context of a full refresh for the provider on which both the clusters and hosts reside. I am concerned that if a cluster has only one host, and that host becomes disconnected, does that result in the associated cluster instance being removed? Version-Release number of selected component (if applicable): 5.4.4.2 How reproducible: not sure Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: two appliances are in use in this environment and each of them has at one time or another had the ems inventory role for the provider encountering this behavior. I am attaching the emsrefresh worker log for the process which exposes the delete of the clusters followed by the disconnecting of a large number of hosts. full logs of both appliances are available if needed