Bug 1415725 - oVirt hyperconverged gluster replica 3 arbiter 1: After successful glusterfs recovery (total fallout through "clusterssh yum update) new external NFS data(master) SD can't get out of "locked" state. No node gets up because no working data(master) SD. [NEEDINFO]
Summary: oVirt hyperconverged gluster replica 3 arbiter 1: After successful glusterfs ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.18.21
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Dan Kenigsberg
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-23 14:39 UTC by mike.lupe
Modified: 2017-02-20 12:21 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-20 12:21:49 UTC
oVirt Team: Infra
mperina: needinfo? (didi)
mperina: needinfo? (mike.lupe)


Attachments (Terms of Use)
vdsm logs of slp-ovirtnode-02 (2nd node in 3-node cluster) (2.96 MB, application/zip)
2017-01-23 14:39 UTC, mike.lupe
no flags Details
Hosted engine VM logs (934.75 KB, application/zip)
2017-01-23 14:43 UTC, mike.lupe
no flags Details
oVirt Webui hosts view (155.85 KB, image/png)
2017-01-23 17:57 UTC, mike.lupe
no flags Details
oVirt Webui original GlusterFS data (master) SD - maintenanced (118.23 KB, image/png)
2017-01-23 18:01 UTC, mike.lupe
no flags Details
oVirt Webui view: data2_ssd (new master) NFS SD - "Locked" (119.64 KB, image/png)
2017-01-23 18:03 UTC, mike.lupe
no flags Details
HE engine.log SQL errors (405.61 KB, application/x-gzip)
2017-02-03 19:33 UTC, mike.lupe
no flags Details
engine-ha agent.log & broker.log slp-ovirtnode-01 (1.39 MB, application/x-gzip)
2017-02-04 09:16 UTC, mike.lupe
no flags Details

Description mike.lupe 2017-01-23 14:39:39 UTC
Created attachment 1243609 [details]
vdsm logs of slp-ovirtnode-02 (2nd node in 3-node cluster)

Description of problem:
I had a total fallout when doing a "clusterssh yum update -y" on all 4 oVirt nodes. I know, please don't ask me why I did that ++*""*°@#°#@.

It took me 2 days to get glusterfs back into sync without any errors.

After that time-consuming gluster recovery, all volumes came up. But - in Webui all the storage domains jumped back and forth from "online" (green) to offline (red), in an interval of about 20 seconds. Because this didn't stop, I took the glustered data(master) SD to maintenance. This automatically activated an existing 2nd data SD ("data2_ssd" - NFS on external storage-01 server) as the new master SD. This SD remains in "locked" state since then (3 days ago). No chance to get back original gluster data SD, as no hosts are "up". Can't activate hosts, because they moan about the missing data(master) storage domain, impossible for them to connect to the locked master SD.

The 3 nodes (glusterfs hyperconverged) are in this state for the past 3 days:
slp-ovirtnode-01 (HE MV running on)	"Activating"
slp-ovirtnode-02			"Activating"
slp-ovirtnode-03			"Maintenance"

Storage domains:
data		"Master Data Gluster Storage Domain"	Data	GlusterFS	V3	Maintenance	2712 GB	2315 GB	"Master Data Gluster Storage Domain"

data2_ssd	"SSD slp-storage-01 SSD disk1"	Data (Master)	NFS	V3	Locked	[N/A]	[N/A]	"SSD slp-storage-01 SSD disk1"

esd	"NFS Export Storage Domain on srp-storage-01"	Export	NFS	V1	Unknown	[N/A]	[N/A]	"NFS Export Storage Domain on srp-storage-01"

hosted_storage		Data	GlusterFS	V3	Unknown	49 GB	40 GB	

iso	"ISO Gluster Storage Domain"	ISO	GlusterFS	V1	Inactive	31 GB	16 GB	"ISO Gluster Storage Domain"


Version-Release number of selected component (if applicable):

OS:			CentOS 7.3
oVirt Version:		4.0.6
vdsm version		vdsm-4.18.21-1.el7.centos.x86_64
# of nodes:		3
Data(master) Storage:	"data" - GlusterFS replica 3 arbiter 1
Data Storage 2		"data2_ssd" - NFS on external storage server (became data master after setting "data(master) to maintenance)


How reproducible:


Steps to Reproduce:
1. I don't recommend to take out the plug to reproduce this. Or do a "yum update" on all hosts at the same time. :-|
2.
3.

Actual results:
data2_ssd data(master) SD remains in "locked" state. Not a single node can get to "up", because SD locked. Can't destroy data(master) SD because the original glustered data(master) SD is in maintenance. Can't put that original glustered data(master) SD up, because no node is "up". Egg-chicken etc.

Expected results:
The external NFS data(master) SD that took over from the maintenanced original data(master) should unlock.

Additional info:

vdsm.log exceptions pointed out by mlipchuck:

1) 
jsonrpc.Executor/2::ERROR::2017-01-23 09:12:22,814::sdc::146::Storage.StorageDomainCache::(_findDomain) domain 0e957331-9688-47a9-8352-08e7ddf97d9f not found
Traceback (most recent call last):
File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'0e957331-9688-47a9-8352-08e7ddf97d9f',)

2)
jsonrpc.Executor/5::DEBUG::2017-01-23 09:47:18,556::lvm::288::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  WARNING: Not using lvmetad because config setting use_lvmetad=0.\n  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).\n  Volume group "0e957331-9688-47a9-8352-08e7ddf97d9f" not found\n  Cannot process volume group 0e957331-9688-47a9-8352-08e7ddf97d9f\n'; <rc> = 5

Comment 1 mike.lupe 2017-01-23 14:43:23 UTC
Created attachment 1243610 [details]
Hosted engine VM logs

The hosted-engine VM resides on the 1st cluster node slp-ovirtnode-01

Comment 2 mike.lupe 2017-01-23 17:57:13 UTC
Created attachment 1243740 [details]
oVirt Webui hosts view

Hosts view in Webui. Status of node 1 and node 2 jumps from "Activating" to "Not Operational" and back etc. Interval 3-4 mins(?).

Comment 3 mike.lupe 2017-01-23 18:01:14 UTC
Created attachment 1243742 [details]
oVirt Webui original GlusterFS data (master) SD - maintenanced

This SD "data" hasn't changed status since ~2 days. Before I set it to maintenance (after the successful glusterfs recovery), all storage domains came up, but started switching back and forward from "up" green to "down" red. That's why I set it to maintenance, and then "data2_ssd" SD became master.

Comment 4 mike.lupe 2017-01-23 18:03:56 UTC
Created attachment 1243743 [details]
oVirt Webui view: data2_ssd (new master) NFS SD - "Locked"

"Locked" for the past 2 days. Can't destroy it, because "data" SD is in maintenance and can't take over.

Comment 5 mike.lupe 2017-01-23 20:30:06 UTC
Typo in the description and comment of storage domain "esd" - it's also sLp-storage-01 and not sRp-storage-01.

esd	"NFS Export Storage Domain on slp-storage-01"	Export	NFS	V1	Unknown	[N/A]	[N/A]	"NFS Export Storage Domain on slp-storage-01"

Comment 6 Daniel Erez 2017-01-24 13:43:20 UTC
Hi Mike,

According to the vdsm log [1], there was a connection error to the storage.
Can you please verify that mount succeeds for all hosts (just to ensure that the connection issue is mitigated).

[1]
jsonrpc.Executor/6::ERROR::2017-01-23 09:12:22,131::hsm::2403::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2400, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 456, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 238, in connect
    six.reraise(t, v, tb)
  File "/usr/share/vdsm/storage/storageServer.py", line 230, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 229, in mount
    timeout=timeout, cgroup=cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda>
    **kwargs)
  File "<string>", line 2, in mount
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
MountError: (32, ';mount.nfs: Connection timed out\n')

Comment 7 mike.lupe 2017-01-24 20:04:53 UTC
Hi Daniel,

I'll try that with a "normal" user. If I remember well, when trying with vdsm user from one of the nodes, it timed out. I'm not sure if the "soft" (or smthng) parameter had a problem.

What user do you need me to try with again? And how _exactly_ do you want me to manually mount that storage-01's nfs export? Can you provide the exact params  needed? thanks

Comment 8 Daniel Erez 2017-01-24 20:22:22 UTC
You can test it using the 'nfs-check program' in /vdsm/contrib.
See full details at: http://www.ovirt.org/documentation/how-to/troubleshooting/troubleshooting-nfs-storage-issues/

Comment 9 mike.lupe 2017-01-25 08:02:45 UTC
I previously already did mostly of the recommended things from that URL, besides disabling firewall on storage, what I now did in one of the following steps.

On slp-storage-01 providing the current data(master) NFS export:
-----------------
setsebool -P virt_use_nfs 1
systemctl status rpcbind:  Active: active (running)
chmod -R 0755 /srv/ssd_data/ovirt/exports/data2
chown -R 36:36 /srv/ssd_data/ovirt/exports/data2 (-> gives owner vdsm, group kvm)

/etc/exports:
# 2nd SSD data storage for ovirt
/srv/ssd_data/ovirt/exports/data2 *(rw,anonuid=36,anongid=36,all_squash)


1) nfs-check OK (_with_ firewald activated on slp-storage-01)

On slp-ovirtnode-01
-------------------

python nfs-check.py slp-storage-01.corp.domain.tld:/srv/ssd_data/ovirt/exports/data2

"Current hostname: slp-ovirtnode-01.corp.domain.tld - IP addr 555.555.555.131
Trying to /bin/mount -t nfs slp-storage-01.corp.domain.tld:/srv/ssd_data/ovirt/exports/data2...
Executing NFS tests..
Removing vdsmTest file..
Status of tests [OK]
Disconnecting from NFS Server..
Done!"

2) showmount NOT OK (with firewall still on on storage-01):

"showmount -e slp-storage-01" moaned "RPC bind no route to host"

--> I then stopped firewalld and showmount ran ok showing all exports

3) Restarting engine
As nothing moved in Webui for ~15 minutes, I restarted HE VM. 

On node1
--------
hosted-engine --set-maintenance --mode=global

On each node
------------
systemctl stop ovirt-ha-agent

In HE VM
--------
shutdown -h now 

On slp-ovirtnode-01 (wait first for HE status down):
-------------------
hosted-engine --vm-start
systemctl start ovirt-ha-agent
hosted-engine --set-maintenance --mode=none

On each node
------------
systemctl start ovirt-ha-agent

4) HE VM came up and suddenly 2nd node slp-ovirtnode-02 went "up" aswell 
Even though I started HE on maintenanced node1, it did launch on node2. And then this node2 went aswell to status "up" (?)

5) Actual state nodes and SD master:
slp-ovirtnode-01	"maintenance" (doesn't come upwhen trying to activate)
slp-ovirtnode-02	"up" and HE VM on it running
slp-ovirtnode-03	"Non Operational" (doesn't come up)

Storage state (still):
data		Data (orig. master)	GlusterFS	V3	Maintenance	2712 GB	2315 GB
data2_ssd	Data (Master)		NFS			V3	Locked		[N/A]	[N/A]		

How can I force unlock of current master data2_ssd out of that twilight zone?

Comment 10 mike.lupe 2017-01-25 08:07:51 UTC
ADDITION (hosted-storage SD came up aswell):

4) HE VM came up and suddenly hosted-storage SD and 2nd node slp-ovirtnode-02 went "up" aswell 
Even though I started HE on maintenanced node1, it did launch on node2. 
And then this node2 went aswell to status "up" (?).
hosted_storage SD went "Active" aswell.

5) Current state of nodes and SD master:
slp-ovirtnode-01	"maintenance" (doesn't come up when trying to activate)
slp-ovirtnode-02	"up" and HE VM on it running
slp-ovirtnode-03	"Non Operational" (doesn't come up)

Storage:
data	Data (orig. master)	GlusterFS V3 Maintenance	2712 GB	2315 GB
data2_ssd	Data (Master)	NFS	  V3 Locked	[N/A]	[N/A]		
hosted_storage	Data	        GlusterFS V3 Active	49 GB	40 GB
esd & iso SDs are "Inactive"
ovirt-image-repository "Unattached"

Comment 11 Daniel Erez 2017-01-25 09:43:07 UTC
(In reply to mike.lupe from comment #10)
> ADDITION (hosted-storage SD came up aswell):
> 
> 4) HE VM came up and suddenly hosted-storage SD and 2nd node
> slp-ovirtnode-02 went "up" aswell 
> Even though I started HE on maintenanced node1, it did launch on node2. 
> And then this node2 went aswell to status "up" (?).
> hosted_storage SD went "Active" aswell.
> 
> 5) Current state of nodes and SD master:
> slp-ovirtnode-01	"maintenance" (doesn't come up when trying to activate)
> slp-ovirtnode-02	"up" and HE VM on it running
> slp-ovirtnode-03	"Non Operational" (doesn't come up)
> 
> Storage:
> data	Data (orig. master)	GlusterFS V3 Maintenance	2712 GB	2315 GB
> data2_ssd	Data (Master)	NFS	  V3 Locked	[N/A]	[N/A]		
> hosted_storage	Data	        GlusterFS V3 Active	49 GB	40 GB
> esd & iso SDs are "Inactive"
> ovirt-image-repository "Unattached"

* Have you verified there isn't a firewall issue in 'slp-ovirtnode-01' and 'slp-ovirtnode-03' as well perhaps?

* There should be a 'Non Operational' reason in the UI and logs. Can you find any?

Comment 12 mike.lupe 2017-01-27 09:18:04 UTC
I have finally again found the time to do some research, besides my "real" work, sorry for the delay.

Temporary disabled firewall (iptables + firewalld, did I mess up again here?) on all nodes. Hosted Engine comes up (ssh-able), but permanently switches from one node to another, after remaining some minutes (~15m) on "failed liveliness check". This happens since yesterday ongoing:

--== Host 1 status ==--
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
--== Host 2 status ==--
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
--== Host 3 status ==--
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}

This research is very,very time-consuming for me as well, and even if I feel honored with giving feedback, doing research and interacting with you experts etc, I'm at the point of simply need to manually unlock that external NFS data (master) SD residing on the storage-01 server, which I suspect is still locked (can't get to HE Webui to simple-check anymore). Or switching master to the original glusterfs hyperconverged SD (which still should be in "maintenance").

Fortunately I do have the most important data being hosted on some of the cluster's VMs backupped. But not my laboratory research work (eval application VMs e.g.).

Can you give me a hint to force that cluster to a working state, especially with the NFS SD unlocking? Thanks

Comment 13 mike.lupe 2017-01-28 17:50:09 UTC
On all 3 nodes, this entry is continous in /var/log/messages:

"journal: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof"

I checked http://www.ovirt.org/develop/release-management/features/infra/pki/ but can't tell what to do.

Comment 14 mike.lupe 2017-01-28 19:17:24 UTC
On all 3 nodes, this entry is continous in /var/log/messages:

"journal: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof"

I checked http://www.ovirt.org/develop/release-management/features/infra/pki/ but can't tell what to do.

Comment 15 Daniel Erez 2017-01-30 12:59:35 UTC
@Martin - iiuc, seems there's an issue with hosted-engine: https://bugzilla.redhat.com/show_bug.cgi?id=1415725#c12
Could you help here perhaps?

Comment 16 Martin Sivák 2017-01-30 15:42:55 UTC
> Temporary disabled firewall (iptables + firewalld, did I mess up again
> here?) on all nodes. 

> Hosted Engine comes up (ssh-able), but permanently
> switches from one node to another, after remaining some minutes (~15m) on
> "failed liveliness check". This happens since yesterday ongoing:

Liveness check is trying to do an HTTP request to the engine VM. Are you sure your IP/DNS is still the same it was when installed?

Can you ping or wget the main engine webpage manually from the host where it is running and from other hosts? What about your machine, can you reach the webadmin using browser?

Comment 17 mike.lupe 2017-01-30 18:06:59 UTC
First of all, Martin Sivák: Your recording on youtube "oVirt and Gluster, hyper-converged! by Martin Sivak" was my first close-up to ovirt, and still a bookmarked reference for me, even if it might be a bit outdated. Thanks for that one.

Now - my original DNS server was on an ovirt machine, so when everything went down through my epic "yum update" on all nodes, my backup DNS server (MS) took over. Unfortunately there was an outdated entry for the (taking over) NFS data (master) SD, which remained locked (and I guess still is locked). I had to update the entry for that NFS server, but everything is ok since ~5 days. slp-ovirtnode-01 to 03 never changed DNS entry.

Btw, while "yum update" (CentOS 7.3.x) went through, I saw vdsm and selinux being updated. That seriously made me fill my pants. 

Anyway, I guess more important is: I realized the day before yesterday, that ssh-ing between the nodes with root requested accepting a new key (shame on me, not using pubkey pair auth).

I thought, reaccepting new keys with loging between all hosts should resolve the "journal: vdsm vds.dispatcher ERROR SSL error..." entry in all node's "/var/log/messages". It didn't.

I surely messed up stuff.

Comment 18 mike.lupe 2017-01-30 18:13:59 UTC
Addition: HE isn't coming up anymore right now, so no response.

hosted-engine --vm-status:

Failed to connect to broker, the number of errors has exceeded the limit (1)
Cannot connect to the HA daemon, please check the logs.
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 174, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 105, in print_status
    cluster_stats = self._get_cluster_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 90, in _get_cluster_stats
    cluster_stats = ha_cli.get_all_stats(client.HAClient.
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 102, in get_all_stats
    with broker.connection(self._retries, self._wait):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 99, in connection
    self.connect(retries, wait)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 78, in connect
    raise BrokerConnectionError(error_msg)
ovirt_hosted_engine_ha.lib.exceptions.BrokerConnectionError: Failed to connect to broker, the number of errors has exceeded the limit (1)

Comment 19 Martin Sivák 2017-01-31 08:45:12 UTC
Hi Mike, thanks for trying out ovirt then.

> 
> Now - my original DNS server was on an ovirt machine, so when everything
> went down through my epic "yum update" on all nodes, my backup DNS server
> (MS) took over. Unfortunately there was an outdated entry for the (taking
> over) NFS data (master) SD, which remained locked (and I guess still is
> locked). I had to update the entry for that NFS server, but everything is ok
> since ~5 days. slp-ovirtnode-01 to 03 never changed DNS entry.

Ok, if the DNS is reporting the right mappings then you should be fine.

> Btw, while "yum update" (CentOS 7.3.x) went through, I saw vdsm and selinux
> being updated. That seriously made me fill my pants. 

That is normally not an issue, VDSM is backwards compatible.
 
> Anyway, I guess more important is: I realized the day before yesterday, that
> ssh-ing between the nodes with root requested accepting a new key (shame on
> me, not using pubkey pair auth).

This is not really an issue for the host tooling. The engine might require you to "Reinstall" the hosts though (it uses pubkey pair authentication for certain tasks).

> I thought, reaccepting new keys with loging between all hosts should resolve
> the "journal: vdsm vds.dispatcher ERROR SSL error..." entry in all node's
> "/var/log/messages". It didn't.

This might be totally unrelated too.

> I surely messed up stuff.

It happens. The first thing now is to make sure the vdsmd, sanlock, ovirt-ha-agent and ovirt-ha-broker services are up and running (systemctl start all of them).

Can you attach /var/log/ovirt-hosted-engine-ha/{agent.log,broker.log} to this bugzilla too?

Comment 20 mike.lupe 2017-02-01 21:01:13 UTC
Thanks for your time. I'll have time for detailed analysis tomorrow. 

But in short: All services running, the HE trying to come up, switching nodes continously.

Last entries agent.log:
MainThread::INFO::2017-02-01 21:56:46,182::ovf_store::103::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:030d9cc8-f284-4d9b-a671-39ed5a15d9a4, volUUID:e1e0b60c-b7b8-4da5-90b9-7f18da04e6f7
MainThread::INFO::2017-02-01 21:56:46,249::ovf_store::103::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:2bcab093-9eb9-4a34-865b-e19093f963fb, volUUID:6d148155-e1a8-4c5a-96d0-df2025d14702
MainThread::INFO::2017-02-01 21:56:46,366::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE
MainThread::INFO::2017-02-01 21:56:46,389::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/slp-ovirtnode-01.corp.domain.tld:_engine/50934cd7-6cb2-44df-bdf4-c9cd5953a67b/images/2bcab093-9eb9-4a34-865b-e19093f963fb/6d148155-e1a8-4c5a-96d0-df2025d14702
MainThread::INFO::2017-02-01 21:56:46,435::config::226::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Found an OVF for HE VM, trying to convert
MainThread::INFO::2017-02-01 21:56:46,438::config::231::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Got vm.conf from OVF_STORE
MainThread::INFO::2017-02-01 21:56:48,920::hosted_engine::461::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 3400)
MainThread::INFO::2017-02-01 21:56:48,920::hosted_engine::466::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host slp-ovirtnode-03.corp.domain.tld (id: 3, score: 3000)

Latest entries broker.log:
21:59:25,521::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-1404907::INFO::2017-02-01 21:59:25,522::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-1404908::INFO::2017-02-01 21:59:25,523::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-1404908::INFO::2017-02-01 21:59:25,524::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-1404909::INFO::2017-02-01 21:59:25,524::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-1404909::INFO::2017-02-01 21:59:25,525::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-1404910::INFO::2017-02-01 21:59:25,526::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-1404910::INFO::2017-02-01 21:59:25,527::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-1404911::INFO::2017-02-01 21:59:25,527::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-1404906::INFO::2017-02-01 21:59:25,731::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-1404911::INFO::2017-02-01 21:59:25,746::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed

Comment 21 mike.lupe 2017-02-01 21:11:10 UTC
Why am I tempted to update to 4.1? Because I'm hoping that this update will re-trigger some "stuff"?

Comment 22 Martin Sivák 2017-02-02 12:37:31 UTC
(In reply to mike.lupe from comment #21)
> Why am I tempted to update to 4.1? Because I'm hoping that this update will
> re-trigger some "stuff"?

It won't really.. we have fixed some bugs, but nothing too important.

The log is too short, all I see there is that there is a host with score 3400 and another with score 3000. 

There is a way to get an overview of the hosted engine cluster: execute `hosted-engine --vm-status` on any hosted engine host and it should tell you the status. You can even execute it on all of your hosts and compare the output, it should be consistent.

Comment 23 mike.lupe 2017-02-03 19:30:42 UTC
May I be a bit rude and suggest that the nodes are doing well, but it's the HE who's broken? I managed to ssh into HE between the shutting down and switching to another node and engine.log is full of "uncategorized SQLException for SQL [select * from  getvmbyvmguid(?, ?, ?)]; SQL state [null]".

See attached 20160203_engine.log

help :)

Comment 24 mike.lupe 2017-02-03 19:33:16 UTC
Created attachment 1247564 [details]
HE engine.log SQL errors

HE engine.log SQL errors. HE shutting down and going up on another node permanently.

Comment 25 Martin Sivák 2017-02-03 23:07:37 UTC
I am not asking for certain logs or commands just for fun, we actually have important pieces of information "hidden" there. I can't help when I am not sure whether the hosted engine synchronization whiteboard is fine for example. The fact that hosts are not noticing the maintenance mode seems to indicate that it might be not.

We would also have seen that the reason for reboot was the failed liveliness check from the hosted engine logs (we knew that, but not the reason for that) and that the hosts and the hosted engine tooling were doing fine (or not). And that would have led us to check the VM itself (the worst scenario...).

You can prevent the automatic kill & restart by putting the cluster to global maintenance, but I see you already know that. Does the hosted-engine --vm-status tell you you are in global maintenance when executed from all hosts? (as it should if the hosted engine synchronization works properly)

If the database is broken or the underlying storage lies to the nodes then I can't personally help you (I maintain the synchronization logic code only), but I think you should not be seeing postgresql page errors so you are probably right...

Comment 26 mike.lupe 2017-02-04 09:16:10 UTC
Created attachment 1247644 [details]
engine-ha agent.log & broker.log slp-ovirtnode-01

hosted-engine --vmstatus correctly shows on all 3 nodes "!! Cluster is in GLOBAL MAINTENANCE mode !!"

broker.log shows still permanently switching between "Connection established" and "Connection closed"

Comment 27 mike.lupe 2017-02-04 09:35:44 UTC
Martin, thanks for your clarifications.

If I should prepare for pulling the plug and reinstalling (because HE backup was quite old), I hope you'll have some hints for me on best practices for:

- pulling out the VMs that haven't been backupped/pushed yet to the external storage domain
- in that context, which and how to allow users to manage the kvm/qemu engine (if needed)
- Any other stuff I'll need for being able to backup up the maximum for a later rebuild of my environment

Oh man, I hope we'll be able to avoid reinstalling. thanks

Comment 28 Martin Sivák 2017-02-04 10:54:52 UTC
Ok, I see the restart cycles in the log and I also see that the state machine is now indeed in global maintenance and so the VM should stay up (the agent won't touch it).

We would have to look further if it is still restarting, but I am pretty certain that it should not be and the log seems to agree with me.

It should be possible to do whatever maintenance is needed on the VM now, unfortunately I am not too familiar with postgresql recovery procedures.

> If I should prepare for pulling the plug and reinstalling (because HE backup
> was quite old), I hope you'll have some hints for me on best practices for:
> 
> - pulling out the VMs that haven't been backupped/pushed yet to the external
> storage domain

We have an internal backup of all VMs in the form of OVF config files on the shared storage, unfortunately those do contain the basic data only (no quota, no user permissions and so on).

> - in that context, which and how to allow users to manage the kvm/qemu
> engine (if needed)

You can connect to local kvm directly using virsh (user: vdsm@ovirt pass: shibboleth) or you can go one level higher and use vdsClient for basic manipulation through vdsm.

You can even add the host to a different engine and the VMs should show up as external unmanaged VMs there. It might be possible to export them then, but I have never tried that.

> - Any other stuff I'll need for being able to backup up the maximum for a
> later rebuild of my environment
> 
> Oh man, I hope we'll be able to avoid reinstalling. thanks

I will defer to my colleagues here as I really do not know enough about that. There is the engine-backup tool, but I am really not sure what would happen in your case.

And the last idea: you can email the users list too, because this seems to be turning into a support case and that is easier to manage there.

Martin: do we have any procedures for restoring/fixing a possibly corrupted engine database (see the attached engine log for the postgresql tracebacks)?

Comment 29 mike.lupe 2017-02-08 08:41:00 UTC
Thanks again for your infos Martin.

I would be thankful for some hints from the others about restoring/fixing the engine db. I'm aware it could be better to jump on the users list. But I'm honestly afraid it would take another 3 weeks to get to the level of "understanding" where we're here. 

Thanks a lot

Comment 30 Martin Perina 2017-02-13 15:25:58 UTC
> Martin: do we have any procedures for restoring/fixing a possibly corrupted
> engine database (see the attached engine log for the postgresql tracebacks)?

AFAIK you could restore engine DB from backup, it depends how often you performed backup and if you backed up only db or everything (config files and engine/DWH database). There's feature page about engine-backup although it's slightly outdated:

http://www.ovirt.org/develop/release-management/features/engine/engine-backup/

Anyway if you have performed full backup (files + dbs), then you should be able to restore them fully. If you have only db backup, then some thing might not work properly (it depends how old the backup is).

Comment 31 mike.lupe 2017-02-15 08:45:00 UTC
I'll take a look at it, thanks. My last "manual" backup is some months old. But I did some automatic backup when updating. Where do I find that backup set?

And what I don't get as well is: Do I really have to restore on a "clean machine"? That means, I'll lose the actual state of HE VM, meaning I'll have to backup all customized things (NFS shares, mounts etc.)? And a clean machine means killing and redeploying HE on all nodes?

Comment 32 Martin Perina 2017-02-15 13:00:55 UTC
(In reply to mike.lupe from comment #31)
> I'll take a look at it, thanks. My last "manual" backup is some months old.
> But I did some automatic backup when updating. Where do I find that backup
> set?

Hmm, it's only a backup database, but not a backup of /etc/ovirt-engine files. The location of the backup is mentioned during engine-backup execution, so I think it should be mentioned also in relevant log of engine-setup execution in /var/log/ovirt-engine/setup directory

> 
> And what I don't get as well is: Do I really have to restore on a "clean
> machine"? That means, I'll lose the actual state of HE VM, meaning I'll have
> to backup all customized things (NFS shares, mounts etc.)? And a clean
> machine means killing and redeploying HE on all nodes?

Comment 33 Martin Perina 2017-02-17 13:47:20 UTC
There are a lot of informations inside this bug, but I haven't been able to find any specific issue which should be fixed on oVirt side. So is there any item which you think that should be fixed in oVirt? If not, can we close the bug?

Comment 34 mike.lupe 2017-02-18 18:54:37 UTC
Indeed, unfortunately there where too many layers and levels of challenges. I ended with empty gluster bricks, why ever. 

The only thing I can say is that maintaining a hyperconverged ovirt cluster can be very tricky and there's almost no room for major user errors, like the one I triggered (clusterssh yum update on all hosts). 

I would suggest closing this issue.

thanks

Comment 35 Martin Perina 2017-02-20 12:21:49 UTC
Closing as NOTABUG, feel free to reopen and please provide concrete issue


Note You need to log in before you can comment on or make changes to this bug.