Created attachment 1243609 [details] vdsm logs of slp-ovirtnode-02 (2nd node in 3-node cluster) Description of problem: I had a total fallout when doing a "clusterssh yum update -y" on all 4 oVirt nodes. I know, please don't ask me why I did that ++*""*°@#°#@. It took me 2 days to get glusterfs back into sync without any errors. After that time-consuming gluster recovery, all volumes came up. But - in Webui all the storage domains jumped back and forth from "online" (green) to offline (red), in an interval of about 20 seconds. Because this didn't stop, I took the glustered data(master) SD to maintenance. This automatically activated an existing 2nd data SD ("data2_ssd" - NFS on external storage-01 server) as the new master SD. This SD remains in "locked" state since then (3 days ago). No chance to get back original gluster data SD, as no hosts are "up". Can't activate hosts, because they moan about the missing data(master) storage domain, impossible for them to connect to the locked master SD. The 3 nodes (glusterfs hyperconverged) are in this state for the past 3 days: slp-ovirtnode-01 (HE MV running on) "Activating" slp-ovirtnode-02 "Activating" slp-ovirtnode-03 "Maintenance" Storage domains: data "Master Data Gluster Storage Domain" Data GlusterFS V3 Maintenance 2712 GB 2315 GB "Master Data Gluster Storage Domain" data2_ssd "SSD slp-storage-01 SSD disk1" Data (Master) NFS V3 Locked [N/A] [N/A] "SSD slp-storage-01 SSD disk1" esd "NFS Export Storage Domain on srp-storage-01" Export NFS V1 Unknown [N/A] [N/A] "NFS Export Storage Domain on srp-storage-01" hosted_storage Data GlusterFS V3 Unknown 49 GB 40 GB iso "ISO Gluster Storage Domain" ISO GlusterFS V1 Inactive 31 GB 16 GB "ISO Gluster Storage Domain" Version-Release number of selected component (if applicable): OS: CentOS 7.3 oVirt Version: 4.0.6 vdsm version vdsm-4.18.21-1.el7.centos.x86_64 # of nodes: 3 Data(master) Storage: "data" - GlusterFS replica 3 arbiter 1 Data Storage 2 "data2_ssd" - NFS on external storage server (became data master after setting "data(master) to maintenance) How reproducible: Steps to Reproduce: 1. I don't recommend to take out the plug to reproduce this. Or do a "yum update" on all hosts at the same time. :-| 2. 3. Actual results: data2_ssd data(master) SD remains in "locked" state. Not a single node can get to "up", because SD locked. Can't destroy data(master) SD because the original glustered data(master) SD is in maintenance. Can't put that original glustered data(master) SD up, because no node is "up". Egg-chicken etc. Expected results: The external NFS data(master) SD that took over from the maintenanced original data(master) should unlock. Additional info: vdsm.log exceptions pointed out by mlipchuck: 1) jsonrpc.Executor/2::ERROR::2017-01-23 09:12:22,814::sdc::146::Storage.StorageDomainCache::(_findDomain) domain 0e957331-9688-47a9-8352-08e7ddf97d9f not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'0e957331-9688-47a9-8352-08e7ddf97d9f',) 2) jsonrpc.Executor/5::DEBUG::2017-01-23 09:47:18,556::lvm::288::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' WARNING: Not using lvmetad because config setting use_lvmetad=0.\n WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).\n Volume group "0e957331-9688-47a9-8352-08e7ddf97d9f" not found\n Cannot process volume group 0e957331-9688-47a9-8352-08e7ddf97d9f\n'; <rc> = 5
Created attachment 1243610 [details] Hosted engine VM logs The hosted-engine VM resides on the 1st cluster node slp-ovirtnode-01
Created attachment 1243740 [details] oVirt Webui hosts view Hosts view in Webui. Status of node 1 and node 2 jumps from "Activating" to "Not Operational" and back etc. Interval 3-4 mins(?).
Created attachment 1243742 [details] oVirt Webui original GlusterFS data (master) SD - maintenanced This SD "data" hasn't changed status since ~2 days. Before I set it to maintenance (after the successful glusterfs recovery), all storage domains came up, but started switching back and forward from "up" green to "down" red. That's why I set it to maintenance, and then "data2_ssd" SD became master.
Created attachment 1243743 [details] oVirt Webui view: data2_ssd (new master) NFS SD - "Locked" "Locked" for the past 2 days. Can't destroy it, because "data" SD is in maintenance and can't take over.
Typo in the description and comment of storage domain "esd" - it's also sLp-storage-01 and not sRp-storage-01. esd "NFS Export Storage Domain on slp-storage-01" Export NFS V1 Unknown [N/A] [N/A] "NFS Export Storage Domain on slp-storage-01"
Hi Mike, According to the vdsm log [1], there was a connection error to the storage. Can you please verify that mount succeeds for all hosts (just to ensure that the connection issue is mitigated). [1] jsonrpc.Executor/6::ERROR::2017-01-23 09:12:22,131::hsm::2403::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2400, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 456, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 238, in connect six.reraise(t, v, tb) File "/usr/share/vdsm/storage/storageServer.py", line 230, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 229, in mount timeout=timeout, cgroup=cgroup) File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda> **kwargs) File "<string>", line 2, in mount File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) MountError: (32, ';mount.nfs: Connection timed out\n')
Hi Daniel, I'll try that with a "normal" user. If I remember well, when trying with vdsm user from one of the nodes, it timed out. I'm not sure if the "soft" (or smthng) parameter had a problem. What user do you need me to try with again? And how _exactly_ do you want me to manually mount that storage-01's nfs export? Can you provide the exact params needed? thanks
You can test it using the 'nfs-check program' in /vdsm/contrib. See full details at: http://www.ovirt.org/documentation/how-to/troubleshooting/troubleshooting-nfs-storage-issues/
I previously already did mostly of the recommended things from that URL, besides disabling firewall on storage, what I now did in one of the following steps. On slp-storage-01 providing the current data(master) NFS export: ----------------- setsebool -P virt_use_nfs 1 systemctl status rpcbind: Active: active (running) chmod -R 0755 /srv/ssd_data/ovirt/exports/data2 chown -R 36:36 /srv/ssd_data/ovirt/exports/data2 (-> gives owner vdsm, group kvm) /etc/exports: # 2nd SSD data storage for ovirt /srv/ssd_data/ovirt/exports/data2 *(rw,anonuid=36,anongid=36,all_squash) 1) nfs-check OK (_with_ firewald activated on slp-storage-01) On slp-ovirtnode-01 ------------------- python nfs-check.py slp-storage-01.corp.domain.tld:/srv/ssd_data/ovirt/exports/data2 "Current hostname: slp-ovirtnode-01.corp.domain.tld - IP addr 555.555.555.131 Trying to /bin/mount -t nfs slp-storage-01.corp.domain.tld:/srv/ssd_data/ovirt/exports/data2... Executing NFS tests.. Removing vdsmTest file.. Status of tests [OK] Disconnecting from NFS Server.. Done!" 2) showmount NOT OK (with firewall still on on storage-01): "showmount -e slp-storage-01" moaned "RPC bind no route to host" --> I then stopped firewalld and showmount ran ok showing all exports 3) Restarting engine As nothing moved in Webui for ~15 minutes, I restarted HE VM. On node1 -------- hosted-engine --set-maintenance --mode=global On each node ------------ systemctl stop ovirt-ha-agent In HE VM -------- shutdown -h now On slp-ovirtnode-01 (wait first for HE status down): ------------------- hosted-engine --vm-start systemctl start ovirt-ha-agent hosted-engine --set-maintenance --mode=none On each node ------------ systemctl start ovirt-ha-agent 4) HE VM came up and suddenly 2nd node slp-ovirtnode-02 went "up" aswell Even though I started HE on maintenanced node1, it did launch on node2. And then this node2 went aswell to status "up" (?) 5) Actual state nodes and SD master: slp-ovirtnode-01 "maintenance" (doesn't come upwhen trying to activate) slp-ovirtnode-02 "up" and HE VM on it running slp-ovirtnode-03 "Non Operational" (doesn't come up) Storage state (still): data Data (orig. master) GlusterFS V3 Maintenance 2712 GB 2315 GB data2_ssd Data (Master) NFS V3 Locked [N/A] [N/A] How can I force unlock of current master data2_ssd out of that twilight zone?
ADDITION (hosted-storage SD came up aswell): 4) HE VM came up and suddenly hosted-storage SD and 2nd node slp-ovirtnode-02 went "up" aswell Even though I started HE on maintenanced node1, it did launch on node2. And then this node2 went aswell to status "up" (?). hosted_storage SD went "Active" aswell. 5) Current state of nodes and SD master: slp-ovirtnode-01 "maintenance" (doesn't come up when trying to activate) slp-ovirtnode-02 "up" and HE VM on it running slp-ovirtnode-03 "Non Operational" (doesn't come up) Storage: data Data (orig. master) GlusterFS V3 Maintenance 2712 GB 2315 GB data2_ssd Data (Master) NFS V3 Locked [N/A] [N/A] hosted_storage Data GlusterFS V3 Active 49 GB 40 GB esd & iso SDs are "Inactive" ovirt-image-repository "Unattached"
(In reply to mike.lupe from comment #10) > ADDITION (hosted-storage SD came up aswell): > > 4) HE VM came up and suddenly hosted-storage SD and 2nd node > slp-ovirtnode-02 went "up" aswell > Even though I started HE on maintenanced node1, it did launch on node2. > And then this node2 went aswell to status "up" (?). > hosted_storage SD went "Active" aswell. > > 5) Current state of nodes and SD master: > slp-ovirtnode-01 "maintenance" (doesn't come up when trying to activate) > slp-ovirtnode-02 "up" and HE VM on it running > slp-ovirtnode-03 "Non Operational" (doesn't come up) > > Storage: > data Data (orig. master) GlusterFS V3 Maintenance 2712 GB 2315 GB > data2_ssd Data (Master) NFS V3 Locked [N/A] [N/A] > hosted_storage Data GlusterFS V3 Active 49 GB 40 GB > esd & iso SDs are "Inactive" > ovirt-image-repository "Unattached" * Have you verified there isn't a firewall issue in 'slp-ovirtnode-01' and 'slp-ovirtnode-03' as well perhaps? * There should be a 'Non Operational' reason in the UI and logs. Can you find any?
I have finally again found the time to do some research, besides my "real" work, sorry for the delay. Temporary disabled firewall (iptables + firewalld, did I mess up again here?) on all nodes. Hosted Engine comes up (ssh-able), but permanently switches from one node to another, after remaining some minutes (~15m) on "failed liveliness check". This happens since yesterday ongoing: --== Host 1 status ==-- Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} --== Host 2 status ==-- Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} --== Host 3 status ==-- Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} This research is very,very time-consuming for me as well, and even if I feel honored with giving feedback, doing research and interacting with you experts etc, I'm at the point of simply need to manually unlock that external NFS data (master) SD residing on the storage-01 server, which I suspect is still locked (can't get to HE Webui to simple-check anymore). Or switching master to the original glusterfs hyperconverged SD (which still should be in "maintenance"). Fortunately I do have the most important data being hosted on some of the cluster's VMs backupped. But not my laboratory research work (eval application VMs e.g.). Can you give me a hint to force that cluster to a working state, especially with the NFS SD unlocking? Thanks
On all 3 nodes, this entry is continous in /var/log/messages: "journal: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof" I checked http://www.ovirt.org/develop/release-management/features/infra/pki/ but can't tell what to do.
@Martin - iiuc, seems there's an issue with hosted-engine: https://bugzilla.redhat.com/show_bug.cgi?id=1415725#c12 Could you help here perhaps?
> Temporary disabled firewall (iptables + firewalld, did I mess up again > here?) on all nodes. > Hosted Engine comes up (ssh-able), but permanently > switches from one node to another, after remaining some minutes (~15m) on > "failed liveliness check". This happens since yesterday ongoing: Liveness check is trying to do an HTTP request to the engine VM. Are you sure your IP/DNS is still the same it was when installed? Can you ping or wget the main engine webpage manually from the host where it is running and from other hosts? What about your machine, can you reach the webadmin using browser?
First of all, Martin Sivák: Your recording on youtube "oVirt and Gluster, hyper-converged! by Martin Sivak" was my first close-up to ovirt, and still a bookmarked reference for me, even if it might be a bit outdated. Thanks for that one. Now - my original DNS server was on an ovirt machine, so when everything went down through my epic "yum update" on all nodes, my backup DNS server (MS) took over. Unfortunately there was an outdated entry for the (taking over) NFS data (master) SD, which remained locked (and I guess still is locked). I had to update the entry for that NFS server, but everything is ok since ~5 days. slp-ovirtnode-01 to 03 never changed DNS entry. Btw, while "yum update" (CentOS 7.3.x) went through, I saw vdsm and selinux being updated. That seriously made me fill my pants. Anyway, I guess more important is: I realized the day before yesterday, that ssh-ing between the nodes with root requested accepting a new key (shame on me, not using pubkey pair auth). I thought, reaccepting new keys with loging between all hosts should resolve the "journal: vdsm vds.dispatcher ERROR SSL error..." entry in all node's "/var/log/messages". It didn't. I surely messed up stuff.
Addition: HE isn't coming up anymore right now, so no response. hosted-engine --vm-status: Failed to connect to broker, the number of errors has exceeded the limit (1) Cannot connect to the HA daemon, please check the logs. Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 174, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 105, in print_status cluster_stats = self._get_cluster_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 90, in _get_cluster_stats cluster_stats = ha_cli.get_all_stats(client.HAClient. File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 102, in get_all_stats with broker.connection(self._retries, self._wait): File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 99, in connection self.connect(retries, wait) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 78, in connect raise BrokerConnectionError(error_msg) ovirt_hosted_engine_ha.lib.exceptions.BrokerConnectionError: Failed to connect to broker, the number of errors has exceeded the limit (1)
Hi Mike, thanks for trying out ovirt then. > > Now - my original DNS server was on an ovirt machine, so when everything > went down through my epic "yum update" on all nodes, my backup DNS server > (MS) took over. Unfortunately there was an outdated entry for the (taking > over) NFS data (master) SD, which remained locked (and I guess still is > locked). I had to update the entry for that NFS server, but everything is ok > since ~5 days. slp-ovirtnode-01 to 03 never changed DNS entry. Ok, if the DNS is reporting the right mappings then you should be fine. > Btw, while "yum update" (CentOS 7.3.x) went through, I saw vdsm and selinux > being updated. That seriously made me fill my pants. That is normally not an issue, VDSM is backwards compatible. > Anyway, I guess more important is: I realized the day before yesterday, that > ssh-ing between the nodes with root requested accepting a new key (shame on > me, not using pubkey pair auth). This is not really an issue for the host tooling. The engine might require you to "Reinstall" the hosts though (it uses pubkey pair authentication for certain tasks). > I thought, reaccepting new keys with loging between all hosts should resolve > the "journal: vdsm vds.dispatcher ERROR SSL error..." entry in all node's > "/var/log/messages". It didn't. This might be totally unrelated too. > I surely messed up stuff. It happens. The first thing now is to make sure the vdsmd, sanlock, ovirt-ha-agent and ovirt-ha-broker services are up and running (systemctl start all of them). Can you attach /var/log/ovirt-hosted-engine-ha/{agent.log,broker.log} to this bugzilla too?
Thanks for your time. I'll have time for detailed analysis tomorrow. But in short: All services running, the HE trying to come up, switching nodes continously. Last entries agent.log: MainThread::INFO::2017-02-01 21:56:46,182::ovf_store::103::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:030d9cc8-f284-4d9b-a671-39ed5a15d9a4, volUUID:e1e0b60c-b7b8-4da5-90b9-7f18da04e6f7 MainThread::INFO::2017-02-01 21:56:46,249::ovf_store::103::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:2bcab093-9eb9-4a34-865b-e19093f963fb, volUUID:6d148155-e1a8-4c5a-96d0-df2025d14702 MainThread::INFO::2017-02-01 21:56:46,366::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2017-02-01 21:56:46,389::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/slp-ovirtnode-01.corp.domain.tld:_engine/50934cd7-6cb2-44df-bdf4-c9cd5953a67b/images/2bcab093-9eb9-4a34-865b-e19093f963fb/6d148155-e1a8-4c5a-96d0-df2025d14702 MainThread::INFO::2017-02-01 21:56:46,435::config::226::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Found an OVF for HE VM, trying to convert MainThread::INFO::2017-02-01 21:56:46,438::config::231::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Got vm.conf from OVF_STORE MainThread::INFO::2017-02-01 21:56:48,920::hosted_engine::461::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 3400) MainThread::INFO::2017-02-01 21:56:48,920::hosted_engine::466::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host slp-ovirtnode-03.corp.domain.tld (id: 3, score: 3000) Latest entries broker.log: 21:59:25,521::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-1404907::INFO::2017-02-01 21:59:25,522::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-1404908::INFO::2017-02-01 21:59:25,523::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-1404908::INFO::2017-02-01 21:59:25,524::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-1404909::INFO::2017-02-01 21:59:25,524::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-1404909::INFO::2017-02-01 21:59:25,525::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-1404910::INFO::2017-02-01 21:59:25,526::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-1404910::INFO::2017-02-01 21:59:25,527::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-1404911::INFO::2017-02-01 21:59:25,527::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-1404906::INFO::2017-02-01 21:59:25,731::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-1404911::INFO::2017-02-01 21:59:25,746::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Why am I tempted to update to 4.1? Because I'm hoping that this update will re-trigger some "stuff"?
(In reply to mike.lupe from comment #21) > Why am I tempted to update to 4.1? Because I'm hoping that this update will > re-trigger some "stuff"? It won't really.. we have fixed some bugs, but nothing too important. The log is too short, all I see there is that there is a host with score 3400 and another with score 3000. There is a way to get an overview of the hosted engine cluster: execute `hosted-engine --vm-status` on any hosted engine host and it should tell you the status. You can even execute it on all of your hosts and compare the output, it should be consistent.
May I be a bit rude and suggest that the nodes are doing well, but it's the HE who's broken? I managed to ssh into HE between the shutting down and switching to another node and engine.log is full of "uncategorized SQLException for SQL [select * from getvmbyvmguid(?, ?, ?)]; SQL state [null]". See attached 20160203_engine.log help :)
Created attachment 1247564 [details] HE engine.log SQL errors HE engine.log SQL errors. HE shutting down and going up on another node permanently.
I am not asking for certain logs or commands just for fun, we actually have important pieces of information "hidden" there. I can't help when I am not sure whether the hosted engine synchronization whiteboard is fine for example. The fact that hosts are not noticing the maintenance mode seems to indicate that it might be not. We would also have seen that the reason for reboot was the failed liveliness check from the hosted engine logs (we knew that, but not the reason for that) and that the hosts and the hosted engine tooling were doing fine (or not). And that would have led us to check the VM itself (the worst scenario...). You can prevent the automatic kill & restart by putting the cluster to global maintenance, but I see you already know that. Does the hosted-engine --vm-status tell you you are in global maintenance when executed from all hosts? (as it should if the hosted engine synchronization works properly) If the database is broken or the underlying storage lies to the nodes then I can't personally help you (I maintain the synchronization logic code only), but I think you should not be seeing postgresql page errors so you are probably right...
Created attachment 1247644 [details] engine-ha agent.log & broker.log slp-ovirtnode-01 hosted-engine --vmstatus correctly shows on all 3 nodes "!! Cluster is in GLOBAL MAINTENANCE mode !!" broker.log shows still permanently switching between "Connection established" and "Connection closed"
Martin, thanks for your clarifications. If I should prepare for pulling the plug and reinstalling (because HE backup was quite old), I hope you'll have some hints for me on best practices for: - pulling out the VMs that haven't been backupped/pushed yet to the external storage domain - in that context, which and how to allow users to manage the kvm/qemu engine (if needed) - Any other stuff I'll need for being able to backup up the maximum for a later rebuild of my environment Oh man, I hope we'll be able to avoid reinstalling. thanks
Ok, I see the restart cycles in the log and I also see that the state machine is now indeed in global maintenance and so the VM should stay up (the agent won't touch it). We would have to look further if it is still restarting, but I am pretty certain that it should not be and the log seems to agree with me. It should be possible to do whatever maintenance is needed on the VM now, unfortunately I am not too familiar with postgresql recovery procedures. > If I should prepare for pulling the plug and reinstalling (because HE backup > was quite old), I hope you'll have some hints for me on best practices for: > > - pulling out the VMs that haven't been backupped/pushed yet to the external > storage domain We have an internal backup of all VMs in the form of OVF config files on the shared storage, unfortunately those do contain the basic data only (no quota, no user permissions and so on). > - in that context, which and how to allow users to manage the kvm/qemu > engine (if needed) You can connect to local kvm directly using virsh (user: vdsm@ovirt pass: shibboleth) or you can go one level higher and use vdsClient for basic manipulation through vdsm. You can even add the host to a different engine and the VMs should show up as external unmanaged VMs there. It might be possible to export them then, but I have never tried that. > - Any other stuff I'll need for being able to backup up the maximum for a > later rebuild of my environment > > Oh man, I hope we'll be able to avoid reinstalling. thanks I will defer to my colleagues here as I really do not know enough about that. There is the engine-backup tool, but I am really not sure what would happen in your case. And the last idea: you can email the users list too, because this seems to be turning into a support case and that is easier to manage there. Martin: do we have any procedures for restoring/fixing a possibly corrupted engine database (see the attached engine log for the postgresql tracebacks)?
Thanks again for your infos Martin. I would be thankful for some hints from the others about restoring/fixing the engine db. I'm aware it could be better to jump on the users list. But I'm honestly afraid it would take another 3 weeks to get to the level of "understanding" where we're here. Thanks a lot
> Martin: do we have any procedures for restoring/fixing a possibly corrupted > engine database (see the attached engine log for the postgresql tracebacks)? AFAIK you could restore engine DB from backup, it depends how often you performed backup and if you backed up only db or everything (config files and engine/DWH database). There's feature page about engine-backup although it's slightly outdated: http://www.ovirt.org/develop/release-management/features/engine/engine-backup/ Anyway if you have performed full backup (files + dbs), then you should be able to restore them fully. If you have only db backup, then some thing might not work properly (it depends how old the backup is).
I'll take a look at it, thanks. My last "manual" backup is some months old. But I did some automatic backup when updating. Where do I find that backup set? And what I don't get as well is: Do I really have to restore on a "clean machine"? That means, I'll lose the actual state of HE VM, meaning I'll have to backup all customized things (NFS shares, mounts etc.)? And a clean machine means killing and redeploying HE on all nodes?
(In reply to mike.lupe from comment #31) > I'll take a look at it, thanks. My last "manual" backup is some months old. > But I did some automatic backup when updating. Where do I find that backup > set? Hmm, it's only a backup database, but not a backup of /etc/ovirt-engine files. The location of the backup is mentioned during engine-backup execution, so I think it should be mentioned also in relevant log of engine-setup execution in /var/log/ovirt-engine/setup directory > > And what I don't get as well is: Do I really have to restore on a "clean > machine"? That means, I'll lose the actual state of HE VM, meaning I'll have > to backup all customized things (NFS shares, mounts etc.)? And a clean > machine means killing and redeploying HE on all nodes?
There are a lot of informations inside this bug, but I haven't been able to find any specific issue which should be fixed on oVirt side. So is there any item which you think that should be fixed in oVirt? If not, can we close the bug?
Indeed, unfortunately there where too many layers and levels of challenges. I ended with empty gluster bricks, why ever. The only thing I can say is that maintaining a hyperconverged ovirt cluster can be very tricky and there's almost no room for major user errors, like the one I triggered (clusterssh yum update on all hosts). I would suggest closing this issue. thanks
Closing as NOTABUG, feel free to reopen and please provide concrete issue
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days