Created attachment 1332879 [details] engine and vdsm logs Description of problem: In our automation, we see several failures about start VM failed with, engine.log: 2017-09-30 14:12:46,236+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-22) [vms_syncAction_b646c6ef-d06d-45ad] EVENT_ID: USER_STARTED_VM(153), VM vm_TestCase18894_REST_GLUS_3014123420 was started by admin@internal-authz (Host: host_mixed_2). 2017-09-30 14:12:53,520+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-1) [] VM '7c03609e-da72-41d4-90a5-65efe9a264f2' was reported as Down on VDS '2e5274d9-65d1-4d1c-b5cf-4c5d07c53690'(host_mixed_2) 2017-09-30 14:12:53,523+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-1) [] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{hostId='2e5274d9-65d1-4d1c-b5cf-4c5d07c53690', vmId='7c03609e-da72-41d4-90a5-65efe9a264f2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 74698ad4 2017-09-30 14:12:54,734+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-1) [] FINISH, DestroyVDSCommand, log id: 74698ad4 2017-09-30 14:12:54,734+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-1) [] VM '7c03609e-da72-41d4-90a5-65efe9a264f2'(vm_TestCase18894_REST_GLUS_3014123420) moved from 'WaitForLaunch' --> 'Down' 2017-09-30 14:12:54,747+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-1) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_TestCase18894_REST_GLUS_3014123420 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: [2017-09-30 11:12:50.057812] E [MSGID: 108006] [afr-common.c:5001:__afr_handle_child_down_event] 0-GE_12_volume02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. 2017-09-30 14:12:54,748+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-1) [] add VM '7c03609e-da72-41d4-90a5-65efe9a264f2'(vm_TestCase18894_REST_GLUS_3014123420) to rerun treatment 2017-09-30 14:12:54,751+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-1) [] Rerun VM '7c03609e-da72-41d4-90a5-65efe9a264f2'. Called from VDS 'host_mixed_2' 2017-09-30 14:12:54,768+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-162) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Failed to run VM vm_TestCase18894_REST_GLUS_3014123420 on Host host_mixed_2. vdsm.log: 2017-09-30 14:12:53,475+0300 ERROR (vm/7c03609e) [virt.vm] (vmId='7c03609e-da72-41d4-90a5-65efe9a264f2') The vm start process failed (vm:866) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 799, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2576, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: qemu unexpectedly closed the monitor: [2017-09-30 11:12:50.057812] E [MSGID: 108006] [afr-common.c:5001:__afr_handle_child_down_event] 0-GE_12_volume02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. 2017-09-30T11:12:51.992205Z qemu-kvm: -drive file=gluster://gluster01.scl.lab.tlv.redhat.com/GE_12_volume02/bcc4bcc2-3bdf-4f64-b12e-20dd532a9876/images/6b2fb326-366e-4db6-83d8-3edce3c63a03/f9a12dfb-9278-4853-b296-3dacff38bb18,file.debug=4,format=qcow2,if=none,id=drive-virtio-disk0,serial=6b2fb326-366e-4db6-83d8-3edce3c63a03,cache=none,werror=stop,rerror=stop,aio=threads: Could not read L1 table: Input/output error 2017-09-30 14:12:53,476+0300 INFO (vm/7c03609e) [virt.vm] (vmId='7c03609e-da72-41d4-90a5-65efe9a264f2') Changed state to Down: internal error: qemu unexpectedly closed the monitor: [2017-09-30 11:12:50.057812] E [MSGID: 108006] [afr-common.c:5001:__afr_handle_child_down_event] 0-GE_12_volume02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. 2017-09-30T11:12:51.992205Z qemu-kvm: -drive file=gluster://gluster01.scl.lab.tlv.redhat.com/GE_12_volume02/bcc4bcc2-3bdf-4f64-b12e-20dd532a9876/images/6b2fb326-366e-4db6-83d8-3edce3c63a03/f9a12dfb-9278-4853-b296-3dacff38bb18,file.debug=4,format=qcow2,if=none,id=drive-virtio-disk0,serial=6b2fb326-366e-4db6-83d8-3edce3c63a03,cache=none,werror=stop,rerror=stop,aio=threads: Could not read L1 table: Input/output error (code=1) (vm:1553) 2017-09-30 14:12:53,520+0300 INFO (vm/7c03609e) [virt.vm] (vmId='7c03609e-da72-41d4-90a5-65efe9a264f2') Stopping connection (guestagent:429) 2017-09-30 14:12:53,529+0300 INFO (jsonrpc/3) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.35.69.107,35318 (api:46) 2017-09-30 14:12:53,529+0300 INFO (jsonrpc/3) [virt.vm] (vmId='7c03609e-da72-41d4-90a5-65efe9a264f2') Release VM resources (vm:4762) 2017-09-30 14:12:53,530+0300 WARN (jsonrpc/3) [virt.vm] (vmId='7c03609e-da72-41d4-90a5-65efe9a264f2') trying to set state to Powering down when already Down (vm:542) 2017-09-30 14:12:53,530+0300 INFO (jsonrpc/3) [virt.vm] (vmId='7c03609e-da72-41d4-90a5-65efe9a264f2') Stopping connection (guestagent:429) Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20170925172449.git9d850a9.el7.centos.noarch vdsm-4.20.3-95.git0813890.el7.centos.x86_64 How reproducible: From time to time Steps to Reproduce: 1. Start VM 2. 3. Actual results: Expected results: Additional info:
Moving to virt for initial analysis. If there's a storage issue here, we'll need you guys to point us in the right direction. Thanks!
When you see a qemu failure it is often needed to have qemu.log. Please attach it as well
Created attachment 1333134 [details] qemu log
There's no other issue than that L1 table, gluster related. Sahina?
It does look gluster related - the volume has lost quorum according to the logs [2017-09-30 11:13:04.830383] E [MSGID: 108006] [afr-common.c:5001:__afr_handle_child_down_event] 0-GE_12_volume02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. Was there a network issue at this time? Could you also attach glusterd logs and brick logs from the 3 nodes?
(In reply to Sahina Bose from comment #5) > It does look gluster related - the volume has lost quorum according to the > logs > [2017-09-30 11:13:04.830383] E [MSGID: 108006] > [afr-common.c:5001:__afr_handle_child_down_event] > 0-GE_12_volume02-replicate-0: All subvolumes are down. Going offline until > atleast one of them comes back up. > > Was there a network issue at this time? Could you also attach glusterd logs > and brick logs from the 3 nodes? There wasn't any network issue during that time. I don't have the requested logs but I will attach them when I'll see this error again in our automation
Adding back the needinfo for the requested logs.
This bug discussed on mail thread: On Mon, Oct 23, 2017 at 2:16 PM, Krutika Dhananjay <kdhananj> wrote: I checked the logs in the gluster storage pool machines and the machines that are mounting the volume. The only thing that stands out at the moment is frequent connect/disconnect logs between the bricks and mount process. Adding Raghavendra G and Milind. Are there any network issues known between the servers and the client machines? -Krutika On Mon, Oct 23, 2017 at 11:49 AM, Raz Tamir <ratamir> wrote: On Oct 21, 2017 15:22, "Raz Tamir" <ratamir> wrote: Take a look at this https://bugzilla.redhat.com/show_bug.cgi?id=1497511 Seems like same issue Sahina, this bug has a needinfo for me. All the requested information can be found in this thread. If anything still missing, let me know On Oct 21, 2017 12:00, "Elad Ben Aharon" <ebenahar> wrote: Hi Ravi, Thanks for looking into this. We're using glusterfs-3.8.15-2.el7.x86_64 I tried the workaround, this doesn't solve the issue. [root@gluster01 ~]# gluster volume set storage_local_ge2_volume_0 stat-prefetch off volume set: success [2017-10-21 08:52:22.358182] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-storage_local_ge2_volume_0-server: Shutting down connection storage-ge2-vdsm1.scl.lab.tlv.redhat.com-1319-2 017/10/21-08:52:21:185720-storage_local_ge2_volume_0-client-0-0-0 [2017-10-21 08:52:25.157006] I [MSGID: 115029] [server-handshake.c:692:server_setvolume] 0-storage_local_ge2_volume_0-server: accepted client from storage-ge2-vdsm1.scl.lab.tlv.redhat.com-1 8117-2017/10/21-08:52:25:92950-storage_local_ge2_volume_0-client-0-0-0 (version: 3.12.1) [2017-10-21 08:52:25.191877] E [MSGID: 113107] [posix.c:1084:posix_seek] 0-storage_local_ge2_volume_0-posix: seek failed on fd 19 length 196768 [No such device or address] [2017-10-21 08:52:25.191980] E [MSGID: 115089] [server-rpc-fops.c:2007:server_seek_cbk] 0-storage_local_ge2_volume_0-server: 22: SEEK-2 (13be5b8f-8855-4dd2-b05b-c01507053bcf) ==> (No such d evice or address) [No such device or address] [2017-10-21 08:52:25.197340] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-storage_local_ge2_volume_0-server: disconnecting connection from storage-ge2-vdsm1.scl.lab.tlv.redhat.com-1 8117-2017/10/21-08:52:25:92950-storage_local_ge2_volume_0-client-0-0-0 [2017-10-21 08:52:25.197470] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-storage_local_ge2_volume_0-server: Shutting down connection storage-ge2-vdsm1.scl.lab.tlv.redhat.com-18117- 2017/10/21-08:52:25:92950-storage_local_ge2_volume_0-client-0-0-0 On Fri, Oct 20, 2017 at 12:08 PM, Ravishankar N <ravishankar> wrote: On 10/19/2017 01:56 PM, Sahina Bose wrote: The message indicates a quorum loss. Adding Krutika and Ravi to take a look On Thu, Oct 19, 2017 at 1:44 PM, Denis Chaplygin <dchaplyg> wrote: Hello! It looks like HW storage issues or may be storage misconfiguration. I found following log lines on the gluster01.scl.lab.tlv.redhat.com in gluster_volumes-storage_local_ge2_volume_0.log 2017-10-19 08:03:03.469689] I [MSGID: 115029] [server-handshake.c:692:server_setvolume] 0-storage_local_ge2_volume_0-server: accepted client from storage-ge2-vdsm1.scl.lab.tlv.redhat.com-6360-2017/10/19-08:03:03:385573-storage_local_ge2_volume_0-client-0-0-0 (version: 3.12.1) [2017-10-19 08:03:03.490477] E [MSGID: 113107] [posix.c:1084:posix_seek] 0-storage_local_ge2_volume_0-posix: seek failed on fd 18 length 196768 [No such device or address] [2017-10-19 08:03:03.490608] E [MSGID: 115089] [server-rpc-fops.c:2007:server_seek_cbk] 0-storage_local_ge2_volume_0-server: 20: SEEK-2 (7878ea49-99a8-4408-965e-1e21ae088770) ==> (No such device or address) [No such device or address] [2017-10-19 08:03:03.495805] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-storage_local_ge2_volume_0-server: disconnecting connection from storage-ge2-vdsm1.scl.lab.tlv.redhat.com-6360-2017/10/19-08:03:03:385573-storage_local_ge2_volume_0-client-0-0-0 These messages seem to be similar to what is observed in https://bugzilla.redhat.com/show_bug.cgi?id=1482994. What version of gluster/rhgs is this? Does the work-around mentioned in the BZ doc text (i.e. gluster volume set <volname> stat-prefetch off) solve the problem? Regards, Ravi On Wed, Oct 18, 2017 at 1:08 PM, Elad Ben Aharon <ebenahar> wrote: + Sahina and Denis On Wed, Oct 18, 2017 at 11:57 AM, Elad Ben Aharon <ebenahar> wrote: Hi Ala, We're unable to start a VM with disk on Gluster. The volume is replica3 storage-ge2-vdsm1.scl.labl.tlv.redhat.com 2017-10-18 11:52:40,011+0300 DEBUG (monitor/b203db5) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices { preferred_names = ["^/dev /mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a516008d6|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prior itise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' b203db55-5dcf-40de-bb8b-b7a1d991c6a9 (cwd None) (commands:70) 2017-10-18 11:52:40,256+0300 DEBUG (monitor/b203db5) [storage.Misc.excCmd] SUCCESS: <err> = ''; <rc> = 0 (commands:94) 2017-10-18 11:52:40,629+0300 ERROR (vm/fede7e95) [virt.vm] (vmId='fede7e95-9ad9-4282-8cf3-240cfe17a010') The vm start process failed (vm:894) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 823, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2650, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: qemu unexpectedly closed the monitor: [2017-10-18 08:52:38.212261] E [MSGID: 108006] [afr-common.c:5001:__afr_handle_child_down_event] 0-storage_local_ge2_volu me_0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. Setup details: storage-ge-02.scl.lab.tlv.redhat.com admin/123456 storage-ge2-vdsm{1..3}.scl.lab.tlv.redhat.com Can you please take a look? Thanks On Wed, Oct 18, 2017 at 11:40 AM, Elad Ben Aharon <ebenahar> wrote: Looking On Wed, Oct 18, 2017 at 11:29 AM, Tareq Alayan <talayan> wrote: On Wed, Oct 18, 2017 at 8:25 AM, Artyom Lukianov <alukiano> wrote: I have a problem with a gluster storage under different automation runs: https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.1v4-ge-runner-tier3/57 - two VM's that have disks on gluster storage dropped to pause state You can see ticket https://projects.engineering.redhat.com/browse/RHEVM-3638 for more information. https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-tier2/30 - VM's that have disks on gluster storage failed to start. You can see ticket https://projects.engineering.redhat.com/browse/RHEVM-3632 for more information. So I have two questions: Did someone encounter some other issues with our gluster storage? Can we disable gluster storage in our automation runs until we will have some solution to the problem? I don't think it is a good to idea a disable the gluster. I suggest you reproduce it, and work with one of guys from the storage team to pin point the issue WDYT?. Elad can you help with that? --Tareq Best Regards
I checked the logs in the gluster storage pool machines and the machines that are mounting the volume. The only thing that stands out at the moment is frequent connect/disconnect logs between the bricks and mount process. Adding Raghavendra G and Milind to cc list. Are there any network issues known between the servers and the client machines? Is this issue being hit consistently? -Krutika
(In reply to Krutika Dhananjay from comment #9) > I checked the logs in the gluster storage pool machines and the machines > that are mounting the volume. The only thing that stands out at the moment > is frequent connect/disconnect logs between the bricks and mount process. > > Adding Raghavendra G and Milind to cc list. > > Are there any network issues known between the servers and the client > machines? No, we didn't experience network issues during our automation executions on the mentioned time > Is this issue being hit consistently? Yes, we see this issue for the last 3 - 4 weeks now > > -Krutika
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
*** Bug 1525098 has been marked as a duplicate of this bug. ***
Hi Milind, We see this bug more and more. Can you please take a look?
ping Milind. Can you respond?
(In reply to Sahina Bose from comment #20) > ping Milind. Can you respond? I need access to the live systems for further investigation to comment any further.
Created attachment 1379450 [details] logs
Is there any action needed from gluster side? I think Milind has analysed and provided the recommendations?
(In reply to Sahina Bose from comment #39) > Is there any action needed from gluster side? I think Milind has analysed > and provided the recommendations? I do not recall changing the gluster server options to get to this state. See comment #38 for the existing options I saw a bout a month ago. About Milind recommendations, I think comment #37 answers that point.
Sahina, anything QE need to change on their setup? How can we make progress on this BZ?
From the conversation above, I think 2 things 1. Ensure that client and server gluster packages are compatible - ensure same version 2. Set transport.socket.listen-backlog in /etc/glusterfs/glusterd.vol to something higher than 151 , in case SYN flooding messages are seen in /var/log/messages Is the QE still facing issues on setup?
Could you please ensure the points mentioned in Comment 42 are addressed and let us know if you continue to face issue?
(In reply to Sahina Bose from comment #42) > From the conversation above, I think 2 things > 1. Ensure that client and server gluster packages are compatible - ensure > same version About same version on client and server gluster: Currently its not the same(see details below),Sahina, why is this an issue? We had to upgrade the server to 3.12.2-1 - Kevin , do you recall why ? - Gluster storage cluster gluster01/02/03.scl.lab.tlv.redhat.com build 3.12.2-1. - Gluster client (VDSM hosts) is 3.8.4-54 -> this is the official gluster build for latest 4.2 downstream (4.2.2.4-0.1 -> http://bob.eng.lab.tlv.redhat.com/builds/4.2/rhv-4.2.2-6/el7Everything/x86_64/) > 2. Set transport.socket.listen-backlog in /etc/glusterfs/glusterd.vol to > something higher than 151 , in case SYN flooding messages are seen in > /var/log/messages I set all 3 Gluster storage cluster hosts gluster01/02/03.scl.lab.tlv.redhat.com /etc/glusterfs/glusterd.vol -> 'option transport.socket.listen-backlog 152' - Should I do additional action like gluster restart to load it with the new config ? > Is the QE still facing issues on setup? I'll run the tests again and update
> Is the QE still facing issues on setup? I rerun all the test we saw this issue & the issue does not reproduce on current release 4.2.2.4-0.1 but I have a feeling we'll see it again as it's already disappeared & reappeared since this bug was opened.
Keeping this open but moving this to 4.2.3. Please provide the setup, if the issue reproduces again.
Did you run into issue again? If not, we will close this bz and re-open if you encounter it?
(In reply to Sahina Bose from comment #47) > Did you run into issue again? I tested on latest 4.2.2.6-0.1 and issue is not seen . > If not, we will close this bz and re-open if you encounter it? Agreed, If we'll see it again we will re-open this issue.
Closing based on previous comment