Description of problem: If virt-sparsify fails like this [1][2]: 2020-11-09 01:05:42,031-0500 INFO (jsonrpc/4) [api.host] FINISH getAllVmIoTunePolicies return={'status': {'code': 0, 'message': 'Done'}, 'io_tune_policies_dict': {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54) 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmIoTunePolicies' in bridge with {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}} (__init__:360) 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED: <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n"; <rc> = 1 (commands:98) Then it's hard to get more information about why it failed. vdsm should then: 1. Perhaps run it again with various debugging flags/env as instructed by the error message. 2. Run libguestfs-test-tool (and make sure it can authenticate against libvirt - now tried it manually on some host and it prompted "Please enter your authentication name:"). This should probably apply to all invocations of libguestfs tools, not only virt-sparsify. From a casual search in the code I see we call also e.g. virt-alignment-scan, virt-sysprep. [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/565/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/565/artifact/exported-artifacts/test_logs/basic-suite-master/lago-basic-suite-master-host-1/_var_log/vdsm/vdsm.log
To add a bit of extra information which was discussed on email: When the error message matches (contains the substring) "guestfs_launch" you should have VDSM run: libguestfs-test-tool and log the full output. This is the only good way to diagnose these kinds of errors. Usually they are really caused because kernel/qemu is broken in some way, but to find out exactly how this tool will tell you. Just because libguestfs-test-tool failed in one way in one place does not mean that is the cause of the error. You need to run it from VDSM in exactly the same context (user ID, permissions, etc) as the original failure. Once you have the full libguestfs-test-tool log, post it somewhere for us to look at. It may also be useful to log the versions (eg. rpm -q) of: - kernel - qemu - supermin - libguestfs On some architectures like POWER these errors can also be caused by out of date firmware (SLOF), so logging firmware versions can be useful too.
This fits well with our intention to write the output from libguestfs tools to a different log file in order to ease debugging. Worth thinking about separating this out from VDSM to an ansible playbook so the log would be available on the engine side (now that we a better infrastructure on the engine side to execute async tasks on the hosts through ansible)
(In reply to Yedidyah Bar David from comment #0) > Description of problem: > > If virt-sparsify fails like this [1][2]: > > 2020-11-09 01:05:42,031-0500 INFO (jsonrpc/4) [api.host] FINISH > getAllVmIoTunePolicies return={'status': {'code': 0, 'message': > 'Done'}, 'io_tune_policies_dict': > {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], > 'current_values': [{'name': 'vda', 'path': > '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f- > a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7- > 4d16-bd37-50a9d4e14a80', > 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, > 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, > 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54) > 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] > Return 'Host.getAllVmIoTunePolicies' in bridge with > {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], > 'current_values': [{'name': 'vda', 'path': > '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f- > a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7- > 4d16-bd37-50a9d4e14a80', > 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, > 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, > 'read_iops_sec': 0}}]}} (__init__:360) > 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED: > <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch > failed.\nThis usually means the libguestfs appliance failed to start > or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand > run the command again. For further information, read:\n > http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can > also run 'libguestfs-test-tool' and post the *complete* output\ninto a > bug report or message to the libguestfs mailing list.\n\nIf reporting > bugs, run virt-sparsify with debugging enabled and include the > \ncomplete output:\n\n virt-sparsify -v -x [...]\n"; <rc> = 1 > (commands:98) > > Then it's hard to get more information about why it failed. Did you see this in real environment? I think this issue happens only in OST, when we run on machines with very little memory. I don't think this is a real issue, just a side effect of using badly configured enviroment for OST. > vdsm should then: > > 1. Perhaps run it again with various debugging flags/env as instructed by > the error message. Vdsm cannot depend on text from the error message, only on public API, like a special return code, or machine readable output (e.g. json). > 2. Run libguestfs-test-tool (and make sure it can authenticate against > libvirt - now tried it manually on some host and it prompted "Please enter > your authentication name:"). I think this should be done by libguestfs. It should provide a --debug option or enviroment variable that will run the test tool and give more info about the failure. > This should probably apply to all invocations of libguestfs tools, not only > virt-sparsify. From a casual search in the code I see we call also e.g. > virt-alignment-scan, virt-sysprep. Correct, this is an issue in libguestfs, not in vdsm. > [1] > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/ > 565/ > > [2] > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/ > 565/artifact/exported-artifacts/test_logs/basic-suite-master/lago-basic- > suite-master-host-1/_var_log/vdsm/vdsm.log These are logs from OST, not from real environemnt. I think we are wasting our time on OST issues instead of adding real value to users of the product.
(In reply to Arik from comment #2) > This fits well with our intention to write the output from libguestfs tools > to a different log file in order to ease debugging. > Worth thinking about separating this out from VDSM to an ansible playbook so > the log would be available on the engine side (now that we a better > infrastructure on the engine side to execute async tasks on the hosts > through ansible) virt-sparsify and like are storage operation on disk. They must be perform by vdsm when we have full control and can perform operation in a safe and efficient way. Outsourcing storage operations to ansible is a bad idea, see the recent trouble with exporting ova. If this operation was managed by vdsm we would not have all that trouble.
Sure, I meant once the fixes for exporting OVAs are in (already posted) and we have a good enough ansible-based mechansim
(In reply to Nir Soffer from comment #3) > Did you see this in real environment? No. > I think this issue happens only in > OST, when we run on machines with very little memory. I don't think this > is a real issue, just a side effect of using badly configured enviroment > for OST. How can you be certain? What do you suggest? Just ignore failures from OST? Perhaps stop using it altogether? Spend lots of time analyzing them before deciding they are due to "bad env" and can be ignored? Allocate more memory? Something else? > > > vdsm should then: > > > > 1. Perhaps run it again with various debugging flags/env as instructed by > > the error message. > > Vdsm cannot depend on text from the error message, Indeed, but I didn't suggest that. > only on public API, like > a special return code, or machine readable output (e.g. json). The error message has a link to their public api. Did you check it? http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs > > > 2. Run libguestfs-test-tool (and make sure it can authenticate against > > libvirt - now tried it manually on some host and it prompted "Please enter > > your authentication name:"). > > I think this should be done by libguestfs. It should provide a --debug option > or enviroment variable that will run the test tool and give more info about > the failure. So you simply suggest to close current bug and open one on libguestfs? Fine with me. Personally I am not sure it matters that much. They already have their current API, which can definitely be used as-is right now usefully. If you suggest to enhance it, fine - but nothing prevents you from using what they have right now, and adapt to the refined one once it's available.
(In reply to Yedidyah Bar David from comment #6) > > I think this issue happens only in > > OST, when we run on machines with very little memory. I don't think this > > is a real issue, just a side effect of using badly configured enviroment > > for OST. > > How can you be certain? > > What do you suggest? > > Just ignore failures from OST? Perhaps stop using it altogether? > > Spend lots of time analyzing them before deciding they are due to "bad env" > and can be ignored? > > Allocate more memory? The owner of this test should debug this issue. If they don't have time to debug this in OST, we need to remove the test since it is not stable. > > > vdsm should then: > > > > > > 1. Perhaps run it again with various debugging flags/env as instructed by > > > the error message. > > > > Vdsm cannot depend on text from the error message, > > Indeed, but I didn't suggest that. > > > only on public API, like > > a special return code, or machine readable output (e.g. json). > > The error message has a link to their public api. Did you check it? > > http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs How do you know when running libguestfs test tool is needed? What I see here is: "as instructed by the error message." Which imply depending on text in the error message. Public API means the tool will fail with special error code meaning that running the test tool is required, or the tool return json ouput with similar error code/string which is public API. > > > 2. Run libguestfs-test-tool (and make sure it can authenticate against > > > libvirt - now tried it manually on some host and it prompted "Please enter > > > your authentication name:"). > > > > I think this should be done by libguestfs. It should provide a --debug option > > or enviroment variable that will run the test tool and give more info about > > the failure. > > So you simply suggest to close current bug and open one on libguestfs? Fine > with me. > > Personally I am not sure it matters that much. They already have their > current API, > which can definitely be used as-is right now usefully. If you suggest to > enhance it, > fine - but nothing prevents you from using what they have right now, and > adapt to > the refined one once it's available. Solving this in vdsm is wrong. What about the other users of libguestfs? they will have to duplicate the code in vdsm in different languages to get the same behavior? If libguestfs will handle this, all users will get this for free.
(In reply to Nir Soffer from comment #7) > (In reply to Yedidyah Bar David from comment #6) > > > I think this issue happens only in > > > OST, when we run on machines with very little memory. I don't think this > > > is a real issue, just a side effect of using badly configured enviroment > > > for OST. > > > > How can you be certain? > > > > What do you suggest? > > > > Just ignore failures from OST? Perhaps stop using it altogether? > > > > Spend lots of time analyzing them before deciding they are due to "bad env" > > and can be ignored? > > > > Allocate more memory? > > The owner of this test should debug this issue. If they don't have time to > debug this in OST, we need to remove the test since it is not stable. To the debug the issue we need logs from the tool first. How are we supposed to get the logs if you suggest that storing the logs is a bad idea? > > > > > vdsm should then: > > > > > > > > 1. Perhaps run it again with various debugging flags/env as instructed by > > > > the error message. > > > > > > Vdsm cannot depend on text from the error message, > > > > Indeed, but I didn't suggest that. What's the benefit over running the tool with debug flags (-v -x) in the first place instead of trying to do the job twice? > > > > > only on public API, like > > > a special return code, or machine readable output (e.g. json). > > > > The error message has a link to their public api. Did you check it? > > > > http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs > > How do you know when running libguestfs test tool is needed? What I see > here is: > > "as instructed by the error message." > > Which imply depending on text in the error message. > > Public API means the tool will fail with special error code meaning that > running the test tool is required, or the tool return json ouput with > similar error code/string which is public API. > > > > > 2. Run libguestfs-test-tool (and make sure it can authenticate against > > > > libvirt - now tried it manually on some host and it prompted "Please enter > > > > your authentication name:"). > > > > > > I think this should be done by libguestfs. It should provide a --debug option > > > or enviroment variable that will run the test tool and give more info about > > > the failure. > > > > So you simply suggest to close current bug and open one on libguestfs? Fine > > with me. > > > > Personally I am not sure it matters that much. They already have their > > current API, > > which can definitely be used as-is right now usefully. If you suggest to > > enhance it, > > fine - but nothing prevents you from using what they have right now, and > > adapt to > > the refined one once it's available. > > Solving this in vdsm is wrong. What about the other users of libguestfs? they > will have to duplicate the code in vdsm in different languages to get the > same > behavior? If libguestfs will handle this, all users will get this for free. What exactly should be fixed in libguestfs and how? I think you are miss-interpreting the suggestion in the error message. Running libguestfs-test-tool is useful when debugging the issue manually as it can help you identify if it is general environment issue or the issue with a virt-* tool itself. It does not give any additional value over running the virt-* tool with debug messages (-v -x). If you are suggesting libguestfs tools should have some `--log-file /path/to/log` option then I agree it would be helpful, but it's not something one could not solve by piping the output to the file. And it's not something that would be added to all virt-* tools in a reasonable time-frame. Further, it is only half of the issue. We still need to manage where and how to store the logs. This is something that VDSM needs to do and not something that the tool should manage on its own.
(In reply to Tomáš Golembiovský from comment #8) ... > > The owner of this test should debug this issue. If they don't have time to > > debug this in OST, we need to remove the test since it is not stable. > > To the debug the issue we need logs from the tool first. How are we supposed > to get the logs if you suggest that storing the logs is a bad idea? We can add a configuration option to collect more info in virt-xxx logs (e.g. -v -x), and keep the info in vdsm logs. This should be good enough for debugging the issue in OST. > What's the benefit over running the tool with debug flags (-v -x) in the > first place instead of trying to do the job twice? This is what libguestfs suggests to do in the error message. I don't know if this is better than using debug messages. ... > > Solving this in vdsm is wrong. What about the other users of libguestfs? they > > will have to duplicate the code in vdsm in different languages to get the > > same > > behavior? If libguestfs will handle this, all users will get this for free. > > What exactly should be fixed in libguestfs and how? Returning proper error info when operation fails. Currently the error does not contain enough information to understand the problem. When launching the guestfs special vm, what was the issue that caused the failure? If using libvirt, there are probably an error message from libvirt. If using direct backend, there is error output from qemu. > I think you are miss-interpreting the suggestion in the error message. > Running libguestfs-test-tool is useful when debugging the issue manually as > it can help you identify if it is general environment issue or the issue > with a virt-* tool itself. It does not give any additional value over > running the virt-* tool with debug messages (-v -x). Running with debug messages should not be required to get error messages. > If you are suggesting libguestfs tools should have some `--log-file > /path/to/log` option then I agree it would be helpful, but it's not > something one could not solve by piping the output to the file. And it's not > something that would be added to all virt-* tools in a reasonable > time-frame. Further, it is only half of the issue. We still need to manage > where and how to store the logs. This is something that VDSM needs to do and > not something that the tool should manage on its own. We don't have enough data yet that storing virt-xxx logs is needed. We have random failure in OST. Lets debug this instead of waste time on unneeded infrastructure. Can we reproduce the failures in local OST environment or it happens only in the CI environment? We need to understand the problem better before we think about the solution.
(In reply to Nir Soffer from comment #9) > We don't have enough data yet that storing virt-xxx logs is needed. We have > random failure in OST. Lets debug this instead of waste time on unneeded > infrastructure. OK. How? > > Can we reproduce the failures in local OST environment or it happens only > in the CI environment? I have no idea. > We need to understand the problem better before we > think about the solution. I can find 5 reports to infra@ with "test_sparsify_disk1" [1]. I only looked at 2 of them. One is current bug, other is [2]. The problem is: Among hundreds (or thousands?) of runs in CI, 5 failed as in [1], and we have no idea why they did. The solution I suggested (following advice from Richard): Run the tool, inside vdsm, the way that its documentation suggest to run it, in case you want to be able to understand why it failed, if it fails. If in comment 0 it seemed as if I think we should run it twice, then I am sorry for that. I am all for running it just once, if this does not incur some significant overhead. I didn't test this. If it does incur significant overhead, and it's riskless to try again, then I still stand by comment 0 - try first as now, and if it fails, try again as suggested. Other solutions are more than welcome. To clarify, again: The problem current bug is about, is not A="virt-sparsify fails". The problem is B="We do not know why". I assume that if we will be able to know why, we'll also be able to solve A. I do not see how we can solve A without solving B first. [1] https://lists.ovirt.org/archives/search?mlist=infra%40ovirt.org&q=%22test_sparsify_disk1%22 [2] https://lists.ovirt.org/archives/list/devel@ovirt.org/message/NYWXHE3WZJ5FQVYJPRWN2VHBBXGHMYNT/
(In reply to Nir Soffer from comment #9) > (In reply to Tomáš Golembiovský from comment #8) > ... > > > The owner of this test should debug this issue. If they don't have time to > > > debug this in OST, we need to remove the test since it is not stable. > > > > To the debug the issue we need logs from the tool first. How are we supposed > > to get the logs if you suggest that storing the logs is a bad idea? > > We can add a configuration option to collect more info in virt-xxx logs > (e.g. -v -x), and keep the info in vdsm logs. This should be good enough > for debugging the issue in OST. Given the size of the output it would make it horribly difficult to read from the logs. > > > What's the benefit over running the tool with debug flags (-v -x) in the > > first place instead of trying to do the job twice? > > This is what libguestfs suggests to do in the error message. I don't know > if this is better than using debug messages. Ok, I see. As I mention the rationale behind it is in situations when you're running the tool manually and it basically says "get debug logs before you approach us on mailing list". > > ... > > > Solving this in vdsm is wrong. What about the other users of libguestfs? they > > > will have to duplicate the code in vdsm in different languages to get the > > > same > > > behavior? If libguestfs will handle this, all users will get this for free. > > > > What exactly should be fixed in libguestfs and how? > > Returning proper error info when operation fails. Currently > the error does not contain enough information to understand the problem. > > When launching the guestfs special vm, what was the issue that caused > the failure? If using libvirt, there are probably an error message from > libvirt. If using direct backend, there is error output from qemu. Libguestfs is a complex tool that relies on large number of external command-line tools. Yes, for libvirt the report could be better because libguestfs uses libvirt API, but for many other tools it does not know the exact issue because the tools don't have clear API. Hence the generic suggestion libguestfs reports in debug logs "See earlier errors". Believe me I have been bitten by this problem many times, over and over again, so I understand what you are saying here. But I also know that this does not have an easy solution in libguestfs and that it would be foolish for oVirt to just wait for a miracle to happen. We need to improve things on our end if we want to make our lives easier. > > > I think you are miss-interpreting the suggestion in the error message. > > Running libguestfs-test-tool is useful when debugging the issue manually as > > it can help you identify if it is general environment issue or the issue > > with a virt-* tool itself. It does not give any additional value over > > running the virt-* tool with debug messages (-v -x). > > Running with debug messages should not be required to get error messages. Ideally it should not, but our situation is not ideal as I mentioned above. > > > If you are suggesting libguestfs tools should have some `--log-file > > /path/to/log` option then I agree it would be helpful, but it's not > > something one could not solve by piping the output to the file. And it's not > > something that would be added to all virt-* tools in a reasonable > > time-frame. Further, it is only half of the issue. We still need to manage > > where and how to store the logs. This is something that VDSM needs to do and > > not something that the tool should manage on its own. > > We don't have enough data yet that storing virt-xxx logs is needed. We have > random failure in OST. Lets debug this instead of waste time on unneeded > infrastructure. I see, this is another source of our miss-understanding here. The few random failures in OST are not the only reason we need the logs. Just in these two months we had bug 1887434 and bug 1860492 for virt-sysprep, bug 1792905 and bug 1882582 on virt-sparsify, and probably few more that I cannot find now. The fact is we have been needing the logs for years. Those logs are important source of information every time something fails in the workflow. It's not that we need the logs only when the tool exits with error. We need the logs also in situations when the tools exit with success. It may be that there is a bug in the libguestfs tool, it may be that there is nothing with libguestfs tool at all and it's oVirt issue, it may be that there is an issue in the guest or in user's environment. Not all the cases result in non-zero exit code. But to figure out what is the issue we require the logs. So we either need to ask users to run things manually (which is not trivial) or try hard to reproduce the issue (only to find out that we cannot). > Can we reproduce the failures in local OST environment or it happens only > in the CI environment? We need to understand the problem better before we > think about the solution.
(In reply to Yedidyah Bar David from comment #10) > [...] If in comment 0 it > seemed as if I think we should run it twice, then I am sorry for that. I am > all for running it just once, if this does not incur some significant > overhead. There is slight overhead (1-2 seconds) when running the tools with debug messages. But that seems negligible to me. In reality you will be always limited by the penalty incurred by managing the job in oVirt (for small disks/VMs) or by the disk size of the VM (for large disks/VMs).
Verified with: vdsm-4.40.50.4-1.el8ev.x86_64 ovirt-engine-4.4.5.4-0.6.el8ev.noarch Steps: 1. Create a VM with disk 2. Sparsify the disk on engine UI 3. Create a sealed template from the VM on engine UI Results: Logs for virt-sparsify, virt-sysprep are created under /var/log/vdsm/commands: [root@ocelot05 commands]# ll total 4084 -rw-r-----. 1 vdsm kvm 75598 Feb 9 14:46 virtcmd-virt-sparsify-20210209T144601-xqrg.log -rw-r-----. 1 vdsm kvm 4103210 Feb 9 14:54 virtcmd-virt-sysprep-858231cd-41ba-4547-82b7-b9ea10938212-20210209T145441-uogx.log
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.