Bug 1895843 - vdsm does not collect enough libguestfs debug information
Summary: vdsm does not collect enough libguestfs debug information
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.4.5
: ---
Assignee: Tomáš Golembiovský
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-09 09:28 UTC by Yedidyah Bar David
Modified: 2021-11-04 19:28 UTC (History)
7 users (show)

Fixed In Version: vdsm-4.40.50.1-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 15:16:19 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
pm-rhel: planning_ack+
ahadas: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 112148 0 master MERGED virt: utils: add run_logging() 2021-02-07 08:41:36 UTC
oVirt gerrit 112149 0 master MERGED virt: create logging runner for virt-* commands 2021-02-07 08:41:37 UTC
oVirt gerrit 112150 0 master ABANDONED virt: log virt-sparsify command 2021-02-07 08:41:37 UTC

Description Yedidyah Bar David 2020-11-09 09:28:02 UTC
Description of problem:

If virt-sparsify fails like this [1][2]:

2020-11-09 01:05:42,031-0500 INFO  (jsonrpc/4) [api.host] FINISH
getAllVmIoTunePolicies return={'status': {'code': 0, 'message':
'Done'}, 'io_tune_policies_dict':
{'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [],
'current_values': [{'name': 'vda', 'path':
'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80',
'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0,
'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0,
'read_iops_sec': 0}}]}}} from=::1,34002 (api:54)
2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
Return 'Host.getAllVmIoTunePolicies' in bridge with
{'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [],
'current_values': [{'name': 'vda', 'path':
'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80',
'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0,
'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0,
'read_iops_sec': 0}}]}} (__init__:360)
2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED:
<err> = b"virt-sparsify: error: libguestfs error: guestfs_launch
failed.\nThis usually means the libguestfs appliance failed to start
or crashed.\nDo:\n  export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand
run the command again.  For further information, read:\n
http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can
also run 'libguestfs-test-tool' and post the *complete* output\ninto a
bug report or message to the libguestfs mailing list.\n\nIf reporting
bugs, run virt-sparsify with debugging enabled and include the
\ncomplete output:\n\n  virt-sparsify -v -x [...]\n"; <rc> = 1
(commands:98)

Then it's hard to get more information about why it failed.

vdsm should then:

1. Perhaps run it again with various debugging flags/env as instructed by the error message.

2. Run libguestfs-test-tool (and make sure it can authenticate against libvirt - now tried it manually on some host and it prompted "Please enter your authentication name:").

This should probably apply to all invocations of libguestfs tools, not only virt-sparsify. From a casual search in the code I see we call also e.g. virt-alignment-scan, virt-sysprep.

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/565/

[2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/565/artifact/exported-artifacts/test_logs/basic-suite-master/lago-basic-suite-master-host-1/_var_log/vdsm/vdsm.log

Comment 1 Richard W.M. Jones 2020-11-09 10:04:15 UTC
To add a bit of extra information which was discussed on email:

When the error message matches (contains the substring) "guestfs_launch"
you should have VDSM run: libguestfs-test-tool and log the full output.
This is the only good way to diagnose these kinds of errors.  Usually they
are really caused because kernel/qemu is broken in some way, but to find
out exactly how this tool will tell you.

Just because libguestfs-test-tool failed in one way in one place does
not mean that is the cause of the error.  You need to run it from VDSM
in exactly the same context (user ID, permissions, etc) as the original
failure.

Once you have the full libguestfs-test-tool log, post it somewhere
for us to look at.

It may also be useful to log the versions (eg. rpm -q) of:

 - kernel
 - qemu
 - supermin
 - libguestfs

On some architectures like POWER these errors can also be caused by
out of date firmware (SLOF), so logging firmware versions can be
useful too.

Comment 2 Arik 2020-11-10 07:51:03 UTC
This fits well with our intention to write the output from libguestfs tools to a different log file in order to ease debugging.
Worth thinking about separating this out from VDSM to an ansible playbook so the log would be available on the engine side (now that we a better infrastructure on the engine side to execute async tasks on the hosts through ansible)

Comment 3 Nir Soffer 2020-11-12 14:41:09 UTC
(In reply to Yedidyah Bar David from comment #0)
> Description of problem:
> 
> If virt-sparsify fails like this [1][2]:
> 
> 2020-11-09 01:05:42,031-0500 INFO  (jsonrpc/4) [api.host] FINISH
> getAllVmIoTunePolicies return={'status': {'code': 0, 'message':
> 'Done'}, 'io_tune_policies_dict':
> {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [],
> 'current_values': [{'name': 'vda', 'path':
> '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-
> a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-
> 4d16-bd37-50a9d4e14a80',
> 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0,
> 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0,
> 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54)
> 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
> Return 'Host.getAllVmIoTunePolicies' in bridge with
> {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [],
> 'current_values': [{'name': 'vda', 'path':
> '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-
> a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-
> 4d16-bd37-50a9d4e14a80',
> 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0,
> 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0,
> 'read_iops_sec': 0}}]}} (__init__:360)
> 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED:
> <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch
> failed.\nThis usually means the libguestfs appliance failed to start
> or crashed.\nDo:\n  export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand
> run the command again.  For further information, read:\n
> http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can
> also run 'libguestfs-test-tool' and post the *complete* output\ninto a
> bug report or message to the libguestfs mailing list.\n\nIf reporting
> bugs, run virt-sparsify with debugging enabled and include the
> \ncomplete output:\n\n  virt-sparsify -v -x [...]\n"; <rc> = 1
> (commands:98)
> 
> Then it's hard to get more information about why it failed.

Did you see this in real environment? I think this issue happens only in
OST, when we run on machines with very little memory. I don't think this
is a real issue, just a side effect of using badly configured enviroment
for OST.

> vdsm should then:
> 
> 1. Perhaps run it again with various debugging flags/env as instructed by
> the error message.

Vdsm cannot depend on text from the error message, only on public API, like
a special return code, or machine readable output (e.g. json).
 
> 2. Run libguestfs-test-tool (and make sure it can authenticate against
> libvirt - now tried it manually on some host and it prompted "Please enter
> your authentication name:").

I think this should be done by libguestfs. It should provide a --debug option
or enviroment variable that will run the test tool and give more info about
the failure.

> This should probably apply to all invocations of libguestfs tools, not only
> virt-sparsify. From a casual search in the code I see we call also e.g.
> virt-alignment-scan, virt-sysprep.

Correct, this is an issue in libguestfs, not in vdsm.

> [1]
> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/
> 565/
> 
> [2]
> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/
> 565/artifact/exported-artifacts/test_logs/basic-suite-master/lago-basic-
> suite-master-host-1/_var_log/vdsm/vdsm.log

These are logs from OST, not from real environemnt. I think we are wasting our
time on OST issues instead of adding real value to users of the product.

Comment 4 Nir Soffer 2020-11-12 14:46:00 UTC
(In reply to Arik from comment #2)
> This fits well with our intention to write the output from libguestfs tools
> to a different log file in order to ease debugging.
> Worth thinking about separating this out from VDSM to an ansible playbook so
> the log would be available on the engine side (now that we a better
> infrastructure on the engine side to execute async tasks on the hosts
> through ansible)

virt-sparsify and like are storage operation on disk. They must be perform by
vdsm when we have full control and can perform operation in a safe and efficient
way.

Outsourcing storage operations to ansible is a bad idea, see the recent trouble
with exporting ova. If this operation was managed by vdsm we would not have
all that trouble.

Comment 5 Arik 2020-11-12 14:59:00 UTC
Sure, I meant once the fixes for exporting OVAs are in (already posted) and we have a good enough ansible-based mechansim

Comment 6 Yedidyah Bar David 2020-11-12 15:15:49 UTC
(In reply to Nir Soffer from comment #3)
> Did you see this in real environment?

No.

> I think this issue happens only in
> OST, when we run on machines with very little memory. I don't think this
> is a real issue, just a side effect of using badly configured enviroment
> for OST.

How can you be certain?

What do you suggest?

Just ignore failures from OST? Perhaps stop using it altogether?

Spend lots of time analyzing them before deciding they are due to "bad env" and can be ignored?

Allocate more memory?

Something else?

> 
> > vdsm should then:
> > 
> > 1. Perhaps run it again with various debugging flags/env as instructed by
> > the error message.
> 
> Vdsm cannot depend on text from the error message,

Indeed, but I didn't suggest that.

> only on public API, like
> a special return code, or machine readable output (e.g. json).

The error message has a link to their public api. Did you check it?

http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs

>  
> > 2. Run libguestfs-test-tool (and make sure it can authenticate against
> > libvirt - now tried it manually on some host and it prompted "Please enter
> > your authentication name:").
> 
> I think this should be done by libguestfs. It should provide a --debug option
> or enviroment variable that will run the test tool and give more info about
> the failure.

So you simply suggest to close current bug and open one on libguestfs? Fine with me.

Personally I am not sure it matters that much. They already have their current API,
which can definitely be used as-is right now usefully. If you suggest to enhance it,
fine - but nothing prevents you from using what they have right now, and adapt to
the refined one once it's available.

Comment 7 Nir Soffer 2020-11-12 16:41:50 UTC
(In reply to Yedidyah Bar David from comment #6)
> > I think this issue happens only in
> > OST, when we run on machines with very little memory. I don't think this
> > is a real issue, just a side effect of using badly configured enviroment
> > for OST.
> 
> How can you be certain?
> 
> What do you suggest?
> 
> Just ignore failures from OST? Perhaps stop using it altogether?
> 
> Spend lots of time analyzing them before deciding they are due to "bad env"
> and can be ignored?
> 
> Allocate more memory?

The owner of this test should debug this issue. If they don't have time to 
debug this in OST, we need to remove the test since it is not stable.

> > > vdsm should then:
> > > 
> > > 1. Perhaps run it again with various debugging flags/env as instructed by
> > > the error message.
> > 
> > Vdsm cannot depend on text from the error message,
> 
> Indeed, but I didn't suggest that.
> 
> > only on public API, like
> > a special return code, or machine readable output (e.g. json).
> 
> The error message has a link to their public api. Did you check it?
> 
> http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs

How do you know when running libguestfs test tool is needed? What I see
here is:

    "as instructed by the error message."

Which imply depending on text in the error message.

Public API means the tool will fail with special error code meaning that
running the test tool is required, or the tool return json ouput with
similar error code/string which is public API.

> > > 2. Run libguestfs-test-tool (and make sure it can authenticate against
> > > libvirt - now tried it manually on some host and it prompted "Please enter
> > > your authentication name:").
> > 
> > I think this should be done by libguestfs. It should provide a --debug option
> > or enviroment variable that will run the test tool and give more info about
> > the failure.
> 
> So you simply suggest to close current bug and open one on libguestfs? Fine
> with me.
> 
> Personally I am not sure it matters that much. They already have their
> current API,
> which can definitely be used as-is right now usefully. If you suggest to
> enhance it,
> fine - but nothing prevents you from using what they have right now, and
> adapt to
> the refined one once it's available.

Solving this in vdsm is wrong. What about the other users of libguestfs? they
will have to duplicate the code in vdsm in different languages to get the same
behavior? If libguestfs will handle this, all users will get this for free.

Comment 8 Tomáš Golembiovský 2020-11-13 10:38:07 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Yedidyah Bar David from comment #6)
> > > I think this issue happens only in
> > > OST, when we run on machines with very little memory. I don't think this
> > > is a real issue, just a side effect of using badly configured enviroment
> > > for OST.
> > 
> > How can you be certain?
> > 
> > What do you suggest?
> > 
> > Just ignore failures from OST? Perhaps stop using it altogether?
> > 
> > Spend lots of time analyzing them before deciding they are due to "bad env"
> > and can be ignored?
> > 
> > Allocate more memory?
> 
> The owner of this test should debug this issue. If they don't have time to 
> debug this in OST, we need to remove the test since it is not stable.

To the debug the issue we need logs from the tool first. How are we supposed to get the logs if you suggest that storing the logs is a bad idea?

> 
> > > > vdsm should then:
> > > > 
> > > > 1. Perhaps run it again with various debugging flags/env as instructed by
> > > > the error message.
> > > 
> > > Vdsm cannot depend on text from the error message,
> > 
> > Indeed, but I didn't suggest that.

What's the benefit over running the tool with debug flags (-v -x) in the first place instead of trying to do the job twice?

> > 
> > > only on public API, like
> > > a special return code, or machine readable output (e.g. json).
> > 
> > The error message has a link to their public api. Did you check it?
> > 
> > http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs
> 
> How do you know when running libguestfs test tool is needed? What I see
> here is:
> 
>     "as instructed by the error message."
> 
> Which imply depending on text in the error message.
> 
> Public API means the tool will fail with special error code meaning that
> running the test tool is required, or the tool return json ouput with
> similar error code/string which is public API.
> 
> > > > 2. Run libguestfs-test-tool (and make sure it can authenticate against
> > > > libvirt - now tried it manually on some host and it prompted "Please enter
> > > > your authentication name:").
> > > 
> > > I think this should be done by libguestfs. It should provide a --debug option
> > > or enviroment variable that will run the test tool and give more info about
> > > the failure.
> > 
> > So you simply suggest to close current bug and open one on libguestfs? Fine
> > with me.
> > 
> > Personally I am not sure it matters that much. They already have their
> > current API,
> > which can definitely be used as-is right now usefully. If you suggest to
> > enhance it,
> > fine - but nothing prevents you from using what they have right now, and
> > adapt to
> > the refined one once it's available.
> 
> Solving this in vdsm is wrong. What about the other users of libguestfs? they
> will have to duplicate the code in vdsm in different languages to get the
> same
> behavior? If libguestfs will handle this, all users will get this for free.

What exactly should be fixed in libguestfs and how?

I think you are miss-interpreting the suggestion in the error message. Running libguestfs-test-tool is useful when debugging the issue manually as it can help you identify if it is general environment issue or the issue with a virt-* tool itself. It does not give any additional value over running the virt-* tool with debug messages (-v -x). 

If you are suggesting libguestfs tools should have some `--log-file /path/to/log` option then I agree it would be helpful, but it's not something one could not solve by piping the output to the file. And it's not something that would be added to all virt-* tools in a reasonable time-frame. Further, it is only half of the issue. We still need to manage where and how to store the logs. This is something that VDSM needs to do and not something that the tool should manage on its own.

Comment 9 Nir Soffer 2020-11-15 14:04:38 UTC
(In reply to Tomáš Golembiovský from comment #8)
...
> > The owner of this test should debug this issue. If they don't have time to 
> > debug this in OST, we need to remove the test since it is not stable.
> 
> To the debug the issue we need logs from the tool first. How are we supposed
> to get the logs if you suggest that storing the logs is a bad idea?

We can add a configuration option to collect more info in virt-xxx logs
(e.g. -v -x), and keep the info in vdsm logs. This should be good enough
for debugging the issue in OST.

> What's the benefit over running the tool with debug flags (-v -x) in the
> first place instead of trying to do the job twice?

This is what libguestfs suggests to do in the error message. I don't know
if this is better than using debug messages.

...
> > Solving this in vdsm is wrong. What about the other users of libguestfs? they
> > will have to duplicate the code in vdsm in different languages to get the
> > same
> > behavior? If libguestfs will handle this, all users will get this for free.
> 
> What exactly should be fixed in libguestfs and how?

Returning proper error info when operation fails. Currently
the error does not contain enough information to understand the problem.

When launching the guestfs special vm, what was the issue that caused 
the failure? If using libvirt, there are probably an error message from
libvirt. If using direct backend, there is error output from qemu.

> I think you are miss-interpreting the suggestion in the error message.
> Running libguestfs-test-tool is useful when debugging the issue manually as
> it can help you identify if it is general environment issue or the issue
> with a virt-* tool itself. It does not give any additional value over
> running the virt-* tool with debug messages (-v -x).

Running with debug messages should not be required to get error messages.

> If you are suggesting libguestfs tools should have some `--log-file
> /path/to/log` option then I agree it would be helpful, but it's not
> something one could not solve by piping the output to the file. And it's not
> something that would be added to all virt-* tools in a reasonable
> time-frame. Further, it is only half of the issue. We still need to manage
> where and how to store the logs. This is something that VDSM needs to do and
> not something that the tool should manage on its own.

We don't have enough data yet that storing virt-xxx logs is needed. We have
random failure in OST. Lets debug this instead of waste time on unneeded
infrastructure.

Can we reproduce the failures in local OST environment or it happens only
in the CI environment? We need to understand the problem better before we
think about the solution.

Comment 10 Yedidyah Bar David 2020-11-15 15:16:54 UTC
(In reply to Nir Soffer from comment #9)
> We don't have enough data yet that storing virt-xxx logs is needed. We have
> random failure in OST. Lets debug this instead of waste time on unneeded
> infrastructure.

OK. How?

> 
> Can we reproduce the failures in local OST environment or it happens only
> in the CI environment?

I have no idea.

> We need to understand the problem better before we
> think about the solution.

I can find 5 reports to infra@ with "test_sparsify_disk1" [1]. I only looked at 2 of them. One is current bug, other is [2].

The problem is: Among hundreds (or thousands?) of runs in CI, 5 failed as in [1], and we have no idea why they did.

The solution I suggested (following advice from Richard): Run the tool, inside vdsm, the way that its documentation suggest to run it, in case you want to be able to understand why it failed, if it fails. If in comment 0 it seemed as if I think we should run it twice, then I am sorry for that. I am all for running it just once, if this does not incur some significant overhead. I didn't test this. If it does incur significant overhead, and it's riskless to try again, then I still stand by comment 0 - try first as now, and if it fails, try again as suggested.

Other solutions are more than welcome.

To clarify, again: The problem current bug is about, is not A="virt-sparsify fails". The problem is B="We do not know why". I assume that if we will be able to know why, we'll also be able to solve A. I do not see how we can solve A without solving B first.

[1] https://lists.ovirt.org/archives/search?mlist=infra%40ovirt.org&q=%22test_sparsify_disk1%22

[2] https://lists.ovirt.org/archives/list/devel@ovirt.org/message/NYWXHE3WZJ5FQVYJPRWN2VHBBXGHMYNT/

Comment 11 Tomáš Golembiovský 2020-11-18 13:10:48 UTC
(In reply to Nir Soffer from comment #9)
> (In reply to Tomáš Golembiovský from comment #8)
> ...
> > > The owner of this test should debug this issue. If they don't have time to 
> > > debug this in OST, we need to remove the test since it is not stable.
> > 
> > To the debug the issue we need logs from the tool first. How are we supposed
> > to get the logs if you suggest that storing the logs is a bad idea?
> 
> We can add a configuration option to collect more info in virt-xxx logs
> (e.g. -v -x), and keep the info in vdsm logs. This should be good enough
> for debugging the issue in OST.

Given the size of the output it would make it horribly difficult to read from the logs.

> 
> > What's the benefit over running the tool with debug flags (-v -x) in the
> > first place instead of trying to do the job twice?
> 
> This is what libguestfs suggests to do in the error message. I don't know
> if this is better than using debug messages.

Ok, I see. As I mention the rationale behind it is in situations when you're running the tool manually and it basically says "get debug logs before you approach us on mailing list".


> 
> ...
> > > Solving this in vdsm is wrong. What about the other users of libguestfs? they
> > > will have to duplicate the code in vdsm in different languages to get the
> > > same
> > > behavior? If libguestfs will handle this, all users will get this for free.
> > 
> > What exactly should be fixed in libguestfs and how?
> 
> Returning proper error info when operation fails. Currently
> the error does not contain enough information to understand the problem.
> 
> When launching the guestfs special vm, what was the issue that caused 
> the failure? If using libvirt, there are probably an error message from
> libvirt. If using direct backend, there is error output from qemu.

Libguestfs is a complex tool that relies on large number of external command-line tools. Yes, for libvirt the report could be better because libguestfs uses libvirt API, but for many other tools it does not know the exact issue because the tools don't have clear API. Hence the generic suggestion libguestfs reports in debug logs "See earlier errors".

Believe me I have been bitten by this problem many times, over and over again, so I understand what you are saying here. But I also know that this does not have an easy solution in libguestfs and that it would be foolish for oVirt to just wait for a miracle to happen. We need to improve things on our end if we want to make our lives easier.


> 
> > I think you are miss-interpreting the suggestion in the error message.
> > Running libguestfs-test-tool is useful when debugging the issue manually as
> > it can help you identify if it is general environment issue or the issue
> > with a virt-* tool itself. It does not give any additional value over
> > running the virt-* tool with debug messages (-v -x).
> 
> Running with debug messages should not be required to get error messages.

Ideally it should not, but our situation is not ideal as I mentioned above.

> 
> > If you are suggesting libguestfs tools should have some `--log-file
> > /path/to/log` option then I agree it would be helpful, but it's not
> > something one could not solve by piping the output to the file. And it's not
> > something that would be added to all virt-* tools in a reasonable
> > time-frame. Further, it is only half of the issue. We still need to manage
> > where and how to store the logs. This is something that VDSM needs to do and
> > not something that the tool should manage on its own.
> 
> We don't have enough data yet that storing virt-xxx logs is needed. We have
> random failure in OST. Lets debug this instead of waste time on unneeded
> infrastructure.

I see, this is another source of our miss-understanding here. The few random failures in OST are not the only reason we need the logs.
Just in these two months we had bug 1887434 and bug 1860492 for virt-sysprep, bug 1792905 and bug 1882582 on virt-sparsify, and probably few more that I cannot find now. The fact is we have been needing the logs for years. Those logs are important source of information every time something fails in the workflow. It's not that we need the logs only when the tool exits with error. We need the logs also in situations when the tools exit with success. It may be that there is a bug in the libguestfs tool, it may be that there is nothing with libguestfs tool at all and it's oVirt issue, it may be that there is an issue in the guest or in user's environment. Not all the cases result in non-zero exit code. But to figure out what is the issue we require the logs. So we either need to ask users to run things manually (which is not trivial) or try hard to reproduce the issue (only to find out that we cannot).


> Can we reproduce the failures in local OST environment or it happens only
> in the CI environment? We need to understand the problem better before we
> think about the solution.

Comment 12 Tomáš Golembiovský 2020-11-18 13:14:01 UTC
(In reply to Yedidyah Bar David from comment #10)
 
> [...] If in comment 0 it
> seemed as if I think we should run it twice, then I am sorry for that. I am
> all for running it just once, if this does not incur some significant
> overhead.

There is slight overhead (1-2 seconds) when running the tools with debug messages. But that seems negligible to me. In reality you will be always limited by the penalty incurred by managing the job in oVirt (for small disks/VMs) or by the disk size of the VM (for large disks/VMs).

Comment 13 Qin Yuan 2021-02-09 13:59:09 UTC
Verified with:
vdsm-4.40.50.4-1.el8ev.x86_64
ovirt-engine-4.4.5.4-0.6.el8ev.noarch

Steps:
1. Create a VM with disk
2. Sparsify the disk on engine UI
3. Create a sealed template from the VM on engine UI

Results:
Logs for virt-sparsify, virt-sysprep are created under /var/log/vdsm/commands:

[root@ocelot05 commands]# ll
total 4084
-rw-r-----. 1 vdsm kvm   75598 Feb  9 14:46 virtcmd-virt-sparsify-20210209T144601-xqrg.log
-rw-r-----. 1 vdsm kvm 4103210 Feb  9 14:54 virtcmd-virt-sysprep-858231cd-41ba-4547-82b7-b9ea10938212-20210209T145441-uogx.log

Comment 14 Sandro Bonazzola 2021-03-18 15:16:19 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.