Bug 1295178
Summary: | Engine setup fails due to pending RestoreAllSnapshots tasks (type 224) | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Nikolai Sednev <nsednev> | ||||
Component: | Setup.Engine | Assignee: | Yedidyah Bar David <didi> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Gonza <grafuls> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.6.1.3 | CC: | bugs, derez, didi, lveyde, mlipchuk, nsednev, oourfali, rmartins, sbonazzo, stirabos, tnisan, ylavi | ||||
Target Milestone: | --- | Flags: | rule-engine:
planning_ack?
rule-engine: devel_ack? pstehlik: testing_ack+ |
||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
URL: | https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88WVhFLTlfSHVodUU/view?usp=sharing | ||||||
Whiteboard: | integration | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-01-13 12:49:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1290528 | ||||||
Bug Blocks: | 1294361 | ||||||
Attachments: |
|
Description
Nikolai Sednev
2016-01-03 09:42:58 UTC
Created attachment 1111119 [details]
logs from the engine setup failure.tar.gz
likely a duplicate of bug 1290528 Adding the sosreport from the engine https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88WVhFLTlfSHVodUU/view?usp=sharing Oved, verified patch [1], see description there. But this will not fix the "root problem", will just make engine-setup fail more nicely. What can cause these 224 lines to appear there? Since Nikolai opened this bug, I see on his system two more lines, and he claims he did nothing related to snapshots (224 is RestoreAllSnapshots in VdcActionType.java). What's that? Can we safely make taskcleaner.sh clean them? Perhaps another bug/RFE is needed for that. (In reply to Yedidyah Bar David from comment #4) > Oved, verified patch [1], see description there. But this will not fix the > "root problem", will just make engine-setup fail more nicely. What can cause > these 224 lines to appear there? Since Nikolai opened this bug, I see on his > system two more lines, and he claims he did nothing related to snapshots > (224 is RestoreAllSnapshots in VdcActionType.java). What's that? Can we > safely make taskcleaner.sh clean them? Perhaps another bug/RFE is needed for > that. I have no idea. Worth asking the storage team. Derez? QE contact for this week is Maor, he will have a look (In reply to Yedidyah Bar David from comment #4) > Oved, verified patch [1], see description there. But this will not fix the > "root problem", will just make engine-setup fail more nicely. What can cause > these 224 lines to appear there? Since Nikolai opened this bug, I see on his > system two more lines, and he claims he did nothing related to snapshots > (224 is RestoreAllSnapshots in VdcActionType.java). What's that? Can we > safely make taskcleaner.sh clean them? Perhaps another bug/RFE is needed for > that. RestoreAllSnapshots command handles undo and commit snapshot operation. You can try and check if there's an hanging task on vdsm (using 'vdsClient getAllTasks'). Or, from the webadmin, see if any snapshot is in status locked. Cleaning the tasks by taskcleaner.sh would affect the locked snapshot and fail the undo/commit operation (and it might damage the associated VM). (In reply to Daniel Erez from comment #8) > RestoreAllSnapshots command handles undo and commit snapshot operation. You > can try and check if there's an hanging task on vdsm (using 'vdsClient > getAllTasks'). Or, from the webadmin, see if any snapshot is in status > locked. Cleaning the tasks by taskcleaner.sh would affect the locked > snapshot and fail the undo/commit operation (and it might damage the > associated VM). Ran 'vdsClient -s 0 getAllTasks' on all of Nikolai's hosts and the output is empty for all of them. How to continue? Changing the subject for now, instead of merely closing as duplicate of bug 1290528. The new subject assumes bug 1290528 is solved. Tal - please take over. Nikolai can provide access to machines. Liron, you're this week's QE contact, can you have a look please? I've eventually succeeded upgrading the engine, by migrating it's VM to another host and then powering OFF/ON the engine's VM. I think that engine's VM was running on host with over 81% memory load, this badly influenced the engine's VM and caused for these problems. It looks like as more performance issue. I don't have this problem any more on my setup now. This is by design. It is risky to clean this task and we should wait and fail on timeout. (In reply to Yaniv Dary from comment #12) > This is by design. It is risky to clean this task and we should wait and > fail on timeout. Isn't it possible that due to some bug or something, such a task will remain hanging? I looked at Nikolai's setup over more than a day and some were not cleared. Also, we currently do not timeout afaik, but try endlessly (if user says 'yes'). (In reply to Yedidyah Bar David from comment #13) > (In reply to Yaniv Dary from comment #12) > > This is by design. It is risky to clean this task and we should wait and > > fail on timeout. > > Isn't it possible that due to some bug or something, such a task will remain > hanging? I looked at Nikolai's setup over more than a day and some were not > cleared. > > Also, we currently do not timeout afaik, but try endlessly (if user says > 'yes'). Then we need a different bug on storage to cover that. (In reply to Yaniv Dary from comment #14) > Then we need a different bug on storage to cover that. Very well. Nikolai - can you please open one? Thanks. *** This bug has been marked as a duplicate of bug 1290528 *** It's a bit problematic to open a new bug on storage as I can't provide reproduction for this as it happened during very complicated upgrade on our upgrade setup, there could be many triggers for snapshot was running on background and then other tasks were running on the background. Should logs provided within this bug be sufficient for opening a new bug? (In reply to Nikolai Sednev from comment #16) > It's a bit problematic to open a new bug on storage as I can't provide > reproduction for this as it happened during very complicated upgrade on our > upgrade setup, there could be many triggers for snapshot was running on > background and then other tasks were running on the background. Should logs > provided within this bug be sufficient for opening a new bug? I have no idea, I guess you'll have to simply try :-) If the storage people will need more information that you'll not be able to supply, they can always close insufficient_data. You do not have any other logs? engine/vdsm? (In reply to Yedidyah Bar David from comment #17) > (In reply to Nikolai Sednev from comment #16) > > It's a bit problematic to open a new bug on storage as I can't provide > > reproduction for this as it happened during very complicated upgrade on our > > upgrade setup, there could be many triggers for snapshot was running on > > background and then other tasks were running on the background. Should logs > > provided within this bug be sufficient for opening a new bug? > > I have no idea, I guess you'll have to simply try :-) > > If the storage people will need more information that you'll not be able to > supply, they can always close insufficient_data. > > You do not have any other logs? engine/vdsm? No other logs are available except those attached to this bug. |