Bug 1539363 - [RFE] Add a mechanism for canceling an image download automatically when there's no activity after a configurable timeout is reached
Summary: [RFE] Add a mechanism for canceling an image download automatically when ther...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.1.2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.2.2
: ---
Assignee: Idan Shaby
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-28 09:42 UTC by Avihai
Modified: 2018-04-05 04:29 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-03-29 10:56:07 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2?
rule-engine: ovirt-4.3?
rule-engine: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
engine , image_proxy ,vdsm,deamon logs (3.57 MB, application/x-gzip)
2018-01-28 09:42 UTC, Avihai
no flags Details
rep1 logs including server log (3.03 MB, application/x-gzip)
2018-01-30 12:47 UTC, Avihai
no flags Details
relevant log in latest reproduction (3.47 MB, application/x-gzip)
2018-01-30 13:00 UTC, Avihai
no flags Details
engine , image_proxy ,vdsm,deamon logs (2.29 MB, application/x-gzip)
2018-01-30 15:17 UTC, Avihai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 87155 0 master MERGED core: stop image transfer when there is no activity 2020-04-08 03:11:43 UTC
oVirt gerrit 87924 0 ovirt-engine-4.2 MERGED core: rename UploadImageUiInactivityTimeoutInSeconds 2020-04-08 03:11:43 UTC
oVirt gerrit 87925 0 ovirt-engine-4.2 MERGED tools: add TransferImageClientInactivityTimeoutInSeconds to engine-config 2020-04-08 03:11:43 UTC
oVirt gerrit 87926 0 ovirt-engine-4.2 MERGED core: remove unused ctor from TransferImageParameters 2020-04-08 03:11:43 UTC
oVirt gerrit 87927 0 ovirt-engine-4.2 MERGED core: stop image transfer when there is no activity 2020-04-08 03:11:43 UTC

Description Avihai 2018-01-28 09:42:52 UTC
Created attachment 1387136 [details]
engine , image_proxy ,vdsm,deamon logs

Description of problem:
During Automation TestCase18268 run which includes extending 4 disks to from 1G->2G & than starting image download, right after image download an engine restart occurred ( nothing seen in message logs as to why this happened).

The issue here is that:
- the 4 disks remained at 'locked' status & cannot be unlocked.
- The disks downloads were NOT stopped & it looks like its still running but not actually downloading anything.


In Engine log we see Trying to release a shared lock before the restart:
2018-01-25 20:45:12,008+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [12e1c718] Running command: TransferImageStatusComma
nd internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_DISK with role type USER
2018-01-25 20:45:12,060+02 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [12e1c718] Trying to release a shared lock for key: '87bca4be-9479-48d1-9
98e-25a49b95acf0DISK' , but lock does not exist

2018-01-25 20:45:22,635+02 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [66406a2] Trying to release a shared lock for key: '71e0bb17-55af-4bff-8d1f-431506730650DISK' , but lock does not exist

2018-01-25 20:45:22,767+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [7966d716] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[87bca4be-9479-48d1-998e-25a49b95acf0=DISK]'}'
2018-01-25 20:49:08,890+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 57) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.
2018-01-25 20:49:08,931+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 57) [] The file '/etc/ovirt-engine/engine.conf' doesn't exist or isn't readable. Will return an empty set of properties.
Version-Release number of selected component (if applicable):
4.2.1.3-0.1.el7

How reproducible:
Happened once so far 

Steps to Reproduce (from TestCase18268) :
1.Create extended floating disk permutations with data content
2.Download images
3.right after that restart engine - NOT part of the test , it just happened.

Actual results:
- the 4 disks remained at 'locked' status & cannot be unlocked.
- The disks downloads were NOT stopped & it looks like its still running but not actually downloading anything.


Expected results:
- the 4 disks remained should be unlocked or let the user force unlock from UI.
- The disks downloads should be stopped with a clear message.


Additional info:
image-proxy log:
(Thread-298) INFO 2018-01-25 20:44:57,799 auth:185:auth2:(add_signed_ticket) Adding new ticket: <Ticket id=u'64aaf3a5-09bc-41d1-a9d5-247fd8825f88', url=u'https://storage-ge8-vdsm3.scl.lab.tlv.redhat.com:54322' timeout=3597.2006919384003 at 0x7f6ad00d4950>
(Thread-298) INFO 2018-01-25 20:44:57,799 web:102:web:(log_finish) FINISH [10.35.162.7] PUT /tickets/: [200] 0 (0.00s)
(Thread-299) INFO 2018-01-25 20:44:57,897 web:95:web:(log_start) START [10.35.162.7] PUT /tickets/
(Thread-299) INFO 2018-01-25 20:44:57,899 auth:185:auth2:(add_signed_ticket) Adding new ticket: <Ticket id=u'5d2f2eb0-bb60-4749-93fc-51dd2044c173', url=u'https://storage-ge8-vdsm3.scl.lab.tlv.redhat.com:54322' timeout=3597.0999529361725 at 0x7f6ad00d4e50>
(Thread-299) INFO 2018-01-25 20:44:57,900 web:102:web:(log_finish) FINISH [10.35.162.7] PUT /tickets/: [200] 0 (0.00s)
(Thread-300) INFO 2018-01-25 20:44:58,694 web:95:web:(log_start) START [10.35.70.109] GET /images/64aaf3a5-09bc-41d1-a9d5-247fd8825f88
(Thread-300) INFO 2018-01-25 20:44:58,697 connectionpool:735:requests.packages.urllib3.connectionpool:(_new_conn) Starting new HTTPS connection (1): storage-ge8-vdsm3.scl.lab.tlv.redhat.com
(Thread-301) INFO 2018-01-25 20:44:58,772 web:95:web:(log_start) START [10.35.70.109] GET /images/5d2f2eb0-bb60-4749-93fc-51dd2044c173
(Thread-301) INFO 2018-01-25 20:44:58,775 connectionpool:735:requests.packages.urllib3.connectionpool:(_new_conn) Starting new HTTPS connection (1): storage-ge8-vdsm3.scl.lab.tlv.redhat.com
(Thread-300) INFO 2018-01-25 20:44:58,937 web:102:web:(log_finish) FINISH [10.35.70.109] GET /images/64aaf3a5-09bc-41d1-a9d5-247fd8825f88: [206] 67108864 (0.25s)
(Thread-301) INFO 2018-01-25 20:44:58,958 web:102:web:(log_finish) FINISH [10.35.70.109] GET /images/5d2f2eb0-bb60-4749-93fc-51dd2044c173: [206] 67108864 (0.19s)
(Thread-302) INFO 2018-01-25 20:45:28,094 web:95:web:(log_start) START [10.35.70.109] GET /images/6d51b0ba-1538-4bbe-a762-9332f2c37a24
(Thread-302) INFO 2018-01-25 20:45:28,099 connectionpool:735:requests.packages.urllib3.connectionpool:(_new_conn) Starting new HTTPS connection (1): storage-ge8-vdsm2.scl.lab.tlv.redhat.com
(Thread-303) INFO 2018-01-25 20:45:28,145 web:95:web:(log_start) START [10.35.70.109] GET /images/704d6d75-6978-4a92-831f-bee8b41708e3
(Thread-303) INFO 2018-01-25 20:45:28,148 connectionpool:735:requests.packages.urllib3.connectionpool:(_new_conn) Starting new HTTPS connection (1): storage-ge8-vdsm2.scl.lab.tlv.redhat.com
(Thread-302) INFO 2018-01-25 20:45:28,173 web:102:web:(log_finish) FINISH [10.35.70.109] GET /images/6d51b0ba-1538-4bbe-a762-9332f2c37a24: [206] 67108864 (0.08s)
(Thread-303) INFO 2018-01-25 20:45:28,229 web:102:web:(log_finish) FINISH [10.35.70.109] GET /images/704d6d75-6978-4a92-831f-bee8b41708e3: [206] 67108864 (0.09s)
(Thread-304) INFO 2018-01-25 20:45:28,385 web:95:web:(log_start) START [10.35.70.109] GET /images/5d2f2eb0-bb60-4749-93fc-51dd2044c173
(Thread-304) INFO 2018-01-25 20:45:28,388 connectionpool:735:requests.packages.urllib3.connectionpool:(_new_conn) Starting new HTTPS connection (1): storage-ge8-vdsm3.scl.lab.tlv.redhat.com
(Thread-304) INFO 2018-01-25 20:45:28,458 web:102:web:(log_finish) FINISH [10.35.70.109] GET /images/5d2f2eb0-bb60-4749-93fc-51dd2044c173: [206] 67108864 (0.08s)
(Thread-305) INFO 2018-01-25 20:45:28,659 web:95:web:(log_start) START [10.35.70.109] GET /images/64aaf3a5-09bc-41d1-a9d5-247fd8825f88
(Thread-305) INFO 2018-01-25 20:45:28,661 connectionpool:735:requests.packages.urllib3.connectionpool:(_new_conn) Starting new HTTPS connection (1): storage-ge8-vdsm3.scl.lab.tlv.redhat.com
(Thread-305) INFO 2018-01-25 20:45:28,764 web:102:web:(log_finish) FINISH [10.35.70.109] GET /images/64aaf3a5-09bc-41d1-a9d5-247fd8825f88: [206] 67108864 (0.10s)
(MainThread) INFO 2018-01-25 18:46:49,978 image_proxy:26:root:(main) Server started, successfully notified systemd


VDSM Host storage-ge8-vdsm3.scl.lab.tlv.redhat.com :
2018-01-25 20:45:02,302+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH get_image_ticket return={'result': {u'uuid': u'64aaf3a5-09bc-41d1-a9d5-247fd8825f88', u'ops': [u'read'], u'url': u'file:///rhev/data-center/mnt/bl
ockSD/c905b259-f450-48f9-b5b7-2febedfecbe3/images/87bca4be-9479-48d1-998e-25a49b95acf0/ad588ff1-2c87-4dc5-94ab-2eaa887e3a82', u'expires': 5710669, u'transferred': 7340032, u'timeout': 295, u'active': True, u'siz
e': 1073741824}} from=::ffff:10.35.162.7,53192, flow_id=36400b86-e679-4dc2-948c-e13c6f115e7e, task_id=4eee4499-ee7f-4cbc-8784-e61c17a86b5b (api:52)
2018-01-25 20:45:02,304+0200 DEBUG (jsonrpc/1) [storage.TaskManager.Task] (Task='4eee4499-ee7f-4cbc-8784-e61c17a86b5b') finished: {'result': {u'uuid': u'64aaf3a5-09bc-41d1-a9d5-247fd8825f88', u'ops': [u'read'], 
u'url': u'file:///rhev/data-center/mnt/blockSD/c905b259-f450-48f9-b5b7-2febedfecbe3/images/87bca4be-9479-48d1-998e-25a49b95acf0/ad588ff1-2c87-4dc5-94ab-2eaa887e3a82', u'expires': 5710669, u'transferred': 7340032
, u'timeout': 295, u'active': True, u'size': 1073741824}} (task:1201)


VDSM host 2:
2018-01-25 20:45:28,148 INFO    (Thread-394) [images] Reading 67108864 bytes at offset 67108864 from /rhev/data-center/mnt/blockSD/c905b259-f450-48f9-b5b7-2febedfecbe3/images/4c0c663a-40cc-41c7-9932-15d69574f992
/89eb6ec4-8033-45e3-bc29-da30365976d6 for ticket 6d51b0ba-1538-4bbe-a762-9332f2c37a24
2018-01-25 20:45:28,150 INFO    (Thread-394) [web] FINISH [10.35.162.7] GET /images/6d51b0ba-1538-4bbe-a762-9332f2c37a24: [206] 67108864 (0.01s)
2018-01-25 20:45:28,210 INFO    (Thread-395) [web] START [10.35.162.7] GET /images/704d6d75-6978-4a92-831f-bee8b41708e3
2018-01-25 20:45:28,212 INFO    (Thread-395) [images] Reading 67108864 bytes at offset 67108864 from /rhev/data-center/mnt/blockSD/c905b259-f450-48f9-b5b7-2febedfecbe3/images/7d32291e-fe7d-40ee-90a7-aad11d83774b/d5957984-34f4-4eef-b376-5ab71c70f06d for ticket 704d6d75-6978-4a92-831f-bee8b41708e3
2018-01-25 20:45:28,213 INFO    (Thread-395) [web] FINISH [10.35.162.7] GET /images/704d6d75-6978-4a92-831f-bee8b41708e3: [206] 67108864 (0.00s)

Comment 1 Daniel Erez 2018-01-29 10:30:50 UTC
Can you please check it again with 4.2.1.3? There's been an issue with CoCo resolved recently by http://gerrit.ovirt.org/86667

Comment 2 Avihai 2018-01-29 13:15:22 UTC
(In reply to Daniel Erez from comment #1)
> Can you please check it again with 4.2.1.3? There's been an issue with CoCo
> resolved recently by http://gerrit.ovirt.org/86667

This issue occurred on 4.2.1.3-0.1.el7 .

Comment 3 Daniel Erez 2018-01-29 14:27:46 UTC
* Can you please also attach server.log, maybe it includes a hint for the engine restart.
* Is it reproducible? I.e. can you reproduce it with a manual server restart?

Comment 4 Avihai 2018-01-29 16:18:06 UTC
(In reply to Daniel Erez from comment #3)
> * Can you please also attach server.log, maybe it includes a hint for the
> engine restart.
> * Is it reproducible? I.e. can you reproduce it with a manual server restart?

Env is rebuilt so server log cannot be extracted.
I'll rerun the same tests & try to reproduce.

Comment 5 Avihai 2018-01-30 12:27:48 UTC
(In reply to Avihai from comment #4)
> (In reply to Daniel Erez from comment #3)
> > * Can you please also attach server.log, maybe it includes a hint for the
> > engine restart.
> > * Is it reproducible? I.e. can you reproduce it with a manual server restart?
> 
> Env is rebuilt so server log cannot be extracted.
> I'll rerun the same tests & try to reproduce.

Issue reproduced again, in this time disk were not locked but 1 disk finished download & the other 3 disks are stuck in 'downloading via API' .

Logs including server.log are attached.

More important than that is that I noticed the engine host itself rebooted but I do not have any messages log before the boot so I do not know what caused it.

Lastest entries from Engine log:
2018-01-29 20:43:39,817+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [9c3a74e8-59f0-4104-aa65-3a4df57b115a] START, GetImageTic
ketVDSCommand(HostName = host_mixed_1, GetImageTicketVDSCommandParameters:{hostId='189df3df-46b1-4a9c-9a9d-c85c682b80f7', ticketId='61230092-eeed-43e2-bec7-b9d6a827431e', timeout='null'}), log id: 21aa2eff
2018-01-29 20:43:39,852+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [9c3a74e8-59f0-4104-aa65-3a4df57b115a] FINISH, GetImageTi
cketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@60ac5a4a, log id: 21aa2eff

#HERE WE HAVE ~4 minutes of silence before engine logs again :

2018-01-29 20:47:22,075+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 57) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

Comment 6 Avihai 2018-01-30 12:47:12 UTC
Created attachment 1388366 [details]
rep1 logs including server log

Comment 7 Avihai 2018-01-30 13:00:15 UTC
Created attachment 1388388 [details]
relevant log in latest reproduction

Comment 8 Avihai 2018-01-30 13:01:52 UTC
Comment on attachment 1388366 [details]
rep1 logs including server log

non relevant logs

Comment 9 Idan Shaby 2018-01-30 13:49:20 UTC
Why do we have ~ 4 minutes of silence? Did you reboot the engine there? Why did it take more than a few seconds? Is that what you were trying to check?

I started a download of a 1G image from the sdk, rebooted the engine and the download continued in the background. In the meanwhile, the engine resumed to run and everything seemed to be ok.

BTW - it would help if you could clear the logs before reproducing and send only the relevant information, along with clear steps to reproduce (the time you started to download, the time you rebooted the engine, etc).
How big is the file you're trying to download?

Comment 10 Avihai 2018-01-30 14:12:53 UTC
(In reply to Idan Shaby from comment #9)
> Why do we have ~ 4 minutes of silence? Did you reboot the engine there?
> Why did it take more than a few seconds? Is that what you were trying to check?

I did not reboot the engine but the engine linux machine itself rebooted.
why? I do not know as no message logs are available after the boot itself and I do not see any other indication that something went wrong.

> I started a download of a 1G image from the sdk, rebooted the engine and the
> download continued in the background. In the meanwhile, the engine resumed
> to run and everything seemed to be ok.
Did you restart the engine or rebooted the engine linux machine itself?
Try to download & reboot the machine the engine resides on as this is more close to the scenario here.
 
> BTW - it would help if you could clear the logs before reproducing and send
> only the relevant information, along with clear steps to reproduce (the time
> you started to download, the time you rebooted the engine, etc).
I do not know what is the relevant information as the root cause is not clear, if there was a simple ERROR/exception that I agree 100% with you.

The scenario is described in the bug description, the only thing that should change is what I just found out that instead of engine restart we are talking about engine's machine reboot & not an orderly reboot but more a crash.

Also a timestamp of when the issue start which I gave from engine log in comment #5.

now that I know the machine crashed &rebooted not just the engine started I updated the headline.

> How big is the file you're trying to download?
4 files downloaded simultaneously each is of 1G (cow/raw/virtio/virtscsi matrix) size .

Comment 11 Avihai 2018-01-30 15:16:17 UTC
I reproduced the issue also manually with a simple scenario :

1) Start file download via UI/SDK
2) reboot the engine machine (reboot -f)

See that after the engine is up the state of the disk is still 'Transferring via browser' bug actually nothing progress in the download itself. 


Engine log you can see that transfer is initiated just before reboot:
ting transfer session for image transfer command '2caec4ad-fda3-411a-93fa-e7376c59c817'
2018-01-30 16:30:17,442+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [9e95803a-9553-4671-9b78-48e5e04b8e0e] EVENT_ID: TRANSFER_IMAGE_INITIATED(1,031), Image Download with disk bla was initiated by admin@internal-authz.
2018-01-30 16:30:17,816+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-95) [9e95803a-9553-4671-9b78-48e5e04b8e0e] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='aa06a29a-bdd4-4911-a123-6cfdf60fb50e', ticketId='f27d322b-e6ef-4f65-9864-d399d225131e', timeout='null'}), log id: 17302b46
2018-01-30 16:30:17,850+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-95) [9e95803a-9553-4671-9b78-48e5e04b8e0e] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@4bc2192, log id: 17302b46
2018-01-30 16:30:21,917+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [9e95803a-9553-4671-9b78-48e5e04b8e0e] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='aa06a29a-bdd4-4911-a123-6cfdf60fb50e', ticketId='f27d322b-e6ef-4f65-9864-d399d225131e', timeout='null'}), log id: 78218085
2018-01-30 16:30:21,945+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [9e95803a-9553-4671-9b78-48e5e04b8e0e] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@483c38e, log id: 78218085
2018-01-30 16:30:30,032+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [9e95803a-9553-4671-9b78-48e5e04b8e0e] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='aa06a29a-bdd4-4911-a123-6cfdf60fb50e', ticketId='f27d322b-e6ef-4f65-9864-d399d225131e', timeout='null'}), log id: 6dd05235
2018-01-30 16:30:30,078+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [9e95803a-9553-4671-9b78-48e5e04b8e0e] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@4130786, log id: 6dd05235
2018-01-30 16:30:40,154+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [9e95803a-9553-4671-9b78-48e5e04b8e0e] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='aa06a29a-bdd4-4911-a123-6cfdf60fb50e', ticketId='f27d322b-e6ef-4f65-9864-d399d225131e', timeout='null'}), log id: 4d56fcd0
2018-01-30 16:30:40,206+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [9e95803a-9553-4671-9b78-48e5e04b8e0e] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@3861c7c, log id: 4d56fcd0
2018-01-30 16:30:50,253+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [9e95803a-9553-4671-9b78-48e5e04b8e0e] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='aa06a29a-bdd4-4911-a123-6cfdf60fb50e', ticketId='f27d322b-e6ef-4f65-9864-d399d225131e', timeout='null'}), log id: 7ad4881b
2018-01-30 16:30:50,325+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [9e95803a-9553-4671-9b78-48e5e04b8e0e] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@2f93172, log id: 7ad4881b
2018-01-30 16:33:41,408+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 57) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.
2018-01-30 16:33:41,456+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 57) [] The file '/etc/ovirt-engine/engine.conf' doesn't exist or isn't readable. Will return an empty set of properties.

At proxy log you can see that the last thing the proxy did was before the reboot (2018-01-30 16:30:17), after the reboot nothing is done but still the status in the UI remains ''Transferring via browser':

(Thread-1  ) INFO 2018-01-30 16:30:17,348 web:95:web:(log_start) START [10.35.162.7] PUT /tickets/
(Thread-1  ) INFO 2018-01-30 16:30:17,369 auth:185:auth2:(add_signed_ticket) Adding new ticket: <Ticket id=u'f27d322b-e6ef-4f65-9864-d399d225131e', url=u'https://storage-ge8-vdsm2.scl.lab.tlv.redhat.com:54322' timeout=3597.6306970119476 at 0x7fae7ea6f210>
(Thread-1  ) INFO 2018-01-30 16:30:17,369 web:102:web:(log_finish) FINISH [10.35.162.7] PUT /tickets/: [200] 0 (0.02s)
(MainThread) INFO 2018-01-30 14:31:50,289 image_proxy:26:root:(main) Server started, successfully notified systemd

At VDSM host daemon.log you see that the transfer is still alive (PUT/GET still occurring each several seconds but without data written/read : 
2018-01-30 16:30:17,181 INFO    (ticket.server) [tickets] Adding ticket <Ticket active=False expires=4388870 filename=u'bla.qcow2' ops=[u'read'] size=1073741824 transferred=0 url=u'file:///rhev/data-center/mnt/b
lockSD/5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/images/d3812c79-e772-46b2-b2fb-17951dae84b4/2e7a99ca-5697-47b1-b16b-627a0397e0e3' uuid=u'f27d322b-e6ef-4f65-9864-d399d225131e' at 0x7f80ac127810>
2018-01-30 16:30:17,182 INFO    (ticket.server) [web] FINISH [/] PUT /tickets/f27d322b-e6ef-4f65-9864-d399d225131e: [200] 0 (0.00s)
2018-01-30 16:30:17,829 INFO    (ticket.server) [web] START [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:30:17,830 INFO    (ticket.server) [tickets] Retrieving ticket f27d322b-e6ef-4f65-9864-d399d225131e
................................................................................
2018-01-30 16:34:24,544 INFO    (ticket.server) [tickets] Extending ticket f27d322b-e6ef-4f65-9864-d399d225131e, new expiration in 4389117
2018-01-30 16:34:24,545 INFO    (ticket.server) [web] FINISH [/] PATCH /tickets/f27d322b-e6ef-4f65-9864-d399d225131e: [200] 0 (0.01s)
2018-01-30 16:34:24,755 INFO    (ticket.server) [web] START [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:34:24,756 INFO    (ticket.server) [tickets] Retrieving ticket f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:34:24,756 INFO    (ticket.server) [web] FINISH [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e: [200] 346 (0.00s)
2018-01-30 16:34:34,969 INFO    (ticket.server) [web] START [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:34:34,970 INFO    (ticket.server) [tickets] Retrieving ticket f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:34:34,971 INFO    (ticket.server) [web] FINISH [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e: [200] 346 (0.00s)
2018-01-30 16:34:45,127 INFO    (ticket.server) [web] START [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:34:45,128 INFO    (ticket.server) [tickets] Retrieving ticket f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 16:34:45,129 INFO    (ticket.server) [web] FINISH [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e: [200] 346 (0.00s)

............................................................................
2018-01-30 17:09:06,734 INFO    (ticket.server) [tickets] Retrieving ticket f27d322b-e6ef-4f65-9864-d399d225131e
2018-01-30 17:09:06,736 INFO    (ticket.server) [web] FINISH [/] GET /tickets/f27d322b-e6ef-4f65-9864-d399d225131e: [200] 346 (0.01s)

Comment 12 Avihai 2018-01-30 15:17:05 UTC
Created attachment 1388500 [details]
engine , image_proxy ,vdsm,deamon logs

Comment 13 Idan Shaby 2018-01-31 08:21:39 UTC
Thanks for the information!
I tried to kill the engine and proxy processes right after starting a download via the webadmin, and saw that the download has paused.
When I had the two processes up again, the disk was stuck in "Received 13 of 1024 MB", and the engine continued to monitor the status of the download.
However, I could easily cancel the download from the webadmin, so I am not sure that there's anything wrong with this scenario.
Daniel, what do you think?
Also, did you manage to download anything before you rebooted the engine's machine? I wonder why you got the status "Transferring via Browser" rather than "Received x of y MB".

Comment 14 Avihai 2018-01-31 09:01:14 UTC
(In reply to Idan Shaby from comment #13)
> Thanks for the information!
> I tried to kill the engine and proxy processes right after starting a
> download via the webadmin, and saw that the download has paused.
After reboot, I would also expect to also get into paused or even better if the transfer can not be resumed, cancel it altogether.

> When I had the two processes up again, the disk was stuck in "Received 13 of
> 1024 MB", and the engine continued to monitor the status of the download.
> However, I could easily cancel the download from the webadmin, so I am not
> sure that there's anything wrong with this scenario.
I would expect that when reboot/power outage of the Linux machine the engine sits on occurs the download/upload will be stopped/pause on the engine (proxy) and ALSO on the VDSM host(daemon) & not remain in the same state.

If the transfer can not be resumed, cancel it altogether.

What also bothers me is that the daemon on the VDSM host keeps running the download task without actually doing anything.( see daemon logs from my latest comment)


> Daniel, what do you think?
> Also, did you manage to download anything before you rebooted the engine's
> machine? 
No . this reboot was done right at the start before any actual download occurred.

I wonder why you got the status "Transferring via Browser" rather
> than "Received x of y MB".
I think because at this time I did even reach the "Received x of y MB" state.

Comment 15 Idan Shaby 2018-01-31 10:01:20 UTC
You'll get the same results if you just close the browser that downloads the file. No need to reboot the machine.
Let's wait for Daniel's thoughts about this.

(In reply to Avihai from comment #14)
> (In reply to Idan Shaby from comment #13)
> > Thanks for the information!
> > I tried to kill the engine and proxy processes right after starting a
> > download via the webadmin, and saw that the download has paused.
> After reboot, I would also expect to also get into paused or even better if
> the transfer can not be resumed, cancel it altogether.
> 
> > When I had the two processes up again, the disk was stuck in "Received 13 of
> > 1024 MB", and the engine continued to monitor the status of the download.
> > However, I could easily cancel the download from the webadmin, so I am not
> > sure that there's anything wrong with this scenario.
> I would expect that when reboot/power outage of the Linux machine the engine
> sits on occurs the download/upload will be stopped/pause on the engine
> (proxy) and ALSO on the VDSM host(daemon) & not remain in the same state.
> 
> If the transfer can not be resumed, cancel it altogether.
> 
> What also bothers me is that the daemon on the VDSM host keeps running the
> download task without actually doing anything.( see daemon logs from my
> latest comment)
> 
> 
> > Daniel, what do you think?
> > Also, did you manage to download anything before you rebooted the engine's
> > machine? 
> No . this reboot was done right at the start before any actual download
> occurred.
> 
> I wonder why you got the status "Transferring via Browser" rather
> > than "Received x of y MB".
> I think because at this time I did even reach the "Received x of y MB" state.

Makes, sense, then.

Comment 16 Daniel Erez 2018-01-31 10:04:08 UTC
(In reply to Idan Shaby from comment #13)
> Thanks for the information!
> I tried to kill the engine and proxy processes right after starting a
> download via the webadmin, and saw that the download has paused.
> When I had the two processes up again, the disk was stuck in "Received 13 of
> 1024 MB", and the engine continued to monitor the status of the download.
> However, I could easily cancel the download from the webadmin, so I am not
> sure that there's anything wrong with this scenario.
> Daniel, what do you think?

So the disk moved to paused status after engine restart or just stuck on last progress. Is it the same behavior using the sdk?
Since the download is fully handled by the client (sdk/webadmin), I'm not sure we have an indication to cancel the download, other than adding some timeout.
@Idan - maybe we can detect download inactivity after some amount of time, something similar to upload ('UploadImageUiInactivityTimeoutInSeconds').

> Also, did you manage to download anything before you rebooted the engine's
> machine? I wonder why you got the status "Transferring via Browser" rather
> than "Received x of y MB".

Comment 17 Idan Shaby 2018-01-31 12:41:02 UTC
I guess we can, for both webadmin and sdk scenarios.
Changing this bug to an RFE.
Tal, can you please retarget, if needed?

Comment 18 Tal Nisan 2018-01-31 13:09:48 UTC
Let's consider to 4.3 although I doubt if we'll get to it

Comment 19 Avihai 2018-01-31 13:20:37 UTC
(In reply to Idan Shaby from comment #17)
> I guess we can, for both webadmin and sdk scenarios.
> Changing this bug to an RFE.

IMHO, this is a bug not an RFE.

Why?
As we see in the logs the Engine(proxy) and the hosts(daemons) keep their activity extending/retrieving tickets without any data transfer.
This take unnecessary CPU & memory that could be used somewhere that's really needed.

Think what the customer make multiple downloads & a power failure occurs but all the downloads are still "in progress" after it boot's up and takes the resources(memory/CPU/logging) of Engine& VDSM hosts until manual cancelation that may not come by the customer as he forgot about it.

This is not 'nice to have' but a true issue -> bug.

Please reconsider.

Comment 20 Daniel Erez 2018-01-31 15:59:41 UTC
(In reply to Avihai from comment #19)
> (In reply to Idan Shaby from comment #17)
> > I guess we can, for both webadmin and sdk scenarios.
> > Changing this bug to an RFE.
> 
> IMHO, this is a bug not an RFE.
> 
> Why?
> As we see in the logs the Engine(proxy) and the hosts(daemons) keep their
> activity extending/retrieving tickets without any data transfer.
> This take unnecessary CPU & memory that could be used somewhere that's
> really needed.
> 
> Think what the customer make multiple downloads & a power failure occurs but
> all the downloads are still "in progress" after it boot's up and takes the
> resources(memory/CPU/logging) of Engine& VDSM hosts until manual cancelation
> that may not come by the customer as he forgot about it.

It shouldn't consume more resources than any other paused transfer. As cancelling the transfers should mitigate the issue it doesn't seem crucial.
I'll move it to 4.2.3 for reconsideration.

> 
> This is not 'nice to have' but a true issue -> bug.
> 
> Please reconsider.

Comment 21 Avihai 2018-01-31 16:10:55 UTC
(In reply to Daniel Erez from comment #20)
> (In reply to Avihai from comment #19)
> > (In reply to Idan Shaby from comment #17)
> > > I guess we can, for both webadmin and sdk scenarios.
> > > Changing this bug to an RFE.
> > 
> > IMHO, this is a bug not an RFE.
> > 
> > Why?
> > As we see in the logs the Engine(proxy) and the hosts(daemons) keep their
> > activity extending/retrieving tickets without any data transfer.
> > This take unnecessary CPU & memory that could be used somewhere that's
> > really needed.
> > 
> > Think what the customer make multiple downloads & a power failure occurs but
> > all the downloads are still "in progress" after it boot's up and takes the
> > resources(memory/CPU/logging) of Engine& VDSM hosts until manual cancelation
> > that may not come by the customer as he forgot about it.
> 
> It shouldn't consume more resources than any other paused transfer. As
> cancelling the transfers should mitigate the issue it doesn't seem crucial.
> I'll move it to 4.2.3 for reconsideration.
> 
In this case, the 'pause ' is not really a transfer pause as we can't resume so why keep all this proxy-daemon ticket mechanism running for a resume that will not occur?

Comment 22 Idan Shaby 2018-02-04 14:47:38 UTC
As Daniel mentioned in comment 16, the download is fully handled by the client and therefore we don't have any indication to cancel it.
However, we can add a "keep-alive" like interval to the download flow so that if there's no activity after that timeout is reached, the download is automatically paused. That should cover this case.
Also, as mentioned before, the user can always cancel the download if he likes.

Comment 23 Idan Shaby 2018-02-05 10:52:37 UTC
Avihai, after rethinking about it, as opposed to upload we can indeed assume that if there was no activity for a while, the client (browser) must have been restarted and there is no way to resume the download anyway.
So canceling the download (as opposed to pausing an upload) makes more sense here.
Thanks!

Comment 26 Avihai 2018-03-06 09:13:23 UTC
rhv-4.2.2-4 release (containing ovirt-engine-4.2.2.2 ) is blocked, moving back to MODIFIED.

Please move back to 'ON QA' when next valid build is available.

Comment 27 Avihai 2018-03-06 14:42:37 UTC
rhv-4.2.2-4 release was release again about ~20H ago and now its not blocked , moving back to ON_QA

Comment 28 Avihai 2018-03-07 14:51:46 UTC
verified on ovirt-engine-4.2.2.2

Comment 29 Sandro Bonazzola 2018-03-29 10:56:07 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.