Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1432067

Summary: After killing vdsmd during Amend locked forever + flooding of ERROR: "failed to poll the command entity"
Product: [oVirt] ovirt-engine Reporter: Avihai <aefrat>
Component: BLL.StorageAssignee: Maor <mlipchuk>
Status: CLOSED CURRENTRELEASE QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.1.4CC: aefrat, bugs, lveyde, tnisan
Target Milestone: ovirt-4.1.2Flags: rule-engine: ovirt-4.1+
Target Release: 4.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-23 08:21:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1445831    
Bug Blocks:    
Attachments:
Description Flags
engine & vdsm logs
none
db_step_query
none
db_job_query
none
older_engine_log none

Description Avihai 2017-03-14 12:48:10 UTC
Created attachment 1262927 [details]
engine & vdsm logs

Description of problem:
After killing vdsmd during Amend to disk ,
flooding of ERROR failed to poll the command entity
every ~10 sec we get this error on engine log .

I do not see any apparent issue but the flooding of the log , maybe its something worse please investigate.

It looks like we are polling a job every 10 sec forever even if it was killed.



Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-4.1.1.4-0.1.el7.noarch

VDSM:
4.19.6-1

How reproducible:
100%

Steps to Reproduce:
1.Amend disk 
2.kill vdsmd 

Actual results:
Error should not be generated each 10sec but only after vdsm is killed and than stop .

Expected results:
flooding of ERROR "failed to poll the command entity"
every ~10 sec we get this error on engine log 


Additional info:
First occurance was right after vdsmd kill :
2017-03-13 11:13:40,439+02 ERROR [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler1) [6233de6b] Command AmendVolume id: '1e3393ad-b416-4066-bf44-525c6e0dacaf': failed to poll the command entity

After that each 10 seconds same ERROR appear .

Comment 1 Maor 2017-03-14 14:58:24 UTC
I see you have a disturbing sql errors in the "long" engine log indicating [1].
Are those related to this bug?
Do you have any earlier engine logs that indicate what happened with job id a57f7ec6-4085-48d9-b88b-7fdb55243146 ?


Regarding the amend error, I noticed the following error:

2017-03-13 14:22:09,309+02 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler10) [39025250] Command AmendVolume id: '59569ae6-a8a2-4990-a9b3-80538e5ff75b': couldn't get the status of job 'd84ddaee-cb01-4df9-89a3-da0d1dbacdf8'

do you still have the setup running?
If so, would you mind to share the out put of the following sql commands:
"SELECT * FROM job"
"SELECT * FROM step"




[1]
2017-03-13 03:08:02,951+02 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler6) [disks_update_2983e1ae-ddfe-4dd3] Exception: org.springframework.dao.DataIntegrity
ViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_
step_job"
  Detail: Key (job_id)=(a57f7ec6-4085-48d9-b88b-7fdb55243146) is not present in table "job".
  Where: SQL statement "INSERT INTO step (
        step_id,
        parent_step_id,
        job_id,
        step_type,
        description,
        step_number,
        status,
        progress,
        start_time,
        end_time,
        correlation_id,
        external_id,
        external_system_type,
        is_external
        )
    VALUES (
        v_step_id,
        v_parent_step_id,
        v_job_id,
        v_step_type,
        v_description,
        v_step_number,
        v_status,
        v_progress,
        v_start_time,
        v_end_time,
        v_correlation_id,
        v_external_id,
        v_external_system_type,
        v_is_external
        )"

Comment 2 Maor 2017-03-14 15:16:08 UTC
(In reply to Maor from comment #1)
> I see you have a disturbing sql errors in the "long" engine log indicating
> [1].
> Are those related to this bug?
> Do you have any earlier engine logs that indicate what happened with job id
> a57f7ec6-4085-48d9-b88b-7fdb55243146 ?
> 
> 
> Regarding the amend error, I noticed the following error:
> 
> 2017-03-13 14:22:09,309+02 INFO 
> [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler10)
> [39025250] Command AmendVolume id: '59569ae6-a8a2-4990-a9b3-80538e5ff75b':
> couldn't get the status of job 'd84ddaee-cb01-4df9-89a3-da0d1dbacdf8'
> 
> do you still have the setup running?
> If so, would you mind to share the out put of the following sql commands:
> "SELECT * FROM job"
> "SELECT * FROM step"

Also the host output of :
  "vdsm-client Host getJobs"

I will try to reproduce it myself meanwhile

> 
> 
> 
> 
> [1]
> 2017-03-13 03:08:02,951+02 ERROR
> [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler6)
> [disks_update_2983e1ae-ddfe-4dd3] Exception:
> org.springframework.dao.DataIntegrity
> ViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?,
> ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table
> "step" violates foreign key constraint "fk_
> step_job"
>   Detail: Key (job_id)=(a57f7ec6-4085-48d9-b88b-7fdb55243146) is not present
> in table "job".
>   Where: SQL statement "INSERT INTO step (
>         step_id,
>         parent_step_id,
>         job_id,
>         step_type,
>         description,
>         step_number,
>         status,
>         progress,
>         start_time,
>         end_time,
>         correlation_id,
>         external_id,
>         external_system_type,
>         is_external
>         )
>     VALUES (
>         v_step_id,
>         v_parent_step_id,
>         v_job_id,
>         v_step_type,
>         v_description,
>         v_step_number,
>         v_status,
>         v_progress,
>         v_start_time,
>         v_end_time,
>         v_correlation_id,
>         v_external_id,
>         v_external_system_type,
>         v_is_external
>         )"

Comment 3 Avihai 2017-03-14 15:47:01 UTC
(In reply to Maor from comment #1)
> I see you have a disturbing sql errors in the "long" engine log indicating
> [1].
> Are those related to this bug?
> Do you have any earlier engine logs that indicate what happened with job id
> a57f7ec6-4085-48d9-b88b-7fdb55243146 ?

No they are not related AFAIK , but you can see when it started at engine.log-20170312.gz which I attached .
> 
> 
> Regarding the amend error, I noticed the following error:
> 
> 2017-03-13 14:22:09,309+02 INFO 
> [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler10)
> [39025250] Command AmendVolume id: '59569ae6-a8a2-4990-a9b3-80538e5ff75b':
> couldn't get the status of job 'd84ddaee-cb01-4df9-89a3-da0d1dbacdf8'
> 
> do you still have the setup running?
> If so, would you mind to share the out put of the following sql commands:
> "SELECT * FROM job"
> "SELECT * FROM step"

Added 2 files with the requested data :
db_job_query.txt
db_step_query.txt

> 
> 
> [1]
> 2017-03-13 03:08:02,951+02 ERROR
> [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler6)
> [disks_update_2983e1ae-ddfe-4dd3] Exception:
> org.springframework.dao.DataIntegrity
> ViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?,
> ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table
> "step" violates foreign key constraint "fk_
> step_job"
>   Detail: Key (job_id)=(a57f7ec6-4085-48d9-b88b-7fdb55243146) is not present
> in table "job".
>   Where: SQL statement "INSERT INTO step (
>         step_id,
>         parent_step_id,
>         job_id,
>         step_type,
>         description,
>         step_number,
>         status,
>         progress,
>         start_time,
>         end_time,
>         correlation_id,
>         external_id,
>         external_system_type,
>         is_external
>         )
>     VALUES (
>         v_step_id,
>         v_parent_step_id,
>         v_job_id,
>         v_step_type,
>         v_description,
>         v_step_number,
>         v_status,
>         v_progress,
>         v_start_time,
>         v_end_time,
>         v_correlation_id,
>         v_external_id,
>         v_external_system_type,
>         v_is_external
>         )"

Comment 4 Avihai 2017-03-14 15:48:05 UTC
Created attachment 1263004 [details]
db_step_query

Comment 5 Avihai 2017-03-14 15:48:36 UTC
Created attachment 1263005 [details]
db_job_query

Comment 6 Avihai 2017-03-14 15:49:19 UTC
Created attachment 1263006 [details]
older_engine_log

Comment 7 Maor 2017-03-20 10:40:48 UTC
I've added a fence job, so once the job is not found in the VDSM it will fail the operation.
We might see several times the call for the fence job although it should be solved once the Host will be up again

Comment 8 Maor 2017-04-05 08:47:59 UTC
*** Bug 1431619 has been marked as a duplicate of this bug. ***

Comment 9 Avihai 2017-04-27 05:21:56 UTC
verified on 4.1.2