Bug 1432067
| Summary: | After killing vdsmd during Amend locked forever + flooding of ERROR: "failed to poll the command entity" | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Avihai <aefrat> | ||||||||||
| Component: | BLL.Storage | Assignee: | Maor <mlipchuk> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Avihai <aefrat> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 4.1.1.4 | CC: | aefrat, bugs, lveyde, tnisan | ||||||||||
| Target Milestone: | ovirt-4.1.2 | Flags: | rule-engine:
ovirt-4.1+
|
||||||||||
| Target Release: | 4.1.2 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2017-05-23 08:21:45 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | 1445831 | ||||||||||||
| Bug Blocks: | |||||||||||||
| Attachments: |
|
||||||||||||
I see you have a disturbing sql errors in the "long" engine log indicating [1].
Are those related to this bug?
Do you have any earlier engine logs that indicate what happened with job id a57f7ec6-4085-48d9-b88b-7fdb55243146 ?
Regarding the amend error, I noticed the following error:
2017-03-13 14:22:09,309+02 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler10) [39025250] Command AmendVolume id: '59569ae6-a8a2-4990-a9b3-80538e5ff75b': couldn't get the status of job 'd84ddaee-cb01-4df9-89a3-da0d1dbacdf8'
do you still have the setup running?
If so, would you mind to share the out put of the following sql commands:
"SELECT * FROM job"
"SELECT * FROM step"
[1]
2017-03-13 03:08:02,951+02 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler6) [disks_update_2983e1ae-ddfe-4dd3] Exception: org.springframework.dao.DataIntegrity
ViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_
step_job"
Detail: Key (job_id)=(a57f7ec6-4085-48d9-b88b-7fdb55243146) is not present in table "job".
Where: SQL statement "INSERT INTO step (
step_id,
parent_step_id,
job_id,
step_type,
description,
step_number,
status,
progress,
start_time,
end_time,
correlation_id,
external_id,
external_system_type,
is_external
)
VALUES (
v_step_id,
v_parent_step_id,
v_job_id,
v_step_type,
v_description,
v_step_number,
v_status,
v_progress,
v_start_time,
v_end_time,
v_correlation_id,
v_external_id,
v_external_system_type,
v_is_external
)"
(In reply to Maor from comment #1) > I see you have a disturbing sql errors in the "long" engine log indicating > [1]. > Are those related to this bug? > Do you have any earlier engine logs that indicate what happened with job id > a57f7ec6-4085-48d9-b88b-7fdb55243146 ? > > > Regarding the amend error, I noticed the following error: > > 2017-03-13 14:22:09,309+02 INFO > [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler10) > [39025250] Command AmendVolume id: '59569ae6-a8a2-4990-a9b3-80538e5ff75b': > couldn't get the status of job 'd84ddaee-cb01-4df9-89a3-da0d1dbacdf8' > > do you still have the setup running? > If so, would you mind to share the out put of the following sql commands: > "SELECT * FROM job" > "SELECT * FROM step" Also the host output of : "vdsm-client Host getJobs" I will try to reproduce it myself meanwhile > > > > > [1] > 2017-03-13 03:08:02,951+02 ERROR > [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler6) > [disks_update_2983e1ae-ddfe-4dd3] Exception: > org.springframework.dao.DataIntegrity > ViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?, > ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table > "step" violates foreign key constraint "fk_ > step_job" > Detail: Key (job_id)=(a57f7ec6-4085-48d9-b88b-7fdb55243146) is not present > in table "job". > Where: SQL statement "INSERT INTO step ( > step_id, > parent_step_id, > job_id, > step_type, > description, > step_number, > status, > progress, > start_time, > end_time, > correlation_id, > external_id, > external_system_type, > is_external > ) > VALUES ( > v_step_id, > v_parent_step_id, > v_job_id, > v_step_type, > v_description, > v_step_number, > v_status, > v_progress, > v_start_time, > v_end_time, > v_correlation_id, > v_external_id, > v_external_system_type, > v_is_external > )" (In reply to Maor from comment #1) > I see you have a disturbing sql errors in the "long" engine log indicating > [1]. > Are those related to this bug? > Do you have any earlier engine logs that indicate what happened with job id > a57f7ec6-4085-48d9-b88b-7fdb55243146 ? No they are not related AFAIK , but you can see when it started at engine.log-20170312.gz which I attached . > > > Regarding the amend error, I noticed the following error: > > 2017-03-13 14:22:09,309+02 INFO > [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler10) > [39025250] Command AmendVolume id: '59569ae6-a8a2-4990-a9b3-80538e5ff75b': > couldn't get the status of job 'd84ddaee-cb01-4df9-89a3-da0d1dbacdf8' > > do you still have the setup running? > If so, would you mind to share the out put of the following sql commands: > "SELECT * FROM job" > "SELECT * FROM step" Added 2 files with the requested data : db_job_query.txt db_step_query.txt > > > [1] > 2017-03-13 03:08:02,951+02 ERROR > [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler6) > [disks_update_2983e1ae-ddfe-4dd3] Exception: > org.springframework.dao.DataIntegrity > ViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?, > ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table > "step" violates foreign key constraint "fk_ > step_job" > Detail: Key (job_id)=(a57f7ec6-4085-48d9-b88b-7fdb55243146) is not present > in table "job". > Where: SQL statement "INSERT INTO step ( > step_id, > parent_step_id, > job_id, > step_type, > description, > step_number, > status, > progress, > start_time, > end_time, > correlation_id, > external_id, > external_system_type, > is_external > ) > VALUES ( > v_step_id, > v_parent_step_id, > v_job_id, > v_step_type, > v_description, > v_step_number, > v_status, > v_progress, > v_start_time, > v_end_time, > v_correlation_id, > v_external_id, > v_external_system_type, > v_is_external > )" Created attachment 1263004 [details]
db_step_query
Created attachment 1263005 [details]
db_job_query
Created attachment 1263006 [details]
older_engine_log
I've added a fence job, so once the job is not found in the VDSM it will fail the operation. We might see several times the call for the fence job although it should be solved once the Host will be up again *** Bug 1431619 has been marked as a duplicate of this bug. *** verified on 4.1.2 |
Created attachment 1262927 [details] engine & vdsm logs Description of problem: After killing vdsmd during Amend to disk , flooding of ERROR failed to poll the command entity every ~10 sec we get this error on engine log . I do not see any apparent issue but the flooding of the log , maybe its something worse please investigate. It looks like we are polling a job every 10 sec forever even if it was killed. Version-Release number of selected component (if applicable): Engine: ovirt-engine-4.1.1.4-0.1.el7.noarch VDSM: 4.19.6-1 How reproducible: 100% Steps to Reproduce: 1.Amend disk 2.kill vdsmd Actual results: Error should not be generated each 10sec but only after vdsm is killed and than stop . Expected results: flooding of ERROR "failed to poll the command entity" every ~10 sec we get this error on engine log Additional info: First occurance was right after vdsmd kill : 2017-03-13 11:13:40,439+02 ERROR [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler1) [6233de6b] Command AmendVolume id: '1e3393ad-b416-4066-bf44-525c6e0dacaf': failed to poll the command entity After that each 10 seconds same ERROR appear .