Bug 1939234

Summary: 16.1 Introspection Times Out after node scaled up
Product: Red Hat OpenStack Reporter: David Rosenfeld <drosenfe>
Component: openstack-tripleo-commonAssignee: Steve Baker <sbaker>
Status: CLOSED INSUFFICIENT_DATA QA Contact: David Rosenfeld <drosenfe>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: mburns, pweeks, sbaker, slinaber
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-15 22:11:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Rosenfeld 2021-03-15 19:18:35 UTC
Description of problem: In Phase 3 regression compute replacement job failed due to introspection timing out for a scaled up node. In this test compute-2
is scaled up. This is seen in logs when introspection is performed for compute-2 after it has been scaled up:

Waiting for introspection to finish...
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retrying 1 nodes that failed introspection. Attempt 1 of 3 
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retrying 1 nodes that failed introspection. Attempt 2 of 3 
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retrying 1 nodes that failed introspection. Attempt 3 of 3 
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retry limit reached with 1 nodes still failing introspection


STDERR:

Waiting for messages on queue 'tripleo' with no timeout.
Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection


Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20210311.n.1


How reproducible: Has been seen several times when this job is executed: DFG-df-rfe-16.1-virsh-3cont_3db_3msg_2net_2comp_3ceph-blacklist-2networker-compute-replacement


Steps to Reproduce:
1. Execute Jenkins job: DFG-df-rfe-16.1-virsh-3cont_3db_3msg_2net_2comp_3ceph-blacklist-2networker-compute-replacement
2. It fails during scale up stage due to introspection timing out
3.

Actual results: Scale up stage fails due to introspection of newly added node timing out


Expected results: Introspection does not time out after scale up is performed.


Additional info:

Comment 1 David Rosenfeld 2021-03-15 19:19:17 UTC
Logs to a failing test(see: ir-cloud-config-scale-up.log): https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3db_3msg_2net_2comp_3ceph-blacklist-2networker-compute-replacement/32/

See this is in ir-cloud-config-scale-up.log:

Waiting for introspection to finish...
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retrying 1 nodes that failed introspection. Attempt 1 of 3 
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retrying 1 nodes that failed introspection. Attempt 2 of 3 
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retrying 1 nodes that failed introspection. Attempt 3 of 3 
Introspection of node attempt failed:f4118ceb-d221-41e5-95d3-c1357e13fd6e.
Retry limit reached with 1 nodes still failing introspection


STDERR:

Waiting for messages on queue 'tripleo' with no timeout.
Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection

Comment 2 Steve Baker 2021-03-16 19:50:17 UTC
We're going to need to see the inspector logs for this failure, its not obvious if those are available in the jira job.

Comment 3 David Rosenfeld 2021-03-23 20:22:09 UTC
Can exact path to inspector logs be provided? Will get logs if path is provided.

Comment 4 Steve Baker 2021-03-23 21:22:18 UTC
It will be /var/log/ironic/deploy/<node uuid>.tar.gz or something similar

Comment 5 Steve Baker 2021-03-23 21:24:25 UTC
whoops, inspector logs are /var/log/ironic-inspector/ramdisk/<file including datestamp>.tar.gz or similar

Comment 6 David Rosenfeld 2021-03-24 17:52:03 UTC
The Jenkins instance was replaced since this BZ was written. This is link to the failing job in the archived Jenkins instance:

https://rhos-ci-jenkins-history.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3db_3msg_2net_2comp_3ceph-blacklist-2networker-compute-replacement/32/

This is link to /var/log directory on undercloud:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-df-rfe-16.1-virsh-3cont_3db_3msg_2net_2comp_3ceph-blacklist-2networker-compute-replacement/32/undercloud-0/var/log/

These is no ironic-inspector log directory.

Comment 8 Steve Baker 2021-04-13 19:45:03 UTC
I think we need access to an environment after the inspection times out. We can re-trigger the inspection manually and observe the issue.

Comment 9 David Rosenfeld 2021-04-28 13:56:13 UTC
The introspection timeout has been seen(and is still being seen) many times in the CI environment. Haven't been able to recreate on my server that I can give you access too. Will keep trying. Note: not sure if its related, but am also having problem in CI environment during the scale up job with ipmitool returning:

Set Chassis Power Control to Up/On failed: Unspecified error

Have written a Jira for that:
https://projects.engineering.redhat.com/browse/RHOSINFRA-3971

The Scale up regression test is failing with either the ipmitool error or the introspection timeout.

Comment 10 Steve Baker 2021-12-15 22:11:59 UTC
Lets close this for now until we get some better data to diagnose.

Comment 11 Red Hat Bugzilla 2023-09-15 01:03:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days