Bug 1602782
| Summary: | Trace when deploying a heat template... | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eduard Barrera <ebarrera> | |
| Component: | openstack-heat | Assignee: | Thomas Hervé <therve> | |
| Status: | CLOSED ERRATA | QA Contact: | Ronnie Rasouli <rrasouli> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 10.0 (Newton) | CC: | ebarrera, ipetrova, mburns, mowens, pcaruana, ramishra, sbaker, shardy, srevivo, therve, zbitter | |
| Target Milestone: | z9 | Keywords: | Reopened, Triaged, ZStream | |
| Target Release: | 10.0 (Newton) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-heat-7.0.6-4.el7ost | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1611903 1611904 (view as bug list) | Environment: | ||
| Last Closed: | 2018-09-17 16:57:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1611903, 1611904 | |||
|
Description
Eduard Barrera
2018-07-18 13:32:35 UTC
It appears that there is a resource whose current_template_id is set to None. Since that should never happen, it fails when we try to load that template from the database. I have no idea how this could have occurred, but it is probably unrelated to the new template per se. What version of Heat is being used? The logs appear to show that it's something earlier than 7.0.6-1. I believe 7.0.6-2 is the latest. Error appears here the first time: 2018-07-03 12:02:51.649 579777 ERROR oslo_messaging.rpc.server TypeError: object of type 'NoneType' has no len() I think this happens when a stack isn't stored properly. Could we get the following requests run on the overcloud database: > select count(*) from stack where raw_template_id is NULL; If the number isn't too big, then: > select * from stack where raw_template_id is NULL; Thanks. The bottom part of the stack trace is tricking you ;) If you read all the way up, the part where it's failing is here: https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resource.py?h=newton-eol#n324 So although it's failing during the loading of the stack, the id that is None isn't coming from the stack table in the DB, it's coming from the current_template_id of the resource. So the requests to check are: > select count(*) from resource where current_template_id is NULL; > select * from resource where current_template_id is NULL; (In reply to Zane Bitter from comment #5) > select count(*) from resource where current_template_id is NULL; > select * from resource where current_template_id is NULL; Hi, below the result MariaDB [heat]> select count(*) from resource where current_template_id is NULL; +----------+ | count(*) | +----------+ | 4 | +----------+ 1 row in set (0.00 sec) MariaDB [heat]> select * from resource where current_template_id is NULL; +------+--------------------------------------+---------------+--------------------------+---------------------+---------------------+--------+----------+----------------------------------------------+--------------------------------------+---------------+-----------------+-----------+------------+--------------------------------------+----------+----------+-------------+---------------------+---------------------------+--------------------------------------+ | id | uuid | nova_instance | name | created_at | updated_at | action | status | status_reason | stack_id | rsrc_metadata | properties_data | engine_id | atomic_key | needed_by | requires | replaces | replaced_by | current_template_id | properties_data_encrypted | root_stack_id | +------+--------------------------------------+---------------+--------------------------+---------------------+---------------------+--------+----------+----------------------------------------------+--------------------------------------+---------------+-----------------+-----------+------------+--------------------------------------+----------+----------+-------------+---------------------+---------------------------+--------------------------------------+ | 4176 | be1acb43-0474-4234-b633-5d9e157bf8f1 | NULL | vpe_hss_oam_sub_200 | 2018-07-03 10:01:50 | 2018-07-03 10:01:49 | CHECK | COMPLETE | CHECK not supported for OS::Neutron::Subnet | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | {} | null | NULL | 9 | [4179, 3, 4182, 6, 4823, 4703] | [] | NULL | NULL | NULL | 0 | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | | 4179 | 35c95e56-2975-4e4b-b070-5ee308f9bc5e | NULL | vpea_subport_hss_oam_200 | 2018-07-03 10:01:50 | 2018-07-03 10:01:49 | CHECK | COMPLETE | CHECK not supported for OS::Neutron::Port | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | {} | null | NULL | 6 | [504] | [] | NULL | NULL | NULL | 0 | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | | 4182 | dccbf886-8c32-45cf-ab8b-b2f9117494b9 | NULL | vpeb_subport_hss_oam_200 | 2018-07-03 10:01:50 | 2018-07-03 10:01:49 | CHECK | FAILED | state changed by resource_mark_unhealthy api | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | {} | null | NULL | 8 | [507] | [] | NULL | NULL | NULL | 0 | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | | 4185 | 7a3e286b-28a6-4863-ae9d-efe076bb9516 | NULL | vpe_bbn_sub_207 | 2018-07-03 10:01:50 | 2018-07-03 10:01:49 | CHECK | FAILED | state changed by resource_mark_unhealthy api | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | {} | null | NULL | 11 | [3, 6, 4167, 4179, 4182, 4823, 4703] | [] | NULL | NULL | NULL | 0 | dbcae8ee-6af7-4cde-8757-7a11af2a47e2 | +------+--------------------------------------+---------------+--------------------------+---------------------+---------------------+--------+----------+----------------------------------------------+--------------------------------------+---------------+-----------------+-----------+------------+--------------------------------------+----------+----------+-------------+---------------------+---------------------------+--------------------------------------+ 4 rows in set (0.00 sec) OK, that confirms that current_template_id being NULL is the proximate cause of the traceback. The 4 resources in question are not present in ..._no_vol_6.yaml and are commented out in ..._no_vol_7.yaml. So the info we're getting about what happened is not completely correct. In particular, there is zero possibility of resources that do not appear in the template showing up in the database, hence the assertion in the description that this is 100% reproducible by simply deploying the ..._no_vol_7.yaml is not compatible with the evidence. Now that we know the stack ID, we can piece together the history from the logs. The stack VNF-RDG-VPE-stack_v1 (dbcae8ee-6af7-4cde-8757-7a11af2a47e2) is created on 2018-05-01 and completes successfully. At this time the stack does *not* contain any of the problem resources (possibly this is the ..._no_vol_6.yaml template). The stack is updated many times on 2018-05-10 with varying degress of success. The first time our suspect resources appear is on 2018-07-02. This could be the ..._no_vol_7.yaml template without the resources commented out. Some time on that day, updates start to fail due to two missing networks e10fa625-f38c-4dcb-b764-85292db7c061 and 097174d9-487d-427e-9d17-015815e2c85b. For some reason, before each stack update there is a stack check run, even if the stack was cleanly in a STACK_FAILED state. This continues into 2018-07-03. At this point in the log there's a confusing part where someone (presumably in a different tenant) attempts to create, and then deletes, another stack with the same name (but different UUID) as ours. The last update not to show the bug occurs at 2018-07-03 11:58:40 - once again we see two errors in property translation that cause the stack (though not any particular resource) to fail: "NotFound: Unable to find network with name or id 'e10fa625-f38c-4dcb-b764-85292db7c061'" and "NotFound: Unable to find network with name or id '097174d9-487d-427e-9d17-015815e2c85b'". At this point, the resources are still in the template. After this there is another stack check performed at 12:01:49. The first appearance in an update graph of the resource IDs we now see in the database is during the following update at 12:02:49 - the update that begins failing with the stack trace in the bug report (which appears 4 times). Our suspect resources are scheduled to be cleaned up because they've been removed from the template (presumably this is the point at which they were commented out, because the network they depended on had been deleted). This is a red flag, because for them to have been around to need cleaning up in the first place we ought to see them either in the graph for the previous update or being created on the fly as replacement resources. There were two replacement resources created in the previous update, but the IDs were 4170 and 4173. This suggests that the 'stack check' created the resources in the DB, and did so with an empty current_template_id. That's consistent with how the resources appear in the database - with the status CHECK_COMPLETE (except for the two that have been explicitly marked unhealthy), and no physical resource IDs which means they've never been CREATEd. It's also not entirely shocking given that 'stack check' knows nothing about convergence (https://storyboard.openstack.org/#!/story/1727142) and current_template_id wasn't used in pre-convergence stacks. It would take a lot of work to figure out what sequence of events is required to get into the state where the resources are either missing from the database or cannot be found by the stack check operation, such that it creates new entries when storing the CHECK_COMPLETE states. For our purposes, it probably doesn't matter. The solution will likely be along the lines of ensuring that 'stack check' stores the correct current_template_id when creating new resource entries in the DB. I'd also recommend against running 'stack check' before every update. For fixing this particular stack, since the resources are being deleted anyway I imagine all we need to do is delete them from the DB: DELETE FROM resource WHERE current_template_id is NULL; @ Zane, after executing: DELETE FROM resource WHERE current_template_id is NULL; And re-running the update, we got the following report: Right now I don't see any error in the log, but the stack persist in Update Progress for indefinite time. Below the last log, after that no update. 7fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] Stack UPDATE IN_PROGRESS (VNF-RDG-VPE-stack_v1): Stack UPDATE started 2018-07-23 20:22:11.710 171291 INFO heat.engine.stack [req-e032f8be-0777-4443-a4ee-d21da8802489 acf77c8dca0334d9037dfec444e7f613acbea75bd32ce5faaa71c9737fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] convergence_dependencies: (Dependencies([((12, False), (12, True)), ((12, False), (9, False)), ((12, True), (18, True)), ((12, True), (72, True)), ((12, True), (75, True)), ((18, False), (12, False)), ((18, False), (18, True)), ((18, False), (9, False)), ((18, True), (60, True)), ((18, True), (66, True)), ((21, False), (21, True)), ((21, False), (3, False)), ((21, True), (24, True)), ((21, True), (27, True)), ((24, False), (21, False)), ((24, False), (24, True)), ((24, True), (27, True)), ((24, True), (72, True)), ((24, True), (75, True)), ((27, False), (21, False)), ((27, False), (24, False)), ((27, False), (27, True)), ((27, True), (60, True)), ((27, True), (66, True)), ((3, False), (3, True)), ((3, True), (21, True)), ((3, True), (30, True)), ((3, True), (51, True)), ((3, True), (60, True)), ((3, True), (63, True)), ((3, True), (72, True)), ((3, True), (78, True)), ((30, False), (3, False)), ((30, False), (30, True)), ((30, False), (6, False)), ((33, False), (504, False)), ((39, False), (507, False)), ((42, False), (42, True)), ((42, False), (6, False)), ((42, True), (45, True)), ((42, True), (48, True)), ((45, False), (42, False)), ((45, False), (45, True)), ((45, True), (48, True)), ((45, True), (63, True)), ((45, True), (69, True)), ((48, False), (42, False)), ((48, False), (45, False)), ((48, False), (48, True)), ((48, True), (60, True)), ((48, True), (66, True)), ((504, False), (6, False)), ((507, False), (3, False)), ((51, False), (3, False)), ((51, False), (51, True)), ((51, True), (54, True)), ((51, True), (57, True)), ((54, False), (51, False)), ((54, False), (54, True)), ((54, True), (57, True)), ((54, True), (78, True)), ((54, True), (81, True)), ((57, False), (51, False)), ((57, False), (54, False)), ((57, False), (57, True)), ((57, True), (60, True)), ((57, True), (66, True)), ((6, False), (6, True)), ((6, True), (30, True)), ((6, True), (42, True)), ((6, True), (60, True)), ((6, True), (63, True)), ((6, True), (72, True)), ((6, True), (78, True)), ((6, True), (9, True)), ((60, False), (18, False)), ((60, False), (27, False)), ((60, False), (3, False)), ((60, False), (33, False)), ((60, False), (39, False)), ((60, False), (48, False)), ((60, False), (57, False)), ((60, False), (6, False)), ((60, False), (60, True)), ((60, True), (66, True)), ((63, False), (3, False)), ((63, False), (45, False)), ((63, False), (6, False)), ((63, False), (63, True)), ((63, True), (69, True)), ((66, False), (18, False)), ((66, False), (27, False)), ((66, False), (33, False)), ((66, False), (39, False)), ((66, False), (48, False)), ((66, False), (57, False)), ((66, False), (60, False)), ((66, False), (66, True)), ((69, False), (45, False)), ((69, False), (63, False)), ((69, False), (69, True)), ((72, False), (12, False)), ((72, False), (24, False)), ((72, False), (3, False)), ((72, False), (6, False)), ((72, False), (72, True)), ((72, True), (75, True)), ((75, False), (12, False)), ((75, False), (24, False)), ((75, False), (72, False)), ((75, False), (75, True)), ((78, False), (3, False)), ((78, False), (54, False)), ((78, False), (6, False)), ((78, False), (78, True)), ((78, True), (81, True)), ((81, False), (54, False)), ((81, False), (78, False)), ((81, False), (81, True)), ((9, False), (6, False)), ((9, False), (9, True)), ((9, True), (12, True)), ((9, True), (18, True))]),) 2018-07-23 20:22:11.857 171291 INFO heat.engine.stack [req-e032f8be-0777-4443-a4ee-d21da8802489 acf77c8dca0334d9037dfec444e7f613acbea75bd32ce5faaa71c9737fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] Triggering resource 81 for update 2018-07-23 20:22:11.872 171291 INFO heat.engine.stack [req-e032f8be-0777-4443-a4ee-d21da8802489 acf77c8dca0334d9037dfec444e7f613acbea75bd32ce5faaa71c9737fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] Triggering resource 30 for update 2018-07-23 20:22:11.875 171291 INFO heat.engine.stack [req-e032f8be-0777-4443-a4ee-d21da8802489 acf77c8dca0334d9037dfec444e7f613acbea75bd32ce5faaa71c9737fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] Triggering resource 66 for update 2018-07-23 20:22:11.877 171291 INFO heat.engine.stack [req-e032f8be-0777-4443-a4ee-d21da8802489 acf77c8dca0334d9037dfec444e7f613acbea75bd32ce5faaa71c9737fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] Triggering resource 69 for update 2018-07-23 20:22:11.879 171291 INFO heat.engine.stack [req-e032f8be-0777-4443-a4ee-d21da8802489 acf77c8dca0334d9037dfec444e7f613acbea75bd32ce5faaa71c9737fb0ffce c3e6e62addf64a29b0d3440f57e1ffcb - - -] Triggering resource 75 for update Logs show that the error condition has been successfully cleared - the resources 4176, 4179, 4182, and 4185 were removed from the traversal graph, and the failure mode reverted to network not found errors just as the stack was experiencing prior to the error reported in this bug, albeit with different network IDs: Unable to find network with name or id '1d3b2b90-eed5-4217-b3d1-2886f4518a4c' Unable to find network with name or id '905b55bc-996c-4680-b4c0-a2be8e7bb978' After several updates with an identical result (interspersed, against recommendations, with stack checks), heat-engine was restarted (twice) and now the appears to be failing to process any resources as all - messages are sent to RabbitMQ to trigger the first 5 resources to be processed, but nothing appears to be picking them up (or if something is, it's failing silently). Debugging this would have been a lot easier with the log message added by this patch: https://review.openstack.org/533015 Hi there, If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -. Thanks, Alex Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2716 |