Bug 1819310

Summary: [RFE] Do not completely fail Ansible Tower refresh when a template fails to parse
Product: Red Hat CloudForms Management Engine Reporter: Felix Dewaleyne <fdewaley>
Component: ProvidersAssignee: Tina Fitzgerald <tfitzger>
Status: CLOSED DUPLICATE QA Contact: Nandini Chandra <nachandr>
Severity: medium Docs Contact: Red Hat CloudForms Documentation <cloudforms-docs>
Priority: high    
Version: 5.10.13CC: agrare, dmetzger, duhlmann, fdewaley, jfrey, jhardy, jhenner, jkeselma, nachandr, nansari, obarenbo, phoffmann, simaishi, tfitzger, wfitzger
Target Milestone: GAKeywords: FutureFeature, RFE, ZStream
Target Release: 5.11.6Flags: simaishi: cfme-5.11.z?
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 5.11.6.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-04 10:06:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Ansible Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1822318    

Description Felix Dewaleyne 2020-03-31 16:34:52 UTC
Description of problem:
if a variable has two backslashes (\) it breaks the refresh worker during the refresh

Version-Release number of selected component (if applicable):
cfme-5.10.13.1

How reproducible:
all the time

Steps to Reproduce:
1.set up ansible tower with a playbook with a variable that has '\\' in it
2.add the tower to cloudforms
3.refresh it

Actual results:
[----] I, [2020-03-30T03:18:19.434134 #21377:f44f5c]  INFO -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::EventCatcher::Runner#filtered?) EMS [sample.cloudforms.com] as [spirit21\SUA_VCA_SR_CFME] Skipping caught event [UserLog
inSessionEvent] chainId [34600414]
[----] I, [2020-03-30T03:18:19.927604 #23256:f44f5c]  INFO -- : Exception in realtime_block :ems_refresh - Timings: {:collect_inventory_for_targets=>0.6473894119262695, :parse_inventory=>283.44240641593933, :parse_targeted_inventory=>283
.4424464702606, :ems_refresh=>284.0899860858917}
[----] E, [2020-03-30T03:18:19.927951 #23256:f44f5c] ERROR -- : MIQ(ManageIQ::Providers::AnsibleTower::AutomationManager::Refresher#refresh) EMS: [Ansible Tower Automation Manager], id: [21000000000008] Refresh failed
[----] E, [2020-03-30T03:18:19.928195 #23256:f44f5c] ERROR -- : [Psych::SyntaxError]: (<unknown>): found unknown escape character while parsing a quoted scalar at line 4 column 13  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2020-03-30T03:18:19.928303 #23256:f44f5c] ERROR -- : /usr/share/ruby/psych.rb:377:in `parse'
/usr/share/ruby/psych.rb:377:in `parse_stream'
/usr/share/ruby/psych.rb:325:in `parse'
/usr/share/ruby/psych.rb:291:in `safe_load'
/opt/rh/cfme-gemset/gems/ansible_tower_client-0.20.1/lib/ansible_tower_client/base_model.rb:158:in `hashify'
/opt/rh/cfme-gemset/gems/ansible_tower_client-0.20.1/lib/ansible_tower_client/base_models/job_template.rb:34:in `extra_vars_hash'
--
/var/www/miq/vmdb/app/models/miq_server.rb:379:in `monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:241:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'


Expected results:
refresh passes and stores values correctly

Additional info:
the variable identified causing the issue was
~~~
entry_path: "test\/21\\test"
~~~

Comment 8 Jerry Keselman 2020-04-15 14:03:21 UTC
Felix,

Please forgive my ignorance, because I have not used Ansible ever, I'm dealing with the MIQ side of things and working my way back.  In your two pictures, was this input to Ansible without any integration with CloudForms?  And is the first picture the input to Ansible Tower, while the second is the output? Interpretted output?  Inquiring minds....

Comment 9 Felix Dewaleyne 2020-04-15 14:29:35 UTC
that's Tower 3.6.1 with Ansible 2.9.1

the problem is that this introduces a behavioural difference between tower and tower through miq.

Comment 11 Jerry Keselman 2020-04-15 15:21:40 UTC
And by the way - just because it happens to work - doesn't mean its *required*.  I'm looking at the Ansible documentation, and they refer to the YAML spec for what is valid.  The YAML spec says '\/' is not valid.  If it works on Ansible Tower, should we break the spec in MIQ so that they're the same?  I have my doubts.  Still waiting for your response.  Thanks much!

Comment 12 phospi 2020-04-15 15:49:17 UTC
Hi Jerry,

thanks for your time dealing with this issue. I don't care what the yaml spec says. The yaml interpreter that is called inside of cloudforms/manageiq throws an error. This is absolutely ok and I don't expect you to change the yaml spec or anything like that.

The current situation is that the yaml interpreter that is used inside the cloudforms sync job throws an error and the ansible tower sync has a *hard* break. It never finishes. And every element that *could* have been imported because it conatins no wrong formatting at all is not imported because of this. The ansible sync is never finished. I expect you to acknowledge the fact that ansible seems to interpret yaml differently then cloudforms does. And I expect you, as a software developer, to properly catch the error and accomplish the task of syncing ansible tower with cloudforms in a best effort manner. This means that cloudforms catches the yaml error, finds out which part of ansible tower is broken (In my case it is a *single* job template of hundreds), write an appropriate warning message in the evm.log or whichever you like. Then you skip the broken part and carry on with the sync instead of just letting it crash.

Kind Regards,
Philipp

Comment 14 Felix Dewaleyne 2020-04-16 10:07:50 UTC
I progressed a little and https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html indicates that it isn't how you're supposed to escape /. you should be using '.

which leaves one aspect : is it feasable to have the refresh worker continue to the next playbook and continue sync, skipping import of the playbooks that do raise errors (keeping those errors in the logs for good measure)

Comment 21 Tina Fitzgerald 2020-05-14 14:22:50 UTC
Hi Drew,

Can you look into this issue?

Thanks,
Tina

Comment 22 drew uhlmann 2020-05-14 15:26:14 UTC
Looks like dup of https://bugzilla.redhat.com/show_bug.cgi?id=1824846

Comment 23 Felix Dewaleyne 2020-05-14 16:00:58 UTC
(In reply to drew uhlmann from comment #22)
> Looks like dup of https://bugzilla.redhat.com/show_bug.cgi?id=1824846

that's the parent ticket but because it was headed into the handling of parsing instead of handling of replication I split it to this one for clarity...

Comment 24 Tina Fitzgerald 2020-05-15 17:49:47 UTC
Hi Felix,

I agree that it makes sense to keep both tickets because they describe different scenarios with the same root cause.

This merged back ported PR should resolve both issues:
https://github.com/ManageIQ/manageiq-providers-ansible_tower/pull/218

Thanks,
Tina

Comment 25 Jaroslav Henner 2020-06-02 12:14:47 UTC
For verification it has to be noted that prior the fixed-in-version I have seen the provider not refreshed (no objects in CFME), though it was reported as refreshed. I don't like this behavior. I think what the CFME should do is to do some kind of "best effort" -- discover as many objects as possible but not report provider as refreshed when there was a problem.

Comment 26 Felix Dewaleyne 2020-06-02 13:58:42 UTC
Thanks Tina - and I agree with Jaroslav. sync what can be synced, show that the sync completed with error at the very least. I believe this is already how the rhev provider is being handled.

Comment 31 Jaroslav Henner 2020-06-03 16:13:29 UTC
I spoke with the developers: Tina, Adam Grare and dmetzger. Tina said this in deed is a dup. So I am going to close this as a DUP. It should be fixed in same PR anyway.

The fact that providers may be not really be fully refreshed but report "Last refresh: Success Less Then A Minute Ago" could be resolved by some third state: "Completed with warnings" or so. I will fill a RFE for this on Github.

@Felix Dewaleyne Please clarify what did you mean by the "replication" in https://bugzilla.redhat.com/show_bug.cgi?id=1819310#c23. This is the only remaining piece which I don't understand.

Comment 32 Felix Dewaleyne 2020-06-03 16:32:04 UTC
replication was a term used by the customer I believe. My understanding is that they meant to have the playbook syncronized into cloudforms. 

does that help?

if you end up raising that with github, let me know the url of the post so I can share it with the customer.

Comment 33 Nandini Chandra 2020-06-03 17:32:25 UTC
Verified on 5.11.6

Reproducer
1)Add Tower to CFME
2)Make sure the Tower set up has a job template to which a variable with '\\' is being passed.

On 5.11.6 (with the fix) , I have verified the following after the Tower provider was added:
1)UI shows provider refresh has completed
2)There are no errors in evm.log
3)I'm able to see all the objects under the provider. I also see the job template in question being listed in the UI. The job template shouldn't get listed in the UI since it's not usable. Clicking on the template doesn't take you to the template details page. I'll report a BZ to address this issue.

I also see this warning logged in evm.log. This change was incorporated through https://bugzilla.redhat.com/show_bug.cgi?id=1824846
----] W, [2020-06-03T12:27:42.321964 #1073065:2b0930dc05b4] WARN -- : MIQ(ManageIQ::Providers::AnsibleTower::Inventory::Parser::AutomationManager#configuration_scripts) Failed to parse job_template ID [7]: <unknown>): found unknown escape character while parsing a quoted scalar at line 2 column 10

Comment 34 Jaroslav Henner 2020-06-04 10:06:09 UTC

*** This bug has been marked as a duplicate of bug 1824846 ***