When adding a hawkular middleware provider which is using postgresql to store inventory information, the inventory refresh fails in CloudForms and displays the error: "undefined method `path' for nil:NilClass ". The evm.log file contains the following: > ManageIQ::Providers::Hawkular::MiddlewareManager: [hawkular-prod-01] > [----] E, [2016-11-01T13:31:06.087098 #2209:935148] ERROR -- : MIQ(ManageIQ::Providers::Hawkular::MiddlewareManager::Refresher#refresh) EMS: [hawkular-prod-01], id: [1] Refresh failed > [----] E, [2016-11-01T13:31:06.087361 #2209:935148] ERROR -- : [NoMethodError]: undefined method `path' for nil:NilClass Method:[rescue in block in refresh] > [----] E, [2016-11-01T13:31:06.087834 #2209:935148] ERROR -- : /var/www/miq/vmdb/app/models/manageiq/providers/hawkular/middleware_manager.rb:91:in `block in os_resource_for' > /var/www/miq/vmdb/app/models/ext_management_system.rb:360:in `with_provider_connection' > /var/www/miq/vmdb/app/models/manageiq/providers/hawkular/middleware_manager.rb:89:in `os_resource_for' > /var/www/miq/vmdb/app/models/manageiq/providers/hawkular/middleware_manager.rb:85:in `machine_id' > /var/www/miq/vmdb/app/models/manageiq/providers/hawkular/middleware_manager/refresh_parser.rb:33:in `block (2 levels) in fetch_middleware_servers'
Some additional information discovered during testing: - The inventory tables in postgres are successfully created when the container is started with the postgres jdbc parameters, so the container is able to connect to the postgresql server. - Running the same container without connecting to postgresql (using the built-in hsqldb) the inventory refresh works fine.
I can't comment on the MiQ side of the picture, Ruby is like Greek to me :) While I don't know the semantics of the "refresh", could the reason for this be that the hawkular provider is looking for something in inventory that is not yet there? Postgres is slower than the embedded database so this might be just a timing issue of sorts coupled with an improperly handled response handling on the hawkular provider side. Could you attach the hawkular-services logs so that I can check the error is not coming from the inventory side?
This will happen if this line is reached https://git.io/vXs7U or in other words if there is no such resource type called "Operating System" under the feed in the h-inventory. There are couple of scenarios, I am aware of, when this can be true. Either the h-services were started with -Dhawkular.agent.enabled=false or the refresh is called before the data is in the h-inventory as Lukas has mentioned above ^, the operation system metrics are not being collected, because it's somehow turned off in the agent config in the standalone.xml or some other other unknown reason to me. Anyway the code should be more robust imho, but I am not sure how to handle this. It's the trade-off between the fail-fast strategy vs the "robust self-healing smartness" :] What are the implications if the os field is empty? Perhaps Juca knows.
> What are the implications if the os field is empty? Perhaps Juca knows. There's no problem at all if the field is empty. It is *expected* to be filled for Linux machines, specially the ones with systemd, such as newer Ubuntu's, Fedora's, ... It's *not expected* to be filled for containers, for instance. Or Windows machines. I think I talked with Mazz back then, and there was a guarantee that "Operating System" would always exist. If that changed, then the "API" changed and this code obviously need to be revisited.
> I think I talked with Mazz back then, and there was a guarantee that "Operating System" would always exist. s/always/eventually/ and you are right :) Agent always discovers it but it can take time for it to appear in inventory...
Created attachment 1216708 [details] Hawkular Services log file using postgres jdbc config
I should also mention that this is postgresql version 9.2.15, which is the current version available in the RHEL 7 yum repo.
Seeing this in the HS logs: 4:28:34,815 WARN [org.hawkular.inventory.rest] (default task-17) RestEasy exception, : java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: cached plan must not change result type at org.umlg.sqlg.structure.SqlgEdge.load(SqlgEdge.java:235) at org.umlg.sqlg.structure.SqlgElement.property(SqlgElement.java:215) at org.umlg.sqlg.structure.SqlgEdge.property(SqlgEdge.java:54) at org.hawkular.inventory.impl.tinkerpop.TinkerpopBackend.relate(TinkerpopBackend.java:717) at org.hawkular.inventory.impl.tinkerpop.TinkerpopBackend$3.relateToParent(TinkerpopBackend.java:924) at org.hawkular.inventory.impl.tinkerpop.TinkerpopBackend$3.defaultAction(TinkerpopBackend.java:842) at org.hawkular.inventory.impl.tinkerpop.TinkerpopBackend$3.defaultAction(TinkerpopBackend.java:839) at org.hawkular.inventory.api.model.StructuredData$Visitor$Simple.visitString(StructuredData.java:461) at org.hawkular.inventory.api.model.StructuredData.accept(StructuredData.java:97) at org.hawkular.inventory.impl.tinkerpop.TinkerpopBackend$3.visitMap(TinkerpopBackend.java:913) at org.hawkular.inventory.impl.tinkerpop.TinkerpopBackend$3.visitMap(TinkerpopBackend.java:839) ... So the issue is twofold here. 1) Inventory on Postgres suffers from concurrent update of schema and querying (I've already started working on a fix yesterday) 2) MiQ provider should not assume the Operating System resource will always be there - it will *eventually* appear in inventory but there is no guarantee it will be there at the time of refresh.
Created attachment 1217127 [details] HS Log
Created attachment 1217129 [details] CFME evm.log
Note that PR https://github.com/hawkular/hawkular-inventory/pull/309 contains a fix for the postgres issues in inventory and is pending review.
Reassigning back to Heiko, because this is hopefully handled on the inventory side, while the MiQ side still needs attention.
MiQ PR: https://github.com/ManageIQ/manageiq/pull/12477
This seems to be working in the DR8 (0.19.0.Final) build. Moving to ON_QA to verify that the original issue is resolved.
The ruby side is merged to MiQ master, but as far as I can see not backported to Euwe. So the check for the "null pointer exception" is not yet in CF
@Paul Gier: you probably got this from the comments, but just for sure: this issue is not-deterministic, it happens only if you add the provider and run the refresh in the MiQ fast enough. Hmm, actually changing the platform enabled from "true" to "false" in the standalone.xml of the monitored server should work too.
Setting to POST, as it is in master on the ruby-side already