Description of problem: The cartridge scale up/down or app creation fails occasionally and reports like "error adding app record perl1s-**.example.com". In my broker,there are 170 failure in 12346. it is about 1.2%. I think this is not acceptable. I traced in nsupdate_plugin.rb, 'system nsupdatecmd' returned error. The nsupdate reported "dns_dispatch_getudp (v4): permission denied" or "dns_dispatch_getudp (v6): permission denied."." Version-Release number of selected component (if applicable): puddle-2014-05-29.3 How reproducible: 1.2% Steps to Reproduce: 1. scale up cartridge to large number(e.g. 60 or more ) or keep creating applications Actual results: 1) Messages like "error adding/delete app record perl1s-**.example.com" will be reported by rhc 2) The error reported in production.log 2014-06-18 22:36:26.566 [ERROR] error adding app record perl1s-hanli1dom.example.com (pid:21724) 2014-06-18 22:36:26.566 [ERROR] /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-dns-nsupdate-1.16.2.1/lib/openshift/nsupdate_plugin.rb:202:in `register_application' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/gear.rb:134:in `register_dns' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/pending_ops_models/register_dns_op.rb:8:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/pending_app_op_group.rb:104:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/pending_app_op_group.rb:94:in `each' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/pending_app_op_group.rb:94:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/application.rb:1735:in `run_jobs' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/application.rb:840:in `block in add_cartridges' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/lock.rb:62:in `run_in_app_lock' 3) nsupdate standard error dns_dispatch_getudp (v4): permission denied dns_dispatch_getudp (v6): permission denied dns_dispatch_getudp (v4): permission denied dns_dispatch_getudp (v4): permission denied dns_dispatch_getudp (v6): permission denied dns_dispatch_getudp (v4): permission denied dns_dispatch_getudp (v4): permission denied dns_dispatch_getudp (v4): permission denied dns_dispatch_getudp (v6): permission denied dns_dispatch_getudp (v6): permission denied dns_dispatch_getudp (v6): permission denied dns_dispatch_getudp (v6): permission denied. Expected results: DNS record can be updated always. Additional info:
Can you provide some more information about the environment this was seen in? Are the bind instance and the broker on the same host? Are the hosts configured for both ipv4 and ipv6 (it looks like they are from the nsupdate error output)? If ipv6 is configured, does this happen when only using ipv4? It looks like it's attempting both ipv6 and ipv4 are there AAAA records for the ns server? What does the nsupdate plugin config using for the nameserver?
With bug 1118396 shipped in OSE 2.1.4 hopefully we will see at least some helpful feedback when this occurs. Given that we only ever see it sporadically, though, I'm wondering if it's a problem with the networking stack, sporadic UDP failures, the nameserver dropping the ball, or just what is going on. I've seen this even in a standard test rig with BIND and the broker on the same host. Two approaches to consider: 1. Just have the DNS plugin retry failed operations in case it was a fluke (perhaps depending on error returned). This would probably help but is not satisfying as to why it happens. 2. Create a script that loads the broker env and does a lot of these operations to reproduce this, and do a packet trace. May require trying different access patterns or deployment topologies to reproduce. This could be a useful script for customers to try as well for troubleshooting their specific deployment.
As a note too, I had a try yesterday, 7/1000 adding/Deleting DNS application recordDNS was reported in my open-stack environment. if I add/delete DNS records using nsupdate directly(don't trigger app create/delete) for 3000 times, no error was reported by bind.
just provide some possible usefull information. Hit 90% this issue in my Env with bug "https://bugzilla.redhat.com/show_bug.cgi?id=1158019".
Customer has decided not to use this environment for a while. Was recommended to close this bug till then.
Customer has come back and provided more logs. Upon searching through these logs I cannot find any proof of the failure in the tcpdumps, only seeing the error in var/log/openshift/broker/production.log
Possible resolution from selinux: setsebool -P allow_ypbind 1 #============= dhcpc_t ============== allow dhcpc_t port_t:udp_socket name_bind; allow dhcpc_t random_device_t:chr_file getattr;
Could be a simple fix for an annoying issue. Can QE check if this addresses the issue?
The issue acts like: "when the randomness runs dry, it tries to seed more and then it gets a denial" This is one possible explanation for the "occasionally". Discovered when back porting a dynamic DNS script from RHEL7 to RHEL6.
What info is needed?
Jason Pyeron, I don't understand how your suggestion in comment 18 relates to intermittent DNS update failures. You may be seeing a different problem. What is the error that you see? Are you seeing denials in /var/log/audit/audit.log? We enable the allow_ypbind Boolean in order to allow the broker to send updates to the nameserver: # sesearch -A -s httpd_t -t port_t -c udp_socket -C Found 4 semantic av rules: allow httpd_t port_type : udp_socket { recv_msg send_msg } ; DT allow httpd_t port_type : udp_socket { recv_msg send_msg } ; [ allow_ypbind ] DT allow httpd_t port_t : udp_socket name_bind ; [ allow_ypbind ] ET allow httpd_t port_t : udp_socket name_bind ; [ httpd_verify_dns ] (It looks like the httpd_verify_dns Boolean would work as well, and it would be more targeted, but for whatever reason, we use the allow_ypbind Boolean in OpenShift Enterprise.) Without this Boolean enabled, we would see the following SELinux denial on the broker host when creating or deleting an OpenShift application: avc: denied { name_bind } for pid=12340 comm="nsupdate" src=3610 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:port_t:s0 tclass=udp_socket (On the user end, rhc would report, "Error adding DNS application record" or, "Error removing DNS application record".) Also, the failure rate would be 100%, not 1.2%. The allow_ypbind Boolean does not provide access to random_device_t: # sesearch -A -t random_device_t -b allow_ypbind # The only thing that should be running with the dhcpc_t type label is dhclient, so I do not understand how that is relevant. Are you seeing errors similar to those described in the initial report? Are they intermittent? If so, please provide more detail about what you are seeing here. If you are seeing something different, please open a separate Bugzilla report to cover the issue. Thanks!
Given that the two customer cases are no longer hitting this issue, we don't have much information to try and reproduce this. We'll close this for now due to lack of information. If this comes up again and we can get more information to try and reproduce it, then please reopen!
Adding this back to my backlog to re-investigate. re: comment 24, as I recall from my bad memory, the observed problem was identical. re: comment 18:, I would likely find out after investigation that the ypbind portion of the solution has no relevance. Start speculation without merit: Without any review of the problem or logs I would proffer that the intermittence is due to a cache/buffer full/not full pattern. Do something new, need more random than available, ... End speculation without merit.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days