1113825 – Add/delete DNS record may fails occasionally

Bug 1113825 - Add/delete DNS record may fails occasionally

Summary: Add/delete DNS record may fails occasionally

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	2.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Williams
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1118396
Blocks:	1273542
TreeView+	depends on / blocked

Reported:	2014-06-27 03:02 UTC by Anping Li
Modified:	2023-09-14 02:10 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-17 17:29:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1022335	0	low	CLOSED	[performance] inconsistent happens when deleting a certain number of applications concurrently	2021-02-22 00:41:40 UTC

Internal Links: 1022335

Description Anping Li 2014-06-27 03:02:44 UTC

Description of problem:
The cartridge scale up/down or app creation fails occasionally and reports like "error adding app record perl1s-**.example.com". 

In my broker,there are 170 failure in 12346. it is about 1.2%. I think this is not acceptable.

I traced in nsupdate_plugin.rb, 'system nsupdatecmd' returned error. The nsupdate reported "dns_dispatch_getudp (v4): permission denied" or "dns_dispatch_getudp (v6): permission denied."."

Version-Release number of selected component (if applicable):
puddle-2014-05-29.3

How reproducible:
1.2%

Steps to Reproduce:
1. scale up cartridge to large number(e.g. 60 or more ) or keep creating applications

Actual results:
1) Messages like "error adding/delete app record perl1s-**.example.com" will be reported by rhc

2) The error reported in production.log 
2014-06-18 22:36:26.566 [ERROR] error adding app record perl1s-hanli1dom.example.com (pid:21724)
2014-06-18 22:36:26.566 [ERROR] /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-dns-nsupdate-1.16.2.1/lib/openshift/nsupdate_plugin.rb:202:in `register_application'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/gear.rb:134:in `register_dns'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/pending_ops_models/register_dns_op.rb:8:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/pending_app_op_group.rb:104:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/pending_app_op_group.rb:94:in `each'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/pending_app_op_group.rb:94:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/application.rb:1735:in `run_jobs'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/application.rb:840:in `block in add_cartridges'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.23.10.1/app/models/lock.rb:62:in `run_in_app_lock'

3) nsupdate standard error
dns_dispatch_getudp (v4): permission denied
dns_dispatch_getudp (v6): permission denied
dns_dispatch_getudp (v4): permission denied
dns_dispatch_getudp (v4): permission denied
dns_dispatch_getudp (v6): permission denied
dns_dispatch_getudp (v4): permission denied
dns_dispatch_getudp (v4): permission denied
dns_dispatch_getudp (v4): permission denied
dns_dispatch_getudp (v6): permission denied
dns_dispatch_getudp (v6): permission denied
dns_dispatch_getudp (v6): permission denied
dns_dispatch_getudp (v6): permission denied.

Expected results:
DNS record can be updated always.

Additional info:

Comment 2 Jason DeTiberus 2014-06-27 13:18:48 UTC

Can you provide some more information about the environment this was seen in?

Are the bind instance and the broker on the same host?

Are the hosts configured for both ipv4 and ipv6 (it looks like they are from the nsupdate error output)?

If ipv6 is configured, does this happen when only using ipv4?

It looks like it's attempting both ipv6 and ipv4 are there AAAA records for the ns server?

What does the nsupdate plugin config using for the nameserver?

Comment 6 Luke Meyer 2014-09-22 14:42:45 UTC

With bug 1118396 shipped in OSE 2.1.4 hopefully we will see at least some helpful feedback when this occurs. Given that we only ever see it sporadically, though, I'm wondering if it's a problem with the networking stack, sporadic UDP failures, the nameserver dropping the ball, or just what is going on. I've seen this even in a standard test rig with BIND and the broker on the same host.

Two approaches to consider:
1. Just have the DNS plugin retry failed operations in case it was a fluke (perhaps depending on error returned). This would probably help but is not satisfying as to why it happens.
2. Create a script that loads the broker env and does a lot of these operations to reproduce this, and do a packet trace. May require trying different access patterns or deployment topologies to reproduce. This could be a useful script for customers to try as well for troubleshooting their specific deployment.

Comment 8 Anping Li 2014-09-30 02:36:26 UTC

As a note too, I had a try yesterday, 7/1000 adding/Deleting DNS application recordDNS was reported in my open-stack environment. 
if I  add/delete DNS records using nsupdate directly(don't trigger app create/delete) for 3000 times, no error was reported by bind.

Comment 10 Anping Li 2014-10-28 10:33:52 UTC

just provide some possible usefull information.
Hit 90% this issue in my Env with bug "https://bugzilla.redhat.com/show_bug.cgi?id=1158019".

Comment 16 Eric Jones 2015-11-10 22:53:32 UTC

Customer has decided not to use this environment for a while. Was recommended to close this bug till then.

Comment 17 Eric Jones 2015-11-19 16:14:31 UTC

Customer has come back and provided more logs. Upon searching through these logs I cannot find any proof of the failure in the tcpdumps, only seeing the error in var/log/openshift/broker/production.log

Comment 18 Jason Pyeron 2016-02-06 14:42:20 UTC

Possible resolution from selinux:


setsebool -P allow_ypbind 1

#============= dhcpc_t ==============

allow dhcpc_t port_t:udp_socket name_bind;
allow dhcpc_t random_device_t:chr_file getattr;

Comment 19 Luke Meyer 2016-02-08 21:26:32 UTC

Could be a simple fix for an annoying issue. Can QE check if this addresses the issue?

Comment 20 Jason Pyeron 2016-02-08 22:59:13 UTC

The issue acts like: "when the randomness runs dry, it tries to seed more and then it gets a denial" This is one possible explanation for the "occasionally".

Discovered when back porting a dynamic DNS script from RHEL7 to RHEL6.

Comment 22 Jason Pyeron 2016-04-16 02:21:50 UTC

What info is needed?

Comment 24 Miciah Dashiel Butler Masters 2016-04-19 14:33:07 UTC

Jason Pyeron, I don't understand how your suggestion in comment 18 relates to intermittent DNS update failures.  You may be seeing a different problem.  What is the error that you see? Are you seeing denials in /var/log/audit/audit.log?

We enable the allow_ypbind Boolean in order to allow the broker to send updates to the nameserver:

    # sesearch -A -s httpd_t -t port_t -c udp_socket -C
    Found 4 semantic av rules:
       allow httpd_t port_type : udp_socket { recv_msg send_msg } ; 
    DT allow httpd_t port_type : udp_socket { recv_msg send_msg } ; [ allow_ypbind ]
    DT allow httpd_t port_t : udp_socket name_bind ; [ allow_ypbind ]
    ET allow httpd_t port_t : udp_socket name_bind ; [ httpd_verify_dns ]

(It looks like the httpd_verify_dns Boolean would work as well, and it would be more targeted, but for whatever reason, we use the allow_ypbind Boolean in OpenShift Enterprise.)

Without this Boolean enabled, we would see the following SELinux denial on the broker host when creating or deleting an OpenShift application:

    avc:  denied  { name_bind } for  pid=12340 comm="nsupdate" src=3610 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:port_t:s0 tclass=udp_socket

(On the user end, rhc would report, "Error adding DNS application record" or, "Error removing DNS application record".)

Also, the failure rate would be 100%, not 1.2%.

The allow_ypbind Boolean does not provide access to random_device_t:

    # sesearch -A -t random_device_t -b allow_ypbind
    #

The only thing that should be running with the dhcpc_t type label is dhclient, so I do not understand how that is relevant.

Are you seeing errors similar to those described in the initial report? Are they intermittent? If so, please provide more detail about what you are seeing here.  If you are seeing something different, please open a separate Bugzilla report to cover the issue.  Thanks!

Comment 27 Rory Thrasher 2016-05-17 17:29:15 UTC

Given that the two customer cases are no longer hitting this issue, we don't have much information to try and reproduce this.  We'll close this for now due to lack of information.  If this comes up again and we can get more information to try and reproduce it, then please reopen!

Comment 28 Jason Pyeron 2016-09-05 13:03:54 UTC

Adding this back to my backlog to re-investigate.

re: comment 24, as I recall from my bad memory, the observed problem was identical.

re: comment 18:, I would likely find out after investigation that the ypbind portion of the solution has no relevance.


Start speculation without merit:

Without any review of the problem or logs I would proffer that the intermittence is due to a cache/buffer full/not full pattern.

Do something new, need more random than available, ...

End speculation without merit.

Comment 29 Red Hat Bugzilla 2023-09-14 02:10:43 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.