963723 – mod_cluster: proxy DNS lookup failure with IPv6 on Solaris

Bug 963723 - mod_cluster: proxy DNS lookup failure with IPv6 on Solaris

Summary: mod_cluster: proxy DNS lookup failure with IPv6 on Solaris

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	JBoss Enterprise Web Server 2
Classification:	JBoss
Component:	mod_cluster
Sub Component:
Version:	2.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ER02
Target Release:	2.1.0
Assignee:	Jean-frederic Clere
QA Contact:	Michal Karm Babacek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-16 13:19 UTC by Michal Karm Babacek
Modified:	2019-06-13 12:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-06-13 12:09:37 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MODCLUSTER-339	0	Critical	Closed	"proxy: DNS lookup failure" with IPv6 on Solaris	2016-07-31 19:12:49 UTC

Description Michal Karm Babacek 2013-05-16 13:19:04 UTC

https://issues.jboss.org/browse/MODCLUSTER-339

Comment 1 JBoss JIRA Server 2013-05-16 14:39:55 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

As a comparison, here is a healthy debug log from a mod_cluster IPv6 test on RHEL [^error_log-mod_cluster-RHEL].

Comment 2 JBoss JIRA Server 2013-05-16 15:22:33 UTC

Jean-Frederic Clere <jfclere> made a comment on jira MODCLUSTER-339

%5B2620%3A52%3A0%3A105f%3A0%3A0%3Affff%3A50%252%5D
that is [ ... %2] that is not a valid address.
what is configured on the AS7 side?

Comment 3 JBoss JIRA Server 2013-05-16 16:10:15 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

{code}
<interfaces>
    <interface name="management">
        <inet-address value="2620:52:0:105f::ffff:50"/>
    </interface>
    <interface name="public">
        <inet-address value="2620:52:0:105f::ffff:50"/>
    </interface>
    <interface name="unsecure">
        inet-address value="${jboss.bind.address.unsecure:127.0.0.1}"/>
    </interface>
</interfaces>
<socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
    <socket-binding name="management-native" interface="management" port="${jboss.management.native.port:9999}"/>
    <socket-binding name="management-http" interface="management" port="${jboss.management.http.port:9990}"/>
    <socket-binding name="management-https" interface="management" port="${jboss.management.https.port:9443}"/>
    <socket-binding name="ajp" port="8009"/>
    <socket-binding name="http" port="8080"/>
    <socket-binding name="https" port="8443"/>
    <socket-binding name="jgroups-mping" port="0" multicast-address="ff01::3" multicast-port="45700"/>
    <socket-binding name="jgroups-tcp" port="7600"/>
    <socket-binding name="jgroups-tcp-fd" port="57600"/>
    <socket-binding name="jgroups-udp" port="55200" multicast-address="ff01::3" multicast-port="45688"/>
    <socket-binding name="jgroups-udp-fd" port="54200"/>
    <socket-binding name="modcluster" port="0" multicast-address="ff01::7" multicast-port="23964"/>
    <socket-binding name="remoting" port="4447"/>
    <socket-binding name="txn-recovery-environment" port="4712"/>
    <socket-binding name="txn-status-manager" port="4713"/>
    <outbound-socket-binding name="mail-smtp">
    <remote-destination host="localhost" port="25"/>
    </outbound-socket-binding>
</socket-binding-group>
{code}

Comment 4 JBoss JIRA Server 2013-05-16 16:46:18 UTC

Jean-Frederic Clere <jfclere> made a comment on jira MODCLUSTER-339

Ok it seems EAP/AS adds the %2 which causes problem on solaris in the URL. That needs to be fixed.

Comment 5 JBoss JIRA Server 2013-05-17 11:31:16 UTC

Jean-Frederic Clere <jfclere> made a comment on jira MODCLUSTER-339

It looks like apr behaves differently on Solaris and  Linux:
    rv = apr_sockaddr_info_get(&sa, "2001:db8:0:f101::1%2", APR_UNSPEC, 80, 0, p);
works on Linux but not on Solaris. It seems the Solaris doesn't like the %.

Comment 6 JBoss JIRA Server 2013-05-24 12:46:48 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

h3. Thinking aloud
I do not understand why should we put zone there at all. What should httpd, as a server, do with it?
I had tried to look up some httpd tests with IPv6, and I found only this, not using zone id:
[httpd-2.2.23/srclib/apr/test/testsock.c:314|https://gist.github.com/Karm/5642351#file-testsock-c-L314]
 
Furthermore, I examined the functions in {{httpd-2.2.23/srclib/apr/network_io/unix/sockaddr.c}} leading to {{getaddrinfo(hostname, servname, &hints, &ai_list);}}

Solaris POSIX mambo-jambo reveals a nice doc for [getaddrinfo()|http://docs.oracle.com/cd/E23823_01/html/816-5170/getaddrinfo-3socket.html#scrolltoc] 

{quote}
The {{nodename}} can also be an IPv6 zone-id in the form:
{code}
<address>%<zone-id>
{code}
The address is the literal IPv6 link-local address or host name of the destination. The zone-id is the interface ID of the IPv6 link used to send the packet. The zone-id can either be a numeric value, indicating a literal zone value, or an interface name such as hme0.
{quote}

OK, we should be able to put %num there, still, why should be httpd interested in worker's interface zone id? It is not going to be binding to it...
I guess there is even a room for a nasty error where, given that zone id has a priority over the actual address, httpd will try to use a specific interface just because it was given an unnecessary zone id... Dunno :-(

h3. Toss % out
How about stripping the %num from the CONFIG message on the native side? As I stated above, it's IMHO useless there anyhow.

{code:title=RHEL with zone %666|borderStyle=solid|borderColor=#ccc| titleBGColor=#F7D6C1}
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(655): add_balancer_node: Create balancer balancer://qacluster
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180%666]:8009
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(549): proxy: initialized single connection worker 1 in child 10070 for (2620:52:0:102f:221:5eff:fe96:8180%666)
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180%666]:8009 1 (status): 129
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
{code}

OK, RHEL can handle it, SOLARIS can't. On the other hand:

{code:title=RHEL without any zone in the message|borderStyle=solid|borderColor=#ccc| titleBGColor=#F7D6C1}
[Fri May 24 06:37:47 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180]:8009
[Fri May 24 06:37:47 2013] [debug] mod_proxy_cluster.c(549): proxy: initialized single connection worker 1 in child 9967 for (2620:52:0:102f:221:5eff:fe96:8180)
[Fri May 24 06:37:47 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180]:8009 1 (status): 129
{code}
Omitting the zone from the CONFIG message seems to be doing no harm.

Solaris up and running: :-)
{code:title=SOLARIS without any zone in the message|borderStyle=solid|borderColor=#ccc| titleBGColor=#F7D6C1}
[Fri May 24 08:25:15 2013] [debug] mod_manager.c(1923): manager_trans CONFIG (/)
[Fri May 24 08:25:15 2013] [debug] mod_manager.c(2598): manager_handler CONFIG (/) processing: "JVMRoute=FakeNode&Host=%5B2620%3A52%3A0%3A105f%3A%3Affff%3A60%5D&Maxattempts=1&Port=8009&Type=ajp&ping=100\r\n"
[Fri May 24 08:25:15 2013] [debug] mod_manager.c(2647): manager_handler CONFIG  OK
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(655): add_balancer_node: Create balancer balancer://qacluster
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(655): add_balancer_node: Create balancer balancer://qacluster
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(532): proxy: initialized worker 1 in child 19207 for (2620:52:0:105f::ffff:60) min=0 max=25 smax=25
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009 1 (status): 1
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (2620:52:0:105f::ffff:60)
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://[2620:52:0:105f::ffff:60]:8009/ to 2620:52:0:105f::ffff:60:8009
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(532): proxy: initialized worker 1 in child 19208 for (2620:52:0:105f::ffff:60) min=0 max=25 smax=25
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009 1 (status): 1
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2193): proxy: connected / to 2620:52:0:105f::ffff:60:8009
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 26 socket created to connect to 2620:52:0:105f::ffff:60
{code}

Without *%something* in the Host attribute of the CONFIG message, there is no nasty *DNS lookup failure* and everything seems to be cool (not yet thoroughly tested though).

The aforementioned log was produced with this fake message:

{code}
{ echo "CONFIG / HTTP/1.0"; echo "Content-length: 108"; echo ""; echo "JVMRoute=FakeNode&Host=%5B2620%3A52%3A0%3A105f%3A%3Affff%3A60%5D&Maxattempts=1&Port=8009&Type=ajp&ping=100"; sleep 1; } | telnet 2620:52:0:105f::ffff:60 6666
{code}

What do you think about it?

Comment 7 JBoss JIRA Server 2013-05-24 19:22:52 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

Regarding the idea of removing the zone id, how about this: [https://github.com/modcluster/mod_cluster/pull/20/] ?

Comment 8 Jean-frederic Clere 2013-05-28 15:41:30 UTC

Not in EAP6.1 = not EWS 2.0.1 = unknown bug.

Comment 9 Michal Karm Babacek 2013-05-28 16:01:33 UTC

(In reply to Jean-frederic Clere from comment #8)
> Not in EAP6.1 = not EWS 2.0.1 = unknown bug.

??? Um...the bug is present both in EAP6.1 and EWS 2.0.1 :-)

Comment 10 Jean-frederic Clere 2013-05-29 06:38:02 UTC

Yes the bug is both in EWS 2.0.1 and EAP 6.1 has we share the same sources.

Comment 11 Michal Karm Babacek 2013-05-29 08:45:26 UTC

(In reply to Jean-frederic Clere from comment #10)
> Yes the bug is both in EWS 2.0.1 and EAP 6.1 has we share the same sources.

I am lost. Why did you send this comment then?

Jean-frederic Clere 2013-05-28 11:41:30 EDT
Not in EAP6.1 = not EWS 2.0.1 = unknown bug.
Flags: devel_ack-

?

Comment 12 Jean-frederic Clere 2013-05-29 09:50:02 UTC

Not fixed  in EAP6.1 = not fixed EWS 2.0.1 = known bug.

Comment 16 Libor Fuka 2013-05-30 13:40:50 UTC

i set requires_doc_text to ?

Comment 17 Libor Fuka 2013-05-30 13:42:48 UTC

It needs to be in release notes

Comment 18 Jean-frederic Clere 2013-05-30 13:55:15 UTC

Cause: 

Java returns IPv6 with a zone like "2001:db8:0:f101::1%2" when returning a node address the modcluser subsystem sends the IPv6 has it sees it in Java. Solaris  apr_sockaddr_info_get() doesn't support that format and tries (and fails) to resolve the IP as hostname.


Consequence: 

httpd mod_cluster won't work with nodes with IPv6 addresses.

Fix: Use Ipv4 address for nodes when httpd runs on Solaris.

Result:

Comment 19 Jean-frederic Clere 2013-05-30 14:04:56 UTC

The work-around is to use a address="hostname" in the connector in the web subsystem.

Comment 20 JBoss JIRA Server 2013-05-30 15:28:07 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

[~jfclere] I have been investigating further and you might find these notes useful:
h4. IPv6 works if we remove % and zone id
The "fix", or rather a workaround, in [/pull/20/|https://github.com/modcluster/mod_cluster/pull/20/] really made IPv6 work on Soalris 11 SPARC64. I tested with attached [^mod_manager.so] (built from [/pull/20/|https://github.com/modcluster/mod_cluster/pull/20/] sources for sparc64, *apxs* from httpd-2.2.23). Here is the debug log from the successful test: [^error_log_pull20].

h4. Actual apr_sockaddr_info_get source code
I wondered what is the actual difference between Solaris's and Fedora's {{apr_sockaddr_info_get}}, but I am bewildered with all these macros. What I did is to run a preprocessor, so as I can compare the actual C code that is to be compiled on Fedora and Solaris.
{noformat}
/tmp/native/httpd/httpd-2.2.23/srclib/apr
gcc -E -P -g -Wall -Wmissing-prototypes -Wstrict-prototypes -Wmissing-declarations -m64 -DSSL_EXPERIMENTAL -DSSL_ENGINE -DHAVE_CONFIG_H -DSOLARIS2=11 -D_POSIX_PTHREAD_SEMANTICS -D_REENTRANT -I./include -I/tmp/native/httpd/httpd-2.2.23/srclib/apr/include/arch/unix -I./include/arch/unix -I/tmp/native/httpd/httpd-2.2.23/srclib/apr/include/arch/unix -I/tmp/native/httpd/httpd-2.2.23/srclib/apr/include -o network_io/unix/sockaddr.lo -c network_io/unix/sockaddr.c
{noformat}

One may find resulting files attached as [^sockaddr.lo_fedora18_x86_64], [^sockaddr.lo_solaris11_sparc64].
I took a look at differences in
 * {{static apr_status_t find_addresses(apr_sockaddr_t **sa, const char *hostname, apr_int32_t family, apr_port_t port, apr_int32_t flags, apr_pool_t *p)}}
 * {{call_resolver(apr_sockaddr_t **sa, const char *hostname, apr_int32_t family, apr_port_t port, apr_int32_t flags, apr_pool_t *p)}}

but it all boils down to the system's:
 {{getaddrinfo(hostname, servname, &hints, &ai_list);}}
that, as far as I was able to look up, [supports %zoneid syntax|http://docs.oracle.com/cd/E23823_01/html/816-5170/getaddrinfo-3socket.html#scrolltoc]...

So, I can't really see how could {{apr_sockaddr_info_get}} fail us? There is not much code in it:
Solaris 11 SPARC64:
{code}
apr_status_t apr_sockaddr_info_get(apr_sockaddr_t **sa,
                                                const char *hostname,
                                                apr_int32_t family, apr_port_t port,
                                                apr_int32_t flags, apr_pool_t *p)
{
    apr_int32_t masked;
    *sa = 0L;
    if ((masked = flags & (0x01 | 0x02))) {
        if (!hostname ||
            family != 0 ||
            masked == (0x01 | 0x02)) {
            return 22;
        }
    }
    return find_addresses(sa, hostname, family, port, flags, p);
}
{code}
the only difference from Fedora build being on line 7, {{*sa = ((void *)0);}}.

uh...

Comment 21 Misha H. Ali 2013-06-03 01:37:02 UTC

Thank you, Jean-Frederic. Documenting this as a known issue for 2.0.1.

Comment 22 JBoss JIRA Server 2013-08-29 08:29:52 UTC

Jean-Frederic Clere <jfclere> updated the status of jira MODCLUSTER-339 to Resolved

Comment 23 JBoss JIRA Server 2013-09-19 15:50:46 UTC

Michal Babacek <mbabacek> updated the status of jira MODCLUSTER-339 to Closed

Comment 24 JBoss JIRA Server 2013-09-19 15:50:46 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

Verified with mod_cluster 1.2.6 :-)

Comment 25 Michal Karm Babacek 2014-06-11 15:39:36 UTC

Let's switch it to ON_QA...

Comment 26 Michal Karm Babacek 2014-06-11 15:42:40 UTC

...and let's flip it to VERIFIED :-)

Solaris 10 sparc
[Wed Jun 11 06:14:34 2014] [debug] mod_manager.c(2623): manager_handler CONFIG (/) processing: "JVMRoute=jboss-eap-6.3&Host=%5B2620%3A52%3A0%3A105f%3A0%3A0%3Affff%3Af8%252%5D&Maxattempts=1&Port=8009&StickySessionForce=No&Type=ajp&ping=10"
[Wed Jun 11 06:14:34 2014] [debug] mod_manager.c(2672): manager_handler CONFIG  OK

EWS 2.1.0.ER2

Comment 27 Mandar Joshi 2014-08-08 11:55:26 UTC

Changed Doc Type to Bug Fix.

Modified Doc Text: 
Updated EWS version to 2.1.0
Added sentence: This issue is fixed in EWS 2.1.0/

Note You need to log in before you can comment on or make changes to this bug.