Bug 963720

Summary:	mod_cluster: proxy DNS lookup failure with IPv6 on Solaris
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	Michal Karm Babacek <mbabacek>
Component:	mod_cluster	Assignee:	Jean-frederic Clere <jclere>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Michal Karm Babacek <mbabacek>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.1.0	CC:	jclere, rdickens, smumford
Target Milestone:	ER2
Target Release:	EAP 6.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	In previous versions of EAP 6 it was found that attempting to use IPv6 addresses within a Solaris system would result in a DNS lookup failure. The source of this issue was traced to the IPv6 zone-id string of IPv6 adresses. Since this information is of no use to the HTTPD, the string is no longer used and mod_cluster now operates as expected on Solaris systems.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-12-15 16:21:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michal Karm Babacek 2013-05-16 13:16:46 UTC

https://issues.jboss.org/browse/MODCLUSTER-339

Comment 1 JBoss JIRA Server 2013-05-16 14:39:54 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

As a comparison, here is a healthy debug log from a mod_cluster IPv6 test on RHEL [^error_log-mod_cluster-RHEL].

Comment 2 JBoss JIRA Server 2013-05-16 15:22:33 UTC

Jean-Frederic Clere <jfclere> made a comment on jira MODCLUSTER-339

%5B2620%3A52%3A0%3A105f%3A0%3A0%3Affff%3A50%252%5D
that is [ ... %2] that is not a valid address.
what is configured on the AS7 side?

Comment 3 JBoss JIRA Server 2013-05-16 16:10:14 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

{code}
<interfaces>
    <interface name="management">
        <inet-address value="2620:52:0:105f::ffff:50"/>
    </interface>
    <interface name="public">
        <inet-address value="2620:52:0:105f::ffff:50"/>
    </interface>
    <interface name="unsecure">
        inet-address value="${jboss.bind.address.unsecure:127.0.0.1}"/>
    </interface>
</interfaces>
<socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
    <socket-binding name="management-native" interface="management" port="${jboss.management.native.port:9999}"/>
    <socket-binding name="management-http" interface="management" port="${jboss.management.http.port:9990}"/>
    <socket-binding name="management-https" interface="management" port="${jboss.management.https.port:9443}"/>
    <socket-binding name="ajp" port="8009"/>
    <socket-binding name="http" port="8080"/>
    <socket-binding name="https" port="8443"/>
    <socket-binding name="jgroups-mping" port="0" multicast-address="ff01::3" multicast-port="45700"/>
    <socket-binding name="jgroups-tcp" port="7600"/>
    <socket-binding name="jgroups-tcp-fd" port="57600"/>
    <socket-binding name="jgroups-udp" port="55200" multicast-address="ff01::3" multicast-port="45688"/>
    <socket-binding name="jgroups-udp-fd" port="54200"/>
    <socket-binding name="modcluster" port="0" multicast-address="ff01::7" multicast-port="23964"/>
    <socket-binding name="remoting" port="4447"/>
    <socket-binding name="txn-recovery-environment" port="4712"/>
    <socket-binding name="txn-status-manager" port="4713"/>
    <outbound-socket-binding name="mail-smtp">
    <remote-destination host="localhost" port="25"/>
    </outbound-socket-binding>
</socket-binding-group>
{code}

Comment 4 JBoss JIRA Server 2013-05-16 16:46:17 UTC

Jean-Frederic Clere <jfclere> made a comment on jira MODCLUSTER-339

Ok it seems EAP/AS adds the %2 which causes problem on solaris in the URL. That needs to be fixed.

Comment 5 JBoss JIRA Server 2013-05-17 11:31:15 UTC

Jean-Frederic Clere <jfclere> made a comment on jira MODCLUSTER-339

It looks like apr behaves differently on Solaris and  Linux:
    rv = apr_sockaddr_info_get(&sa, "2001:db8:0:f101::1%2", APR_UNSPEC, 80, 0, p);
works on Linux but not on Solaris. It seems the Solaris doesn't like the %.

Comment 6 Jean-frederic Clere 2013-05-22 08:13:12 UTC

Cause: 

Solaris doesn't support IPv6 zone (%n) in apr_sockaddr_info_get()


Consequence: 

mod_cluster can't work with nodes with IPv6 addresses on Solaris

Workaround (if any): 

None

Result: 
.

Comment 7 JBoss JIRA Server 2013-05-24 12:46:46 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

h3. Thinking aloud
I do not understand why should we put zone there at all. What should httpd, as a server, do with it?
I had tried to look up some httpd tests with IPv6, and I found only this, not using zone id:
[httpd-2.2.23/srclib/apr/test/testsock.c:314|https://gist.github.com/Karm/5642351#file-testsock-c-L314]
 
Furthermore, I examined the functions in {{httpd-2.2.23/srclib/apr/network_io/unix/sockaddr.c}} leading to {{getaddrinfo(hostname, servname, &hints, &ai_list);}}

Solaris POSIX mambo-jambo reveals a nice doc for [getaddrinfo()|http://docs.oracle.com/cd/E23823_01/html/816-5170/getaddrinfo-3socket.html#scrolltoc] 

{quote}
The {{nodename}} can also be an IPv6 zone-id in the form:
{code}
<address>%<zone-id>
{code}
The address is the literal IPv6 link-local address or host name of the destination. The zone-id is the interface ID of the IPv6 link used to send the packet. The zone-id can either be a numeric value, indicating a literal zone value, or an interface name such as hme0.
{quote}

OK, we should be able to put %num there, still, why should be httpd interested in worker's interface zone id? It is not going to be binding to it...
I guess there is even a room for a nasty error where, given that zone id has a priority over the actual address, httpd will try to use a specific interface just because it was given an unnecessary zone id... Dunno :-(

h3. Toss % out
How about stripping the %num from the CONFIG message on the native side? As I stated above, it's IMHO useless there anyhow.

{code:title=RHEL with zone %666|borderStyle=solid|borderColor=#ccc| titleBGColor=#F7D6C1}
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(655): add_balancer_node: Create balancer balancer://qacluster
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180%666]:8009
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(549): proxy: initialized single connection worker 1 in child 10070 for (2620:52:0:102f:221:5eff:fe96:8180%666)
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180%666]:8009 1 (status): 129
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 06:44:25 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
{code}

OK, RHEL can handle it, SOLARIS can't. On the other hand:

{code:title=RHEL without any zone in the message|borderStyle=solid|borderColor=#ccc| titleBGColor=#F7D6C1}
[Fri May 24 06:37:47 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180]:8009
[Fri May 24 06:37:47 2013] [debug] mod_proxy_cluster.c(549): proxy: initialized single connection worker 1 in child 9967 for (2620:52:0:102f:221:5eff:fe96:8180)
[Fri May 24 06:37:47 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:102f:221:5eff:fe96:8180]:8009 1 (status): 129
{code}
Omitting the zone from the CONFIG message seems to be doing no harm.

Solaris up and running: :-)
{code:title=SOLARIS without any zone in the message|borderStyle=solid|borderColor=#ccc| titleBGColor=#F7D6C1}
[Fri May 24 08:25:15 2013] [debug] mod_manager.c(1923): manager_trans CONFIG (/)
[Fri May 24 08:25:15 2013] [debug] mod_manager.c(2598): manager_handler CONFIG (/) processing: "JVMRoute=FakeNode&Host=%5B2620%3A52%3A0%3A105f%3A%3Affff%3A60%5D&Maxattempts=1&Port=8009&Type=ajp&ping=100\r\n"
[Fri May 24 08:25:15 2013] [debug] mod_manager.c(2647): manager_handler CONFIG  OK
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(655): add_balancer_node: Create balancer balancer://qacluster
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(655): add_balancer_node: Create balancer balancer://qacluster
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(532): proxy: initialized worker 1 in child 19207 for (2620:52:0:105f::ffff:60) min=0 max=25 smax=25
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009 1 (status): 1
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (2620:52:0:105f::ffff:60)
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://[2620:52:0:105f::ffff:60]:8009/ to 2620:52:0:105f::ffff:60:8009
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(426): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(532): proxy: initialized worker 1 in child 19208 for (2620:52:0:105f::ffff:60) min=0 max=25 smax=25
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(601): Created: worker for ajp://[2620:52:0:105f::ffff:60]:8009 1 (status): 1
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1010): update_workers_node starting
[Fri May 24 08:25:15 2013] [debug] mod_proxy_cluster.c(1025): update_workers_node done
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2193): proxy: connected / to 2620:52:0:105f::ffff:60:8009
[Fri May 24 08:25:15 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 26 socket created to connect to 2620:52:0:105f::ffff:60
{code}

Without *%something* in the Host attribute of the CONFIG message, there is no nasty *DNS lookup failure* and everything seems to be cool (not yet thoroughly tested though).

The aforementioned log was produced with this fake message:

{code}
{ echo "CONFIG / HTTP/1.0"; echo "Content-length: 108"; echo ""; echo "JVMRoute=FakeNode&Host=%5B2620%3A52%3A0%3A105f%3A%3Affff%3A60%5D&Maxattempts=1&Port=8009&Type=ajp&ping=100"; sleep 1; } | telnet 2620:52:0:105f::ffff:60 6666
{code}

What do you think about it?

Comment 8 JBoss JIRA Server 2013-05-24 19:22:51 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

Regarding the idea of removing the zone id, how about this: [https://github.com/modcluster/mod_cluster/pull/20/] ?

Comment 9 JBoss JIRA Server 2013-05-30 15:28:05 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

[~jfclere] I have been investigating further and you might find these notes useful:
h4. IPv6 works if we remove % and zone id
The "fix", or rather a workaround, in [/pull/20/|https://github.com/modcluster/mod_cluster/pull/20/] really made IPv6 work on Soalris 11 SPARC64. I tested with attached [^mod_manager.so] (built from [/pull/20/|https://github.com/modcluster/mod_cluster/pull/20/] sources for sparc64, *apxs* from httpd-2.2.23). Here is the debug log from the successful test: [^error_log_pull20].

h4. Actual apr_sockaddr_info_get source code
I wondered what is the actual difference between Solaris's and Fedora's {{apr_sockaddr_info_get}}, but I am bewildered with all these macros. What I did is to run a preprocessor, so as I can compare the actual C code that is to be compiled on Fedora and Solaris.
{noformat}
/tmp/native/httpd/httpd-2.2.23/srclib/apr
gcc -E -P -g -Wall -Wmissing-prototypes -Wstrict-prototypes -Wmissing-declarations -m64 -DSSL_EXPERIMENTAL -DSSL_ENGINE -DHAVE_CONFIG_H -DSOLARIS2=11 -D_POSIX_PTHREAD_SEMANTICS -D_REENTRANT -I./include -I/tmp/native/httpd/httpd-2.2.23/srclib/apr/include/arch/unix -I./include/arch/unix -I/tmp/native/httpd/httpd-2.2.23/srclib/apr/include/arch/unix -I/tmp/native/httpd/httpd-2.2.23/srclib/apr/include -o network_io/unix/sockaddr.lo -c network_io/unix/sockaddr.c
{noformat}

One may find resulting files attached as [^sockaddr.lo_fedora18_x86_64], [^sockaddr.lo_solaris11_sparc64].
I took a look at differences in
 * {{static apr_status_t find_addresses(apr_sockaddr_t **sa, const char *hostname, apr_int32_t family, apr_port_t port, apr_int32_t flags, apr_pool_t *p)}}
 * {{call_resolver(apr_sockaddr_t **sa, const char *hostname, apr_int32_t family, apr_port_t port, apr_int32_t flags, apr_pool_t *p)}}

but it all boils down to the system's:
 {{getaddrinfo(hostname, servname, &hints, &ai_list);}}
that, as far as I was able to look up, [supports %zoneid syntax|http://docs.oracle.com/cd/E23823_01/html/816-5170/getaddrinfo-3socket.html#scrolltoc]...

So, I can't really see how could {{apr_sockaddr_info_get}} fail us? There is not much code in it:
Solaris 11 SPARC64:
{code}
apr_status_t apr_sockaddr_info_get(apr_sockaddr_t **sa,
                                                const char *hostname,
                                                apr_int32_t family, apr_port_t port,
                                                apr_int32_t flags, apr_pool_t *p)
{
    apr_int32_t masked;
    *sa = 0L;
    if ((masked = flags & (0x01 | 0x02))) {
        if (!hostname ||
            family != 0 ||
            masked == (0x01 | 0x02)) {
            return 22;
        }
    }
    return find_addresses(sa, hostname, family, port, flags, p);
}
{code}
the only difference from Fedora build being on line 7, {{*sa = ((void *)0);}}.

uh...

Comment 10 JBoss JIRA Server 2013-08-29 08:29:48 UTC

Jean-Frederic Clere <jfclere> updated the status of jira MODCLUSTER-339 to Resolved

Comment 11 JBoss JIRA Server 2013-09-19 15:50:43 UTC

Michal Babacek <mbabacek> updated the status of jira MODCLUSTER-339 to Closed

Comment 12 JBoss JIRA Server 2013-09-19 15:50:43 UTC

Michal Babacek <mbabacek> made a comment on jira MODCLUSTER-339

Verified with mod_cluster 1.2.6 :-)

Comment 13 Michal Karm Babacek 2013-09-19 16:25:04 UTC

Splendid :-) mod_cluster 1.2.6 works on Solaris with IPv6 like a charm. Only httpd does not: [BZ 1009987]

Comment 14 Scott Mumford 2013-11-20 21:35:43 UTC

So what did we wind up doing to resolve this? None of the linked tickets explicitly state what the fix was (that I could see, anyway).

Did we remove the problematic zone? Or find a way to make Solaris play nice with it?

Need to know how we fixed this for the Release Note.

Comment 15 Michal Karm Babacek 2013-11-20 23:09:10 UTC

Resolution was this: https://github.com/modcluster/mod_cluster/pull/20/
In my own words: An unnecessary zone string is removed from the received message.

Comment 16 Scott Mumford 2013-11-21 00:11:14 UTC

Thanks for that Michal.

I still can't see where that's stated in the link. I guess I need to get better at reading pull requests.

Have added Doc Text and marked for inclusion in EAP 6.2 Release Notes.