Escalated to Bugzilla from IssueTracker
Event posted on 03-30-2010 11:36pm CST by smayhew Okay, I logged in and and reproduced this manually before I saw your notes about the scripts. Here's the sequence of events: node1 # ip addr add 192.168.122.22/24 dev eth0 node1 # arping -c 1 -U -I eth0 192.168.122.22 node1 # service ipsec start node2 # service ipsec start node1 # ipsec auto --up conn3 node2 # ipsec auto --up conn3 node1 # ip addr del 192.168.122.22/24 dev eth0 node2 # ip addr add 192.168.122.22/24 dev eth0 node2 # arping -c 1 -U -I eth0 192.168.122.22 node2 # ipsec auto --ready node2 # ipsec auto --up conn2 node1 # ipsec auto --ready node1 # ipsec auto --up conn2 // wait for pluto to crash I then went and modified your reproducer script and it worked exactly once. I think it might be a problem with the order in which processes are being fired off in the background. It might be easier to just do everything in a sequential fashion, with a few seconds in between each step. I'll keep playing around with it to see if I can come up with anything. This event sent from IssueTracker by wezhang [Support Engineering Group] issue 597033
Event posted on 06-29-2010 02:42pm CST by wezhang this is caused by processing the queued events for a connection which already got down(conn3 here) the following is the connections ipaddrs we defined in mesh.conf: for conn3: 192.168.122.33(node2) 192.168.122.22(floating ip) for conn2: 192.168.122.56(node1) 192.168.122.22(floating ip) this bug just happens after the following two events happened sequently: 1. the floating ip got moved from node1 to node2 2. ipsec whack --listen on node2, 'ipsec whack --listen' would trigger whack_process function, the source code responsible for '--listen' request looks like: void whack_process(int whackfd, struct whack_message msg) { const struct osw_conf_options *oco = osw_init_options(); ... if (msg.whack_listen) { fflush(stderr); fflush(stdout); close_peerlog(); /* close any open per-peer logs */ openswan_log("listening for IKE messages"); listening = TRUE; daily_log_reset(); reset_adns_restart_count(); set_myFQDN(); find_ifaces(); load_preshared_secrets(NULL_FD); load_groups(); } ... } then the calling path for interfaces check is: whack_process -> find_ifaces -> free_dead_ifaces(); the following code is from free_dead_ifaces: static void free_dead_ifaces(void) { struct iface_port *p; bool some_dead = FALSE , some_new = FALSE; for (p = interfaces; p != NULL; p = p->next) { if (p->change == IFN_DELETE) { openswan_log("shutting down interface %s/%s %s:%d" , p->ip_dev->id_vname , p->ip_dev->id_rname , ip_str(&p->ip_addr), p->port); some_dead = TRUE; } else if (p->change == IFN_ADD) { some_new = TRUE; } } ... /* this must be done after the release_dead_interfaces * in case some to the newly unoriented connections can * become oriented here. */ if (some_dead || some_new) check_orientations(); } because 192.168.122.22(floating ip) for conn2 already got moved to this node, so some_new = TRUE, and would triger the following piece of code in check_orientations(): void check_orientations(void) { ... /* Check that no oriented connection has become double-oriented. * In other words, the far side must not match one of our new interfaces. */ { struct iface_port *i; for (i = interfaces; i != NULL; i = i->next) { if (i->change == IFN_ADD) ---> we got floating ip for conn2 added in this node, so this condition got true { struct host_pair *hp; for (hp = host_pairs; hp != NULL; hp = hp->next) { if (sameaddr(&hp->him.addr, &i->ip_addr) && (kern_interface!=NO_KERNEL || hp->him.host_port == pluto_port)) { /* bad news: the whole chain of connections * hanging off this host pair has both sides * matching an interface. * We'll get rid of them, using orient and * connect_to_host_pair. But we'll be lazy * and not ditch the host_pair itself (the * cost of leaving it is slight and cannot * be induced by a foe). */ struct connection *c = hp->connections; hp->connections = NULL; while (c != NULL) { struct connection *nxt = c->hp_next; c->interface = NULL; (void)orient(c); --+---> reorient conn3 here would triger this issue connect_to_host_pair(c); --/ c = nxt; } } } } } } } then let's look in orient()/connect_to_host_pai() function: bool orient(struct connection *c) { struct spd_route *sr; ... for (;;) { /* check if this interface matches this end */ if (sameaddr(&sr->this.host_addr, &p->ip_addr) && (kern_interface != NO_KERNEL || sr->this.host_port == pluto_port)) { if (oriented(*c)) { if (c->interface->ip_dev == p->ip_dev) loglog(RC_LOG_SERIOUS , "both sides of "%s" are our interface %s!" , c->name, p->ip_dev->id_rname); else loglog(RC_LOG_SERIOUS, "two interfaces match "%s" (%s, %s)" , c->name, c->interface->ip_dev->id_rname, p->ip_dev->id_rname); c->interface = NULL; /* withdraw orientation */ return FALSE; --> for conn3, both left/right end IPs are on node2, so would set ->interface = NULL } c->interface = p; } ... } void connect_to_host_pair(struct connection *c) { if (oriented(*c)) { ... else { /* since this connection isn't oriented, we place it * in the unoriented_connections list instead. */ c->host_pair = NULL; --> host_pair becomes NULL c->hp_next = unoriented_connections; unoriented_connections = c; } } because we got host_pair NULL, so if a event already queued need to dereference ->host_pair (like host_pair_enqueue_pending/pending_check_timeout/unpend) would cause a program crash silimar situation happens on node1, but on node1 this caused by some interface got removed(floating ip for conn3) This event sent from IssueTracker by wezhang [Support Engineering Group] issue 597033
Created attachment 427849 [details] proposed patch
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This issue applies to current openswan, so also RHEL6. We are going to check this fix and get back to you. Sorry this one slipped through
Paul, are you using the same patch that is attached here or a different patch?
I have not had time to look into the correctness of the patch yet. We have no patch at the moment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0211.html