Bug 1419177
Summary: | rabbitmq cluster not forming with bigswitch virtual switch deployment on openstack controller nodes. | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | bigswitch <rhosp-bugs-internal> | ||||||||||||||
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | bigswitch <rhosp-bugs-internal> | ||||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | 8.0 (Liberty) | CC: | apevec, fdinitto, jeckersb, jjoyce, lemenkov, lhh, rhosp-bugs-internal, srevivo, ushkalim | ||||||||||||||
Target Milestone: | async | Keywords: | OtherQA, Triaged, ZStream | ||||||||||||||
Target Release: | 8.0 (Liberty) | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | rabbitmq-server-3.3.5-33.el7ost | Doc Type: | If docs needed, set a value | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2017-06-20 12:48:04 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
(In reply to bigswitch from comment #0) > Created attachment 1247567 [details] > sos report from all the controllers > > Description of problem: > > when installing bigswitch virtual switch onto controller nodes , deployment > is failing and also rabbitmq server is not forming a cluster. > Version-Release number of selected component (if applicable): > RHOSP8 > How reproducible: > > RHOSP8 with bigswitch virutal switch onto openstack controller nodes. > Steps to Reproduce: > 1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes. > 2. > 3. > > Actual results: > > > Expected results: > > > Additional info: > > > sos report attached from all 3 controllers. The tarball attached contains only data from one node. Could you please reupload other tarballs? Created attachment 1250907 [details]
controller0
Created attachment 1250908 [details]
controller1
added sos report from all controllers Ok, few observations. Your cluster got hit by netsplit issues, so please consider upgrading up to resource-agents-3.9.5-86.el7 where netsplits handled better (see bug #1397393 for further details). =ERROR REPORT==== 3-Feb-2017::17:53:23 === ** Generic server <0.2070.0> terminating ** Last message in was {'$gen_cast',{gm_deaths,[<20755.2101.0>]}} ** When Server state == {state, {amqqueue, {resource,<<"/">>,queue,<<"metering.sample">>}, false,false,none, [{<<"x-ha-policy">>,longstr,<<"all">>}], <0.2067.0>,[],[], [{vhost,<<"/">>}, {name,<<"ha-all">>}, {pattern,<<"^(?!amq\\.).*">>}, {'apply-to',<<"all">>}, {definition,[{<<"ha-mode">>,<<"all">>}]}, {priority,0}], [],[]}, <0.2071.0>, {state, {dict,0,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}, {{[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}}}, erlang}, #Fun<rabbit_mirror_queue_master.5.69179775>, #Fun<rabbit_mirror_queue_master.6.69179775>} ** Reason for termination == ** {{case_clause,{ok,<20750.2036.0>,[]}}, [{rabbit_mirror_queue_coordinator,handle_cast,2, [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,354}]}, {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]} This is GH#714. =ERROR REPORT==== 3-Feb-2017::17:53:36 === ** Generic server <0.1960.0> terminating ** Last message in was {'DOWN',#Ref<0.0.0.10102>,process,<20755.1906.0>, noconnection} ** When Server state == {state, {1,<0.1960.0>}, {{0,<20750.1899.0>},#Ref<0.0.0.10101>}, {{0,<20755.1906.0>},#Ref<0.0.0.10102>}, {resource,<<"/">>,queue, <<"heat-engine-listener_fanout_87c95e6a2a2242529d00fbef9880d75e">>}, rabbit_mirror_queue_slave, {2, [{{0,<20750.1899.0>}, {view_member, {0,<20750.1899.0>}, [], {0,<20755.1906.0>}, {1,<0.1960.0>}}}, {{0,<20755.1906.0>}, {view_member, {0,<20755.1906.0>}, [], {1,<0.1960.0>}, {0,<20750.1899.0>}}}, {{1,<0.1960.0>}, {view_member, {1,<0.1960.0>}, [], {0,<20750.1899.0>}, {0,<20755.1906.0>}}}]}, 0, [{{0,<20750.1899.0>},{member,{[],[]},0,0}}, {{0,<20755.1906.0>}, {member, {[{3,{delete_and_terminate,normal}}],[]}, 3,2}}, {{1,<0.1960.0>},{member,{[],[]},0,0}}], [<0.1956.0>], {[],[]}, [],0,undefined, #Fun<rabbit_misc.execute_mnesia_transaction.1>, {true,{shutdown,ring_shutdown}}} ** Reason for termination == ** {function_clause,[{orddict,fetch, [{1,<0.1960.0>},[]], [{file,"orddict.erl"},{line,72}]}, {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1223}]}, {gm,change_view,2,[{file,"src/gm.erl"},{line,1395}]}, {gm,handle_info,2,[{file,"src/gm.erl"},{line,729}]}, {gen_server2,handle_msg,2, [{file,"src/gen_server2.erl"},{line,1022}]}, {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]} This is GH#368. Ok, I can't find any other issues so far. I'll update this ticket shortly. GH#368 is addressed by this commit: * https://github.com/fedora-erlang/rabbitmq-server/commit/724f531 GH#714 is addressed by this commit: * https://github.com/fedora-erlang/rabbitmq-server/commit/18210ba Both issues are addressed in rabbitmq-server-3.3.5-33.el7ost. Is there a procedure to upgrade rabbitmq-server during deployment or is it part of RHOSP10? (In reply to bigswitch from comment #11) > Is there a procedure to upgrade rabbitmq-server during deployment or is it > part of RHOSP10? We're spipping rabbitmq-server-3.6.3 within RHOS9 and more recent RHOS versions which already contains all the patches necessary. As for RHOS8 the package (rabbitmq-server-3.3.5-33.el7ost) will be available in z-stream relatively soon. We are still seeing cluster is up, but deployment is failing with RHOSP9. Attached is the sosreport rabbitmq version is [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq rabbitmq-server-3.6.3-5.el7ost.noarch Created attachment 1257393 [details]
sosreport from controller 2
Created attachment 1257394 [details]
sosreport from controller 1
Created attachment 1257395 [details]
sosreport from controller 0
(In reply to bigswitch from comment #13) > We are still seeing cluster is up, but deployment is failing with RHOSP9. > Attached is the sosreport > rabbitmq version is > [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq > rabbitmq-server-3.6.3-5.el7ost.noarch I see that nodes 0 and 1 cannot join the rest of the cluster initially. Still they finally joined it a little later. Node 0 failed to join at 24-Feb-2017::00:21:22, and rejoined at 24-Feb-2017::00:21:41 (20 seconds later). Node 1 stopped operation due to "pause minority" option at 24-Feb-2017::00:16:55, and rejoined at 24-Feb-2017::00:21:42. See what's happened from the node-1's point of view: =================================== =INFO REPORT==== 24-Feb-2017::00:16:55 === rabbit on node 'rabbit@overcloud-controller-0' down ... =ERROR REPORT==== 24-Feb-2017::00:16:55 === Partial partition detected: * We saw DOWN from rabbit@overcloud-controller-0 * We can still see rabbit@overcloud-controller-2 which can see rabbit@overcloud-controller-0 * pause_minority mode enabled We will therefore pause until the *entire* cluster recovers ... =INFO REPORT==== 24-Feb-2017::00:16:57 === rabbit on node 'rabbit@overcloud-controller-0' up ... =INFO REPORT==== 24-Feb-2017::00:16:57 === rabbit on node 'rabbit@overcloud-controller-2' up ... =INFO REPORT==== 24-Feb-2017::00:20:44 === rabbit on node 'rabbit@overcloud-controller-2' down ... =INFO REPORT==== 24-Feb-2017::00:21:00 === Error description: {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}} ... =INFO REPORT==== 24-Feb-2017::00:21:19 === rabbit on node 'rabbit@overcloud-controller-2' up ... =INFO REPORT==== 24-Feb-2017::00:21:42 === rabbit on node 'rabbit@overcloud-controller-0' up =================================== This Sorry, accidentally pressed send. This shows that there was something at "24-Feb-2017::00:16:55". I looked at node-0 logs: Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask. Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30073]: ERROR: [findif] failed ... Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5498] device (p1p1): enslaved to non-master-type device ivs; ignoring Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5531] manager: (ext4000): new Generic device (/org/freedesktop/NetworkManager/Devices/15) Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5546] manager: (storagemgmt3997): new Generic device (/org/freedesktop/NetworkManager/Devices/16) Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5561] manager: (api3999): new Generic device (/org/freedesktop/NetworkManager/Devices/17) Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5578] manager: (storage3998): new Generic device (/org/freedesktop/NetworkManager/Devices/18) Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5699] device (ext4000): link connected Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5707] device (storagemgmt3997): link connected Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5708] device (api3999): link connected Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924208.5709] device (storage3998): link connected Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask. ... Feb 24 00:16:49 overcloud-controller-0.localdomain lrmd[27617]: notice: ip-10.8.84.70_monitor_10000:30043:stderr [ ocf-exit-reason:Unable to find nic or netmask. ] Feb 24 00:16:49 overcloud-controller-0.localdomain crmd[27620]: notice: overcloud-controller-0-ip-10.8.84.70_monitor_10000:90 [ ocf-exit-reason:Unable to find nic or netmask.\n ] Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]: [TOTEM ] A processor failed, forming new configuration. Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]: [TOTEM ] The network interface is down. Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]: [TOTEM ] adding new UDPU member {172.17.0.15} Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]: [TOTEM ] adding new UDPU member {172.17.0.14} Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]: [TOTEM ] adding new UDPU member {172.17.0.16} Feb 24 00:16:50 overcloud-controller-0.localdomain bash[29987]: ifup ivs port ext4000 Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn> [1487924210.6024] dhcp4 (em1): request timed out Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6024] dhcp4 (em1): state changed unknown -> timeout Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6130] dhcp4 (em1): canceled DHCP transaction, DHCP client pid 6779 Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6130] dhcp4 (em1): state changed timeout -> done Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6133] device (em1): state change: ip-config -> failed (reason 'ip-config-unavailable') [70 120 5] Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6135] manager: NetworkManager state is now DISCONNECTED Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn> [1487924210.6138] device (em1): Activation: failed for connection 'System em1' Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6145] device (em1): state change: failed -> disconnected (reason 'none') [120 30 0] Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6175] policy: auto-activating connection 'System em1' Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6186] device (em1): Activation: starting connection 'System em1' (1dad842d-1912-ef5a-a43a-bc238fb267e7) Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6188] device (em1): state change: disconnected -> prepare (reason 'none') [30 40 0] Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6189] manager: NetworkManager state is now CONNECTING Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6194] device (em1): state change: prepare -> config (reason 'none') [40 50 0] Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6201] device (em1): state change: config -> ip-config (reason 'none') [50 70 0] Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6205] dhcp4 (em1): activation: beginning transaction (timeout in 45 seconds) Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info> [1487924210.6267] dhcp4 (em1): dhclient started with pid 30228 Feb 24 00:16:50 overcloud-controller-0.localdomain dhclient[30228]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 3 (xid=0x763a0705) I bet something went wrong with networking. I could advice you to upgrade resource-agents from your current resource-agents-3.9.5-82.el7.x86_64 up to resource-agents-3.9.5-86.el7.x86_64 where we improved recovery time. In the case mentioned above RabbitMQ will recover 1 minute faster - down from 5 to 4 minutes. But this won't fix networking instability I see in logs. Still 4 minutes is better than 5. Can somebody from bigswitch please verify the build mentioned in comment #19 and let us know ASAP if that helps? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1542 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1247567 [details] sos report from all the controllers Description of problem: when installing bigswitch virtual switch onto controller nodes , deployment is failing and also rabbitmq server is not forming a cluster. Version-Release number of selected component (if applicable): RHOSP8 How reproducible: RHOSP8 with bigswitch virutal switch onto openstack controller nodes. Steps to Reproduce: 1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes. 2. 3. Actual results: Expected results: Additional info: sos report attached from all 3 controllers.