Bug 1469780 - Instance boots with two IPs although only one net-id is supplied [NEEDINFO]
Instance boots with two IPs although only one net-id is supplied
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
11.0 (Ocata)
Unspecified Unspecified
high Severity high
: ---
: 11.0 (Ocata)
Assigned To: Eoghan Glynn
Joe H. Rahme
scale_lab, aos-scalability-36
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-11 16:05 EDT by Sai Sindhur Malleni
Modified: 2018-02-20 13:11 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ipetrova: needinfo? (alifshit)


Attachments (Terms of Use)

  None (edit)
Description Sai Sindhur Malleni 2017-07-11 16:05:58 EDT
Description of problem: An instance is created using the command openstack server create --image 2017-06-21-ocp-3.6.121-atomic --flavor m1.small --nic net-id=85caaf07-bf6f-415a-a616-44e0325baad8 svt-i-1. The instance is expected to have only one IP annd one NIC. However, the instance boots with two IPs and two NICs(eth0 and eth1), however eth1 wasn;t even setup (had no entry in /etc/sysconfig/network-scripts/)

On doing an openstack server list, we see the VM with two ips on the same subnet

| 743427ab-af57-4704-a9d1-1d02f5c9ed3a | svt-ng-23  | ACTIVE  | private=172.16.1.225, 172.16.1.222 | 2017-06-28-ocp-3.6.126.1-2-rhel |

The instance id is 743427ab-af57-4704-a9d1-1d02f5c9ed3a
Both neutron ports are active. The port ids of the neutron ports are 
172.16.1.225- 9cbec257-81ac-4b4a-b0b2-bbaa0c13f26c
172.16.1.222- 380883d7-e9c2-4bb0-9c11-a96ae94a0d01

The instance is only reachable via the IP 172.16.1.225.

Grepping with the instance ID and port ids in the neutron and nova logs on the 3 controllers we have the following: https://gist.github.com/smalleni/3ae0792440d0276f62f9d865ce0001ad

On doing an ip a on the instance we see https://gist.github.com/smalleni/2214f9ca06e22f4e4ad31a9f2cf5702f



Version-Release number of selected component (if applicable):
11 GA

How reproducible:
Happened on two instances in this environment

Steps to Reproduce:
1. Boot an instance, passing only one net-id
2.
3.

Actual results:
Instance should boot with one IP

Expected results:
Instance boots with two neutron ports and IPs

Additional info:
Comment 1 Jakub Libosvar 2017-07-13 08:42:28 EDT
Can you please attach Nova and Neutron logs from all controllers?
Comment 2 Sai Sindhur Malleni 2017-07-13 09:10:44 EDT
I grepped for instance id, and the two port ids in nova/neutron logs and pasted it here- https://gist.github.com/smalleni/3ae0792440d0276f62f9d865ce0001ad

Do you need the full logs? There's a lot of them.
Comment 3 Jakub Libosvar 2017-07-13 12:15:59 EDT
From the provided logs it seems Neutron is asked twice for the port, we have following ports:
9cbec257-81ac-4b4a-b0b2-bbaa0c13f26c created at: 2017-07-10T22:52:46Z
380883d7-e9c2-4bb0-9c11-a96ae94a0d01 created at: 2017-07-10T22:53:22Z

I can see for each port a POST request:

2017-07-10 22:52:47.453 [c0] 129603 INFO neutron.wsgi [req-26a883fa-f597-4cc5-beac-1334c9ad61a1 accde8c8d00f4ae0a765fa7df5a20d7e 6934d26fc8f64fbdb12a80de7850b958 - - -] 192.168.25.16 - - [10/Jul/2017 22:52:47] "POST /v2.0/ports.json HTTP/1.1" 201 1047 1.111623
2017-07-10 22:53:23.607 [c2] 490233 INFO neutron.wsgi [req-fa3834ea-d3e6-4012-b213-ca991aa9cae2 accde8c8d00f4ae0a765fa7df5a20d7e 6934d26fc8f64fbdb12a80de7850b958 - - -] 192.168.25.16 - - [10/Jul/2017 22:53:23] "POST /v2.0/ports.json HTTP/1.1" 201 1047 1.049874


So either nova or haproxy sends the request twice.
Comment 4 Artom Lifshitz 2017-07-13 14:26:13 EDT
I took a look at the env, debug wasn't universally enabled to it was of limited use. I'll try to use the sosreports from bz 1390199 to figure it out. Comment 5 in that bz [1] is suggesting that this happens if an instance is rescheduled - this would make it the same as upstream bug 1609526 [2]. If that's indeed what's happening we'll move the bug to opentack-nova and take care of the fix.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1390199#c5
[2] https://bugs.launchpad.net/nova/+bug/1609526
Comment 5 Jakub Libosvar 2017-07-24 07:54:15 EDT
I'm flipping this to nova to triage as the symptoms are the same as in upstream bug from comment 4 - https://bugs.launchpad.net/nova/+bug/1639230

But it should be fixed in Newton while this is OSP 11 (Ocata).
Comment 11 Artom Lifshitz 2018-02-06 14:24:09 EST
This can't be bug 1639230 [1], because that's been fixed in Ocata (OSP11) and backported to Newton (OSP10).

I took a look at the sosreports in case 01927760. Based on comment #6, I looked for instance 35bed9a2-071b-472d-a3e8-cfb6c823842a in order to trace its creation from nova API accepting the boot request to the instance running on the compute. I found the following line in the compute log (slightly formatted for legibility):

  $ grep 2>/dev/null -I 35bed9a2-071b-472d-a3e8-cfb6c823842a.*build -R .

    ./pollux-tds-compute-001.localdomain/var/log/nova/nova-compute.log:

    2017-09-08 12:17:39.716 21691 INFO nova.compute.manager

    [req-d532789b-4b1e-4b55-b885-391a055d1c6a
    8c7fb2e59d8a4e138fd0453e287dafe6
    1a4003cd5dd5425799a578968c2b7c1a - - -]

    [instance: 35bed9a2-071b-472d-a3e8-cfb6c823842a]

    Took 4.79 seconds to build instance.

So it looks like the request ID that we need to trace is req-d532789b-4b1e-4b55-b885-391a055d1c6a. Now the confusion starts. In the conductor logs, that request ID is associated with the following instance IDs:

    72fdcf4e-7bd9-4b46-97dd-225c27ff31c4
    0bb50738-1d96-43c3-a09c-5656d5859ab0

However, in the compute logs:

* Instance 72fdcf4e-7bd9-4b46-97dd-225c27ff31c4 does not appear for that
  request ID

* Instance 0bb50738-1d96-43c3-a09c-5656d5859ab0 does appear, thankfully

* Other instance IDs appear for that request ID:
  0992cd78-757e-4fb9-baf6-b48a9608ec6c
  35bed9a2-071b-472d-a3e8-cfb6c823842a
  4179df9e-4402-47d8-b786-36fd02dfbc8e
  9b109be0-4169-45d1-b8e9-cb313e341ef3
  c1370b06-0953-462f-b8f0-b5bf16f2bbeb
  caa7b249-245c-48d6-974e-5beb10c2454d
  d30313e5-f8ff-4926-b22d-eebec7f68d58

I'm guessing the most reasonable explanation for all this is that:

1. The boot request that created instance 35bed9a2-071b-472d-a3e8-cfb6c823842a
   was for multiple instances

2. Some sosreports from other computes and controllers are missing. 

For point 1, can I ask you to do some testing and rule out whether this bug needs a multi-VM boot request to be reproduced? In other words, can this bug be reproduced by booting a single VM at a time?

For point 2, I would like to have sosreports from all nodes.

If either point 1 or 2 are wrong - in other words, if the new instance IDs in nova-compute.log are not from a different controller whose sosreports I don't have - then I would think about blaming rabbitmq, perhaps because the compute service is getting instance build requests replayed more than once for a given instance.

Before we go there however, I would like to check points 1 and 2 with you.

Cheers!

[1] https://bugs.launchpad.net/nova/+bug/1639230

Note You need to log in before you can comment on or make changes to this bug.