Juniper EVPN BGP options – eBGP-only design

February 26, 2020 Alex Comments 12 comments

In another part of his never-ending EVPN/BGP saga Ivan Pepelnjak argued with Juniper fanboys once again about sanity of iBGP-over-eBGP and eBGP-over-eBGP designs and all that fun stuff. I’ve already written my opinion about that topic in my previous post and numerous comments to Ivan’s posts (TL;DR: iBGP-over-eBGP design has its advantages, just implement it wisely – don’t place RR on spine switches).

But there is one thing that worries me. In almost every one of his posts Ivan talks about some mythical Junos limitations that don’t allow Juniper to support eBGP only (over single session) design. So let’s find out what these limitaions are.

Juniper has freely available version of vQFX for Vagrant. There are a few lab topologies available on GitHub. I will be using full-2qfx-4srv-evpnvxlan topology in this post.

This topology comes with Ansible playbook that configures vQFX switches with iBGP-over-OSPF EVPN. Standard Juniper configuration, just for reference:

protocols {
ospf {
area 0.0.0.0 {
interface lo0.0 {
passive;
}
interface xe-0/0/0.0;
}
}
evpn {
encapsulation vxlan;
multicast-mode ingress-replication;
default-gateway no-gateway-community;
extended-vni-list all;
}
bgp {
group evpn_overlay {
type internal;
local-address 9.9.9.1;
family evpn {
signaling;
}
neighbor 9.9.9.2;
}
}
}

If we want to convert it to Ivan’s beloved iBGP-over-eBGP design – it’s pretty simple (you can quickly do the same – just use “load patch terminal” and don’t forget to change config for vQFX2):

vagrant@vqfx1# show | compare
[edit]
+ policy-options {
+ policy-statement lo0 {
+ term 1 {
+ from {
+ protocol direct;
+ interface lo0.0;
+ }
+ then accept;
+ }
+ }
+ }
[edit protocols]
- ospf {
- area 0.0.0.0 {
- interface lo0.0 {
- passive;
- }
- interface xe-0/0/0.0;
- }
- }
[edit protocols bgp]
group evpn_overlay { ... }
+ group underlay {
+ family inet {
+ unicast;
+ }
+ export lo0;
+ neighbor 10.0.0.2 {
+ peer-as 64502;
+ local-as 64501;
+ }
+ }

Also pretty standard Juniper config described many times in a lot of places. If you didn’t read the whole Ivan’s saga, BEWARE: there are local-as knob in this config, which means COMPLEXITY. But let’s not go that road this time.

Instead, let’s implement another complicated design option – eBGP-over-eBGP (overlay eBGP EVPN session between loopbacks). Diff from previous config on vQFX1:

vagrant@vqfx1# show | compare
[edit routing-options]
- autonomous-system 64500;
+ autonomous-system 64501;
[edit protocols bgp group evpn_overlay]
- type internal;
+ multihop;
[edit protocols bgp group evpn_overlay neighbor 9.9.9.2]
+ peer-as 64502;
[edit protocols bgp group underlay neighbor 10.0.0.2]
- local-as 64501;

And the whole bgp config for vQFX1:

vagrant@vqfx1# show protocols bgp
group evpn_overlay {
multihop;
local-address 9.9.9.1;
family evpn {
signaling;
}
neighbor 9.9.9.2 {
peer-as 64502;
}
}
group underlay {
family inet {
unicast;
}
export lo0;
neighbor 10.0.0.2 {
peer-as 64502;
}
}

We use different AS per vQFX in this design (configured under routing-options) and multihop option for evpn_overlay bgp group. But unfortunately that’s not all. In eBGP overlay design EVPN auto-RT feature doesn’t work by default. There are several ways to make it work, but the most simple one is to disable auto-RT and manually configure single Route Target for all VNIs (don’t forget to apply similar changes to vQFX2):

vagrant@vqfx1# show | compare
[edit switch-options vrf-target]
- target:64500:9991;
+ target:64500:1;
- auto;

After that change servers can ping each other and EVPN successfully distributes routes:

vagrant@vqfx1# run show evpn database
Instance: default-switch
VLAN DomainId MAC address Active source Timestamp IP address
10000 00:01:94:00:01:01 05:00:00:fb:f5:00:00:27:10:00 Feb 26 16:13:09 10.10.1.254
10000 00:01:95:00:01:01 05:00:00:fb:f6:00:00:27:10:00 Feb 26 17:35:54 10.10.1.254
10000 02:05:86:71:8f:00 9.9.9.2 Feb 26 17:35:54 10.10.1.252
10000 02:05:86:71:ae:00 irb.10000 Feb 26 16:13:10 10.10.1.251
10000 08:00:27:8a:6a:1d 9.9.9.2 Feb 26 17:36:41 10.10.1.20
10000 08:00:27:c6:d6:7b xe-0/0/1.0 Feb 26 17:36:41 10.10.1.10
20000 00:01:94:00:01:02 05:00:00:fb:f5:00:00:4e:20:00 Feb 26 16:13:08 10.10.2.254
20000 00:01:95:00:01:02 05:00:00:fb:f6:00:00:4e:20:00 Feb 26 17:35:54 10.10.2.254
20000 02:05:86:71:8f:00 9.9.9.2 Feb 26 17:35:54 10.10.2.252
20000 02:05:86:71:ae:00 irb.20000 Feb 26 16:13:09 10.10.2.251
20000 08:00:27:5e:a0:c0 9.9.9.2 Feb 26 17:36:56 10.10.2.20
20000 08:00:27:68:fe:f8 xe-0/0/2.0 Feb 26 17:37:43 10.10.2.10

So, to recap, what we achieved so far: iBGP-over-OSPF works like a breeze without any nerd knobs; to implement iBGP-over-eBGP we should use local-as; to make work Cisco-style eBGP-over-eBGP design it’s necessary to use multihop eBGP session and find a suitable way to assign RT to routes.

Now it’s time for the main topic of this post – let’s try eBGP only design option. First let’s try simple configuration:

vagrant@vqfx1# show protocols bgp
group evpn {
family inet {
unicast;
}
family evpn {
signaling;
}
export lo0;
neighbor 10.0.0.2 {
peer-as 64502;
}
}

vagrant@vqfx1# show policy-options
policy-statement lo0 {
from {
protocol direct;
interface lo0.0;
}
then accept;
}

And at first sight it seems to be working! Kinda… At least routes are propagating. But hosts can’t ping each other. Let’s take a closer look at one of the routes from vQFX1:

vagrant@vqfx2# run show route table bgp.evpn.0 evpn-mac-address 08:00:27:c6:d6:7b detail
bgp.evpn.0: 32 destinations, 32 routes (32 active, 0 holddown, 0 hidden)
2:9991:1::10000::08:00:27:c6:d6:7b/304 MAC/IP (1 entry, 0 announced)
*BGP Preference: 170/-101
Route Distinguisher: 9991:1
Next hop type: Router, Next hop index: 1752
Address: 0xc65ee24
Next-hop reference count: 38
Source: 10.0.0.1
Next hop: 10.0.0.1 via xe-0/0/0.0, selected

This route is advertised with next-hop of interface between vQFX switches. Which make sence, because by default Junos always rewrite next-hop for routes advertised over eBGP session. But this is the problem because VTEP is using IP address of loopback interface. It’s not obvious how to change this behaviour of eBGP, but here is working configuration:

[edit policy-options]
+ policy-statement evpn_nexthop {
+ from protocol evpn;
+ then {
+ next-hop 9.9.9.1;
+ }
+ }
[edit protocols bgp group evpn]
+ multihop {
+ no-nexthop-change;
+ }
- export lo0;
+ export [ lo0 evpn_nexthop ];
[edit protocols bgp group evpn neighbor 10.0.0.2]
+ accept-remote-nexthop;

We need to configure and apply policy to explicitly change the next-hop for EVPN routes to address of loopback interface. But, confusingly, this policy doesn’t take effect until multihop no-nexthop-change option is applied (as far as I understand, without this knob Junos always change next-hop for advertised eBGP routes to ip address of source interface). And to recieve such routes (with changed nexthop) accept-remote-nexthop option is necessary.

After that we can see EVPN route with correct nexthop:

vagrant@vqfx2# run show route table bgp.evpn.0 evpn-mac-address 08:00:27:c6:d6:7b detail
bgp.evpn.0: 32 destinations, 32 routes (32 active, 0 holddown, 0 hidden)
2:9991:1::10000::08:00:27:c6:d6:7b/304 MAC/IP (1 entry, 0 announced)
*BGP Preference: 170/-101
Route Distinguisher: 9991:1
Next hop type: Indirect, Next hop index: 0
Address: 0xc65ee24
Next-hop reference count: 32
Source: 10.0.0.1
Protocol next hop: 9.9.9.1
Indirect next hop: 0x2 no-forward INH Session ID: 0x0

Just to recap, here is the full working configuration for eBGP only design:

vagrant@vqfx1# show protocols bgp
group evpn {
multihop {
no-nexthop-change;
}
family inet {
unicast;
}
family evpn {
signaling;
}
export [ lo0 evpn_nexthop ];
neighbor 10.0.0.2 {
accept-remote-nexthop;
peer-as 64502;
}
}

vagrant@vqfx1# show policy-options
policy-statement evpn_nexthop {
from protocol evpn;
then {
next-hop 9.9.9.1;
}
}
policy-statement lo0 {
from {
protocol direct;
interface lo0.0;
}
then accept;
}

vagrant@vqfx1# show switch-options
vtep-source-interface lo0.0;
route-distinguisher 9991:1;
vrf-target target:64500:1;

vagrant@vqfx1# show protocols evpn
encapsulation vxlan;
multicast-mode ingress-replication;
default-gateway no-gateway-community;
extended-vni-list all;

So, I think the myth about some limitations of Junos to not being able to support eBGP-only option is finally busted. Of course the configuration is convoluted with a couple of really nerd knobs, but it works. If you going to tell me about other vendors that provide a sane default for such type of configuration, keep in mind couple of things. First of all, EVPN in Junos is not only about datacenter, EVPN-MPLS also plays a big role in this world. Defaults that work great for DC can be really unsuitable for Inter-AS sessions. And, generally speaking, the sanity of any solution and default values is highly debatable thing. But I really confident that best product is the one that supports all the options.

From all the above, I want to make only one conclusion. Please, don’t blame Junos developers for simply offering more options. There are a lot of other things they should be ashamed of, but not this one. As for the marketing and educational departments for their promoting of iBGP design with RR on spines – that’s another story…

12 thoughts on “Juniper EVPN BGP options – eBGP-only design”

Tomi says:

March 2, 2020 at 5:21 pm

What is your suggestion about RR placement?

1. Alex says:
  
  March 6, 2020 at 9:53 am
  
  Border Leafs or vRR
  
  1. Andrea Florio says:
    
    March 7, 2020 at 6:59 pm
    
    Why would you do that?
    vxlan-evpn in the DC follows the same exact rules of MPLS-VPN that ISP have been running for years.
    
    your PEs need to terminate VPNs, you control plan and underlay routing should stay as separate as possible
    
    1. Andrea Florio says:
      
      March 7, 2020 at 7:02 pm
      
      just to be precise, i understand the idea of a virtual RR, but really, in the DC, what advantages is that giving you when all you have are a few dozen switches ?
      
      1. Alex says:
        
        March 11, 2020 at 10:11 am
        
        And what if you have a few hundred switches? A few thousands? Not all DCs are created equal. I think it’s more useful to estimate expected size of routing information and compare it to the scalability limits of switches.
    2. Alex says:
      
      March 11, 2020 at 10:06 am
      
      I don’t quite understand your question. I’m proposing exactly what you wrote – don’t place RR on Spine switches (i.e. Spines don’t participate in overlay at all). In every SP network that I worked with, there are always separate RR boxes (be it couple of ancient 7200s or a bunch of modern vRRs).
      
      1. Fabian P. says:
        
        December 3, 2020 at 8:54 am
        
        Hi Alex,
        I don’t understand your point about no running RR for EVPN on Spine?
        “don’t place RR on Spine switches (i.e. Spines don’t participate in overlay at all). In every SP network that I worked with, there are always separate RR boxes”.
        This separate RR boxes don’t participate in overlay at all, so this would be also a valid point for Spine.
        
        I would never place RR on Border Leaf:
        1) as mentioned by Andre, Leaf are PE and SP never run RR on PE, so we can trust SP.
        2) Border Leaf are subject to lot of configuration changes and handling a lot of stuff (Routing to external devices/Networks,…) so their CPU is already extensively used – which can hurt the stability of whole the Fabric. Spine have under-utilized CPU (they are not part of the Overlay)
        3) if RR on Border Leaf, then all EVPN-BGP sessions from remote Leaf switches will be at 1 hop distance and any failure on the path would kill you Fabric (error on Border Leaf uplinks, congestion,…)
        4) Because Border Leaf has more functions/features enabled, it is more likely to face a bug, so upgrade will be needed and that would again go against Fabric Stability, whereas Spine is poor in feature and will much more stable.
        
        So please share what the benefit to not have RR on Spine and have it on Border Leaf.
        
        Thanks, Fabian.
      2. Alex says:
        
        December 3, 2020 at 9:33 am
        
        1. If you trust SP experience, please show me one that have RR functionality on MPLS P-routers (and why they even have simple P routers, aka BGP-free core?)
        
        2. If you worried about such things (either you have huge scale or too simple devices on border leafs role) – use vRRs.
        
        3. Sorry, didn’t catch your point here. If you loose border leaf, you have more serious problem that just missing 1 of your RRs.
        
        4. >Spine is poor in feature and will much more stable.
        So let’s change that and make them RRs, to add new features and bugs on spine layer.
        
        And also think about possibility of using different vendors for leaf and spine layers.
john says:

August 29, 2020 at 7:41 am

What is your opinion on shortest path bridging as implemented by Avaya (and now taken to the next level by Extreme Networks)

https://en.wikipedia.org/wiki/IEEE_802.1aq

https://www.extremenetworks.com/resources/solution-brief/extreme-fabric-connect/

Fabian P says:

December 3, 2020 at 10:15 am

Indeed vRR would be a good solution when you have very large scale and for MPLS SP as they don’t build a Spine&Leaf architecture so placement of RR is not so obvious.

Point 2) this is not only about scale, but based on my own experience (so this a bias experience 🙂 ) is that the most stable network device is the one you don’t “touch”.

I would like to hear from you the Pros for RR on Border Leaf?
And I’m interested if you have some public documentation to share.

I know that Cisco is schizophrenia on this topic (some doc say Spine don’t support RR, other doc say it is good to have vRR, on ACI RR is on the APIC – which act as a Controller, not just a RR,..) but other like Juniper, Arista as clear: RR on Spine.

Thanks. Fabian.

1. Alex says:
  
  December 3, 2020 at 10:33 am
  
  My main point is about RR on Spines – if there is possibility to avoid that, it’s better to use it. Keep spines as simple as possible, it will help you in the future to scale/upgrade/change gear.
  
  So if RR not on Spines – where else should it be? Just 3 options – Leafs, Border Leafs, vRRs.
  Border Leafs, in general, are more capable devices than regular leafs. But on the other hand, you concerns can also be true under certain circumstances. But I think there is too much attention to such simple thing as RR – just make 3-4 of them for redundancy, in different places of your fabric.
  
  Public documentation and examples from vendors always very simplistic – just 2 spines and a few leafs in every case. This is just “Hello world!” examples. Real life is little more complicated. By the way, show me the vendor who will propose you the option to use different vendors for leaf and spine layers.
  
Fabian P. says:

December 3, 2020 at 10:55 am

Indeed, why would a vendor propose a model from a competitor, this would be insane (loss of revenue :P) and particularly in an EVPN-VXLAN Fabric (where automation and tooling is very important).

You are a bit harsh with public documentation are not so simplistic, look here :
– Yves@Cisco (who is probably one of the most experienced EVPN-VXLAN Engineer at Cisco): http://yves-louis.com/DCI/?p=2013
– Juniper with Scaling of more than 200 Leaf switches and “supervised” by Co-author of RFC7432 Isaac Aldrin: https://www.juniper.net/documentation/en_US/release-independent/solutions/topics/concept/solution-cloud-data-center-components.html

So “Show me the money!” 😉
I will stop here and it was fun thanks.
Fabian.