February 2011 ~ CCNP, CCSP and CCIE Security Version 4 Training Institute in India- Delhi

Voice Toll-Fraud Caveats – No VoIP Traffic by Default!

This isn’t exactly the latest news, and doesn’t effect the CCIE Voice Lab exam (although it very well may effect the new CCNP Voice exams), however I am hearing more and more how people are upgrading their Voice routers with newer 15.x IOS code, and not realizing how existing (working) VoIP calls are being broken due to new, intelligent feature default configurations.
Last July, Cisco decided (wisely, IMHO) to create a new style of Toll-Fraud prevention to keep would-be dishonest people from defrauding a company by placing calls through their misconfigured voice gateway(s), at the company’s expense. This new mechanism works by preventing unintended TDM (FXO/CAS/PRI) and VoIP (H.323 & SIP) calls from being able to be placed through a given company’s voice gateway(s), by simply blocking all unknown traffic. Beginning in IOS 15.1(2)T, Cisco added a new application to the default IOS stack of apps that compares all source IP address with an explicitly configured list in the IOS running config, and if the IP address(es) or subnets do not match, all VoIP traffic is denied. Also, the new default for all POTS voice-ports is to not allow secondary dial-tone, making direct-inward-dial the default for CAS/PRI, and PLAR necessary for FXO.

We can trust our VoIP sources with a few, very easy commands.
If we wanted to trust only our CUCM Publisher and Subscribers servers on our GradedLabs Voice Racks, we would add:

voice service voip
  ip address trusted list
    ipv4 177.1.10.10 255.255.255.255
    ipv4 177.1.10.20 255.255.255.255

Or possibly if we wanted to trust the entire subnet that our servers were on, we would add:

voice service voip
  ip address trusted list
    ipv4 177.1.10.0 255.255.255.0

We also have the ability to go back to pre-15.1(2)T behavior by simply doing either this:

voice service voip
  ip address trusted list
    ipv4 0.0.0.0 0.0.0.0

Or this:

voice service voip
  no ip address trusted authenticate

Also, we have the ability to configure the router for pre-15.1(2)T behavior as it relates to inbound POTS calls.
For inbound ISDN calls we would add:

voice service pots
  no direct-inward-dial isdn

And for inbound FXO calls we would add:

voice-port 0/0
 secondary dialtone

One nice thing is that when booting an IOS router with this toll-fraud functionality, a message is displayed on boot-up, letting us know about it – essentially warning us that we need to configure something if we wish VoIP calls to work.
A link to Cisco’s tech note describing this new functionality can be found here.
In summary, when upgrading a previously working H.323 or SIP VoIP gateway to IOS 15.1(2)T or later, until the proper configuration changes have been added to allow the proper VoIP source traffic into your voice gateway, all VoIP calls will cease to function properly. In general, this shouldn’t break FXO/CAS/PRI for most configurations out there – as most folks are likely to have their routers configured properly to handle inbound POTS traffic (i.e. PLAR on their FXO ports and DID on their CAS/PRI port – or so we should hope) – I suppose YMMV depending on each unique configuration.
Let me know if you think this is a good thing that Cisco has done.

The Basics of EIGRP

12:43:00 PM Eigrp, Tutorials No comments

About the Protocol

The algorithm used for this advanced Distance Vector protocol is the Diffusing Update Algorithm.
The metric is based upon Bandwidth and Delay values.
For updates, EIGRP uses Update and Query packets that are sent to a multicast address.
Split horizon and DUAL form the basis of loop prevention for EIGRP.
EIGRP is a classless routing protocol that is capable of Variable Length Subnet Masking.
Automatic summarization is on by default, but summarization and filtering can be accomplished anywhere inside the network.

Neighbor Adjacencies

EIGRP forms “neighbor relationships” as a key part of its operation. Hello packets are used to help maintain the relationship. A hold time dictates the assumption that a neighbor is no longer accessible and causes the removal of topology information learned from that neighbor. This hold timer value is reset when any packet is received from the neighbor, not just a Hello packet.

EIGRP uses the network type in order to dictate default Hello and Hold Time values:

For all point-to-point types – the default Hello is 5 seconds and the default Hold is 15
For all links with a bandwidth over 1 MB – the default is also 5 and 15 seconds respectively
For all multi-point links with a bandwidth less than 1 MB – the default Hello is 60 seconds and the default Hold is 180 seconds

Interestingly, these values are carried in the Hello packets themselves and do not need to match in order for an adjacency to form (unlike OSPF).

Reliable Transport

By default, EIGRP sends updates and other information to multicast 224.0.0.10 and the associated multicast MAC address of 01-00-5E-00-00-0A.
For multicast packets that need to be reliably delivered, EIGRP waits until a RTO (retransmission timeout) before beginning a recovery action. This RTO value is based off of the SRTT (smooth round-trip time) for the neighbor. These values can be seen in the show ip eigrp neighbor command.
If the router sends out a reliable packet and does not receive an Acknowledgement from a neighbor, the router informs that neighbor to no longer listen to multicast until it is told to once again. The local router then begins unicasting the update information. Once the router begins unicasting, it will try for 16 times or the expiration of the Hold timer, whichever is greater. It will then reset the neighbor and declare a Retransmission Limit Exceeded error.
Note that not all EIGRP packets follow this reliable routine – just Updates and Queries. Hellos and acknowledgements are examples of packets that are not sent reliably.

A VLAN Hopping Attack

12:34:00 PM Tutorials No comments

In our recent Implement Layer 2 Technologies series, we examined Q-in-Q tunneling in great detail. In this discussion, I mentioned a big caution about the Service Provider cloud with 802.1Q trunks in use for switch to switch trunking. This caution involved the use of an untagged native VLAN.
You see, this configuration could lead to what is known as the VLAN hopping attack. Here is how it works:

A computer criminal at a customer site wants to send frames into a VLAN that they are not part of.
The evil-doer double tags the frame (Q-in-Q) with the outer frame matching the native VLAN in use at the provider edge switch.
The provider edge switch strips off the outer tag (because it matches the native VLAN), and send this frame across the trunk.
The next switch in the path examines the frame and reads the inner VLAN tag and forwards the frame accordingly. Yikes!

Notice the nature of this attack is unidirectional. The attacker can send traffic into the VLAN, but traffic will not return. Admittedly, this is still NOT something we want taking place!
What are solutions for the Service Provider?

Use ISL trunks in the cloud. Yuck.
Use a Native VLAN that is outside of the range permitted for the customer. Yuck.
Tag the native VLAN in the cloud. Awesome.

The Trouble with European MGCP Gateways and Mobile Connect Inbound Calling Party Matching

12:32:00 PM Tutorials No comments

The Cisco Unified Communications feature called Mobile Connect (also familiarly referred to as Single Number Reach) is truly a great feature of Unified Communications Manager, and can provide us with many efficiencies both in being able to be reachable just about anywhere, and in being able to be easily identified when placing inbound calls from our mobile phones into the CUCM cluster to our colleagues. As admins, we know that if we wish to have our users place calls from their mobile phones inbound to their colleagues inside the CUCM cluster, that we need to match up all or at least part of their inbound calling party number (CLID) to their CUCM Remote Destination. But what happens when what the carrier is sending CLID digits inbound to our IOS voice gateways that differs significantly from our Remote Destinations in CUCM, especially if we have truly embraced Cisco’s push toward true Globalization in v7.0, v8.0 & v8.5?

The fact is that many, if not most European carriers (as well as many more all over the world) send CLID in through an ISDN PRI into the enterprise gateway with a preceding “0″ as a courtesy digit for easy recognition and ease in dialing back out, since this “0″ is very commonly used as a carrier-recognized national dialing prefix. If we were speaking of the US and Canada, this “0″ we are speaking of would be akin to dialing a “1″ prior to the national number. Now in the US and Canada, if a carrier in the US sent CLID into a gateway with a “1″ preceding any 10 digit number, this would work fine since the US/Canada country code also happens to be “1″. However, the “0″ preceding a variable length number is not valid in a true E.164 number format (e.g. If you dialed the phone number from outside of whatever country we were talking about, you would omit that preceding 0 from your dialed digits).
So what are we to do to get our inbound CLID to match our RD’s?
That is exactly what we will explore here today in video format, as you watch a very small excerpt from the new video-based solutions to one of the many new labs we will be releasing in the very near future (some just next week) as we completely replace our CCIE Voice Volume I & II workbooks with completely new lab scenarios and solutions.
Click here for the 25 minute video discussion on “The Trouble with European MGCP Gateways and Mobile Connect Inbound Calling Party Matching“.
(BTW, if the video starts off with a bit of an echo, just hit CTRL-R to refresh the stream. And then stay tuned to this blog for some very exciting announcements about new formats for video solutions in the very near future)
Happy Labbing,

Catalyst Switch Port Security Basics

12:29:00 PM Tutorials No comments

Catalyst switch port security is so often recommended. This is because of a couple of important points:

There are many attacks that are simple to carry out at Layer 2
There tends to be a gross lack of security at Layer 2
Port Security can guard against so many different types of attacks such as MAC flooding, MAC spoofing, and rouge DHCP and APs, just to name a few

I find when it comes to port security, however, many students cannot seem to remember two main points:

What in the world is Sticky Learning and how does it work?
What is the difference between the different violation modes and how can I remember them?

Sticky Learning

Sticky learning is a convenient way to set static MAC address mappings for MAC addresses that you allow on your network. What you do is confirm that the correct devices are connected. You then turn on sticky learning and the port security feature itself, for example:

switchport port-security maximum 2
switchport port-security mac-address sticky
switchport port-security

Now what happens is the 2 MAC addresses for the two devices you trust (perhaps an IP Phone and a PC) are dynamically learned by the switch. The switch now automatically writes static port security entries in the running configuration for those two devices. All you have to do is save the running configuration, and poof, you are now configured with the powerful static MAC port security feature.
Please note that it is easy to forget to actually turn on port security after setting the parameters. This is what the third line is doing in the configuration above. Always use your show port-security commands to confirm you remembered this important step of the process!

Violation Modes

The violation modes are Shutdown, Protect, and Restrict. Shutdown is the default and the most severe. If there is a violation, the port is error-disabled and notifications are sent (SNMP traps can be used and violation counters are incremented, etc.). With Restrict mode, the bad MAC cannot communicate on the port, but the port does not error-disable. There are notifications sent. With the Protect mode, the bad MAC cannot communicate and there is no eror-disabling, but the problem is, there are no notfications sent. Cisco does not recommend this mode as a result.
How can you remember these easily? Just think of the alphabet. P the R then S gives you the levels of severity.

Where do you find these features documented should you still forget?
Cisco.com – Support – Configure – Products – Switches – LAN Switches – Access – 3560 Series - Configuration Guides – Software Configuration Guides – Latest Release – Configuring Port-Based Traffic Control

The EIGRP Composite Metric – Part 1

12:26:00 PM Eigrp, Tutorials No comments

EIGRP for IP: Basic Operation and Configuration by Russ White and Alvaro Retana
I was able to grab an Amazon Kindle version for about $9, and EIGRP has always been one of my favorite protocols.
The text dives right in to none other than the composite metric of EIGRP and it brought a smile to my face as I thought about all of the misconceptions I had regarding this topic from early on in my Cisco studies. Let us review some key points regarding this metric and hopefully put some of your own misconceptions to rest.

While we are taught since CCNA days that the EIGRP metric consists of 5 possible components – BW, Delay, Load, Reliability, and MTU; we realize when we look at the actual formula for the metric computation, MTU is actually not part of the metric. Why have we been taught this then? Cisco indicates that MTU is used as a tie-breaker in a situation that might require it. To review the actual formula that is used to compute the metric, click here.
Notice from the formula that the K (constant values) impact which components of the metric are actually considered. By default K1 is set to 1 and K3 is set to 1 to ensure that Bandwidth and Delay are utilized in the calculation. If you wanted to make Bandwidth twice as significant in the calculation, you could set K1 to 2, as an example. The metric weights command is used for this manipulation. Note that it starts with a TOS parameter that should always be set to 0. Cisco never did fully implement this functionality.
The Bandwidth that effects the metric is taken from the bandwidth command used in interface configuration mode. Obviously, if you do not provide this value – the Cisco router will select a default based on the interface type.
The Delay value that effects the metric is taken from the delay command used in interface configuration mode. This value depends on the interface hardware type, e.g. it is lower for Ethernet but higher for Serial interfaces. Note how the Delay parameter allows you to influence EIGRP pathing decisions without the manipulation of the Bandwidth value. This is nice since other mechanisms could be relying heavily on the bandwidth setting, e.g. EIGRP bandwidth pacing or absolute QoS reservation values for CBWFQ.
The actual metric value for a prefix is derived from the SUM of the delay values in the path, and the LOWEST bandwidth value along the path. This is yet another reason to use more predictive Delay manipulations to change EIGRP path preference.

In the next post on the EIGRP metric, we will examine this at the actual command line, and discuss EIGRP load balancing options. Thanks for reading!

Optimum Bandwidth Allocation for VoIP Traffic

12:23:00 PM Tutorials No comments

Abstract

This publication discusses the spectrum of problems associated with transporting Constant Bit Rate (CBR) circuits over packet networks, specifically focusing VoIP services. It provides guidance on practical calculation for voice bandwidth allocation in IP networks, including the maximum bandwidth proportion allocation and LLQ queue settings. Lastly, the publication discusses the benefits and drawbacks of transporting CBR flows over packet switched networks and demonstrates some effectiveness criteria.

Introduction

Historically, the main design goal of Packet Switched Networks (PSNs) was optimum bandwidth utilization for low-speed links. Compared to their counterpart, circuit-switched networks (CSNs such as SONET/SDH networks), PSNs use statistical as opposed to deterministic (synchronous) multiplexing. This feature allows PSNs to be very effective for bursty traffic sources, i.e. those that send traffic sporadically. Indeed, with many sources this allows the transmission channel to be optimally utilized by sending traffic only when necessary. Statistical multiplexing is only possible if every node in the network implements packet queueing, because PSNs introduce link contention. One good historical example is ARPANET: the network theoretical foundation has been developed in Kleinrock’s work on distributed queueing systems.

In PSNs, it is common for the traffic from multiple sources to be scheduled for sending out the same link at the same moment. In such case of contention for the shared resource, exceeding packets are buffered, delayed and possibly dropped. In addition to this, packets could be re-ordered, i.e. packets sent earlier may arrive behind packets that have been sent after them. The latter is normally a result of packets taking different paths in the PSN as a due to routing decisions. Such behavior is OK with bursty, delay insensitive data traffic, but completely inconsistent with the behavior of constant bit rate (CBR), delay/jitter sensitive traffic sources, such as emulated TDM traffic. Indeed, transporting CBR flows over PSNs poses significant challenges. Firstly, emulating a circuit service requires that every node should not buffer the CBR packets (i.e. should not introduce delay or packet drops) and be “flow-aware” to avoid re-ordering. The other challenge is the “packet overhead” tax imposed on emulated CBR circuits. Per their definition, CBR sources produce relatively small burst of data at regular periodic intervals. The more frequent are the intervals, the typically smaller are the bursts. In turn, PSNs apply a header to every transmitted burst of information to implement network addressing and routing, with the header size being often comparable to the CBR payload. This significantly decreases link utilization efficiency when transporting CBRtraffic.

Emulating CBR services over PSN

At first, it may seem that changing queuing discipline in every node will resolve the buffering problem. Obviously, if we distinguish CBR flow packets and service them ahead of all other packets using priority-queue then they would never get buffered. This assumes that link speed is fast enough so that serialization delay is negligible in the context of the given CBR flow. Such delay may vary depending on the CBR source: for example, voice flows typically produce one codec sample every 10ms and based on this, serialization delay at every node should not exceed 10ms, or preferably be less than that (otherwise, the next produced packet will “catch up” the previous one). Serialization problem on slow links could be solved using fragmentation and interleaving mechanics, e.g. as demonstrated in [6]. Despite of priority queueing and fragmentation, situation becomes more complicated with multiple CBR flows transported over the same PSN. The reason is that there is now contention among the CBR flows, since all of them should be serviced on priority basis. This creates queueing issues and intolerable delays. There is only one answer to reduce resource contention PSNs – over-provisioning.
Following the work [2], let’s review how the minimum over-provisioning rate could be calculated. We first define the CBR flows as those that cannot tolerate a single delay of their packet in the link queue. Assume there are r equally behaving traffic flows contending for the same link. Pick up a designated flow out of the set. When we randomly “look” at the link, the probability that we see a packet from the designated flow is 1/r, since we assume that all flows are serviced equally by the connected PSN node. Then, the probability that the selected packet does NOT belong to our flow is 1-1/r respectively. If the link can accept at maximum t packets per millisecond, then during an interval of k milliseconds the probability that our designated flow may send a packet over the link without blocking is: P=1-(1-1/r)^tk, where (1-1/r)^tk is the probability of NOT seeing our flows designated packet amount the tk packets. The value P is the probability of any given packet NOT being delayed due to contention. It is important to understand that every delayed packet will cause flow behavior deviation from CBR. Following [2], we define u=tk as the “over-provisioning ratio” where u=1 when the channel can only send one flow packet during a time unit that take a flow to generate the same packet, i.e. when channel rate = flow rate. When u=2 the link is capable of sending twice as much packets during a unit of time compared to the number of packets sent by a single flow during the same interval. With the new variable, the formula becomes P=1-(1-1/r)^u. Fixing the value of P in this equation we obtain:

u=ln(1-P)/ln(1-1/r). (*)

which the minimum over-provision ratio to achieve the desired P probability of successfully transmitting the designated flow’s packet when r equal flows are contending for the link. For example, with P=99.9% and r=10 we end up having u=ln(0.001)/ln(0.9)=65.5. That is, in order to provide the guarantee of not delaying 99.9% packets for 10 CBR flows we need to have at least 66 times more bandwidth than a single flow requires. Lowering P to 99% results in u=44 over-provisioning coefficient. It is interesting to look at the r/u ratio, which demonstrate what portion of minimally over-provisioned link bandwidth would be occupied by the “sensitive” flows when they all are transmitted in parallel. If we take the ratio r/u=ln((1-1/r)^r)/ln(1-P) then with large number of r we can replace (1-1/r)^r with 1/e and the link utilization ratio becomes approximated by:

r/u=-1/ln(1-P). (**)

For P=99% we get the ratio of 21%, for P=99.9% the ratio is 14% and for P=90% the ratio becomes 43%. In practice, this means that for moderately large amount of concurrent CBR flows, e.g. over 30, you may allocate no more than specified percentage of the link’s bandwidth to CBR traffic based on the target QoS requirement.
Now that we are done with buffering delays, what about packet reordering and packet overhead tax? Reordering problem could be solved at the routing level in PSNs: if every routing node is aware of the flow state it may ensure that all packets belonging to the same flow are sent across the same path. This is typically implemented by deep packet inspection (which by the way violates the end-to-end principle as stated in RFC 1958) and classifying the packets based on the higher-level information. Such implementations are, however, rather effective as inspection and flow classification is typically performed in forwarding path using hardware acceleration. The overhead tax problem has two solution. The first one is, again, over-provisioning. By using high-capacity, low-utilized link we may ignore the bandwidth wastage due to the overhead. The second solution requires adding some state to network nodes: by performing flow inspection at the both ends of a single link we may strip the header information and replace it with a small flow ID. The other end of the link will reconstruct the original headers by matching the flow ID to the locally stored state information. This solution violates the end-to-end principle and has poor scalability as the number of flows grow. Typically it is used on slow-speed links. For example, VoIP services utilize IP/RTP/UDP and possibly TCP header compression to reduce the packet overhead tax.

Practical Example: VoIP Bandwidth Consumption

Let’s say we have a 2Mbps link and we want to know how to provision priority-queue settings for G.729 calls. Firstly, we need to know per-flow bandwidth consumption. You may find enough information on this topic referring to [3]. Assuming we are using header compression over Frame-Relay, the per-flow bandwidth is 11.6Kbps and maximum link capacity is roughly 2000Kbps the theoretical maximum over-subscription rate is 2000/11.6=172. We can find the maximum number of flows allowed under the condition of P as

r=1/(1-exp(ln(1-p)/u)) = 1/(1-(1-P)^(1/u)) (***)

setting u=170 and P=0.99. This yields the theoretical limit of 37 concurrent flows. The total bandwidth for that many flows is 37*11.6=429Kbps or about 21% of the link capacity, as predicted by the asymptotic formula (**) above. The remaining bandwidth could be used by other non-CBR applications, as it should be expected from a PSN exhibiting high link utilization efficiency.
Knowing the aggregate bandwidth and maximum number of flows provides us the parameters for admission control tools (e.g. policer rate, RSVP bandwidth and so froth). However, what is left to define yet are the burst settings and the queue depth for LLQ. The maximum theoretical burst size equals to the maximum number of flows multiplied by the voice frame size. From [3] we readily obtain that a compressed G.729 frame size for Frame-Relay connection is 29 bytes. This gives us the burst of 1073 bytes, which we could round up to 1100 bytes for safety. The maximum queue depth could be defined as number of flows minus one, since in the worst case at least one priority packet would be scheduled for serializing while others held in the priority queue for processing. This means the queue depth would be at maximum 36 packets. The resulting IOS configuration would look like:

policy-map TEST
 class TEST
    priority 430 1100
    queue-limit 36 packets

Circuits vs Packets

It is interesting to compare voice call capacity for a digital TDM circuit vs the same circuit being used for packet mode transport. Talking of a E1 circuit, we can transport as many as 30 calls, if one channel is used for associated signaling (e.g. ISDN). Compare this to the 37 G.729 VoIP calls we may obtain if the same circuit is channelized and runs IP – about 20% increase in call capacity. However, it is important to point out the quality of G.729 calls is degraded as compared to digital 64Kbps bearer channels, not to mention that other services could not be delivered over a compressed emulated voice channel. It might be more fair comparing the digital E1 to the packetized E1 circuit carrying G.711 VoIP calls. In this case, the bit rate for a single call running over Frame-Relay encapsulation with IP/RTP/UDP header compression would be (160+2+7)*50*8=67600bps or 67.6Kbps. Maximum over-provision rate is 29 in such case, which ends up with only six (6) VoIP calls allowed for the packetized link with P=99% in-time delivery! Therefore, if you try providing digital call quality over a packet network you end up with extremely inefficient implementation. Finally, consider an intermediate case – G.729 calls without IP/RTP/UDP header compression. This case assumes that complexity belongs to the network edge, as any transit links are not required to implement the header compression procedure. We end up with the following: uncompressed G.729 call over Frame-Relay generates (20+40+7)*50*8=26.8Kbps which results in over-provisioning coefficient of u=74 and r=16 flows – slightly over the half of the number that a E1 could carry.
Using the asymptotic formula (**) we see that for P=99% no more than 21% of the packetized link could be used for CBR services. This implies that the packet compression scheme should reduce the bandwidth of pure CBR flow by more than 5 times to be effectively compared with circuit-switched transport. Based on this, we conclude that PSNs could be more efficient for CBR transportation compared to “native” circuit-networks only if they utilize advanced processing features such as payload/header compression yielding compression coefficient over 5 times. However, we should keep in mind that such compression is not possible for all CBR services, e.g. relaying T1/E1 over IP has to maintain the full bandwidth of the original TDM channels, which is extremely inefficient in terms of resource utilization. Furthermore, the advanced CODEC features require complex equipment at the network edge and possibly additional complexity in the other parts of the network, e.g. in order to implement link header compression.
It could be argued that the remaining 79% of the packetized link could be used for data transmission, but the same is possible with circuit switched networks, provided that packet routers are attached to the edges. All the data packet switching routers need to do is dynamically request transmission circuit from the CSN based on traffic demands and used them for packet transmissions. This approach has been implemented, among others, in GMLPS ([5]).

Conclusions

The above logic demonstrates that PSNs were not really designed to be good at emulating true CBR services. Naturally, as the original intent of PSNs was maximizing the use of scarce link bandwidth. Transporting CBR services not only requires complex queueing disciplines but also ultimately over-provisioning the link bandwidth, thus somewhat defeating the main purpose of PSNs. Indeed if all that a PSN is used for is CBR service emulation, the under-utilization remains very significant. Some cases, like VoIP, allows for effective payload transformation and significant bandwidth reduction, which allows for more efficient use of network resources. On the other hand, such payload transformation requires introducing extra complexity to networking equipment. All this in addition to the fact that packet-switching equipment is inherently more complex and expensive compared to circuit-switched networks especially for very high-speed links. Indeed, packet switching logic requires complex dynamic lookups, large buffer memory and internal interconnection fabric. Memory requirements and dynamic state grow proportionally to the link speed, making high-speed packet-switching routers extremely expensive not only in hardware but also in software, due to advanced control plane requirement and proliferating services. More on this subject could be found in [4]. It is worth mentioning that the packet-switching inefficiency in network core has been realized long time ago, and there have been attempts for integrating circuit-switching core networks with packet-switching networks, most notable being GMPS ([5]). However, so far, the industry inertia did not make any of the proposed integration solutions viable.
Despite of all arguments, VoIP implementations have been highly successful so far, most likely akin to the effectiveness of VoIP codecs. Of course, no one can yet say than VoIP over Internet provides quality comparable to digital phone lines, but at least it is cheap, and that’s what market is looking for. VoIP has been highly successful in enterprises, so far, mainly due to the fact that enterprise campus networks are mostly high-speed based on ethernet switched technology, that demonstrates very low general link utilization ratio, within 1-3% of available bandwidth. In such over-provisioned conditions, deploying VoIP should not pose major QoS challenges.

Understanding BGP Convergence

12:20:00 PM Tutorials No comments

Introduction

BGP (see [0]) is the de-facto protocol used for Inter-AS connectivity nowadays. Even though it is commonly accepted that BGP protocol design is far from being ideal and there have been attempts to develop a better replacement for BGP, none of them has been successful. To further add to BGP’s widespread adoption, MP-BGP extension allows BGP transporting almost any kind of control-plane information, e.g. to providing auto-discovery functions or control-plane interworking for MPLS/BGP VPNs. However, despite BGP’s success, the problems with the protocol design did not disappear. One of them is slow convergence, which is a serious limiting factor for many modern applications. In this publication, we are going to discuss some techniques that could be used to improve BGP convergence for Intra-AS deployments.

BGP-Only Convergence Process

BGP is a path-vector protocol – in other words, a distance-vector protocol featuring complex metric. In absence of any policies, BGP operates like if routes have metric equal to the length of the AS_PATH attribute. BGP routing polices may override this simple monotonous metric and potentially create divergence conditions in non-trivial BGP topologies (see [7],[8],[9]). While this may be a serious problem at a large scale, we are not going to discuss these pathological cases, but rather talk about convergence in general. Like any distance-vector protocol, BGP routing process accepts multiple incoming routing updates, and advertises only the best routes to its peers. BGP does not utilize periodic updates, and thus route invalidation is not based on expiring any sort of soft state information (e.g prefix-related timers like in RIP). Instead, BGP uses explicit withdrawal section in the triggered UPDATE message to signal neighbors of the loss of the particular path. In addition to the explicit withdrawals, BGP also support implicit signaling, where newer information for the same prefix from the same peer replaces the previously learned information.
Let’s have a look at BGP UPDATE message below. As you can see, the UPDATE message may contain both withdrawn prefixes and new routing information. While withdrawn prefixes are listed simply as a collection of NLRIs, new information is grouped around the set of BGP attributes, shared by the group of announced prefixes. In other words, every BGP UPDATE message contains new information pertaining to a set of path attributes, at least prefixes sharing the same AS_PATH attribute. Therefore, every new collection of attributes requires a separate UPDATE message to be sent. This fact is important, as BGP process tries packing as many prefixes per update message as possible, when replicating routing information.
BGP-Convergence-FIG0

Look at the sample topology below. Let’s assume that R1′s session to R7 just came up and follow the way that prefix 20.0.0.0/8 takes to propagate through AS 300. In the course of this discussion we skip the complexities associated with BGP policy application and thus ignore the existence of BGP Adj-RIB-In space used for processing the prefixes learned from a peer prior to running the best-path selection process.
BGP-Convergence-FIG1

Upon session establishment and exchanging the BGP OPEN messages, R1 enters the “BGP Read-Only Mode”. What this means, is that R1 will not start the BGP Best-Path selection process until it either receives all prefixes from R7 or reaches the BGP read-only mode timeout. The timeout is defined using the BGP process command bgp update-delay. The reason to hold the BGP best-path selection process is to ensure that the peer has supplied us all routing information. This allows minimizing the number of best-path selection process runs, simplify update generation and ensure better prefix per message packing, thus improving transportation efficiency.
BGP process determines the end of UPDATE messages flow in either of two ways: receiving BGP KEEPALIVE message or receiving BGP End-of-RIB message. The last message is normally used for BGP graceful restart (see [13]), but could also be used to explicitly signalize the end of BGP UPDATE exchange process. Even if BGP process does not support the End-of-RIB marker, Cisco’s BGP implementation always sends a KEEPALIVE message when it finishes sending updates to a peer. It is clear that the best-path selection delay would be longer in case when peers have to exchange larger routing tables, or the underlying TCP transport and router ingress queue settings make the exchange slower. To address this, we’ll briefly cover TCP transport optimization later.
When R1′s BGP process leaves read-only mode, it starts the best-path selection running the BGP Router process. This process walks over new information and compare it with the local BGP RIB contents, selecting the best-path for every prefix. It takes time proportional to the amount of the new informational learned. Luckily, the computations are not very CPU-intensive, just like with any distance-vector protocol. As soon as the best-path process if finished, BGP has to upload all routes to the RIB, before advertising them to the peers. This is a requirement of distance-vector protocols – having the routing information active in the RIB before propagating it further. The RIB update will in turn trigger FIB information upload to the router’s line-cards, if the platform supports distributed forwarding. Both RIB and FIB updates are time-consuming and take the time proportional to the number of prefixes being updated.
After information has been committed to RIB, R1 needs to replicate the best-paths to every peer that should receive it. The replication process could be most memory and CPU intensive as BGP process has to perform a full BGP table walk for every peer and construct the output for the corresponding BGP Adj-RIB-Out. This may require additional transient memory in the course of the update batch calculation. However, the update generation process is highly optimized in Cisco’s BGP implementation by means of dynamic update groups. The essence of the dynamic update groups is that BGP process dynamically finds all neighbors sharing the same output policies, then elects a peer with the lowest IP address as the group leader and only generates the updates batch for the group leader. All other members of the same group receive the same updates. In our case, R1 has to generate two update sets: one for R5 and another for the pair of RR1 and RR2 route reflectors. The BGP update groups become very effective on route-reflectors that often have hundred of peers sharing the same policies. You may see the update groups using the command show ip bgp replication for IPv4 sessions.
R1 starts sending updates to R1 and RR1, RR2. This will take some time, depending on the BGP TCP transport settings and BGP table size. However, before R1 will ever start sending any updates to any peer/update group, it checks if Advertisement Interval timer is running for this peer. BGP speaker starts this timer on per-peer basis every time its done sending the full batch of updates to the peer. If the subsequent batch is prepared to be sent and the timer is still running, the update will be delayed until the timer expires. This is a dampening mechanism to prevent unstable peers from flooding the network with updates. The command to define this timer is neighbor X.X.X.X advertisment-interval XX. The default values are 30 seconds for eBGP and 5 seconds for iBGP/eiBGP sessions (intra-AS). This timer really starts playing its role only for “Down-Up” or “Up-Down” convergence, as any rapid flapping changes are delayed for the amount of advertisement-interval seconds. This becomes especially important for inter-AS route propagation, where the default advertisement-interval there is 30 seconds.

The process repeats itself on RR1 and RR2, starting with the incoming UPDATE packet reception, best-path selection and update generation. If for some reason the prefix 20.0.0.0/8 would vanish from AS 100 soon after it has been advertised, it may take as long as “Number_of_Hops x Advertisement_Interval” to reach to R3 and R4, as every hop may delay the fast subsequent update. As we can see, the main limiting factors of BGP convergence are BGP table size, transport-level settings and advertisement delay. The best-path selection time is proportional to the table size as well as time required for update batching.
Let’s look at a slightly different scenario to demonstrate how BGP multi-path may potentially improve convergence. Firstly, observing the topology presented on FIG 1, we may notice that AS 300 has two connections to AS 100. Thus, it may be expected to see two paths to every route from AS 100 on every router in AS 300. But this is not always possible in situations where any topology other than BGP full mesh is used inside the AS. In our example, R1 and R2 advertise routing information to the route-reflectors RR1 and RR2. Per the distance-vector behavior, the reflectors will only re-advertise the best-path to AS 100 prefixes, and since both RRs elect paths consistently, they will advertise the same path to R3, R4 and R2. Both R3 and R4 will receive the prefix 10.0.0.0/24 from each of the RRs and use the path via R1. R2 will receive the best path via R1 as well but prefer using its eBGP connection. On contrary, if R1, R2, R3 and R4 were connected in the full mesh, then every router would have seen exits via R1 and R2 and be able to use BGP multi-path if configured. Let’s review what happens in the topology on FIG1 when R1 loses connection to AS 100.

Depending on the failure detection mechanism, be it BGP keepalives or BFD, it will take some time for R1 to realize the connection is no longer valid. We’ll discuss the options for fast failure detection later in this publication.
After realizing that R5 is gone, R1 deletes all paths via R7. Since RR1 and RR2 never advertised back to R1 the path via R2, R1 has no alternate paths to AS 100. Realizing this, R1 prepares a batch of UPDATE messages for RR1, RR2 and R7, containing the withdrawal messages for AS 100 prefixes. As soon as RR1 and RR2 are done receiving and processing the withdrawals, they elect the new best path via R2 and advertise withdrawals/updates to R1, R2, R3, R4.
R3 and R4 now have the new path via R2, and R2 loses the “backup” path via R1 it knew about from the RRs. The main workhorses of the re-convergence process in this case are the route-reflectors. The convergence time is sum of the peering session failure detection, update advertisement and BGP best-path recalculations in the RRs.

If BGP speakers were able to utilize multiple paths at the same time, then it could be possible to alleviate the severity of a network failure. Indeed, if load-balancing is in use, then a failure of an exit point will only affect flows going across this exit point (50% in our case) and only those flows will have to wait for re-convergence time. Even better, it is theoretically possible to do “fast” re-route in the case where multiple equal-cost (equivalent and thus loop–less) paths are available in BGP. Such switchover could be performed in the forwarding engine, as soon as the failure is signaled. However, there are two major problems with the re-route mechanism of this type:

As we have seen, the use of route-reflectors (or confederations) has significant effect on redundancy by hiding alternate paths. Using full-mesh is not an option, so a mechanism needed allowing propagation of multiple alternate paths in RR/Confederation environment. It is interesting to point out that such mechanism is already available in BGP/MPLS VPN scenarios, where multiple point of attachments for CE sites could utilize different RD values to differentiate the same routes advertised from different connection points. However, a generic solution is required, allowing for advertising multiple alternate paths with IPv4 or any other address-family.
Failure detection and propagation by means of BGP mechanics is slow, and depends on the number of affected prefixes. Therefore, the more severe is the damage, the slower it is propagated in the BGP. Some other, non-BGP mechanism needs to be used to report network failures and trigger BGP re-convergence.

In the following sections we are going to review various technologies developed to accelerate BGP convergence, enabling far better reaction times compared to “pure BGP based” failure detection and repair.

Tuning BGP Transport

Tuning BGP transport mechanism is a very important factor for improving BGP performance in the cases where purely BGP-based re-convergence process is in use. TCP is the underlying transport used for propagating BGP UPDATE messages, and optimizing TCP performance directly benefits BGP. If you take the full Internet routing table, which is above 300k prefixes (Y2010), then simply transporting the prefixes alone will consume over 10 Megabytes, not to count the path attributes and other information. Tuning TCP transport performance includes the following:

Enabling TCP Path MTU discovery for every neighbor, to allow the TCP selecting optimum MSS size. Notice that this requires that no firewall blocks the ICMP unreachable messages used during the discovery process
Tuning the router’s ingress queue size to allow for successful absorption of large amount of TCP ACK messages. When a router starts replicating BGP UPDATES to its peers, every peer responds with TCP ACK message to normally every second segment sent (TCP Delayed ACK). The more peers router has, the higher will be the pressure on the ingress queue.

Very detailed information on tuning BGP transport could be found in [10] Chapter 3. We, therefore, skip an in-depth discussion of this topic here.

BGP Fast Peering Session Deactivation

When using BGP-only convergence mechanic, detecting a link failure is normally based on BGP KEEPALIVE timers, which are 60/180 seconds by default. It could be noted that TCP keepalives could be used for the same purpose, but since BGP already has similar mechanics these are not of any big help. It is possible to tune the BGP keepalive timers to be as low as 1/3 seconds, but the risk of peering session flapping become significant with such settings. Such instability is dangerous since there is no built-in session dampening mechanism in BGP session establishment process. Therefore, some other mechanism should be preferred – either BFD or fast BGP peering session deactivation. The last option is on by default for eBGP sessions, and tracks the outgoing interface associated with the BGP session. As soon as the interface (or the next-hop for multihop eBGP) is reported as down, the BGP session is deactivated. Interface flapping could be effectively dampened using IP Event Dampening in Cisco IOS (see [14]) and hence is less dangerous than BGP peering session flapping. The command to disable fast peering session deactivation is no bgp fast-external-fallover. Notice that this feature is by default off for iBGP sessions, as those are supposed to be routed and restored using the underlying IGP mechanics.
Using BFD is the best option on multipoint interfaces, such as Ethernet, that do not support fast link down detection e.g. by means of Ethernet OAM. BFD is especially attractive in the platforms that implement it in the hardware. The command to activate BFD fallover is neighbor fall-over bfd. In the following sections, we’ll discuss the use of IGP for fast reporting of link failures.

BGP and IGP Interaction

BGP prefixes typically rely on recursive next-hop resolution. That is, next-hops associated with BGP prefixes are normally not directly connected, but rather resolved via IGP. The core of BGP and IGP interaction used to be implemented in the BGP Scanner process. This process runs periodically and among other work performs full BGP table walk and validates the BGP next-hop values. The validation consists of resolving the next-hop recursively through the router’s RIB and possibly changing the forwarding information in response to IGP events. For example, if R1 crashes on FIG1, it will take 180 seconds for the RRs to notice the failure based on BGP KEEPALIVE message. However, the IGP will probably converge faster and report R1′s address as unreachable. This event will be detected during the BGP Scanner process run and all paths via R1 will be invalidated by all BGP speakers in AS 100. The default BGP Scanner run-time is 60 seconds, and could be changed using the command bgp scan-time. Notice that setting this value too low may result in extra burden on router’s CPU if you have large BGP tables, since the scanner process has to perform full table walk every time it executes.
The periodic behavior of BGP Scanner is still too slow to effectively respond to IGP events. IGP protocols could be tuned to react to a network change within hundreds of milliseconds (see [6]) and it would be desirable to make BGP aware of such changes as quickly as possible. This could be done with the help of BGP Next-Hop Tracking (NHT) feature. The idea is to make the BGP process register the next-hop values with the RIB “watcher” process and require a “call-back” every time information about the prefix corresponding to the next-hop changes. Typically, the number of registered next-hop values equals the number of exits from the local AS, or the number of PEs in MPLS/BGP VPN environment, so next-hop tracking does not impose heavy memory/CPU requirements. There are normally two types of events: IGP prefix becoming unreachable and IGP prefix metric change. The first event is more important and reported faster than metric change. Overall, IGP delays report of an event for the duration of bgp nextop trigger delay XX interval which is 5 seconds by default. This allows for more consecutive events to be processed and received from IGP and effectively implements event aggregation. This delay is helpful in various “fate sharing” scenarios where a facility failure affects multiple links in the network, and BGP needs to ensure that all IGP nodes have reported this failure and IGP has fully converged. Normally, you should set the NHT delay to be slightly above the time it takes the IGP to fully converge upon a change in the network. In a fast-tuned IGP network, you can set this delay to as low as 0 seconds, so that every IGP event is reported immediately, though this requires careful underlying IGP tuning to avoid oscillations. See [6] for more information on tuning the IGP protocol settings, but in short, you need to tune the SPF delay value in IGP to be conservative enough to capture all changes that could be caused by a failure in the network. Setting SPF delay too low may result is excessive BGP next-hop recalculations and massive best-path process runs.
As a reaction to an IGP next-hop change, the BGP process has to start BGP Router sub-process for re-calculating the best paths. This will affect every prefix that has the next-hop changed as a result of IGP event, and could take significant amount of time, based on number of prefixes associated with this nexthop. For example, if an AS has two connections to the Internet and receives full BGP tables over both connections, then a single exit failure will force full-table walk for over 300k prefixes. After this happens, BGP has to upload the new forwarding information to RIB/FIB, with the overall delay being proportional to the table size. To put it in other words, BGP convergence is non-deterministic in response to an IGP event, e.g. there is no well-defined finite time for the process to complete. However, if the IGP change did not result in any effects to BGP next-hop, e.g. if IGP was able to repair the path upon link failure and the path has the same cost, then BGP is not needed to be informed at all and convergence is handled at IGP level.
The last, less visible contributor to faster convergence is Hierarchical FIB. Look at the figure below – it shows how FIB could be organized as either “flat” or “hierarchical”. In the “flat” case, BGP prefixes have their forwarding information directly associated – e.g. the outgoing interface, MAC rewrite, MPLS label information and so on. In such case, any change to a BGP next-hop may require updating a lot of prefixes sharing the same next-hop, which is a time consuming process. If the next-hop value remains the same, and only the output interface changes, the FIB update process still needs walking over all BGP prefixes and reprogramming the forwarding information. In case of “hierarchical” FIB, any IGP change that does not affect BGP prefixes, e.g. output interface change, only requires walking over the IGP prefixes, which are not as numerous as BGP. Therefore, hierarchical FIB organization significantly reduces FIB update latency in the cases where only IGP information needs to be changed. The use of hierarchical FIB is automatic and does not require any special commands. All major networking equipment vendors support this feature.
BGP-Convergence-FIG2

The last thing to discuss in relation to BGP NHT is IGP route summarization. Summarization hides detailed information and may conceal changes occurring in the network. In such case, BGP process will not be notified of the IGP event and will have to detect failure and re-converge using BGP-only mechanics. Look at the figure below – because of summarization, R1 will not be notified or R2′s failure and the BGP process at R1 will have to wait till BGP session times out. Aside from avoiding summarization for the prefixes used for iBGP peering, an alternate solution could be using multi-hop BFD [15]. Additionally, there is some work in progress to allow the separation of routing and reachability information natively in IGP protocols.
BGP-Convergence-FIG3

You can see now how NHT may allow BGP to react quickly to the events inside its own AS, provided that underlying IGP is properly tuned for fast convergence. This fast convergence process effectively covers core link and node failures, as well as edge link and node failures, provided that all these could be detected by IGP. You may want to look at [1] for detailed convergence breakdowns. Pay special attention that edge link failure requires special handling. If your edge BGP speaker is changing the next-hop value to self for the routes received from another autonomous system, than IGP will only be able to detect failures for paths going to the BGP speaker’s own IP address. However, if the edge link fails, the convergence will follow along the BGP path, using BGP withdrawal message propagation through the AS. The best approach in this case is to leave the eBGP next-hop IP address unmodified and advertise the edge link into IGP using the passive interface feature or redistribution. This will allow the IGP to respond to link down condition by quickly propagating the new LSA and synchronously trigger BGP re-convergence on all BGP speakers in the system by informing them of the failed next-hop. In topologies with large BGP tables this takes significantly less time compared to BGP-based convergence process. And lastly, despite all benefits that BGP NHT may provide for recovering from Intra-AS failures, the Inter-AS convergence is still purely BGP driven, based on BGP’s distance-vector behavior.

BGP PIC and Multiple-Path Propagation

Even though BGP NHT enables fast reaction to IGP events, the convergence time is still not deterministic, because it depends on the number of prefixes BGP needs to be processed for best-path selection. Previously, we discussed how having multiple equal-cost BGP paths could be used for redundancy and fast failover at the forwarding engine level, without involving any BGP best-path selection. What if the paths are unequal – is it possible to use them for backup? In fact, since BGP treats the local AS as a single hop, all BGP speakers select the same path consistently, and changing from one path to another synchronously among all speakers should not create any permanent routing loops. Thus, even in scenarios where equal-cost BGP multi-path is not possible, the secondary paths may still be used for fast failover, provided that a signaling mechanism to detect the primary path failure exists. We already know that BGP NHT could be used to detect a failure and propagate this information quickly to all BGP speakers, triggering local switchover. This switchover does not require any BGP table walks and best-path re-election, but simply is a matter of changing the forwarding information – provided that hierarchical FIB is in use. Therefore, this process does not depend on the number of BGP prefixes, and thus known as Prefix Independent Convergence (PIC) process. You may think of this process as a BGP equivalent to IGP-based Fast Re-Route, though in IGP failure deception is local to the router and in BGP failure detection is local to the AS. BGP PIC could be used any time there are multiple paths to the destination prefix, such on R1 in the example below, where target prefix is reachable via multiple paths:
We have already stated the problem with multiple paths – only one best path is advertised by BGP speakers and the BGP speaker will only accept one path for a given prefix from a given peer. If a BGP speaker receives multiple paths for the same prefix within the same session it simply uses the newest advertisement. A special extension to BGP known as “Add Paths” (see [3] and [16]) allows BGP speaker to propagate and accept multiple paths for the same prefix. The “Add Paths” capability allows peering BGP speakers to negotiate whether they support advertising/receiving multiple paths per prefix and actually advertise such paths. A special 4-byte path-identifier is added to NLRIs to differentiate multiple paths for the same prefix sent across a peering session. Notice that BGP still considers all paths as comparable from the viewpoint of best-path selection process – all paths are stored in the BGP RIB and only one is selected as the best-path. The additional NLRI identifier is only used when prefixes are sent across a peering session to prevent implicit withdrawals by the receiving peer. These identifiers are generated locally and independently for every peering session that supports such capability.
BGP-Convergence-FIG4

in addition to propagating backup paths, the “Add Paths” capability could be used for other purposes, e.g. overcoming BGP divergence problems described in [9]. Alternatively, if backup paths are required but “Add Path” feature is not implemented, one of your options could be using full-mesh of BGP speakers, such as on the figure below. In this case, multiple exit point information is preserved and allows for implementing BGP PIC functionality.
BGP-Convergence-FIG5

Pay attention to the fact that BGP PIC is possible even without the “Add Paths” capability in RR scenarios, provided that RRs propagate the alternate paths to the edge nodes. This may require IGP metric manipulation to ensure different exit points are selected by the RRs or using other techniques, such as different RD values for multi-homed site attachment points.

Practical Scenario: BGP PIC + BGP NHT

In this hands-on scenario we are going to illustrate the use of IGP tuning, BGP NHT configuration and BGP PIC and demonstrate how they work together. First, look at the topology diagram: R9 is advertising a prefix, and R5, R6 receive this prefix via the RRs. In normal BGP environment, provided that the RRs elect the same path, R5 and R6 would have just one path for R9′s prefix. However, we tune the scenario, disabling the connections between R1 and R4 and R2 and R3, so R3 has better cost to exit via R1 and R4 has better cost via R2. This will make the RRs elect different best-paths and propagate them to their clients.
BGP-Convergence-FIG6

The following is the key piece of configuration for enabling the fast backup path failover to be applied to every router in AS 100. As you can see, the SPF/LSA throttling timers are tuned very aggressively to allow for fastest reaction to IGP events. BGP nexthop trigger delay is set to 0 seconds, thus fully relying on IGP to aggregate underlying events. In any production environment, you should NOT use these values and pick up your own, matching your IGP scale and convergence rate.

router ospf 100
timers throttle spf 1 100 5000
timers throttle lsa all 0 100 5000
timers lsa arrival 50
!
router bgp 100
 bgp nexthop trigger delay 0
 bgp additional-paths install
 no bgp recursion host

The command bgp additional-paths install when executed in non BGP-multipath environment allows for installing backup paths in additional to the best one elected by BGP. This, of course, requires that the additional paths have been advertised by the BGP Route Reflectors. At the moment of writing, Cisco IOS does not support the “Add Paths” capability, so you need to make sure BGP RRs elect different best-paths in order for the edge routers to be able to use additional paths. The command no bgp recursion host requires special explanation on its own. By default, when a BGP prefix loses next-hop, the CEF process will attempt to look-up the next longest-matching prefix for the next-hop to provide fallback. When additional repair paths are present, this functionality is not required and will, in fact, slower the convergence. This is why it’s automatically disabled when you type the command bgp additional-paths install and thus typing it with the “no” prefix is not really required.
Now that we have our scenario set up, we are going to demonstrate the fact that at least in current implementation, Cisco IOS BGP process does not exchange/detects the capabilities for “Add Path” feature. Here is a debugging output from a peering session establishment process, which shows that no “Add Path Capability” (code 69, per the RFC draft) is being exchanged during session establishment.

R5#debug ip bgp 10.0.3.3
BGP debugging is on for neighbor 10.0.3.3 for address family: IPv4 Unicast
R5#clear ip bgp 10.0.3.3

BGP: 10.0.3.3 active rcv OPEN, version 4, holdtime 180 seconds
BGP: 10.0.3.3 active rcv OPEN w/ OPTION parameter len: 29
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 6
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 1, length 4
BGP: 10.0.3.3 active OPEN has MP_EXT CAP for afi/safi: 1/1
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 2
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 128, length 0
BGP: 10.0.3.3 active OPEN has ROUTE-REFRESH capability(old) for all address-families
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 2
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 2, length 0
BGP: 10.0.3.3 active OPEN has ROUTE-REFRESH capability(new) for all address-families
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 3
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 131, length 1
BGP: 10.0.3.3 active OPEN has MULTISESSION capability, without grouping
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 6
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 65, length 4
BGP: 10.0.3.3 active OPEN has 4-byte ASN CAP for: 100
BGP: nbr global 10.0.3.3 neighbor does not have IPv4 MDT topology activated
BGP: 10.0.3.3 active rcvd OPEN w/ remote AS 100, 4-byte remote AS 100
BGP: 10.0.3.3 active went from OpenSent to OpenConfirm
BGP: 10.0.3.3 active went from OpenConfirm to Established

This means that we need to rely on the BGP RRs to advertise multiple different paths in order for the edge nodes to leverage the backup path capability.

R5#debug ip bgp updates
BGP updates debugging is on for address family: IPv4 Unicast
R5#debug ip bgp addpath
BGP additional-path related events debugging is on
R5#clear ip bgp 10.0.3.3

BGP(0): 10.0.3.3 rcvd UPDATE w/ attr: nexthop 20.0.17.7, origin i, localpref 100, metric 0, originator 10.0.1.1, clusterlist 10.0.3.3, merged path 200, AS_PATH
BGP(0): 10.0.3.3 rcvd 20.0.99.0/24
BGP(0): 10.0.3.3 rcvd NEW PATH UPDATE (bp/be - Deny)w/ prefix: 20.0.99.0/24, label 1048577, bp=N, be=N
BGP(0): 10.0.3.3 rcvd UPDATE w/ prefix: 20.0.99.0/24, - DO BESTPATH
BGP(0): Calculating bestpath for 20.0.99.0/24

Here you can see that the RR with IP address 10.0.3.3 sends us an update that has better information than the one we currently know. However, before you enable the bgp additional-paths install there would be just one path installed for the prefix:

R5#show ip route repair-paths 20.0.99.0
Routing entry for 20.0.99.0/24
  Known via "bgp 100", distance 200, metric 0
  Tag 200, type internal
  Last update from 20.0.17.7 00:02:31 ago
  Routing Descriptor Blocks:
  * 20.0.17.7, from 10.0.3.3, 00:02:31 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 200
      MPLS label: none

But as soon as the bgp additional-paths install option has been enabled, the output of the same command looks different:

R5#show ip route repair-paths 20.0.99.0
Routing entry for 20.0.99.0/24
  Known via "bgp 100", distance 200, metric 0
  Tag 200, type internal
  Last update from 20.0.17.7 00:00:03 ago
  Routing Descriptor Blocks:
  * 20.0.17.7, from 10.0.3.3, 00:00:03 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 200
      MPLS label: none
    [RPR]20.0.28.8, from 10.0.4.4, 00:00:03 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 200
      MPLS label: none

You may also see the second path in the BGP table with the “b” (backup) flag:

R5#show ip bgp 20.0.99.0
BGP routing table entry for 20.0.99.0/24, version 39
Paths: (2 available, best #1, table default)
  Additional-path
  Not advertised to any peer
  200
    20.0.17.7 (metric 192) from 10.0.3.3 (10.0.3.3)
      Origin IGP, metric 0, localpref 100, valid, internal, best
      Originator: 10.0.1.1, Cluster list: 10.0.3.3
  200
    20.0.28.8 (metric 192) from 10.0.4.4 (10.0.4.4)
      Origin IGP, metric 0, localpref 100, valid, internal, backup/repair
      Originator: 10.0.2.2, Cluster list: 10.0.4.4

And if you check the CEF entry for this prefix, you will notice there are multiple next-hops and output interfaces that could be used for primary/backup paths:

R5#show ip cef 20.0.99.0 detail
20.0.99.0/24, epoch 0, flags rib only nolabel, rib defined all labels
  recursive via 20.0.17.7
    recursive via 20.0.17.0/24
      nexthop 10.0.35.3 Serial1/0
  recursive via 20.0.28.8, repair
    recursive via 20.0.28.0/24
      nexthop 10.0.35.3 Serial1/0
      nexthop 10.0.45.4 Serial1/2

Notice that in oder to use the PIC functionality, BGP multi-path should be turned off – otherwise, equal-cost paths will be used for load-sharing, not for primary/backup behavior. You may opt to using equal-cost multipath if allowed by the network topology, as it offers better resource utilization and CEF switching layer allows for fast path failover in case of equal-cost load-balancing. Now for debugging the fast failover process. We want to shut down R1′s connection to R7 and see fast backup path switchover at R5. There are few caveats here, because we have very simplified topology. Firstly, we only have one prefix advertised into BGP on R9. Propagating this prefix through BGP is almost instant, since BGP best-path selection is done quickly and advertisement delay does not apply to a single event. Thus, if we shutdown R1′s connection to R7, which is used as primary path, then R1 will detect the link failure and shutdown the session. Immediately after this BGP process will flood an UPDATE with prefix removal and this message would reach R5 and R6 even before OSPF finishes SPF computations. The reason being, of course, single prefix propagated via BGP and no advertisement-interval used to delay to a single event.
It may seems like that disabling BGP fast external fallover on R1 could help us to take BGP out of the equation. However, we still have BGP NHT enabled in R1 – as soon as we shut down the link, the RIB process would report to BGP of the next-hop failure and UPDATE message will be sent right away. Thus, we would also need to disable NTH in R1, using the command no bgp nexthop trigger enable. If we think further, we’ll notice that we also need to enable NHT in R3 and R4, just so that they cannot to generate their own UPDATEs to R5 ahead of OSPF notification. Therefore, prior to running experiment we disable BGP NHT in R1, R3, R4 and disable fast external fallover in R1. This will allow the event from R1 propagate via OSPF ahead of BGP UPDATE message and trigger fast switchover on R5. The below is the output of the debugging commands enabled on R5 after we shut down R1′s connection to R7.

R5#debug ip ospf spf
OSPF spf events debugging is on
OSPF spf intra events debugging is on
OSPF spf inter events debugging is on
OSPF spf external events debugging is on

R5#debug ip bgp addpath
BGP additional-path related events debugging is on

R5 receive the LSA at 26.223 then BGP starts the path switchover at 26.295 – It took 72ms to run SPF, update RIB and inform BGP of the event and then change the paths.

14:00:26.223: OSPF: Detect change in topology Base with MTID-0, in LSA type 1, LSID 10.0.1.1 from 10.0.1.1 area 0
14:00:26.223: OSPF: Schedule SPF in area 0, topology Base with MTID 0
      Change in LS ID 10.0.1.1, LSA type R, spf-type Full
….
14:00:26.295: BGP(0): Calculating bestpath for 20.0.99.0/24, New bestpath is 20.0.28.8 :path_count:- 2/0, best-path =20.0.28.8, bestpath runtime :- 4 ms(or 3847 usec) for net 20.0.99.0
14:00:26.299: BGP(0): Calculating backuppath::Backup-Path for 20.0.99.0/24:BUMP-VERSION-BACKUP-DELETE:, backup path runtime :- 0 ms (or 193 usec)

14:00:32.439: BGP(0): 10.0.3.3 rcvd UPDATE w/ prefix: 20.0.99.0/24, - DO BESTPATH
14:00:32.443: BGP(0): Calculating bestpath for 20.0.99.0/24,   bestpath is 20.0.28.8 :path_count:- 2/0, best-path =20.0.28.8, bestpath runtime :- 0 ms(or 222 usec) for net 20.0.99.0
14:00:32.443: BGP(0): Calculating backuppath::Backup-Path for 20.0.99.0/24, backup path runtime :- 0 ms (or 133 usec)

In the debugging output above, you can see that the BGP process in R5 switched to backup path even before it received the UPDATE message from R3, signaling the change of the best-path in the RR. Notice that the update does not have any path identifiers in the NLRI, as the RR has only a single best-path. Let’s see how much time it actually took to run SPF, as compared to overall detection/failover process:

R5#show ip ospf statistics

            OSPF Router with ID (10.0.5.5) (Process ID 100)

  Area 0: SPF algorithm executed 15 times

  Summary OSPF SPF statistic

  SPF calculation time
Delta T   Intra D-Intra Summ D-Summ Ext D-Ext Total Reason
00:28:00   44 0 0 4 0 4 56 R
…..

As you can see, the total SPF runtime was 56ms. Therefore, the remaining 20ms were spent on updating RIB and triggering the next-hop change event. Of course, all these numbers have only relative meaning, as we are using Dynamips for this simulation, but you may use similar methodology when validating real-world designs.

Considerations for Implementing BGP Add Paths

Even though the Add Paths feature is not yet implemented, it is worth considering the drawbacks of this approach. One drawback is that the amount information needed to be sent and stored is now multiplied by the number of additional paths. Previously, the most stressed routers in BGP AS were route reflectors, that had to carry the largest BGP tables. With the Add-Path functionality, every non-RR speaker now receives all information that RR stores in its BGP table. This puts extra requirement on the edge speakers and should be accounted when planning to use this feature. Furthermore, additional paths will utilize extra memory on the forwarding engines, as now PIC-enabled prefixes have multiple alternate paths. However, since the number of prefixes remains the same, TCAM fast lookup memory resources will not be wasted, and thus only dynamic RAM is being affected the most.

Summary

Achieving fast BGP convergence is not easy, because BGP is a complicated routing protocol running overlay on top of an IGP process. We found out that tuning purely BGP-based convergence requires the following general steps:

Tuning BGP TCP Transport and router ingress queues to achieve faster routing information propagation.
Proper organization of outbound policies for achieving optimum update group construction.
Tuning BGP Advertisement Interval if needed to respond to fast “Down->Up” conditions
Activating BGP fast external fallover and possible BFD for fast external peering session deactivation.

As we noticed previously, pure-BGP based convergence is the only thing available for Inter-AS scenarios. However, for fastest convergence inside a single AS, understanding and tuning BGP and IGP interaction can make BGP converge almost as fast as the underlying IGP. This allows for fast recovery in response to intra-AS link and node failures, as well as to edge link failures. Optimizing BGP and IGP interaction requires the following:

Tuning the underlying IGP for fast convergence. It is possible to tune the IGP even for large network to converge under one second.
Enabling BGP Next-Hop Tracking process for all BGP speakers and tuning the BGP NHT delay in accordance with IGP response time.
Applying IGP summarization carefully to avoid hiding BGP NHT information.
Leveraging IGP for propagation of external peering link failures, in addition to relying on BGP peering session deactivation.
Using the Add-Path Functionality in critical BGP speakers (e.g. RRs) to allow for propagation of redundant paths if supported by implementation.
Use BGP PIC or fast backup switchover in the environments that allow for multiple paths to be propagated – e.g. multihomed MPLS VPN sites using different RD values.

We’ve also briefly covered some caveats resulting from the future use of “Add-Path” functionality, such as excessive usage of memory resources on router-processor and line-cards and extra toll on BGP best-path process due to the growth of alternate paths. There are few things that were left out of the scope of this paper. We didn’t not concentrate on the detailed mechanics of BGP fast peering session deactivation e.g. for multihop sessions and we did not cover the MP-BGP specific features. Some MP-BGP extensions such as the additional import scan interval and edge control plane interworking have their effects on end-to-end convergence, but this is a topic for another discussion.

Rolling Back a Configuration

12:15:00 PM Tutorials No comments

Have you ever been on your GradedLabs rack of equipment and wanted to test a particular feature or set of configurations, but you certainly do not want to keep these changes on the rack? Perhaps this is because you are right in the middle of solving a Volume 2 lab and you certainly cannot have that configuration impacted.
Thanks to the very handy config replace command, you can easily rollback almost instantly to your previous saved configuration after your experimenting. Here is a demonstration of just how simple this is. Enjoy, and let us give thanks for all there is to learn on blog.ine.com!

I also want to thank my good friend Keith Barker for first showing me this one.

Rack29R1#configure terminal

Enter configuration commands, one per line. End with CNTL/Z.

Rack29R1(config)#hostname TEST

TEST(config)#interface fastethernet0/0

TEST(config-if)#ip address 1.2.3.4 255.0.0.0

TEST(config-if)#no shut

TEST(config-if)#end

TEST#

Nov 25 09:09:58.856: %LINK-3-UPDOWN: Interface FastEthernet0/0, changed state to

Nov 25 09:09:59.173: %SYS-5-CONFIG_I: Configured from console by console

TEST#configure terminal

Nov 25 09:10:01.404: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEtherne

t0/0, changed state to up

TEST#config replace nvram:startup-config force

Total number of passes: 1

Rollback Done

Rack29R1#

Nov 25 09:10:08.644: Rollback:Acquired Configuration lock.

Nov 25 09:10:17.827: %LINK-5-CHANGED: Interface FastEthernet0/0, changed state t

o administratively down

Nov 25 09:10:18.829: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEtherne

t0/0, changed state to down

Rack29R1#

Nov 25 09:10:22.727: %PARSER-3-CONFIGNOTLOCKED: Unlock requested by process ’3′.

Configuration not locked.

Rack29R1#

Rack29R1#configure terminal
Enter configuration commands, one per line.  End with CNTL/Z.
Rack29R1(config)#hostname TEST
TEST(config)#interface fastethernet0/0
TEST(config-if)#ip address 1.2.3.4 255.0.0.0
TEST(config-if)#no shut
TEST(config-if)#end
TEST#
TEST#config replace nvram:startup-config force
Total number of passes: 1
Rollback Done
Rack29R1#
Rack29R1#show run interface fa0/0
Building configuration...
Current configuration : 83 bytes
!
interface FastEthernet0/0

 no ip address
 shutdown
 duplex auto
 speed auto
end

Rack29R1#

Performing Access-List Computation and Route Summarization Using ACL Manager

12:14:00 PM Tutorials No comments

Problem Statement

A popular task in CCIE-level scenarios requires creating an access-list matching a set of prefixes using the minimum number of access-list entries. Typically, such scenarios were relatively easy, so figuring out a combination of subnet prefix and wildcard mask was more or less intuitive. However, a good question would be if there exist a generic algorithm for constructing such “minimal” access-lists. To give you a better feel of the problem, let’s start with an example. Look at the following access-list matching nine different subnets:

ip access-list standard TEST
 permit 138.0.0.0
 permit 170.0.0.0
 permit 177.0.0.0
 permit 185.0.0.0
 permit 204.0.0.0
 permit 205.0.0.0
 permit 206.0.0.0
 permit 207.0.0.0
 permit 234.0.0.0

Could this SAME filtering logic be implemented with a smaller number of ACL Entries (ACEs)? In fact, if you try the common approach and write all non-zero octets in binary form you will get the following:

10001010 138
10101010 170
11101010 177
10110001 185
10111001 204
11001100 205
11001110 206
11001111 207
11001101 234

Effectively, the only “common” bit is the highest-order one and thus a single-line matching ALL entries would look like:

ip access-list standard TEST
 permit 128.0.0.0 127.0.0.0

However, the obvious problem is that this access-list would match 128 prefixes as opposed to the original nine prefixes. This is clearly too much of an overlap in address space. What if we try using more access-list entries, as opposed to just one? For example, let’s group octets like this:

10001010 138
10101010 170
11101010 234
--
10110001 177
10111001 185
--
11001100 204
11001110 206
11001111 207
11001101 205

Denoting the “don’t care” bit using “x”, the three above groups could be represented as:

1xx01010 (covers 4 addresses)
1011x001 (covers 2 addresses)
110011xx (covers 4 addresses)

which effectively translates in the following access-list:

ip access-list standard TEST
 permit 138.0.0.0 96.0.0.0
 permit 177.0.0.0 8.0.0.0
 permit 204.0.0.0 3.0.0.0

This looks much better – in just 3 lines we covered 10 addresses. Adding a single statement “deny 202.0.0.0″ we result in a four element access-list covering all 9 original entries. This is great, but how would one figure out that “clever” grouping of network octets that results in optimum packing? In fact, this job has already being done for you in Catalyst switches.

Using Catalyst IOS ACL Manager for fun and profit

The minimization problem stated above is not simply a CCIE task. In fact, this is a very important problem of minimizing the number of a boolean function’s constructive elements that arises in the field discrete mathematics. In general, such problem is NP-complete so no optimum solution exists, only more or less effective heuristics. One notable application of boolean function minimization is optimizing the use of TCAM memory in hardware-switching platforms. The Ternary Access Control Memory is fast lookup mechanism used for the purpose of fast prefix or access-list matching. Every element of TCAM is essentially a bit vector with the values of 1,0 or “x” – the do not care bit. When you compare a given binary vector with TCAM contents, it does parallel lookup over all elements and returns all “matching” vectors. TCAM hardware is complex and expensive, so packing as much values as possible in a single TCAM vector is extremely desirable. This is what the hardware ACL manager does when you create and apply an access-list, be it for the purpose of traffic filtering or QoS classification. The manager takes all entries and tries to compress them in as little ACEs as possible with given configuration. It also attempts to merge the results with the values already stored in TCAM, so you can imagine how complex the process is. Not to mention that the resulting answer is “best-effort” and not guaranteed to be absolutely optimal, just like its always is with heuristics.
Keeping this in mind, one may assume that we can get a nearly optimal ACL by creating the “suboptimal” version manually, then feeding it to the ACL manager and next peeking at the TCAM contents to see what the results are. One challenge is that TCAM hardware inspection commands could be different on various platforms, not to mention that the output may look really cryptic. However, in regards to CCIE lab, it appears that we can use the 3560 switches ACL manager feature in relatively simple manner. Here is how an algorithm may look like:

Create the original, suboptimal access-list in the switch. Re-using our previous example:

ip access-list standard TEST
 permit 138.0.0.0
 permit 170.0.0.0
 permit 177.0.0.0
 permit 185.0.0.0
 permit 204.0.0.0
 permit 205.0.0.0
 permit 206.0.0.0
 permit 207.0.0.0
 permit 234.0.0.0

Ensure no other access-lists are being applied to the switch interfaces at the moment. This includes QoS features, VLAN and port-based ACLs. The reason is to avoid any extra output from TCAM memory and prevent potential additional merges. Select or create an active IP interface – port or SVI that is in up/up state and has an IP address assigned. Apply the access-list created above to the interface:
```
3560:
interface FastEthernet 0/1
 no switchport
 ip address 10.0.0.1 255.255.255.0
 ip access-group TEST in
```

Dump the contents of TCAM memory and match just the L3 Source address/Mask output:

SW1#show platform tcam table acl detail | inc l3Source
  l3Source:                     00.00.00.00         00.00.00.00
  l3Source:                     00.00.00.00         00.00.00.00
  l3Source:                     8A.00.00.00         DF.FF.FF.FF
  l3Source:                     B1.00.00.00         F7.FF.FF.FF
  l3Source:                     CC.00.00.00         FC.FF.FF.FF
  l3Source:                     EA.00.00.00         FF.FF.FF.FF

Only look for the non-zero outputs. In our case, we end up with FOUR non-empty elements, where the first numeric column stands for the source address and the second one for the source mask. Effectively, this translates into the octets and mask in an access-list. However, bear in mind that the mask in this output has “reverse” meaning: a bit value of 1 means “care” while a bit value of “0″ means do not care. Therefore, in order to get the actual ACL wildcard mask octet you need to subtract this value from 255 or 0xFF. Here is what we get:

Subnet: 0x8A = 138, Mask: 0xFF-0xDF = 0x20 = 32; ACE: permit 138.0.0.0 32.0.0.0
Subnet: 0xB1 = 177, Mask: 0xFF-0xF7 = 0x08 = 08; ACE: permit 177.0.0.0 8.0.0.0
Subnet: 0xCC = 204, Mask: 0xFF-0xFC = 0x03 = 03; ACE: permit 204.0.0.0 3.0.0.0
Subnet: 0xEA = 234, Mask: 0xFF-0xFF = 0x00 = 00; ACE: permit 234.0.0.0 0.0.0.0

The final ACL would look like:

ip access-list standard TEST
 permit 138.0.0.0 32.0.0.0
 permit 177.0.0.0 8.0.0.0
 permit 204.0.0.0 3.0.0.0
 permit 234.0.0.0 0.0.0.0

This list has four elements, just like the one we created before, but this time it features only permit entries. And the resulting ACL is as optimum as the heuristic permits – maybe not the best one, but at least the best shot that the algorithm may give.

Using this simple algorithm, one may construct optimum access-lists for practically any combination of ACL entries. In fact, this is why you don’t care MUCH about the ACLs you create in the hardware platforms – they are optimized anyways. Of course, there is some extent of optimization you should apply on your own, as ACL manager can not do all the magic for you, but these techniques are outside the scope of this document. We only want to show you a way for constructing minimized access-lists.

What about route summarization?

Minimizing access-list looks very similar to the procedure we know as route summarization. Although normally our manual summaries result in address space overlap, what if we want to summarize prefixes so that they don’t overlap any unnecessary address space? Look at the example presented in A Simple IPv4 Prefix Summarization Procedure. The example calls for summarizing the following prefixes:

192.168.100.0/22
192.168.101.0/24
192.168.99.0/24
192.168.102.0/24
192.168.105.0/24
192.168.98.0/23

And the resulting summary is 192.168.96.0/20. The only problem is that this summary covers as much as 2^12 addresses, while the original prefixes covered only 2^10 + 2^9 + 2^8 = 1792 or about 43% of the summary! Looking for a minimum set of summary prefixes covering the SAME address space, we may rewrite the original prefixes as access-list entries:

no ip access-list standard TEST
ip access-list standard TEST
permit 192.168.100.0 0.0.3.255
permit 192.168.101.0 0.0.0.255
permit 192.168.99.0 0.0.0.255
permit 192.168.102.0 0.0.0.255
permit 192.168.103.0 0.0.0.255
permit 192.168.105.0 0.0.0.255
permit 192.168.98.0 0.0.1.255

and feed them to the ACL manager in the same manner we did before. What we get in result is following:

l3Source:                     C0.A8.64.00         FF.FF.FC.00
l3Source:                     C0.A8.69.00         FF.FF.FF.00
l3Source:                     C0.A8.62.00         FF.FF.FE.00

or in decimal form (you may notice that the TCAM masks directly translate into the subnet masks):

192.168.100.0 255.255.252.0
192.168.105.0 255.255.255.0
192.168.98.0 255.255.254.0

The prefixes covering exactly the same address space as the original six prefixes! Not as short as a single /20 prefix, but much more efficient in terms of address space usage. This looks good, but let’s give ACL manager a more challenging task. For the next demonstration, we dump over 600 prefixes from a BGP table of an Internet router. Using a simple script we convert them to an access-list here: Converted Access List and feed it to the switch using the same technique as previously. The resulting TCAM elements are saved here: TCAM Contents, and as we can see, we saved about 38% off the initial set of prefixes. And no extra address space overlap!

Summary

We demonstrated how the ACL manager in Catalyst switches could be used to compute optimum access-lists and summary addresses. Even though this little trick may look intriguing, it mainly remains an amusement for CCIE candidates. Indeed, various access-lists are automatically optimized when configured, and real-world route summarization is normally not that complicated to be performed using pen and paper. Not to mention that inter-domain summarization is normally very limited due to the fact that large-scale Internet connectivity is meshed and non-hierarchical. In fact, this is one of the reason we are having so many routing problems in today’s Internet.

Scaling Virtual Private LAN Services (VPLS)

12:12:00 PM Tutorials No comments

we examined Cisco’s implementation of Virtual Private LAN Services (VPLS) in some detail. One blog that I promised our students was more information about how large enterprises or Internet Service Providers can enhance the scalbility of this solution.
First, let us review the issues that influence its scalability. We covered these in the course, but they are certainly worth repeating here.
Remember that VPLS looks just like an Ethernet switch to the customers. As such, this solution can suffer from the same issues that could hinder a Layer 2 core infrastructure. These are:

Control-plane scalability – classic VPLS calls for a full-mesh of pseudo-wires connecting the edge sites. This certainly does not scale as the number of edge sites grow – from both operational and control-plane viewpoints.
Network stability as the network grows – Spanning Tree Protocol-based (STP) infrastructures tend not to scale as well as Multiprotocol Label Switching (MPLS) solutions.
Ability to recover from outages – as the VPLS network grows, it could become much more susceptible to major issues for customer connectivity in the result of a failure.
Multicast and broadcast radiation to all sites – remembering that the VPLS network acts as a Layer 2 switch reminds us that multicast and broadcast traffic can be flooded to all customers across the network.
Multicast scalability – multicast traffic has to be replicated on ingress PE devices, which significantly reduces forwarding efficiency.
IGP peering scalability issues – all routers attached to the cloud tend to be in the same broadcast domain and thus IGP peer, which results in full-mesh of adjacencies and excessive flooding when using link-state routing protocols.
STP loops – it is certainly possible that a customer creating an STP loop could impact other customers of the ISP. STP may be blocked across the MPLS cloud, but it is normally used for multi-homed deployments to prevent forwarding loops.
Load-balancing – the use of MPLS encapsulation hides the VPLS encapsulated flows from the core network and thus prevents the effective use of ECMP flow-based load-balancing.

The solution for some of these issues is Hierarchical VPLS (H-VPLS). H-VPLS only interconnects the core MPLS network provider edge devices with a full mesh of pseudo-wires, thus reducing the complexity of the full-mesh. The customer provider edge equipment is then connected hierarchically using pseudo-wires to these core devices. The topology now looks like a reduced full-mesh with tree-like connections at the edge of the network. Notice the customer edge equipment no longer connects to each other.
Using H-VPLS, the problem of control-plane scalability can be significantly alleviated. The network is partitioned into as many edge domains as is required. These edge domains are then very efficiently connected using an MPLS core. The MPLS core for the provider allows for the simultaneous usage of L3 MPLS VPNs for those additional customers that desire it. In addition, the MPLS core may limit the extent of the STP domain in non multi-homed scenarios, and therefore, improves performance and limits instabilities.
It is worth mentioning that the main problem with VPLS deployments is not the control-plane but rather data-plane scalability – mainly because of the MAC address table explosion and excessive broadcast/multicast traffic flooding. In addition to improving control-plane scalability by means of H-VPLS pseudo-wire hierarchies, additional techniques can be used to alleviate the issues of the data-plane. The first technique is MAC-in-MAC address stacking, which reduces the number of MAC addresses to be exposed in the core. The second is known as multicast forwarding optimization. This allows for the use of special point-to-multipoint pseudo-wires in the SP core to improve multicast traffic replication.
To summarize, H-VPLS offers the following benefits compared to traditional VPLS solutions:

Lowers the number of pseudo-wire connections that must be full-meshed and thus improves control-plane scalability
Reduces the burden on core devices presented by frame replication and forwarding by adding a hierarchical “aggregation” layer
Reduces the size of MAC address tables on the provider equipment when combined with MAC-in-MAC address stacking

To complete this blog post, we should mention that there is an alternative to H-VPLS control plane scaling that has been championed by Cisco’s rival, Juniper Networks. Juniper’s approach utilizes BGP for VPLS pseudo-wire signaling, as opposed to using point-to-point LDP sessions. Scaling BGP by means of route-reflectors is well-known and thus Juniper’s approach automatically has a scalable control-plane.
I hope you enjoyed this post. Be sure to watch the blog for more exciting Design-related presentations.

About US

Network Bulls is Best Institute for Cisco CCNA, CCNA Security, CCNA Voice, CCNP, CCNP Security, CCNP Voice, CCIP, CCIE RS, CCIE Security Version 4 and CCIE Voice Certification courses in India. Network Bulls is a complete Cisco Certification Training and Course Coaching Institute in Gurgaon/Delhi NCR region in India. Network Bulls has Biggest Cisco Training labs in India. Network Bulls offers all Cisco courses on Real Cisco Devices. Network Bulls has Biggest Team of CCIE Trainers in North India, with more than 90% of passing rate in First Attempt for CCIE Security Version 4 candidates.

Biggest Cisco Training Labs in India
More than 90% Passing Rate in First Attempt
CCIE Certified Trainers for All courses
24x7 Lab Facility
100% Job Guaranteed Courses
Awarded as Best Network Security Institute in 2011 by Times
Only Institute in India, to provide CCIE Security Version 4.0 Training
CCIE Security Version 4 Training available
Latest equipments available for CCIE Security Version 4

Network Bulls Institute Gurgaon

Network Bulls Institute in Gurgaon is one of the best Cisco Certifications Training Centers in India. Network Bulls has Biggest Networking Training and Networking courses labs in North India. Network Bulls is offering Cisco Training courses on real Cisco Routers and Switches. Labs of Network Bulls Institute are 24x7 Available. There are many coaching Centers in Delhi, Gurgaon, Chandigarh, Jaipur, Surat, Mumbai, Bangalore, Hyderabad and Chennai, who are offering Cisco courses, but very few institutes out of that big list are offering Cisco Networking Training on real Cisco devices, with Live Projects. Network Bulls is not just an institute. Network Bulls is a Networking and Network Security Training and consultancy company, which is offering Cisco certifications Training as well support too. NB is awarded in January 2012, by Times, as Best Network Security and Cisco Training Institute for the year 2011. Network Bulls is also offering Summer Training in Gurgaon and Delhi. Network Bulls has collaboration with IT companies, from which Network Bulls is offering Networking courses in Summer Training and Industrial Training of Btech BE BCA MCA students on real Live projects. Job Oriented Training and Industrial Training on Live projects is also offered by network bulls in Gurgaon and Delhi NCR region. Network Bulls is also providing Cisco Networking Trainings to Corporates of Delhi, Gurgaon, bangalore, Jaipur, Nigeria, Chandigarh, Mohali, Haryana, Punjab, Bhiwani, Ambala, Chennai, Hyderabad.
Cisco Certification Exams are also conducted by Network Bulls in its Gurgaon Branch.
Network Bulls don't provide any Cisco CCNA, CCNP simulations for practice. They Provide High End Trainings on Real topologies for high tech troubleshooting on real Networks. There is a list of Top and best Training Institutes in India, which are providing CCNA and CCNP courses, but NB has a different image from market. Many students has given me their feedbacks and reviews about Network bulls Institute, but there were no complaints about any fraud from this institute. Network Bulls is such a wonderful place to get trained from Industry expert Trainers, under guidance of CCIE Certified Engineers.

CCIE Security Version 4

Cisco Finally updated CCIE Security Lab exam blueprint. WSA Ironport and ISE devices are added in CCIE Security Version 4 Lab Exam Syllabus Blueprint. In Updated CCIE Security Version 4 Syllabus blueprint, new technologies like Mobile Security, VoIP Security and IPV6 Security along with Network Security, are added. As in CCIE Security Version 3 blueprint, Cisco had focused on Network Security only, but now as per market demand, Cisco is looking forward to produce Internet gear Security Engineer, not only Network Security engineers.
In CCIE Security Version 4 Bluerpint, Lab Exam is going to be more interested than before. What is Difference in CCIE Security Version 3 and Version 4? Just go through the CCIE Security Version 4 Lab Equipment and Lab Exam Syllabus Blueprints and find out!

About the Protocol

Neighbor Adjacencies

Reliable Transport

Sticky Learning

Violation Modes

Abstract

Introduction

Emulating CBR services over PSN

Practical Example: VoIP Bandwidth Consumption

Circuits vs Packets

Conclusions

Introduction

Problem Statement

Using Catalyst IOS ACL Manager for fun and profit

What about route summarization?

Summary

About US

Network Bulls Institute Gurgaon

About Blog

NB

Cisco Training in Delhi

Testimonials : Network Bulls

Cisco Networking Certifications

About a wonderfull CCIE Training Institute in Gurgaon

Cisco Coaching and Learning Center

CCIE Security Version 4

Social Profiles

Our Updates

About NB

Important Links

Network Bulls

Cisco Training Links

Labels

Blog Archive

About Network Bulls Gurgaon