iljitsch.com

topics: BGP / IPv6 / more · settings · b&w · my business: inet⁶ consult · Twitter · Mastodon · LinkedIn · email · 🇺🇸 🇳🇱

These are all posts about BGP, including those originally published on BGPexpert.com.

IPv6 deployment

When I talk about IPv6 I like to bring up two pictures of a web site, one seen over IPv4 and the other over IPv6. Obviously, the two pictures are identical. Because of this invisibility, it's hard to know what kind of deployment progress there is with IPv6. A few years ago I decided to visit all the web sites of all AMS-IX members and see which ones I could reach over IPv6. The results weren't all that impressive back then, but things have started to change over the past year. In april 2004 the web sites of four members were reachable over IPv6 (with one other having an unreachable address) and in march 2005 this was nine out of 213.

For many organizations making their web site available over IPv6 is a serious commitment, so the number of AMS-IX members that run IPv6 is even higher. According to the AMS-IX member list in march 2005, for 213 members with 343 ports there were 59 IPv6 addresses present on the exchange.

However, the AMS-IX membership isn't exactly representative of the net as a whole. I also had a look at a self-proclaimed list of the top 100 English language web sites but out of those not a single one was reachable over IPv6.

One (http://www.alibaba.com/) suffered from the "doubleclick syndrome" and didn't reply to AAAA DNS queries, which introduces a 10 second delay when visiting this site with an IPv6-enabled WWW browser. This is the reason why many people are disabling IPv6 in Firefox.

My conclusion: IPv6 deployment is happening, but it has a very long way to go.

Permalink - posted 2005-04-17

'BGP' translated into Japanese

FYI - my book is now also available in Japanese. The translation is published by Ohmsha.

Update: this is what it looks like.

So apparently Japanese for "gazelle" (as on the cover of the English version) is "mule"...

Permalink - posted 2004-08-21

IPv6 DNS delegation progress

Since the announcement was made on the 35th anniversary of the first moon landing, I think it's appropriate to modify Neil Armstrong's famous quote slightly:

"One small step for mankind... but a huge step for ICANN."

Yesterday, ICANN (the Internet Corporation for Assigned Names and Numbers) announced that IPv6 nameservers for .jp and .kr were added to the root zone. Now obviously this is a necessary step for making it possible to resolve domain names using IPv6, but it's really not a big deal as IPv6 delegations have been around lower down the delegation chain for a very long time, and no reasonable DNS implementation has a problem with this. Also, ns.ripe.net has had an IPv6 address for some time now, and this nameserver is secondary for country domains such as .nl and .it, not to mention the reverse mapping of huge slabs of IP address space. The difference between that and the new .jp and .kr delegations is that those also have an AAAA glue record, while ns.ripe.net only has an A glue record.

But the real issue here is having IPv6 addresses for the root servers themselves. There has been considerable experimenting with actually running root service over IPv6 transport. However, in order for an IPv6-only host to be able to get at the root servers, those IPv6 addresses must be published in two places. Number one is the named.root "hints" file, which DNS servers use to find the root servers on startup. It's fairly trivial to add IPv6 addresses to this file. This would allow a DNS server to perform the first query over IPv6. Unfortunately, that first query is for the full set of root servers. Since there are 13 of those, the response to this query comes fairly close to the maximum response size of 512 bytes. Adding IPv6 addresses for all root servers isn't possible without going over this limit, which has the potential to cause several problems. I'm very interested to see how ICANN is going to handle this, and especially when they're going to handle this.

As a bonus, at long last E.F.F.3.IP6.ARPA has been delegated so that 6bone addresses (3ffe::/16) can now enjoy proper reverse service in the DNS.

Permalink - posted 2004-07-21

TCP hype to rest: the real story

I hate to admit it, but I got infected by the hype surrounding the TCP "vulnerability". As it turns out, all of this was pretty much yesterday's news from the start. In a news.com article we can read Paul Watson complain that "it's crazy". No argument there. So here is the real story.

RFC 793 (TCP) clearly indicates that established TCP sessions must be torn down when an RST packet is received with a sequence number that falls within the current window.

Quick detour: TCP is responsible for making sure that all data from the sending application is received once, only once and in the correct order by the receiving application. In order to do this, it numbers every byte of data using a sequence number. At any time, TCP has a specific "window" in the sequence number space for bytes that it is prepared to receive. Bytes that fall before the window have already been received, so if those come in again they are ignored. Bytes that fall beyond the window are too far in the future and are also ignored. A packet or segment of data that starts with the first byte of the current window is processed immediately, and any data that falls further within the window is buffered and will be processed later as this data is received out of order.

RST packets are supposed to be generated in order to reset stale sessions that can for instance occur after one side reboots. So when the other side sends a packet, the system that just booted doesn't know this session and sends back an RST packet in reply to the data packet, copying the sequence number in the process. The sender of the original packet now receives the RST with a sequence number that obviously falls within the window, so the session is torn down.

The important part here is that according to RFC 793 the sequence number doesn't have to be an exact match: as long as it's within the window, it'll be accepted. For some reason, many people assumed there would have to be an exact match. Since the sequence number is 32 bits in size, this means 4.3 billion possibilities. So if an attacker wants to reset a TCP session, assuming he already knows the correct IP addresses and port numbers, he would have to send up to 4.3 billion packets. But in reality the number of packets necessary to reset a session must be divided by the window size.

So what would be the window size for a typical TCP session, or, more importantly, a TCP session used for BGP? Well, if a system implements the RFC 1323 TCP high performance extensions, the window size can be almost a gigabyte. This is where Paul Watson's claim that resetting a TCP session can be done with "as few as four packets" comes in. However, this is pure nonsense as such huge windows are never necessary. Even when moving data across the globe at 10 Gbps a window less than half that size is more than sufficient. BGP isn't exactly in the business of moving data across the globe at high speed over TCP, so it only requires a very modest window size. These are some of the initial packets for two BGP sessions between a Cisco router and a FreeBSD box running Zebra:

09:42:06.537772 IP 213.156.3.169.11164 > 213.156.3.173.179: S 3863598700:3863598 700(0) win 16384 <mss 1460>
09:44:35.984407 IP 213.156.3.173.54660 > 213.156.3.169.179: S 1733311468:1733311 468(0) win 65535 <mss 1460,nop,wscale 0,nop,nop,timestamp 3272949461 0>
09:44:36.003219 IP 213.156.3.169.179 > 213.156.3.173.54660: S 947956194:94795619 4(0) ack 1733311469 win 16384 <mss 1460>

The packet from the Cisco router doesn't even have the window scale option. The FreeBSD machine has the option but doesn't bother to actually use it. This means that the maximum window for the subsequent session is limited to 65535 bytes. However, routers almost universally use around 16000 bytes. Bottom line: in order to reset a BGP session using a TCP RST, it is necessary to send between 65 and 268 thousand packets. This takes several minutes at DSL speeds. So if IP addresses and port numbers are known, resetting a BGP session isn't too hard for an attacker. However, if the attacker must also guess the ephemeral port, then the time it takes to reset a session becomes too long to make this an interesting attack vector.

In theory it's even better than this, at least on Cisco routers, because those rate limit the handling of RST packets. Unfortunately, RFC 793 also suggests that TCP sessions should be terminated if an in-window SYN rather than a RST packet is received. SYN packets are normally used to open sessions, so the rationale for terminating sessions when unexpected SYNs come along is murky at best.

Last but not least, there was a bug in several TCP implementations that allows any "left of window" RSTs to be acted upon without further checks. This bug was fixed in 1998 in FreeBSD. From the FreeBSD 4.9 /usr/src/sys/netinet/tcp_input.c file:
* First check the RST flag and sequence number since reset segments * are exempt from the timestamp and connection count tests. This * fixes a bug introduced by the Stevens, vol. 2, p. 960 bugfix * below which allowed reset segments in half the sequence space * to fall though and be processed (which gives forged reset * segments with a random sequence number a 50 percent chance of * killing a connection).

This bug was still present in NetBSD until a few days ago. This led me to believe that the real issue here was that this same bug was also present in the TCP code from one or more major router vendors, and that this was what Paul Watson was talking about. But that didn't turn out to be the case, so essentially the whole story boils down to "router vendors implement TCP according to RFC 793". Big news indeed.

Permalink - posted 2004-04-24

BGP TCP vulnerability - Update

It is becoming clear that there are indeed systems that are vulnerable to having TCP sessions reset within only four tries, assuming IP addresses and port numbers are known. Unfortunately, there is little information about which systems have this vulnerability. However, judging from the secrecy at Cisco and Juniper, it is far from inconceivable that they are vulnerable. (If my suspicions are correct, the hole was fixed in FreeBSD in 1998 (!!!), though.)

The details will probably be public on thursday, and we can expect exploits very soon after that. Since the required number of packets to take advantage of the vulnerability is very low, having MD5 in place on BGP sessions is almost certainly a good idea, as it is unlikely a router will receive so many packets that the CPU is overloaded. Also, filtering BGP RSTs as outlined below/above where possible will make sure your routers won't terminate TCP sessions. However, your sessions may still be vulnerable depending on the status of your BGP neighbor's router.

So set up MD5 passwords on important BGP sessions (such as the ones to transit networks) as soon as possible.

The Cisco advisory on just the problem with RSTs and the window. If I interpret this correctly, fixed IOS versions are already available, even to Cisco users without support contracts. (Note that non-IOS products are also affected, and it's a good idea to upgrade anyway as per Cisco's recently uncovered SNMP vulnerability.)

Permalink - posted 2004-04-22

Update - BGP/TCP countermeasures

In an article in Wired (Flaw Could Cripple Entire Net) Paul Watson is said to claim that he can reset TCP sessions " with as few as four attempts". I have a hard time believing this, but we'll have to see thursday, when all will be revealed. An attacker still needs to know the IP addresses for both sides and the appropriate TTL (simple) and the port numbers used on both sides. The latter may or may not be trivial: routers typically start using ephemeral ports at a fixed number after booting, so for the first few BGP sessions this should be relatively easy to guess. However, a router that has been running for a while and has lots of BGP sessions (which is common on internet exchanges), these port numbers are well randomized within the range used by the system, which is typically at least 4000 ports.

So if Paul Watson is correct it may be possible to reset sessions with less than a hundred to a few thousand packets. This takes only moments. If he isn't, but the router has few BGP sessions, port numbers are easily guess-able and the default window size of a little less than 16k makes it possible to reset sessions with about 250 thousand packets per port combination, which is in line with reports that people were able to do this in the lab within about half an hour. This is short enough to incur flap dampening difficulties if it happens repeatedly.

So what can we do?

  1. Use the BGP TTL hack / GTSM. Attackers that are several hops away can't spoof a TTL of 255 or 254 so sessions are protected without wasting much CPU time. However, GTSM isn't widely available yet.

  2. Use the RFC 2385 BGP TCP MD5 option. This is widely (but not universally) implemented, and should work well against this type of attack. Unfortunately, implementing MD5 passwords is a significant amount of work and in many cases sessions break. Cisco routers don't reset the session when a password is applied in recent versions, but older IOSes and many other vendors still do. And the change must happen on both ends at the same time for sessions to remain up.

    The MD5 option is also a double edged sword because it opens the door to CPU exhaustion based denial of service attacks. In theory the crypto should only be done when the packet passes all regular TCP checks, but in reality this isn't the case so making the CPU burn cycles on MD5 hashing should be easier to do for an attacker than sending a successful RST. Based on the information I have right now, I believe the upsides of having MD5 on peering sessions with relatively small peers over exchanges don't outweigh the downsides, as the work and the MD5 DoS risks are the same for small peers, but the damage when a session breaks is fairly negligible.

    For very large peers and especially transit connections, the situation is different: the instability caused by session resets can be significant, so MD5 is a good idea here.

  3. If a router actually starts receiving lots of spoofed RSTs, the input queues fill up and legitimate BGP packets may be dropped. So it helps to increase the input queue, reset the BGP hold time to the default 180 seconds (I normally advise lowering it to detect outages sooner), but lower the keepalive time to arrive at a better real-to-spoofed BGP packet ratio. On a Cisco:

    !
    interface gigabit3/0
     hold-queue 2048 in
    !
    router bgp 12345
     timers bgp 1 180
    !
    
  4. Don't be part of the problem: make sure your customers can't pollute the net with packets with spoofed source addresses. Use anti-spoofing filters or Unicast RPF (uRPF) for this.

  5. Last but not least, you can filter out TCP RST packets to/from the BGP port (179). Filtering all RSTs is a very bad idea as they are necessary to make sure that when two hosts are communicating, and one loses its state (for instance, it reboots), the other doesn't keep sending traffic at high speed until eventually the TCP session times out. However, for BGP this isn't much of an issue because TCP doesn't generate all that much traffic and an expired hold time will take care of one-sided sessions soon enough.

    Lines in a Cisco access list for filtering BGP TCP RSTs look like this:

    access-list 123 deny   tcp any any eq bgp rst log-input
    access-list 123 deny   tcp any eq bgp any rst log-input
    
    (Note that some legitimate hits are possible if a BGP session is only configured on one router.)

    These must be applied on input on interfaces that may received spoofed RSTs (i.e., external connections, but also customer facing connections if those don't have proper anti-spoofing filters). The log-input keyword makes sure the interface and sometimes the MAC address of the system that sent the offending packet are logged. This is very useful on shared/switched media interfaces such as internet exchanges. Don't worry about overwhelming the router with logging information too much, as this is rate limited. (However, having log-input in place when a full-fledged DoS attack is in progress isn't advisable either.)

(See below (when reading this on the main page) or above (in the archives) for information about the MD5 and TTL protection mechanisms.)

Permalink - posted 2004-04-21

older posts - newer posts

Search for:
RSS feed

Archives: 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2010, 2011, 2013, 2014, 2015, 2016, 2018, 2019, 2020, 2021, 2022, 2023, 2024