IPv6 documentation prefix and IPv6 site/host list

There is now an official IPv6 prefix set aside for documentation purposes: 2001:0DB8::/32. (Leading zero courtesy of APNIC.)

The how and why is documented at a page at APNIC. Note that there is also a prefix set aside for documentation purposes in IPv4: 192.0.2.0/24. See RFC 3330 for more information and other special IPv4 prefixes.

At prik.net there is now a list of IPv6-enabled hosts or sites. I have no idea how complete the list is, but it has more than 3000 entries so it's better than the manually maintained stuff in some other places. If the link doesn't work, this is probably because your browser doesn't understand compressed content. In that case, use the uncompressed version. The compression ratio is about 1 : 6.

Permalink - posted 2004-01-04

New miminum allocation size at RIPE

The RIPE NCC has changed its policy regarding the initial allocation that new LIRs receive. The rule that efficient use for at least a /22 must be demonstrated is now off the table, and the minimum allocation is now a /21 rather than a /20. See the announcement. RIPE also maintains a list of minimum allocation and assignment sizes for their address blocks (linked from the announcement), but this is pretty much useless because filtering on allocation size is too restrictive while filtering on assignment size is isn't restrictive enough for many address blocks. So be very careful when implementing prefix length filtering.

Without the jargon, please!

Right. Most of us get our IP addresses from our ISPs, and ISPs usually have one or more blocks of IP address space of their own. Having their own address space is important for ISPs because this allows them to be independent from their ISPs by allowing them to change ISPs without having to change addresses. (Obviously this is useful to end-users as well, but this changed policy applies to ISPs.) Until now, ISPs that wanted to get address space of their own needed to show that they and/or their customers would start using 1024 addresses (a /22) immediately. In this case, they would get a block of 4096 addresses (a /20). The advantage of having such a large block is that everyone in the world is prepared to store a pointer to it in their routers, making the addresses globally usable without limitations.

Since some networks only accept routing information for the smallest address blocks that RIPE and the other Regional Internet Registries (ARIN, APNIC and LACNIC) give out to ISPs. Smaller address blocks aren't entirely useless, but they may not be globally reachable without having to depend on the ISP the addresses came from, which of course limits ISP independence.

Since RIPE is now giving out blocks of 2048 addresses (/21) from some of their address blocks, networks are expected (and pretty much forced) to accept these blocks. This is good news for small ISPs that want their own independent block: they no longer have to jump through hoops trying to show they need 1024 addresses, or make do with only semi-independent addresses.

Note that the other RIRs haven't changed their policies (or at least there are no announcements to be found). ARIN's policy for instance, is even more restrictive than the old RIPE policy: multihomed networks must show efficient use of a /21 to get a /20, single homed ISPs must even show efficient use of a full /20 to get a /20. So for now the good news only applies to ISPs in the RIPE region, which is roughly Europe, the Middle East, Africa north of the Sahara and the former Soviet Union. For more info, see the RIR policy comparison matrix.

Permalink - posted 2004-01-10

Clearing the DF bit

As I wrote a few weeks ago in an article under the name "no ip unreachables", path MTU discovery doesn't work all that well across the internet in practice. Since then, I've noticed that people end up on this site looking for ways to clear the don't fragment bit in the IP header. So here is an example of how to do this on a Cisco router:

! route-map nodf permit 10 set ip df 0 ! interface FastEthernet2/0 ip policy route-map nodf !

Note that the "ip policy route-map nodf" command must be applied on the interface receiving the packets for which the DF bit must be cleared, and not the interface with the reduced MTU itself, where the packets are subsequently transmitted. See a page at Cisco for additional strategies.

Permalink - posted 2004-01-12

Apple Safari IPv6 hack

IPv6-enabled operating systems such as Windows, Linux and FreeBSD all come with a web browser that also supports IPv6 and prefers IPv6 when both IPv4 is available. Things are slightly different with Safari, Apple's browser application. Initially, Safari only supported literal IPv6 addresses and some corner case DNS names. In the current version, Safari will do IPv6 if no IPv4 address is available, but it won't prefer IPv6 over IPv4 or fall back to IPv6 when IPv4 doesn't work.

However, Nicholas Humfrey has come up with a trick. By enabling Safari's debug mode and switch off one of the two HTTP loaders that are normally used, Safari will prefer IPv6 when it's available. See the Mac OS X hints article that Nick posted for the details.

Permalink - posted 2004-01-28

BGP on Cisco 2500

It has been a while since the last news posting. My apologies for that. Here is something to hold you over until I can find some real news:

When perusing the HTTP referrer log, I noticed that a lot of people are finding this site in search of "bgp+2500" or something similar. So... is it possible to run BGP on a Cisco 2500 router?

The short answer is "yes". The IOS images for the 2500 support BGP, including BGP for IPv6. (They do not, however, support OSPF for IPv6 even though OSPF for IPv4 is supported and the "ipv6 router ospf" command may exist. Same thing for IS-IS: the command exists, but the protocol isn't present in any of the 2500 images.)

The slightly longer answer is that a 2500 is of limited use for inter-domain routing. Actually way back when I got started with BGP I used a 2514 with 16 MB RAM and it could hold the entire 35000 or so entry global routing table. From two upstreams even, if I remember correctly. However, in the mean time the global routing table has gotten four times as big, and the 2500's memory limit is still 16 MB. The fact that the 2500 series sports a 68030 CPU doesn't really help either. All of this means that you can only run BGP on a 2500 if you don't send it more than a few thousand routes. You also shouldn't send it a full feed and have the 2500 filter out the unwanted routes, as this will tax the CPU too much and make for many-minute convergence times.

Note that even 35000 routes wouldn't work anymore today, as modern IOS images need a lot more memory for their internal house keeping. On bigger routers it gets even worse because unlike the 2500, those can't run their software from flash, so it must be copied to RAM. Additionally, the switching path of choice is CEF these days, which takes a lot of memory. (Without CEF you'll be using fast switching which uses just as much memory but only when needed, so it's not only the CPU that melts down but you also run out of memory when a slammer-like worm hits.)

So you may be able to get away with a full table on a Cisco with 128 MB RAM (or you may not), but 256 MB gives you much more elbow room. Unfortunately, Cisco still makes boxes that can run BGP but won't take enough memory to do so properly. A good example are the 3550 series multilayer switches.

Permalink - posted 2004-03-31

BGP TTL "hack"

At the NANOG 26 meeting in october 2002, Dave Meyer presented a very simple proposal to protect BGP sessions against attacks: set the TTL to 255 on outgoing packets, and check whether the TTL in received packets is equal to 255. Since routers always lower the Time To Live (or Hop Limit in IPv6) when forwarding a packet, and routers discard packets with a TTL of 0, there is no way for anyone who isn't attached to the subnet in question to inject packets with a TTL of 255 into a subnet. RFC3682 was published in february and describes the details of the "Generalized TTL Security Mechanism (GTSM)".

Cisco has now included GTSM into IOS release 12.3(7)T, as explained in the feature guide. It seems there are some interesting caveats. First of all, Cisco states that enabling the feature using the neighbor ... ttl-security command will only enable the check for incoming packets and not change any behavior as to outgoing packets. So this must mean they always use a TTL of 255 for outgoing packets now. However, older IOS versions set the TTL for BGP packets to 1 (in the absence of any ebgp-multihop settings). If this is the case, then detecting whether a neighbor is directly connected won't be very reliable right now. Then again, Cisco says the feature must be configured on both ends of an eBGP session (no support for iBGP as of yet) which seems to contradict this.

Another thing is that they look for a TTL of 254 or higher. This suggests that for incoming packets with a TTL of 255, they first decrease the TTL and then go on to process the TCP segment. Again, this is not how things work in older IOS versions, as it's perfectly possible to set the TTL to 0 and still interact with a Cisco router on the local subnet. So unless something changed in this regard as well, accepting a TTL of 254 means that there can still be a router in between!

Note that this mechanism only offers protection against attacks on port 179 of a router from "far away". Anyone on the local subnets still gets to do whatever they please and the content of the BGP sessions isn't protected any better than before.

Permalink - posted 2004-04-08

TCP vulnerability puts BGP at risk

Rumors have been floating around for days, as the referrer log for this site shows large numbers of people looking for "BGP hack" and "BGP MD5". But the cat is out of the bag now:

NISCC Vulnerability Advisory 236929

Please see the page linked above for detailed information. The short version is that TCP sequence numbers turn out much easier to guess than assumed until now, which makes long-lived TCP sessions vulnerable to reset attacks. Since BGP sessions can remain for days, weeks or even months, and other pertinent information is relatively easy to find, BGP is the protocol most affected by this vulnerability.

Fortunately, the BGP TCP MD5 option protects against exactly this problem. Enable it if at all possible. Most, if not all, routers support it. The option is enabled on Cisco routers as follows:

!
router bgp 12345
 neighbor 192.168.0.1 password use-upto-80-characters
!

However, this will break any running BGP sessions so coordinate the change closely with the remote AS.

Since this mechanism operates at the TCP level, host-based routers such as Zebra or Quagga running on BSD or Linux typically don't support this option. However, there is some rudimentary support in both OSes, see the SANS advisory.

Note: The "BGP TTL hack" or GTSM (see below or above) also offers protection against the TCP vulnerability, without adding the MD5 crypto overhead. And good anti-spoofing filters do the same, but the problem there is that the other AS also needs to implement them, something that can't be assumed.

It seems the actual risks aren't as bad as the reports seem to indicate at first glance. I'll post a more detailed analysis later, but from discussions on NANOG it seems the only new aspect is that previously people didn't realize that the RST packet could have any sequence number that falls inside the receive window on the potential victim, which is often around 16k. This means the attacker only has to guess the first 18 bits of the sequence number rather than the full 32 bits. However, she also needs to guess both port numbers, which makes the number of possible combinations an attacker must try around a billion, which amounts to a DoS attack of 10000 packets per second for more than a day.

Permalink - posted 2004-04-20

Update - BGP/TCP countermeasures

In an article in Wired (Flaw Could Cripple Entire Net) Paul Watson is said to claim that he can reset TCP sessions " with as few as four attempts". I have a hard time believing this, but we'll have to see thursday, when all will be revealed. An attacker still needs to know the IP addresses for both sides and the appropriate TTL (simple) and the port numbers used on both sides. The latter may or may not be trivial: routers typically start using ephemeral ports at a fixed number after booting, so for the first few BGP sessions this should be relatively easy to guess. However, a router that has been running for a while and has lots of BGP sessions (which is common on internet exchanges), these port numbers are well randomized within the range used by the system, which is typically at least 4000 ports.

So if Paul Watson is correct it may be possible to reset sessions with less than a hundred to a few thousand packets. This takes only moments. If he isn't, but the router has few BGP sessions, port numbers are easily guess-able and the default window size of a little less than 16k makes it possible to reset sessions with about 250 thousand packets per port combination, which is in line with reports that people were able to do this in the lab within about half an hour. This is short enough to incur flap dampening difficulties if it happens repeatedly.

So what can we do?

Use the BGP TTL hack / GTSM. Attackers that are several hops away can't spoof a TTL of 255 or 254 so sessions are protected without wasting much CPU time. However, GTSM isn't widely available yet.
Use the RFC 2385 BGP TCP MD5 option. This is widely (but not universally) implemented, and should work well against this type of attack. Unfortunately, implementing MD5 passwords is a significant amount of work and in many cases sessions break. Cisco routers don't reset the session when a password is applied in recent versions, but older IOSes and many other vendors still do. And the change must happen on both ends at the same time for sessions to remain up.
The MD5 option is also a double edged sword because it opens the door to CPU exhaustion based denial of service attacks. In theory the crypto should only be done when the packet passes all regular TCP checks, but in reality this isn't the case so making the CPU burn cycles on MD5 hashing should be easier to do for an attacker than sending a successful RST. Based on the information I have right now, I believe the upsides of having MD5 on peering sessions with relatively small peers over exchanges don't outweigh the downsides, as the work and the MD5 DoS risks are the same for small peers, but the damage when a session breaks is fairly negligible.
For very large peers and especially transit connections, the situation is different: the instability caused by session resets can be significant, so MD5 is a good idea here.
If a router actually starts receiving lots of spoofed RSTs, the input queues fill up and legitimate BGP packets may be dropped. So it helps to increase the input queue, reset the BGP hold time to the default 180 seconds (I normally advise lowering it to detect outages sooner), but lower the keepalive time to arrive at a better real-to-spoofed BGP packet ratio. On a Cisco:
```
!
interface gigabit3/0
 hold-queue 2048 in
!
router bgp 12345
 timers bgp 1 180
!
```
Don't be part of the problem: make sure your customers can't pollute the net with packets with spoofed source addresses. Use anti-spoofing filters or Unicast RPF (uRPF) for this.
Last but not least, you can filter out TCP RST packets to/from the BGP port (179). Filtering all RSTs is a very bad idea as they are necessary to make sure that when two hosts are communicating, and one loses its state (for instance, it reboots), the other doesn't keep sending traffic at high speed until eventually the TCP session times out. However, for BGP this isn't much of an issue because TCP doesn't generate all that much traffic and an expired hold time will take care of one-sided sessions soon enough.
Lines in a Cisco access list for filtering BGP TCP RSTs look like this:
```
access-list 123 deny   tcp any any eq bgp rst log-input
access-list 123 deny   tcp any eq bgp any rst log-input
```
(Note that some legitimate hits are possible if a BGP session is only configured on one router.)
These must be applied on input on interfaces that may received spoofed RSTs (i.e., external connections, but also customer facing connections if those don't have proper anti-spoofing filters). The log-input keyword makes sure the interface and sometimes the MAC address of the system that sent the offending packet are logged. This is very useful on shared/switched media interfaces such as internet exchanges. Don't worry about overwhelming the router with logging information too much, as this is rate limited. (However, having log-input in place when a full-fledged DoS attack is in progress isn't advisable either.)

(See below (when reading this on the main page) or above (in the archives) for information about the MD5 and TTL protection mechanisms.)

Permalink - posted 2004-04-21

IPv6 MD5 and Apple BGP

The referrer log for this site can be interesting reading at times. It seems several people have landed here when looking for information for the BGP TCP MD5 option and IPv6. Despite the fact that RFC 2385 doesn't mention IPv6, it's possible to have an MD5 password on IPv6 BGP sessions. At least, it is in recent IOS versions. I have a very old one that allows this to configured but it doesn't compute the checksum correctly. I assume this is particular to this specific version (from 1999), though.

Someone else seemed interested in BGP and Apple. Good news: under Panther, Zebra 0.94 compiles without trouble, so it's possible to run BGP on a Mac. Juguar/Zebra 0.93b didn't work for me.

Permalink - posted 2004-04-21

BGP TCP vulnerability - Update

It is becoming clear that there are indeed systems that are vulnerable to having TCP sessions reset within only four tries, assuming IP addresses and port numbers are known. Unfortunately, there is little information about which systems have this vulnerability. However, judging from the secrecy at Cisco and Juniper, it is far from inconceivable that they are vulnerable. (If my suspicions are correct, the hole was fixed in FreeBSD in 1998 (!!!), though.)

The details will probably be public on thursday, and we can expect exploits very soon after that. Since the required number of packets to take advantage of the vulnerability is very low, having MD5 in place on BGP sessions is almost certainly a good idea, as it is unlikely a router will receive so many packets that the CPU is overloaded. Also, filtering BGP RSTs as outlined below/above where possible will make sure your routers won't terminate TCP sessions. However, your sessions may still be vulnerable depending on the status of your BGP neighbor's router.

So set up MD5 passwords on important BGP sessions (such as the ones to transit networks) as soon as possible.

The Cisco advisory on just the problem with RSTs and the window. If I interpret this correctly, fixed IOS versions are already available, even to Cisco users without support contracts. (Note that non-IOS products are also affected, and it's a good idea to upgrade anyway as per Cisco's recently uncovered SNMP vulnerability.)

Permalink - posted 2004-04-22

TCP hype to rest: the real story

I hate to admit it, but I got infected by the hype surrounding the TCP "vulnerability". As it turns out, all of this was pretty much yesterday's news from the start. In a news.com article we can read Paul Watson complain that "it's crazy". No argument there. So here is the real story.

RFC 793 (TCP) clearly indicates that established TCP sessions must be torn down when an RST packet is received with a sequence number that falls within the current window.

Quick detour: TCP is responsible for making sure that all data from the sending application is received once, only once and in the correct order by the receiving application. In order to do this, it numbers every byte of data using a sequence number. At any time, TCP has a specific "window" in the sequence number space for bytes that it is prepared to receive. Bytes that fall before the window have already been received, so if those come in again they are ignored. Bytes that fall beyond the window are too far in the future and are also ignored. A packet or segment of data that starts with the first byte of the current window is processed immediately, and any data that falls further within the window is buffered and will be processed later as this data is received out of order.

RST packets are supposed to be generated in order to reset stale sessions that can for instance occur after one side reboots. So when the other side sends a packet, the system that just booted doesn't know this session and sends back an RST packet in reply to the data packet, copying the sequence number in the process. The sender of the original packet now receives the RST with a sequence number that obviously falls within the window, so the session is torn down.

The important part here is that according to RFC 793 the sequence number doesn't have to be an exact match: as long as it's within the window, it'll be accepted. For some reason, many people assumed there would have to be an exact match. Since the sequence number is 32 bits in size, this means 4.3 billion possibilities. So if an attacker wants to reset a TCP session, assuming he already knows the correct IP addresses and port numbers, he would have to send up to 4.3 billion packets. But in reality the number of packets necessary to reset a session must be divided by the window size.

So what would be the window size for a typical TCP session, or, more importantly, a TCP session used for BGP? Well, if a system implements the RFC 1323 TCP high performance extensions, the window size can be almost a gigabyte. This is where Paul Watson's claim that resetting a TCP session can be done with "as few as four packets" comes in. However, this is pure nonsense as such huge windows are never necessary. Even when moving data across the globe at 10 Gbps a window less than half that size is more than sufficient. BGP isn't exactly in the business of moving data across the globe at high speed over TCP, so it only requires a very modest window size. These are some of the initial packets for two BGP sessions between a Cisco router and a FreeBSD box running Zebra:

09:42:06.537772 IP 213.156.3.169.11164 > 213.156.3.173.179: S 3863598700:3863598 700(0) win 16384 <mss 1460>
09:44:35.984407 IP 213.156.3.173.54660 > 213.156.3.169.179: S 1733311468:1733311 468(0) win 65535 <mss 1460,nop,wscale 0,nop,nop,timestamp 3272949461 0>
09:44:36.003219 IP 213.156.3.169.179 > 213.156.3.173.54660: S 947956194:94795619 4(0) ack 1733311469 win 16384 <mss 1460>

The packet from the Cisco router doesn't even have the window scale option. The FreeBSD machine has the option but doesn't bother to actually use it. This means that the maximum window for the subsequent session is limited to 65535 bytes. However, routers almost universally use around 16000 bytes. Bottom line: in order to reset a BGP session using a TCP RST, it is necessary to send between 65 and 268 thousand packets. This takes several minutes at DSL speeds. So if IP addresses and port numbers are known, resetting a BGP session isn't too hard for an attacker. However, if the attacker must also guess the ephemeral port, then the time it takes to reset a session becomes too long to make this an interesting attack vector.

In theory it's even better than this, at least on Cisco routers, because those rate limit the handling of RST packets. Unfortunately, RFC 793 also suggests that TCP sessions should be terminated if an in-window SYN rather than a RST packet is received. SYN packets are normally used to open sessions, so the rationale for terminating sessions when unexpected SYNs come along is murky at best.

Last but not least, there was a bug in several TCP implementations that allows any "left of window" RSTs to be acted upon without further checks. This bug was fixed in 1998 in FreeBSD. From the FreeBSD 4.9 /usr/src/sys/netinet/tcp_input.c file:
* First check the RST flag and sequence number since reset segments * are exempt from the timestamp and connection count tests. This * fixes a bug introduced by the Stevens, vol. 2, p. 960 bugfix * below which allowed reset segments in half the sequence space * to fall though and be processed (which gives forged reset * segments with a random sequence number a 50 percent chance of * killing a connection).

This bug was still present in NetBSD until a few days ago. This led me to believe that the real issue here was that this same bug was also present in the TCP code from one or more major router vendors, and that this was what Paul Watson was talking about. But that didn't turn out to be the case, so essentially the whole story boils down to "router vendors implement TCP according to RFC 793". Big news indeed.

Permalink - posted 2004-04-24

IPv6 DNS delegation progress

Since the announcement was made on the 35th anniversary of the first moon landing, I think it's appropriate to modify Neil Armstrong's famous quote slightly:

"One small step for mankind... but a huge step for ICANN."

Yesterday, ICANN (the Internet Corporation for Assigned Names and Numbers) announced that IPv6 nameservers for .jp and .kr were added to the root zone. Now obviously this is a necessary step for making it possible to resolve domain names using IPv6, but it's really not a big deal as IPv6 delegations have been around lower down the delegation chain for a very long time, and no reasonable DNS implementation has a problem with this. Also, ns.ripe.net has had an IPv6 address for some time now, and this nameserver is secondary for country domains such as .nl and .it, not to mention the reverse mapping of huge slabs of IP address space. The difference between that and the new .jp and .kr delegations is that those also have an AAAA glue record, while ns.ripe.net only has an A glue record.

But the real issue here is having IPv6 addresses for the root servers themselves. There has been considerable experimenting with actually running root service over IPv6 transport. However, in order for an IPv6-only host to be able to get at the root servers, those IPv6 addresses must be published in two places. Number one is the named.root "hints" file, which DNS servers use to find the root servers on startup. It's fairly trivial to add IPv6 addresses to this file. This would allow a DNS server to perform the first query over IPv6. Unfortunately, that first query is for the full set of root servers. Since there are 13 of those, the response to this query comes fairly close to the maximum response size of 512 bytes. Adding IPv6 addresses for all root servers isn't possible without going over this limit, which has the potential to cause several problems. I'm very interested to see how ICANN is going to handle this, and especially when they're going to handle this.

As a bonus, at long last E.F.F.3.IP6.ARPA has been delegated so that 6bone addresses (3ffe::/16) can now enjoy proper reverse service in the DNS.

Permalink - posted 2004-07-21

'BGP' translated into Japanese

FYI - my book is now also available in Japanese. The translation is published by Ohmsha.

Update: this is what it looks like.

So apparently Japanese for "gazelle" (as on the cover of the English version) is "mule"...

Permalink - posted 2004-08-21