After testing SP1 on a couple of Windows 2003 servers I recently rolled out deployment to a handful of production servers. Everything seemed fine for the first 20 hours and then I noticed incoming traffic through one of the interfaces on two machines was failing, in fact outgoing traffic was also failing, but the interfaces were still connected and showing as 'Up'.
Traffic through the internal interface was still fine and this was only happening on two of the servers that had been upgraded. Thinking it might be an issue with the new Windows Firewall I disabled it completely. 15 Hours later the same problem occured. Running diagnostics on all of the cards, cabling and switches proved it wasn't a hardware issue.
As the interfaces that were failing were teamed, I upgraded all of the drivers to the latest versions, broke the teams and recreated all of them. 24 Hours later exactly the same problem.
Then I started examining the differences between the machines that were experiencing this problem and the other Windows 2003 SP1 servers. The most obvious difference was that the two servers that were failing were both multi-homed and were therefore connecting to multiple networks.
Knowing there were differences in the way that Windows 2000 and 2003 handled multi-homing got me thinking that this was the most likely issue, although it must be said that it had never been a problem before SP1, although it probably should have been.
Anyhow, removing the gateway for the internal card and ensuring the route had been removed ensured that Windows only had one default gateway. As the issue was only happening after a random period of time the only way to tell if the problem was cured was to leave it and see what happened. 4 Days later there hasn't been a problem.
So, if you experience a similar problem, check if the machines are multi-homed and if so, remove one of the gateways. Odds are, it will cure the problem.
The following link also has some helpful information regarding multi-homing under Windows 2003.
The most frustrating think about the issues we were having was that our internal monitoring system believed all services etc. to be running, simply because it was testing them through the internal interface, even though it had been specifically configured to test via the external interface.
The one good thing that has happened as a result of this issue is that it's led me to review how we monitor our servers and networks and I've discovered NetCrunch from AdRem Software (http://www.adremsoft.com/netcrunch/index.php). If you're in the market for a network management and monitoring system, then you really need to check this out. It is a little pricey, but when compared to other equivilents, you should be pleasantly surprised. I sure am and will be recommending this to replace our existing system in the near future.