MSNBOT must die!

If you've suffered any problems accessing any of the sites, the databases, the CPAN mirror, etc from the CPAN Testers server last night, please direct your wrath at Microsoft. Last night the msnbot took out the CPAN Testers server with a dedicated denial of service attack. As a consequence measures are now being put in place to completely ban the msnbot from accessing at least the Reports site, and probably all the sites on the server.

Microsoft in their incompetent wisdom decided to unleash 20-30 bots every few seconds. I know this because I can see the IP addresses in the logs. The ones spotted within a few minutes of rebooting the server this morning to clear the processes were:

65.55.207.50
65.55.207.23
65.55.207.93
65.55.207.25
65.55.207.48
65.55.207.46
65.55.207.72
65.55.207.26
65.55.106.234
65.55.107.179
65.55.207.100
65.55.207.121
65.55.207.30
65.55.207.69
65.55.207.28
65.55.107.180
65.55.207.27
65.55.207.47
65.55.207.21
65.55.207.51
65.55.207.54

It seems their bots completely ignore the rules specified in the robots.txt, despite me setting it up as per their own guidelines on their site, and worst of all they don't talk to each other to see they are accessing the same domain. Most sensible bots, such as those of Google or Majestic 12, will only let one bot at a time crawl a site, as most sensible companies acknowledge that a DOS attack is not good policy. As a consequence I'll now be denying access to anything with the IP matching /^65\.55\.(106|107|207)/. If you discover you fall into that pattern, and are a real person, please let me know.

If anyone from Microsoft ends up reading this, though likely you'll have to do it in person and not via a bot, I now consider you to be no better than a script kiddie trying to bring down a government computer. DOS attacks usually get people charged and arrested. If CPAN Testers was a legal entity, then I might have been able to follow this through. Instead I'm locking the doors, and no longer letting you through.

13 Comments


Hmm. I banned MSN a long time ago from perldesignpatterns.com. If you don't respect the robots.txt there and attempt to crawl the site, you'll wind up downloading the same content over and over in vast numbers of permutations. Guess who was ignoring robots.txt...

-scott

Does anyone know whether this was deliberate or the result of some inept techie trying to get more cataloguing?

Never attribute to malice what can be easily explained by stupidity.

We had similar problems over at the Open Watcom project about a year ago, see msnbot turning evil.

Hi,
I am a Program Manager on the Bing team at Microsoft, thanks for bringing this issue to our attention. I have sent an email to barbie@cpan.org as we need additional information to be able to track down the problem. If you have not received the email please contact us through the Bing webmaster center at bwmc@microsoft.com.

Any reason we shouldn't just block all requests with a User Agent string containing /msnbot/i ? I sure don't care if my site is missing from their index.

Perhaps this is not malicious, but it sure looks suspicious. I run a webserver and I am seeing requests form these bots for filenames of the form:

2375C96D0FABF01BA2E94E6ECBFDB72F_00000.temp0014.htm

There have NEVER been any files available on my webserver with filenames that look like that.

I am going to block these networks from my webserver.

The whole 65.52.0.0/14 block is Microsoft territory, could just block that:

richter-3:~$ whois 65.55.207.50

OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US

NetRange: 65.52.0.0 - 65.55.255.255
CIDR: 65.52.0.0/14
NetName: MICROSOFT-1BLK
NetHandle: NET-65-52-0-0-1
Parent: NET-65-0-0-0-0
NetType: Direct Assignment
NameServer: NS1.MSFT.NET
NameServer: NS5.MSFT.NET
NameServer: NS2.MSFT.NET
NameServer: NS3.MSFT.NET
NameServer: NS4.MSFT.NET
Comment:
RegDate: 2001-02-14
Updated: 2004-12-09
[...]

Perhaps I am paranoid, but I kind of doubt it is a coincidence that these Microsoft Bot's are screwing up my server performance with these requests for non-existent files with names that appear to be randomly generated and that the primary purpose of my server is to serve custom Mozilla builds and Mozilla add-ons.

Also besides the 65.52 range I am seeing the same things from MSN bot's in:

OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US

NetRange: 207.46.0.0 - 207.46.255.255
CIDR: 207.46.0.0/16
NetName: MICROSOFT-GLOBAL-NET

I should have mentioned that the attempts to load files with these randomly generated names are all within the /mozilla/firefox file hierarchy on the server.

My suspicion is:

1. This IS malicious.

2. This is NOT intentional on Microsoft's part.

I think somehow someone has found a way to hack into their server indexing system to cause this.

Hi Bill,

Did those bots actually identify themselves as msnbot in the User-agent request header?

If so, instead of banning the bot IPs, I would suggest adding a Crawl-delay directive to your robots.txt file, and then contacting Microsoft if -after that- you still see an excessive number of requests per second coming from msnbot.

See: http://www.bing.com/community/blogs/webmaster/archive/2009/08/10/crawl-delay-and-the-bing-crawler-msnbot.aspx

I have also witnessed this DoS from msnbot on one of my web server (PHP database) and despite a high crawl-delay (gradually raising the value of the past months), and blocking the /cgi-bin directory, I still see some bots that continues their regular crawling without taking the directive into account:

apache log file :
65.55.207.21 - - [21/Jan/2010:00:17:01 +0100] "GET /cgi-bin/search.php?... HTTP/1.0" 200 7352 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.207.21 - - [21/Jan/2010:00:17:31 +0100] "GET ... HTTP/1.0" 200 6133 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

only 30 seconds between the 2 hits, while the robots.txt says:
User-agent: *
Disallow: /search.php
Disallow: /search.php?action=show
Disallow: /*?action=show
Disallow: /cgi-bin

User-agent: msnbot
Crawl-delay: 2048


and of course this specific bot has retrieved the directive from the robots.txt file a few days ago:

grep "65.55.207.21.*robots.txt" /var/log/apache2/...-access.log
65.55.207.21 - - [18/Jan/2010:10:58:46 +0100] "GET /robots.txt HTTP/1.0" 200 238 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"


n.b.: if it weren't for my employer web site, I would have blocked MSN bot completely a long time ago!
(but seeing how general this problem becomes, I'm still reconsidering it)

Leave a comment

About CPAN Testers

user-pic This is the new account for incidental and summary updates to what's happening with the CPAN Testers. For all the latest news and views please see our blog.