logo BOTSLIST   Home   |   Products   |   Services   |   Sign In    |   Sign Up  

GOOD BOTS AND BAD BOTS
FREQUENTLY ASKED QUESTIONS

MACHINE AGENTS

What is a machine agent?

Machine agents are software used to retrieve information from, and/or upload information onto, webservers with little or no user intervention. When the software requires significant user intervention in order to get or put information into the webserver, we call the software a user agent.

Are machine agents new?

No, machine agents are not new. They are also called bots, robots, spiders, crawlers, spambots, or scrapers in different contexts based on the particular tasks they perform.

Are machine agents bad?

Whether machine agents are good or bad depends on the goals of the webmaster and/or application. For most applications though, there are many more machine agents that may be considered bad and relatively few that may be considered good.

Can bad machine agents be identified?

We can detect, identify and record the capabilities of most machine agents as they hit our websites, but we leave it up to webmasters to determine which machine agents are good and which are bad for their particular applications.

How are machine agents detected?

We use many different techniques to detect machine agent activity on our websites. Some techniques are very simple, for example, any computer that requests robots.txt from our servers is registered as a machine agent. Other techniques are more advanced, for example, we monitor request patterns and rates of request to detect stealth machine agents.

How can spoofed user agent strings be detected?

Spoofed user agent strings (which machine agents are supposed to use to identify themselves) can be detected in at least two ways. Search engine companies have proposed a method which unfortunately is not generic enough and doesn't always work. We developed another method which should always work but is unfortunately not widely supported by search engine companies and may also intefere with the robots.txt exclusion protocol. The best method today is to check the x-verified header field (described below) in our database.

Can I use machine agent detection techniques on my websites?

You can in principle, but it is highly recommended that you get the data from us or from another provider instead. The reason is that many of the techniques will make the popular search engines penalize your websites or even remove your websites completely from their database. We are able use these techniques freely on our websites and pass the data on to you safely only because we care more about word-of-mouth than SERPs ranking.

Is there any way to block machine agents from my websites without being penalized by legitimate search engines?

Yes, certainly. The best way to do this is to acquire a list of machine agents that someone has already detected using the techniques described above, and then use this list to allow or deny access to your websites. This is a perfectly safe approach that will not affect your SERPs ranking or overload your server significantly.

What to look for in a machine agent?

Our system calculates and stores a capabilities value for each machine agent that we identify. This "botcaps" value as we call it encodes answers to the following questions for each machine agent :

(1) does it have an identifier or user agent string ?
(2) has it been delisted from our database ?
(3) is it being used to look for exploits and vulnerabilities ?
(4) does it support cookies ?
(5) does it have a webpage ?
(6) is it connecting from an IP address published on its webpage ?
(7) does it read the robots.txt file ?
(8) does it follow the robots.txt protocol ?
(9) is it capable of executing javascript ?
(10) is it being used to scrape web content ?
(11) is it being used to send spam ?
(12) is it being used to track or check resources like images and links ?

You can use these various pieces of information to help you decide if a particular machine agent is reputable enough to be allowed access to your websites or not.

How should the bot capabilities (botcaps) value be decoded or interpreted?

(1) If the machine agent does not have an identifier or a user agent string, bit 1 in the botcaps value will be set and the X-NOID header field will be present in the database record.

(2) If the machine agent has been delisted from our database, bit 2 in the botcaps value will be set and the X-DELETED header field will be present in the database record.

(3) If the machine agent is being used to look for exploits and vulnerabilities, or if it requested a page that doesn't exist on our servers, bit 3 in the botcaps value will be set and the X-CRACKER header field will be present in the database record.

(4) If the machine agent ignores or does not correctly return cookies, bit 4 in the botcaps value will be set and the X-ICOOKIES header field will be present in the database record.

(5) If the machine agent has a valid url in its user agent string, bit 6 in the botcaps value will be set and the X-PUBLISHED header field will be present in the database record.

(6) If the machine agent has a user agent string that is demonstrably not spoofed, bit 5 in the botcaps value will be set and the X-VERIFIED header field will be present in the database record.

(7) If the machine agent reads the robots.txt file, bit 7 in the botcaps value will be set and the X-RROBOTSTXT header field will be present in the database record.

(8) If the machine does not respect the robots.txt file, bit 8 in the botcaps value will be set and the X-IROBOTSTXT header field will be present in the database record.

(9) If the machine agent is unable to execute javascript, bit 9 in the botcaps value will be set and the X-IJAVASCRIPT header field will be present in the database record.

(10) If the machine agent is being used to scrape web content, bit 10 in the botcaps value will be set and the X-SCRAPER header field will be present in the database record.

(11) If the machine agent is being used to send spam, bit 11 in the botcaps value will be set and the X-SPAMMER header field will be present in the database record.

(12) If the machine agent is being used to track or check resources like images and links, bit 12 in the botcaps value will be set and the X-TRACKER header field will be present in the database record.

OUR SERVICES

What can botslist products and services do for you?

We help webmasters to detect and identify robots and other machine agents that visit their websites. This in turn helps webmasters to

(a) block automated requests from compromised internet servers

(b) maintain accurate and reliable web statistics

(c) manage bandwidth costs and improve webserver performance

(d) implement legitimate content delivery for search engine optimization

(e) protect their content and other intellectual property from unauthorized scrapers

How does it work?

DATA COLLECTION

(1) We run sophisticated machine agent detection software on our websites, including the main website that you are currently on. Every hour of the day, each website uploads its newly detected machine agents to a central server.

(2) We operate a powerful crawler that searches the internet for access logs, parses the log files to identify machine agents, and uploads the results to our central server for analysis.

(3) Wherever and whenever possible, we ask webmasters to use HTTP 3xx response codes to redirect machine agents they have detected to our central server to be added to our database.

DATA PROCESSING

(1) Our central server processes the data and adds the machine agents to our database.

(2) Another server grabs the data from our database and sends it to your web server at the interval specified in your account information.

(3) Your web server script or module accepts the data, verifies that it originates from our servers and updates your access control files and databases.

REALTIME WEBSERVICE 

By requesting the page http://rbl.botslist.ca/whois?addr=XXXXXXX from our realtime bot lookup server, low-traffic websites can easily find out if an ip address is active in our database. If the ip address is not active in our database, the server will respond with a 404-Not Found. But if the ip address is active, the server will respond with a 200-OK along with detailed information for the ip address.

How much does it cost?

There is no charge for receiving the full database every quarter when you purchase some of our products. A small charge will apply if you want to receive the full database more frequently or without purchasing any of our products.

How this works is that we determine the price per update (PPU) and you choose how many updates you want and how frequently you want the updates to be posted to your server.

For example, 12 updates at a monthly interval and PPU of $10 will cost you $120 per year. Similarly, 52 updates at a weekly interval and PPU of $10 will cost you $520 per year. The PPU may change from time to time but will never apply retroactively to updates that you have already paid for.

What software is needed in order to use the service?

If none of our products is suitable for your needs and you want to develop your own solution, you need to know that we will send the data to your web server with a POST request, so your server must be able to handle the request. This will most likely require a server side script written in any of the popular web scripting languages like PHP or Python. Your script should

(1) accept the request
(2) verify that it originates from our domain
(3) parse the botslist data and update your access configuration file or database

For step(2), your script should do a forward DNS lookup on www.botslist.ca to get a list of ip addresses and terminate the connection if the connection is not from one of the addresses in this list. Your script should also verify that the POST request contains the security header name and value specified in your account. It is important to follow this procedure in order to prevent your web server from accepting data from malicious computers pretending to be a botslist server.

For step(3), your script will need to parse the botslist data in order to extract the information that you are interested in. The detail data file contains the ip address on one line, followed by header fields on subsequent lines, followed by two newlines(\r\n). The summary data file contains the ip address, a single space and the botcaps value on each line. The sql data file contains sql queries for updating an sql database with the information from the detail data file.

What should my web server do when it denies access to a suspected machine agent?

It depends on how you detected the suspicious activity.

(1) If you are denying access based on information contained in our database, we recommend that you redirect the request to the botslisted page on our server. This page explains why access is being denied and will also allow them to remove themselves from our database if they turn out to be genuine users instead of machine agents.

(2) If you suspect machine agent activity based on information that is NOT from our database, we recommend that you redirect the request to any page that does not exist on our server, for example, "/badbots". If the redirection is followed, it will cause the suspected machine agent to be added to our database automatically.

I refreshed my browser and now I'm listed as a machine agent; or I just visited your website and got registered in your database, why?

Our detection software is very unforgiving by design. The problem is that different browers behave differently for any given action like doing a refresh. If a browser does something that a machine agent might do, such as refusing to following a redirection or not returning our cookies correctly, it will get listed. If this was an error, the user will have a chance to correct it. Our system is designed to err on the side of safety because the bad guys have gotten really sophisticated over the last few years.

Can I list or delist a third party ip address in your database ?

Sorry, you can't. We provide no way for anyone to cause a third party's ip address to be listed or delisted unless that party connects to a botslist server. This is necessary in order to avoid abuse and legal complications.

There is also no backdoor in our system to allow anyone, including our staff, to delist an ip address from our database. Delisting can only be done by someone connecting from the listed ip address.


FAQ   |   Privacy Policy   |   Terms of Service   |   Contact Us  

botslist Copyright 2012. Botslist and Voicenette Communications. All Rights Reserved.