MACHINE AGENTS
What is a machine agent?
Machine agents are software used to retrieve
information from, and/or upload information onto, webservers
with little or no user intervention. When the software requires
significant user intervention in order to get or put information
into the webserver, we call the software a user agent.
Are machine agents new?
No, machine agents are not new. They are also
called bots, robots, spiders, crawlers, spambots, or scrapers
in different contexts based on the particular tasks they perform.
Are machine agents bad?
Whether machine agents are good or bad depends
on the goals of the webmaster and/or application. For most
applications though, there are many more machine agents that
may be considered bad and relatively few that may be considered
good.
Can bad machine agents be identified?
We can detect, identify and record the capabilities
of most machine agents as they hit our websites, but we leave
it up to webmasters to determine which machine agents are
good and which are bad for their particular applications.
How are machine agents detected?
We use many different techniques to detect
machine agent activity on our websites. Some techniques are
very simple, for example, any computer that requests robots.txt
from our servers is registered as a machine agent. Other techniques
are more advanced, for example, we monitor request patterns
and rates of request to detect stealth machine agents.
How can spoofed user agent strings be detected?
Spoofed user agent strings (which machine
agents are supposed to use to identify themselves) can be
detected in at least two ways. Search engine companies have
proposed a method which unfortunately is not generic enough
and doesn't always work. We
developed another method which should always work
but is unfortunately not widely supported by search engine
companies and may also intefere with the robots.txt exclusion
protocol. The best method today is to check the x-verified
header field (described below) in our database.
Can I use machine agent detection techniques
on my websites?
You can in principle, but it is highly recommended
that you get the data from us or from another provider instead.
The reason is that many of the techniques will make the popular
search engines penalize your websites or even remove your
websites completely from their database. We are able use these
techniques freely on our websites and pass the data on to
you safely only because we care more about word-of-mouth than
SERPs ranking.
Is there any way to block machine agents from
my websites without being penalized by legitimate search engines?
Yes, certainly. The best way to do this is
to acquire a list of machine agents that someone has already
detected using the techniques described above, and then use
this list to allow or deny access to your websites. This is
a perfectly safe approach that will not affect your SERPs
ranking or overload your server significantly.
What to look for in a machine agent?
Our system calculates and stores a capabilities
value for each machine agent that we identify. This "botcaps"
value as we call it encodes answers to the following questions
for each machine agent :
(1) does it have an identifier or user agent string ?
(2) has it been delisted from our database ?
(3) is it being used to look for exploits and vulnerabilities
?
(4) does it support cookies ?
(5) does it have a webpage ?
(6) is it connecting from an IP address published on its webpage
?
(7) does it read the robots.txt file ?
(8) does it follow the robots.txt protocol ?
(9) is it capable of executing javascript ?
(10) is it being used to scrape web content ?
(11) is it being used to send spam ?
(12) is it being used to track or check resources like images
and links ?
You can use these various pieces of information to help you
decide if a particular machine agent is reputable enough to
be allowed access to your websites or not.
How should the bot capabilities (botcaps) value
be decoded or interpreted?
(1) If the machine agent does not have an
identifier or a user agent string, bit 1 in the botcaps value
will be set and the X-NOID header field will be present in
the database record.
(2) If the machine agent has been delisted from our database,
bit 2 in the botcaps value will be set and the X-DELETED header
field will be present in the database record.
(3) If the machine agent is being used to look for exploits
and vulnerabilities, or if it requested a page that doesn't
exist on our servers, bit 3 in the botcaps value will be set
and the X-CRACKER header field will be present in the database
record.
(4) If the machine agent ignores or does not correctly return
cookies, bit 4 in the botcaps value will be set and the X-ICOOKIES
header field will be present in the database record.
(5) If the machine agent has a valid url in its user agent
string, bit 6 in the botcaps value will be set and the X-PUBLISHED
header field will be present in the database record.
(6) If the machine agent has a user agent string that is demonstrably
not spoofed, bit 5 in the botcaps value will be set and the
X-VERIFIED header field will be present in the database record.
(7) If the machine agent reads the robots.txt file, bit 7
in the botcaps value will be set and the X-RROBOTSTXT header
field will be present in the database record.
(8) If the machine does not respect the robots.txt file, bit
8 in the botcaps value will be set and the X-IROBOTSTXT header
field will be present in the database record.
(9) If the machine agent is unable to execute javascript,
bit 9 in the botcaps value will be set and the X-IJAVASCRIPT
header field will be present in the database record.
(10) If the machine agent is being used to scrape web content,
bit 10 in the botcaps value will be set and the X-SCRAPER
header field will be present in the database record.
(11) If the machine agent is being used to send spam, bit
11 in the botcaps value will be set and the X-SPAMMER header
field will be present in the database record.
(12) If the machine agent is being used to track or check
resources like images and links, bit 12 in the botcaps value
will be set and the X-TRACKER header field will be present
in the database record.
|
OUR SERVICES
What can botslist products and services do for
you?
We help webmasters to detect and identify robots
and other machine agents that visit their websites. This in
turn helps webmasters to
(a) block automated requests from compromised internet servers
(b) maintain accurate and reliable web statistics
(c) manage bandwidth costs and improve webserver performance
(d) implement legitimate content delivery for search engine
optimization
(e) protect their content and other intellectual property from
unauthorized scrapers
How does it work?
DATA COLLECTION
(1) We run sophisticated machine agent detection software on
our websites, including the main website that you are currently
on. Every hour of the day, each website uploads its newly detected
machine agents to a central server.
(2) We operate a powerful crawler that searches the internet
for access logs, parses the log files to identify machine agents,
and uploads the results to our central server for analysis.
(3) Wherever and whenever possible, we ask webmasters to use
HTTP 3xx response codes to redirect machine agents they have
detected to our central server to be added to our database.
DATA PROCESSING
(1) Our central server processes the data and adds the machine
agents to our database.
(2) Another server grabs the data from our database and sends
it to your web server at the interval specified in your account
information.
(3) Your web server script or module accepts the data, verifies
that it originates from our servers and updates your access
control files and databases.
REALTIME WEBSERVICE
By requesting the page http://rbl.botslist.ca/whois?addr=XXXXXXX
from our realtime bot lookup server, low-traffic websites can
easily find out if an ip address is active in our database.
If the ip address is not active in our database, the server
will respond with a 404-Not Found. But if the ip address is
active, the server will respond with a 200-OK along with detailed
information for the ip address.
How much does it cost?
There is no charge for receiving the full database
every quarter when you purchase some of our products. A small
charge will apply if you want to receive the full database more
frequently or without purchasing any of our products.
How this works is that we determine the price per update (PPU)
and you choose how many updates you want and how frequently
you want the updates to be posted to your server.
For example, 12 updates at a monthly interval and PPU of $10
will cost you $120 per year. Similarly, 52 updates at a weekly
interval and PPU of $10 will cost you $520 per year. The PPU
may change from time to time but will never apply retroactively
to updates that you have already paid for.
What software is needed in order to use the service?
If none of our products is suitable for your
needs and you want to develop your own solution, you need to
know that we will send the data to your web server with a POST
request, so your server must be able to handle the request.
This will most likely require a server side script written in
any of the popular web scripting languages like PHP or Python.
Your script should
(1) accept the request
(2) verify that it originates from our domain
(3) parse the botslist data and update your access configuration
file or database
For step(2), your script should do a forward DNS lookup on www.botslist.ca
to get a list of ip addresses and terminate the connection if
the connection is not from one of the addresses in this list.
Your script should also verify that the POST request contains
the security header name and value specified in your account.
It is important to follow this procedure in order to prevent
your web server from accepting data from malicious computers
pretending to be a botslist server.
For step(3), your script will need to parse the botslist data
in order to extract the information that you are interested
in. The detail data file
contains the ip address on one line, followed by header fields
on subsequent lines, followed by two newlines(\r\n). The summary
data file contains the ip address, a single space and
the botcaps value on each line. The sql
data file contains sql queries for updating an sql database
with the information from the detail data file.
What should my web server do when it denies access
to a suspected machine agent?
It depends on how you detected the suspicious
activity.
(1) If you are denying access based on information contained
in our database, we recommend that you redirect the request
to the botslisted page on our
server. This page explains why access is being denied and will
also allow them to remove themselves from our database if they
turn out to be genuine users instead of machine agents.
(2) If you suspect machine agent activity based on information
that is NOT from our database, we recommend that you redirect
the request to any page that does not exist on our server, for
example, "/badbots". If the redirection is followed, it will
cause the suspected machine agent to be added to our database
automatically.
I refreshed my browser and now I'm listed as a
machine agent; or I just visited your website and got registered
in your database, why?
Our detection software is very unforgiving by
design. The problem is that different browers behave differently
for any given action like doing a refresh. If a browser does
something that a machine agent might do, such as refusing to
following a redirection or not returning our cookies correctly,
it will get listed. If this was an error, the user will have
a chance to correct it. Our system is designed to err on the
side of safety because the bad guys have gotten really sophisticated
over the last few years.
Can I list or delist a third party ip address in
your database ?
Sorry, you can't. We provide no way for anyone
to cause a third party's ip address to be listed or delisted
unless that party connects to a botslist server. This is necessary
in order to avoid abuse and legal complications.
There is also no backdoor in our system to allow anyone, including
our staff, to delist an ip address from our database. Delisting
can only be done by someone connecting from the listed ip address.
|