Difference between revisions of "Robots txt"

From TNG_Wiki
Jump to: navigation, search
(added Security Related Links)
 
Line 1: Line 1:
The [http://www.robotstxt.org/robotstxt.html Robots txt.org site] provides explanation on using the The Robots Exclusion Protocol.
+
Extracted from the [http://www.robotstxt.org/robotstxt.html Robots txt.org site] provides the folowing explanation on using the The Robots Exclusion Protocol.
 +
<sub>Link provided by BruceM on User2 list 6/30/2009 10:34 PM</sub>
 +
 
 +
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. 
 +
It works likes this: a robot wants to vists a Web site URL, say <nowiki>http://www.example.com/welcome.html</nowiki>. Before it does so, it firsts checks for <nowiki>http://www.example.com/robots.txt</nowiki>, and finds: 
 +
 
 +
 
 +
<syntaxhighlight lang="html4strict" enclose="div">
 +
User-agent: *
 +
 
 +
Disallow: /
 +
</syntaxhighlight>
 +
 
 +
 
  
 +
The "User-agent: *" means this section applies to all robots.  The "Disallow: /" tells the robot that it should not visit any pages on the site. 
 +
There are two important considerations when using /robots.txt: 
 +
* robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
 +
* the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. 
 +
So don't try to use /robots.txt to hide information. 
 
Note that malware or email harvesting bots will ignore the directives of the robots.txt file.
 
Note that malware or email harvesting bots will ignore the directives of the robots.txt file.
  
<sub>Link provided by BruceM on User2 list 6/30/2009 10:34 PM</sub>
 
  
 
== Related links ==
 
== Related links ==
  
{{: Security related links}}
+
[http://www.robotstxt.org/robotstxt.html Robots txt.org site]
 +
{{: Security related links}}=
  
 
[[Category:Security]]
 
[[Category:Security]]

Latest revision as of 13:36, 4 February 2015

Extracted from the Robots txt.org site provides the folowing explanation on using the The Robots Exclusion Protocol. Link provided by BruceM on User2 list 6/30/2009 10:34 PM

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:


User-agent: *

Disallow: /


The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site. There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information. Note that malware or email harvesting bots will ignore the directives of the robots.txt file.


Related links

Robots txt.org site

The following provide additional security measures:

Controlling Site Access

Protecting Resources

Checking your site for Malware