|
This section provides some simple suggestions that will make your
documents more accessible to search engines.
- Define the document language
- In the global context of the Web it is important to know which
human language a page was written in. This is discussed in the section
on
language information.
- Specify language variants of this document
- If you have prepared translations of this document into other
languages, you should use the
LINK element to reference these. This
allows an indexing engine to offer users search results in the user's
preferred language, regardless of how the query was written. For
instance, the following links offer French and German alternatives to
a search engine:
<LINK rel="alternate"
type="text/html"
href="mydoc-fr.html" hreflang="fr"
lang="fr" title="La vie souterraine">
<LINK rel="alternate"
type="text/html"
href="mydoc-de.html" hreflang="de"
lang="de" title="Das Leben im Untergrund">
- Provide keywords and descriptions
- Some indexing engines look for
META elements that define a
comma-separated list of keywords/phrases, or that give a short
description. Search engines may present these keywords as the result
of a search. The value of the
name attribute sought by a search
engine is not defined by this specification. Consider these examples,
<META name="keywords" content="vacation,Greece,sunshine">
<META name="description" content="Idyllic European vacations">
- Indicate the beginning of a collection
- Collections of word processing documents or presentations are
frequently translated into collections of HTML documents. It is
helpful for search results to reference the beginning of the
collection in addition to the page hit by the search. You may help
search engines by using the
LINK element with rel="start"
along with the
title attribute, as in:
<LINK rel="start"
type="text/html"
href="page1.html"
title="General Theory of Relativity">
- Provide robots with indexing instructions
- People may be surprised to find that their site has been indexed
by an indexing robot and that the robot should not have been permitted
to visit a sensitive part of the site. Many Web robots offer
facilities for Web site administrators and content providers to limit
what the robot does. This is achieved through two mechanisms: a
"robots.txt" file and the
META element in HTML documents,
described below.
The robots.txt file
When a Robot visits a Web site, say http://www.foobar.com/, it firsts
checks for http://www.foobar.com/robots.txt. If it can find this
document, it will analyze its contents to see if it is allowed to
retrieve the document. You can customize the robots.txt file to apply
only to specific robots, and to disallow access to specific directories
or files.
Here is a sample robots.txt file that prevents all robots from
visiting the entire site
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
The Robot will simply look for a "/robots.txt" URI on your site,
where a site is defined as a HTTP server running on a particular host
and port number. Here are some sample locations for robots.txt:
| Site URI |
URI for robots.txt |
| http://www.w3.org/ |
http://www.w3.org/robots.txt |
| http://www.w3.org:80/ |
http://www.w3.org:80/robots.txt |
| http://www.w3.org:1234/ |
http://www.w3.org:1234/robots.txt |
| http://w3.org/ |
http://w3.org/robots.txt |
There can only be a single "/robots.txt" on a site. Specifically, you
should not put "robots.txt" files in user directories, because a robot
will never look at them. If you want your users to be able to create
their own "robots.txt", you will need to merge them all into a single
"/robots.txt". If you don't want to do this your users might want to use
the Robots META Tag instead.
Some tips: URI's are case-sensitive, and "/robots.txt" string must be
all lower-case. Blank lines are not permitted within a single record in
the "robots.txt" file.
There must be exactly one "User-agent" field per record. The robot
should be liberal in interpreting this field. A case-insensitive
substring match of the name without version information is recommended.
If the value is "*", the record describes the default access policy
for any robot that has not matched any of the other records. It is not
allowed to have multiple such records in the "/robots.txt" file.
The "Disallow" field specifies a partial URI that is not to be
visited. This can be a full path, or a partial path; any URI that starts
with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
An empty value for "Disallow", indicates that all URIs can be
retrieved. At least one "Disallow" field must be present in the
robots.txt file.
Robots and the META element
The
META element allows HTML authors to tell
visiting robots whether a document may be indexed, or used to harvest
more links. No server administrator action is required.
In the following example a robot should neither index this document,
nor analyze it for links. <META name="ROBOTS" content="NOINDEX, NOFOLLOW">
The list of terms in the content is ALL, INDEX,
NOFOLLOW, NOINDEX.
Note. In early 1997 only a few robots implement
this, but this is expected to change as more public attention is given
to controlling indexing robots.
Copyright © 1999 World
Wide Web Consortium, (Massachusetts
Institute of Technology, Institut
National de Recherche en Informatique et en Automatique,
Keio University). All Rights
Reserved. http://www.w3.org/Consortium/Legal |