URL
Several HTML elements, most notably the A
element, may contain an attribute which takes a URL as value.
URLs, Uniform Resource Locators, are addresses of Web documents.
More generally, URLs can be used on the Web to refer to
"objects" on the Web or in other information systems.
The general syntax of absolute URLs is the following:
scheme://
host:
port/
path/
filename
where
- scheme
- specifies the information system (technically speaking,
the protocol) to be used to access the resource;
possible values include the following:
http |
a Web document (to be accessed using
Hypertext Transfer Protocol, HTTP) |
ftp |
a
resource to be retrieved using FTP (File
Transfer Protocol), usually a file in a so-called
FTP server, |
file |
a file on a particular computer; a
file URL is hardly useful on the Web |
gopher |
a file in a Gopher server |
mailto |
electronic mail address |
news |
a newsgroup or an article in Usenet news |
telnet |
for starting an interactive session via the
Telnet protocol (which is part of TCP/IP) |
- host
- is the Internet host name in the domain notation, eg
www.hut.fi
(or sometimes a numerical TCP/IP address); notice that
typically, but not necessarily, Web servers have domain
names starting with www
:
port
- is the port number part, which can usually be omitted
since it has a reasonable default; that is, omit it,
unless it is a part of a URL which you got somewhere (or
you really know what you are doing)
- path
- is a directory path within the host
- filename
- is a file name within the directory.
Warning: Although many browsers allow you to
omit the part http://
when specifying the URL of a
document to be visited, you must not omit it in when writing a
normal URL into an HTML document. (Otherwise browsers will try to
interpret it as a relative URL.)
Actually, this pattern is mainly for Web documents, ie http
URLs. For other URLs, simplifications and special interpretations
are applied. For example, a mailto
URL is just of
the form mailto
:address where address
is a normal Internet E-mail address like Jukka.Korpela@hut.fi
(as specified in RFC
822). Please notice that appending anything to the E-mail
address in a mailto
URL is nonstandard and may
result in lost mail without anyone noticing! (See also the
discussion of mailto:
URLs in the description of the
A element.)
An http
URL can also be a fragment identifier
which consists of an absolute URL, the # sign and a name (which refers to a location within the
document specified by the absolute URL). See the description of
the A element for more information.
It is safest to enclose URLs in quotes
when writing them as attribute values in HTML.
For an overview of URLs, see W3C
material on addressing.
As regards to the technical
specifications of the syntax of URLs, see RFC 1738
(absolute URLs) and RFC 1808
(relative URLs).
In particular, the specifications say
that within a URL only a limited set of characters can be
used as such:
- alphanumeric characters (
A
to Z
,
a
to z
, 0
to 9
)
- the characters
$-_.+!*'(),
- the characters
;/?:@=&#
provided that
they are used in the special meaning reserved
for them in the RFCs mentioned above.
Other characters must be encoded. (The characters ;/?:@=&#
must also be encoded, if they are not used in the special
meaning.) This encoding (which is defined by URL specifications,
not HTML specifications) consists of using the percent sign
followed by two hexadecimal digits, presenting the code position.
For example, tilde (~
) should be presented as %7E
and space as %20
. (Violating the rules causes
problems much more likely in the latter case than in the former.)
When a URL occurs as an attribute value in HTML, there is another
complication caused by the & character which
may have special use in query form
submissions. In principle, that character should be escaped as & or as &
(there is a
footnote in the HTML 2.0
specification about this) and browsers should process it so that
the actual URL passed to the processing CGI
script has that notation replaced by plain & character.
(Notice that it must not be encoded. This is a confusing
issue, and CGI scripts should really be written so that semicolon
; and not ampersand & is used as field separator.)