Consider a trivial CGI script called echo.cgi
:
#!/usr/bin/perl -T
use strict;
use warnings;
use CGI::Simple;
my $cgi = new CGI::Simple;
my $name = $cgi->param('name');
my $name_escaped = $cgi->escapeHTML($name);
print "Content-Type: text/html\n\n";
print <<EOF;
<!DOCTYPE html>
<title>Greetings</title>
<p>Hello, $name_escaped!
EOF
Run echo.cgi?name=Philip and it works fine. Since we use
$cgi->escapeHTML
there's no danger of cross-site scripting (XSS)
attacks: try echo.cgi?name=<script>alert('oops')</script> and
the script will be harmlessly displayed as plain text.
That's okay for ASCII, but we live in a Unicode world. Try echo.cgi?name=金, and you'll probably get "Hello, 金!" – the web browser encodes the character "金" as UTF-8 when sending to the server, but defaults to decoding the response as Windows-1252 (a superset of ISO-8859-1). The script needs to output Content-Type: text/html; charset=UTF-8 to tell the web browser exactly what the character encoding is.
But UTF-8 isn't the only encoding in the world, so let's let the user choose whatever output encoding they prefer:
#!/usr/bin/perl -T
use strict;
use warnings;
use CGI::Simple;
use Encode;
my $cgi = new CGI::Simple;
my $name = decode('UTF-8', $cgi->param('name'));
my $name_escaped = $cgi->escapeHTML($name);
# Get the specified encoding (default UTF-8)
my $encoding = $cgi->param('enc') || 'UTF-8';
# Make sure the specified encoding is valid and exists
if (not ($encoding =~ /^[a-zA-Z0-9-]+$/
and find_encoding($encoding))) {
print <<EOF;
Status: 500
Content-Type: text/plain
Invalid output encoding.
EOF
exit;
}
print "Content-Type: text/html; charset=$encoding\n\n";
my $response = <<EOF;
<!DOCTYPE html>
<title>Greetings</title>
<p>Hello, $name_escaped!
EOF
print encode($encoding, $response);
Now you can run echo.cgi?name=金;enc=EUC-KR and the output will
be encoded into the byte sequence 0xD1 0xD1
(the EUC-KR
representation of "金"). The web browser decodes it as EUC-KR, and everything
is fine, and you've encoded the character with one byte fewer than in UTF-8.
Perfect!
If you now run echo.cgi?name=%14%C3%8B%C3%84%C3%8A%C3%91%C3%B8%C3%88%C2%9E%2F%25%C3%81%C3%8A%C3%88%C2%88%1B%3F%3F%C3%B8%C3%8B%1B%C2%89%14%07%C3%8B%C3%84%C3%8A%C3%91%C3%B8%C3%88%C2%9E;enc=CP1047 in a browser like Firefox 3.5, it will execute a script from the URL. We've evidently introduced an XSS vulnerability, even though our CGI script is correctly decoding and escaping and encoding all its text.
The name bytes 0x14 0xC3 0x8B …
are decoded as UTF-8
into the characters U+0014 U+00CB …
. Any dangerous characters are
escaped in escapeHTML
, e.g. "<
"
(U+003C
) becomes "<
", but we don't have any of
those characters here so nothing gets escaped. Now the text is encoded as
CP1047 – an EBCDIC encoding, very different to ASCII. U+0014 U+00CB
…
encodes into the bytes 0x3C 0x73 …
.
Those bytes are sent to the browser. But Firefox doesn't support the CP1047
encoding, so it falls back on its default of Windows-1252. 0x3C
0x73
then decodes into the characters "<s
" – the
start of an unescaped script tag.
(Internet Explorer does support CP1047, so it will decode 0x3C
0x73
into the harmless characters U+0014 U+00CB
instead.
Web browsers are not required to support (or to ignore) these encodings, and
are not doing anything wrong here – the only bug is on the server, using
encodings that are dangerous when not supported by clients.)
Run
echo.cgi?name=숍訊昱穿刷奄剔㏆穽侘㈊섞昌侄從쒜;enc=ISO-2022-KR in
Google Chrome 2.0, and the same story applies. ISO-2022-KR encodes the
first character "숍" into the bytes 0x3C 0x73
(preceded by
0x0E
to shift into multi-byte Korean mode). Chrome doesn't support
ISO-2022-KR, so it will decode as Windows-1252 and execute the script.
Just use UTF-8, always. It saves a whole lot of bother. Use gzip compression if you're concerned about bandwidth usage of UTF-8 for non-English languages.
If you really want to support multiple encodings, restrict it to a short
whitelist of acceptable encodings (perhaps UTF-8, Windows-1252, ISO-8859-1,
Shift_JIS, GB2312, Big5, EUC-JP, EUC-KR, …), and absolutely avoid any encodings
where markup characters (<
, "
, '
, …)
can be encoded as different bytes than in their ASCII encoding, or where those
bytes can occur in the encoding of any other character. This means avoid UTF-7,
all EBCDIC encodings (CP1047, CP037, …), ISO-2022-* (ISO-2022-KR, ISO-2022-JP,
ISO-2022-CN, …), JOHAB, SCSU, BOCU-1, and possibly others.
Yahoo Search was vulnerable to the ISO-2022-KR attack (only affecting Chrome users), reported to them on 2009-06-29 and fixed the next day.
Ultraseek ("Enterprise Search Engine", hosted on many sites) is vulnerable to an EBCDIC attack.