XSS vulnerabilities with unusual character encodings

Pre-introduction

This article assumes some understanding of bytes and characters and Unicode and encodings. (See the absolute minimum …)

Introduction

Consider a trivial CGI script called echo.cgi:

#!/usr/bin/perl -T
use strict;
use warnings;
use CGI::Simple;
my $cgi = new CGI::Simple;

my $name = $cgi->param('name');
my $name_escaped = $cgi->escapeHTML($name);

print "Content-Type: text/html\n\n";

print <<EOF;
<!DOCTYPE html>
<title>Greetings</title>
<p>Hello, $name_escaped!
EOF

Run echo.cgi?name=Philip and it works fine. Since we use $cgi->escapeHTML there's no danger of cross-site scripting (XSS) attacks: try echo.cgi?name=<script>alert('oops')</script> and the script will be harmlessly displayed as plain text.

That's okay for ASCII, but we live in a Unicode world. Try echo.cgi?name=金, and you'll probably get "Hello, é‡‘!" – the web browser encodes the character "金" as UTF-8 when sending to the server, but defaults to decoding the response as Windows-1252 (a superset of ISO-8859-1). The script needs to output Content-Type: text/html; charset=UTF-8 to tell the web browser exactly what the character encoding is.

But UTF-8 isn't the only encoding in the world, so let's let the user choose whatever output encoding they prefer:

#!/usr/bin/perl -T
use strict;
use warnings;
use CGI::Simple;
use Encode;
my $cgi = new CGI::Simple;

my $name = decode('UTF-8', $cgi->param('name'));
my $name_escaped = $cgi->escapeHTML($name);

# Get the specified encoding (default UTF-8)
my $encoding = $cgi->param('enc') || 'UTF-8';

# Make sure the specified encoding is valid and exists
if (not ($encoding =~ /^[a-zA-Z0-9-]+$/
         and find_encoding($encoding))) {
    print <<EOF;
Status: 500
Content-Type: text/plain

Invalid output encoding.
EOF
    exit;
}

print "Content-Type: text/html; charset=$encoding\n\n";

my $response = <<EOF;
<!DOCTYPE html>
<title>Greetings</title>
<p>Hello, $name_escaped!
EOF

print encode($encoding, $response);

Now you can run echo.cgi?name=金;enc=EUC-KR and the output will be encoded into the byte sequence 0xD1 0xD1 (the EUC-KR representation of "金"). The web browser decodes it as EUC-KR, and everything is fine, and you've encoded the character with one byte fewer than in UTF-8. Perfect!

The problem

If you now run echo.cgi?name=%14%C3%8B%C3%84%C3%8A%C3%91%C3%B8%C3%88%C2%9E%2F%25%C3%81%C3%8A%C3%88%C2%88%1B%3F%3F%C3%B8%C3%8B%1B%C2%89%14%07%C3%8B%C3%84%C3%8A%C3%91%C3%B8%C3%88%C2%9E;enc=CP1047 in a browser like Firefox 3.5, it will execute a script from the URL. We've evidently introduced an XSS vulnerability, even though our CGI script is correctly decoding and escaping and encoding all its text.

The name bytes 0x14 0xC3 0x8B … are decoded as UTF-8 into the characters U+0014 U+00CB …. Any dangerous characters are escaped in escapeHTML, e.g. "<" (U+003C) becomes "<", but we don't have any of those characters here so nothing gets escaped. Now the text is encoded as CP1047 – an EBCDIC encoding, very different to ASCII. U+0014 U+00CB … encodes into the bytes 0x3C 0x73 ….

Those bytes are sent to the browser. But Firefox doesn't support the CP1047 encoding, so it falls back on its default of Windows-1252. 0x3C 0x73 then decodes into the characters "<s" – the start of an unescaped script tag.

(Internet Explorer does support CP1047, so it will decode 0x3C 0x73 into the harmless characters U+0014 U+00CB instead. Web browsers are not required to support (or to ignore) these encodings, and are not doing anything wrong here – the only bug is on the server, using encodings that are dangerous when not supported by clients.)

The problem again

Run echo.cgi?name=숍訊昱穿刷奄剔㏆穽侘㈊섞昌侄從쒜;enc=ISO-2022-KR in Google Chrome 2.0, and the same story applies. ISO-2022-KR encodes the first character "숍" into the bytes 0x3C 0x73 (preceded by 0x0E to shift into multi-byte Korean mode). Chrome doesn't support ISO-2022-KR, so it will decode as Windows-1252 and execute the script.

The solution

Just use UTF-8, always. It saves a whole lot of bother. Use gzip compression if you're concerned about bandwidth usage of UTF-8 for non-English languages.

If you really want to support multiple encodings, restrict it to a short whitelist of acceptable encodings (perhaps UTF-8, Windows-1252, ISO-8859-1, Shift_JIS, GB2312, Big5, EUC-JP, EUC-KR, …), and absolutely avoid any encodings where markup characters (<, ", ', …) can be encoded as different bytes than in their ASCII encoding, or where those bytes can occur in the encoding of any other character. This means avoid UTF-7, all EBCDIC encodings (CP1047, CP037, …), ISO-2022-* (ISO-2022-KR, ISO-2022-JP, ISO-2022-CN, …), JOHAB, SCSU, BOCU-1, and possibly others.

In practice

Yahoo Search was vulnerable to the ISO-2022-KR attack (only affecting Chrome users), reported to them on 2009-06-29 and fixed the next day.

Ultraseek ("Enterprise Search Engine", hosted on many sites) is vulnerable to an EBCDIC attack.