Chris Weber: Unicode attacks and test cases – Visual Spoofing, IDN homograph attacks, and the Mixed Script Confusables

Let’s face it, playing tricks that mess with people’s perception can be fun.  With Unicode, there’s lots of fun tricks to be had.  What’s to stop someone from believing the following is what it appears to be:

www.аmazon.com

Looks like amazon.com of course, but it’s not.  The first ‘a’ is the Cyrillic small letter a, not the English, or Latin rather, small letter ‘a’, although they look identical – they’re from two different languages.   Confused?  Good.  Now hover your mouse over the link above, don’t click it because I don’t know where it goes but it probably isn’t nice.  In your browser’s status bar you should see the Punycode encoded version of the domain name:

http://www.xn--mazon-3ve.com/

Because DNS does not support Unicode (only a subset of ASCII characters are allowed), we have IDN (Internationalized Domain Name) standards which define how domain names with Unicode characters should be encoded.  Punycode is the name of the encoding mechanism.

The above is often referred to as an IDN homograph attack.  Aside from spoofing with lookalike characters from completely different alphabets, we can do a bunch of spoofing just within our own alphabets.  For example, certain fonts make combinations of characters hard to determine.  Just like the letter’s ‘r’ and ‘n’ together can look like the letter ‘m’: rn == m Zeroe’s can look like ‘O’ and the number 1 can look like a lower case ‘l’.  So you wind up with lots of clever visual attacks:

  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com

I’ve listed the same text here in several different fonts, because in some fonts, you wouldn’t be able to tell the visual difference between the two words.  The visual appearance of characters has a lot to do with the fonts used to display the glyph, not just the alphabet.

The Confusables

These types of visual attacks are attributed to what’s known as ‘the confusables‘ and have been documented in Unicode’s Technical Report 36 and TR39.  The confusables is a name given to scripts that essentially lookalike each other. The Unicode consortium has defined three main classes of confusable strings which are possible:

  1. Single-script
  2. Mixed-script
  3. Whole-script

I want to investigate each one in turn.  Because I’m simplifying things here, I may not be accurate in my use of the terms script, alphabet, letter, and so on.  Linguistics people get it better than I do but for the rest of us, the term ‘scriptrefers to:

A collection of letters and other written signs used to represent textual information in one or more writing systems. For example, Russian is written with a subset of the Cyrillic script; Ukranian is written with a different subset. The Japanese writing system uses several scripts.

Single-script confusables

These occur when letters from the same alphabet, or script, are used to give the same visual appearance.  This definition should be extended to say that these occur when letters from either the same script, inherited script, or common script, are used together.   For example, the following two combinations of Latin letters look identical:

  • so̷s
  • søs

If you take these apart, there’s a big difference.  While the letter ’s’ is the same in each, the ‘o̷’ and ‘ø’ are different.  The first uses the Basic Latin ‘o’ with a combining diacritical mark named COMBINING SHORT SOLIDUS OVERLAY, which is considered an inherited script.  To put it a different way, we have two atomic Unicode code points here, which together give the affect of a single character or letter.  The second uses the atomic character LATIN SMALL LETTER O WITH STROKE.  Let’s take these apart and look at the Unicode code point values for each.

  • so̷s == \u0073\u006F\u0337\u0073
  • søs == \u0073\u00F8\u0073

As you can see, the first ‘o̷’ gets formed from two Unicode code points, u006F and u0337.  If you copy and paste that word into a text editor that supports Unicode (e.g. Notepad) and click backspace, you’ll see the first backspace removes the combining diacritical mark, and the second removes the ‘o’.  Continuing with the example, the second ‘ø’ is made of a single Unicode code point u00F8 part of the Latin-1 Supplement Unicode block. At a lower level, because we’re using different code points and bytes to achieve the same visual affect, we have a case of the confusables.

Let’s take a closer look at what qualifies as a single-script confusable for the Latin lower-case letter ‘a’ – taken from the confusables table at http://unicode.org/reports/tr39/data/confusables.txt.

FF21 ; 0041 ; SA # ( A → A ) FULLWIDTH LATIN CAPITAL LETTER A → LATIN CAPITAL LETTER A
1D400 ; 0041 ; SA # ( 𝐀 → A ) MATHEMATICAL BOLD CAPITAL A → LATIN CAPITAL LETTER A # {nfkc:119809}

1D434 ; 0041 ; SA # ( 𝐴 → A ) MATHEMATICAL ITALIC CAPITAL A → LATIN CAPITAL LETTER A # {nfkc:119861}

Update: I just realized that some of the characters broke WordPress so I’ve converted them all to NCR. In the above you can see three characters that all visually look similar to the Latin lowercase letter ‘a’. The first number is the code point for the confusable, the second number 0041 is the code point for ‘a’, and the following stuff is some descriptive text.

The reason the ‘Mathematical’ characters are considered single-script confusables is because they have the common script class assigned to them.

Other scripts exist which have their own characters confusable with the Latin ‘a’, but those are considered mixed-script, which I’ll go over in another post. For now I’ll leave you with a list of test cases for single-script confusables. Some are more obvious than others, and it all depends on the font – I’ve set Lucida Sans Unicode which is supported on most Mac’s and Windows machines.

  • Microsoft → Micros𝗈ft
  • Apple → Ap𝗉le
  • Google → Google
  • IBM → IBM
  • Oracle → O𝗿𝗮cle
  • Intel → Int𝗲𝗹

Mixed-script confusables

These occur when letters from one alphabet or script, are used to give the same visual appearance as letters from a completely different script.  For example, the following words contain a mix of Latin and Cyrillic letters which are indistinguishable from their counterparts:

  • Spооfing with hоmogrаphs

If you look at the letters, you’ll see that the ‘oo’ in ‘Spoofing’ is made up of two Cyrillic small letters ‘o’, and the ‘a’ in ‘homographs’ is Cyrillic as well.  Let’s take some of the words apart and look at the Unicode code point values for each.

  • Spoofing == \u0053\u0070\u006F\u006F\u0066\u0069\u006E\u0067
  • Spoofing == \u0053\u0070\u043E\u043E\u0066\u0069\u006E\u0067

The first version of ‘Spoofing’ uses all ASCII Latin letters, but the second mixes in the Cyrillic letters ‘oo’. Now if the word ‘Spoofing’ was being filtered, you could probably bypass the filter using this case of mixed-script confusables.

In fact, the confusables can be used to bypass profanity filters, ad filters, or just about any system that wants to blacklist words but still accepts Unicode.

As a test case, most browsers and other software shouldn’t allow the end-user to be fooled by the following IDN homograph attacks. These domain names contain mixed-script confusables, and should be represented in their lovely Punycode encoding for the user to realize they may not be what they appear to be.

www.microsоft.com is http://www.xn--microsft-sbh.com/
www.Αpple.com is http://www.xn--pple-zld.com/
www.faϲebook.com is http://www.xn--faebook-6pf.com/

I’ll take them apart another time, planning to look closer at IDN, IRI’s and the rules around them.

Whole-script confusables

It’s starting to make sense now. Let’s look at the Unicode TR39 definition of a whole-script confusable:

X and Y are whole-script confusables if they are mixed-script confusables, and each of them is a single script string. Example: “scope” in Latin and “ѕсоре” in Cyrillic.

If we look at the code points, we’ll see the clear difference between the two scripts being used:

  • scope == \u0073\u0063\u006F\u0070\u0065
  • ѕсоре == \u0455\u0441\u043E\u0440\u0435

The first version of ’scope’ uses all Latin letters, but the second uses all Cyrillic letters. We call it a whole-script confusable because each word is made of entirely of a single script, we’re not mixing scripts within the same string.

The confusables can be used to bypass profanity filters, ad filters, or just about any system that wants to blacklist words but still accepts Unicode.

As a test case, most browsers and other software shouldn’t allow the end-user to be fooled by the following IDN homograph attacks. These domain names contain whole-script confusables, and should be represented in their lovely Punycode encoding for the user to realize they may not be what they appear to be.

www.аЬс.com is http://www.xn--80a8a6a.com/
www.ігѕ.com is http://www.xn--c1a2eb.com/

source: http://www.lookout.net


3 Comments

  1. Optional says:

    When I mouse over your demo Amazon link in Firefox it doesn’t show the punycode version in the status bar. Still looks like a genuine Amazon link :(

    Firefox’s numerous ‘little security issues’ bother me more and more each day.

  2. IDN Phishing and Spoofing | Domain Name Scams says:

    […] IDN homograph attacks, visual spoofing, or IDN’s in general, I highly recommend checking out IDNNews.com. I’m not sure how many of you remember lraq.com (LRAQ.com that is) and how many people were […]

  3. Gervase Markham says:

    Optional: Firefox has robust defences against this sort of attack.
    http://www.mozilla.org/projects/security/tld-idn-policy-list.html

    If the Amazon domain actually resolved, I’m pretty sure the URL bar would show the punycode form – because .com is not whitelisted. For whitelisted domains, their anti-spoofing policies at registration time should mean you aren’t able to register such a domain.

Leave a Reply