Phishing - Browser-based Defences

Version 0.3 - 2005-02-09

The core problem of phishing can be expressed as follows: a user has been phished when they think they are communicating with an entity they know and have a relationship with, when in fact they are communicating with a malicious attacker. Once the attacker has the victim fooled, he can attempt to extract valuable information such as usernames, passwords, or credit card details.

The browser, as the vehicle overwhelmingly used for high value transactions, is at the forefront of the phishing war. This paper discusses browser-based approaches to mitigating the problem, and suggests three changes - "Domain Hashing", "New Site" and "Phish Finder".

Other approaches to phishing are possible - for example, one could argue that prevention of class 5 attacks (see below) is entirely the problem of the domain name registrars. However, such approaches are outside the scope of this paper.

Phishing URL Classes

Phishing URLs can be classified into the following five classes:

An IP address, e.g. http://192.168.1.1/
This relies on the user ignoring the URL bar completely, or being confused by its complexity.
A completely different domain, e.g. https://www.randomdomain.com/
This relies on the user just not looking at the domain at all.
A plausible-sounding but fake domain, e.g. https://www.paypal-secure.com
This relies on the user not knowing their exact destination.
A visible-to-the-eye letter substitution, e.g. https://www.paypa1.com
This relies on the user not looking too closely at individual letters.
An HTTP login, e.g. https://www.paypal.com.../@https://www.evil.com/
Some browsers already warn about this.
An invisible letter substitution (punycode attack), e.g. https://www.xn--pypal-4ve.com
This sort are currently almost undetectable.

Solution Requirements

Phishing is, at heart, a problem of correct identification. So good solutions will be those which make the user more confident of where they are, and make it easier to notice when it's somewhere they haven't been before. I believe that browser-based anti-phishing measures need to have the following characteristics.

Solutions MUST be based on SSL. The browser needs to know for certain where it is, before the user has any hope of knowing. This means SSL - the industry-standard way of a browser being certain that the site sending it data is the same site it thinks it is requesting data from. Any fix which does not involve SSL is a band-aid. In fact, providing anti-phishing UI on non-SSL sites may worsen the situation, as having the anti-phishing UI visible only when the connection is secure is a good indicator - with complete absence of the UI elements meaning "be suspicious".

Solutions should be general and coordinated. It's no good having one UI widget which goes blue in the presence of a type 3 attack, and another which changes shape for a possible type 5. All forms of phishing should be considered, and a consistent and organised response made.

Solutions should ideally be self-contained. Solutions which require the browser to go to a third-party server to help make the decision are hamstrung if that server goes down - OCSP has this problem. Even if the server is up and responsive, page load can be delayed because further network round-trips are necessary. It also requires that the third-party server be maintained.

Solutions should avoid discrimination against persons or groups. We shouldn't discriminate against punycode domain names as a class, because that's discriminating against users of non-Latin alphabets. We also shouldn't discriminate against unpopular or low-traffic sites - that's unfair on small businesses.

Solutions should be aligned with the user's desire to do minimum work. Protecting the user is our job; passing it off to them and making them work to protect themselves means both that we've failed, and that it won't get done. Users have a high view of their own inability to be fooled - no-one thinks "wow, I'm a real mug, I'd better do something to protect myself from phishing".

Solutions should have simple UI. The more complex a UI, the more a user's eye will gloss over it. A browser which has twenty widgets and icons to indicate different things about the site will not help because the user will simply ignore the lot. As a rule of thumb, if we can't explain to the user what to look for in a single sentence, the UI is too complicated.

Possible Solutions

Here are some solutions or solution components which have been suggested recently, and my thoughts on them.

Turn off IDN

The latest round of punycode-based homograph attacks has led people to suggest switching off IDN, either personally or in browser security releases.

This solution is inherently discriminatory - IDN was introduced to try and level the playing field in domain names with regard to their alphabet. It's also an admission of defeat, and it doesn't solve the problem - it only prevents class 6 phishing attacks.

Show raw names

It has been suggested that the browser should (instead or additionally) show raw domain names to users in the punycode (class 6) case.

This is not much better than showing them a hex string which is a hash of the domain - to most users, it will seem just like a random set of letters and numbers. It would also hamper uptake of IDN, because IDN owners would not want to be second-class citizens.

Phishing blacklist

It has been suggested that browsers should contact a remote server which holds a phishing domain blacklist.

While a good idea in theory, the experience of those who attempt to maintain spam blacklists should cause us to be cautious about saying that such a service is easy to build and keep up in the face of attack. Accepting only valid reports, removing malicious reports and fending off DOS attacks would make this a serious undertaking.

Unless the browser downloads the entire blacklist regularly (and most phishing sites have a lifetime measured in days), it would also have to send each URL accessed to the remote server, which has serious privacy implications.

User-selected logo or 'petname'

It has been suggested that users should select a logo or friendly name for sites that they frequently visit, and that this name or logo should then be displayed in the UI on future visits, to reassure them that they are where they think.

While this scheme may work fairly well for users who bother to take the action required, in practice I suspect very few would choose to do so. As I explained in "prerequisites", few users are going to put in the time to maintain such a list outside the confines of a prompted lab study. It requires them to put up-front effort into protecting themselves from a problem they can't imagine themselves encountering.

Domain letter colouring

It has been suggested that the display of the domain name in the URL bar should be annotated, perhaps using coloured backgrounds, to indicate the lexical class of the letter concerned, so that "odd" letters may be more easily picked out. This suggestion has gained currency because it is mentioned in the Unicode Consortium advice on dealing with IDN spoofing issues, and by members of the IDN community, so I will deal with it at greater length.

My most important objection is that these sort of schemes will inevitably be irreducibly over-complex. It would be impossible to train a significant proportion of users as to the meanings of each of the colours or changes, and to explain exactly what defined a suspicious circumstance, as there are many legitimate uses for domains with a mix of e.g. Latin and another alphabet (as the Consortium advice notes). If you can't explain it in a single sentence, it's too complex.

Any scheme involving coloured annotations presents problems for the 5% of the world who are colour-blind. It's a longstanding UI principle that no information should ever be conveyed to the user by means of colour alone. Some unicode codepoints are part of several different languages, and so would have no obvious single colour.

The last objection is that it would also have the significant side-effect of making the URL bar often very ugly for those large parts of the world who use some or many non-ASCII characters. This is not a trivial consideration - users who dislike the ugliness will either try and turn it off or will switch to another product.

Using bookmarks or history

It has been suggested that bookmarks and history could be used to see whether the user had previously visited a site. If a user has never visited a site before, that's an indication it could be fraudulent.

The idea here is good, but this would not work well with current history implementations, which store history for a limited time only. Extending that time, and/or making the keeping of history compulsory, runs up against privacy issues. There's a fundamental tension between the anti-phishing code's need to know where the user has been, and a user's legitimate desire to conceal that information.

Measurements of lexical proximity

It has been suggested that browsers should attempt to determine whether a particular domain was in "close lexical proximity" to a high value domain.

While this may be possible for the limited case of homographic characters, the problems in the more general case are finding the list of high value domains, and defining "close lexical proximity". For a dyslexic, http://www.ibm.com and http://www.bmi.com are in close lexical proximity, and could be confused by a dyslexic person. But no-one would suggest BMI Music should surrender their domain to IBM Computers. In addition, the number of different scripts and writing forms make it very difficult indeed to design a consistent metric.

It also requires the browser to have a list of high-value domains. Who would make and maintain such a list, and decide on the entry criteria?

My Suggestions

I have three suggestions for browser changes. The first two help the user to realise on their own that they may be on a phishing site. The last helps the browser to warn the user, even if the user does not suspect anything. (I cannot claim that these ideas are entirely my own; I formulated them having considered the ideas of many other people, including the ones above.)

Domain Hashing

It is a fundamental of human language that we will want to choose names which sound or look fairly similar to each other. Any system which allowed no such similarities would prevent many legitimate uses. However, we need to make it easier for the user to notice the difference between two close strings.

The standard way to make two things more easily comparable is to hash them. So, the browser should display, alongside the domain name, a representation of a hash of the domain. This "amplifies" the difference between similar- or identical-looking domain names.

However, such a hash should not be displayed, for example, as a set of hexadecimal digits. These are hard for the human eye to scan and remember, because they are letters and numbers without obvious meaning. Humans find arbitrary strings of letters hard to remember. I originally wanted to use a colour to represent the hash, but this produces problems for the colourblind.

Instead, I suggest the hash (or the first N bits of it) should be displayed as two digits of a new 64-symbol set of glyphs, or "alphabet". The alphabet would be carefully chosen to have the characteristic that no glyph is a letter in any language, and no two are similar in appearance.

The 64 symbols would be chosen from "Geometric Shapes", "Miscellaneous Symbols", "Dingbats" and other similar sections of the Unicode standard. The advantage of using Unicode symbols is that the glyphs are widely available, and their forms defined and known. Because these codepoints are not letters in an alphabet, terms like "bold", "italic", "serif" and "script" mean nothing to them, their glyphs keep their distinctive in different fonts.

The same hash algorithm and set of glyphs would be used across all browsers, and the glyph for a particular domain name would be printed on advertising media and marketing materials for that domain. The hash would be part of the security UI of the browser - in the Firefox case, the domain indicator in the status bar. The user would check both the domain name and the glyphs to make sure they were in the right place.

Example: www.paypal.com (♠◊)

So, in order for a phisher to have a plausible URL, they would need to find one which both looked like "www.paypal.com" and also hashed to the same glyphs. This reduces the number of possibilities by a factor of (64 * 64 =) 4096. Even better odds could be gained by using more than two glyphs; greater simplicity could be gained by using just a single one. Exactly how many to use would be a matter of discussion. More digits is more secure, but also increases UI complexity.

To summarise: this proposal means that lexically close or homographic domains look significantly different, aiding the user in determining exactly where they are. This scheme has the advantage of the logo/petname proposal - a domain is associated with an image - without the need for the user to continously do work to define relationships.

Update 2005-03-14: I'm no longer so keen on "Domain Hashing", for several reasons - primarily because I think IDN spoofing should be tackled mostly at the Nameprep and registrar level, and we are working towards doing that instead.

New Site

The key and unchangeable characteristic of a phish is as follows: the user is somewhere they haven't been before, but they think they are somewhere they have been before. (If they've been to this phishy place before, they've already been phished; if the phish is faking a site they've never visited, they won't have a relationship or a high-value login.)

If the browser kept complete and total history, it could say definitively "you've been here before", or "you haven't been here before". It could then display the words "new site" (or an equivalent icon) in the security area, that would hopefully alert the user to the problem.

However, this runs into a problem with privacy. Many users do not like to retain history, and normal browser implementations throw it away after a configurable number of days (often 30). These two factors make it much less useful for this purpose than it could be.

My suggested solution is a special history, which takes each SSL domain visited, appends a user-specific random string, and stores a hash of the result. Because it just stores domains, the size of the store becomes manageable, and because it's hashed, it's free from casual prying. The store has the property that it would be possible for an attacker with access to the box to know if a particular SSL domain had been visited, but not to get a list of all SSL domains visited, unless they hashed all the domains in the world for that particular user. Hopefully, this reduces the privacy issue.

It would be possible to disable this hashed SSL history, but it would be a separate preference to normal history, and would not be cleared by default by any "Sanitise" functions. It should be considered part of the browser's internal workings, and not user-accessible.

So, if this store indicated that an SSL domain had not been previously visited, the UI might change as follows. This is just a suggestion - the details could easily be changed.

Demonstration of new UI - purple border and words "new site"

Phish Finder

The third set of changes are heuristics to help the browser determine that a site may be suspicious. The heuristics would use the following as input data. These factors are carefully chosen to be non-discriminatory.

Whether the domain name consisted solely of ASCII + at least one ASCII homograph, or (if an accurate heuristic with no false positives is possible) contains "mixed scripts"
Whether the domain appears in existing bookmarks, history or password store
Hashed SSL domain history, as outlined above
Whether the link came from an email or other external source
Whether the link text has different hostname to the URL
Use of a numeric IP address rather than a domain name
Whether the link has an unnecessary HTTP username/password pair

Once the page has loaded, we could add:

Whether the page has many cross-domain links (e.g. if they are stealing the spoofed site's graphics directly)
Whether the page has the same favicon but not the same domain as a previously-visited page

If we also take the time to contact a remote server, we could add:

Age of the domain (if it were, say, less than two weeks old)
Whether the domain was on a phishing blacklist (regularly downloaded)

When the browser suspects a site as being phishy, a yellow security information bar would appear at the top of the content window - as currently happens for blocked popups and attempted software installs. A "more info" button would explain why the browser thought the site was suspicious. This UI is independent of the exact algorithm used, which gives maximum flexibility to improve the algorithm and sources of input data as time goes on.

Conclusion

Many approaches to countering phishing have been suggested, but they often have practical problems. I believe my three suggestions could form the basis of a coordinated, non-discriminatory approach to the issue.

Credits

Many thanks to Ian G, Robert O'Callahan, Nelson Bolyard and the other participants of the Mozilla security group for their ideas and input on this paper, and thanks to SpoofGuard for some Phish Finder heuristic ideas.