Talk:Website fingerprinting
Add topicHm.. interesting idea. Try this:
Which sites to store?
[edit source]The Alexa.com top 1 million + selected categories in DMOZ.org like banks.
Which pages to store?
[edit source]The main page and the login pages.
How to find the login pages?
[edit source]Check the link text of each <a> link off the main page for a set of keywords, eg. login, signin, sign-in. Index each page that matches up to a limit of 500.
If no link text if found, then check each link's destination page for login forms. They can be identified by input tags with the password setting and other keywords. If this doesn't work then try training a bayesian text classifier.
How to make a fingerprint?
[edit source]- a hash of the HTML tags, eg. html,head,title,/title,/head
- a hash of the HTML except for URLs, eg. remove text in <a href="", style declarations, etc.
How do I know if the fingerprint method is good?
[edit source]Collect an archive of mirrored phishing websites and test.
How do I implement it?
[edit source]Collect the fingerprints and identify login pages using WhatWeb (http://www.morningstarsecurity.com/research/whatweb). I wrote WhatWeb BTW and you'll need to write a couple of custom plugins, maybe I will help you.... maybe not.
Make a web browser plugin.
[edit source]It sends the URL + fingerprint to the server.
Make a server
[edit source]it receives URLs+page fingerprints and responds with:
- url found, fingerprint matches. all good in da hood
- url found, fingerprint doesn't match. maybe they redesigned their website, maybe it's a MITM attack, better check the actual URL for verification from the central server.
- url not found, fingerprint not found. whatever... don't lose your trust in small businesses
- url not found, fingerprint found. maybe it's a phishing site.
- url not found, fingerprint found for >1 trusted websites. probably a false positive for a CMS default login page.
Meh.. that's about it. I wouldn't bother the user unless you get condition 2 or 4. You should pop something up and let them choose what to do, eg. redirect to the trusted site with the same fingerprint.
What's the greatest challenge involved in this?
[edit source]Who's gonna pay for the bandwidth? It could work for a single corp, antivirus vendor or google.
Who wrote this up?
[edit source]Andrew Horton / urbanadventurer