Irresponsible use of Levenshtein metric

This week I posted about Auctia on Reddit's r/wow. Having a pleasant response from r/woweconomy I hoped for some nice bunch of upvotes and good publicity. What I got instead was an accusation of my website being an Ethereum scam.

It started a day before when I posted my Auctia update on wykop.pl, a polish website similar to Reddit. User bbackbone noticed that when trying to visit Auctia.io from presumably Brave Browser he got this nice warning screen:

I was really puzzled and at first and thought that maybe bbackbone has had some malware on his computer, but after a short investigation it turned out to be real. I have pinpointed the issue down to the MetaMask's eth-phishing-detect and filled a new issue asking to be whitelisted, as I have absolutely nothing to do with cryptocurrencies on Auctia, not even a thing similar to a cryptocurrency donation link or anything. Their online tool revealed that my domain was blacklisted, because: “This domain was blocked for its similarity to auctus.org, a historical phishing target.”. The reason given got me slightly irritated, but I didn't think much would come out of it.

The next day I posted about Auctia on r/wow and the very first comment I got was a copy&paste of a similar warning message, this time explicitly naming “MetaMask” as the source of blacklisting. Now anyone with similar warning displayed or reading the comment would ignore or downvote the thread. The reputation of my website was scratched from the very beginning. The only thing I could do was to just reply to the comment and edit the main post explaining that this is just a false-positive, but the damage is done, the post really didn't get much traction during the first few hours. My irritation grew as I dug deeper into the source code of eth-phishing-detect project to see how their detection methods work.

Let's have a look at their project repository first. The issue list is full of people asking to be whitelisted, and some to blacklist a certain domain, merge requests look similar. That was to be expected. But their main code has been pretty much unchanged in the last 3 years. It turns out that their one and only method of detection whether a website is a scam or not is calculating Levenshtein's distance between domain names. At first I was baffled at what I am seeing. To people that don't know what Levenshtein metric is, it is a way to measure how far words, sentences or texts in general, are apart. BUT it's not really good at doing so with short words or sentences, as then everything is similar to anything. With their threshold set at mere '3' it generates a lot of false positives. It is painfully visible in their whitelist, which consists of more than 600 entries. Their blacklist has about 10k entries.

While having a blacklist-whitelist registar for scammy domains is nothing bad, adding to it Levenshtein metric makes it a messy pile of a false-positives generator. The advertised “Ethereum Phishing detector” has little to do with Ethereum or detection too in that matter, because it only checks if a name is similar to the blacklisted website, completely ignoring it's contents.

This project could be a part of a larger and more sophisticated phishing detection mechanism, but for some unknown reason some developers have decided to include it into their products and block websites based solely on that weak Levenshtein assumption. Lack of distinction between “Having a similar name” from “Being actually blacklisted” complete with very aggressive warning screens is very misleading.

Currently it has been two days since I have added my plea for whitelisting Auctia, and nothing has changed.