The Danger of False Positives

The danger of false positives as text with positive and neutral symbols on left of text

The security IT teams where I work are intent on protecting the company from emails that might cause damage to the company, so they’ve been working on new Spam filters, and they’ve decided to put a notification at the top of each email that comes from outside the company. However we use a lot of third party online tools and now notifications from these tools – which I need to see – are flagged as being from outside the company.

Technically it’s true, these are emails from outside the company. However they’re from companies we partner with and I do need to see these emails. From a user perspective these are being flagged when they don’t need to be: it’s a false positive and it occurs because IT have defined the work environment as only what exists on the company’s own servers.

There is a limit to the accuracy of any test, and some of that limit is around false negatives and false positives, so what do those terms mean?
Think of a pregnancy test:
– A false positive would mean the test confirmed a pregnancy that does not exist.
– A false negative would mean the tested showed no pregnancy when one does exist.

One challenge of creating an algorithms that use some external data as input is evaluating the risk of false positives and false negatives. In law there’s an axiom that it’s better that ten guilty people go free rather than one innocent person be imprisoned. So the legal systems work to protect the innocent with rules of evidence and putting the burden of proof on the prosecution, knowing that in some cases guilty people will go free. In drawing the line far on side of false negatives (the guilty person is not convicted) the law acknowledges that it is, at least in theory, really important to avoid false positives (an innocent person is convicted).

In my email example above the line has been drawn, if it’s not a company email – easily identifiable by the email address – then it’s external and the notification is used. But our work environment online is no longer the walled garden we once had. Almost all of the systems I use are from external companies, companies that have gone through significant technical and risk assessment before being allowed to connect to the network. They are systems where I work, but from a strict IT perspective they are outside the company.

I understand the IT perspective on this, but the volume of notifications has taught me to ignore them. It’s a bit like the boy who cried wolf – the ultimate false positive.

Ego Surfing

It’s a term coined almost twenty years ago, referring to the act of searching for your own name, pseudonym, or handle online to see what information appears.

We often place a negative association on displays of egos, and references to ego surfing on the internet are generally negative or sarcastic.

But ego surfing can be a smart thing to do.

Just as companies manage their online presence and their online reputation so should you, I think this should be an ongoing action, but I’m sure people think of it more when they’re job seeking.

If you’re a random, unfamous person like me, the occasional search on major search engines will be enough. Here’s how I do it;

  1. Use a browser I don’t use very often
  2. Log out of any accounts, particularly Google
  3. Clear browsing history and cookies
  4. Search for my name, and the name of the blogs I write
  5. Search for the key topics I write about in the hope that my name/blog appears connected to those topics.

It’s important to use a “clean” browser to do this as Google will give you adjusted results based on your location, browsing history and login.

If you find content that shouldn’t be publicly available you have a few options to remove it; WikiHow provides a list of actions you can take. In some cases Google will remove content that they index if it could lead to identity theft (although they won’t remove your date of birth). In some situations EU residents can ask to be “forgotten” by Google when information is dated and has a negative reputational impact.

There are therefore two very good reasons for searching your own name; to check that your name isn’t associated with negative information and to make sure that the content you are publishing is building your reputation in your field of expertise.

The algorithms used by search engines prioritise content that is useful, rewarding content sources that provide useful content, and ranking content higher that is clicked on. Most people won’t click on “next page” of a google search so it’s really important that your content is on the first page of results, in fact there’s a joke in Search Engine Optimisation about hiding anything you don’t want anyone to find on page two of Google results.

If you find that your prized content is not ranking highly in search results the thing to do is create more quality, useful content, generate more links to that content and wait. If you’re a public figure and find you are turning up in search results connected to negative events, the way to change that is to start doing a lot of good things, media will create reports on the good things and that’s what will appear connected to your name in a very short time.

Algorithms can have inherent bias, but they mostly reward content that is useful, often clicked and newsworthy.

The sculpture in the header image of this post was set alight, and burnt in a matter of minutes. So much for an ego.


Image: Art: Ego    |    Michael & Sandy   |   CC BY-NC-ND 2.0

It’s Not Google, It’s Us.


Mashable published an article under the title “Google Translate Might have a Gender Problem“, and published the evidence of the problem, a series of tweets. The complaint was that Google Translate translates the Turkish phrase “o bir doktor” as “he is a doctor” when in fact the Turkish doesn’t give any gender information.

How did this happen? English uses gendered pronouns; he and she, but not all languages do. Turkish uses one pronoun “o” regardless of gender. Which means that to translate a text from Turkish to English a translator must decide whether to translate ‘o’ as he or she.  A human translator will look for evidence within the document to determine which pronoun to use in the translation.

Google Translate works in a different way, it’s essentially a big data project which uses existing translations on the internet and a statistical analysis of the proximity of words in phrases.

So the google translate engine has seen multiple instances where ” o bir doktor” in Turkish was translated as “he is a doctor” in English. Or, where there are few language matches, the frequency of that word sequence is high. In fact another Google tool, ngrams, illustrates how much more commonly we think of doctors as male. Ngrams compares data from books rather than internet sites, but it does reflect how our culture assigns gender to the occupation of doctor.

Doctor is associated with maleness in published text in English, the same pattern exists for engineers and soldiers. Unsurprisingly “he is a nurse” is far rarer in our books than “she is nurse”.
Yes there is misogyny on the internet. But Google Translate hasn’t created this, it’s come out of our misogynist culture.Could we stop blaming Google for something that is far broader – just stop it. In this case Google translate is just a mirror.
 Image:  Stop  |  Kenny Louie  |  CC BY 2.0