The 100% correct way to validate email addresses

David Gilbertson
6 min readSep 4, 2016

--

Congratulations. From this day forward, you will no longer squander your time trying to work out the perfect regex to validate email addresses. You will also never again run the risk of rejecting what is, in fact, a strange, valid email address.

The trick is to first define what we mean by ‘valid’.

We are developers, we are technical folk, so it’s no surprise that the prevailing wisdom is to check that it matches the official criteria, some examples of the diversity of the official criteria are…

https://en.wikipedia.org/wiki/Email_address#Valid_email_addresses

But I say pish! to prevailing wisdom, so…

Everything you know is wrong

Instead of the above approach that largely ignores reality, I believe there are two questions we need to ask:

  1. Did the user understand that they were supposed to type an email address into this field?
  2. Did the user correctly type their email address into this field?

If you have a well laid-out form with a label that says “email”, and the user enters an ‘@’ symbol somewhere, then it’s safe to say they understood that they were supposed to be entering an email address. Easy.

Next, we want to do some validation to ascertain if they correctly entered their email address.

Not possible.

It’s important that you agree with me on this point: it’s not possible.

I know what you’re thinking. “But it helps, right?” That’s like saying that opening and closing your fridge really quickly conserves energy and helps fight climate change. Sure, it helps, if we want to be slaves to the word ‘help’. But most people would agree you have a promising career in a straight jacket if you’re unnecessarily rattling your pickle jars for the benefit of the polar bears.

Let’s explore

Let’s imagine that my email address is davidgilbertson@example.com. That’s 27 stabs at the keyboard that could go awry. Any mistype will definitely result in an incorrect email address but only maybe result in an invalid email address.

[epiphany]

Even if the sun shone through my window and I was visited by a particularly savage sneeze (I suffer from Autosomal Dominant Compelling Helio-Ophthalmic Outburst Syndrome*) and I typed out #!$%&’*+-/=?^_`{}|~@example.com by mistake, I would still pass the most thorough email ‘validation’ techniques. (The flip side is I fail and be told my address isn’t valid when it is! On a whim I just emailed the person at #!$%&’*+-/=?^_`{}|~@example.com and she said she gets super pissed off when told that her email address isn’t valid. She regrets buying the example.com domain, too, but won’t give it up, just like the guy that’s got milk.com. We got chatting and it turns out she only lives a few blocks from me and also collects vintage cameras; we’re playing golf next week. I think maybe she’s the one. I should probably close these brackets and get on with the story.)

So what are the odds that any one typo would result in an invalid email address? We will build a statistical model! Let’s look at, say, the ‘g’. I am more likely to mis-type with a letter on the visible keyboard with no shift key required (I apply a weighting to non-modified keys in the model). From all of the tappable keys on a physical keyboard, there are six characters that, while not completely invalid, are only valid in certain cases: []\;, and space. 6 out of 48. A 12% chance.

But an off-by-one error is more likely. For example hitting the neighbouring ‘h’ key instead of ‘g’. So from a list of 117 million email addresses I have calculated the frequency of occurrence of each character and for each, noted which keys lie closest on the keyboard, and factored in the likelihood that a mis-stroke will create an invalid email address. (I know hacking LinkedIn just to make a point about email validation is a bit extreme, but it is important to back up one’s opinions with data).

For example, ‘e’ is considered a low risk of invalidating, because all surrounding keys would still result in a valid email address. But ‘p’ has [ and ; within striking distance! So although it’s less common than ‘e’, it carries a higher risk of resulting in an invalid email address if missed.

I also consider the relative dexterity of the fingers. We all know that the pinky is the retarded cousin of the finger family, so that is factored in as well.

A graphical representation of the model showing the strike zone around the P, accounting for the shortcomings of the pinky.

Now, let’s say Silkie (fox) sits on the shift key and I hit the wrong key on the keyboard. I’m in danger of getting one of 6 bad keys: @:”<> and space. And again, those bad keys are only invalid in certain circumstances. And since it’s more likely that the shift key would be down only for the letters on either side of the @ symbol, and ‘l’ on either side of the @ is considered particularly dangerous.

The above is all for a single key, but if I mistype a second key, it is possible that I turn an invalid email address back into a valid one (e.g. adding a \ next to a \). This is, of course, factored into the model.

It goes without saying that I’ve gone to a similar level of effort to account for soft keyboards.

Remember too that if I mistype the @ symbol, the error will be caught by step one above where I actually check for the existence of an @ as a proxy for a user’s intent to enter an email address.

I also built in some general common sense: people with aol email addresses are sloppy typers. Daryls tend to poke at the keys with only their index fingers like they’re afraid each key will burn them. People with ‘z’ in their name use mechanical keyboards and rarely make mistakes. Your basic human axioms.

I also factored in the fact that any dot before the @ in gmail addresses is ignored and that ‘f’ and ‘h’ are pretty much the same letter when you think about it.

The result

So with all of that taken into account, I ran the 117 million email addresses through the model. And the odds that an incorrect email address will be caught by email validation is …

0.00000000000000000000000000000000000000625%

I’m afraid I don’t have time to type out the algorithm that totally exists and is indisputably perfect, so you’ll have to take my word for it that this number is not in any way made up.

The upshot

There is no point in trying to work out if an email address is ‘valid’. A user is far more likely to enter a wrong and valid email address than they are to enter an invalid one.

Therefore, you are better off spending your time doing literally any other thing than trying to validate email addresses.

The 100% correct way

Send your users an activation email. (That’s a bold full-stop for effect.)

I have published a follow up to this post that looks at how to help prevent your users from entering a wrong email address in the first place. With real life code! Go. Read. https://medium.com/@david.gilbertson/how-to-reduce-incorrect-email-addresses-df3b70cb15a9#.3y39aii0e

If you thought this post was pointless and silly, and would like more of the same, go check out my podcast, David reads Wikipedia, it is just what you think it is.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!

--

--