John Dalesandro

Comprehensive Regex for URL Detection and Spam Filtering

Regular expressions are one of the most common tools used to identify URLs or links in user-generated content. Whether you’re moderating comments, filtering forum posts, or protecting contact forms, a reliable URL-matching regular expression (regex) can help detect spam before it reaches your users.

In this post, we’ll break down a regex for detecting URLs, explain how it works, discuss its role in spam detection, and highlight its limitations. We’ll finish with practical examples of where it succeeds and where it falls short.

A Word of Caution

This regex is a monster. Regular expressions are powerful but notoriously difficult to read, and this one is especially long. To help, I’ve formatted and commented it for clarity.

It’s important to note that this is not a strict URL validator as it doesn’t fully comply with RFC specifications. Instead, it’s tailored for spam detection, where flexibility matters more than strict accuracy. Spammers often disguise links, so this regex leans toward catching more rather than less.

Practical Use

I use a modified version of this regex in my Analytical Spam Filter Plugin for WordPress. The plugin scans comments, identifies and counts URLs, and flags a comment as spam if too many links are detected. Since PHP doesn’t include a built-in URL parser, regex offers a practical solution.

The plugin version is slightly simplified to improve performance and reduce false positives, but it’s not as comprehensive as the full regex presented here.

This pattern was also inspired by the article An Improved Liberal, Accurate Regex Pattern for Matching URLs. I built my own version to avoid maintaining a hard-coded TLD list (a perfectly valid approach) and as a personal challenge, given the complexity of handling internationalized domains and name variations.

AI generated illustration of a sasquatch using a magnifying glass to search for URLs in a book.

The Regex

Here’s the full regex pattern we’ll analyze:

~
# HOSTNAME Subroutine
(?(DEFINE)
  (?<HOSTNAME>
    (?:
      (?:
        # Internationalized hostnames
        [\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?
        |
        # Punycode hostnames
        xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?
        |
        # ASCII hostnames
        [A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?
      )
      (?:\.
        (?:
          # Internationalized hostnames
          [\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?
          |
          # Punycode hostnames
          xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?
          |
          # ASCII hostnames
          [A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?
        )
      )+
      \.?
    )
  )
)

# Ensure no word characters before (boundary)
(?<![\p{L}\p{N}\p{M}_])
(?:
  (?:
    # With explicit scheme / protocol, e.g., http, https, ftp, ftps
    (?:https?|ftps?)://
    # Optional user information, e.g., username:[password]@
    (?:
      # Username
      [\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]+
      # Optional password
      (?::[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]*)?
      @
    )?
    # Host / domain, e.g., localhost, domain name, IPv4 address, IPv6 address (in brackets)
    (?:
      localhost
      |
      # Internationalized / punycode / ASCII hostnames
      (?&HOSTNAME)
      |
      # IPv4
      (?:
        (?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)
        (?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}
      )
      |
      # IPv6
      \[[0-9A-Fa-f:]+\]
    )
  )
  |
    # Without explicit scheme / protocol
    # Internationalized / punycode / ASCII hostnames
    (?&HOSTNAME)
  )
  # Optional port
  (?::\d{1,5})?
  # Optional path, query, fragment/anchor
  (?:
    [/?#]
    (?:
      (?:%[0-9A-Fa-f]{2})
      |
      [A-Za-z0-9\-._\~]
      |
      [\p{L}\p{N}\p{M}\p{S}\p{P}]
    )*
  )?
# Ensure no word characters after (boundary)
(?![\p{L}\p{N}\p{M}_])
~isugx

Here’s the same regex pattern in a single line, with all formatting, comments, and the x mode modifier removed:

~(?(DEFINE)(?<HOSTNAME>(?:(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\.(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+\.?)))(?<![\p{L}\p{N}\p{M}_])(?:(?:(?:https?|ftps?)://(?:[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]+(?::[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]*)?@)?(?:localhost|(?&HOSTNAME)|(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|\[[0-9A-Fa-f:]+\]))|(?&HOSTNAME))(?::\d{1,5})?(?:[/?#](?:(?:%[0-9A-Fa-f]{2})|[A-Za-z0-9\-._\~]|[\p{L}\p{N}\p{M}\p{S}\p{P}])*)?(?![\p{L}\p{N}\p{M}_])~isug

Here’s the same regex pattern demonstrated in PHP using preg_match_all in a single line, with all formatting, comments, and the x and g mode modifiers removed. I’ve also escaped a few additional characters, perhaps excessively.

This code scans a string ($text_to_search) for every URL that matches the regex, collects them into an array ($matches), and returns a match count ($count).

$count = preg_match_all( '~(?(DEFINE)(?<HOSTNAME>(?:(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\.(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+\.?)))(?<![\p{L}\p{N}\p{M}_])(?:(?:(?:https?|ftps?)://(?:[\p{L}\p{N}\p{M}\-\._\~!$&\'()*+,;=%]+(?::[\p{L}\p{N}\p{M}\-\._\~!$&\'()*+,;=%]*)?@)?(?:localhost|(?&HOSTNAME)|(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|\[[0-9A-Fa-f:]+\]))|(?&HOSTNAME))(?::\d{1,5})?(?:[/?#](?:(?:%[0-9A-Fa-f]{2})|[A-Za-z0-9-._\~]|[\p{L}\p{N}\p{M}\p{S}\p{P}])*)?(?![\p{L}\p{N}\p{M}_])~isu', $text_to_search, $matches );

Step-by-Step Breakdown

Mode Modifiers / Flags

Regex mode modifiers (sometimes called flags) are special switches that change the way the regex engine interprets your pattern. For this pattern, the modifiers isugx are used.

HOSTNAME Subroutine

The (?(DEFINE) … ) block defines a reusable subpattern named HOSTNAME which is later called with (?&HOSTNAME).

It matches:

It enforces:

Word Boundaries

Before and after the URL, the regex uses lookarounds:

(?<![\p{L}\p{N}\p{M}_])

(?![\p{L}\p{N}\p{M}_])

These create custom boundaries before and after a URL. Unlike the standard \b (word boundary), they:

\b is ASCII-centric and includes underscores as word characters, which may not be desirable here.

Scheme Handling

The regex supports:

User Info

Optionally matches username:password@. Usernames may include letters, numbers, and common URL-safe symbols.

Hosts

Hosts can be:

Port

Optional numeric ports are matched up to 5 digits, e.g., :80, :443, :65535. This is loosely validated as it will accept any combination, e.g., :99999.

Path, Query, Fragment

An optional suffix beginning with /, ?, or #. It supports:

This makes it flexible for internationalized paths.

Use for Spam Detection

This regex is particularly useful in spam detection systems:

Gaps and Limitations

Summary

This regex is a balanced tool for URL detection. It handles Unicode domains, Punycode, IPs, and scheme-less URLs, while minimizing false positives. For spam filtering, it’s strong enough to count and extract legitimate-looking links, which is often all you need to flag spam submissions.

It’s not perfect. Obfuscations and exotic schemes can bypass it. As a first-pass filter in spam detection, it works well. Combine it with heuristics, blocklists, and DNS checks for the best results.

This regex is a practical front-line spam detection tool. It’s precise enough to catch what matters, loose enough not to miss too much, and structured clearly enough to maintain.