Comprehensive Regex for URL Detection and Spam Filtering
Regular expressions are one of the most common tools used to identify URLs or links in user-generated content. Whether you’re moderating comments, filtering forum posts, or protecting contact forms, a reliable URL-matching regular expression (regex) can help detect spam before it reaches your users.
In this post, we’ll break down a regex for detecting URLs, explain how it works, discuss its role in spam detection, and highlight its limitations. We’ll finish with practical examples of where it succeeds and where it falls short.
A Word of Caution
This regex is a monster. Regular expressions are powerful but notoriously difficult to read, and this one is especially long. To help, I’ve formatted and commented it for clarity.
It’s important to note that this is not a strict URL validator as it doesn’t fully comply with RFC specifications. Instead, it’s tailored for spam detection, where flexibility matters more than strict accuracy. Spammers often disguise links, so this regex leans toward catching more rather than less.
Practical Use
I use a modified version of this regex in my Analytical Spam Filter Plugin for WordPress. The plugin scans comments, identifies and counts URLs, and flags a comment as spam if too many links are detected. Since PHP doesn’t include a built-in URL parser, regex offers a practical solution.
The plugin version is slightly simplified to improve performance and reduce false positives, but it’s not as comprehensive as the full regex presented here.
This pattern was also inspired by the article An Improved Liberal, Accurate Regex Pattern for Matching URLs. I built my own version to avoid maintaining a hard-coded TLD list (a perfectly valid approach) and as a personal challenge, given the complexity of handling internationalized domains and name variations.

The Regex
Here’s the full regex pattern we’ll analyze:
~
# HOSTNAME Subroutine
(?(DEFINE)
(?<HOSTNAME>
(?:
(?:
# Internationalized hostnames
[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?
|
# Punycode hostnames
xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?
|
# ASCII hostnames
[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?
)
(?:\.
(?:
# Internationalized hostnames
[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?
|
# Punycode hostnames
xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?
|
# ASCII hostnames
[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?
)
)+
\.?
)
)
)
# Ensure no word characters before (boundary)
(?<![\p{L}\p{N}\p{M}_])
(?:
(?:
# With explicit scheme / protocol, e.g., http, https, ftp, ftps
(?:https?|ftps?)://
# Optional user information, e.g., username:[password]@
(?:
# Username
[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]+
# Optional password
(?::[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]*)?
@
)?
# Host / domain, e.g., localhost, domain name, IPv4 address, IPv6 address (in brackets)
(?:
localhost
|
# Internationalized / punycode / ASCII hostnames
(?&HOSTNAME)
|
# IPv4
(?:
(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)
(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}
)
|
# IPv6
\[[0-9A-Fa-f:]+\]
)
)
|
# Without explicit scheme / protocol
# Internationalized / punycode / ASCII hostnames
(?&HOSTNAME)
)
# Optional port
(?::\d{1,5})?
# Optional path, query, fragment/anchor
(?:
[/?#]
(?:
(?:%[0-9A-Fa-f]{2})
|
[A-Za-z0-9\-._\~]
|
[\p{L}\p{N}\p{M}\p{S}\p{P}]
)*
)?
# Ensure no word characters after (boundary)
(?![\p{L}\p{N}\p{M}_])
~isugx
Here’s the same regex pattern in a single line, with all formatting, comments, and the x
mode modifier removed:
~(?(DEFINE)(?<HOSTNAME>(?:(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\.(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+\.?)))(?<![\p{L}\p{N}\p{M}_])(?:(?:(?:https?|ftps?)://(?:[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]+(?::[\p{L}\p{N}\p{M}\-._\~!$&'()*+,;=%]*)?@)?(?:localhost|(?&HOSTNAME)|(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|\[[0-9A-Fa-f:]+\]))|(?&HOSTNAME))(?::\d{1,5})?(?:[/?#](?:(?:%[0-9A-Fa-f]{2})|[A-Za-z0-9\-._\~]|[\p{L}\p{N}\p{M}\p{S}\p{P}])*)?(?![\p{L}\p{N}\p{M}_])~isug
Here’s the same regex pattern demonstrated in PHP using preg_match_all
in a single line, with all formatting, comments, and the x
and g
mode modifiers removed. I’ve also escaped a few additional characters, perhaps excessively.
This code scans a string ($text_to_search
) for every URL that matches the regex, collects them into an array ($matches
), and returns a match count ($count
).
$count = preg_match_all( '~(?(DEFINE)(?<HOSTNAME>(?:(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\.(?:[\p{L}\p{N}\p{M}](?![\p{L}\p{N}\p{M}-]{1,2}--)(?:[\p{L}\p{N}\p{M}-]{0,61}[\p{L}\p{N}\p{M}])?|xn--(?![A-Za-z0-9]{2}--)(?:[A-Za-z0-9-]{1,59}[A-Za-z0-9])?|[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+\.?)))(?<![\p{L}\p{N}\p{M}_])(?:(?:(?:https?|ftps?)://(?:[\p{L}\p{N}\p{M}\-\._\~!$&\'()*+,;=%]+(?::[\p{L}\p{N}\p{M}\-\._\~!$&\'()*+,;=%]*)?@)?(?:localhost|(?&HOSTNAME)|(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|\[[0-9A-Fa-f:]+\]))|(?&HOSTNAME))(?::\d{1,5})?(?:[/?#](?:(?:%[0-9A-Fa-f]{2})|[A-Za-z0-9-._\~]|[\p{L}\p{N}\p{M}\p{S}\p{P}])*)?(?![\p{L}\p{N}\p{M}_])~isu', $text_to_search, $matches );
Step-by-Step Breakdown
Mode Modifiers / Flags
Regex mode modifiers (sometimes called flags) are special switches that change the way the regex engine interprets your pattern. For this pattern, the modifiers isugx
are used.
i
case-insensitive (so bothxn--
andXN--
will match).s
Dot matches newlines.u
Unicode enabled (so\p{L}
and other classes work).g
Global match (finds all URLs in text).x
Free-spacing mode (allows spaces and comments in the regex).
HOSTNAME Subroutine
The (?(DEFINE) … )
block defines a reusable subpattern named HOSTNAME
which is later called with (?&HOSTNAME)
.
It matches:
- Internationalized domain names (IDN) using Unicode characters.
- Punycode labels starting with
xn--
(with safeguards against malformed values). - ASCII hostnames with alphanumerics and internal hyphens.
It enforces:
- Labels cannot begin or end with a hyphen.
- Labels are limited to 63 characters.
- Domains can have multiple labels separated by dots.
- An optional trailing dot, which is valid in DNS.
Word Boundaries
Before and after the URL, the regex uses lookarounds:
(?<![\p{L}\p{N}\p{M}_])
…
(?![\p{L}\p{N}\p{M}_])
These create custom boundaries before and after a URL. Unlike the standard \b
(word boundary), they:
- Work with Unicode letters.
- Exclude underscores as word characters.
- Prevent partial matches inside words.
\b
is ASCII-centric and includes underscores as word characters, which may not be desirable here.
Scheme Handling
The regex supports:
- Explicit schemes:
http://
,https://
,ftp://
,ftps://
- Scheme-less URLs:
example.com
,example.net
User Info
Optionally matches username:password@
. Usernames may include letters, numbers, and common URL-safe symbols.
Hosts
Hosts can be:
localhost
- Hostnames (
example.com
,example.net
). - IPv4 addresses (
0.0.0.0
to255.255.255.255
) with strict validation. - IPv6 addresses (
[...]
), though loosely validated as the pattern accepts any0-9A-Fa-f:
combination.
Port
Optional numeric ports are matched up to 5 digits, e.g., :80
, :443
, :65535
. This is loosely validated as it will accept any combination, e.g., :99999
.
Path, Query, Fragment
An optional suffix beginning with /
, ?
, or #
. It supports:
- Percent-encoded sequences (
%20
). - Safe URL characters (
-
,.
,_
, and~
). - Unicode categories: letters, numbers, marks, symbols, punctuation.
This makes it flexible for internationalized paths.
Use for Spam Detection
This regex is particularly useful in spam detection systems:
- Counting links: Flag comments with too many URLs.
- Extracting domains: Captured hostnames can be compared against blocklists.
- Blocking suspicious patterns: For example, IP-based links are often spam signals.
Gaps and Limitations
- IPv6:
\[[0-9A-Fa-f:]+\]
matches any sequence of colons and hex digits, not strictly valid IPv6. - Obfuscations: Variants like
hxxp://
orexample(dot)com
won’t match. - Missing schemes:
mailto:
anddata:
are not covered. - Underscores: Domains with underscores are rejected, though some DNS labels permit them.
- Domain length: Total maximum length (253 characters) is not enforced.
Summary
This regex is a balanced tool for URL detection. It handles Unicode domains, Punycode, IPs, and scheme-less URLs, while minimizing false positives. For spam filtering, it’s strong enough to count and extract legitimate-looking links, which is often all you need to flag spam submissions.
It’s not perfect. Obfuscations and exotic schemes can bypass it. As a first-pass filter in spam detection, it works well. Combine it with heuristics, blocklists, and DNS checks for the best results.
This regex is a practical front-line spam detection tool. It’s precise enough to catch what matters, loose enough not to miss too much, and structured clearly enough to maintain.