spacer

Sunday, 2 November 2008

Detecting URLs in a Block of Text

Filed under: Regex Examples — Jan Goyvaerts @ 7:57

In his blog post The Problem with URLs, Jeff Atwood points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.

The short answer is that it can’t be done. Pretty much any character is valid in URLs. The very simplistic \b\S+ not only fails to differentiate between punctuation that’s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I’ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.

In RegexBuddy’s library, you’ll find this regex if you look up “URL: Find in full text”:

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|] (case insensitive)

Like every other regex for extracting URLs, it’s not perfect. The key benefit of this regex is that it uses a separate character class for the last character in the URL, which allows less punctuation characters than the character class for the other characters in the URL. It excludes punctuation that is unlikely to occur at the end of the URL, and more likely to be punctuation that’s part of the sentence the URL is quoted in. It does not allow parentheses at all.

In EditPad Pro’s syntax coloring schemes, which are fully editable and entirely based on regular expressions, you’ll often find this regex:

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]
(case insensitive)

The main difference with the previous regex is that this one matches URLs such as www.regexguru.com without the protocol. People often type URLs that way in their documents and messages, because most browsers accept them that way too.

EditPad’s built-in “clickable URLs” syntax highlighting uses this regex:

\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]
   | ((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,4})\b)
|"(?:(?:https?|ftp|file)://|www\.|ftp\.)[^"\r\n]+"?
|'(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+'?
(free-spacing, case insensitive)

This log regex adds three alternatives to the previous regex. It adds the ability to match email addresses, with or without mailto:, and it matches URLs between single or double quotes. When the URL is quoted, it allows all characters in the URL, except line breaks and the delimiting quote. This way, any URL with weird punctuation can be highlighted correctly by placing it between a pair of quote characters. Because this regex is used to highlight text as you type, the closing quotes are optional. The highlighting will run until the end of the line until you type the closing quote. Remove the question marks after the quote characters if you will use this regex to extract URLs.

So how about Jeff’s problem?

I couldn’t come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.

That’s not too hard, if we add the restriction that we only allow unnested pairs of parentheses in URLs. Using the second regex in this article as the starting point, add an alternative for a pair of parentheses to both character classes in that regex:

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
(free-spacing, case insensitive)

This regex allows the same set of characters in the middle of the URL, mixed with zero or more sequences of those characters between parentheses. It allows the URL to end with the same reduced set of characters, or a final run between parentheses. Because we require the opening parenthesis to be in the URL, we don’t have to do anything complicated to check if any closing parentheses we encounter are part of the URL or not.

It’s important that you observe that in order to allow any number of pairs of parentheses in the middle of the regex, I moved the star from the character class to the group it is now in. I did not add another star to the group. A double-star combination like (a|b*)* is a sure-fire recipe for catastrophic backtracking.

All the regexes in this article will be included in RegexBuddy’s library with the next free minor update. Current version is 3.2.0.

Comments (7)

7 Comments

  1. [...] 原文作者是Jan Goyvaerts(Regex Guru),原页面链接是Detecting URLs in a Block of Text, [...]

    Pingback by [译]从文本中析取有效URL链接 | 我爱正则表达式 — Friday, 7 November 2008 @ 15:11

  2. thank you, I was using this:
    (^|[>[:space:]\n])([[:alnum:]]+)://([^[:space:]]*)([[:alnum:]#?/&=])([<[:space:]\n]|$)
    from
    archives.neohapsis.com/archives/php/2000-05/0007.html

    but I was having an issue with URLS inside a quote:
    w/w/w-w/w.php?w=1hum_1vol/QUOTE
    It was taking the [/QUOTE] as part of the URL and then my “dangling quotation clean up” messed up the posts. This seems to fix it when I did this:
    /(\b(?:(?:https?|ftp|file|[A-Za-z]+):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$]))/i

    So that people can put in other protocols.

    Comment by revaaron — Thursday, 13 May 2010 @ 20:21

  3. Hey,

    The regex works great – the only problem I’ve seen is that it doesn’t detect URLs with HTML Entity Ampersands (&) very well.. it doesn’t seem to like the semicolon.

    I changed the regex to this which seems to do the job:

    /(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])/ix

    Comment by Joe Holdcroft — Friday, 24 September 2010 @ 18:12

  4. Technically, spaces are NOT valid in a URL (they must be encoded as %20). Just because some people use them does not make it correct (the forgiving browsers automatically perform the translation similar to how they handle conversion of Unicode to punycode). For anyone interested, I’ve put together a regex solution for plucking URLs from HTML text which correctly handles most of the tough-cases cited in Jeff Atwood’s blog post. It handles bracketed URLs and html entities as well. See: URL Linkification (HTTP/FTP) Its not simple, but neither is the problem!

    Comment by ridgerunner — Tuesday, 29 March 2011 @ 1:40

  5. RFC 3986 allows spaces in URLs.

    Comment by Jan Goyvaerts — Wednesday, 30 March 2011 @ 9:18

  6. I got the first version to work perfectly, but I am trying to handle the results. Can you tell me how to handle multiple matches? I’m sure it’s out there somewhere, I just can’t find it. Thanks in advance!

    Comment by jonthenoob — Friday, 13 May 2011 @ 6:30

  7. What about the backslash. According to the web site Backslash in web authoring the backlash should be supported.

    The following link Extract urls using Java supports the backslash using the following pattern :

    ((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)

    What do you think about it ?

    Comment by David — Monday, 20 June 2011 @ 17:41

Sorry, the comment form is closed at this time.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.