正则表达式匹配带有可选的“www"和协议的 URL
我正在尝试编写一个正则表达式.
I'm trying to write a regexp.
一些背景信息:我尝试查看我网站 URL 的 REQUEST_URI 是否包含另一个 URL.像这样:
some background info: I am try to see if the REQUEST_URI of my website's URL contains another URL. like these:
- http://mywebsite.com/google.com/search=xyz
但是,网址不会总是包含http"或www".所以模式也应该匹配像这样的字符串:
However, the url wont always contain the 'http' or the 'www'. so the pattern should also match strings like:
- http://mywebsite.com/yahoo.org/search=xyz
- http://mywebsite.com/www.yahoo.org/search=xyz强>
- http://mywebsite.com/msn.co.uk'
- http://mywebsite.com/http://msn.co.uk'
有一堆正则表达式可以匹配 url,但我发现没有一个可以在 http 和 www 上进行可选匹配.
there are a bunch of regexps out there to match urls but none I have found do an optional match on the http and www.
我想知道匹配的模式是否可能是这样的:
i'm wondering if the pattern to match could be something like:
^([a-z]).(com|ca|org|etc)(.)
我想也许另一个选择是匹配任何包含点 (.) 的字符串.(因为我的应用程序中的其他 REQUEST_URI 通常不包含点)
I thought maybe another option was to perhaps just match any string that had a dot (.) in it. (as the other REQUEST_URI's in my application typically won't contain dots)
这对任何人都有意义吗?我真的很感谢在这方面的帮助,因为它已经阻止了我的项目数周.
Does this make sense to anyone? I'd really appreciate some help with this its been blocking my project for weeks.
非常感谢-蒂姆
推荐答案
我建议使用一种简单的方法,基本上是建立在你所说的基础上,只是任何带有点的东西,但也使用正斜杠.捕获所有内容而不会错过不寻常的 URL.所以就像:
I suggest using a simple approach, essentially building on what you said, just anything with a dot in it, but working with the forward slashes too. To capture everything and not miss unusual URLs. So something like:
^((?:https?://)?[^./]+(?:.[^./]+)+(?:/.*)?)$
它读作:
- 可选 http://或 https://
- 非点或正斜杠字符
- 一组或多组点后跟非点或正斜杠字符
- 可选的正斜杠及其后的任何内容
将整个事物捕获到第一个分组.
Capturing the whole thing to the first grouping.
它会匹配,例如:
nic.uk
nic.uk/
http://nic.uk
http://nic.uk/
https://example.com/test/?a=bcd
验证它们是有效的 URL 是另一回事!它也会匹配:
Verifying they are valid URLs is another story! It would also match:
index.php
它不会匹配:
目录/index.php
最小匹配基本上是something.something
,其中没有正斜杠,除非它在点之后至少出现一个字符.因此,请确保不要将这种格式用于其他任何用途.
The minimal match is basically something.something
, with no forward slash in it, unless it comes at least one character past the dot. So just be sure not to use that format for anything else.
相关文章