Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
W
webmagic
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
沈俊林
webmagic
Commits
d2e0f0cd
Commit
d2e0f0cd
authored
Sep 06, 2013
by
yihua.huang
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
#25 use URL api in UrlUtils.canonicalizeUrl()
parent
363fd38c
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
23 additions
and
36 deletions
+23
-36
UrlUtils.java
...e/src/main/java/us/codecraft/webmagic/utils/UrlUtils.java
+20
-32
UrlUtilsTest.java
...c/test/java/us/codecraft/webmagic/utils/UrlUtilsTest.java
+3
-4
No files found.
webmagic-core/src/main/java/us/codecraft/webmagic/utils/UrlUtils.java
View file @
d2e0f0cd
...
...
@@ -2,6 +2,8 @@ package us.codecraft.webmagic.utils;
import
org.apache.commons.lang3.StringUtils
;
import
java.net.MalformedURLException
;
import
java.net.URL
;
import
java.nio.charset.Charset
;
import
java.util.regex.Matcher
;
import
java.util.regex.Pattern
;
...
...
@@ -18,45 +20,31 @@ public class UrlUtils {
/**
* canonicalizeUrl
*
* Borrowed from Jsoup.
*
* @param url
* @param refer
* @return canonicalizeUrl
*/
public
static
String
canonicalizeUrl
(
String
url
,
String
refer
)
{
if
(
StringUtils
.
isBlank
(
url
)
||
StringUtils
.
isBlank
(
refer
))
{
return
url
;
}
if
(
url
.
startsWith
(
"http"
)
||
url
.
startsWith
(
"ftp"
)
||
url
.
startsWith
(
"mailto"
)
||
url
.
startsWith
(
"javascript:"
))
{
return
url
;
}
if
(
StringUtils
.
startsWith
(
url
,
"/"
))
{
String
host
=
getHost
(
refer
);
return
host
+
url
;
}
else
if
(!
StringUtils
.
startsWith
(
url
,
"."
))
{
refer
=
reversePath
(
refer
,
1
);
return
refer
+
"/"
+
url
;
}
else
{
Matcher
matcher
=
relativePathPattern
.
matcher
(
url
);
if
(
matcher
.
find
())
{
int
reverseDepth
=
matcher
.
group
(
1
).
length
();
refer
=
reversePath
(
refer
,
reverseDepth
);
String
substring
=
StringUtils
.
substring
(
url
,
matcher
.
end
());
return
refer
+
"/"
+
substring
;
}
else
{
refer
=
reversePath
(
refer
,
1
);
return
refer
+
"/"
+
url
;
}
}
}
public
static
String
reversePath
(
String
url
,
int
depth
)
{
int
i
=
StringUtils
.
lastOrdinalIndexOf
(
url
,
"/"
,
depth
);
if
(
i
<
10
)
{
url
=
getHost
(
url
);
}
else
{
url
=
StringUtils
.
substring
(
url
,
0
,
i
);
URL
base
;
try
{
try
{
base
=
new
URL
(
refer
);
}
catch
(
MalformedURLException
e
)
{
// the base is unsuitable, but the attribute may be abs on its own, so try that
URL
abs
=
new
URL
(
refer
);
return
abs
.
toExternalForm
();
}
// workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired
if
(
url
.
startsWith
(
"?"
))
url
=
base
.
getPath
()
+
url
;
URL
abs
=
new
URL
(
base
,
url
);
return
abs
.
toExternalForm
();
}
catch
(
MalformedURLException
e
)
{
return
""
;
}
return
url
;
}
public
static
String
getHost
(
String
url
)
{
...
...
webmagic-core/src/test/java/us/codecraft/webmagic/utils/UrlUtilsTest.java
View file @
d2e0f0cd
...
...
@@ -19,13 +19,12 @@ public class UrlUtilsTest {
fixrelativeurl
=
UrlUtils
.
canonicalizeUrl
(
"../aa"
,
"http://www.dianping.com/sh/ss/com"
);
Assert
.
assertEquals
(
"http://www.dianping.com/sh/aa"
,
fixrelativeurl
);
fixrelativeurl
=
UrlUtils
.
canonicalizeUrl
(
"..../aa"
,
"http://www.dianping.com/sh/ss/com"
);
Assert
.
assertEquals
(
"http://www.dianping.com/aa"
,
fixrelativeurl
);
fixrelativeurl
=
UrlUtils
.
canonicalizeUrl
(
".../aa"
,
"http://www.dianping.com/sh/ss/com"
);
Assert
.
assertEquals
(
"http://www.dianping.com/aa"
,
fixrelativeurl
);
fixrelativeurl
=
UrlUtils
.
canonicalizeUrl
(
"..aa"
,
"http://www.dianping.com/sh/ss/com"
);
Assert
.
assertEquals
(
"http://www.dianping.com/sh/ss/..aa"
,
fixrelativeurl
);
fixrelativeurl
=
UrlUtils
.
canonicalizeUrl
(
"../../aa"
,
"http://www.dianping.com/sh/ss/com/"
);
Assert
.
assertEquals
(
"http://www.dianping.com/sh/aa"
,
fixrelativeurl
);
fixrelativeurl
=
UrlUtils
.
canonicalizeUrl
(
"../../aa"
,
"http://www.dianping.com/sh/ss/com"
);
Assert
.
assertEquals
(
"http://www.dianping.com/aa"
,
fixrelativeurl
);
}
@Test
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment