Part
1
Write a program that takes the URL from the command-line,
checks that it is a valid URL, opens
the web page at that URL, grabs and prints the URLs in
lexicographic order. For
example, if you run the program from the terminal
shell, python test1.py http://scholar.google.com, the output should look like:
http://support.google.com/bin/answer.py?answer=23852
http://www.google.com/chrome/
http://www.google.com/imghp?hl=en&
http://www.google.com/intl/en/about.html
http://www.google.com/intl/en/options/
http://www.google.com/intl/en/privacy.html
http://www.google.com/webhp?hl=en&
http://www.mozilla.com/firefox/
while python test1.py scholar.google.com, would report an invalid URL
The program should:
·
import both sys and parse_url,
·
check that there are two command-line
arguments,
·
check that the URL in sys.argv[1] begins with
http:// (look up the string function
startswith),
·
call get_links from the parse_url module to
get the set of URLs, and
·
print the sorted URLs, one per line of output.
Part 2
Modify the code so that it calls get_links in turn
for each URL returned by the first call to get_links. Save all of the resulting
links, together with the original links, in a single set, and print the URLs in
order. For example, if you start with http://scholar.google.com, your code
should output the eight URLs shown above together with all of the URLs found on
those eight pages. At the end, output the number of different URLs found.
No comments:
Post a Comment