Thursday, August 8, 2013

Python Check URL



Part 1

Write a program that takes the URL from the command-line, checks that it is a valid URL, opens
the web page at that URL, grabs and prints the URLs in lexicographic order. For
example, if you run the program from the terminal shell, python test1.py http://scholar.google.com, the output should look like:

http://support.google.com/bin/answer.py?answer=23852
http://www.google.com/chrome/
http://www.google.com/imghp?hl=en&
http://www.google.com/intl/en/about.html
http://www.google.com/intl/en/options/
http://www.google.com/intl/en/privacy.html
http://www.google.com/webhp?hl=en&
http://www.mozilla.com/firefox/

while python test1.py scholar.google.com,  would report an invalid URL

The program should:

·          import both sys and parse_url,
·          check that there are two command-line arguments,
·          check that the URL in sys.argv[1] begins with http:// (look up the string function
startswith),
·          call get_links from the parse_url module to get the set of URLs, and
·          print the sorted URLs, one per line of output.

Part 2

Modify the code so that it calls get_links in turn for each URL returned by the first call to get_links. Save all of the resulting links, together with the original links, in a single set, and print the URLs in order. For example, if you start with http://scholar.google.com, your code should output the eight URLs shown above together with all of the URLs found on those eight pages. At the end, output the number of different URLs found.

No comments:

Post a Comment