Archive a Portion of a Website Using GNU Wget

The mirroring of a thousand files begins with the first bit…

I needed to back up a portion of an “online content source” and thought GNU’s wget might be a good choice for the task. It self-compiled into digital existence roughly 34.1337 microseconds after the Web’s Big Bang so there’s not much HTTP and FTP-based work you can’t do with it.

Some of its features (from gnu.org):

  • Can resume aborted downloads, using REST and RANGE
  • Can use filename wild cards and recursively mirror directories
  • NLS-based message files for many different languages
  • Optionally converts absolute links in downloaded documents to relative, so that downloaded documents may link to each other locally
  • Runs on most UNIX-like operating systems as well as Microsoft Windows
  • Supports HTTP proxies
  • Supports HTTP cookies
  • Supports persistent HTTP connections
  • Unattended / background operation
  • Uses local file timestamps to determine whether documents need to be re-downloaded when mirroring
  • GNU Wget is distributed under the GNU General Public License.

The endpoints of the content source were a little tricky to mirror and I had to experiment with the utility’s options to achieve the desired result. The advanced options I used are reproduced below.

First, acquire wget. My workstation is Ubuntu–just use the package manager for your distro or install the utility from source.

sudo apt-get update
sudo apt-get install wget
wget --version

Basic (deceptively simple) syntax:

wget [option] ... [URL] ...

Here are the two wget option sets I ultimately used:

wget -r -nc -np -R "index.html" -e robots=off "URL"

This next option set is used to download files from URLs that don’t have an explicit filename in the URL (redirects to file) but you want to use the redirected file name for the saved file. The URLs are contained in a local file called “missing.txt”.

wget --content-disposition --trust-server-names -r -nc -np -R "index.html*" -e robots=off -i ./missing.txt

And finally, descriptions for the options (from gnu.org).

"-r"
"--recursive"

Turn on recursive retrieving. The default maximum depth is 5 (this can be increased with the –level=depth parameter.

"-nc"
"--no-clobber"

If a file is downloaded more than once in the same directory, Wget’s behavior depends on a few options, including ‘-nc’. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.

When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named ‘file.1’. If that file is downloaded yet again, the third copy will be named ‘file.2’, and so on. (This is also the behavior with ‘-nd’, even if ‘-r’ or ‘-p’ are in effect.) When ‘-nc’ is specified, this behavior is suppressed, and Wget will refuse to download newer copies of ‘file’. Therefore, “no-clobber” is actually a misnomer in this mode—it’s not clobbering that’s prevented (as the numeric suffixes were already preventing clobbering), but rather the multiple version saving that’s prevented.

When running Wget with ‘-r’ or ‘-p’, but without ‘-N’, ‘-nd’, or ‘-nc’, re-downloading a file will result in the new copy simply overwriting the old. Adding ‘-nc’ will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored.

When running Wget with ‘-N’, with or without ‘-r’ or ‘-p’, the decision as to whether or not to download a newer copy of a file depends on the local and remote timestamp and size of the file (see Time-Stamping). ‘-nc’ may not be specified at the same time as ‘-N’.

A combination with ‘-O’/‘–output-document’ is only accepted if the given output file does not exist.

Note that when ‘-nc’ is specified, files with the suffixes ‘.html’ or ‘.htm’ will be loaded from the local disk and parsed as if they had been retrieved from the Web.

"-np"
"--no-parent"

Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.

"-R rejList"
"--reject rejList"
"reject = rejList"
"--reject-regex urlregex"
"reject-regex = urlregex"

The ‘–reject’ option works the same way as ‘–accept’, only its logic is the reverse; Wget will download all files except the ones matching the suffixes (or patterns) in the list.

So, if you want to download a whole page except for the cumbersome MPEGs and .AU files, you can use ‘wget -R mpg,mpeg,au’. Analogously, to download all files except the ones beginning with ‘bjork’, use ‘wget -R “bjork*”’. The quotes are to prevent expansion by the shell.

"-e command"
"--execute command"

Execute command as if it were a part of .wgetrc (see Startup File). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. If you need to specify more than one wgetrc command, use multiple instances of ‘-e’.

Here, the command is being used to ignore the site’s robot exclusion directive, i.e. robots=off. “Your download resistance is futile.”

"--content-disposition"

If this is set to on, experimental (not fully-functional) support for Content-Disposition headers is enabled. This can currently result in extra round-trips to the server for a HEAD request, and is known to suffer from a few bugs, which is why it is not currently enabled by default.

This option is useful for some file-downloading CGI programs that use Content-Disposition headers to describe what the name of a downloaded file should be.

When combined with ‘--metalink-over-http’ and ‘--trust-server-names’, a ‘Content-Type: application/metalink4+xml’ file is named using the Content-Disposition filename field, if available.

"--trust-server-names"

If this is set, on a redirect, the local file name will be based on the redirection URL. By default the local file name is based on the original URL. When doing recursive retrieving this can be helpful because in many web sites redirected URLs correspond to an underlying file structure, while link URLs do not.

"-i file"
"--input-file=file"

Read URLs from a local or external file. If ‘-’ is specified as file, URLs are read from the standard input. (Use ‘./-’ to read from a file literally named ‘-’.)

If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If ‘--force-html’ is not specified, then file should consist of a series of URLs, one per line.

However, if you specify ‘--force-html’, the document will be regarded as ‘html’. In that case you may have problems with relative links, which you can solve either by adding <base href=”url”> to the documents or by specifying ‘--base=url’ on the command line.

If the file is an external one, the document will be automatically treated as ‘html’ if the Content-Type matches ‘text/html’. Furthermore, the file’s location will be implicitly used as base href if none was specified.

© mylevel4cache.com, 2019. Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to mylevel4cache.com with appropriate and specific direction to the original content.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s