First stable release of the GNU Wget2 web content downloader

After three and a half years of development, the first stable release of the GNU Wget2 project is presented, developing a completely redesigned version of the program for automating the recursive download of GNU Wget content. GNU Wget2 is designed and rewritten from scratch and is notable for moving the basic functionality of the web client into the libwget library, which can be used in stand-alone applications. The utility is supplied under the GPLv3+ license, and the library under the LGPLv3+.

Instead of gradually reworking the existing code base, it was decided to remake everything from scratch and establish a separate branch of Wget2 to implement ideas for restructuring, increasing functionality and making changes that violate compatibility. With the exception of the end of support for the FTP protocol and the WARC format, wget2 can act as a transparent replacement for the classic wget utility in most situations.

That being said, wget2 has some documented behavioral differences, provides about 30 additional options, and drops support for a few dozen options. Including the processing of options such as "--ask-password", "--header", "--exclude-directories", "--ftp*", "--warc*", "--limit-rate", "--relative ' and '--unlink'.

Key innovations include:

  • Transferring functionality to the libwget library.
  • Transition to multithreaded architecture.
  • Ability to set up multiple connections in parallel and download to multiple streams. In particular, it is possible to parallelize the download of one file with a breakdown into blocks using the "--chunk-size" option.
  • Support for HTTP/2 protocol.
  • Using the If-Modified-Since HTTP header to load only data that has changed.
  • Transition to the use of external bandwidth limiters, such as trickle.
  • Support for the Accept-Encoding header, compressed data transfer, and brotli, zstd, lzip, gzip, deflate, lzma, and bzip2 compression algorithms.
  • Support for TLS 1.3, OCSP (Online Certificate Status Protocol) for checking revoked certificates, HSTS (HTTP Strict Transport Security) for forced redirection to HTTPS, and HPKP (HTTP Public Key Pinning) for certificate binding.
  • Ability to use GnuTLS, WolfSSL and OpenSSL as backends for TLS.
  • Support for TCP fast open mode (TCP FastOpen).
  • Built-in support for the Metalink format.
  • Support for internationalized domain names (IDNA2008).
  • The ability to work simultaneously through several proxy servers (one stream will be loaded through one proxy, the second through another).
  • Built-in support for news feeds in Atom and RSS formats (for example, for crawling and downloading links). RSS/Atom data can be loaded from a local file or over the network.
  • Support for extracting URLs from sitemap files. The presence of parsers to extract links from CSS and XML files.
  • Support for the 'include' directive in configuration files and distribution of settings across multiple files (/etc/wget/conf.d/*.conf).
  • Built-in DNS query caching mechanism.
  • Possibility of transcoding content with changing the encoding of the document.
  • Accounting for the "robots.txt" file during recursive downloads.
  • Safe write mode with fsync() call after saving data.
  • The ability to resume interrupted TLS sessions, as well as caching and saving TLS session parameters to a file.
  • Mode "--input-file -" to load URLs coming through the standard input stream.
  • Cookie scope check against the Public Suffix List to isolate different sites hosted in the same second-level domain (for example, "a.github.io" and "b.github.io") from each other.
  • Support for downloading live streaming in ICEcast / SHOUTcast format.

Source: opennet.ru

Add a comment