WARC – new ISO file format to store billions of online data
A Web page that is here today, may not be here tomorrow. A new ISO standard, ISO 28500:2009, Information and documentation – WARC file format, will ensure that the vast and often valuable information posted on the Web is not lost when a page changes or disappears.
ISO 28500 provides a file format known as WARC (Web ARChive), which offers a convention for concatenating multiple data objects into one long file. The format can be used to build applications for harvesting, managing, accessing and exchanging content.
“For a long time, keeping track of the staggering number of Web sites and pages posed a difficult challenge for digital curators and archivists, and resulted in countless lost data,” says Clément Oury, member of the working group that developed the standard.
“With WARC, ISO 28500 takes Internet archiving to the next level by enabling the effective management, structure and storage of billions of resources collected from the Web and elsewhere. Its standardization offers a guarantee of durability, and will help Web archiving become part of the mainstream activities of heritage institutions and other branches, by for example, fostering the development of new tools and ensuring interoperability between collections,” explains Mr. Oury.
The WARC format is an extension of the ARC file format, which has been used by the Internet Archive since 1996, and by numerous heritage institutions to store “Web crawls” – which represent extracts of entire Web pages and their links.
The motivation to extend the ARC arose from the discussions and experiences of these organizations within the International Internet Preservation Consortium (IIPC) – whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC members were finding it increasingly difficult to store and manage the growing volume of information coming from the Internet.
The WARC format differs from the ARC in that it offers new possibilities, notably the recording of HTTP request headers and of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, whether retrieved by HTTP or another protocol.
“Several applications are already WARC compliant,” adds Mr. Oury, “such as the Heritrix crawler for harvesting, the WARC tools for data management and exchange, the Wayback Machine, NutchWAX and other search tools for access.”
ISO 28500: 2009, Information and documentation – WARC file format, was developed by ISO technical committee ISO/TC 46, Information and documentation, subcommittee SC 4, Technical interoperability. The standard is available from ISO national member institutes (see the complete list with contact details). It may also be obtained directly from the ISO Central Secretariat, price 118 Swiss francs, through the ISO Store or by contacting the Marketing, Communication and Information department (see right-hand column).
(Source: http://www.iso.org/iso/pressrelease.htm?refid=Ref1255)