Portico is a preservation service that works with libraries and publishers to ensure that scholarly content is available for future scholars. When it was formed in 2005, the initial workflows were built around a concept of a publication that had remained relatively unchanged for centuries. As books and journals moved to digital formats, they mostly simulated the bound print world, retaining and adapting many of its artifacts and processes. While traditional publications have not changed significantly and still play a central role in scholarly communication, scholars now also have the option to use an exploding variety of platforms and tools to share their work in new and sometimes complex forms. The result is that the relative uniformity of traditional electronic publications, which enables scalable preservation, can no longer be assumed. Those who seek to preserve scholarship are confronted with publications that incorporate an ever-expanding variety of embedded media formats and viewers, data visualizations, version management, complex interdependent networks of supporting materials such as software and data, reader-contributed content (annotations, comments), interactive features, and nonlinear forms of navigation. Preservation services that focus on scholarly content continue to evolve in order to support these changes, and new services have arisen to meet demand in specific areas, such as software preservation. But as content becomes more complex and prolific, it is challenging for preservation services to keep pace with increasingly diverse publication formats and preserve them at scale. If authors and their publishers do not plan for the longevity of their work, the most innovative scholarship today may lose the characteristics that make it unique and valuable in a matter of years rather than decades. Preservation copies, in turn, may be missing important components of the work.
It is in this context that NYU Libraries proposed a project that would bring together open access publishers concerned with the long-term survival of their most innovative projects with digital preservation services that specialize in scholarly publications. The project, titled
Five publishers—Michigan Publishing, Stanford University Press, University of Minnesota Press, University of British Columbia Press, and New York University Press—shared 20 innovative examples of enhanced digital scholarly publications to be analyzed for preservability. While analysis showed that it was
Drawing from the research described above, this section summarizes some of the patterns identified and aligns them with ideas for how publishers and preservation services can evolve together to accomplish scalable preservation of new forms of scholarship.
Without the constraints of the bound form and with the falling cost of storage, sharing of additional resources to accompany a text has become commonplace. This has been buoyed by funder mandates on data sharing and cultural changes that call for researchers to show the evidence underlying their work. In new forms of scholarship that integrate different kinds of materials, what were once called “supplements” may now be referenced and embedded throughout the publication in addition to sitting alongside them. This prompted several of the platforms involved in this research to call these materials publication “resources” rather than “supplements.” As an integral part of the publication, these resources are vital to preserve in order to fully understand the intellectual contribution of the scholarship.
More than half of the publications analyzed had hundreds of resources. Resource file formats ranged from PDFs, audio, and video to executable programs and full databases. In three platforms, each of these resources included rich descriptive metadata, a dedicated landing page, and in some cases a unique persistent identifier making it possible to cite them directly. One third of the publications had resources of more than a gigabyte (GB) of disk space—much larger than a typical PDF publication.
For traditional publications in Portico, where supplements are supplied for preservation, they are packaged and archived with the ebook or ejournal and mostly have little or no metadata outside of the notes that are within the publication text. In these new forms of publication, Portico may need to reflect this rich expression of the associated resources with independent landing pages, improved support for a wider variety of file formats, and sometimes DOIs that can resolve to the archived version if the publisher copy is no longer available. This is all required while also ensuring that the resources remain linked to the text so that they can be presented as part of it through Portico’s access platform. Increasingly, publications are a web of connected resources in which the text itself connects many pieces rather than a bound format that can be easily contained as a distinct object.
As preservation services work to support this evolving concept of supplements, publishers can increase the success of this effort in a number of ways. These are laid out in detail in the
Traditional publishing has strict schedules and workflows that lead up to a publication date, after which the publication does not change except through established and formal channels such as addendums, editions, and retraction notices. For new forms of scholarship, the concept of “version of record” can be elusive and may need to be considered case by case. Some current platforms enable version updates after initial publication without a formal addendum or change to the DOI. Others support annotations and comments from readers that appear on the publication and, for some publishers, are considered an important aspect of the published work. Other publications exist in a perpetual draft state, where they are designed to iteratively change to incorporate feedback or new data, never reaching a final, official publication date.
This kind of
Living documents may warrant an addendum to the usual conversation that occurs between the publisher and Portico when planning the preservation workflow. To support preservation, publishers may need to define which version or versions should be preserved. The preservation system will need to recognize when something has changed using agreed-upon criteria that will support making the necessary distinctions through the descriptive metadata. For user-contributed content, if a publisher believes it is important to preserve, there will need to be clarity about whether there are rights for a third party in order to do so. Changes to, or negotiations regarding, terms of use/service may be necessary to support preservation of this content.
One of the most frequent features of new forms of scholarship is the embedding of a variety of resources that would typically not be found in traditional publications. Examples include audio, video, and complex data visualizations that are seamlessly integrated into the body of the text, just as figure graphics are embedded in traditional publications. The EPUB format, and web pages generally, support simple methods for embedding audiovisual material using HTML tags for image, video, and audio. If used as intended, these can be managed by preservation workflows. In the samples analyzed during this research, however, almost all material that was not text or image was embedded into publications using an inline frame, or
A multitude of preservation challenges result from the use of iframes. The content presented in them may not be part of the publishing platform or even controlled by the publisher. This makes the content of iframes less likely to be included in platform exports and vulnerable to
Portico is exploring new ways to identify which resources should be included in the archived copy and how to ensure they are embedded at the appropriate position in the publication. This includes adding web page archiving tools to some preservation workflows. To help avoid omission of resources that are integral to the work, preservation services may need support from publishers and their platform designers. Workflows that standardize the expression of common embedded features within the publishing platform could be leveraged to design compatible workflows for preservation. Where possible, publishers could ensure they have a copy of all embedded resources, whether managed by the publisher platform or not, and the appropriate rights to preserve them. These copies could be included in preservation packages and also provide a useful backup for publishers in the case that an embedded resource becomes unavailable on the web even before the publication reaches an archive. Where copies or rights cannot be obtained, meaningful captions that include a description and an original link could help future readers find the content even if it is no longer connected to the publication. In some cases, publishers may wish to participate in a service that allows for static captures of web pages that are displayed or linked in the publication so that the content is preserved before it has a chance to change or disappear. An example of this type of service is
For some new forms of scholarship, enormous effort has gone into creating a specific presentation or
For Portico, the primary mode of preservation has been to separate the intellectual components of the publication from the platform and arrange them into a standard package that can be updated and adapted through time to work with modern technology. If the experience is an important aspect of the publication that cannot be easily separated from the platform, then preservation approaches that record this experience should be considered.
One commonly used option for preserving the experience of a website is a web crawler, in which a tool visits and records a copy of the web pages from the outside. For websites that are compatible with this approach, the process can be automated with little configuration. Many websites, however, have features that require the crawler to be customized in order to record them. Some websites, such as those whose content can only be discovered via a search bar, cannot be preserved using this method. Another option is to recreate the website’s server on a virtual machine and then preserve that virtual machine. This approach is rarely used for website preservation as it depends on access to the resources needed to recreate the server (code, data, software, licenses, documentation, expertise, etc.) and also whether the website can be configured to function without access to URLs outside of the web server. If the website centers on a visualization hosted by ArcGIS Online, for example, it will only work for as long as that visualization is available. There may also be some uncertainty about the technology and expertise required to run the server in the long term, but this is evolving as the tools and infrastructure to support this approach are making progress through efforts such as
For web publications that integrate a lot of dynamic features and technologies, it can be difficult to apply website archiving methods and maintain high quality preservation at scale. Each of the web platforms analyzed during this research required several weeks of effort to create a web archiving process that met the preservation requirements. Even with configuration tailored to the platform, unanticipated variations between publications and features that cannot be preserved using these methods can lead to an incomplete archival copy. Because the publications in this research were complex and spanned many web pages, the web archived content also required significantly more disk space compared to the exported content, which may have implications for the preservation cost at scale. If these challenges can be navigated, however, these approaches can be highly effective for platforms that favor them and may be vital to ensuring that the most innovative publications created today can be experienced in the future.
To respond to the need to preserve the experience for some publications, Portico has initiated a web archiving pilot project that uses a crawler and will continue to evaluate scalable options for preserving websites that cannot be crawled. While preservation services will always attempt to evolve and work with the content as presented, one way to improve the chance that these experience-focused approaches will be successful at high quality and scale is for publishers and preservation services to work together to ensure the platforms will favor these techniques. Numerous suggestions for how to make this possible are documented in the
The previous section laid out ideas for how publishers and preservation services can work together to tackle different kinds of challenges. These are addressed in more detail through the
While the
Every instance in which publishers and preservation services can collaborate to build tools and methods that can plug into platforms or be reused by others will contribute to an information infrastructure that favors the longevity of scholarship. Rather than asking publishers to stifle the innovation and creativity of their authors to simplify the preservation task, we are looking to collaborate and innovate to develop approaches to creating new forms of scholarship that will be available to future scholars.