Pilot 7: Piratical distribution as one form of impact indicator & reaching unexpected audiences


The goal of the pilot study was to conduct a quantitative, statistical and econometric analysis of large scale datasets on the supply of and demand for scholarly works on various illegal platforms such as Sci-Hub and Library Genesis. The pilot modeled which works-specific factors may explain the availability and illegal demand for individual works (e.g. price, legal availability).

The data was analysed using advanced statistical modelling methods and geospatial models to explain the impact of legal (un)availability of scientific works. The results shed light on any potential failures or shortcomings on the legal access channels (markets and libraries) that drive people to use the black markets. We assessed how important the illegal traffic of scientific works is.

Key Results

Using a dataset provided to us by one of the administrators of a prominent shadow library in 2012 and in 2015 we mapped the both the supply of and the demand for academic monographs, textbooks and other learning material via piratical shadow libraries. Our primary findings suggest that scholarly book piracy is a ubiquitous global phenomenon, with no apparent end in sight. If that is indeed true, we must ask, what might the consequences be for the status quo in scholarly publishing.

The study is based on a number of data sources to analyse the supply and demand of pirated scholarly publications. The analysis of supply is based on the catalogue of Library Genesis. The shadow library publishes its catalogue with basic bibliographic metadata as daily database dumps. Usage data is based on to two sets of access log data provided to us by the administrators of one of the mirror services that distribute the titles in Library Genesis. The datasets were detailed enough to link the download of catalogue items to geographic locations. To conduct extra research into legal availability we occasionally queried other data sources, such as amazon.com from price and legal availability, and worldcat for library availability.

In 2012 the Library Genesis catalogue contained 836.479 records. Three years later, in 2015, the catalog almost doubled to 1.317.424 records, and by the time of writing in 2018, Library Genesis hosts more than 2.237.940 documents, almost all scholarly publications. In addition, there is an extensive collection of literary works, comics, and of course the 100 million journal articles archived through the SciHub.

Post-Soviet republics, which in 2012 were heavy users of this particular mirror seem to have migrated to other services, and the traffic from these countries declined. Countries and regions, that account for the bulk of the usage (US, India, China, Europe) show average growth. On the other hand, there have been a staggering growth in Latin America, which in 2012 was hardly using (this particular mirror of) Library Genesis at all, but by 2015 they discovered LibGen, and became one of the most intensive users of the library. Our results show that the biggest per capita users are the high income North American and European countries. In fact, just a handful of countries, the United States (11.66%), India (8.58%), Germany (5.23%), the UK (4.10%), Iran (3.68%), China (3.67%), Italy (3.30%), Canada (2.36%), Indonesia (2.29%), Spain (2.28%), Turkey (2.24%), and Brazil (2.11%) account for more than half of all the downloads.

The ten top Dewey content categories suggest a strong science and technology focus of the library, since these two categories enjoy the highest demand, and also since works have the highest download volume per title (26.8). Social sciences, on the other hand sees the second lowest (13) download per title, while this section is big, both in terms of supply, and in terms of the number of titles downloaded.

What kind of impact shadow libraries may have on the current system of scholarly publishing? It seems that the scholarly publishing industry understood that it is close to impossible to efficiently fight scholarly piracy. Gigapedia, the predecessor of LibGen was relatively easy to shut down, as it relied on a centralized database, and a centralized document repository. LibGen and SciHub are much more difficult to eliminate, as they are both radically decentralized, and already exist in multiple copies all over the internet. That might also explain why there is only one court case against these services (Elsevier Inc. et al v. Sci-Hub et al Case No. 1:15-cv-04282-RWS 2015).

Under such conditions academic publishers have to ask themselves if the copyright and exclusivity-based business models are sustainable. For a number of reasons, the answer might still be in the affirmative. Both the US and the EU has mandated open access publishing for its publicly funded research, creating a lucrative revenue stream for publishers in the form of article processing fees which are not threatened by piracy. The fact that the scholarly pirates (the scholars themselves) are not those who must pay for the materials (the ones paying are the academic institutions, libraries, in some cases government agencies) may ultimately mitigate the negative effects of piracy, where illegal consumption substitutes sales. One illegally downloaded scholarly monograph, already priced for the library market does not diminish sales to individuals but may generate a purchase by the library at the request of the researcher who had a free sample copy through the shadow library. The net effect may well be positive for publishers.

Ultimately there are major consequences of the increasing shadow library use on the current systems of producing and interpreting academic indicators. Our pilot shows that shadow libraries are now an integral part of the systems of scholarly communication. They are part of the everyday routine of scholars of both the developed and the developing countries. However, there is no reliable, systematic insight into the use of these resources. Consequently, our academic indicators only give an incomplete and biased picture on the circulation of scholarly works. The copyright-infringing nature of shadow libraries only allows ad-hoc and fragmented insight into the circulation of works through them. Their modus operandi, on the other hand, is certain to introduce an unknown level of bias to our currently accepted set of indicators. For example, since SciHub uses leaked/shared academic credentials to provide access to pay-walled materials, the traffic as measured at the point of access, at the library through which the unauthorized access takes place will not provide an accurate picture of who uses the library resources and for what reasons. Since it is not reliably known to what extent SciHub serves subsequent requests to an article from its own archive as opposed to getting it again through a library, its impact on library usage metrics is also unknown. It is certain however, that any newly published article behind a paywall is at least once requested through a library, and that inflates library usage statistics.

On the other hand, the high usage numbers from developed countries suggest that at least some of the shadow library traffic is generated by users who otherwise could have had legal access through their institutions. That both applies to articles for which users go to SciHub rather than their own institutional repositories, and books for which users visit LibGen rather than their library print or e-book collections. Ours statistical models (not reported here) seem to suggest that in North America and Western Europe we cannot explain the high usage with serious access limitations. Instead we suspect that in these territories the convenient one-click access shadow libraries provide to full digital copies plays a role. This of course means that the official usage statistics of those resources that are also available through the shadow libraries will be underreporting the actual demand for key library resources. Shadow libraries do not just introduce noise into the current indicators that measure the circulation and use of scholarly resources. Given their size, the intensity and growth of their use, the omission of the traffic through these libraries threatens to falsify these indicators.

Read the full evaluation report here.

Contact: Balazs Bodo, University of AmsterdamB.Bodo@uva.nl