Methodology - Open Access Archivangelism

Wednesday, October 24. 2012

Comparing Carrots and Lettuce

These are comments on Stephen Curry's
"The inexorable rise of open access scientific publishing".

Our (Gargouri, Lariviere, Gingras, Carr & Harnad) estimate (for publication years 2005-2010, measured in 2011, based on articles published in the c. 12,000 journals indexed by Thomson-Reuters ISI) is 35% total OA in the UK (10% above the worldwide total OA average of 25%): This is the sum of both Green and Gold OA.

Our sample yields a Gold OA estimate much lower than Laakso & Bjrk's. Our estimate of about 25% OA worldwide is composed of 22.5% Green plus 2.5% Gold. And the growth rate of neither Gold nor (unmandated) Green is exponential.

There are a number of reasons neither "carrots vs. lettuce" nor "UK vs. non-UK produce" nor L&B estimates vs. G et al estimates can be compared or combined in a straightforward way.

Please take the following as coming from a fervent supporter of OA, not an ill-wisher, but one who has been disappointed across the long years by far too many failures to seize the day -- amidst surges of "tipping-point" euphoria -- to be ready once again to tout triumph.

First, note that the hubbub is yet again about Gold OA (publishing), even though all estimates agree that there is far less of Gold OA than there is of Green OA (self-archiving), and even though it is Green OA that can be fast-forwarded to 100%: all it takes is effective Green OA mandates (I will return to this point at the end).

So Stephen Curry asks why there is a discrepancy between our (Gargouri et al) estimates of Gold OA -- in the UK and worldwide (c. Laakso & Bjrk (17%). Here are some of the multiple reasons (several of them already pointed out by Richard van Noorden in his comments too):

1. Thomson-Reuters ISI Subset: Our estimates are based solely on articles in the Thomson-Reuters ISI database of c. 12,000 journals. This database is more selective than the SCOPUS database on which L&B's sample is based. The more selective journals have higher quality standards and are hence the ones that both authors and users prefer.

(Without getting into the controversy about journal citation impact factors, another recent L&B study has shown that the higher the journal's impact factor, the less likely that the journal is Gold OA. -- But let me add that this is now likely to change, because of the perverse effects of the Finch Report and the RCUK OA Policy: Thanks to the UK's announced readiness to divert UK research funds to double-paying subscription journal publishers for hybrid Gold OA, most journals, including the top journals, will soon be offering hybrid Gold OA -- a very pricey way to add the UK's 6% of worldwide research output to the worldwide Gold OA total: The very same effect could be achieved free of extra cost if RCUK instead adopted a compliance-verification mechanism for its existing Green OA mandates.)

2. Embargoed "Gold OA": L&B included in their Gold OA estimates "OA" that was embargoed for a year. That's not OA, and certainly should not be credited to the total OA for any given year -- whence it is absent -- but to the next year. By that time, the Green OA embargoes of most journals have already expired. So, again, any OA purchased in this pricey way -- instead of for a few extra cost-free keystrokes by the author, for Green -- is more of a head-shaker than occasion for heady triumph.

3. 1% Annual Growth: The 1% annual growth of Gold OA is not much headway either, if you do the growth curves for the projected date they will reach 100%! (The more heady Gold OA growth percentages are not Gold OA growth as a percentage of all articles published, but Gold OA growth as a percentage of the preceding year's Gold OA articles.)

4. Green Achromatopsia: The relevant data for comparing Gold OA -- both its proportion and its growth rate -- with Green come from a source L&B do not study, namely, institutions with (effective) Green OA mandates. Here the proportions within two years of mandate adoption (60%+) and the subsequent growth rate toward 100% eclipse not only the worldwide Gold OA proportions and growth rate, but also the larger but still unimpressive worldwide Green OA proportions and growth rate for unmandated Green OA (which is still mostly all there is).

5. Mandate Effectiveness: Note also that RCUK's prior Green OA mandate was not an effective one (because it had no compliance verification mechanism), even though it may have increased UK OA (35%) by 10% over the global average (25%).

Stephen Curry: "A cheaper green route is also available, whereby the author usually deposits an unformatted version of the paper in a university repository without incurring a publisher's charge, but it remains to be seen if this will be adopted in practice. Universities and research institutions are only now beginning to work out how to implement the new policy (recently clarified by the RCUK)."

Well, actually RCUK has had Green OA mandates for over a half-decade now. But RCUK has failed to draw the obvious conclusion from its pioneering experiment -- which is that the RCUK mandates require an effective compliance-verification mechanism (of the kind that the effective university mandates have -- indeed, the universities themselves need to be recruited as the compliance-verifiers).

Instead, taking their cue from the Finch Report -- which in turn took its cue from the publisher lobby -- RCUK is doing a U-turn from its existing Green OA mandate, and electing to double-pay publishers for Gold instead.

A much more constructive strategy would be for RCUK to build on its belated grudging concession (that although Gold is RCUK's preference, RCUK fundees may still choose Green) by adopting an effective Green OA compliance verification mechanism. That (rather than the obsession with how to spend "block grants" for Gold) is what the fundees' institutions should be recruited to do for RCUK.

6. Discipline Differences: The main difference between the Gargouri, Lariviere, Gingras, Carr & Harnad estimates of average percent Gold in the ISI sample (2.5%) and the Laakso & Bjork estimates (10.3% for 2010) probably arise because L&B's sample included all ISI articles per year for 12 years (2000-2011), whereas ours was a sample of 1300 articles per year, per discipline, separately, for each of 14 disciplines, for 6 years (2005-2010: a total of about 100,000 articles).

7. Biomedicine Preponderance? Our sample was much smaller than L&B's because L&B were just counting total Gold articles, using DOAJ, whereas we were sending out a robot to look for Green OA versions on the Web for each of the 100,000 articles in our sample. It may be this equal sampling across disciplines that leads to our lower estimates of Gold: L&B's higher estimate may reflect the fact that certain disciplines are both more Gold and publish more articles (in our sample, Biomed was 7.9% Gold). Note that both studies agree on the annual growth rate of Gold (about 1%)

8. Growth Spurts? Our projection does not assume a linear year-to-year growth rate (1%), it detects it. There have so far been no detectable annual growth spurts (of either Gold or Green). (I agree, however, that Finch/RCUK could herald one forthcoming annual spurt of 6% Gold (the UK's share of world research output) -- but that would be a rather pricey (and, I suspect, unscaleable and unsustainable) one-off growth spurt. )

9. RCUK Compliance Verification Mechanism for Green OA Deposits: I certainly hope Stephen Curry is right that I am overstating the ambiguity of the RCUK policy!

But I was not at all reassured at the LSHTM meeting on Open Access by Ben Ryan's rather vague remarks about monitoring RCUK mandate compliance, especially compliance with Green. After all that (and not the failure to prefer and fund Gold) was the main weakness of the prior RCUK OA mandate.

Stevan Harnad

Posted by Stevan Harnad in Methodology at 12:42 | Comments (0) | Trackbacks (0)

Saturday, April 2. 2011

"The Sole Methodologically Sound Study of the Open Access Citation Advantage(!)"

Re: Chronicle of Higher Education

It is true that downloads of research findings are important. They are being measured, and the evidence of the open-access download advantage is growing. See:

S. Hitchcock (2011) "The effect of open access and downloads ('hits') on citation impact: a bibliography of studies"

A. Swan (2010) "The Open Access citation advantage: Studies and results to date"

B. Wagner (2010) "Open Access Citation Advantage: An Annotated Bibliography"

But the reason it is the open-access citation advantage that is especially important is that refereed research is conducted and published so it can be accessed, used, applied and built upon in further research: Research is done by researchers, for uptake by researchers, for the benefit of the public that funds the research. Both research progress and researchers' careers and funding depend on research uptake and impact.

The greatest growth potential for open access today is through open access self-archiving mandates adopted by the universal providers of research: the researchers' universities, institutions and funders (e.g., Harvard and MIT) . See the ROARMAP registry of open-access mandates.

Universities adopt open access mandates in order to maximize their research impact. The large body of evidence, in field after field, that open access increases citation impact, helps motivate universities to mandate open access self-archiving of their research output, to make it accessible to all its potential users -- rather than just those whose universities can afford subscription access -- so that all can apply, build upon and cite it. (Universities can only afford subscription access to a fraction of research journals.)

The Davis study lacks the statistical power to show what it purports to show, which is that the open access citation advantage is not causal, but merely an artifact of authors self-selectively self-archiving their better (hence more citable) papers. Davis's sample size was smaller than many of the studies reporting the open access citation advantage. Davis found no citation advantage for randomized open access. But that does not demonstrate that open access is a self-selection artifact -- in that study or any other study -- because Davis did not replicate the widely reported self-archiving advantage either, and that advantage is often based on far larger samples. So the Davis study is merely a small non-replication of a widely reported outcome. (There are a few other non-replications; but most of the studies to date replicate the citation advantage, especially those based on bigger samples.)

Davis says he does not see why the inferences he attempts to make from his results -- that the reported open access citation advantage is an artifact, eliminated by randomization, that there is hence no citation advantage, which implies that there is no research access problem for researchers, and that researchers should just content themselves with the open access download advantage among lay users and forget about any citation advantage -- are not welcomed by researchers.

These inferences are not welcomed because they are based on flawed methodology and insufficient statistical power and yet they are being widely touted -- particularly by the publishing industry lobby (see the spin FASEB is already trying to put on the Davis study: "Paid access to journal articles not a significant barrier for scientists"!) -- as being the sole methodologically sound test of the open access citation advantage! Ignore the many positive studies. They are all methodologically flawed. The definitive finding, from the sole methodologically sound study, is null. So there's no access problem, researchers have all the access they need -- and hence there's no need to mandate open access self-archiving.

No, this string of inferences is not a "blow to open access" -- but it would be if it were taken seriously.

What would be useful and opportune at this point would be ">meta-analysis.

Stevan Harnad
American Scientist Open Access Forum
EnablingOpenScholarship

Posted by Stevan Harnad in Methodology at 00:32 | Comments (0) | Trackbacks (0)

The Sound of One Hand Clapping

Re: Nature: The Great Beyond

Suppose many studies report that cancer incidence is correlated with smoking and you want to demonstrate in a methodologically sounder way that this correlation is not caused by smoking itself, but just an artifact of the fact that the same people who self-select to smoke are also the ones who are more prone to cancer. So you test a small sample of people randomly assigned to smoke or not, and you find no difference in their cancer rates. How can you know that your sample was big enough to detect the repeatedly reported correlation at all unless you test whether it's big enough to show that cancer incidence is significantly higher for self-selected smoking than for randomized smoking?

Many studies have reported a statistically significant increase in citations for articles whose authors make them OA by self-archiving them. To show that this citation advantage is not caused by OA but just a self-selection artifact (because authors selectively self-archive their better, more citeable papers), you first have to replicate the advantage itself, for the self-archived OA articles in your sample, and then show that that advantage is absent for the articles made OA at random. But Davis showed only that the citation advantage was absent altogether in his sample. The most likely reason for that is that the sample was much too small (36 journals, 712 articles randomly OA, 65 self-archived OA, 2533 non-OA).

In a recent study (Gargouri et al 2010) we controlled for self-selection using mandated (obligatory) OA rather than random OA. The far larger sample (1984 journals, 3055 articles mandatorily OA, 3664 self-archived OA, 20,982 non-OA) revealed a statistically significant citation advantage of about the same size for both self-selected and mandated OA.

If and when Davis's requisite self-selected self-archiving control is ever tested, the outcome will either be (1) the usual significant OA citation advantage in the self-archiving control condition that most other published studies have reported -- in which case the absence of the citation advantage in Davis's randomized condition would indeed be evidence that the citation advantage had been a self-selection artifact that was then successfully eliminated by the randomization -- or (more likely, I should think) (2) no significant citation advantage will be found in the self-archiving control condition either, in which case the Davis study will prove to have been just one non-replication of the usual significant OA citation advantage (perhaps because of Davis's small sample size, the fields, or the fact that most of the non-OA articles become OA on the journal's website after a year). (There have been a few other non-replications; but most studies replicate the OA citation advantage, especially the ones based on larger samples.)

Until that requisite self-selected self-archiving control is done, this is just the sound of one hand clapping.

Readers can be trusted to draw their own conclusions as to whether Davis's study, tirelessly touted as the only methodologically sound one to date, is that -- or an exercise in advocacy.

Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research(2010) PLOS ONE 5 (10) (authors: Gargouri, Y., Hajjem, C., Lariviere, V., Gingras, Y., Brody, T., Carr, L. and Harnad, S.)

Posted by Stevan Harnad in Methodology at 00:20 | Comments (0) | Trackbacks (0)

Thursday, March 31. 2011

On Methodology and Advocacy: Davis's Randomization Study of the OA Advantage

Open access, readership, citations: a randomized controlled trial of scientific journal publishing doi:10.1096/fj.11-183988fj.11-183988
Philip M. Davis: "Published today in The FASEB Journal we report the findings of our randomized controlled trial of open access publishing on article downloads and citations. This study extends a prior study of 11 journals in physiology (Davis et al, BMJ, 2008) reported at 12 months to 36 journals covering the sciences, social sciences and humanities at 3yrs. Our initial results are generalizable across all subject disciplines: open access increases article downloads but has no effect on article citations... You may expect a routine cut-and-paste reply by S.H. shortly... I see the world as a more complicated and nuanced place than through the lens of advocacy."

Sorry to disappoint! Nothing new to cut-and-paste or reply to:

Still no self-selected self-archiving control, hence no basis for the conclusions drawn (to the effect that the widely reported OA citation advantage is merely an artifact of a self-selection bias toward self-archiving the better, hence more citeable articles -- a bias that the randomization eliminates). The methodological flaw, still uncorrected, has been pointed out before.

If and when the requisite self-selected self-archiving control is ever tested, the outcome will either be (1) the usual significant OA citation advantage in the self-archiving control condition that most other published studies have reported -- in which case the absence of the citation advantage in Davis's randomized condition would indeed be evidence that the citation advantage had been a self-selection artifact that was then successfully eliminated by the randomization -- or (more likely, I should think) (2) there will be no significant citation advantage in the self-archiving control condition either, in which case the Davis study will prove to have been just a non-replication of the usual significant OA citation advantage (perhaps because of Davis's small sample size, the fields, or the fact that most of the non-OA articles become OA on the journal's website after a year).

Until the requisite self-selected self-archiving control is done, this is just the sound of one hand clapping.

Readers can be trusted to draw their own conclusions as to whether this study, tirelessly touted as the only methodologically sound one to date, is that -- or an exercise in advocacy.

Stevan Harnad
American Scientist Open Access Forum
EnablingOpenScholarship

Posted by Stevan Harnad in Methodology at 13:29 | Comments (0) | Trackbacks (0)

Wednesday, October 20. 2010

Correlation, Causation, and the Weight of Evidence

SUMMARY: One can only speculate on the reasons why some might still wish to cling to the self-selection bias hypothesis in the face of all the evidence to date. It seems almost a matter of common sense that making articles more accessible to users also makes them more usable and citable -- especially in a world where most researchers are familiar with the frustration of arriving at a link to an article that they would like to read (but their institution does not subscribe), so they are asked to drop it into the shopping cart and pay $30 at the check-out counter. The straightforward causal relationship is the default hypothesis, based on both plausibility and the cumulative weight of the evidence. Hence the burden of providing counter-evidence to refute it is now on the advocates of the alternative.

Jennifer Howard ("Is there an Open-Access Advantage?," Chronicle of Higher Education, October 19 2010) seems to have missed the point of our article. It is undisputed that study after study has found that Open Access (OA) is correlated with higher probability of citation. The question our study addressed was whether making an article OA causes the higher probability of citation, or the higher probability causes the article to be made OA.

The latter is the "author self-selection bias" hypothesis, according to which the only reason OA articles are cited more is that authors do not make all articles OA: only the better ones, the ones that are also more likely to be cited.

The Davis et al study tested this by making articles -- 247 articles, from 11 biology journals -- OA randomly, instead of letting the authors choose whether or not to do it, self-selectively, and they found no increased citation for the OA articles one year after publication (although they did find increased downloads).

But almost no one finds that OA articles are cited more a year after publication. The OA citation advantage only becomes statistically detectable after citations have accumulated for 2-3 years.

Even more important, Davis et al. did not test the obvious and essential control condition in their randomized OA experiment: They did not test whether there was a statistically detectable OA advantage for self-selected OA in the same journals and time-window. You cannot show that an effect is an artifact of self-selection unless you show that with self-selection the effect is there, whereas with randomization it is not. All Davis et al showed was that there is no detectable OA advantage at all in their one-year sample (247 articles from 11 Biology journals); randomness and self-selection have nothing to do with it.

Davis et al released their results prematurely. We are waiting*,** to hear what Davis finds after 2-3 years, when he completes his doctoral dissertation. But if all he reports is that he has found no OA advantage at all in that sample of 11 biology journals, and that interval, rather than an OA advantage for the self-selected subset and no OA advantage for the randomized subset, then again, all we will have is a failure to replicate the positive effect that has now been reported by many other investigators, in field after field, often with far larger samples than Davis et al's.

*Note added October 31, 2010: Davis's dissertation turns out to have been posted on the same day as the present posting (October 20; thanks to Les Carr for drawing this to my attention on October 24!).
**Note added November 24, 2010: Phil Davis's results -- a replication of the OA download advantage and a non-replication of the OA citation advantage -- have since been published as: Davis, P. (2010) Does Open Access Lead to Increased Readership and Citations? A Randomized Controlled Trial of Articles Published in APS Journals. The Physiologist 53(6) December 2010.
Davis's results are welcome and interesting, and include some good theoretical insights, but insofar as the OA Citation Advantage is concerned, the empirical findings turn out to be just a failure to replicate the OA Citation Advantage in that particular sample and time-span -- exactly as predicted above. The original 2008 sample of 247 OA and 1372 non-OA articles in 11 journals one year after publication has now been extended to 712 OA and 2533 non-OA articles in 36 journals two years after publication. The result is a significant download advantage for OA articles but no significant citation advantage.

The only way to describe this outcome is as a non-replication of the OA Citation Advantage on this particular sample; it is most definitely not a demonstration that the OA Advantage is an artifact of self-selection, since there is no control group demonstrating the presence of the citation advantage with self-selected OA and the absence of the citation advantage with randomized OA across the same sample and time-span: There is simply the failure to detect any citation advantage at all.

This failure to replicate is almost certainly due to the small sample size as well as the short time-span. (Davis's a-priori estimates of the sample size required to detect a 20% difference took no account of the fact that citations grow with time; and the a-priori criterion fails even to be met for the self-selected subsample of 65.)

"I could not detect the effect in a much smaller and briefer sample than others" is hardly news! Compare the sample size of Davis's negative results with the sample-sizes and time-spans of some of the studies that found positive results:

Meanwhile, our study was similar to that of Davis et al's, except that it was a much bigger sample, across many fields, and a much larger time window -- and, most important, we did have a self-selective matched-control subset, which did show the usual OA advantage. Instead of comparing self-selective OA with randomized OA, however, we compared it with mandated OA -- which amounts to much the same thing, because the point of the self-selection hypothesis is that the author picks and chooses what to make OA, whereas if the OA is mandatory (required), the author is not picking and choosing, just as the author is not picking and choosing when the OA is imposed randomly.

And our finding is that the mandated OA advantage is just as big as the self-selective OA advantage.

As we discussed in our article, if someone really clings to the self-selection hypothesis, there are some remaining points of uncertainty in our study that self-selectionists can still hope will eventually bear them out: Compliance with the mandates was not 100%, but 60-70%. So the self-selection hypothesis has a chance of being resurrected if one argues that now it is no longer a case of positive selection for the stronger articles, but a refusal to comply with the mandate for the weaker ones. One would have expected, however, that if this were true, the OA advantage would at least be weaker for mandated OA than for unmandated OA, since the percentage of total output that is self-archived under a mandate is almost three times the 5-25% that is self-archived self-selectively. Yet the OA advantage is undiminished with 60-70% mandate compliance in 2002-2006. We have since extended the window by three more years, to 2009; the compliance rate rises by another 10%, but the mandated OA advantage remains undiminished. Self-selectionists don't have to cede till the percentage is 100%, but their hypothesis gets more and more far-fetched...

The other way of saving the self-selection hypothesis despite our findings is to argue that there was a "self-selection" bias in terms of which institutions do and do not mandate OA: Maybe it's the better ones that self-select to do so. There may be a plausible case to be made that one of our four mandated institutions -- CERN -- is an elite institution. (It is also physics-only.) But, as we reported, we re-did our analysis removing CERN, and we got the same outcome. Even if the objection of eliteness is extended to Southampton ECS, removing that second institution did not change the outcome either. We leave it to the reader to decide whether it is plausible to count our remaining two mandating institutions -- University of Minho in Portugal and Queensland University of Technology in Australia -- as elite institutions, compared to other universities. It is a historical fact, however, that these four institutions were the first in the world to elect to mandate OA.

One can only speculate on the reasons why some might still wish to cling to the self-selection bias hypothesis in the face of all the evidence to date. It seems almost a matter of common sense that making articles more accessible to users also makes them more usable and citable -- especially in a world where most researchers are familiar with the frustration of arriving at a link to an article that they would like to read (but their institution does not subscribe), so they are asked to drop it into the shopping cart and pay $30 at the check-out counter. The straightforward causal relationship is the default hypothesis, based on both plausibility and the cumulative weight of the evidence. Hence the burden of providing counter-evidence to refute it is now on the advocates of the alternative.

Davis, PN, Lewenstein, BV, Simon, DH, Booth, JG, & Connolly, MJL (2008) Open access publishing, article downloads, and citations: randomised controlled trial , British Medical Journal 337: a568

Gargouri, Y., Hajjem, C., Lariviere, V., Gingras, Y., Brody, T., Carr, L. and Harnad, S. (2010) Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research. PLOS ONE 10(5) e13636

Harnad, S. (2008) Davis et al's 1-year Study of Self-Selection Bias: No Self-Archiving Control, No OA Effect, No Conclusion. Open Access Archivangelism July 31 2008

Posted by Stevan Harnad in Methodology at 01:55