No Statistical Panacea, Hierarchical or Otherwise


Everyone in academia knows how painful the peer-review publication process can be. It’s a lot like Democracy, in that it’s the worst system ever invented, except for all the others. The peer-review process does a fair job at promoting good science overall, but it’s far from perfect. Sure anyone can point out a hundred flaws in the system, but I’m just going to focus on one aspect that has been bothering me particularly and has the potential to change: complicated statistical demands.

I have found that reviewers frequently require the data to be reanalyzed with a particularly popular or pet method. In my opinion, reviewers need to ask whether the statistical techniques are appropriate to answer the questions of interest. Do the data meet the assumptions necessary for the model? If there are violations are they likely to lead to biased inference? Let’s be honest, no model is perfect and there are always potential violations of assumptions. Data are samples from reality and a statistical model creates a representation of this reality from the data. The old adage, “All models are wrong, but some are useful,” is important for reviewers to remember. The questions are does the model answer the question of interest (not the question the reviewer wished was asked), is the question interesting, and was the data collected in an appropriate manner to fit with the model.

Last year I had a manuscript rejected primarily because I did not use a hierarchical model to account for detection probability when analyzing count data. In my opinion, the reviewer was way over staunch in requiring a specific type of analysis. The worst part, is it seemed like the reviewer didn’t have extensive experience with the method. The reviewer actually wrote,

They present estimates based on raw counts, rather than corrected totals. Given that they cite one manuscript, McKenny et al., that estimated detection probability and abundance for the same taxa evaluated in the current manuscript, I was puzzled by this decision. The literature base on this issue is large and growing rapidly, and I cannot recommend publication of a paper that uses naïve count data from a descriptive study to support management implications. (emphasis mine)

I’m not alone in having publications rejected on this account. I personally know of numerous manuscripts shot down for this very reason. After such a hardline statement, this reviewer goes on to say,

The sampling design had the necessary temporal and spatial replication of sample plots to use estimators for unmarked animals. The R package ‘unmarked’ provides code to analyze these data following Royle, J. A. 2004. N-mixture models for estimating population size from spatially replicated counts. Biometrics 60:108-115.

That would seem reasonable except that our study had 8 samples in each of 2 years at ~50 sites in 6 different habitats. That would suggest there was sufficient spatial and temporal replication to use Royle’s N-mixture model. HOWEVER, the N-mixture model has the assumption that the temporal replication all occur with population closure (no changes in abundance through births, deaths, immigration, and emigration). Clearly, 2 years of count data are going to violate this assumption. The N-mixture model would be an inappropriate choice for this data. Even if the 2 years were analyzed separately it would violate this assumption because the data were collected biweekly from May – October each year (and eggs hatch in June – September).

Recently, Dail and Madsen (2011; Biometics) developed a generalized form of the N-mixture model that works for open populations. This model might work for this data but in my experience the Dail-Madsen model requires a huge number of spatial replicates. All of these hierarchical models accounting for detection tend to be quite sensitive to spatial replication (more than temporal), low detection probability (common with terrestrial salamanders which were the focus of the study), and variation in detection not well modeled with covariates. Additionally, the Dail-Madsen model was only published a few months before my submission and hadn’t come out when I analyzed the data, plus the reviewer did not mention it. Given the lack of time for people to become aware of the model and lack of rigorous testing of the model, it would seem insane to require it be used for publication. To be fair, I believe Marc Kery did have a variation of the N-mixture model that allowed for population change (Kery et al. 2009).

So if I can’t use the N-mixture model because of extreme violations of model assumptions and the data are insufficient for the Dail-Madsen model, what was I supposed to do with this study? The associate editor rejected the paper without chance for rebuttal. It was a decent management journal, but certainly not Science or even Ecology or Conservation Biology. The data had been collected in 1999-2000 before most of these hierarchical detection models had been invented. They’ve unfortunately been sitting in a drawer for too long. Had they been published in 2001-2002, no one would have questioned this and it would have gotten favorable reviews. The data were collected quite well (I didn’t collect them, so it’s not bragging) and the results are extremely clear. I’m not saying the detection isn’t important to thing about, but in this case even highly biased detection wouldn’t change the story, just the magnitude of the already very large effect. There has recently been good discussion over the importance of accounting for detection and how well these model actually parse abundance/occupancy and detection, so I won’t rehash it too much here. See Brian McGill’s posts on Statistical Machismo and the plethora of thoughtful comments here and here.

Based on this one reviewer’s hardline comments and the associate editor’s decision to reject it outright, it seems like they are suggesting that this data reside in a drawer forever (if it can’t be used with an N-mix or Dail-Madsen model). With that mindset, all papers using count data published before ~2002-2004 should be ignored and most data collected before then should be thrown out to create more server space. This would be a real shame for long term dataset of which there are too few in ecology! This idea of hierarchical detection model or no publication seems like a hypercritical perspective and review. I’m still working on reanalysis and revision to send to another journal. We’ll see what happens with it in the future and if it ever gets published I’ll post a paper summary on this blog. If I don’t use a hierarchical detection model, then I am lumping abundance processes with detection processes and that should be acknowledged. It adds uncertainty to the inference about abundance, but given the magnitude of the differences among habitats and knowledge of the system, it’s hard to imagine it changing the management implications of the study at all.

My point in all of this is there is no statistical panacea. I think hierarchical models are great and in fact I spend most of my days running various forms of these models. However, I don’t think they solve all problems and they aren’t the right tool for every job. I think most current studies where there is a slight chance of detection bias should be designed to account for it, but that doesn’t mean that all studies are worthless if they don’t use these models. These models are WAY more difficult to fit than most people realize and don’t always work. Hopefully, as science and statistics move forward in ever more complicated ways, more reviewers start to realize that there is no perfect model or method. Just asking if the methods employed are adequate to answer the question and that the inference from the statistical models accurately reflect the data and model assumptions. Just because a technique is new and sexy doesn’t mean that everyone needs to use it in every study.

About Daniel Hocking

I am a post-doctoral researcher at the UMass-Amherst. I am interested in the use of statistical models in ecology and population biology.

Posted on February 13, 2013, in Academia and tagged , , , , . Bookmark the permalink. 9 Comments.

  1. an anonymous coward

    Thanks for the post. It’s rare someone speaks about it publicly.

    Like possibly many, I have experienced rejects based on statistical arguments which are either over the top or simply wrong. One which hurt particularly came after Major Revision and Minor Revision, when subject editor and two reviewers were satisfied. The EiC sent it to a third reviewer, who just killed our paper by a) suggesting to use GLMMs while b) stating that such models would not be suitable due to the high number of collinear variables. In a concluding remark, he just said our work had no purpose. Paper killed, without a chance to answer even to the strange suggestion of unsuitable models. EiC wrote: please do not hesitate to send other manuscripts to our journal. It took more than half a year of thorough review process, and three versions of the manuscript, with a lot of additional work requested by the (original) reviewers to get to this.

    The paper got published, and even in a good journal.
    But it took us more than two years in total, as other journals also rejected (but invited for resubmission) on the grounds of suggested modeling techniques which were unsuitable for our data.

    And there is no place on the web I know of where this sort of things is collected.

    • Thanks for sharing your experience Senior A.A. Coward. As a postdoc, maybe I’m not the one who should be shouting about this through cyberspace but it seems like such a common problem that I had to say something. I just think reviewers need to ask themselves whether the analysis they are about to demand is really critical or just their favorite, sexy tool at the moment. Then they have to be clear in the comments to the authors and editor whether something they suggest is really necessary or just a preference that could improve the manuscript. In my reviews now I try to be clear when I think that a different analysis might yield more interesting information but isn’t critical for answering the question (and hence publication) and when the method is inappropriate and needs alteration for publication (fairly rare that I’ve found in my limited experience). As an author, I appreciate the former type of comment because if nothing else, it gives me something to think about for future studies even if I don’t incorporate it in the current manuscript.

  2. Great and very thoughtful post Daniel!

  3. Seems like many researcher think “advanced stats” = good science. New fancy stat methods are interesting and fun, but it doesnt mean that they work well in practice. That you have to implement recent, not well tested, statistical methods is bad.

    Simple is good:

    http://www.esajournals.org/doi/abs/10.1890/0012-9658%282007%2988%5B56:SACIED%5D2.0.CO;2

  4. Science is a human culture, and as such it is subjected to fashion, blood feuds, and other social biases. The funny thing about pet analyses is that most of their cheerleaders, sorry, advocates, have limited statistical knowledge and just try to show off by demanding complex analyses for simple predictions. Several important questions have been answered with an ordinary chi-squared test.

  1. Pingback: 2013 in review | Daniel J. Hocking

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 54 other followers

%d bloggers like this: