Category Archives: Academia

Academic Social Networking

istock_000005622581mediumOnline social networking was basically unheard of just a decade ago, now it’s integrated into the fabric of American society. Facebook pages advertised on the national news and twitter hashtags pop up everywhere. And it’s not just American society either, as evidenced by the use of Twitter for social organization during the Arab Spring.

From what I’ve observed, the use of social networking appears to be extremely varied in academia. This isn’t surprising given the high demands already placed on academics and the slow turnover rate among faculty (older faculty are less likely to adapt new tools of questionable utility but there are numerous exceptions, of course).

I have found blogging to be useful for improving my quick writing skills, thinking through new ideas, getting feedback on ideas and computer code, and making new contacts. My blog is networked through r-bloggers and the International Network of Next Generation Ecologists (INNGE) supported Ecobloggers, which both help create a community of users. I have found a tremendous amount of useful R code and statistical advice on other people’s blogs.

However, I’ve found social networks like Facebook and LinkedIn to be of limited use so far. I’m sure there is a place for work-related Facebook pages, but I’ve just found that I have been outlets. Google+ is similar but I do have an account and use it on occasion for science-related posts and reading. The only main-stream social network that I find really useful is Twitter. I get feedback on computing and statistics questions quickly, find links to articles and ideas I wouldn’t otherwise come across, meet and interact with new people (even in person the the ESA meeting tweetup), and share my thoughts and research with a larger audience.

I’ve signed up but don’t regularly use a number of other social networks and sites where I can post my academic persona. I think these could be useful but haven’t made great use of them yet (listed below). Even things like Mendeley and Stack Exchange have social components and user rating/badge systems. I think it’s important to manage one’s own online image both personally and professionally. I don’t know if I do the best job, but I’ve at least had some fun exploring different options. I am generally careful not to post anything online that I wouldn’t want a hiring committee or my grandmother to find.

Daniel Hocking's bibliography

Here’s What I’ve used for managing my online professional presence:

Teaching Scientific Computing: Peer Review

computingThis post is going out on a bit of a limb because I am not familiar with the pedagogical literature relating to teaching scientific computing. As such, I can only speak from my very limited experience. I’ve taken a couple short courses on scientific computing, but the only formal full-semester course I’ve taken was Introduction to C Programming for Engineers 15 years ago. In that course, the instructor spent 50 minutes 3 days a week writing code on the chalk board in front of us and we were expected to learn. Homework was to write increasingly large programs throughout the semester. If they didn’t work we got a 0%, if they produced the wrong output we got a 50%, and if they worked properly we got a 100%. Obviously it was a terrible course (although a number of my statistic courses that involved programming were not very different so this might be more common than I’d like to believe). Besides some of the conspicuous instructional problems, I was just thinking that scientific programming courses could learn from pedagogy in the humanities. The University of New Hampshire requires undergraduates to take a number of writing intensive courses. To qualify as writing intensive a course my meet 3 criteria:

  1. Students in the course should do substantial writing that enhances learning and demonstrates knowledge of the subject or the discipline. Writing should be an integral part of the course and should account for a significant part (approximately 50% or more) of the final grade.
  2. Writing should be assigned in such a manner as to require students to write regularly throughout the course. Major assignments should integrate the process of writing (prewriting, drafting, revision, editing). Students should be able to receive constructive feedback of some kind (peer response, workshop, professor, TA, etc.) during the drafting/revising process to help improve their writing.
  3. The course should include both formal (graded) and informal (heuristic) writing.  There should be papers written outside of class which are handed in for formal evaluation as well as informal assignments designed to promote learning, such as invention activities, in-class essays, reaction papers, journals, reading summaries, or other appropriate exercises.

I think these criteria could be applied or at least adapted for scientific computing courses. The 1st one is easy. The 2nd and 2rd are what I think computing courses could really take advantage of. From what I’ve seen, there is often not a lot of time spent of informal feedback from instructors and peers to help with revision. In programming, especially with flexible languages like R, there are often many solutions to the same problems. Useful assignments could be to critic the programs of peers, find ways to improve code efficiency, and provide alternative solutions to sections of code. This could include critics of the commenting and README files.

In introductory courses there is often an emphasis on cover content. Some people will balk at the idea of spending time of learning alternatives to simple options when there is clearly 1 best solution and so much material to cover to get students writing even simple scripts. However, it’s better in my opinion to learn a few things well than many things superficially. By evaluating, revising, and developing alternatives to code written by peers, students will learn how to program better. There is a reason that informal assessment, peer review, and revision is a required part of writing intensive courses. Those same reasons apply to scientific computing courses. Just as review and revision make us better writers, it will make us better programmers.

Open Access Publishing

As those who know me IRL or follow me on twitter (@djhocking), I am advocate for open science. This includes data sharing, open source software, open access to code of analysis, and open source publishing. This is my first post on the subject. I actually started this post back on 02 April 2013 but then my daughter surprised us by arriving a couple weeks early so I am just reviving the post now. My thoughts were stimulated by an article in Nature: Cost of Publishing. In the article the author notes that,

an average revenue per article of roughly $5,000. Analysts estimate profit margins at 20–30% for the industry, so the average cost to the publisher of producing an article is likely to be around $3,500–4,000.

The author notes that costs vary widely and are difficult to estimate. “Diane Sullenberger, executive editor for Proceedings of the National Academy of Sciences in Washington DC, says that the journal would need to charge about $3,700 per paper to cover costs if it went open-access.” These values align well with the typical $3000 cost of electing for open-access in traditional journals. Nature publishing suggests that it would be much more expensive to publish open-access. These traditional journals provide copy-editing and sometimes promotional activities.
However, newer journals without a tradition of paper printing, copy-editing, and typesetting, are able to publish open-access articles with less expense. One complaint of scientists is that they provide the reviewing, primary editing, and formatting for free and don’t see where the expense comes from.

For example, most of PLoS ONE’s editors are working scientists, and the journal does not perform functions such as copy-editing. Some journals, including Nature, also generate additional content for readers, such as editorials, commentary articles and journalism.

Some of the expense is reliable, long-term server space. Publishers such as PLOS require considerable initial capital investment through grants and Venture Capitalists. Then high volume publishing helps maintain finances in the black. PLOS ONE charge $1350 per article but is generally very good about reducing/waving fees if grants are not available to pay for publishing. In addition to the wave of new open access journals, there is an increasing interest in preprint servers (more here about preprints). The Nature article points out that,

Many researchers in fields such as mathematics, high-energy physics and computer science … post pre- and post-reviewed versions of their work on servers such as arXiv — an operation that costs some $800,000 a year to keep going, or about $10 per article. Under a scheme of free open-access ‘Episciences’ journals proposed by some mathematicians this January, researchers would organize their own system of community peer review and host research on arXiv, making it open for all at minimal cost (see Nature http://doi.org/kwg; 2013).

One other major benefit of open access publishing is that even if per-article costs remained the same, there would be value in the time researchers save in accessing and reading papers that are not behind paywalls. Despite many of the benefits of open-access publishing the Nature article points out that,

a total conversion will be slow in coming, because scientists still have every economic incentive to submit their papers to high-prestige subscription journals. The subscriptions tend to be paid for by campus libraries, and few individual scientists see the costs directly. From their perspective, publication is effectively free.

Open Access is Coming Though

The US Federal Government will be requiring open access of articles from publicly-funded research.

White House Open Access:

A Review Cascade can also greatly help facilitate publishing and review times as well as encouraging open access publishing. Nature now has a review cascade.

OA Journals for Ecology and the Environment
Here are a some open access journals for research on ecology, conservation biology, and the environment. Most of my focus is on English-language journals for Ecology, but even for that discipline this is in no way an exhaustive list. New OA journals seems to be popping up everywhere these days. It will certainly be interesting to see the future of scientific publishing. More information on OA Journals is available through the Directory of Open Access Journals (DOAJ). Let me know if you have experience with any of these journals/publishers or if you know of other good options for ecology and conservation.

[UPDATE: This list is now being updated at http://danieljhocking.wordpress.com/links/oa-journals/]

BMC Ecology

  • Publisher: BioMed Central
  • Indexed: Yes
  • Year Established: 2001
  • Eigenfactor:
  • OA Cost: $USD 1955

Elementa

  • Publisher: BioOne?
  • Indexed: Not yet?
  • Year Established: 2012
  • Eigenfactor:
  • OA Cost: $USD 1,450
  • Judge Importance: No
  • Acceptance Rate: Yes
  • Publish Reviews: No
  • License: CC-BY 3.0 Unported

Ecosphere

  • Publisher: Ecological Society of America
  • Indexed: Not as of 02 March 2013
  • Year Established: 2010
  • Eigenfactor:
  • Impact Factor: Not calculated yet
  • OA Cost: $USD 1250/1500 (members/non-members)
  • Judge Importance: Yes
  • Acceptance Rate:
  • Publish Reviews: No

Herpetological Conservation and Biology

  • Publisher:
  • Indexed: Yes
  • Year Established: 2006
  • Eigenfactor:
  • Impact Factor: 0.76 (2012: 5yr)
  • OA Cost: No
  • Judge Importance: Yes
  • Acceptance Rate: 60%
  • Publish Reviews: Yes

Ideas in Ecology and Evolution

  • Publisher: Queen’s University
  • Indexed: Partly (not by ISI)
  • Year Established: 2008
  • Eigenfactor:
  • Impact Factor:
  • OA Cost: $50 – 200 (Canadian)
  • Judge Importance: Yes
  • Acceptance Rate:
  • Publish Reviews: Yes

International Journal of Ecology

  • Publisher: Hindawi Publishing Company
  • Indexed: Yes
  • Year Established: 2007
  • Eigenfactor:
  • Impact Factor:
  • OA Cost: $USD 600
  • Judge Importance: Yes
  • Acceptance Rate:
  • Publish Reviews: Yes

Journal of Biodiversity and Ecological Sciences

  • Publisher:
  • Indexed:
  • Year Established: 2011
  • Eigenfactor:
  • Impact Factor:
  • OA Cost:
  • Judge Importance:
  • Acceptance Rate:
  • Publish Reviews: Yes

Natural Resources

The Open Ecology Journal

  • Publisher: Bentham open
  • Indexed: Yes
  • Year Established: 2008
  • Eigenfactor:
  • Impact Factor: ~1.86
  • OA Cost: $600-900
  • Judge Importance:
  • Acceptance Rate:
  • Publish Reviews: Yes
  • License: Creative Commons Attribution non-commercial License 3.0

Open Journal of Ecology

  • Publisher: Scientific Research Publishing
  • Indexed: Yes
  • Year Established: 2011
  • Eigenfactor:
  • Impact Factor:
  • OA Cost: $USD 500 +50 per page over 10
  • Judge Importance:
  • Acceptance Rate:
  • Publish Reviews: ?
  • License: Creative Commons Attribution License

PeerJ (how PeerJ is changing everything)

  • Publisher: PeerJ
  • Indexed: Yes
  • Year Established: 2012
  • Eigenfactor:
  • Impact Factor:
  • OA Cost: Lifetime Membership ($USD 99 per author)
  • Judge Importance: Yes
  • Acceptance Rate:
  • Publish Reviews: No (can upload as non-peer reviewed PrePrint)

PLoS Biology

  • Publisher: Public Library of Science
  • Indexed: Yes
  • Year Established: 2003
  • Eigenfactor:
  • Impact Factor:
  • OA Cost: $USD 2900 (reduced for many countries or at request)
  • Judge Importance: Yes
  • Acceptance Rate:
  • Publish Reviews: Yes

PLoS ONE

  • Publisher: Public Library of Science
  • Indexed: Yes
  • Year Established: 2006
  • Eigenfactor:
  • Impact Factor:
  • OA Cost: $USD 1350 (reduced for many countries or at request)
  • Judge Importance: No
  • Acceptance Rate: 69%
  • Publish Reviews: No

OA Options of more traditional journals
Acta Oecologia – Published by Elsevier which has been controversial in their relationship to the OA movement (here, here, here, current info). OA Option $USD 2500

Regular Expressions to Increase MS Word Efficiency

Regular Expressions

Regular Expressions (Photo credit: Jeff Kubina)

Just a quick post today. I was formatting an article for submission to the journal Biological Conservation. In the instructions for the authors, I came across the line “Use decimal points (not commas); use a space for thousands (10 000 and above).”

For me that means numbers like 1,565 need to become 1565 (smaller than 10,000) and 136,000 becomes 136 000.

Without using regular expressions, the options are to search the document for commas (hundreds in the document) or go through the entire manuscript line by line and hope you don’t miss anything. Regular expressions allow you to match patterns in documents/files/code. It can help you to find files on your computer, scrape web sites, or in this case find and replace strings in a Microsoft Word document.

For my example above, I used the find and replace feature in MS Word (you may need to go into the advanced options and check “use wildcards”). To replace the comma with a space for values over 10,000, I searched to find

([0-9])([0-9]),([0-9])([0-9])([0-9])

which means find a digit between 0 and 9, followed by another digit, then a comma, followed by three more numbers. This will work for numbers above 10,000 including 100,000. You may need a different search for numbers over 1 million but I knew I didn’t have any in this document.

I then replaced each string that matched that with

 \1\2 \3\4\5

which mean replace with the string that was found with the first character (digit 0-9), then the second character, then a space, then the next three numbers.

With wildcards like * and nearly unlimited combinations, once you get comfortable with regular expressions, you can locate and modify documents with ease. See here or here for more of the basics of regular expressions.

Now to get that manuscript submitted. . .

No Statistical Panacea, Hierarchical or Otherwise

Everyone in academia knows how painful the peer-review publication process can be. It’s a lot like Democracy, in that it’s the worst system ever invented, except for all the others. The peer-review process does a fair job at promoting good science overall, but it’s far from perfect. Sure anyone can point out a hundred flaws in the system, but I’m just going to focus on one aspect that has been bothering me particularly and has the potential to change: complicated statistical demands.

I have found that reviewers frequently require the data to be reanalyzed with a particularly popular or pet method. In my opinion, reviewers need to ask whether the statistical techniques are appropriate to answer the questions of interest. Do the data meet the assumptions necessary for the model? If there are violations are they likely to lead to biased inference? Let’s be honest, no model is perfect and there are always potential violations of assumptions. Data are samples from reality and a statistical model creates a representation of this reality from the data. The old adage, “All models are wrong, but some are useful,” is important for reviewers to remember. The questions are does the model answer the question of interest (not the question the reviewer wished was asked), is the question interesting, and was the data collected in an appropriate manner to fit with the model.

Last year I had a manuscript rejected primarily because I did not use a hierarchical model to account for detection probability when analyzing count data. In my opinion, the reviewer was way over staunch in requiring a specific type of analysis. The worst part, is it seemed like the reviewer didn’t have extensive experience with the method. The reviewer actually wrote,

They present estimates based on raw counts, rather than corrected totals. Given that they cite one manuscript, McKenny et al., that estimated detection probability and abundance for the same taxa evaluated in the current manuscript, I was puzzled by this decision. The literature base on this issue is large and growing rapidly, and I cannot recommend publication of a paper that uses naïve count data from a descriptive study to support management implications. (emphasis mine)

I’m not alone in having publications rejected on this account. I personally know of numerous manuscripts shot down for this very reason. After such a hardline statement, this reviewer goes on to say,

The sampling design had the necessary temporal and spatial replication of sample plots to use estimators for unmarked animals. The R package ‘unmarked’ provides code to analyze these data following Royle, J. A. 2004. N-mixture models for estimating population size from spatially replicated counts. Biometrics 60:108-115.

That would seem reasonable except that our study had 8 samples in each of 2 years at ~50 sites in 6 different habitats. That would suggest there was sufficient spatial and temporal replication to use Royle’s N-mixture model. HOWEVER, the N-mixture model has the assumption that the temporal replication all occur with population closure (no changes in abundance through births, deaths, immigration, and emigration). Clearly, 2 years of count data are going to violate this assumption. The N-mixture model would be an inappropriate choice for this data. Even if the 2 years were analyzed separately it would violate this assumption because the data were collected biweekly from May – October each year (and eggs hatch in June – September).

Recently, Dail and Madsen (2011; Biometics) developed a generalized form of the N-mixture model that works for open populations. This model might work for this data but in my experience the Dail-Madsen model requires a huge number of spatial replicates. All of these hierarchical models accounting for detection tend to be quite sensitive to spatial replication (more than temporal), low detection probability (common with terrestrial salamanders which were the focus of the study), and variation in detection not well modeled with covariates. Additionally, the Dail-Madsen model was only published a few months before my submission and hadn’t come out when I analyzed the data, plus the reviewer did not mention it. Given the lack of time for people to become aware of the model and lack of rigorous testing of the model, it would seem insane to require it be used for publication. To be fair, I believe Marc Kery did have a variation of the N-mixture model that allowed for population change (Kery et al. 2009).

So if I can’t use the N-mixture model because of extreme violations of model assumptions and the data are insufficient for the Dail-Madsen model, what was I supposed to do with this study? The associate editor rejected the paper without chance for rebuttal. It was a decent management journal, but certainly not Science or even Ecology or Conservation Biology. The data had been collected in 1999-2000 before most of these hierarchical detection models had been invented. They’ve unfortunately been sitting in a drawer for too long. Had they been published in 2001-2002, no one would have questioned this and it would have gotten favorable reviews. The data were collected quite well (I didn’t collect them, so it’s not bragging) and the results are extremely clear. I’m not saying the detection isn’t important to thing about, but in this case even highly biased detection wouldn’t change the story, just the magnitude of the already very large effect. There has recently been good discussion over the importance of accounting for detection and how well these model actually parse abundance/occupancy and detection, so I won’t rehash it too much here. See Brian McGill’s posts on Statistical Machismo and the plethora of thoughtful comments here and here.

Based on this one reviewer’s hardline comments and the associate editor’s decision to reject it outright, it seems like they are suggesting that this data reside in a drawer forever (if it can’t be used with an N-mix or Dail-Madsen model). With that mindset, all papers using count data published before ~2002-2004 should be ignored and most data collected before then should be thrown out to create more server space. This would be a real shame for long term dataset of which there are too few in ecology! This idea of hierarchical detection model or no publication seems like a hypercritical perspective and review. I’m still working on reanalysis and revision to send to another journal. We’ll see what happens with it in the future and if it ever gets published I’ll post a paper summary on this blog. If I don’t use a hierarchical detection model, then I am lumping abundance processes with detection processes and that should be acknowledged. It adds uncertainty to the inference about abundance, but given the magnitude of the differences among habitats and knowledge of the system, it’s hard to imagine it changing the management implications of the study at all.

My point in all of this is there is no statistical panacea. I think hierarchical models are great and in fact I spend most of my days running various forms of these models. However, I don’t think they solve all problems and they aren’t the right tool for every job. I think most current studies where there is a slight chance of detection bias should be designed to account for it, but that doesn’t mean that all studies are worthless if they don’t use these models. These models are WAY more difficult to fit than most people realize and don’t always work. Hopefully, as science and statistics move forward in ever more complicated ways, more reviewers start to realize that there is no perfect model or method. Just asking if the methods employed are adequate to answer the question and that the inference from the statistical models accurately reflect the data and model assumptions. Just because a technique is new and sexy doesn’t mean that everyone needs to use it in every study.

Follow

Get every new post delivered to your Inbox.

Join 48 other followers