Today’s article is a guest blogpost by Menno Henselmans. Menno is a fellow TNation writer, a fellow bodybuilding afficianado, and a fellow scientific thinker. His blog can be found HERE and you can learn more about him HERE. I like guys like Menno because it’s not easy trying to teach an unscientific world how to think scientifically. In fact it’s exhausting, so I admire anyone who chooses this route. The article Menno has written is probably one of the most “unsexy” articles you’ll read this week, however it’s a very important topic and one that all scientific thinkers should understand. The statistics side of research is extremely intimidating, but it can’t be ignored.
by Menno Henselmans
No, I’m not talking about a 400 lb bench press. I’m talking about statistical power. Statistical power is a greatly important concept for anyone who wants to interpret scientific research, because statistical tests form the basis of all scientific results. If you have no knowledge of statistics, you can’t properly interpret research. According to many great scientists, statistical illiteracy is the 21st century’s equivalent of the inability to read and write.
Before I continue, here’s a little test for you: Suppose a brand new study was just published in The Journal of Strength and Conditioning Research. The researchers compared two resistance training groups that were identical with the exception that one used dumbbells and the other used kettlebells. Unfortunately, the researchers had little funding, so each group had only 5 people in it. At the end of the study, the kettlebell training group had gained significantly more muscle mass than the dumbbell group. Does this support that kettlebells are more effective than dumbbells for bodybuilding purposes? I’ll give you the answer after explaining what statistical power is.
What is statistical power?
Formally defined, the power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false. It’s identical to the probability of not making a type II error (failing to reject a false null hypothesis). For those unfamiliar with the statistical jargon, here’s a more intuitive explanation: statistical power is the ability to detect an effect when there is one. It is the sensitivity of a test (sensitivity in this case is both an intuitive and the proper statistical terminology).
Here’s an example in lay terms. You’re conducting a study on the efficacy of supplement X. There are 2 aspects to whether supplement X is effective. First, there’s the physical reality of its effectiveness, unknown to you, and secondly, there’s the outcome of your statistical test. Let’s suppose that supplement X is the holy grail of muscle building supps and transforms you into the incredible hulk in a manner of weeks. If that’s also what your test shows, everyone’s happy. However, if supplement X actually works, but your test says it doesn’t, you’ve failed to detect an effect even though there actually was one. That’s a type II error and it’s caused by a lack of statistical power. A more sensitive test would have found an effect. (Lack of) statistical power is an important reason why different studies can show different things.
What determines the statistical power of a test? A full power analysis takes into account many things, including the design of the experiment, but there are 3 main factors. First, there’s the statistical significance criterion, alpha, but you can basically forget out about this, because the convention of using a significance level of 0.05 has become so deeply engrained in the scientific literature it may as well have been enforced by law. Secondly, statistical power is positively determined by sample size. The larger the sample size of a study, the easier it is to detect effects. In our example, if the group receiving supplement X and the placebo group both consisted of thousands of people, it’s most likely that any difference between the groups at the end of the study was caused by supplement X, because the large sample size cancels out all random effects and ensures that the observed values are very close to the actual population means (provided the sampling was random and the study was well controlled). A sample size of only a few participants has far less statistical power, because the effect of supplement X is obscured by random variation or noise. In such a small sample, just one individual with, say, a heart condition that makes him or her unsusceptible to the effects of supplement X may be enough to fool the test into thinking supplement X doesn’t work. Thirdly, there’s the effect size, which is, as the name suggests, the magnitude of the effect you’re looking for. In our example, the effect size of supplement X is extremely high, because I said it turns you into the incredible bulk. In that case, you only really need a sample size of 1 or 2 to know supplement X is the bomb. If supplement X were only mildly effective, you’d need a far larger sample to detect its effects.
Back to the question I asked you at the beginning. Do the results of that study support that kettlebells are more effective than dumbbells for bodybuilding purposes? Many people are inclined to say no, because the study had a small sample size. However, if the sample size was small, this lowered the study’s statistical power. Yet an effect was found. This suggests the study got its statistical power from another factor. Assuming there were no other relevant factors, this means the effect size (or the difference between the effect sizes of dumbbells and kettlebells) must have been large. Therefore, this study would actually provide very strong support for the use of kettlebells. It found a significant difference despite its small sample size.
Here’s another example. Mitchell et al. recently published a study titled “Resistance exercise load does not determine training-mediated hypertrophic gains in young men” (1). They compared 3 resistance training groups: one group performed 3 sets with 30% of 1 RM, one group performed 1 set with 80% of 1 RM and one group performed 3 sets with 80% of 1 RM. At the end of the study, there were no significant differences in muscle mass or isometric strength increases between the groups. Isotonic strength increases and anabolic cell signaling were greater in the 80% groups than in the 30% group, but they did not differ between the 1 and 3 set protocols. Popular media like ScienceDaily stated, “Lifting less weight more times is just as effective at building muscle as training with heavy weights, a finding by McMaster researchers that turns conventional wisdom on its head” and this study has gone viral as the herald that neither volume or intensity is relevant for hypertrophy. Supposedly, all that matters is working to the point of fatigue.
This is what you get when people with no background in statistics, who have not familiarized themselves with the literature and (I can only imagine) do not actually lift themselves start giving people advice on bodybuilding. This study evidently had far too little statistical power to make such bold claims. How do I know? First, the study found no correlation between phosphorylation of any signalling protein and hypertrophy, yet p70S6 kinase and mTOR have been known to regulate protein synthesis, cell growth and cell size since the ‘90s(2,3). Knowing this, the finding from this study that phosphorylation of p70S6 kinase was only increased in the 80% groups and not in the 30% group suggests that the higher intensity did in fact result in more muscle growth.
But that’s not what they found. There was no significant difference in the amount of hypertrophy between the groups. Yet the percentage increases in muscle volume differed by more than factor 2 between groups (e.g. 3.2% for 80% x 1 compared to 7.2% for 80% x 3)! If a study can’t detect a two-fold increase in hypertrophy, it’s safe to say it lacked sufficient power for its research question. Similarly, the lack of a significant difference between the 1 and 3 set protocol indicates the study was underpowered. Previous meta-analytic research has shown that multiple sets do in fact result in greater hypertrophy than a single set (4).
All of these discrepancies with findings in the literature can be understood in terms of statistical power. Meta-analyses have more statistical power than single studies because of the larger sample size. Signaling protein activity is more deterministic and therefore easier to detect than increases in muscle volume. For this study in particular, several other factors also influenced its statistical power, such as the study duration. The longer the duration, the more pronounced the differences can be expected to become, increasing the design’s statistical power. More importantly, the design itself was far from optimal. The researchers put each participant in 2 groups: one for each leg, presumably to compensate for the mediocre sample size of 18. Since unilateral training induces a cross-training effect in the other limb (5), this introduces noise into the study by making it impossible to distinguish which effects are due to cross-training and which effects are due to the leg’s own training. Noise decreases a study’s statistical power. Finally, the use of magnetic resonance imaging to measure muscle volume is prone to error and “a measured volume change of at least 6-17% is required to demonstrate a significant difference” (6), depending on initial muscle size, which was not or only barely the case in this study.
Altogether, this study was considerably underpowered to answer its research questions and therefore its findings must be interpreted with caution, if at all, especially in cases where they deviate from more robust findings in the literature.
Statistical power, or sensitivity, is the ability to detect the effect you’re looking for, if it actually exists; it increases with sample size and effect size. To properly interpret the findings from a study, it is necessary to keep in mind if the study had sufficient power to answer its or your research questions. Lack of statistical power can either empower or destroy a study’s meaningfulness and this is not always intuitive. Any literate broscientist can read a study, but only a statistically literate scientist can interpret one correctly. After reading this article, you now belong to the latter category. As an added bonus, you may now pester Bret with questions about the statistical power of the studies in his research review!
1 Resistance exercise load does not determine training-mediated hypertrophic gains in young men. Mitchell CJ, Churchward-Venne TA, West DD, Burd NA, Breen L, Baker SK, Phillips SM. J Appl Physiol. 2012 Apr 19.
2 p70 S6 kinase and actin dynamics: A perspective. Ip CK, Wong AS. Spermatogenesis. 2012 Jan 1;2(1):44-52.
3 Invited review: intracellular signaling in contracting skeletal muscle. Sakamoto K, Goodyear LJ. J Appl Physiol. 2002 Jul;93(1):369-83.
4 Single vs. multiple sets of resistance exercise for muscle hypertrophy: a meta-analysis. Krieger JW. J Strength Cond Res. 2010 Apr;24(4):1150-9.
5 Cross education: possible mechanisms for the contralateral effects of unilateral resistance training. Lee M, Carroll TJ. Sports Med. 2007;37(1):1-14.
6 Radiologic measurement of extraocular muscle volumes in patients with Graves’ orbitopathy: a review and guideline. Bijlsma WR, Mourits MP. Orbit. 2006 Jun;25(2):83-91.
For More Information…
If this stuff interests you, here are some links you can check out:
Effect Size according to Wikipedia
It’s the Effect Size Stupid by Dr. Robert Coe
Effect Size FAQ’s (this is a great website)
Statistical Power according to Wikipedia
SportsSci.Org (this is Will Hopkins’ website which has helped thousands of Sports Science students over the years)