Today’s article is a guest blogpost by Menno Henselmans. Menno is a fellow TNation writer, a fellow bodybuilding afficianado, and a fellow scientific thinker. His blog can be found HERE and you can learn more about him HERE. I like guys like Menno because it’s not easy trying to teach an unscientific world how to think scientifically. In fact it’s exhausting, so I admire anyone who chooses this route. The article Menno has written is probably one of the most “unsexy” articles you’ll read this week, however it’s a very important topic and one that all scientific thinkers should understand. The statistics side of research is extremely intimidating, but it can’t be ignored.

**Powerful Stats**

by Menno Henselmans

No, I’m not talking about a 400 lb bench press. I’m talking about *statistical power*. Statistical power is a greatly important concept for anyone who wants to interpret scientific research, because statistical tests form the basis of all scientific results. If you have no knowledge of statistics, you can’t properly interpret research. According to many great scientists, statistical illiteracy is the 21^{st} century’s equivalent of the inability to read and write.

Before I continue, here’s a little test for you: *Suppose* a* brand new study was just published in *The Journal of Strength and Conditioning Research*. The researchers compared two resistance training groups that were identical with the exception that one used dumbbells and the other used kettlebells. Unfortunately, the researchers had little funding, so each group had only 5 people in it. At the end of the study, the kettlebell training group had gained significantly more muscle mass than the dumbbell group. Does this support that kettlebells are more effective than dumbbells for bodybuilding purposes? *I’ll give you the answer after explaining what statistical power is.

**What is statistical power?**

Formally defined, the power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false. It’s identical to the probability of not making a type II error (failing to reject a *false* null hypothesis). For those unfamiliar with the statistical jargon, here’s a more intuitive explanation: statistical power is the ability to detect an effect when there is one. It is the* sensitivity* of a test (sensitivity in this case is both an intuitive and the proper statistical terminology).

Here’s an example in lay terms. You’re conducting a study on the efficacy of supplement X. There are 2 aspects to whether supplement X is effective. First, there’s the physical reality of its effectiveness, unknown to you, and secondly, there’s the outcome of your statistical test. Let’s suppose that supplement X is the holy grail of muscle building supps and transforms you into the incredible hulk in a manner of weeks. If that’s also what your test shows, everyone’s happy. However, if supplement X actually works, but your test says it doesn’t, you’ve failed to detect an effect even though there actually was one. That’s a type II error and it’s caused by a lack of statistical power. A more sensitive test would have found an effect. (Lack of) statistical power is an important reason why different studies can show different things.

What determines the statistical power of a test? A full power analysis takes into account many things, including the design of the experiment, but there are 3 main factors. First, there’s the statistical significance criterion, alpha, but you can basically forget out about this, because the convention of using a significance level of 0.05 has become so deeply engrained in the scientific literature it may as well have been enforced by law. Secondly, statistical power is positively determined by sample size. The larger the sample size of a study, the easier it is to detect effects. In our example, if the group receiving supplement X and the placebo group both consisted of thousands of people, it’s most likely that any difference between the groups at the end of the study was caused by supplement X, because the large sample size cancels out all random effects and ensures that the observed values are very close to the actual population means (provided the sampling was random and the study was well controlled). A sample size of only a few participants has far less statistical power, because the effect of supplement X is obscured by random variation or noise. In such a small sample, just one individual with, say, a heart condition that makes him or her unsusceptible to the effects of supplement X may be enough to fool the test into thinking supplement X doesn’t work. Thirdly, there’s the *effect size*, which is, as the name suggests, the magnitude of the effect you’re looking for. In our example, the effect size of supplement X is extremely high, because I said it turns you into the incredible bulk. In that case, you only really need a sample size of 1 or 2 to know supplement X is the bomb. If supplement X were only mildly effective, you’d need a far larger sample to detect its effects.

**Applications
**Back to the question I asked you at the beginning. Do the results of that study support that kettlebells are more effective than dumbbells for bodybuilding purposes? Many people are inclined to say no, because the study had a small sample size. However, if the sample size was small, this lowered the study’s statistical power. Yet an effect was found. This suggests the study got its statistical power from another factor. Assuming there were no other relevant factors, this means the effect size (or the difference between the effect sizes of dumbbells and kettlebells) must have been

*large*. Therefore, this study would actually provide very strong support for the use of kettlebells. It found a significant difference

*despite*its small sample size.

Here’s another example. Mitchell et al. recently published a study titled “Resistance exercise load does not determine training-mediated hypertrophic gains in young men” (1). They compared 3 resistance training groups: one group performed 3 sets with 30% of 1 RM, one group performed 1 set with 80% of 1 RM and one group performed 3 sets with 80% of 1 RM. At the end of the study, there were no significant differences in muscle mass or isometric strength increases between the groups. Isotonic strength increases and anabolic cell signaling were greater in the 80% groups than in the 30% group, but they did not differ between the 1 and 3 set protocols. Popular media like *ScienceDaily* stated, “Lifting less weight more times is just as effective at building muscle as training with heavy weights, a finding by McMaster researchers that turns conventional wisdom on its head” and this study has gone viral as the herald that neither volume or intensity is relevant for hypertrophy. Supposedly, all that matters is working to the point of fatigue.

This is what you get when people with no background in statistics, who have not familiarized themselves with the literature and (I can only imagine) do not actually lift themselves start giving people advice on bodybuilding. This study evidently had far too little statistical power to make such bold claims. How do I know? First, the study found no correlation between phosphorylation of any signalling protein and hypertrophy, yet p70^{S6} kinase and mTOR have been known to regulate protein synthesis, cell growth and cell size since the ‘90s(2,3). Knowing this, the finding from this study that phosphorylation of p70^{S6} kinase was only increased in the 80% groups and not in the 30% group suggests that the higher intensity did in fact result in more muscle growth.

But that’s not what they found. There was no significant difference in the amount of hypertrophy between the groups. Yet the percentage increases in muscle volume differed by more than factor 2 between groups (e.g. 3.2% for 80% x 1 compared to 7.2% for 80% x 3)! If a study can’t detect a two-fold increase in hypertrophy, it’s safe to say it lacked sufficient power for its research question. Similarly, the lack of a significant difference between the 1 and 3 set protocol indicates the study was underpowered. Previous meta-analytic research has shown that multiple sets do in fact result in greater hypertrophy than a single set (4).

All of these discrepancies with findings in the literature can be understood in terms of statistical power. Meta-analyses have more statistical power than single studies because of the larger sample size. Signaling protein activity is more deterministic and therefore easier to detect than increases in muscle volume. For this study in particular, several other factors also influenced its statistical power, such as the study duration. The longer the duration, the more pronounced the differences can be expected to become, increasing the design’s statistical power. More importantly, the design itself was far from optimal. The researchers put each participant in *2* groups: one for each leg, presumably to compensate for the mediocre sample size of 18. Since unilateral training induces a cross-training effect in the other limb (5), this introduces noise into the study by making it impossible to distinguish which effects are due to cross-training and which effects are due to the leg’s own training. Noise decreases a study’s statistical power. Finally, the use of magnetic resonance imaging to measure muscle volume is prone to error and “a measured volume change of at least 6-17% is required to demonstrate a significant difference” (6), depending on initial muscle size, which was not or only barely the case in this study.

Altogether, this study was considerably underpowered to answer its research questions and therefore its findings must be interpreted with caution, if at all, especially in cases where they deviate from more robust findings in the literature.

**Conclusions
**Statistical power, or sensitivity, is the ability to detect the effect you’re looking for, if it actually exists; it increases with sample size and effect size. To properly interpret the findings from a study, it is necessary to keep in mind if the study had sufficient power to answer its or your research questions. Lack of statistical power can either empower or destroy a study’s meaningfulness and this is not always intuitive. Any literate broscientist can read a study, but only a statistically literate scientist can interpret one correctly. After reading this article, you now belong to the latter category. As an added bonus, you may now pester Bret with questions about the statistical power of the studies in his research review!

**References**

1 Resistance exercise load does not determine training-mediated hypertrophic gains in young men. Mitchell CJ, Churchward-Venne TA, West DD, Burd NA, Breen L, Baker SK, Phillips SM. J Appl Physiol. 2012 Apr 19.

2 p70 S6 kinase and actin dynamics: A perspective. Ip CK, Wong AS. Spermatogenesis. 2012 Jan 1;2(1):44-52.

3 Invited review: intracellular signaling in contracting skeletal muscle. Sakamoto K, Goodyear LJ. J Appl Physiol. 2002 Jul;93(1):369-83.

4 Single vs. multiple sets of resistance exercise for muscle hypertrophy: a meta-analysis. Krieger JW. J Strength Cond Res. 2010 Apr;24(4):1150-9.

5 Cross education: possible mechanisms for the contralateral effects of unilateral resistance training. Lee M, Carroll TJ. Sports Med. 2007;37(1):1-14.

6 Radiologic measurement of extraocular muscle volumes in patients with Graves’ orbitopathy: a review and guideline. Bijlsma WR, Mourits MP. Orbit. 2006 Jun;25(2):83-91.

**For More Information…**

If this stuff interests you, here are some links you can check out:

Effect Size according to Wikipedia

It’s the Effect Size Stupid by Dr. Robert Coe

Effect Size FAQ’s (this is a great website)

Statistical Power according to Wikipedia

SportsSci.Org (this is Will Hopkins’ website which has helped thousands of Sports Science students over the years)

Sick article. Would love to see more posts like this!

Well indeed. My kind of article. Three additional things.

1. Yes, meta-analysis has more power, but has a series of its own problems. A meta-analysis is a vehicle for restating other studies which are selected according to somewhat arbitrary criterion, subjected to statistical processes which boil them down into individual effect sizes from different measures, and finally aggregate them into analyses which differ between different study areas. What does that mean? Generally, it means a confusing process is used to combine analyses which are somewhat dissimilar. And if all the studies are rubbish going in, the meta-analysis of that rubbish will be even worse.

2. Overpowered studies exist with a similar, parallel set of problems to the above. An overpowered study is usually when you have sets of several hundred of the variable of interest. This means that you can conclude that there is a significant difference between things which are actually very similar. This is less of a problem in the sports sciences than it is in other applied sciences, but it will happen. People with access to big data sets beware.

3. Trust people who report their actual, calculated effect size in papers. This is a big tick in the “not a muppet” column.

Excellent post, and one of the reasons why I’m pursuing a formal education at all. Nice going Menno!

Bret, it would really be awesome if you and Chris included some sort of analysis of statistical power for each study in your RR. But a s**t-load of extra work!

That immediately came to mind when I read that last sentence from Menno! We review 50 studies every month. Granted some are reviews, columns, etc. but that’s a lot of extra work. Definitely something to think about though.

Wow, this has really opened my eyes! I do read a lot of articles when studying for my MSc, however I often find myself taking the ‘word’ of the researchers without delving into the statistics myself, this shall no longer be the case! I’ve since found some researchers to use ‘less credible’ post-hoc tests in order to push through a significance or finding! Keep these coming!

@ James: I agree with all 3 completely. Good points.

@ Jacob/Bret: A full post hoc power analysis would be a very ambitious project and I don’t think it’s worth the substantial time investment for each study. However, a ‘common sense’ (contradiction in terms?) approach as I used in this article is good to keep in mind, especially for studies that don’t find something that you would expect to see.

If you want to get into this, I recommend reading this: http://research-repository.st-andrews.ac.uk/bitstream/10023/679/5/Thomas-Retrospectivepoweranalysis-postprint.pdf

Holy shit, Menno. I clicked on your blog link and the first word I seen was “Bayesian.” At that point, it took me all of 1.34 seconds to have you added to my blog feed.

Bret, thanks for introducing me to Menno!!

Haha nice. Bayesian thinking is the key to success in everything.

This is terrific! More, more, more please! This is why I love AARR and Bret’s research review services. More example articles pointing out the deficiencies of study execution and design are very welcomed.

Thanks guys!

This just sent me into a cold sweat remembering my stats classes at ECU!!!!!

Great article nevertheless

i will never read another article or study the same way again. this post is disconcerting as we are now challanged to read more carefully, but at the same time it is comforting to know that there is an explanation for studied that contradict each other.

some time ago i saw a TED video presentation in which the notion of “number needed to treat” was introduced. the presenter claimed that after careful evaluation of peer reviewed quality studies, most of the interventions recommended by conventional wisdom did not stand up.it is scary that so much research is done by people who want to sell us something.

on that note, this is a question for Bret. there is a notion that the “one set to failure” was a system devised by arthur jones to enable him to sell nautilus equipment. Do you care to comment?

Hi Menno,Interesting article, thanks for writing, glad to see some people interested in statistics. I’ve got a few thoughts based on what you wrote:

1. I wonder if this statement is a little harsh:

“Altogether, this study was considerably underpowered to answer its research questions and therefore its findings must be interpreted with caution, if at all, especially in cases where they deviate from more robust findings in the literature.”

given that not all comparisons in the paper were under powered. Sure the multi vs single set data didn’t turn out great and looks underpowered, but the intensity argument (80%1RMx3 vs 30%1RMx3) is adequately powered given the means (6.8+/-1.8% vs 7.2+/-1.9%). Sometimes there just isn’t a difference. The original paper was written with this in mind and the authors do acknowledge these concepts in the discussion of the paper. Maybe this didn’t transfer into the press release, not sure I haven’t read them.

2. Also there were two measures of hypertrophy, the MRI data as you mentioned but also myosin ATPase staining for type I and II fibre area which found a similar pattern to the MRI data.

3. Is it possible that the differential effects on p70s6k can be explained by a different temporal relationship between the signalling proteins? Their previous paper (Burd et al, 2010) certainly sets a foundation to suggest the acute response to these exercise intensities could be different. I’ve worked with many signalling proteins, and timing in the post-exercise interval can often explain the different results that exist in the literature.

Anyone have some thoughts?

It’s not specific comparisons, but the statistical testing in general that suffers from being underpowered. That some tests did come up significant suggests those effect sizes were higher or there was less noise. The problem with looking at other results is then that we can’t really interpret them. As such, 7.2 may not seem ‘significantly’ higher than 6.8, but due to the lack of power we really can’t say if these are the true values.

Signalling proteins are indeed hard to interpret. For example, in my article on leucine I mentioned research that showed leucine alters the time course and cellular activity of several anabolic processes, but the amount of muscle built ends up the same.

Nice site by the way.

Hey Menno, checked out your site as well, nice work.

The advantage of power calculations is that when doing them post-hoc, you can break it down into comparisons and determine how sample size relates to the data obtained for whatever you want. I do this frequently with all my variables in my studies as it gives me a little extra confidence into what i’m looking at, and helps if you plan a future study based on the result of a previous. Effect sizes and variability often differ across variables, it’s not reasonable to expect them to be the same, so it is quite possible to have adequate power for one measure, and inadequate for another. Same goes for groups when you have multiple treatments.

The problem with small differences is that you can always argue to some degree for a greater number of participants. If we did a new study with that comparison (80%1RMx3 vs 30%1RMx3) today with a between groups design, and using estimates of the error and means from the present data, we’d probably need at least 100-200 subjects per group to get it done. That, combined with the measurement errors you mentioned for the MRIs, would lead me to believe that it wouldn’t be a financially worthwhile study, and that the 6% difference, from a scientific and financial standpoint may be negligible.

As for whether they are ‘true’ values or not, this would have more to do with the underlying construct of the test used to obtain the data, and the power in this case would relate to the statistic that compares the values between the groups, it doesn’t determine the actual values themselves.

Now in the gym, maybe the 6% difference between the two rates of hypertrophy would be meaningful, but researchers have always wrestled with what represents a physiological relevant or statistically significant change. Are they always the same thing? Maybe not.

Very good article, Menno.

1. When we do studies, we should always do a power calculation before the study. It is clearly unethical and complete waste of subjects time, when the results were big but didn’t reach significance.

2. I think this is the other way around. Look at real outcomes and then explain molecular signaling. And what Dan said. These molecular pathways are just too complicated. It could easily be that the signaling times are different for heavy and light weight. Or there is some other pathways interacting. Only God knows.

3. I am not sure the cross transfer is a big problem here. And the researchers talk about this in the discussion. And since the legs were randomized, I don’t think it is a big problem. And a within subject is sometimes better when there can be big difference due to motivation levels and genetics in the exercise field and it is hard to get people to do 10-12 week exercise studies. But they should have done a power analysis.

4. I am not sure if a an article on ocular muscles with graves orbitopathy” is enough to invalidate the validity of MRI. It is supposed to be the gold standard.

Again, great to see people analyzing the whole study than just reading the abstract.

I fully agree with 1 and 2. How problematic point 3 was is of course debatable, but it’s clearly not optimal. I understand the motivation behind it, but in this case I think the researchers overextended. As for 4, it definitely doesn’t invalidate it and in many cases MRI is indeed the best we got, but it’s again imperfect to detect small differences, which is what this study was intended to do.

1. I think most do power calculations, but when performing studies with novel treatments, where you don’t have previous values to guide your calculation, you have to estimate an effect size and variability (often assumed 30% var), which is a guessing game.

3. Cross-education of strength (or any measure) is always a concern in a design like this. The authors didn’t find a correlation of strength changes between limbs (some support for a lack of cross), and the fact that there were differences in strength between the training groups suggests that any transfer between limbs wasn’t enough to kill the difference between the 80 and 30 groups. I would definitely be more concerned about this if there was no difference between 80 & 30.

Also much of the cross-education literature uses a design where one limb trains and the other remains sedentary, which would likely enhance this finding. In this study both limbs trained which may be an important difference.

Given recent literature looking at myokines and systemic exercise responses, I often wonder if there could be a cross-education of hypertrophy from secreted proteins of one muscle inducing hypertrophy in another, inactive muscle. I’m sure someone has looked at the possibility perhaps in the context of unilateral limb injury or immobilization, I’ll have to look it up.

All very true remarks, Dan. I actually don’t know of any studies of cross-hypertrophy either. I think cross-education is a purely neural phenomenon, but I wouldn’t be surprised if it wasn’t. Any cross-hypertrophy would be very small in magnitude though, as the cross-education effect averages 7.8% of strength gains in the other limb, the majority of which is arguably neural in nature, according to the review I cited.

Hey Dan,

Since they are comparing to a standard protocol, we have plenty of studies to know what a reasonable effect size would be. And if you don’t report it, and most don’t (outside clinical trials), then there is no way for the reader to know if this was another study wasted. I feel we should either use effect size or confidence intervals than p values.

Good points Dan about cross over effect.

Don’t let Phillip read this ha ha. Phillip has been doing studies all along to show there is no effect of systemic hormones on non-exercised muscle. I wrote an article for Alan’s newsletter about this a month back.

And Menn, the study about MRI you quoted I don’t think says anything about the precision of MRI in skeletal muscles. It is measuring the tiny ‘ocular muscles’ in people with some weird disease.

Hey Dudes, think this thread has seen enough of my thoughts so I’ll just post three quick ones then I’m out. Thanks for the good discussion!

1) For power calcs, you could have picked multiple effects to estimate sample size, overall for 1) training (ended up significant, 2) by volume (3 sets vs 1) or 3) by intensity (80% vs 30%). Previous literature could have been considered for 1 & 2, but I’m not confident the literature could have guided the intensity. They would have had to estimate that one, and of course you could check their power by determining what you think would be an appropriate difference, running a power calc, then reconciling this with the N in their paper.

2) Very familiar with the Phillips lab, I’m a student at Mac and my advisor collaborates with him often. Love the questions they’re raising. When looking at systemic, I’m more interested in proteins that aren’t the usual hormones that we go after. I’m currently trying to convince my boss to let me complete proteome analysis (expensive) on serum immediately post-exercise and reconcile that against the acute transcriptome response of the muscle. I think it might given us some interesting info on other potential proteins that could play a systemic role.

3) MRI of the large muscles of the leg would be slightly different, and there are a few papers that demonstrate the error on fixed volumes and reproducibility for repeated scans which would be very important in this repeated design. The error rates are slightly lower but not so much so that it would change Menno’s argument in my opinion.

Hi Dan

1. This is more of an equivalence study where they are trying to show there may not be much difference for a 30% compared to 80%. I don’t even know why they squeezed in the 1 set and 3 set question. So the sample size becomes more important because most differences in the exercise field doesn’t show up as a significant due to low sample sizes.

2. I know you are from Mcmaster. It would be interesting from a mechanistic perspective. I am not sure it would matter much since at the end of the day they all should show as muscle growth which we don’t see much in the first place.

3. I don’t know and haven’t read those. What I am saying is that a review on occular muscles is not a good study to question the MRI results of this study.

Great discussion and an important article. So great job Men!

The blog post on Menno’s site doesnt link here, had to search for this. FYI

[…] they also did not find any effect of 0.9 vs. 1.2 g/kg/d of protein, which suggests their study was statistically underpowered to research this […]

[…] if we assume that the lack of statistical significance was caused by insufficient statistical power, the study was still biased. The group receiving yohimbine was 8.2 kg (18 lb) heavier than the […]

[…] As you can see in the graph below, there was still a trend for lower performance in The AM + caffeine group compared to the PM group, even though the PM group didn’t consume caffeine, but this difference was not statistically significant. This may well have been due to the small sample (N = 12) and resulting insufficient statistical power to detect the performance decrement. […]

[…] team of Massey et al. They found that training the bench press with a greater ROM did not result in statistically significantly more strength gains in men. When they replicated the study in women, the results became […]

[…] 1.6 g/kg but research has missed this. For this to occur, every relevant study must have had a type II error. Statistical power of any study should generally be 0.8 in order to be published, but let’s […]