Image courtesy of Stanford School of Medicine |
I'm sitting in Six Sigma Black Belt training this week, learning all about two-sample t-tests, ANOVA, and other statistical analysis techniques. One thing I noticed is that these techniques are based on sampling. Basically, you collect data based on a sample, not the whole population. An example from a hospital would be randomly picking 10 patients from a census of 100 and looking at their infection rates.
Obviously, data from a sample is not as thorough as that of a population, but often it's thorough enough to be statistically reliable. The benefit of sampling, of course, is that we don't have to go through the time and expense of collecting data for the entire population. However, thanks to powerful database software available to us in healthcare and pretty much any industry nowadays, we can easily pull all the data for all the patients in our system, at virtually no marginal cost. This begs the question--why bother with sampling if we already have the population data?
I guess we wouldn't, unless there was some added value in sampling beyond the data that we gather. If we're sampling by just pulling data out of a database, then there's probably not much value beyond the data. But, if we're sampling by directly observing a process, then there's a lot of additional value: we see the process with our own eyes, we get direct feedback from those involved with the process, we often get to directly hear the voice of the customer (the patient), and we get the opportunity to collect data that we didn't even know was relevant by looking at a database.
So, basically, it's not a question of "to sample or not to sample" but "to go & see or to not go & see."