blob: fe1284088943799c6b1a5a5d8f6fb0e284dd6007 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
6.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
7.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
8
9.. versionadded:: 3.4
10
11.. testsetup:: *
12
13 from statistics import *
14 __name__ = '<doctest>'
15
16**Source code:** :source:`Lib/statistics.py`
17
18--------------
19
20This module provides functions for calculating mathematical statistics of
21numeric (:class:`Real`-valued) data.
22
23Averages and measures of central location
24-----------------------------------------
25
26These functions calculate an average or typical value from a population
27or sample.
28
29======================= =============================================
30:func:`mean` Arithmetic mean ("average") of data.
31:func:`median` Median (middle value) of data.
32:func:`median_low` Low median of data.
33:func:`median_high` High median of data.
34:func:`median_grouped` Median, or 50th percentile, of grouped data.
35:func:`mode` Mode (most common value) of discrete data.
36======================= =============================================
37
Georg Brandleb2aeec2013-10-21 08:57:26 +020038Measures of spread
39------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070040
Georg Brandleb2aeec2013-10-21 08:57:26 +020041These functions calculate a measure of how much the population or sample
42tends to deviate from the typical or average values.
43
44======================= =============================================
45:func:`pstdev` Population standard deviation of data.
46:func:`pvariance` Population variance of data.
47:func:`stdev` Sample standard deviation of data.
48:func:`variance` Sample variance of data.
49======================= =============================================
50
51
52Function details
53----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070054
55.. function:: mean(data)
56
Georg Brandleb2aeec2013-10-21 08:57:26 +020057 Return the sample arithmetic mean of *data*, a sequence or iterator of
58 real-valued numbers.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070059
Georg Brandleb2aeec2013-10-21 08:57:26 +020060 The arithmetic mean is the sum of the data divided by the number of data
61 points. It is commonly called "the average", although it is only one of many
62 different mathematical averages. It is a measure of the central location of
63 the data.
64
65 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070066
67 Some examples of use:
68
69 .. doctest::
70
71 >>> mean([1, 2, 3, 4, 4])
72 2.8
73 >>> mean([-1.0, 2.5, 3.25, 5.75])
74 2.625
75
76 >>> from fractions import Fraction as F
77 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
78 Fraction(13, 21)
79
80 >>> from decimal import Decimal as D
81 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
82 Decimal('0.5625')
83
84 .. note::
85
Georg Brandla3fdcaa2013-10-21 09:08:39 +020086 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +020087 for central location: the mean is not necessarily a typical example of the
88 data points. For more robust, although less efficient, measures of
89 central location, see :func:`median` and :func:`mode`. (In this case,
90 "efficient" refers to statistical efficiency rather than computational
91 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -070092
Georg Brandleb2aeec2013-10-21 08:57:26 +020093 The sample mean gives an unbiased estimate of the true population mean,
94 which means that, taken on average over all the possible samples,
95 ``mean(sample)`` converges on the true mean of the entire population. If
96 *data* represents the entire population rather than a sample, then
97 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070098
Larry Hastingsf5e987b2013-10-19 11:50:09 -070099
100.. function:: median(data)
101
Georg Brandleb2aeec2013-10-21 08:57:26 +0200102 Return the median (middle value) of numeric data, using the common "mean of
103 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700104
Georg Brandleb2aeec2013-10-21 08:57:26 +0200105 The median is a robust measure of central location, and is less affected by
106 the presence of outliers in your data. When the number of data points is
107 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700108
109 .. doctest::
110
111 >>> median([1, 3, 5])
112 3
113
Georg Brandleb2aeec2013-10-21 08:57:26 +0200114 When the number of data points is even, the median is interpolated by taking
115 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700116
117 .. doctest::
118
119 >>> median([1, 3, 5, 7])
120 4.0
121
Georg Brandleb2aeec2013-10-21 08:57:26 +0200122 This is suited for when your data is discrete, and you don't mind that the
123 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700124
Georg Brandleb2aeec2013-10-21 08:57:26 +0200125 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700126
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700127
128.. function:: median_low(data)
129
Georg Brandleb2aeec2013-10-21 08:57:26 +0200130 Return the low median of numeric data. If *data* is empty,
131 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700132
Georg Brandleb2aeec2013-10-21 08:57:26 +0200133 The low median is always a member of the data set. When the number of data
134 points is odd, the middle value is returned. When it is even, the smaller of
135 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700136
137 .. doctest::
138
139 >>> median_low([1, 3, 5])
140 3
141 >>> median_low([1, 3, 5, 7])
142 3
143
Georg Brandleb2aeec2013-10-21 08:57:26 +0200144 Use the low median when your data are discrete and you prefer the median to
145 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700146
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700147
148.. function:: median_high(data)
149
Georg Brandleb2aeec2013-10-21 08:57:26 +0200150 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
151 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700152
Georg Brandleb2aeec2013-10-21 08:57:26 +0200153 The high median is always a member of the data set. When the number of data
154 points is odd, the middle value is returned. When it is even, the larger of
155 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700156
157 .. doctest::
158
159 >>> median_high([1, 3, 5])
160 3
161 >>> median_high([1, 3, 5, 7])
162 5
163
Georg Brandleb2aeec2013-10-21 08:57:26 +0200164 Use the high median when your data are discrete and you prefer the median to
165 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700166
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700167
Georg Brandleb2aeec2013-10-21 08:57:26 +0200168.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700169
Georg Brandleb2aeec2013-10-21 08:57:26 +0200170 Return the median of grouped continuous data, calculated as the 50th
171 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
172 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700173
174 .. doctest::
175
176 >>> median_grouped([52, 52, 53, 54])
177 52.5
178
Georg Brandleb2aeec2013-10-21 08:57:26 +0200179 In the following example, the data are rounded, so that each value represents
180 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5-1.5, 2
181 is the midpoint of 1.5-2.5, 3 is the midpoint of 2.5-3.5, etc. With the data
182 given, the middle value falls somewhere in the class 3.5-4.5, and
183 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700184
185 .. doctest::
186
187 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
188 3.7
189
Georg Brandleb2aeec2013-10-21 08:57:26 +0200190 Optional argument *interval* represents the class interval, and defaults
191 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700192
193 .. doctest::
194
195 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
196 3.25
197 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
198 3.5
199
200 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200201 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700202
203 .. impl-detail::
204
Georg Brandleb2aeec2013-10-21 08:57:26 +0200205 Under some circumstances, :func:`median_grouped` may coerce data points to
206 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700207
208 .. seealso::
209
Georg Brandleb2aeec2013-10-21 08:57:26 +0200210 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
211 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700212
213 * Calculating the `median <http://www.ualberta.ca/~opscan/median.html>`_.
214
Georg Brandleb2aeec2013-10-21 08:57:26 +0200215 * The `SSMEDIAN
216 <https://projects.gnome.org/gnumeric/doc/gnumeric-function-SSMEDIAN.shtml>`_
217 function in the Gnome Gnumeric spreadsheet, including `this discussion
218 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700219
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700220
221.. function:: mode(data)
222
Georg Brandleb2aeec2013-10-21 08:57:26 +0200223 Return the most common data point from discrete or nominal *data*. The mode
224 (when it exists) is the most typical value, and is a robust measure of
225 central location.
226
227 If *data* is empty, or if there is not exactly one most common value,
228 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700229
230 ``mode`` assumes discrete data, and returns a single value. This is the
231 standard treatment of the mode as commonly taught in schools:
232
233 .. doctest::
234
235 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
236 3
237
238 The mode is unique in that it is the only statistic which also applies
239 to nominal (non-numeric) data:
240
241 .. doctest::
242
243 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
244 'red'
245
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700246
Georg Brandleb2aeec2013-10-21 08:57:26 +0200247.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700248
Georg Brandleb2aeec2013-10-21 08:57:26 +0200249 Return the population standard deviation (the square root of the population
250 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700251
252 .. doctest::
253
254 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
255 0.986893273527251
256
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700257
Georg Brandleb2aeec2013-10-21 08:57:26 +0200258.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260 Return the population variance of *data*, a non-empty iterable of real-valued
261 numbers. Variance, or second moment about the mean, is a measure of the
262 variability (spread or dispersion) of data. A large variance indicates that
263 the data is spread out; a small variance indicates it is clustered closely
264 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700265
Georg Brandleb2aeec2013-10-21 08:57:26 +0200266 If the optional second argument *mu* is given, it should be the mean of
267 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700268 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700269
Georg Brandleb2aeec2013-10-21 08:57:26 +0200270 Use this function to calculate the variance from the entire population. To
271 estimate the variance from a sample, the :func:`variance` function is usually
272 a better choice.
273
274 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700275
276 Examples:
277
278 .. doctest::
279
280 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
281 >>> pvariance(data)
282 1.25
283
Georg Brandleb2aeec2013-10-21 08:57:26 +0200284 If you have already calculated the mean of your data, you can pass it as the
285 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700286
287 .. doctest::
288
289 >>> mu = mean(data)
290 >>> pvariance(data, mu)
291 1.25
292
Georg Brandleb2aeec2013-10-21 08:57:26 +0200293 This function does not attempt to verify that you have passed the actual mean
294 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
295 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700296
297 Decimals and Fractions are supported:
298
299 .. doctest::
300
301 >>> from decimal import Decimal as D
302 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
303 Decimal('24.815')
304
305 >>> from fractions import Fraction as F
306 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
307 Fraction(13, 72)
308
309 .. note::
310
Georg Brandleb2aeec2013-10-21 08:57:26 +0200311 When called with the entire population, this gives the population variance
312 σ². When called on a sample instead, this is the biased sample variance
313 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700314
Georg Brandleb2aeec2013-10-21 08:57:26 +0200315 If you somehow know the true population mean μ, you may use this function
316 to calculate the variance of a sample, giving the known population mean as
317 the second argument. Provided the data points are representative
318 (e.g. independent and identically distributed), the result will be an
319 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700320
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700321
Georg Brandleb2aeec2013-10-21 08:57:26 +0200322.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700323
Georg Brandleb2aeec2013-10-21 08:57:26 +0200324 Return the sample standard deviation (the square root of the sample
325 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700326
327 .. doctest::
328
329 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
330 1.0810874155219827
331
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700332
Georg Brandleb2aeec2013-10-21 08:57:26 +0200333.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700334
Georg Brandleb2aeec2013-10-21 08:57:26 +0200335 Return the sample variance of *data*, an iterable of at least two real-valued
336 numbers. Variance, or second moment about the mean, is a measure of the
337 variability (spread or dispersion) of data. A large variance indicates that
338 the data is spread out; a small variance indicates it is clustered closely
339 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700340
Georg Brandleb2aeec2013-10-21 08:57:26 +0200341 If the optional second argument *xbar* is given, it should be the mean of
342 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700343 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700344
Georg Brandleb2aeec2013-10-21 08:57:26 +0200345 Use this function when your data is a sample from a population. To calculate
346 the variance from the entire population, see :func:`pvariance`.
347
348 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700349
350 Examples:
351
352 .. doctest::
353
354 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
355 >>> variance(data)
356 1.3720238095238095
357
Georg Brandleb2aeec2013-10-21 08:57:26 +0200358 If you have already calculated the mean of your data, you can pass it as the
359 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700360
361 .. doctest::
362
363 >>> m = mean(data)
364 >>> variance(data, m)
365 1.3720238095238095
366
Georg Brandleb2aeec2013-10-21 08:57:26 +0200367 This function does not attempt to verify that you have passed the actual mean
368 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700369 impossible results.
370
371 Decimal and Fraction values are supported:
372
373 .. doctest::
374
375 >>> from decimal import Decimal as D
376 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
377 Decimal('31.01875')
378
379 >>> from fractions import Fraction as F
380 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
381 Fraction(67, 108)
382
383 .. note::
384
Georg Brandleb2aeec2013-10-21 08:57:26 +0200385 This is the sample variance s² with Bessel's correction, also known as
386 variance with N-1 degrees of freedom. Provided that the data points are
387 representative (e.g. independent and identically distributed), the result
388 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700389
Georg Brandleb2aeec2013-10-21 08:57:26 +0200390 If you somehow know the actual population mean μ you should pass it to the
391 :func:`pvariance` function as the *mu* parameter to get the variance of a
392 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700393
394Exceptions
395----------
396
397A single exception is defined:
398
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400399.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700400
Benjamin Peterson44c30652013-10-20 17:52:09 -0400401 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700402
403..
404 # This modelines must appear within the last ten lines of the file.
405 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;