blob: 0c9d88c8de3f91eebd6f3c6e4a3832a5cd53be1c [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
6.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
7.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
8
9.. versionadded:: 3.4
10
11.. testsetup:: *
12
13 from statistics import *
14 __name__ = '<doctest>'
15
16**Source code:** :source:`Lib/statistics.py`
17
18--------------
19
20This module provides functions for calculating mathematical statistics of
21numeric (:class:`Real`-valued) data.
22
Nick Coghlan73afe2a2014-02-08 19:58:04 +100023.. note::
24
25 Unless explicitly noted otherwise, these functions support :class:`int`,
26 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
27 Behaviour with other types (whether in the numeric tower or not) is
28 currently unsupported. Mixed types are also undefined and
29 implementation-dependent. If your input data consists of mixed types,
30 you may be able to use :func:`map` to ensure a consistent result, e.g.
31 ``map(float, input_data)``.
32
Larry Hastingsf5e987b2013-10-19 11:50:09 -070033Averages and measures of central location
34-----------------------------------------
35
36These functions calculate an average or typical value from a population
37or sample.
38
39======================= =============================================
40:func:`mean` Arithmetic mean ("average") of data.
41:func:`median` Median (middle value) of data.
42:func:`median_low` Low median of data.
43:func:`median_high` High median of data.
44:func:`median_grouped` Median, or 50th percentile, of grouped data.
45:func:`mode` Mode (most common value) of discrete data.
46======================= =============================================
47
Georg Brandleb2aeec2013-10-21 08:57:26 +020048Measures of spread
49------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070050
Georg Brandleb2aeec2013-10-21 08:57:26 +020051These functions calculate a measure of how much the population or sample
52tends to deviate from the typical or average values.
53
54======================= =============================================
55:func:`pstdev` Population standard deviation of data.
56:func:`pvariance` Population variance of data.
57:func:`stdev` Sample standard deviation of data.
58:func:`variance` Sample variance of data.
59======================= =============================================
60
61
62Function details
63----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070064
Georg Brandle051b552013-11-04 07:30:50 +010065Note: The functions do not require the data given to them to be sorted.
66However, for reading convenience, most of the examples show sorted sequences.
67
Larry Hastingsf5e987b2013-10-19 11:50:09 -070068.. function:: mean(data)
69
Georg Brandleb2aeec2013-10-21 08:57:26 +020070 Return the sample arithmetic mean of *data*, a sequence or iterator of
71 real-valued numbers.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070072
Georg Brandleb2aeec2013-10-21 08:57:26 +020073 The arithmetic mean is the sum of the data divided by the number of data
74 points. It is commonly called "the average", although it is only one of many
75 different mathematical averages. It is a measure of the central location of
76 the data.
77
78 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070079
80 Some examples of use:
81
82 .. doctest::
83
84 >>> mean([1, 2, 3, 4, 4])
85 2.8
86 >>> mean([-1.0, 2.5, 3.25, 5.75])
87 2.625
88
89 >>> from fractions import Fraction as F
90 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
91 Fraction(13, 21)
92
93 >>> from decimal import Decimal as D
94 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
95 Decimal('0.5625')
96
97 .. note::
98
Georg Brandla3fdcaa2013-10-21 09:08:39 +020099 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200100 for central location: the mean is not necessarily a typical example of the
101 data points. For more robust, although less efficient, measures of
102 central location, see :func:`median` and :func:`mode`. (In this case,
103 "efficient" refers to statistical efficiency rather than computational
104 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700105
Georg Brandleb2aeec2013-10-21 08:57:26 +0200106 The sample mean gives an unbiased estimate of the true population mean,
107 which means that, taken on average over all the possible samples,
108 ``mean(sample)`` converges on the true mean of the entire population. If
109 *data* represents the entire population rather than a sample, then
110 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700111
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
113.. function:: median(data)
114
Georg Brandleb2aeec2013-10-21 08:57:26 +0200115 Return the median (middle value) of numeric data, using the common "mean of
116 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700117
Georg Brandleb2aeec2013-10-21 08:57:26 +0200118 The median is a robust measure of central location, and is less affected by
119 the presence of outliers in your data. When the number of data points is
120 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700121
122 .. doctest::
123
124 >>> median([1, 3, 5])
125 3
126
Georg Brandleb2aeec2013-10-21 08:57:26 +0200127 When the number of data points is even, the median is interpolated by taking
128 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700129
130 .. doctest::
131
132 >>> median([1, 3, 5, 7])
133 4.0
134
Georg Brandleb2aeec2013-10-21 08:57:26 +0200135 This is suited for when your data is discrete, and you don't mind that the
136 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700137
Berker Peksag9c1dba22014-09-28 00:00:58 +0300138 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700139
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700140
141.. function:: median_low(data)
142
Georg Brandleb2aeec2013-10-21 08:57:26 +0200143 Return the low median of numeric data. If *data* is empty,
144 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700145
Georg Brandleb2aeec2013-10-21 08:57:26 +0200146 The low median is always a member of the data set. When the number of data
147 points is odd, the middle value is returned. When it is even, the smaller of
148 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700149
150 .. doctest::
151
152 >>> median_low([1, 3, 5])
153 3
154 >>> median_low([1, 3, 5, 7])
155 3
156
Georg Brandleb2aeec2013-10-21 08:57:26 +0200157 Use the low median when your data are discrete and you prefer the median to
158 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700159
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700160
161.. function:: median_high(data)
162
Georg Brandleb2aeec2013-10-21 08:57:26 +0200163 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
164 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700165
Georg Brandleb2aeec2013-10-21 08:57:26 +0200166 The high median is always a member of the data set. When the number of data
167 points is odd, the middle value is returned. When it is even, the larger of
168 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700169
170 .. doctest::
171
172 >>> median_high([1, 3, 5])
173 3
174 >>> median_high([1, 3, 5, 7])
175 5
176
Georg Brandleb2aeec2013-10-21 08:57:26 +0200177 Use the high median when your data are discrete and you prefer the median to
178 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700179
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700180
Georg Brandleb2aeec2013-10-21 08:57:26 +0200181.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700182
Georg Brandleb2aeec2013-10-21 08:57:26 +0200183 Return the median of grouped continuous data, calculated as the 50th
184 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
185 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700186
187 .. doctest::
188
189 >>> median_grouped([52, 52, 53, 54])
190 52.5
191
Georg Brandleb2aeec2013-10-21 08:57:26 +0200192 In the following example, the data are rounded, so that each value represents
193 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5-1.5, 2
194 is the midpoint of 1.5-2.5, 3 is the midpoint of 2.5-3.5, etc. With the data
195 given, the middle value falls somewhere in the class 3.5-4.5, and
196 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700197
198 .. doctest::
199
200 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
201 3.7
202
Georg Brandleb2aeec2013-10-21 08:57:26 +0200203 Optional argument *interval* represents the class interval, and defaults
204 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700205
206 .. doctest::
207
208 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
209 3.25
210 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
211 3.5
212
213 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200214 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700215
216 .. impl-detail::
217
Georg Brandleb2aeec2013-10-21 08:57:26 +0200218 Under some circumstances, :func:`median_grouped` may coerce data points to
219 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700220
221 .. seealso::
222
Georg Brandleb2aeec2013-10-21 08:57:26 +0200223 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
224 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700225
226 * Calculating the `median <http://www.ualberta.ca/~opscan/median.html>`_.
227
Georg Brandleb2aeec2013-10-21 08:57:26 +0200228 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100229 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200230 function in the Gnome Gnumeric spreadsheet, including `this discussion
231 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700232
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700233
234.. function:: mode(data)
235
Georg Brandleb2aeec2013-10-21 08:57:26 +0200236 Return the most common data point from discrete or nominal *data*. The mode
237 (when it exists) is the most typical value, and is a robust measure of
238 central location.
239
240 If *data* is empty, or if there is not exactly one most common value,
241 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700242
243 ``mode`` assumes discrete data, and returns a single value. This is the
244 standard treatment of the mode as commonly taught in schools:
245
246 .. doctest::
247
248 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
249 3
250
251 The mode is unique in that it is the only statistic which also applies
252 to nominal (non-numeric) data:
253
254 .. doctest::
255
256 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
257 'red'
258
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700261
Georg Brandleb2aeec2013-10-21 08:57:26 +0200262 Return the population standard deviation (the square root of the population
263 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700264
265 .. doctest::
266
267 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
268 0.986893273527251
269
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700270
Georg Brandleb2aeec2013-10-21 08:57:26 +0200271.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700272
Georg Brandleb2aeec2013-10-21 08:57:26 +0200273 Return the population variance of *data*, a non-empty iterable of real-valued
274 numbers. Variance, or second moment about the mean, is a measure of the
275 variability (spread or dispersion) of data. A large variance indicates that
276 the data is spread out; a small variance indicates it is clustered closely
277 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700278
Georg Brandleb2aeec2013-10-21 08:57:26 +0200279 If the optional second argument *mu* is given, it should be the mean of
280 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700281 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
Georg Brandleb2aeec2013-10-21 08:57:26 +0200283 Use this function to calculate the variance from the entire population. To
284 estimate the variance from a sample, the :func:`variance` function is usually
285 a better choice.
286
287 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700288
289 Examples:
290
291 .. doctest::
292
293 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
294 >>> pvariance(data)
295 1.25
296
Georg Brandleb2aeec2013-10-21 08:57:26 +0200297 If you have already calculated the mean of your data, you can pass it as the
298 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700299
300 .. doctest::
301
302 >>> mu = mean(data)
303 >>> pvariance(data, mu)
304 1.25
305
Georg Brandleb2aeec2013-10-21 08:57:26 +0200306 This function does not attempt to verify that you have passed the actual mean
307 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
308 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700309
310 Decimals and Fractions are supported:
311
312 .. doctest::
313
314 >>> from decimal import Decimal as D
315 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
316 Decimal('24.815')
317
318 >>> from fractions import Fraction as F
319 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
320 Fraction(13, 72)
321
322 .. note::
323
Georg Brandleb2aeec2013-10-21 08:57:26 +0200324 When called with the entire population, this gives the population variance
325 σ². When called on a sample instead, this is the biased sample variance
326 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700327
Georg Brandleb2aeec2013-10-21 08:57:26 +0200328 If you somehow know the true population mean μ, you may use this function
329 to calculate the variance of a sample, giving the known population mean as
330 the second argument. Provided the data points are representative
331 (e.g. independent and identically distributed), the result will be an
332 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700333
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700334
Georg Brandleb2aeec2013-10-21 08:57:26 +0200335.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700336
Georg Brandleb2aeec2013-10-21 08:57:26 +0200337 Return the sample standard deviation (the square root of the sample
338 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700339
340 .. doctest::
341
342 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
343 1.0810874155219827
344
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700345
Georg Brandleb2aeec2013-10-21 08:57:26 +0200346.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700347
Georg Brandleb2aeec2013-10-21 08:57:26 +0200348 Return the sample variance of *data*, an iterable of at least two real-valued
349 numbers. Variance, or second moment about the mean, is a measure of the
350 variability (spread or dispersion) of data. A large variance indicates that
351 the data is spread out; a small variance indicates it is clustered closely
352 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700353
Georg Brandleb2aeec2013-10-21 08:57:26 +0200354 If the optional second argument *xbar* is given, it should be the mean of
355 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700356 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700357
Georg Brandleb2aeec2013-10-21 08:57:26 +0200358 Use this function when your data is a sample from a population. To calculate
359 the variance from the entire population, see :func:`pvariance`.
360
361 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700362
363 Examples:
364
365 .. doctest::
366
367 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
368 >>> variance(data)
369 1.3720238095238095
370
Georg Brandleb2aeec2013-10-21 08:57:26 +0200371 If you have already calculated the mean of your data, you can pass it as the
372 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700373
374 .. doctest::
375
376 >>> m = mean(data)
377 >>> variance(data, m)
378 1.3720238095238095
379
Georg Brandleb2aeec2013-10-21 08:57:26 +0200380 This function does not attempt to verify that you have passed the actual mean
381 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700382 impossible results.
383
384 Decimal and Fraction values are supported:
385
386 .. doctest::
387
388 >>> from decimal import Decimal as D
389 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
390 Decimal('31.01875')
391
392 >>> from fractions import Fraction as F
393 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
394 Fraction(67, 108)
395
396 .. note::
397
Georg Brandleb2aeec2013-10-21 08:57:26 +0200398 This is the sample variance s² with Bessel's correction, also known as
399 variance with N-1 degrees of freedom. Provided that the data points are
400 representative (e.g. independent and identically distributed), the result
401 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700402
Georg Brandleb2aeec2013-10-21 08:57:26 +0200403 If you somehow know the actual population mean μ you should pass it to the
404 :func:`pvariance` function as the *mu* parameter to get the variance of a
405 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700406
407Exceptions
408----------
409
410A single exception is defined:
411
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400412.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700413
Benjamin Peterson44c30652013-10-20 17:52:09 -0400414 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700415
416..
417 # This modelines must appear within the last ten lines of the file.
418 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;