blob: e6c5959d2c5dc196e3501ddcb20b19f578486d13 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
6.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
7.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
8
9.. versionadded:: 3.4
10
11.. testsetup:: *
12
13 from statistics import *
14 __name__ = '<doctest>'
15
16**Source code:** :source:`Lib/statistics.py`
17
18--------------
19
20This module provides functions for calculating mathematical statistics of
21numeric (:class:`Real`-valued) data.
22
23Averages and measures of central location
24-----------------------------------------
25
26These functions calculate an average or typical value from a population
27or sample.
28
29======================= =============================================
30:func:`mean` Arithmetic mean ("average") of data.
31:func:`median` Median (middle value) of data.
32:func:`median_low` Low median of data.
33:func:`median_high` High median of data.
34:func:`median_grouped` Median, or 50th percentile, of grouped data.
35:func:`mode` Mode (most common value) of discrete data.
36======================= =============================================
37
Georg Brandleb2aeec2013-10-21 08:57:26 +020038Measures of spread
39------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070040
Georg Brandleb2aeec2013-10-21 08:57:26 +020041These functions calculate a measure of how much the population or sample
42tends to deviate from the typical or average values.
43
44======================= =============================================
45:func:`pstdev` Population standard deviation of data.
46:func:`pvariance` Population variance of data.
47:func:`stdev` Sample standard deviation of data.
48:func:`variance` Sample variance of data.
49======================= =============================================
50
51
52Function details
53----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070054
Georg Brandle051b552013-11-04 07:30:50 +010055Note: The functions do not require the data given to them to be sorted.
56However, for reading convenience, most of the examples show sorted sequences.
57
Larry Hastingsf5e987b2013-10-19 11:50:09 -070058.. function:: mean(data)
59
Georg Brandleb2aeec2013-10-21 08:57:26 +020060 Return the sample arithmetic mean of *data*, a sequence or iterator of
61 real-valued numbers.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070062
Georg Brandleb2aeec2013-10-21 08:57:26 +020063 The arithmetic mean is the sum of the data divided by the number of data
64 points. It is commonly called "the average", although it is only one of many
65 different mathematical averages. It is a measure of the central location of
66 the data.
67
68 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070069
70 Some examples of use:
71
72 .. doctest::
73
74 >>> mean([1, 2, 3, 4, 4])
75 2.8
76 >>> mean([-1.0, 2.5, 3.25, 5.75])
77 2.625
78
79 >>> from fractions import Fraction as F
80 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
81 Fraction(13, 21)
82
83 >>> from decimal import Decimal as D
84 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
85 Decimal('0.5625')
86
87 .. note::
88
Georg Brandla3fdcaa2013-10-21 09:08:39 +020089 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +020090 for central location: the mean is not necessarily a typical example of the
91 data points. For more robust, although less efficient, measures of
92 central location, see :func:`median` and :func:`mode`. (In this case,
93 "efficient" refers to statistical efficiency rather than computational
94 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -070095
Georg Brandleb2aeec2013-10-21 08:57:26 +020096 The sample mean gives an unbiased estimate of the true population mean,
97 which means that, taken on average over all the possible samples,
98 ``mean(sample)`` converges on the true mean of the entire population. If
99 *data* represents the entire population rather than a sample, then
100 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700101
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700102
103.. function:: median(data)
104
Georg Brandleb2aeec2013-10-21 08:57:26 +0200105 Return the median (middle value) of numeric data, using the common "mean of
106 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700107
Georg Brandleb2aeec2013-10-21 08:57:26 +0200108 The median is a robust measure of central location, and is less affected by
109 the presence of outliers in your data. When the number of data points is
110 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700111
112 .. doctest::
113
114 >>> median([1, 3, 5])
115 3
116
Georg Brandleb2aeec2013-10-21 08:57:26 +0200117 When the number of data points is even, the median is interpolated by taking
118 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700119
120 .. doctest::
121
122 >>> median([1, 3, 5, 7])
123 4.0
124
Georg Brandleb2aeec2013-10-21 08:57:26 +0200125 This is suited for when your data is discrete, and you don't mind that the
126 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700127
Georg Brandleb2aeec2013-10-21 08:57:26 +0200128 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700129
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700130
131.. function:: median_low(data)
132
Georg Brandleb2aeec2013-10-21 08:57:26 +0200133 Return the low median of numeric data. If *data* is empty,
134 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700135
Georg Brandleb2aeec2013-10-21 08:57:26 +0200136 The low median is always a member of the data set. When the number of data
137 points is odd, the middle value is returned. When it is even, the smaller of
138 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700139
140 .. doctest::
141
142 >>> median_low([1, 3, 5])
143 3
144 >>> median_low([1, 3, 5, 7])
145 3
146
Georg Brandleb2aeec2013-10-21 08:57:26 +0200147 Use the low median when your data are discrete and you prefer the median to
148 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700149
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700150
151.. function:: median_high(data)
152
Georg Brandleb2aeec2013-10-21 08:57:26 +0200153 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
154 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700155
Georg Brandleb2aeec2013-10-21 08:57:26 +0200156 The high median is always a member of the data set. When the number of data
157 points is odd, the middle value is returned. When it is even, the larger of
158 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700159
160 .. doctest::
161
162 >>> median_high([1, 3, 5])
163 3
164 >>> median_high([1, 3, 5, 7])
165 5
166
Georg Brandleb2aeec2013-10-21 08:57:26 +0200167 Use the high median when your data are discrete and you prefer the median to
168 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700169
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700170
Georg Brandleb2aeec2013-10-21 08:57:26 +0200171.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700172
Georg Brandleb2aeec2013-10-21 08:57:26 +0200173 Return the median of grouped continuous data, calculated as the 50th
174 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
175 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700176
177 .. doctest::
178
179 >>> median_grouped([52, 52, 53, 54])
180 52.5
181
Georg Brandleb2aeec2013-10-21 08:57:26 +0200182 In the following example, the data are rounded, so that each value represents
183 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5-1.5, 2
184 is the midpoint of 1.5-2.5, 3 is the midpoint of 2.5-3.5, etc. With the data
185 given, the middle value falls somewhere in the class 3.5-4.5, and
186 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700187
188 .. doctest::
189
190 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
191 3.7
192
Georg Brandleb2aeec2013-10-21 08:57:26 +0200193 Optional argument *interval* represents the class interval, and defaults
194 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700195
196 .. doctest::
197
198 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
199 3.25
200 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
201 3.5
202
203 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200204 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700205
206 .. impl-detail::
207
Georg Brandleb2aeec2013-10-21 08:57:26 +0200208 Under some circumstances, :func:`median_grouped` may coerce data points to
209 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700210
211 .. seealso::
212
Georg Brandleb2aeec2013-10-21 08:57:26 +0200213 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
214 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700215
216 * Calculating the `median <http://www.ualberta.ca/~opscan/median.html>`_.
217
Georg Brandleb2aeec2013-10-21 08:57:26 +0200218 * The `SSMEDIAN
219 <https://projects.gnome.org/gnumeric/doc/gnumeric-function-SSMEDIAN.shtml>`_
220 function in the Gnome Gnumeric spreadsheet, including `this discussion
221 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700222
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700223
224.. function:: mode(data)
225
Georg Brandleb2aeec2013-10-21 08:57:26 +0200226 Return the most common data point from discrete or nominal *data*. The mode
227 (when it exists) is the most typical value, and is a robust measure of
228 central location.
229
230 If *data* is empty, or if there is not exactly one most common value,
231 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700232
233 ``mode`` assumes discrete data, and returns a single value. This is the
234 standard treatment of the mode as commonly taught in schools:
235
236 .. doctest::
237
238 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
239 3
240
241 The mode is unique in that it is the only statistic which also applies
242 to nominal (non-numeric) data:
243
244 .. doctest::
245
246 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
247 'red'
248
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700249
Georg Brandleb2aeec2013-10-21 08:57:26 +0200250.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700251
Georg Brandleb2aeec2013-10-21 08:57:26 +0200252 Return the population standard deviation (the square root of the population
253 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700254
255 .. doctest::
256
257 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
258 0.986893273527251
259
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700260
Georg Brandleb2aeec2013-10-21 08:57:26 +0200261.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700262
Georg Brandleb2aeec2013-10-21 08:57:26 +0200263 Return the population variance of *data*, a non-empty iterable of real-valued
264 numbers. Variance, or second moment about the mean, is a measure of the
265 variability (spread or dispersion) of data. A large variance indicates that
266 the data is spread out; a small variance indicates it is clustered closely
267 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700268
Georg Brandleb2aeec2013-10-21 08:57:26 +0200269 If the optional second argument *mu* is given, it should be the mean of
270 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700271 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700272
Georg Brandleb2aeec2013-10-21 08:57:26 +0200273 Use this function to calculate the variance from the entire population. To
274 estimate the variance from a sample, the :func:`variance` function is usually
275 a better choice.
276
277 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700278
279 Examples:
280
281 .. doctest::
282
283 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
284 >>> pvariance(data)
285 1.25
286
Georg Brandleb2aeec2013-10-21 08:57:26 +0200287 If you have already calculated the mean of your data, you can pass it as the
288 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700289
290 .. doctest::
291
292 >>> mu = mean(data)
293 >>> pvariance(data, mu)
294 1.25
295
Georg Brandleb2aeec2013-10-21 08:57:26 +0200296 This function does not attempt to verify that you have passed the actual mean
297 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
298 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700299
300 Decimals and Fractions are supported:
301
302 .. doctest::
303
304 >>> from decimal import Decimal as D
305 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
306 Decimal('24.815')
307
308 >>> from fractions import Fraction as F
309 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
310 Fraction(13, 72)
311
312 .. note::
313
Georg Brandleb2aeec2013-10-21 08:57:26 +0200314 When called with the entire population, this gives the population variance
315 σ². When called on a sample instead, this is the biased sample variance
316 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700317
Georg Brandleb2aeec2013-10-21 08:57:26 +0200318 If you somehow know the true population mean μ, you may use this function
319 to calculate the variance of a sample, giving the known population mean as
320 the second argument. Provided the data points are representative
321 (e.g. independent and identically distributed), the result will be an
322 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700323
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700324
Georg Brandleb2aeec2013-10-21 08:57:26 +0200325.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700326
Georg Brandleb2aeec2013-10-21 08:57:26 +0200327 Return the sample standard deviation (the square root of the sample
328 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700329
330 .. doctest::
331
332 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
333 1.0810874155219827
334
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700335
Georg Brandleb2aeec2013-10-21 08:57:26 +0200336.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700337
Georg Brandleb2aeec2013-10-21 08:57:26 +0200338 Return the sample variance of *data*, an iterable of at least two real-valued
339 numbers. Variance, or second moment about the mean, is a measure of the
340 variability (spread or dispersion) of data. A large variance indicates that
341 the data is spread out; a small variance indicates it is clustered closely
342 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700343
Georg Brandleb2aeec2013-10-21 08:57:26 +0200344 If the optional second argument *xbar* is given, it should be the mean of
345 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700346 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700347
Georg Brandleb2aeec2013-10-21 08:57:26 +0200348 Use this function when your data is a sample from a population. To calculate
349 the variance from the entire population, see :func:`pvariance`.
350
351 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700352
353 Examples:
354
355 .. doctest::
356
357 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
358 >>> variance(data)
359 1.3720238095238095
360
Georg Brandleb2aeec2013-10-21 08:57:26 +0200361 If you have already calculated the mean of your data, you can pass it as the
362 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700363
364 .. doctest::
365
366 >>> m = mean(data)
367 >>> variance(data, m)
368 1.3720238095238095
369
Georg Brandleb2aeec2013-10-21 08:57:26 +0200370 This function does not attempt to verify that you have passed the actual mean
371 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700372 impossible results.
373
374 Decimal and Fraction values are supported:
375
376 .. doctest::
377
378 >>> from decimal import Decimal as D
379 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
380 Decimal('31.01875')
381
382 >>> from fractions import Fraction as F
383 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
384 Fraction(67, 108)
385
386 .. note::
387
Georg Brandleb2aeec2013-10-21 08:57:26 +0200388 This is the sample variance s² with Bessel's correction, also known as
389 variance with N-1 degrees of freedom. Provided that the data points are
390 representative (e.g. independent and identically distributed), the result
391 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700392
Georg Brandleb2aeec2013-10-21 08:57:26 +0200393 If you somehow know the actual population mean μ you should pass it to the
394 :func:`pvariance` function as the *mu* parameter to get the variance of a
395 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700396
397Exceptions
398----------
399
400A single exception is defined:
401
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400402.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700403
Benjamin Peterson44c30652013-10-20 17:52:09 -0400404 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700405
406..
407 # This modelines must appear within the last ten lines of the file.
408 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;