blob: ea3d7dab0f17375a16a2620d563a2d4d65a3858b [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
40======================= =============================================
41:func:`mean` Arithmetic mean ("average") of data.
42:func:`median` Median (middle value) of data.
43:func:`median_low` Low median of data.
44:func:`median_high` High median of data.
45:func:`median_grouped` Median, or 50th percentile, of grouped data.
46:func:`mode` Mode (most common value) of discrete data.
47======================= =============================================
48
Georg Brandleb2aeec2013-10-21 08:57:26 +020049Measures of spread
50------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070051
Georg Brandleb2aeec2013-10-21 08:57:26 +020052These functions calculate a measure of how much the population or sample
53tends to deviate from the typical or average values.
54
55======================= =============================================
56:func:`pstdev` Population standard deviation of data.
57:func:`pvariance` Population variance of data.
58:func:`stdev` Sample standard deviation of data.
59:func:`variance` Sample variance of data.
60======================= =============================================
61
62
63Function details
64----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070065
Georg Brandle051b552013-11-04 07:30:50 +010066Note: The functions do not require the data given to them to be sorted.
67However, for reading convenience, most of the examples show sorted sequences.
68
Larry Hastingsf5e987b2013-10-19 11:50:09 -070069.. function:: mean(data)
70
Georg Brandleb2aeec2013-10-21 08:57:26 +020071 Return the sample arithmetic mean of *data*, a sequence or iterator of
72 real-valued numbers.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070073
Georg Brandleb2aeec2013-10-21 08:57:26 +020074 The arithmetic mean is the sum of the data divided by the number of data
75 points. It is commonly called "the average", although it is only one of many
76 different mathematical averages. It is a measure of the central location of
77 the data.
78
79 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070080
81 Some examples of use:
82
83 .. doctest::
84
85 >>> mean([1, 2, 3, 4, 4])
86 2.8
87 >>> mean([-1.0, 2.5, 3.25, 5.75])
88 2.625
89
90 >>> from fractions import Fraction as F
91 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
92 Fraction(13, 21)
93
94 >>> from decimal import Decimal as D
95 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
96 Decimal('0.5625')
97
98 .. note::
99
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200100 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200101 for central location: the mean is not necessarily a typical example of the
102 data points. For more robust, although less efficient, measures of
103 central location, see :func:`median` and :func:`mode`. (In this case,
104 "efficient" refers to statistical efficiency rather than computational
105 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700106
Georg Brandleb2aeec2013-10-21 08:57:26 +0200107 The sample mean gives an unbiased estimate of the true population mean,
108 which means that, taken on average over all the possible samples,
109 ``mean(sample)`` converges on the true mean of the entire population. If
110 *data* represents the entire population rather than a sample, then
111 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700113
114.. function:: median(data)
115
Georg Brandleb2aeec2013-10-21 08:57:26 +0200116 Return the median (middle value) of numeric data, using the common "mean of
117 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700118
Georg Brandleb2aeec2013-10-21 08:57:26 +0200119 The median is a robust measure of central location, and is less affected by
120 the presence of outliers in your data. When the number of data points is
121 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700122
123 .. doctest::
124
125 >>> median([1, 3, 5])
126 3
127
Georg Brandleb2aeec2013-10-21 08:57:26 +0200128 When the number of data points is even, the median is interpolated by taking
129 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700130
131 .. doctest::
132
133 >>> median([1, 3, 5, 7])
134 4.0
135
Georg Brandleb2aeec2013-10-21 08:57:26 +0200136 This is suited for when your data is discrete, and you don't mind that the
137 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700138
Berker Peksag9c1dba22014-09-28 00:00:58 +0300139 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700140
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700141
142.. function:: median_low(data)
143
Georg Brandleb2aeec2013-10-21 08:57:26 +0200144 Return the low median of numeric data. If *data* is empty,
145 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700146
Georg Brandleb2aeec2013-10-21 08:57:26 +0200147 The low median is always a member of the data set. When the number of data
148 points is odd, the middle value is returned. When it is even, the smaller of
149 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700150
151 .. doctest::
152
153 >>> median_low([1, 3, 5])
154 3
155 >>> median_low([1, 3, 5, 7])
156 3
157
Georg Brandleb2aeec2013-10-21 08:57:26 +0200158 Use the low median when your data are discrete and you prefer the median to
159 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700160
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700161
162.. function:: median_high(data)
163
Georg Brandleb2aeec2013-10-21 08:57:26 +0200164 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
165 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700166
Georg Brandleb2aeec2013-10-21 08:57:26 +0200167 The high median is always a member of the data set. When the number of data
168 points is odd, the middle value is returned. When it is even, the larger of
169 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700170
171 .. doctest::
172
173 >>> median_high([1, 3, 5])
174 3
175 >>> median_high([1, 3, 5, 7])
176 5
177
Georg Brandleb2aeec2013-10-21 08:57:26 +0200178 Use the high median when your data are discrete and you prefer the median to
179 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700180
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700181
Georg Brandleb2aeec2013-10-21 08:57:26 +0200182.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700183
Georg Brandleb2aeec2013-10-21 08:57:26 +0200184 Return the median of grouped continuous data, calculated as the 50th
185 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
186 is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700187
188 .. doctest::
189
190 >>> median_grouped([52, 52, 53, 54])
191 52.5
192
Georg Brandleb2aeec2013-10-21 08:57:26 +0200193 In the following example, the data are rounded, so that each value represents
194 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5-1.5, 2
195 is the midpoint of 1.5-2.5, 3 is the midpoint of 2.5-3.5, etc. With the data
196 given, the middle value falls somewhere in the class 3.5-4.5, and
197 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700198
199 .. doctest::
200
201 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
202 3.7
203
Georg Brandleb2aeec2013-10-21 08:57:26 +0200204 Optional argument *interval* represents the class interval, and defaults
205 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700206
207 .. doctest::
208
209 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
210 3.25
211 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
212 3.5
213
214 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200215 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
217 .. impl-detail::
218
Georg Brandleb2aeec2013-10-21 08:57:26 +0200219 Under some circumstances, :func:`median_grouped` may coerce data points to
220 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700221
222 .. seealso::
223
Georg Brandleb2aeec2013-10-21 08:57:26 +0200224 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
225 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700226
Serhiy Storchaka6dff0202016-05-07 10:49:07 +0300227 * Calculating the `median <https://www.ualberta.ca/~opscan/median.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700228
Georg Brandleb2aeec2013-10-21 08:57:26 +0200229 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100230 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200231 function in the Gnome Gnumeric spreadsheet, including `this discussion
232 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700233
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700234
235.. function:: mode(data)
236
Georg Brandleb2aeec2013-10-21 08:57:26 +0200237 Return the most common data point from discrete or nominal *data*. The mode
238 (when it exists) is the most typical value, and is a robust measure of
239 central location.
240
241 If *data* is empty, or if there is not exactly one most common value,
242 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700243
244 ``mode`` assumes discrete data, and returns a single value. This is the
245 standard treatment of the mode as commonly taught in schools:
246
247 .. doctest::
248
249 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
250 3
251
252 The mode is unique in that it is the only statistic which also applies
253 to nominal (non-numeric) data:
254
255 .. doctest::
256
257 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
258 'red'
259
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700260
Georg Brandleb2aeec2013-10-21 08:57:26 +0200261.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700262
Georg Brandleb2aeec2013-10-21 08:57:26 +0200263 Return the population standard deviation (the square root of the population
264 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700265
266 .. doctest::
267
268 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
269 0.986893273527251
270
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700271
Georg Brandleb2aeec2013-10-21 08:57:26 +0200272.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700273
Georg Brandleb2aeec2013-10-21 08:57:26 +0200274 Return the population variance of *data*, a non-empty iterable of real-valued
275 numbers. Variance, or second moment about the mean, is a measure of the
276 variability (spread or dispersion) of data. A large variance indicates that
277 the data is spread out; a small variance indicates it is clustered closely
278 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700279
Georg Brandleb2aeec2013-10-21 08:57:26 +0200280 If the optional second argument *mu* is given, it should be the mean of
281 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700282 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700283
Georg Brandleb2aeec2013-10-21 08:57:26 +0200284 Use this function to calculate the variance from the entire population. To
285 estimate the variance from a sample, the :func:`variance` function is usually
286 a better choice.
287
288 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700289
290 Examples:
291
292 .. doctest::
293
294 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
295 >>> pvariance(data)
296 1.25
297
Georg Brandleb2aeec2013-10-21 08:57:26 +0200298 If you have already calculated the mean of your data, you can pass it as the
299 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700300
301 .. doctest::
302
303 >>> mu = mean(data)
304 >>> pvariance(data, mu)
305 1.25
306
Georg Brandleb2aeec2013-10-21 08:57:26 +0200307 This function does not attempt to verify that you have passed the actual mean
308 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
309 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700310
311 Decimals and Fractions are supported:
312
313 .. doctest::
314
315 >>> from decimal import Decimal as D
316 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
317 Decimal('24.815')
318
319 >>> from fractions import Fraction as F
320 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
321 Fraction(13, 72)
322
323 .. note::
324
Georg Brandleb2aeec2013-10-21 08:57:26 +0200325 When called with the entire population, this gives the population variance
326 σ². When called on a sample instead, this is the biased sample variance
327 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700328
Georg Brandleb2aeec2013-10-21 08:57:26 +0200329 If you somehow know the true population mean μ, you may use this function
330 to calculate the variance of a sample, giving the known population mean as
331 the second argument. Provided the data points are representative
332 (e.g. independent and identically distributed), the result will be an
333 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700334
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700335
Georg Brandleb2aeec2013-10-21 08:57:26 +0200336.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700337
Georg Brandleb2aeec2013-10-21 08:57:26 +0200338 Return the sample standard deviation (the square root of the sample
339 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700340
341 .. doctest::
342
343 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
344 1.0810874155219827
345
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700346
Georg Brandleb2aeec2013-10-21 08:57:26 +0200347.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700348
Georg Brandleb2aeec2013-10-21 08:57:26 +0200349 Return the sample variance of *data*, an iterable of at least two real-valued
350 numbers. Variance, or second moment about the mean, is a measure of the
351 variability (spread or dispersion) of data. A large variance indicates that
352 the data is spread out; a small variance indicates it is clustered closely
353 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700354
Georg Brandleb2aeec2013-10-21 08:57:26 +0200355 If the optional second argument *xbar* is given, it should be the mean of
356 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700357 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700358
Georg Brandleb2aeec2013-10-21 08:57:26 +0200359 Use this function when your data is a sample from a population. To calculate
360 the variance from the entire population, see :func:`pvariance`.
361
362 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700363
364 Examples:
365
366 .. doctest::
367
368 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
369 >>> variance(data)
370 1.3720238095238095
371
Georg Brandleb2aeec2013-10-21 08:57:26 +0200372 If you have already calculated the mean of your data, you can pass it as the
373 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700374
375 .. doctest::
376
377 >>> m = mean(data)
378 >>> variance(data, m)
379 1.3720238095238095
380
Georg Brandleb2aeec2013-10-21 08:57:26 +0200381 This function does not attempt to verify that you have passed the actual mean
382 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700383 impossible results.
384
385 Decimal and Fraction values are supported:
386
387 .. doctest::
388
389 >>> from decimal import Decimal as D
390 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
391 Decimal('31.01875')
392
393 >>> from fractions import Fraction as F
394 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
395 Fraction(67, 108)
396
397 .. note::
398
Georg Brandleb2aeec2013-10-21 08:57:26 +0200399 This is the sample variance s² with Bessel's correction, also known as
400 variance with N-1 degrees of freedom. Provided that the data points are
401 representative (e.g. independent and identically distributed), the result
402 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700403
Georg Brandleb2aeec2013-10-21 08:57:26 +0200404 If you somehow know the actual population mean μ you should pass it to the
405 :func:`pvariance` function as the *mu* parameter to get the variance of a
406 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700407
408Exceptions
409----------
410
411A single exception is defined:
412
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400413.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700414
Benjamin Peterson44c30652013-10-20 17:52:09 -0400415 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700416
417..
418 # This modelines must appear within the last ten lines of the file.
419 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;