blob: 04b731dcc44400df3f187e0335d02063f452ea0f [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
Raymond Hettingerfc06a192019-03-12 00:43:27 -070040======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070041:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070043:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100044:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`median` Median (middle value) of data.
46:func:`median_low` Low median of data.
47:func:`median_high` High median of data.
48:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070049:func:`mode` Single mode (most common value) of discrete or nominal data.
50:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070051:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070052======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070053
Georg Brandleb2aeec2013-10-21 08:57:26 +020054Measures of spread
55------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070056
Georg Brandleb2aeec2013-10-21 08:57:26 +020057These functions calculate a measure of how much the population or sample
58tends to deviate from the typical or average values.
59
60======================= =============================================
61:func:`pstdev` Population standard deviation of data.
62:func:`pvariance` Population variance of data.
63:func:`stdev` Sample standard deviation of data.
64:func:`variance` Sample variance of data.
65======================= =============================================
66
67
68Function details
69----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070070
Georg Brandle051b552013-11-04 07:30:50 +010071Note: The functions do not require the data given to them to be sorted.
72However, for reading convenience, most of the examples show sorted sequences.
73
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074.. function:: mean(data)
75
Raymond Hettinger6da90782016-11-21 16:31:02 -080076 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070077
Georg Brandleb2aeec2013-10-21 08:57:26 +020078 The arithmetic mean is the sum of the data divided by the number of data
79 points. It is commonly called "the average", although it is only one of many
80 different mathematical averages. It is a measure of the central location of
81 the data.
82
83 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070084
85 Some examples of use:
86
87 .. doctest::
88
89 >>> mean([1, 2, 3, 4, 4])
90 2.8
91 >>> mean([-1.0, 2.5, 3.25, 5.75])
92 2.625
93
94 >>> from fractions import Fraction as F
95 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
96 Fraction(13, 21)
97
98 >>> from decimal import Decimal as D
99 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
100 Decimal('0.5625')
101
102 .. note::
103
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200104 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200105 for central location: the mean is not necessarily a typical example of the
106 data points. For more robust, although less efficient, measures of
107 central location, see :func:`median` and :func:`mode`. (In this case,
108 "efficient" refers to statistical efficiency rather than computational
109 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700110
Georg Brandleb2aeec2013-10-21 08:57:26 +0200111 The sample mean gives an unbiased estimate of the true population mean,
112 which means that, taken on average over all the possible samples,
113 ``mean(sample)`` converges on the true mean of the entire population. If
114 *data* represents the entire population rather than a sample, then
115 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700116
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700117
Raymond Hettinger47d99872019-02-21 15:06:29 -0800118.. function:: fmean(data)
119
120 Convert *data* to floats and compute the arithmetic mean.
121
122 This runs faster than the :func:`mean` function and it always returns a
123 :class:`float`. The result is highly accurate but not as perfect as
124 :func:`mean`. If the input dataset is empty, raises a
125 :exc:`StatisticsError`.
126
127 .. doctest::
128
129 >>> fmean([3.5, 4.0, 5.25])
130 4.25
131
132 .. versionadded:: 3.8
133
134
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700135.. function:: geometric_mean(data)
136
137 Convert *data* to floats and compute the geometric mean.
138
139 Raises a :exc:`StatisticsError` if the input dataset is empty,
140 if it contains a zero, or if it contains a negative value.
141
142 No special efforts are made to achieve exact results.
143 (However, this may change in the future.)
144
145 .. doctest::
146
147 >>> round(geometric_mean([54, 24, 36]), 9)
148 36.0
149
150 .. versionadded:: 3.8
151
152
Steven D'Aprano22873182016-08-24 02:34:25 +1000153.. function:: harmonic_mean(data)
154
155 Return the harmonic mean of *data*, a sequence or iterator of
156 real-valued numbers.
157
158 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500159 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000160 data. For example, the harmonic mean of three values *a*, *b* and *c*
161 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
162
163 The harmonic mean is a type of average, a measure of the central
164 location of the data. It is often appropriate when averaging quantities
165 which are rates or ratios, for example speeds. For example:
166
167 Suppose an investor purchases an equal value of shares in each of
168 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
169 What is the average P/E ratio for the investor's portfolio?
170
171 .. doctest::
172
173 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
174 3.6
175
176 Using the arithmetic mean would give an average of about 5.167, which
177 is too high.
178
Zachary Warec019bd32016-08-23 13:23:31 -0500179 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000180 is less than zero.
181
Zachary Warec019bd32016-08-23 13:23:31 -0500182 .. versionadded:: 3.6
183
Steven D'Aprano22873182016-08-24 02:34:25 +1000184
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700185.. function:: median(data)
186
Georg Brandleb2aeec2013-10-21 08:57:26 +0200187 Return the median (middle value) of numeric data, using the common "mean of
188 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800189 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700190
Georg Brandleb2aeec2013-10-21 08:57:26 +0200191 The median is a robust measure of central location, and is less affected by
192 the presence of outliers in your data. When the number of data points is
193 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700194
195 .. doctest::
196
197 >>> median([1, 3, 5])
198 3
199
Georg Brandleb2aeec2013-10-21 08:57:26 +0200200 When the number of data points is even, the median is interpolated by taking
201 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700202
203 .. doctest::
204
205 >>> median([1, 3, 5, 7])
206 4.0
207
Georg Brandleb2aeec2013-10-21 08:57:26 +0200208 This is suited for when your data is discrete, and you don't mind that the
209 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700210
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300211 If your data is ordinal (supports order operations) but not numeric (doesn't
212 support addition), you should use :func:`median_low` or :func:`median_high`
213 instead.
214
Berker Peksag9c1dba22014-09-28 00:00:58 +0300215 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700217
218.. function:: median_low(data)
219
Georg Brandleb2aeec2013-10-21 08:57:26 +0200220 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800221 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700222
Georg Brandleb2aeec2013-10-21 08:57:26 +0200223 The low median is always a member of the data set. When the number of data
224 points is odd, the middle value is returned. When it is even, the smaller of
225 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700226
227 .. doctest::
228
229 >>> median_low([1, 3, 5])
230 3
231 >>> median_low([1, 3, 5, 7])
232 3
233
Georg Brandleb2aeec2013-10-21 08:57:26 +0200234 Use the low median when your data are discrete and you prefer the median to
235 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700236
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700237
238.. function:: median_high(data)
239
Georg Brandleb2aeec2013-10-21 08:57:26 +0200240 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800241 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700242
Georg Brandleb2aeec2013-10-21 08:57:26 +0200243 The high median is always a member of the data set. When the number of data
244 points is odd, the middle value is returned. When it is even, the larger of
245 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700246
247 .. doctest::
248
249 >>> median_high([1, 3, 5])
250 3
251 >>> median_high([1, 3, 5, 7])
252 5
253
Georg Brandleb2aeec2013-10-21 08:57:26 +0200254 Use the high median when your data are discrete and you prefer the median to
255 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700256
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700257
Georg Brandleb2aeec2013-10-21 08:57:26 +0200258.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260 Return the median of grouped continuous data, calculated as the 50th
261 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800262 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700263
264 .. doctest::
265
266 >>> median_grouped([52, 52, 53, 54])
267 52.5
268
Georg Brandleb2aeec2013-10-21 08:57:26 +0200269 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200270 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
271 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
272 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200273 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700274
275 .. doctest::
276
277 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
278 3.7
279
Georg Brandleb2aeec2013-10-21 08:57:26 +0200280 Optional argument *interval* represents the class interval, and defaults
281 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
283 .. doctest::
284
285 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
286 3.25
287 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
288 3.5
289
290 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200291 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700292
293 .. impl-detail::
294
Georg Brandleb2aeec2013-10-21 08:57:26 +0200295 Under some circumstances, :func:`median_grouped` may coerce data points to
296 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700297
298 .. seealso::
299
Georg Brandleb2aeec2013-10-21 08:57:26 +0200300 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
301 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700302
Georg Brandleb2aeec2013-10-21 08:57:26 +0200303 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100304 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200305 function in the Gnome Gnumeric spreadsheet, including `this discussion
306 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700307
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700308
309.. function:: mode(data)
310
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700311 Return the single most common data point from discrete or nominal *data*.
312 The mode (when it exists) is the most typical value and serves as a
313 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200314
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700315 If there are multiple modes, returns the first one encountered in the *data*.
Raymond Hettingere43e7ed2019-08-08 01:23:05 -0700316 If the smallest or largest of multiple modes is desired instead, use
317 ``min(multimode(data))`` or ``max(multimode(data))``. If the input *data* is
318 empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700319
320 ``mode`` assumes discrete data, and returns a single value. This is the
321 standard treatment of the mode as commonly taught in schools:
322
323 .. doctest::
324
325 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
326 3
327
328 The mode is unique in that it is the only statistic which also applies
329 to nominal (non-numeric) data:
330
331 .. doctest::
332
333 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
334 'red'
335
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700336 .. versionchanged:: 3.8
337 Now handles multimodal datasets by returning the first mode encountered.
338 Formerly, it raised :exc:`StatisticsError` when more than one mode was
339 found.
340
341
342.. function:: multimode(data)
343
344 Return a list of the most frequently occurring values in the order they
345 were first encountered in the *data*. Will return more than one result if
346 there are multiple modes or an empty list if the *data* is empty:
347
348 .. doctest::
349
350 >>> multimode('aabbbbccddddeeffffgg')
351 ['b', 'd', 'f']
352 >>> multimode('')
353 []
354
355 .. versionadded:: 3.8
356
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700357
Georg Brandleb2aeec2013-10-21 08:57:26 +0200358.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700359
Georg Brandleb2aeec2013-10-21 08:57:26 +0200360 Return the population standard deviation (the square root of the population
361 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700362
363 .. doctest::
364
365 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
366 0.986893273527251
367
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700368
Georg Brandleb2aeec2013-10-21 08:57:26 +0200369.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700370
Georg Brandleb2aeec2013-10-21 08:57:26 +0200371 Return the population variance of *data*, a non-empty iterable of real-valued
372 numbers. Variance, or second moment about the mean, is a measure of the
373 variability (spread or dispersion) of data. A large variance indicates that
374 the data is spread out; a small variance indicates it is clustered closely
375 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700376
Georg Brandleb2aeec2013-10-21 08:57:26 +0200377 If the optional second argument *mu* is given, it should be the mean of
378 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700379 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700380
Georg Brandleb2aeec2013-10-21 08:57:26 +0200381 Use this function to calculate the variance from the entire population. To
382 estimate the variance from a sample, the :func:`variance` function is usually
383 a better choice.
384
385 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700386
387 Examples:
388
389 .. doctest::
390
391 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
392 >>> pvariance(data)
393 1.25
394
Georg Brandleb2aeec2013-10-21 08:57:26 +0200395 If you have already calculated the mean of your data, you can pass it as the
396 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700397
398 .. doctest::
399
400 >>> mu = mean(data)
401 >>> pvariance(data, mu)
402 1.25
403
Georg Brandleb2aeec2013-10-21 08:57:26 +0200404 This function does not attempt to verify that you have passed the actual mean
405 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
406 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700407
408 Decimals and Fractions are supported:
409
410 .. doctest::
411
412 >>> from decimal import Decimal as D
413 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
414 Decimal('24.815')
415
416 >>> from fractions import Fraction as F
417 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
418 Fraction(13, 72)
419
420 .. note::
421
Georg Brandleb2aeec2013-10-21 08:57:26 +0200422 When called with the entire population, this gives the population variance
423 σ². When called on a sample instead, this is the biased sample variance
424 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700425
Georg Brandleb2aeec2013-10-21 08:57:26 +0200426 If you somehow know the true population mean μ, you may use this function
427 to calculate the variance of a sample, giving the known population mean as
428 the second argument. Provided the data points are representative
429 (e.g. independent and identically distributed), the result will be an
430 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700431
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700432
Georg Brandleb2aeec2013-10-21 08:57:26 +0200433.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700434
Georg Brandleb2aeec2013-10-21 08:57:26 +0200435 Return the sample standard deviation (the square root of the sample
436 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700437
438 .. doctest::
439
440 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
441 1.0810874155219827
442
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700443
Georg Brandleb2aeec2013-10-21 08:57:26 +0200444.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700445
Georg Brandleb2aeec2013-10-21 08:57:26 +0200446 Return the sample variance of *data*, an iterable of at least two real-valued
447 numbers. Variance, or second moment about the mean, is a measure of the
448 variability (spread or dispersion) of data. A large variance indicates that
449 the data is spread out; a small variance indicates it is clustered closely
450 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700451
Georg Brandleb2aeec2013-10-21 08:57:26 +0200452 If the optional second argument *xbar* is given, it should be the mean of
453 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700454 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700455
Georg Brandleb2aeec2013-10-21 08:57:26 +0200456 Use this function when your data is a sample from a population. To calculate
457 the variance from the entire population, see :func:`pvariance`.
458
459 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700460
461 Examples:
462
463 .. doctest::
464
465 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
466 >>> variance(data)
467 1.3720238095238095
468
Georg Brandleb2aeec2013-10-21 08:57:26 +0200469 If you have already calculated the mean of your data, you can pass it as the
470 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700471
472 .. doctest::
473
474 >>> m = mean(data)
475 >>> variance(data, m)
476 1.3720238095238095
477
Georg Brandleb2aeec2013-10-21 08:57:26 +0200478 This function does not attempt to verify that you have passed the actual mean
479 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700480 impossible results.
481
482 Decimal and Fraction values are supported:
483
484 .. doctest::
485
486 >>> from decimal import Decimal as D
487 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
488 Decimal('31.01875')
489
490 >>> from fractions import Fraction as F
491 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
492 Fraction(67, 108)
493
494 .. note::
495
Georg Brandleb2aeec2013-10-21 08:57:26 +0200496 This is the sample variance s² with Bessel's correction, also known as
497 variance with N-1 degrees of freedom. Provided that the data points are
498 representative (e.g. independent and identically distributed), the result
499 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700500
Georg Brandleb2aeec2013-10-21 08:57:26 +0200501 If you somehow know the actual population mean μ you should pass it to the
502 :func:`pvariance` function as the *mu* parameter to get the variance of a
503 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700504
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700505.. function:: quantiles(dist, *, n=4, method='exclusive')
506
507 Divide *dist* into *n* continuous intervals with equal probability.
508 Returns a list of ``n - 1`` cut points separating the intervals.
509
510 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
511 *n* to 100 for percentiles which gives the 99 cuts points that separate
512 *dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
513 is not least 1.
514
515 The *dist* can be any iterable containing sample data or it can be an
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700516 instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
517 results, the number of data points in *dist* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700518 Raises :exc:`StatisticsError` if there are not at least two data points.
519
520 For sample data, the cut points are linearly interpolated from the
521 two nearest data points. For example, if a cut point falls one-third
522 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700523 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700524
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700525 The *method* for computing quantiles can be varied depending on
526 whether the data in *dist* includes or excludes the lowest and
527 highest possible values from the population.
528
529 The default *method* is "exclusive" and is used for data sampled from
530 a population that can have more extreme values than found in the
531 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700532 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
533 sample values, the method sorts them and assigns the following
534 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700535
536 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700537 data or for samples that are known to include the most extreme values
538 from the population. The minimum value in *dist* is treated as the 0th
539 percentile and the maximum value is treated as the 100th percentile.
540 The portion of the population falling below the *i-th* of *m* sorted
541 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
542 values, the method sorts them and assigns the following percentiles:
543 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700544
545 If *dist* is an instance of a class that defines an
546 :meth:`~inv_cdf` method, setting *method* has no effect.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700547
548 .. doctest::
549
550 # Decile cut points for empirically sampled data
551 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
552 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
553 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
554 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
555 ... 103, 107, 101, 81, 109, 104]
556 >>> [round(q, 1) for q in quantiles(data, n=10)]
557 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
558
Min ho Kimc4cacc82019-07-31 08:16:13 +1000559 >>> # Quartile cut points for the standard normal distribution
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700560 >>> Z = NormalDist()
561 >>> [round(q, 4) for q in quantiles(Z, n=4)]
562 [-0.6745, 0.0, 0.6745]
563
564 .. versionadded:: 3.8
565
566
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700567Exceptions
568----------
569
570A single exception is defined:
571
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400572.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700573
Benjamin Peterson44c30652013-10-20 17:52:09 -0400574 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700575
Raymond Hettinger11c79532019-02-23 14:44:07 -0800576
577:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700578---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800579
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800580:class:`NormalDist` is a tool for creating and manipulating normal
581distributions of a `random variable
582<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
583composite class that treats the mean and standard deviation of data
584measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800585
586Normal distributions arise from the `Central Limit Theorem
587<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800588of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800589
590.. class:: NormalDist(mu=0.0, sigma=1.0)
591
592 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800593 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800594 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800595 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800596
597 If *sigma* is negative, raises :exc:`StatisticsError`.
598
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800599 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800600
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800601 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800602 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
603 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800604
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800605 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800606
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800607 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800608 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
609 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800610
611 .. attribute:: variance
612
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800613 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800614 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
615 distribution. Equal to the square of the standard deviation.
616
617 .. classmethod:: NormalDist.from_samples(data)
618
Raymond Hettingercc353a02019-03-10 23:43:33 -0700619 Makes a normal distribution instance computed from sample data. The
620 *data* can be any :term:`iterable` and should consist of values that
621 can be converted to type :class:`float`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800622
623 If *data* does not contain at least two elements, raises
624 :exc:`StatisticsError` because it takes at least one point to estimate
625 a central value and at least two points to estimate dispersion.
626
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700627 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800628
629 Generates *n* random samples for a given mean and standard deviation.
630 Returns a :class:`list` of :class:`float` values.
631
632 If *seed* is given, creates a new instance of the underlying random
633 number generator. This is useful for creating reproducible results,
634 even in a multi-threading context.
635
636 .. method:: NormalDist.pdf(x)
637
638 Using a `probability density function (pdf)
639 <https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800640 compute the relative likelihood that a random variable *X* will be near
Raymond Hettinger11c79532019-02-23 14:44:07 -0800641 the given value *x*. Mathematically, it is the ratio ``P(x <= X <
642 x+dx) / dx``.
643
Raymond Hettingercc353a02019-03-10 23:43:33 -0700644 The relative likelihood is computed as the probability of a sample
645 occurring in a narrow range divided by the width of the range (hence
646 the word "density"). Since the likelihood is relative to other points,
647 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800648
649 .. method:: NormalDist.cdf(x)
650
651 Using a `cumulative distribution function (cdf)
652 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800653 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800654 equal to *x*. Mathematically, it is written ``P(X <= x)``.
655
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700656 .. method:: NormalDist.inv_cdf(p)
657
658 Compute the inverse cumulative distribution function, also known as the
659 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
660 or the `percent-point
661 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
662 function. Mathematically, it is written ``x : P(X <= x) = p``.
663
664 Finds the value *x* of the random variable *X* such that the
665 probability of the variable being less than or equal to that value
666 equals the given probability *p*.
667
Raymond Hettinger318d5372019-03-06 22:59:40 -0800668 .. method:: NormalDist.overlap(other)
669
Raymond Hettinger83717992019-08-25 00:57:26 -0700670 Returns a value between 0.0 and 1.0 giving the overlapping area for
671 the two probability density functions.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800672
Raymond Hettinger11c79532019-02-23 14:44:07 -0800673 Instances of :class:`NormalDist` support addition, subtraction,
674 multiplication and division by a constant. These operations
675 are used for translation and scaling. For example:
676
677 .. doctest::
678
679 >>> temperature_february = NormalDist(5, 2.5) # Celsius
680 >>> temperature_february * (9/5) + 32 # Fahrenheit
681 NormalDist(mu=41.0, sigma=4.5)
682
Raymond Hettingercc353a02019-03-10 23:43:33 -0700683 Dividing a constant by an instance of :class:`NormalDist` is not supported
684 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800685
686 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800687 variables, it is possible to `add and subtract two independent normally
688 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800689 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
690 represented as instances of :class:`NormalDist`. For example:
691
692 .. doctest::
693
694 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
695 >>> drug_effects = NormalDist(0.4, 0.15)
696 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700697 >>> round(combined.mean, 1)
698 3.1
699 >>> round(combined.stdev, 1)
700 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800701
702 .. versionadded:: 3.8
703
704
705:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700706^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800707
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800708:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800709
710For example, given `historical data for SAT exams
711<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800712are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700713determine the percentage of students with test scores between 1100 and
7141200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800715
716.. doctest::
717
718 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800719 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700720 >>> round(fraction * 100.0, 1)
721 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800722
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700723Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
724<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
725
726.. doctest::
727
728 >>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
729 [928, 1060, 1192]
730 >>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
731 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
732
Raymond Hettinger11c79532019-02-23 14:44:07 -0800733To estimate the distribution for a model than isn't easy to solve
734analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700735Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800736
737.. doctest::
738
Raymond Hettingercc353a02019-03-10 23:43:33 -0700739 >>> def model(x, y, z):
740 ... return (3*x + 7*x*y - 5*y) / (11 * z)
741 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800742 >>> n = 100_000
Raymond Hettinger83717992019-08-25 00:57:26 -0700743 >>> seed = 86753099035768
744 >>> X = NormalDist(10, 2.5).samples(n, seed=seed)
745 >>> Y = NormalDist(15, 1.75).samples(n, seed=seed)
746 >>> Z = NormalDist(50, 1.25).samples(n, seed=seed)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700747 >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
Raymond Hettinger83717992019-08-25 00:57:26 -0700748 NormalDist(mu=1.8661894803304777, sigma=0.65238717376862)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800749
750Normal distributions commonly arise in machine learning problems.
751
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800752Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800753<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
754The challenge is to predict a person's gender from measurements of normally
755distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800756
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800757We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800758measurements are assumed to be normally distributed, so we summarize the data
759with :class:`NormalDist`:
760
761.. doctest::
762
763 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
764 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
765 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
766 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
767 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
768 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
769
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800770Next, we encounter a new person whose feature measurements are known but whose
771gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800772
773.. doctest::
774
775 >>> ht = 6.0 # height
776 >>> wt = 130 # weight
777 >>> fs = 8 # foot size
778
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800779Starting with a 50% `prior probability
780<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
781we compute the posterior as the prior times the product of likelihoods for the
782feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800783
784.. doctest::
785
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800786 >>> prior_male = 0.5
787 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800788 >>> posterior_male = (prior_male * height_male.pdf(ht) *
789 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
790
791 >>> posterior_female = (prior_female * height_female.pdf(ht) *
792 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
793
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800794The final prediction goes to the largest posterior. This is known as the
795`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800796<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
797
798.. doctest::
799
800 >>> 'male' if posterior_male > posterior_female else 'female'
801 'female'
802
803
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700804..
805 # This modelines must appear within the last ten lines of the file.
806 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;