blob: a906a591e62cef1bf251e766f41652f21d4818d2 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
Raymond Hettingerfc06a192019-03-12 00:43:27 -070040======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070041:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070043:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100044:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`median` Median (middle value) of data.
46:func:`median_low` Low median of data.
47:func:`median_high` High median of data.
48:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070049:func:`mode` Single mode (most common value) of discrete or nominal data.
50:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070051:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070052======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070053
Georg Brandleb2aeec2013-10-21 08:57:26 +020054Measures of spread
55------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070056
Georg Brandleb2aeec2013-10-21 08:57:26 +020057These functions calculate a measure of how much the population or sample
58tends to deviate from the typical or average values.
59
60======================= =============================================
61:func:`pstdev` Population standard deviation of data.
62:func:`pvariance` Population variance of data.
63:func:`stdev` Sample standard deviation of data.
64:func:`variance` Sample variance of data.
65======================= =============================================
66
67
68Function details
69----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070070
Georg Brandle051b552013-11-04 07:30:50 +010071Note: The functions do not require the data given to them to be sorted.
72However, for reading convenience, most of the examples show sorted sequences.
73
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074.. function:: mean(data)
75
Raymond Hettinger6da90782016-11-21 16:31:02 -080076 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070077
Georg Brandleb2aeec2013-10-21 08:57:26 +020078 The arithmetic mean is the sum of the data divided by the number of data
79 points. It is commonly called "the average", although it is only one of many
80 different mathematical averages. It is a measure of the central location of
81 the data.
82
83 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070084
85 Some examples of use:
86
87 .. doctest::
88
89 >>> mean([1, 2, 3, 4, 4])
90 2.8
91 >>> mean([-1.0, 2.5, 3.25, 5.75])
92 2.625
93
94 >>> from fractions import Fraction as F
95 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
96 Fraction(13, 21)
97
98 >>> from decimal import Decimal as D
99 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
100 Decimal('0.5625')
101
102 .. note::
103
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200104 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200105 for central location: the mean is not necessarily a typical example of the
106 data points. For more robust, although less efficient, measures of
107 central location, see :func:`median` and :func:`mode`. (In this case,
108 "efficient" refers to statistical efficiency rather than computational
109 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700110
Georg Brandleb2aeec2013-10-21 08:57:26 +0200111 The sample mean gives an unbiased estimate of the true population mean,
112 which means that, taken on average over all the possible samples,
113 ``mean(sample)`` converges on the true mean of the entire population. If
114 *data* represents the entire population rather than a sample, then
115 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700116
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700117
Raymond Hettinger47d99872019-02-21 15:06:29 -0800118.. function:: fmean(data)
119
120 Convert *data* to floats and compute the arithmetic mean.
121
122 This runs faster than the :func:`mean` function and it always returns a
123 :class:`float`. The result is highly accurate but not as perfect as
124 :func:`mean`. If the input dataset is empty, raises a
125 :exc:`StatisticsError`.
126
127 .. doctest::
128
129 >>> fmean([3.5, 4.0, 5.25])
130 4.25
131
132 .. versionadded:: 3.8
133
134
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700135.. function:: geometric_mean(data)
136
137 Convert *data* to floats and compute the geometric mean.
138
139 Raises a :exc:`StatisticsError` if the input dataset is empty,
140 if it contains a zero, or if it contains a negative value.
141
142 No special efforts are made to achieve exact results.
143 (However, this may change in the future.)
144
145 .. doctest::
146
147 >>> round(geometric_mean([54, 24, 36]), 9)
148 36.0
149
150 .. versionadded:: 3.8
151
152
Steven D'Aprano22873182016-08-24 02:34:25 +1000153.. function:: harmonic_mean(data)
154
155 Return the harmonic mean of *data*, a sequence or iterator of
156 real-valued numbers.
157
158 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500159 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000160 data. For example, the harmonic mean of three values *a*, *b* and *c*
161 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
162
163 The harmonic mean is a type of average, a measure of the central
164 location of the data. It is often appropriate when averaging quantities
165 which are rates or ratios, for example speeds. For example:
166
167 Suppose an investor purchases an equal value of shares in each of
168 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
169 What is the average P/E ratio for the investor's portfolio?
170
171 .. doctest::
172
173 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
174 3.6
175
176 Using the arithmetic mean would give an average of about 5.167, which
177 is too high.
178
Zachary Warec019bd32016-08-23 13:23:31 -0500179 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000180 is less than zero.
181
Zachary Warec019bd32016-08-23 13:23:31 -0500182 .. versionadded:: 3.6
183
Steven D'Aprano22873182016-08-24 02:34:25 +1000184
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700185.. function:: median(data)
186
Georg Brandleb2aeec2013-10-21 08:57:26 +0200187 Return the median (middle value) of numeric data, using the common "mean of
188 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800189 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700190
Georg Brandleb2aeec2013-10-21 08:57:26 +0200191 The median is a robust measure of central location, and is less affected by
192 the presence of outliers in your data. When the number of data points is
193 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700194
195 .. doctest::
196
197 >>> median([1, 3, 5])
198 3
199
Georg Brandleb2aeec2013-10-21 08:57:26 +0200200 When the number of data points is even, the median is interpolated by taking
201 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700202
203 .. doctest::
204
205 >>> median([1, 3, 5, 7])
206 4.0
207
Georg Brandleb2aeec2013-10-21 08:57:26 +0200208 This is suited for when your data is discrete, and you don't mind that the
209 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700210
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300211 If your data is ordinal (supports order operations) but not numeric (doesn't
212 support addition), you should use :func:`median_low` or :func:`median_high`
213 instead.
214
Berker Peksag9c1dba22014-09-28 00:00:58 +0300215 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700217
218.. function:: median_low(data)
219
Georg Brandleb2aeec2013-10-21 08:57:26 +0200220 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800221 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700222
Georg Brandleb2aeec2013-10-21 08:57:26 +0200223 The low median is always a member of the data set. When the number of data
224 points is odd, the middle value is returned. When it is even, the smaller of
225 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700226
227 .. doctest::
228
229 >>> median_low([1, 3, 5])
230 3
231 >>> median_low([1, 3, 5, 7])
232 3
233
Georg Brandleb2aeec2013-10-21 08:57:26 +0200234 Use the low median when your data are discrete and you prefer the median to
235 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700236
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700237
238.. function:: median_high(data)
239
Georg Brandleb2aeec2013-10-21 08:57:26 +0200240 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800241 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700242
Georg Brandleb2aeec2013-10-21 08:57:26 +0200243 The high median is always a member of the data set. When the number of data
244 points is odd, the middle value is returned. When it is even, the larger of
245 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700246
247 .. doctest::
248
249 >>> median_high([1, 3, 5])
250 3
251 >>> median_high([1, 3, 5, 7])
252 5
253
Georg Brandleb2aeec2013-10-21 08:57:26 +0200254 Use the high median when your data are discrete and you prefer the median to
255 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700256
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700257
Georg Brandleb2aeec2013-10-21 08:57:26 +0200258.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260 Return the median of grouped continuous data, calculated as the 50th
261 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800262 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700263
264 .. doctest::
265
266 >>> median_grouped([52, 52, 53, 54])
267 52.5
268
Georg Brandleb2aeec2013-10-21 08:57:26 +0200269 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200270 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
271 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
272 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200273 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700274
275 .. doctest::
276
277 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
278 3.7
279
Georg Brandleb2aeec2013-10-21 08:57:26 +0200280 Optional argument *interval* represents the class interval, and defaults
281 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
283 .. doctest::
284
285 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
286 3.25
287 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
288 3.5
289
290 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200291 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700292
293 .. impl-detail::
294
Georg Brandleb2aeec2013-10-21 08:57:26 +0200295 Under some circumstances, :func:`median_grouped` may coerce data points to
296 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700297
298 .. seealso::
299
Georg Brandleb2aeec2013-10-21 08:57:26 +0200300 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
301 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700302
Georg Brandleb2aeec2013-10-21 08:57:26 +0200303 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100304 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200305 function in the Gnome Gnumeric spreadsheet, including `this discussion
306 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700307
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700308
309.. function:: mode(data)
310
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700311 Return the single most common data point from discrete or nominal *data*.
312 The mode (when it exists) is the most typical value and serves as a
313 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200314
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700315 If there are multiple modes, returns the first one encountered in the *data*.
316 If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700317
318 ``mode`` assumes discrete data, and returns a single value. This is the
319 standard treatment of the mode as commonly taught in schools:
320
321 .. doctest::
322
323 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
324 3
325
326 The mode is unique in that it is the only statistic which also applies
327 to nominal (non-numeric) data:
328
329 .. doctest::
330
331 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
332 'red'
333
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700334 .. versionchanged:: 3.8
335 Now handles multimodal datasets by returning the first mode encountered.
336 Formerly, it raised :exc:`StatisticsError` when more than one mode was
337 found.
338
339
340.. function:: multimode(data)
341
342 Return a list of the most frequently occurring values in the order they
343 were first encountered in the *data*. Will return more than one result if
344 there are multiple modes or an empty list if the *data* is empty:
345
346 .. doctest::
347
348 >>> multimode('aabbbbccddddeeffffgg')
349 ['b', 'd', 'f']
350 >>> multimode('')
351 []
352
353 .. versionadded:: 3.8
354
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700355
Georg Brandleb2aeec2013-10-21 08:57:26 +0200356.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700357
Georg Brandleb2aeec2013-10-21 08:57:26 +0200358 Return the population standard deviation (the square root of the population
359 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700360
361 .. doctest::
362
363 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
364 0.986893273527251
365
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700366
Georg Brandleb2aeec2013-10-21 08:57:26 +0200367.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700368
Georg Brandleb2aeec2013-10-21 08:57:26 +0200369 Return the population variance of *data*, a non-empty iterable of real-valued
370 numbers. Variance, or second moment about the mean, is a measure of the
371 variability (spread or dispersion) of data. A large variance indicates that
372 the data is spread out; a small variance indicates it is clustered closely
373 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700374
Georg Brandleb2aeec2013-10-21 08:57:26 +0200375 If the optional second argument *mu* is given, it should be the mean of
376 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700377 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700378
Georg Brandleb2aeec2013-10-21 08:57:26 +0200379 Use this function to calculate the variance from the entire population. To
380 estimate the variance from a sample, the :func:`variance` function is usually
381 a better choice.
382
383 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700384
385 Examples:
386
387 .. doctest::
388
389 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
390 >>> pvariance(data)
391 1.25
392
Georg Brandleb2aeec2013-10-21 08:57:26 +0200393 If you have already calculated the mean of your data, you can pass it as the
394 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700395
396 .. doctest::
397
398 >>> mu = mean(data)
399 >>> pvariance(data, mu)
400 1.25
401
Georg Brandleb2aeec2013-10-21 08:57:26 +0200402 This function does not attempt to verify that you have passed the actual mean
403 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
404 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700405
406 Decimals and Fractions are supported:
407
408 .. doctest::
409
410 >>> from decimal import Decimal as D
411 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
412 Decimal('24.815')
413
414 >>> from fractions import Fraction as F
415 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
416 Fraction(13, 72)
417
418 .. note::
419
Georg Brandleb2aeec2013-10-21 08:57:26 +0200420 When called with the entire population, this gives the population variance
421 σ². When called on a sample instead, this is the biased sample variance
422 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700423
Georg Brandleb2aeec2013-10-21 08:57:26 +0200424 If you somehow know the true population mean μ, you may use this function
425 to calculate the variance of a sample, giving the known population mean as
426 the second argument. Provided the data points are representative
427 (e.g. independent and identically distributed), the result will be an
428 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700429
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700430
Georg Brandleb2aeec2013-10-21 08:57:26 +0200431.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700432
Georg Brandleb2aeec2013-10-21 08:57:26 +0200433 Return the sample standard deviation (the square root of the sample
434 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700435
436 .. doctest::
437
438 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
439 1.0810874155219827
440
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700441
Georg Brandleb2aeec2013-10-21 08:57:26 +0200442.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700443
Georg Brandleb2aeec2013-10-21 08:57:26 +0200444 Return the sample variance of *data*, an iterable of at least two real-valued
445 numbers. Variance, or second moment about the mean, is a measure of the
446 variability (spread or dispersion) of data. A large variance indicates that
447 the data is spread out; a small variance indicates it is clustered closely
448 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700449
Georg Brandleb2aeec2013-10-21 08:57:26 +0200450 If the optional second argument *xbar* is given, it should be the mean of
451 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700452 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700453
Georg Brandleb2aeec2013-10-21 08:57:26 +0200454 Use this function when your data is a sample from a population. To calculate
455 the variance from the entire population, see :func:`pvariance`.
456
457 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700458
459 Examples:
460
461 .. doctest::
462
463 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
464 >>> variance(data)
465 1.3720238095238095
466
Georg Brandleb2aeec2013-10-21 08:57:26 +0200467 If you have already calculated the mean of your data, you can pass it as the
468 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700469
470 .. doctest::
471
472 >>> m = mean(data)
473 >>> variance(data, m)
474 1.3720238095238095
475
Georg Brandleb2aeec2013-10-21 08:57:26 +0200476 This function does not attempt to verify that you have passed the actual mean
477 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700478 impossible results.
479
480 Decimal and Fraction values are supported:
481
482 .. doctest::
483
484 >>> from decimal import Decimal as D
485 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
486 Decimal('31.01875')
487
488 >>> from fractions import Fraction as F
489 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
490 Fraction(67, 108)
491
492 .. note::
493
Georg Brandleb2aeec2013-10-21 08:57:26 +0200494 This is the sample variance s² with Bessel's correction, also known as
495 variance with N-1 degrees of freedom. Provided that the data points are
496 representative (e.g. independent and identically distributed), the result
497 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700498
Georg Brandleb2aeec2013-10-21 08:57:26 +0200499 If you somehow know the actual population mean μ you should pass it to the
500 :func:`pvariance` function as the *mu* parameter to get the variance of a
501 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700502
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700503.. function:: quantiles(dist, *, n=4, method='exclusive')
504
505 Divide *dist* into *n* continuous intervals with equal probability.
506 Returns a list of ``n - 1`` cut points separating the intervals.
507
508 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
509 *n* to 100 for percentiles which gives the 99 cuts points that separate
510 *dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
511 is not least 1.
512
513 The *dist* can be any iterable containing sample data or it can be an
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700514 instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
515 results, the number of data points in *dist* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700516 Raises :exc:`StatisticsError` if there are not at least two data points.
517
518 For sample data, the cut points are linearly interpolated from the
519 two nearest data points. For example, if a cut point falls one-third
520 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700521 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700522
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700523 The *method* for computing quantiles can be varied depending on
524 whether the data in *dist* includes or excludes the lowest and
525 highest possible values from the population.
526
527 The default *method* is "exclusive" and is used for data sampled from
528 a population that can have more extreme values than found in the
529 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700530 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
531 sample values, the method sorts them and assigns the following
532 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700533
534 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700535 data or for samples that are known to include the most extreme values
536 from the population. The minimum value in *dist* is treated as the 0th
537 percentile and the maximum value is treated as the 100th percentile.
538 The portion of the population falling below the *i-th* of *m* sorted
539 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
540 values, the method sorts them and assigns the following percentiles:
541 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700542
543 If *dist* is an instance of a class that defines an
544 :meth:`~inv_cdf` method, setting *method* has no effect.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700545
546 .. doctest::
547
548 # Decile cut points for empirically sampled data
549 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
550 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
551 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
552 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
553 ... 103, 107, 101, 81, 109, 104]
554 >>> [round(q, 1) for q in quantiles(data, n=10)]
555 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
556
557 >>> # Quartile cut points for the standard normal distibution
558 >>> Z = NormalDist()
559 >>> [round(q, 4) for q in quantiles(Z, n=4)]
560 [-0.6745, 0.0, 0.6745]
561
562 .. versionadded:: 3.8
563
564
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700565Exceptions
566----------
567
568A single exception is defined:
569
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400570.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700571
Benjamin Peterson44c30652013-10-20 17:52:09 -0400572 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700573
Raymond Hettinger11c79532019-02-23 14:44:07 -0800574
575:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700576---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800577
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800578:class:`NormalDist` is a tool for creating and manipulating normal
579distributions of a `random variable
580<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
581composite class that treats the mean and standard deviation of data
582measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800583
584Normal distributions arise from the `Central Limit Theorem
585<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800586of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800587
588.. class:: NormalDist(mu=0.0, sigma=1.0)
589
590 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800591 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800592 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800593 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800594
595 If *sigma* is negative, raises :exc:`StatisticsError`.
596
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800597 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800598
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800599 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800600 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
601 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800602
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800603 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800604
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800605 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800606 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
607 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800608
609 .. attribute:: variance
610
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800611 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800612 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
613 distribution. Equal to the square of the standard deviation.
614
615 .. classmethod:: NormalDist.from_samples(data)
616
Raymond Hettingercc353a02019-03-10 23:43:33 -0700617 Makes a normal distribution instance computed from sample data. The
618 *data* can be any :term:`iterable` and should consist of values that
619 can be converted to type :class:`float`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800620
621 If *data* does not contain at least two elements, raises
622 :exc:`StatisticsError` because it takes at least one point to estimate
623 a central value and at least two points to estimate dispersion.
624
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700625 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800626
627 Generates *n* random samples for a given mean and standard deviation.
628 Returns a :class:`list` of :class:`float` values.
629
630 If *seed* is given, creates a new instance of the underlying random
631 number generator. This is useful for creating reproducible results,
632 even in a multi-threading context.
633
634 .. method:: NormalDist.pdf(x)
635
636 Using a `probability density function (pdf)
637 <https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800638 compute the relative likelihood that a random variable *X* will be near
Raymond Hettinger11c79532019-02-23 14:44:07 -0800639 the given value *x*. Mathematically, it is the ratio ``P(x <= X <
640 x+dx) / dx``.
641
Raymond Hettingercc353a02019-03-10 23:43:33 -0700642 The relative likelihood is computed as the probability of a sample
643 occurring in a narrow range divided by the width of the range (hence
644 the word "density"). Since the likelihood is relative to other points,
645 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800646
647 .. method:: NormalDist.cdf(x)
648
649 Using a `cumulative distribution function (cdf)
650 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800651 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800652 equal to *x*. Mathematically, it is written ``P(X <= x)``.
653
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700654 .. method:: NormalDist.inv_cdf(p)
655
656 Compute the inverse cumulative distribution function, also known as the
657 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
658 or the `percent-point
659 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
660 function. Mathematically, it is written ``x : P(X <= x) = p``.
661
662 Finds the value *x* of the random variable *X* such that the
663 probability of the variable being less than or equal to that value
664 equals the given probability *p*.
665
Raymond Hettinger318d5372019-03-06 22:59:40 -0800666 .. method:: NormalDist.overlap(other)
667
668 Compute the `overlapping coefficient (OVL)
669 <http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger14bab7a2019-03-07 08:54:31 -0800670 between two normal distributions, giving a measure of agreement.
671 Returns a value between 0.0 and 1.0 giving `the overlapping area for
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700672 the two probability density functions
Raymond Hettinger14bab7a2019-03-07 08:54:31 -0800673 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800674
Raymond Hettinger11c79532019-02-23 14:44:07 -0800675 Instances of :class:`NormalDist` support addition, subtraction,
676 multiplication and division by a constant. These operations
677 are used for translation and scaling. For example:
678
679 .. doctest::
680
681 >>> temperature_february = NormalDist(5, 2.5) # Celsius
682 >>> temperature_february * (9/5) + 32 # Fahrenheit
683 NormalDist(mu=41.0, sigma=4.5)
684
Raymond Hettingercc353a02019-03-10 23:43:33 -0700685 Dividing a constant by an instance of :class:`NormalDist` is not supported
686 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800687
688 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800689 variables, it is possible to `add and subtract two independent normally
690 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800691 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
692 represented as instances of :class:`NormalDist`. For example:
693
694 .. doctest::
695
696 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
697 >>> drug_effects = NormalDist(0.4, 0.15)
698 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700699 >>> round(combined.mean, 1)
700 3.1
701 >>> round(combined.stdev, 1)
702 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800703
704 .. versionadded:: 3.8
705
706
707:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700708^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800709
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800710:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800711
712For example, given `historical data for SAT exams
713<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800714are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700715determine the percentage of students with test scores between 1100 and
7161200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800717
718.. doctest::
719
720 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800721 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700722 >>> round(fraction * 100.0, 1)
723 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800724
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700725Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
726<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
727
728.. doctest::
729
730 >>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
731 [928, 1060, 1192]
732 >>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
733 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
734
Raymond Hettinger318d5372019-03-06 22:59:40 -0800735What percentage of men and women will have the same height in `two normally
736distributed populations with known means and standard deviations
737<http://www.usablestats.com/lessons/normal>`_?
738
739 >>> men = NormalDist(70, 4)
740 >>> women = NormalDist(65, 3.5)
741 >>> ovl = men.overlap(women)
742 >>> round(ovl * 100.0, 1)
743 50.3
744
Raymond Hettinger11c79532019-02-23 14:44:07 -0800745To estimate the distribution for a model than isn't easy to solve
746analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700747Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800748
749.. doctest::
750
Raymond Hettingercc353a02019-03-10 23:43:33 -0700751 >>> def model(x, y, z):
752 ... return (3*x + 7*x*y - 5*y) / (11 * z)
753 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800754 >>> n = 100_000
Raymond Hettingercc353a02019-03-10 23:43:33 -0700755 >>> X = NormalDist(10, 2.5).samples(n)
756 >>> Y = NormalDist(15, 1.75).samples(n)
757 >>> Z = NormalDist(5, 1.25).samples(n)
758 >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
759 NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800760
761Normal distributions commonly arise in machine learning problems.
762
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800763Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800764<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
765The challenge is to predict a person's gender from measurements of normally
766distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800767
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800768We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800769measurements are assumed to be normally distributed, so we summarize the data
770with :class:`NormalDist`:
771
772.. doctest::
773
774 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
775 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
776 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
777 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
778 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
779 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
780
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800781Next, we encounter a new person whose feature measurements are known but whose
782gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800783
784.. doctest::
785
786 >>> ht = 6.0 # height
787 >>> wt = 130 # weight
788 >>> fs = 8 # foot size
789
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800790Starting with a 50% `prior probability
791<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
792we compute the posterior as the prior times the product of likelihoods for the
793feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800794
795.. doctest::
796
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800797 >>> prior_male = 0.5
798 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800799 >>> posterior_male = (prior_male * height_male.pdf(ht) *
800 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
801
802 >>> posterior_female = (prior_female * height_female.pdf(ht) *
803 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
804
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800805The final prediction goes to the largest posterior. This is known as the
806`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800807<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
808
809.. doctest::
810
811 >>> 'male' if posterior_male > posterior_female else 'female'
812 'female'
813
814
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700815..
816 # This modelines must appear within the last ten lines of the file.
817 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;