blob: 6b6d3154a28810069abcfcc08b1fb83bcf4d6ba1 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
Sanchit Khuranaf8a63162019-11-26 03:47:59 +05305 :synopsis: Mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070022numeric (:class:`~numbers.Real`-valued) data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070023
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070024The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100029
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070030Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported. Collections with a mix of types are also undefined
34and implementation-dependent. If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100037
Larry Hastingsf5e987b2013-10-19 11:50:09 -070038Averages and measures of central location
39-----------------------------------------
40
41These functions calculate an average or typical value from a population
42or sample.
43
Raymond Hettingerfc06a192019-03-12 00:43:27 -070044======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080046:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070047:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100048:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070049:func:`median` Median (middle value) of data.
50:func:`median_low` Low median of data.
51:func:`median_high` High median of data.
52:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070053:func:`mode` Single mode (most common value) of discrete or nominal data.
54:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070055:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070056======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070057
Georg Brandleb2aeec2013-10-21 08:57:26 +020058Measures of spread
59------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070060
Georg Brandleb2aeec2013-10-21 08:57:26 +020061These functions calculate a measure of how much the population or sample
62tends to deviate from the typical or average values.
63
64======================= =============================================
65:func:`pstdev` Population standard deviation of data.
66:func:`pvariance` Population variance of data.
67:func:`stdev` Sample standard deviation of data.
68:func:`variance` Sample variance of data.
69======================= =============================================
70
71
72Function details
73----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074
Georg Brandle051b552013-11-04 07:30:50 +010075Note: The functions do not require the data given to them to be sorted.
76However, for reading convenience, most of the examples show sorted sequences.
77
Larry Hastingsf5e987b2013-10-19 11:50:09 -070078.. function:: mean(data)
79
Raymond Hettinger733b9a32019-11-11 23:35:06 -080080 Return the sample arithmetic mean of *data* which can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070081
Georg Brandleb2aeec2013-10-21 08:57:26 +020082 The arithmetic mean is the sum of the data divided by the number of data
83 points. It is commonly called "the average", although it is only one of many
84 different mathematical averages. It is a measure of the central location of
85 the data.
86
87 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070088
89 Some examples of use:
90
91 .. doctest::
92
93 >>> mean([1, 2, 3, 4, 4])
94 2.8
95 >>> mean([-1.0, 2.5, 3.25, 5.75])
96 2.625
97
98 >>> from fractions import Fraction as F
99 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
100 Fraction(13, 21)
101
102 >>> from decimal import Decimal as D
103 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
104 Decimal('0.5625')
105
106 .. note::
107
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200108 The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettingere4810b22019-09-05 00:18:47 -0700109 for central location: the mean is not necessarily a typical example of
110 the data points. For more robust measures of central location, see
111 :func:`median` and :func:`mode`.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
Georg Brandleb2aeec2013-10-21 08:57:26 +0200113 The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700114 so that when taken on average over all the possible samples,
Georg Brandleb2aeec2013-10-21 08:57:26 +0200115 ``mean(sample)`` converges on the true mean of the entire population. If
116 *data* represents the entire population rather than a sample, then
117 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700118
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700119
Raymond Hettinger47d99872019-02-21 15:06:29 -0800120.. function:: fmean(data)
121
122 Convert *data* to floats and compute the arithmetic mean.
123
124 This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800125 :class:`float`. The *data* may be a sequence or iterable. If the input
Raymond Hettingere4810b22019-09-05 00:18:47 -0700126 dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger47d99872019-02-21 15:06:29 -0800127
128 .. doctest::
129
130 >>> fmean([3.5, 4.0, 5.25])
131 4.25
132
133 .. versionadded:: 3.8
134
135
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700136.. function:: geometric_mean(data)
137
138 Convert *data* to floats and compute the geometric mean.
139
Raymond Hettingere4810b22019-09-05 00:18:47 -0700140 The geometric mean indicates the central tendency or typical value of the
141 *data* using the product of the values (as opposed to the arithmetic mean
142 which uses their sum).
143
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700144 Raises a :exc:`StatisticsError` if the input dataset is empty,
145 if it contains a zero, or if it contains a negative value.
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800146 The *data* may be a sequence or iterable.
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700147
148 No special efforts are made to achieve exact results.
149 (However, this may change in the future.)
150
151 .. doctest::
152
Raymond Hettingere4810b22019-09-05 00:18:47 -0700153 >>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700154 36.0
155
156 .. versionadded:: 3.8
157
158
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800159.. function:: harmonic_mean(data, weights=None)
Steven D'Aprano22873182016-08-24 02:34:25 +1000160
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800161 Return the harmonic mean of *data*, a sequence or iterable of
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800162 real-valued numbers. If *weights* is omitted or *None*, then
163 equal weighting is assumed.
Steven D'Aprano22873182016-08-24 02:34:25 +1000164
Raymond Hettinger30a8b282021-02-07 16:44:42 -0800165 The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the
166 reciprocals of the data. For example, the harmonic mean of three values *a*,
167 *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the
168 values is zero, the result will be zero.
Steven D'Aprano22873182016-08-24 02:34:25 +1000169
170 The harmonic mean is a type of average, a measure of the central
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700171 location of the data. It is often appropriate when averaging
Raymond Hettinger30a8b282021-02-07 16:44:42 -0800172 ratios or rates, for example speeds.
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700173
174 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
175 What is the average speed?
176
177 .. doctest::
178
179 >>> harmonic_mean([40, 60])
180 48.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000181
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800182 Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
183 speeds-up to 60 km/hr for the remaining 30 km of the journey. What
184 is the average speed?
Steven D'Aprano22873182016-08-24 02:34:25 +1000185
186 .. doctest::
187
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800188 >>> harmonic_mean([40, 60], weights=[5, 30])
189 56.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000190
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800191 :exc:`StatisticsError` is raised if *data* is empty, any element
192 is less than zero, or if the weighted sum isn't positive.
Steven D'Aprano22873182016-08-24 02:34:25 +1000193
Raymond Hettinger7f460492019-11-06 21:50:44 -0800194 The current algorithm has an early-out when it encounters a zero
195 in the input. This means that the subsequent inputs are not tested
196 for validity. (This behavior may change in the future.)
197
Zachary Warec019bd32016-08-23 13:23:31 -0500198 .. versionadded:: 3.6
199
Zackery Spytz66136762021-01-03 05:35:26 -0700200 .. versionchanged:: 3.10
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800201 Added support for *weights*.
Steven D'Aprano22873182016-08-24 02:34:25 +1000202
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700203.. function:: median(data)
204
Georg Brandleb2aeec2013-10-21 08:57:26 +0200205 Return the median (middle value) of numeric data, using the common "mean of
206 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800207 *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700208
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700209 The median is a robust measure of central location and is less affected by
210 the presence of outliers. When the number of data points is odd, the
211 middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700212
213 .. doctest::
214
215 >>> median([1, 3, 5])
216 3
217
Georg Brandleb2aeec2013-10-21 08:57:26 +0200218 When the number of data points is even, the median is interpolated by taking
219 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700220
221 .. doctest::
222
223 >>> median([1, 3, 5, 7])
224 4.0
225
Georg Brandleb2aeec2013-10-21 08:57:26 +0200226 This is suited for when your data is discrete, and you don't mind that the
227 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700228
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700229 If the data is ordinal (supports order operations) but not numeric (doesn't
230 support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300231 instead.
232
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700233.. function:: median_low(data)
234
Georg Brandleb2aeec2013-10-21 08:57:26 +0200235 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800236 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700237
Georg Brandleb2aeec2013-10-21 08:57:26 +0200238 The low median is always a member of the data set. When the number of data
239 points is odd, the middle value is returned. When it is even, the smaller of
240 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700241
242 .. doctest::
243
244 >>> median_low([1, 3, 5])
245 3
246 >>> median_low([1, 3, 5, 7])
247 3
248
Georg Brandleb2aeec2013-10-21 08:57:26 +0200249 Use the low median when your data are discrete and you prefer the median to
250 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700251
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700252
253.. function:: median_high(data)
254
Georg Brandleb2aeec2013-10-21 08:57:26 +0200255 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800256 is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700257
Georg Brandleb2aeec2013-10-21 08:57:26 +0200258 The high median is always a member of the data set. When the number of data
259 points is odd, the middle value is returned. When it is even, the larger of
260 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700261
262 .. doctest::
263
264 >>> median_high([1, 3, 5])
265 3
266 >>> median_high([1, 3, 5, 7])
267 5
268
Georg Brandleb2aeec2013-10-21 08:57:26 +0200269 Use the high median when your data are discrete and you prefer the median to
270 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700271
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700272
Georg Brandleb2aeec2013-10-21 08:57:26 +0200273.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700274
Georg Brandleb2aeec2013-10-21 08:57:26 +0200275 Return the median of grouped continuous data, calculated as the 50th
276 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800277 is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700278
279 .. doctest::
280
281 >>> median_grouped([52, 52, 53, 54])
282 52.5
283
Georg Brandleb2aeec2013-10-21 08:57:26 +0200284 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200285 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
286 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
287 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200288 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700289
290 .. doctest::
291
292 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
293 3.7
294
Georg Brandleb2aeec2013-10-21 08:57:26 +0200295 Optional argument *interval* represents the class interval, and defaults
296 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700297
298 .. doctest::
299
300 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
301 3.25
302 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
303 3.5
304
305 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200306 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700307
308 .. impl-detail::
309
Georg Brandleb2aeec2013-10-21 08:57:26 +0200310 Under some circumstances, :func:`median_grouped` may coerce data points to
311 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700312
313 .. seealso::
314
Georg Brandleb2aeec2013-10-21 08:57:26 +0200315 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
316 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700317
Georg Brandleb2aeec2013-10-21 08:57:26 +0200318 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100319 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200320 function in the Gnome Gnumeric spreadsheet, including `this discussion
321 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700322
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700323
324.. function:: mode(data)
325
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700326 Return the single most common data point from discrete or nominal *data*.
327 The mode (when it exists) is the most typical value and serves as a
328 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200329
Raymond Hettingere4810b22019-09-05 00:18:47 -0700330 If there are multiple modes with the same frequency, returns the first one
331 encountered in the *data*. If the smallest or largest of those is
332 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
333 If the input *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700334
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700335 ``mode`` assumes discrete data and returns a single value. This is the
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700336 standard treatment of the mode as commonly taught in schools:
337
338 .. doctest::
339
340 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
341 3
342
Raymond Hettingere4810b22019-09-05 00:18:47 -0700343 The mode is unique in that it is the only statistic in this package that
344 also applies to nominal (non-numeric) data:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700345
346 .. doctest::
347
348 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
349 'red'
350
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700351 .. versionchanged:: 3.8
352 Now handles multimodal datasets by returning the first mode encountered.
353 Formerly, it raised :exc:`StatisticsError` when more than one mode was
354 found.
355
356
357.. function:: multimode(data)
358
359 Return a list of the most frequently occurring values in the order they
360 were first encountered in the *data*. Will return more than one result if
361 there are multiple modes or an empty list if the *data* is empty:
362
363 .. doctest::
364
365 >>> multimode('aabbbbccddddeeffffgg')
366 ['b', 'd', 'f']
367 >>> multimode('')
368 []
369
370 .. versionadded:: 3.8
371
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700372
Georg Brandleb2aeec2013-10-21 08:57:26 +0200373.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700374
Georg Brandleb2aeec2013-10-21 08:57:26 +0200375 Return the population standard deviation (the square root of the population
376 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700377
378 .. doctest::
379
380 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
381 0.986893273527251
382
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700383
Georg Brandleb2aeec2013-10-21 08:57:26 +0200384.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700385
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800386 Return the population variance of *data*, a non-empty sequence or iterable
Raymond Hettingere4810b22019-09-05 00:18:47 -0700387 of real-valued numbers. Variance, or second moment about the mean, is a
388 measure of the variability (spread or dispersion) of data. A large
389 variance indicates that the data is spread out; a small variance indicates
390 it is clustered closely around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700391
Raymond Hettingere4810b22019-09-05 00:18:47 -0700392 If the optional second argument *mu* is given, it is typically the mean of
393 the *data*. It can also be used to compute the second moment around a
394 point that is not the mean. If it is missing or ``None`` (the default),
395 the arithmetic mean is automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700396
Georg Brandleb2aeec2013-10-21 08:57:26 +0200397 Use this function to calculate the variance from the entire population. To
398 estimate the variance from a sample, the :func:`variance` function is usually
399 a better choice.
400
401 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700402
403 Examples:
404
405 .. doctest::
406
407 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
408 >>> pvariance(data)
409 1.25
410
Georg Brandleb2aeec2013-10-21 08:57:26 +0200411 If you have already calculated the mean of your data, you can pass it as the
412 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700413
414 .. doctest::
415
416 >>> mu = mean(data)
417 >>> pvariance(data, mu)
418 1.25
419
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700420 Decimals and Fractions are supported:
421
422 .. doctest::
423
424 >>> from decimal import Decimal as D
425 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
426 Decimal('24.815')
427
428 >>> from fractions import Fraction as F
429 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
430 Fraction(13, 72)
431
432 .. note::
433
Georg Brandleb2aeec2013-10-21 08:57:26 +0200434 When called with the entire population, this gives the population variance
435 σ². When called on a sample instead, this is the biased sample variance
436 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700437
Raymond Hettingere4810b22019-09-05 00:18:47 -0700438 If you somehow know the true population mean μ, you may use this
439 function to calculate the variance of a sample, giving the known
440 population mean as the second argument. Provided the data points are a
441 random sample of the population, the result will be an unbiased estimate
442 of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700443
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700444
Georg Brandleb2aeec2013-10-21 08:57:26 +0200445.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700446
Georg Brandleb2aeec2013-10-21 08:57:26 +0200447 Return the sample standard deviation (the square root of the sample
448 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700449
450 .. doctest::
451
452 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
453 1.0810874155219827
454
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700455
Georg Brandleb2aeec2013-10-21 08:57:26 +0200456.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700457
Georg Brandleb2aeec2013-10-21 08:57:26 +0200458 Return the sample variance of *data*, an iterable of at least two real-valued
459 numbers. Variance, or second moment about the mean, is a measure of the
460 variability (spread or dispersion) of data. A large variance indicates that
461 the data is spread out; a small variance indicates it is clustered closely
462 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700463
Georg Brandleb2aeec2013-10-21 08:57:26 +0200464 If the optional second argument *xbar* is given, it should be the mean of
465 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700466 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700467
Georg Brandleb2aeec2013-10-21 08:57:26 +0200468 Use this function when your data is a sample from a population. To calculate
469 the variance from the entire population, see :func:`pvariance`.
470
471 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700472
473 Examples:
474
475 .. doctest::
476
477 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
478 >>> variance(data)
479 1.3720238095238095
480
Georg Brandleb2aeec2013-10-21 08:57:26 +0200481 If you have already calculated the mean of your data, you can pass it as the
482 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700483
484 .. doctest::
485
486 >>> m = mean(data)
487 >>> variance(data, m)
488 1.3720238095238095
489
Georg Brandleb2aeec2013-10-21 08:57:26 +0200490 This function does not attempt to verify that you have passed the actual mean
491 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700492 impossible results.
493
494 Decimal and Fraction values are supported:
495
496 .. doctest::
497
498 >>> from decimal import Decimal as D
499 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
500 Decimal('31.01875')
501
502 >>> from fractions import Fraction as F
503 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
504 Fraction(67, 108)
505
506 .. note::
507
Georg Brandleb2aeec2013-10-21 08:57:26 +0200508 This is the sample variance s² with Bessel's correction, also known as
509 variance with N-1 degrees of freedom. Provided that the data points are
510 representative (e.g. independent and identically distributed), the result
511 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700512
Georg Brandleb2aeec2013-10-21 08:57:26 +0200513 If you somehow know the actual population mean μ you should pass it to the
514 :func:`pvariance` function as the *mu* parameter to get the variance of a
515 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700516
Raymond Hettingere4810b22019-09-05 00:18:47 -0700517.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700518
Raymond Hettingere4810b22019-09-05 00:18:47 -0700519 Divide *data* into *n* continuous intervals with equal probability.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700520 Returns a list of ``n - 1`` cut points separating the intervals.
521
522 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
523 *n* to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700524 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700525 is not least 1.
526
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700527 The *data* can be any iterable containing sample data. For meaningful
Raymond Hettingere4810b22019-09-05 00:18:47 -0700528 results, the number of data points in *data* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700529 Raises :exc:`StatisticsError` if there are not at least two data points.
530
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700531 The cut points are linearly interpolated from the
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700532 two nearest data points. For example, if a cut point falls one-third
533 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700534 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700535
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700536 The *method* for computing quantiles can be varied depending on
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700537 whether the *data* includes or excludes the lowest and
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700538 highest possible values from the population.
539
540 The default *method* is "exclusive" and is used for data sampled from
541 a population that can have more extreme values than found in the
542 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700543 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
544 sample values, the method sorts them and assigns the following
545 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700546
547 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700548 data or for samples that are known to include the most extreme values
Raymond Hettingere4810b22019-09-05 00:18:47 -0700549 from the population. The minimum value in *data* is treated as the 0th
Raymond Hettingerb530a442019-07-21 16:32:00 -0700550 percentile and the maximum value is treated as the 100th percentile.
551 The portion of the population falling below the *i-th* of *m* sorted
552 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
553 values, the method sorts them and assigns the following percentiles:
554 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700555
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700556 .. doctest::
557
558 # Decile cut points for empirically sampled data
559 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
560 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
561 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
562 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
563 ... 103, 107, 101, 81, 109, 104]
564 >>> [round(q, 1) for q in quantiles(data, n=10)]
565 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
566
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700567 .. versionadded:: 3.8
568
569
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700570Exceptions
571----------
572
573A single exception is defined:
574
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400575.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700576
Benjamin Peterson44c30652013-10-20 17:52:09 -0400577 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700578
Raymond Hettinger11c79532019-02-23 14:44:07 -0800579
580:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700581---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800582
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800583:class:`NormalDist` is a tool for creating and manipulating normal
584distributions of a `random variable
585<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700586class that treats the mean and standard deviation of data
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800587measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800588
589Normal distributions arise from the `Central Limit Theorem
590<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800591of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800592
593.. class:: NormalDist(mu=0.0, sigma=1.0)
594
595 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800596 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800597 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800598 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800599
600 If *sigma* is negative, raises :exc:`StatisticsError`.
601
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800602 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800603
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800604 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800605 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
606 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800607
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700608 .. attribute:: median
609
610 A read-only property for the `median
611 <https://en.wikipedia.org/wiki/Median>`_ of a normal
612 distribution.
613
614 .. attribute:: mode
615
616 A read-only property for the `mode
617 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
618 distribution.
619
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800620 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800621
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800622 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800623 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
624 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800625
626 .. attribute:: variance
627
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800628 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800629 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
630 distribution. Equal to the square of the standard deviation.
631
632 .. classmethod:: NormalDist.from_samples(data)
633
Raymond Hettingere4810b22019-09-05 00:18:47 -0700634 Makes a normal distribution instance with *mu* and *sigma* parameters
635 estimated from the *data* using :func:`fmean` and :func:`stdev`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800636
Raymond Hettingere4810b22019-09-05 00:18:47 -0700637 The *data* can be any :term:`iterable` and should consist of values
638 that can be converted to type :class:`float`. If *data* does not
639 contain at least two elements, raises :exc:`StatisticsError` because it
640 takes at least one point to estimate a central value and at least two
641 points to estimate dispersion.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800642
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700643 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800644
645 Generates *n* random samples for a given mean and standard deviation.
646 Returns a :class:`list` of :class:`float` values.
647
648 If *seed* is given, creates a new instance of the underlying random
649 number generator. This is useful for creating reproducible results,
650 even in a multi-threading context.
651
652 .. method:: NormalDist.pdf(x)
653
654 Using a `probability density function (pdf)
Raymond Hettingere4810b22019-09-05 00:18:47 -0700655 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
656 the relative likelihood that a random variable *X* will be near the
657 given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
658 X < x+dx) / dx`` as *dx* approaches zero.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800659
Raymond Hettingercc353a02019-03-10 23:43:33 -0700660 The relative likelihood is computed as the probability of a sample
661 occurring in a narrow range divided by the width of the range (hence
662 the word "density"). Since the likelihood is relative to other points,
663 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800664
665 .. method:: NormalDist.cdf(x)
666
667 Using a `cumulative distribution function (cdf)
668 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800669 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800670 equal to *x*. Mathematically, it is written ``P(X <= x)``.
671
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700672 .. method:: NormalDist.inv_cdf(p)
673
674 Compute the inverse cumulative distribution function, also known as the
675 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
676 or the `percent-point
677 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
678 function. Mathematically, it is written ``x : P(X <= x) = p``.
679
680 Finds the value *x* of the random variable *X* such that the
681 probability of the variable being less than or equal to that value
682 equals the given probability *p*.
683
Raymond Hettinger318d5372019-03-06 22:59:40 -0800684 .. method:: NormalDist.overlap(other)
685
Raymond Hettingere4810b22019-09-05 00:18:47 -0700686 Measures the agreement between two normal probability distributions.
687 Returns a value between 0.0 and 1.0 giving `the overlapping area for
688 the two probability density functions
689 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800690
Raymond Hettinger8a6cbf82019-10-13 19:53:30 -0700691 .. method:: NormalDist.quantiles(n=4)
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700692
693 Divide the normal distribution into *n* continuous intervals with
694 equal probability. Returns a list of (n - 1) cut points separating
695 the intervals.
696
697 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles.
698 Set *n* to 100 for percentiles which gives the 99 cuts points that
699 separate the normal distribution into 100 equal sized groups.
700
Raymond Hettinger70f027d2020-04-16 10:25:14 -0700701 .. method:: NormalDist.zscore(x)
702
703 Compute the
704 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
705 describing *x* in terms of the number of standard deviations
706 above or below the mean of the normal distribution:
707 ``(x - mean) / stdev``.
708
709 .. versionadded:: 3.9
710
Raymond Hettinger11c79532019-02-23 14:44:07 -0800711 Instances of :class:`NormalDist` support addition, subtraction,
712 multiplication and division by a constant. These operations
713 are used for translation and scaling. For example:
714
715 .. doctest::
716
717 >>> temperature_february = NormalDist(5, 2.5) # Celsius
718 >>> temperature_february * (9/5) + 32 # Fahrenheit
719 NormalDist(mu=41.0, sigma=4.5)
720
Raymond Hettingercc353a02019-03-10 23:43:33 -0700721 Dividing a constant by an instance of :class:`NormalDist` is not supported
722 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800723
724 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800725 variables, it is possible to `add and subtract two independent normally
726 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800727 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
728 represented as instances of :class:`NormalDist`. For example:
729
730 .. doctest::
731
732 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
733 >>> drug_effects = NormalDist(0.4, 0.15)
734 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700735 >>> round(combined.mean, 1)
736 3.1
737 >>> round(combined.stdev, 1)
738 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800739
740 .. versionadded:: 3.8
741
742
743:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700744^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800745
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800746:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800747
748For example, given `historical data for SAT exams
Raymond Hettinger01bf2192020-01-27 18:31:46 -0800749<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
750that scores are normally distributed with a mean of 1060 and a standard
751deviation of 195, determine the percentage of students with test scores
752between 1100 and 1200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800753
754.. doctest::
755
756 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800757 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700758 >>> round(fraction * 100.0, 1)
759 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800760
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700761Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
762<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
763
764.. doctest::
765
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700766 >>> list(map(round, sat.quantiles()))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700767 [928, 1060, 1192]
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700768 >>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700769 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
770
Raymond Hettinger11c79532019-02-23 14:44:07 -0800771To estimate the distribution for a model than isn't easy to solve
772analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700773Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800774
775.. doctest::
776
Raymond Hettingercc353a02019-03-10 23:43:33 -0700777 >>> def model(x, y, z):
778 ... return (3*x + 7*x*y - 5*y) / (11 * z)
779 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800780 >>> n = 100_000
Raymond Hettingere4810b22019-09-05 00:18:47 -0700781 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
782 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
783 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
784 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
785 [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger11c79532019-02-23 14:44:07 -0800786
Raymond Hettinger10355ed2020-01-25 20:21:17 -0800787Normal distributions can be used to approximate `Binomial
788distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
789when the sample size is large and when the probability of a successful
790trial is near 50%.
791
792For example, an open source conference has 750 attendees and two rooms with a
793500 person capacity. There is a talk about Python and another about Ruby.
794In previous conferences, 65% of the attendees preferred to listen to Python
795talks. Assuming the population preferences haven't changed, what is the
Raymond Hettinger01bf2192020-01-27 18:31:46 -0800796probability that the Python room will stay within its capacity limits?
Raymond Hettinger10355ed2020-01-25 20:21:17 -0800797
798.. doctest::
799
800 >>> n = 750 # Sample size
801 >>> p = 0.65 # Preference for Python
802 >>> q = 1.0 - p # Preference for Ruby
803 >>> k = 500 # Room capacity
804
805 >>> # Approximation using the cumulative normal distribution
806 >>> from math import sqrt
807 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
808 0.8402
809
810 >>> # Solution using the cumulative binomial distribution
811 >>> from math import comb, fsum
812 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
813 0.8402
814
815 >>> # Approximation using a simulation
816 >>> from random import seed, choices
817 >>> seed(8675309)
818 >>> def trial():
819 ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
820 >>> mean(trial() <= k for i in range(10_000))
821 0.8398
822
Raymond Hettinger11c79532019-02-23 14:44:07 -0800823Normal distributions commonly arise in machine learning problems.
824
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800825Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800826<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
827The challenge is to predict a person's gender from measurements of normally
828distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800829
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800830We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800831measurements are assumed to be normally distributed, so we summarize the data
832with :class:`NormalDist`:
833
834.. doctest::
835
836 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
837 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
838 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
839 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
840 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
841 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
842
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800843Next, we encounter a new person whose feature measurements are known but whose
844gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800845
846.. doctest::
847
848 >>> ht = 6.0 # height
849 >>> wt = 130 # weight
850 >>> fs = 8 # foot size
851
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800852Starting with a 50% `prior probability
853<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
854we compute the posterior as the prior times the product of likelihoods for the
855feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800856
857.. doctest::
858
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800859 >>> prior_male = 0.5
860 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800861 >>> posterior_male = (prior_male * height_male.pdf(ht) *
862 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
863
864 >>> posterior_female = (prior_female * height_female.pdf(ht) *
865 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
866
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800867The final prediction goes to the largest posterior. This is known as the
868`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800869<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
870
871.. doctest::
872
873 >>> 'male' if posterior_male > posterior_female else 'female'
874 'female'
875
876
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700877..
878 # This modelines must appear within the last ten lines of the file.
879 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;