Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: a906a591e62cef1bf251e766f41652f21d4818d2 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
				29	currently unsupported. Mixed types are also undefined and
				30	implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, e.g.
				32	``map(float, input_data)``.
				33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	40	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	43	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	44	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`median` Median (middle value) of data.
				46	:func:`median_low` Low median of data.
				47	:func:`median_high` High median of data.
				48	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	49	:func:`mode` Single mode (most common value) of discrete or nominal data.
				50	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	51	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	52	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	53
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	54	Measures of spread
				55	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	56
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	57	These functions calculate a measure of how much the population or sample
				58	tends to deviate from the typical or average values.
				59
				60	======================= =============================================
				61	:func:`pstdev` Population standard deviation of data.
				62	:func:`pvariance` Population variance of data.
				63	:func:`stdev` Sample standard deviation of data.
				64	:func:`variance` Sample variance of data.
				65	======================= =============================================
				66
				67
				68	Function details
				69	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	70
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	71	Note: The functions do not require the data given to them to be sorted.
				72	However, for reading convenience, most of the examples show sorted sequences.
				73
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74	.. function:: mean(data)
				75
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	76	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	77
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	78	The arithmetic mean is the sum of the data divided by the number of data
				79	points. It is commonly called "the average", although it is only one of many
				80	different mathematical averages. It is a measure of the central location of
				81	the data.
				82
				83	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	84
				85	Some examples of use:
				86
				87	.. doctest::
				88
				89	>>> mean([1, 2, 3, 4, 4])
				90	2.8
				91	>>> mean([-1.0, 2.5, 3.25, 5.75])
				92	2.625
				93
				94	>>> from fractions import Fraction as F
				95	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				96	Fraction(13, 21)
				97
				98	>>> from decimal import Decimal as D
				99	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				100	Decimal('0.5625')
				101
				102	.. note::
				103
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	104	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	105	for central location: the mean is not necessarily a typical example of the
				106	data points. For more robust, although less efficient, measures of
				107	central location, see :func:`median` and :func:`mode`. (In this case,
				108	"efficient" refers to statistical efficiency rather than computational
				109	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	110
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	111	The sample mean gives an unbiased estimate of the true population mean,
				112	which means that, taken on average over all the possible samples,
				113	``mean(sample)`` converges on the true mean of the entire population. If
				114	data represents the entire population rather than a sample, then
				115	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	116
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	117
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	118	.. function:: fmean(data)
				119
				120	Convert data to floats and compute the arithmetic mean.
				121
				122	This runs faster than the :func:`mean` function and it always returns a
				123	:class:`float`. The result is highly accurate but not as perfect as
				124	:func:`mean`. If the input dataset is empty, raises a
				125	:exc:`StatisticsError`.
				126
				127	.. doctest::
				128
				129	>>> fmean([3.5, 4.0, 5.25])
				130	4.25
				131
				132	.. versionadded:: 3.8
				133
				134
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	135	.. function:: geometric_mean(data)
				136
				137	Convert data to floats and compute the geometric mean.
				138
				139	Raises a :exc:`StatisticsError` if the input dataset is empty,
				140	if it contains a zero, or if it contains a negative value.
				141
				142	No special efforts are made to achieve exact results.
				143	(However, this may change in the future.)
				144
				145	.. doctest::
				146
				147	>>> round(geometric_mean([54, 24, 36]), 9)
				148	36.0
				149
				150	.. versionadded:: 3.8
				151
				152
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	153	.. function:: harmonic_mean(data)
				154
				155	Return the harmonic mean of data, a sequence or iterator of
				156	real-valued numbers.
				157
				158	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	159	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	160	data. For example, the harmonic mean of three values a, b and c
				161	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				162
				163	The harmonic mean is a type of average, a measure of the central
				164	location of the data. It is often appropriate when averaging quantities
				165	which are rates or ratios, for example speeds. For example:
				166
				167	Suppose an investor purchases an equal value of shares in each of
				168	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				169	What is the average P/E ratio for the investor's portfolio?
				170
				171	.. doctest::
				172
				173	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				174	3.6
				175
				176	Using the arithmetic mean would give an average of about 5.167, which
				177	is too high.
				178
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	179	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	180	is less than zero.
				181
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	182	.. versionadded:: 3.6
				183
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	184
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	185	.. function:: median(data)
				186
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	187	Return the median (middle value) of numeric data, using the common "mean of
				188	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	189	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	190
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	191	The median is a robust measure of central location, and is less affected by
				192	the presence of outliers in your data. When the number of data points is
				193	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	194
				195	.. doctest::
				196
				197	>>> median([1, 3, 5])
				198	3
				199
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	200	When the number of data points is even, the median is interpolated by taking
				201	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	202
				203	.. doctest::
				204
				205	>>> median([1, 3, 5, 7])
				206	4.0
				207
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	208	This is suited for when your data is discrete, and you don't mind that the
				209	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	210
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	211	If your data is ordinal (supports order operations) but not numeric (doesn't
				212	support addition), you should use :func:`median_low` or :func:`median_high`
				213	instead.
				214
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	215	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	216
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	217
				218	.. function:: median_low(data)
				219
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	220	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	221	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	222
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	223	The low median is always a member of the data set. When the number of data
				224	points is odd, the middle value is returned. When it is even, the smaller of
				225	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	226
				227	.. doctest::
				228
				229	>>> median_low([1, 3, 5])
				230	3
				231	>>> median_low([1, 3, 5, 7])
				232	3
				233
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	234	Use the low median when your data are discrete and you prefer the median to
				235	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	236
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	237
				238	.. function:: median_high(data)
				239
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	240	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	241	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	242
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	243	The high median is always a member of the data set. When the number of data
				244	points is odd, the middle value is returned. When it is even, the larger of
				245	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	246
				247	.. doctest::
				248
				249	>>> median_high([1, 3, 5])
				250	3
				251	>>> median_high([1, 3, 5, 7])
				252	5
				253
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	254	Use the high median when your data are discrete and you prefer the median to
				255	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	256
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	257
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	258	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	Return the median of grouped continuous data, calculated as the 50th
				261	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	262	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	263
				264	.. doctest::
				265
				266	>>> median_grouped([52, 52, 53, 54])
				267	52.5
				268
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	269	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	270	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				271	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				272	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	273	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	274
				275	.. doctest::
				276
				277	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				278	3.7
				279
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	280	Optional argument interval represents the class interval, and defaults
				281	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	282
				283	.. doctest::
				284
				285	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				286	3.25
				287	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				288	3.5
				289
				290	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	291	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	292
				293	.. impl-detail::
				294
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	295	Under some circumstances, :func:`median_grouped` may coerce data points to
				296	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	297
				298	.. seealso::
				299
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	300	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				301	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	302
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	303	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	304	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	305	function in the Gnome Gnumeric spreadsheet, including `this discussion
				306	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	307
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	308
				309	.. function:: mode(data)
				310
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	311	Return the single most common data point from discrete or nominal data.
				312	The mode (when it exists) is the most typical value and serves as a
				313	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	314
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	315	If there are multiple modes, returns the first one encountered in the data.
				316	If data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	317
				318	``mode`` assumes discrete data, and returns a single value. This is the
				319	standard treatment of the mode as commonly taught in schools:
				320
				321	.. doctest::
				322
				323	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				324	3
				325
				326	The mode is unique in that it is the only statistic which also applies
				327	to nominal (non-numeric) data:
				328
				329	.. doctest::
				330
				331	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				332	'red'
				333
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	334	.. versionchanged:: 3.8
				335	Now handles multimodal datasets by returning the first mode encountered.
				336	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				337	found.
				338
				339
				340	.. function:: multimode(data)
				341
				342	Return a list of the most frequently occurring values in the order they
				343	were first encountered in the data. Will return more than one result if
				344	there are multiple modes or an empty list if the data is empty:
				345
				346	.. doctest::
				347
				348	>>> multimode('aabbbbccddddeeffffgg')
				349	['b', 'd', 'f']
				350	>>> multimode('')
				351	[]
				352
				353	.. versionadded:: 3.8
				354
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	355
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	356	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	357
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	358	Return the population standard deviation (the square root of the population
				359	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	360
				361	.. doctest::
				362
				363	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				364	0.986893273527251
				365
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	366
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	367	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	368
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	369	Return the population variance of data, a non-empty iterable of real-valued
				370	numbers. Variance, or second moment about the mean, is a measure of the
				371	variability (spread or dispersion) of data. A large variance indicates that
				372	the data is spread out; a small variance indicates it is clustered closely
				373	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	374
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	375	If the optional second argument mu is given, it should be the mean of
				376	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	377	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	378
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	379	Use this function to calculate the variance from the entire population. To
				380	estimate the variance from a sample, the :func:`variance` function is usually
				381	a better choice.
				382
				383	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	384
				385	Examples:
				386
				387	.. doctest::
				388
				389	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				390	>>> pvariance(data)
				391	1.25
				392
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	393	If you have already calculated the mean of your data, you can pass it as the
				394	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	395
				396	.. doctest::
				397
				398	>>> mu = mean(data)
				399	>>> pvariance(data, mu)
				400	1.25
				401
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	402	This function does not attempt to verify that you have passed the actual mean
				403	as mu. Using arbitrary values for mu may lead to invalid or impossible
				404	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	405
				406	Decimals and Fractions are supported:
				407
				408	.. doctest::
				409
				410	>>> from decimal import Decimal as D
				411	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				412	Decimal('24.815')
				413
				414	>>> from fractions import Fraction as F
				415	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				416	Fraction(13, 72)
				417
				418	.. note::
				419
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	420	When called with the entire population, this gives the population variance
				421	σ². When called on a sample instead, this is the biased sample variance
				422	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	423
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	424	If you somehow know the true population mean μ, you may use this function
				425	to calculate the variance of a sample, giving the known population mean as
				426	the second argument. Provided the data points are representative
				427	(e.g. independent and identically distributed), the result will be an
				428	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	429
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	430
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	431	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	432
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	433	Return the sample standard deviation (the square root of the sample
				434	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	435
				436	.. doctest::
				437
				438	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				439	1.0810874155219827
				440
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	441
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	442	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	443
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	444	Return the sample variance of data, an iterable of at least two real-valued
				445	numbers. Variance, or second moment about the mean, is a measure of the
				446	variability (spread or dispersion) of data. A large variance indicates that
				447	the data is spread out; a small variance indicates it is clustered closely
				448	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	449
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	450	If the optional second argument xbar is given, it should be the mean of
				451	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	452	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	453
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	454	Use this function when your data is a sample from a population. To calculate
				455	the variance from the entire population, see :func:`pvariance`.
				456
				457	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	458
				459	Examples:
				460
				461	.. doctest::
				462
				463	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				464	>>> variance(data)
				465	1.3720238095238095
				466
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	467	If you have already calculated the mean of your data, you can pass it as the
				468	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	469
				470	.. doctest::
				471
				472	>>> m = mean(data)
				473	>>> variance(data, m)
				474	1.3720238095238095
				475
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	476	This function does not attempt to verify that you have passed the actual mean
				477	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	478	impossible results.
				479
				480	Decimal and Fraction values are supported:
				481
				482	.. doctest::
				483
				484	>>> from decimal import Decimal as D
				485	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				486	Decimal('31.01875')
				487
				488	>>> from fractions import Fraction as F
				489	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				490	Fraction(67, 108)
				491
				492	.. note::
				493
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	494	This is the sample variance s² with Bessel's correction, also known as
				495	variance with N-1 degrees of freedom. Provided that the data points are
				496	representative (e.g. independent and identically distributed), the result
				497	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	498
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	499	If you somehow know the actual population mean μ you should pass it to the
				500	:func:`pvariance` function as the mu parameter to get the variance of a
				501	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	502
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	503	.. function:: quantiles(dist, *, n=4, method='exclusive')
				504
				505	Divide dist into n continuous intervals with equal probability.
				506	Returns a list of ``n - 1`` cut points separating the intervals.
				507
				508	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				509	n to 100 for percentiles which gives the 99 cuts points that separate
				510	dist in to 100 equal sized groups. Raises :exc:`StatisticsError` if n
				511	is not least 1.
				512
				513	The dist can be any iterable containing sample data or it can be an
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	514	instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
				515	results, the number of data points in dist should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	516	Raises :exc:`StatisticsError` if there are not at least two data points.
				517
				518	For sample data, the cut points are linearly interpolated from the
				519	two nearest data points. For example, if a cut point falls one-third
				520	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	521	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	522
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	523	The method for computing quantiles can be varied depending on
				524	whether the data in dist includes or excludes the lowest and
				525	highest possible values from the population.
				526
				527	The default method is "exclusive" and is used for data sampled from
				528	a population that can have more extreme values than found in the
				529	samples. The portion of the population falling below the i-th of
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame^]	530	m sorted data points is computed as ``i / (m + 1)``. Given nine
				531	sample values, the method sorts them and assigns the following
				532	percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	533
				534	Setting the method to "inclusive" is used for describing population
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame^]	535	data or for samples that are known to include the most extreme values
				536	from the population. The minimum value in dist is treated as the 0th
				537	percentile and the maximum value is treated as the 100th percentile.
				538	The portion of the population falling below the i-th of m sorted
				539	data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
				540	values, the method sorts them and assigns the following percentiles:
				541	0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	542
				543	If dist is an instance of a class that defines an
				544	:meth:`~inv_cdf` method, setting method has no effect.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	545
				546	.. doctest::
				547
				548	# Decile cut points for empirically sampled data
				549	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				550	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				551	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				552	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				553	... 103, 107, 101, 81, 109, 104]
				554	>>> [round(q, 1) for q in quantiles(data, n=10)]
				555	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				556
				557	>>> # Quartile cut points for the standard normal distibution
				558	>>> Z = NormalDist()
				559	>>> [round(q, 4) for q in quantiles(Z, n=4)]
				560	[-0.6745, 0.0, 0.6745]
				561
				562	.. versionadded:: 3.8
				563
				564
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	565	Exceptions
				566	----------
				567
				568	A single exception is defined:
				569
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	570	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	571
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	572	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	573
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	574
				575	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	576	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	577
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	578	:class:`NormalDist` is a tool for creating and manipulating normal
				579	distributions of a `random variable
				580	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
				581	composite class that treats the mean and standard deviation of data
				582	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	583
				584	Normal distributions arise from the `Central Limit Theorem
				585	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	586	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	587
				588	.. class:: NormalDist(mu=0.0, sigma=1.0)
				589
				590	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	591	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	592	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	593	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	594
				595	If sigma is negative, raises :exc:`StatisticsError`.
				596
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	597	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	598
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	599	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	600	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				601	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	602
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	603	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	604
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	605	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	606	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				607	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	608
				609	.. attribute:: variance
				610
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	611	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	612	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				613	distribution. Equal to the square of the standard deviation.
				614
				615	.. classmethod:: NormalDist.from_samples(data)
				616
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	617	Makes a normal distribution instance computed from sample data. The
				618	data can be any :term:`iterable` and should consist of values that
				619	can be converted to type :class:`float`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	620
				621	If data does not contain at least two elements, raises
				622	:exc:`StatisticsError` because it takes at least one point to estimate
				623	a central value and at least two points to estimate dispersion.
				624
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	625	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	626
				627	Generates n random samples for a given mean and standard deviation.
				628	Returns a :class:`list` of :class:`float` values.
				629
				630	If seed is given, creates a new instance of the underlying random
				631	number generator. This is useful for creating reproducible results,
				632	even in a multi-threading context.
				633
				634	.. method:: NormalDist.pdf(x)
				635
				636	Using a `probability density function (pdf)
				637	<https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	638	compute the relative likelihood that a random variable X will be near
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	639	the given value x. Mathematically, it is the ratio ``P(x <= X <
				640	x+dx) / dx``.
				641
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	642	The relative likelihood is computed as the probability of a sample
				643	occurring in a narrow range divided by the width of the range (hence
				644	the word "density"). Since the likelihood is relative to other points,
				645	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	646
				647	.. method:: NormalDist.cdf(x)
				648
				649	Using a `cumulative distribution function (cdf)
				650	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	651	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	652	equal to x. Mathematically, it is written ``P(X <= x)``.
				653
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	654	.. method:: NormalDist.inv_cdf(p)
				655
				656	Compute the inverse cumulative distribution function, also known as the
				657	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				658	or the `percent-point
				659	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				660	function. Mathematically, it is written ``x : P(X <= x) = p``.
				661
				662	Finds the value x of the random variable X such that the
				663	probability of the variable being less than or equal to that value
				664	equals the given probability p.
				665
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	666	.. method:: NormalDist.overlap(other)
				667
				668	Compute the `overlapping coefficient (OVL)
				669	<http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger	14bab7a	2019-03-07 08:54:31 -0800	[diff] [blame]	670	between two normal distributions, giving a measure of agreement.
				671	Returns a value between 0.0 and 1.0 giving `the overlapping area for
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	672	the two probability density functions
Raymond Hettinger	14bab7a	2019-03-07 08:54:31 -0800	[diff] [blame]	673	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	674
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	675	Instances of :class:`NormalDist` support addition, subtraction,
				676	multiplication and division by a constant. These operations
				677	are used for translation and scaling. For example:
				678
				679	.. doctest::
				680
				681	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				682	>>> temperature_february * (9/5) + 32 # Fahrenheit
				683	NormalDist(mu=41.0, sigma=4.5)
				684
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	685	Dividing a constant by an instance of :class:`NormalDist` is not supported
				686	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	687
				688	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	689	variables, it is possible to `add and subtract two independent normally
				690	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	691	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				692	represented as instances of :class:`NormalDist`. For example:
				693
				694	.. doctest::
				695
				696	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				697	>>> drug_effects = NormalDist(0.4, 0.15)
				698	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	699	>>> round(combined.mean, 1)
				700	3.1
				701	>>> round(combined.stdev, 1)
				702	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	703
				704	.. versionadded:: 3.8
				705
				706
				707	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	708	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	709
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	710	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	711
				712	For example, given `historical data for SAT exams
				713	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	714	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	715	determine the percentage of students with test scores between 1100 and
				716	1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	717
				718	.. doctest::
				719
				720	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	721	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	722	>>> round(fraction * 100.0, 1)
				723	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	724
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	725	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				726	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				727
				728	.. doctest::
				729
				730	>>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
				731	[928, 1060, 1192]
				732	>>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
				733	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				734
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	735	What percentage of men and women will have the same height in `two normally
				736	distributed populations with known means and standard deviations
				737	<http://www.usablestats.com/lessons/normal>`_?
				738
				739	>>> men = NormalDist(70, 4)
				740	>>> women = NormalDist(65, 3.5)
				741	>>> ovl = men.overlap(women)
				742	>>> round(ovl * 100.0, 1)
				743	50.3
				744
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	745	To estimate the distribution for a model than isn't easy to solve
				746	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	747	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	748
				749	.. doctest::
				750
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	751	>>> def model(x, y, z):
				752	... return (3x + 7xy - 5y) / (11 * z)
				753	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	754	>>> n = 100_000
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	755	>>> X = NormalDist(10, 2.5).samples(n)
				756	>>> Y = NormalDist(15, 1.75).samples(n)
				757	>>> Z = NormalDist(5, 1.25).samples(n)
				758	>>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
				759	NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	760
				761	Normal distributions commonly arise in machine learning problems.
				762
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	763	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	764	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				765	The challenge is to predict a person's gender from measurements of normally
				766	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	767
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	768	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	769	measurements are assumed to be normally distributed, so we summarize the data
				770	with :class:`NormalDist`:
				771
				772	.. doctest::
				773
				774	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				775	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				776	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				777	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				778	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				779	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				780
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	781	Next, we encounter a new person whose feature measurements are known but whose
				782	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	783
				784	.. doctest::
				785
				786	>>> ht = 6.0 # height
				787	>>> wt = 130 # weight
				788	>>> fs = 8 # foot size
				789
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	790	Starting with a 50% `prior probability
				791	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				792	we compute the posterior as the prior times the product of likelihoods for the
				793	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	794
				795	.. doctest::
				796
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	797	>>> prior_male = 0.5
				798	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	799	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				800	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				801
				802	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				803	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				804
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	805	The final prediction goes to the largest posterior. This is known as the
				806	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	807	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				808
				809	.. doctest::
				810
				811	>>> 'male' if posterior_male > posterior_female else 'female'
				812	'female'
				813
				814
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	815	..
				816	# This modelines must appear within the last ten lines of the file.
				817	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;