Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: 04b731dcc44400df3f187e0335d02063f452ea0f [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
				29	currently unsupported. Mixed types are also undefined and
				30	implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, e.g.
				32	``map(float, input_data)``.
				33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	40	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	43	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	44	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`median` Median (middle value) of data.
				46	:func:`median_low` Low median of data.
				47	:func:`median_high` High median of data.
				48	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	49	:func:`mode` Single mode (most common value) of discrete or nominal data.
				50	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	51	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	52	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	53
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	54	Measures of spread
				55	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	56
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	57	These functions calculate a measure of how much the population or sample
				58	tends to deviate from the typical or average values.
				59
				60	======================= =============================================
				61	:func:`pstdev` Population standard deviation of data.
				62	:func:`pvariance` Population variance of data.
				63	:func:`stdev` Sample standard deviation of data.
				64	:func:`variance` Sample variance of data.
				65	======================= =============================================
				66
				67
				68	Function details
				69	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	70
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	71	Note: The functions do not require the data given to them to be sorted.
				72	However, for reading convenience, most of the examples show sorted sequences.
				73
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74	.. function:: mean(data)
				75
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	76	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	77
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	78	The arithmetic mean is the sum of the data divided by the number of data
				79	points. It is commonly called "the average", although it is only one of many
				80	different mathematical averages. It is a measure of the central location of
				81	the data.
				82
				83	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	84
				85	Some examples of use:
				86
				87	.. doctest::
				88
				89	>>> mean([1, 2, 3, 4, 4])
				90	2.8
				91	>>> mean([-1.0, 2.5, 3.25, 5.75])
				92	2.625
				93
				94	>>> from fractions import Fraction as F
				95	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				96	Fraction(13, 21)
				97
				98	>>> from decimal import Decimal as D
				99	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				100	Decimal('0.5625')
				101
				102	.. note::
				103
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	104	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	105	for central location: the mean is not necessarily a typical example of the
				106	data points. For more robust, although less efficient, measures of
				107	central location, see :func:`median` and :func:`mode`. (In this case,
				108	"efficient" refers to statistical efficiency rather than computational
				109	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	110
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	111	The sample mean gives an unbiased estimate of the true population mean,
				112	which means that, taken on average over all the possible samples,
				113	``mean(sample)`` converges on the true mean of the entire population. If
				114	data represents the entire population rather than a sample, then
				115	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	116
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	117
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	118	.. function:: fmean(data)
				119
				120	Convert data to floats and compute the arithmetic mean.
				121
				122	This runs faster than the :func:`mean` function and it always returns a
				123	:class:`float`. The result is highly accurate but not as perfect as
				124	:func:`mean`. If the input dataset is empty, raises a
				125	:exc:`StatisticsError`.
				126
				127	.. doctest::
				128
				129	>>> fmean([3.5, 4.0, 5.25])
				130	4.25
				131
				132	.. versionadded:: 3.8
				133
				134
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	135	.. function:: geometric_mean(data)
				136
				137	Convert data to floats and compute the geometric mean.
				138
				139	Raises a :exc:`StatisticsError` if the input dataset is empty,
				140	if it contains a zero, or if it contains a negative value.
				141
				142	No special efforts are made to achieve exact results.
				143	(However, this may change in the future.)
				144
				145	.. doctest::
				146
				147	>>> round(geometric_mean([54, 24, 36]), 9)
				148	36.0
				149
				150	.. versionadded:: 3.8
				151
				152
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	153	.. function:: harmonic_mean(data)
				154
				155	Return the harmonic mean of data, a sequence or iterator of
				156	real-valued numbers.
				157
				158	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	159	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	160	data. For example, the harmonic mean of three values a, b and c
				161	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				162
				163	The harmonic mean is a type of average, a measure of the central
				164	location of the data. It is often appropriate when averaging quantities
				165	which are rates or ratios, for example speeds. For example:
				166
				167	Suppose an investor purchases an equal value of shares in each of
				168	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				169	What is the average P/E ratio for the investor's portfolio?
				170
				171	.. doctest::
				172
				173	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				174	3.6
				175
				176	Using the arithmetic mean would give an average of about 5.167, which
				177	is too high.
				178
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	179	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	180	is less than zero.
				181
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	182	.. versionadded:: 3.6
				183
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	184
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	185	.. function:: median(data)
				186
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	187	Return the median (middle value) of numeric data, using the common "mean of
				188	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	189	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	190
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	191	The median is a robust measure of central location, and is less affected by
				192	the presence of outliers in your data. When the number of data points is
				193	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	194
				195	.. doctest::
				196
				197	>>> median([1, 3, 5])
				198	3
				199
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	200	When the number of data points is even, the median is interpolated by taking
				201	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	202
				203	.. doctest::
				204
				205	>>> median([1, 3, 5, 7])
				206	4.0
				207
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	208	This is suited for when your data is discrete, and you don't mind that the
				209	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	210
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	211	If your data is ordinal (supports order operations) but not numeric (doesn't
				212	support addition), you should use :func:`median_low` or :func:`median_high`
				213	instead.
				214
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	215	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	216
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	217
				218	.. function:: median_low(data)
				219
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	220	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	221	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	222
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	223	The low median is always a member of the data set. When the number of data
				224	points is odd, the middle value is returned. When it is even, the smaller of
				225	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	226
				227	.. doctest::
				228
				229	>>> median_low([1, 3, 5])
				230	3
				231	>>> median_low([1, 3, 5, 7])
				232	3
				233
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	234	Use the low median when your data are discrete and you prefer the median to
				235	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	236
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	237
				238	.. function:: median_high(data)
				239
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	240	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	241	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	242
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	243	The high median is always a member of the data set. When the number of data
				244	points is odd, the middle value is returned. When it is even, the larger of
				245	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	246
				247	.. doctest::
				248
				249	>>> median_high([1, 3, 5])
				250	3
				251	>>> median_high([1, 3, 5, 7])
				252	5
				253
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	254	Use the high median when your data are discrete and you prefer the median to
				255	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	256
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	257
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	258	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	Return the median of grouped continuous data, calculated as the 50th
				261	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	262	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	263
				264	.. doctest::
				265
				266	>>> median_grouped([52, 52, 53, 54])
				267	52.5
				268
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	269	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	270	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				271	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				272	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	273	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	274
				275	.. doctest::
				276
				277	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				278	3.7
				279
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	280	Optional argument interval represents the class interval, and defaults
				281	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	282
				283	.. doctest::
				284
				285	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				286	3.25
				287	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				288	3.5
				289
				290	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	291	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	292
				293	.. impl-detail::
				294
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	295	Under some circumstances, :func:`median_grouped` may coerce data points to
				296	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	297
				298	.. seealso::
				299
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	300	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				301	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	302
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	303	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	304	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	305	function in the Gnome Gnumeric spreadsheet, including `this discussion
				306	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	307
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	308
				309	.. function:: mode(data)
				310
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	311	Return the single most common data point from discrete or nominal data.
				312	The mode (when it exists) is the most typical value and serves as a
				313	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	314
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	315	If there are multiple modes, returns the first one encountered in the data.
Raymond Hettinger	e43e7ed	2019-08-08 01:23:05 -0700	[diff] [blame]	316	If the smallest or largest of multiple modes is desired instead, use
				317	``min(multimode(data))`` or ``max(multimode(data))``. If the input data is
				318	empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	319
				320	``mode`` assumes discrete data, and returns a single value. This is the
				321	standard treatment of the mode as commonly taught in schools:
				322
				323	.. doctest::
				324
				325	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				326	3
				327
				328	The mode is unique in that it is the only statistic which also applies
				329	to nominal (non-numeric) data:
				330
				331	.. doctest::
				332
				333	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				334	'red'
				335
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	336	.. versionchanged:: 3.8
				337	Now handles multimodal datasets by returning the first mode encountered.
				338	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				339	found.
				340
				341
				342	.. function:: multimode(data)
				343
				344	Return a list of the most frequently occurring values in the order they
				345	were first encountered in the data. Will return more than one result if
				346	there are multiple modes or an empty list if the data is empty:
				347
				348	.. doctest::
				349
				350	>>> multimode('aabbbbccddddeeffffgg')
				351	['b', 'd', 'f']
				352	>>> multimode('')
				353	[]
				354
				355	.. versionadded:: 3.8
				356
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	357
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	358	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	359
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	360	Return the population standard deviation (the square root of the population
				361	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	362
				363	.. doctest::
				364
				365	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				366	0.986893273527251
				367
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	368
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	369	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	370
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	371	Return the population variance of data, a non-empty iterable of real-valued
				372	numbers. Variance, or second moment about the mean, is a measure of the
				373	variability (spread or dispersion) of data. A large variance indicates that
				374	the data is spread out; a small variance indicates it is clustered closely
				375	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	376
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	377	If the optional second argument mu is given, it should be the mean of
				378	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	379	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	380
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	381	Use this function to calculate the variance from the entire population. To
				382	estimate the variance from a sample, the :func:`variance` function is usually
				383	a better choice.
				384
				385	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	386
				387	Examples:
				388
				389	.. doctest::
				390
				391	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				392	>>> pvariance(data)
				393	1.25
				394
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	395	If you have already calculated the mean of your data, you can pass it as the
				396	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	397
				398	.. doctest::
				399
				400	>>> mu = mean(data)
				401	>>> pvariance(data, mu)
				402	1.25
				403
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	404	This function does not attempt to verify that you have passed the actual mean
				405	as mu. Using arbitrary values for mu may lead to invalid or impossible
				406	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	407
				408	Decimals and Fractions are supported:
				409
				410	.. doctest::
				411
				412	>>> from decimal import Decimal as D
				413	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				414	Decimal('24.815')
				415
				416	>>> from fractions import Fraction as F
				417	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				418	Fraction(13, 72)
				419
				420	.. note::
				421
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	422	When called with the entire population, this gives the population variance
				423	σ². When called on a sample instead, this is the biased sample variance
				424	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	425
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	426	If you somehow know the true population mean μ, you may use this function
				427	to calculate the variance of a sample, giving the known population mean as
				428	the second argument. Provided the data points are representative
				429	(e.g. independent and identically distributed), the result will be an
				430	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	431
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	432
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	433	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	434
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	435	Return the sample standard deviation (the square root of the sample
				436	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	437
				438	.. doctest::
				439
				440	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				441	1.0810874155219827
				442
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	443
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	444	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	445
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	446	Return the sample variance of data, an iterable of at least two real-valued
				447	numbers. Variance, or second moment about the mean, is a measure of the
				448	variability (spread or dispersion) of data. A large variance indicates that
				449	the data is spread out; a small variance indicates it is clustered closely
				450	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	451
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	452	If the optional second argument xbar is given, it should be the mean of
				453	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	454	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	455
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	456	Use this function when your data is a sample from a population. To calculate
				457	the variance from the entire population, see :func:`pvariance`.
				458
				459	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	460
				461	Examples:
				462
				463	.. doctest::
				464
				465	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				466	>>> variance(data)
				467	1.3720238095238095
				468
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	469	If you have already calculated the mean of your data, you can pass it as the
				470	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	471
				472	.. doctest::
				473
				474	>>> m = mean(data)
				475	>>> variance(data, m)
				476	1.3720238095238095
				477
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	478	This function does not attempt to verify that you have passed the actual mean
				479	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	480	impossible results.
				481
				482	Decimal and Fraction values are supported:
				483
				484	.. doctest::
				485
				486	>>> from decimal import Decimal as D
				487	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				488	Decimal('31.01875')
				489
				490	>>> from fractions import Fraction as F
				491	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				492	Fraction(67, 108)
				493
				494	.. note::
				495
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	496	This is the sample variance s² with Bessel's correction, also known as
				497	variance with N-1 degrees of freedom. Provided that the data points are
				498	representative (e.g. independent and identically distributed), the result
				499	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	500
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	501	If you somehow know the actual population mean μ you should pass it to the
				502	:func:`pvariance` function as the mu parameter to get the variance of a
				503	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	504
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	505	.. function:: quantiles(dist, *, n=4, method='exclusive')
				506
				507	Divide dist into n continuous intervals with equal probability.
				508	Returns a list of ``n - 1`` cut points separating the intervals.
				509
				510	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				511	n to 100 for percentiles which gives the 99 cuts points that separate
				512	dist in to 100 equal sized groups. Raises :exc:`StatisticsError` if n
				513	is not least 1.
				514
				515	The dist can be any iterable containing sample data or it can be an
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	516	instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
				517	results, the number of data points in dist should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	518	Raises :exc:`StatisticsError` if there are not at least two data points.
				519
				520	For sample data, the cut points are linearly interpolated from the
				521	two nearest data points. For example, if a cut point falls one-third
				522	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	523	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	524
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	525	The method for computing quantiles can be varied depending on
				526	whether the data in dist includes or excludes the lowest and
				527	highest possible values from the population.
				528
				529	The default method is "exclusive" and is used for data sampled from
				530	a population that can have more extreme values than found in the
				531	samples. The portion of the population falling below the i-th of
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	532	m sorted data points is computed as ``i / (m + 1)``. Given nine
				533	sample values, the method sorts them and assigns the following
				534	percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	535
				536	Setting the method to "inclusive" is used for describing population
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	537	data or for samples that are known to include the most extreme values
				538	from the population. The minimum value in dist is treated as the 0th
				539	percentile and the maximum value is treated as the 100th percentile.
				540	The portion of the population falling below the i-th of m sorted
				541	data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
				542	values, the method sorts them and assigns the following percentiles:
				543	0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	544
				545	If dist is an instance of a class that defines an
				546	:meth:`~inv_cdf` method, setting method has no effect.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	547
				548	.. doctest::
				549
				550	# Decile cut points for empirically sampled data
				551	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				552	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				553	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				554	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				555	... 103, 107, 101, 81, 109, 104]
				556	>>> [round(q, 1) for q in quantiles(data, n=10)]
				557	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				558
Min ho Kim	c4cacc8	2019-07-31 08:16:13 +1000	[diff] [blame]	559	>>> # Quartile cut points for the standard normal distribution
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	560	>>> Z = NormalDist()
				561	>>> [round(q, 4) for q in quantiles(Z, n=4)]
				562	[-0.6745, 0.0, 0.6745]
				563
				564	.. versionadded:: 3.8
				565
				566
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	567	Exceptions
				568	----------
				569
				570	A single exception is defined:
				571
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	572	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	573
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	574	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	575
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	576
				577	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	578	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	579
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	580	:class:`NormalDist` is a tool for creating and manipulating normal
				581	distributions of a `random variable
				582	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
				583	composite class that treats the mean and standard deviation of data
				584	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	585
				586	Normal distributions arise from the `Central Limit Theorem
				587	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	588	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	589
				590	.. class:: NormalDist(mu=0.0, sigma=1.0)
				591
				592	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	593	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	594	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	595	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	596
				597	If sigma is negative, raises :exc:`StatisticsError`.
				598
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	599	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	600
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	601	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	602	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				603	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	604
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	605	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	606
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	607	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	608	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				609	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	610
				611	.. attribute:: variance
				612
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	613	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	614	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				615	distribution. Equal to the square of the standard deviation.
				616
				617	.. classmethod:: NormalDist.from_samples(data)
				618
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	619	Makes a normal distribution instance computed from sample data. The
				620	data can be any :term:`iterable` and should consist of values that
				621	can be converted to type :class:`float`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	622
				623	If data does not contain at least two elements, raises
				624	:exc:`StatisticsError` because it takes at least one point to estimate
				625	a central value and at least two points to estimate dispersion.
				626
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	627	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	628
				629	Generates n random samples for a given mean and standard deviation.
				630	Returns a :class:`list` of :class:`float` values.
				631
				632	If seed is given, creates a new instance of the underlying random
				633	number generator. This is useful for creating reproducible results,
				634	even in a multi-threading context.
				635
				636	.. method:: NormalDist.pdf(x)
				637
				638	Using a `probability density function (pdf)
				639	<https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	640	compute the relative likelihood that a random variable X will be near
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	641	the given value x. Mathematically, it is the ratio ``P(x <= X <
				642	x+dx) / dx``.
				643
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	644	The relative likelihood is computed as the probability of a sample
				645	occurring in a narrow range divided by the width of the range (hence
				646	the word "density"). Since the likelihood is relative to other points,
				647	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	648
				649	.. method:: NormalDist.cdf(x)
				650
				651	Using a `cumulative distribution function (cdf)
				652	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	653	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	654	equal to x. Mathematically, it is written ``P(X <= x)``.
				655
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	656	.. method:: NormalDist.inv_cdf(p)
				657
				658	Compute the inverse cumulative distribution function, also known as the
				659	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				660	or the `percent-point
				661	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				662	function. Mathematically, it is written ``x : P(X <= x) = p``.
				663
				664	Finds the value x of the random variable X such that the
				665	probability of the variable being less than or equal to that value
				666	equals the given probability p.
				667
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	668	.. method:: NormalDist.overlap(other)
				669
Raymond Hettinger	8371799	2019-08-25 00:57:26 -0700	[diff] [blame]	670	Returns a value between 0.0 and 1.0 giving the overlapping area for
				671	the two probability density functions.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	672
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	673	Instances of :class:`NormalDist` support addition, subtraction,
				674	multiplication and division by a constant. These operations
				675	are used for translation and scaling. For example:
				676
				677	.. doctest::
				678
				679	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				680	>>> temperature_february * (9/5) + 32 # Fahrenheit
				681	NormalDist(mu=41.0, sigma=4.5)
				682
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	683	Dividing a constant by an instance of :class:`NormalDist` is not supported
				684	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	685
				686	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	687	variables, it is possible to `add and subtract two independent normally
				688	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	689	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				690	represented as instances of :class:`NormalDist`. For example:
				691
				692	.. doctest::
				693
				694	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				695	>>> drug_effects = NormalDist(0.4, 0.15)
				696	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	697	>>> round(combined.mean, 1)
				698	3.1
				699	>>> round(combined.stdev, 1)
				700	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	701
				702	.. versionadded:: 3.8
				703
				704
				705	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	706	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	707
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	708	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	709
				710	For example, given `historical data for SAT exams
				711	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	712	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	713	determine the percentage of students with test scores between 1100 and
				714	1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	715
				716	.. doctest::
				717
				718	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	719	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	720	>>> round(fraction * 100.0, 1)
				721	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	722
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	723	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				724	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				725
				726	.. doctest::
				727
				728	>>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
				729	[928, 1060, 1192]
				730	>>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
				731	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				732
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	733	To estimate the distribution for a model than isn't easy to solve
				734	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	735	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	736
				737	.. doctest::
				738
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	739	>>> def model(x, y, z):
				740	... return (3x + 7xy - 5y) / (11 * z)
				741	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	742	>>> n = 100_000
Raymond Hettinger	8371799	2019-08-25 00:57:26 -0700	[diff] [blame]	743	>>> seed = 86753099035768
				744	>>> X = NormalDist(10, 2.5).samples(n, seed=seed)
				745	>>> Y = NormalDist(15, 1.75).samples(n, seed=seed)
				746	>>> Z = NormalDist(50, 1.25).samples(n, seed=seed)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	747	>>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
Raymond Hettinger	8371799	2019-08-25 00:57:26 -0700	[diff] [blame]	748	NormalDist(mu=1.8661894803304777, sigma=0.65238717376862)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	749
				750	Normal distributions commonly arise in machine learning problems.
				751
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	752	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	753	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				754	The challenge is to predict a person's gender from measurements of normally
				755	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	756
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	757	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	758	measurements are assumed to be normally distributed, so we summarize the data
				759	with :class:`NormalDist`:
				760
				761	.. doctest::
				762
				763	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				764	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				765	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				766	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				767	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				768	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				769
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	770	Next, we encounter a new person whose feature measurements are known but whose
				771	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	772
				773	.. doctest::
				774
				775	>>> ht = 6.0 # height
				776	>>> wt = 130 # weight
				777	>>> fs = 8 # foot size
				778
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	779	Starting with a 50% `prior probability
				780	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				781	we compute the posterior as the prior times the product of likelihoods for the
				782	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	783
				784	.. doctest::
				785
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	786	>>> prior_male = 0.5
				787	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	788	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				789	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				790
				791	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				792	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				793
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	794	The final prediction goes to the largest posterior. This is known as the
				795	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	796	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				797
				798	.. doctest::
				799
				800	>>> 'male' if posterior_male > posterior_female else 'female'
				801	'female'
				802
				803
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	804	..
				805	# This modelines must appear within the last ten lines of the file.
				806	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;