Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: bc841fda72f887fb8cd40a7d9fe8d3bdac68f68f [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
				29	currently unsupported. Mixed types are also undefined and
				30	implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, e.g.
				32	``map(float, input_data)``.
				33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	40	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	43	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	44	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`median` Median (middle value) of data.
				46	:func:`median_low` Low median of data.
				47	:func:`median_high` High median of data.
				48	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	49	:func:`mode` Single mode (most common value) of discrete or nominal data.
				50	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	51	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	52	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	53
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	54	Measures of spread
				55	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	56
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	57	These functions calculate a measure of how much the population or sample
				58	tends to deviate from the typical or average values.
				59
				60	======================= =============================================
				61	:func:`pstdev` Population standard deviation of data.
				62	:func:`pvariance` Population variance of data.
				63	:func:`stdev` Sample standard deviation of data.
				64	:func:`variance` Sample variance of data.
				65	======================= =============================================
				66
				67
				68	Function details
				69	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	70
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	71	Note: The functions do not require the data given to them to be sorted.
				72	However, for reading convenience, most of the examples show sorted sequences.
				73
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74	.. function:: mean(data)
				75
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	76	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	77
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	78	The arithmetic mean is the sum of the data divided by the number of data
				79	points. It is commonly called "the average", although it is only one of many
				80	different mathematical averages. It is a measure of the central location of
				81	the data.
				82
				83	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	84
				85	Some examples of use:
				86
				87	.. doctest::
				88
				89	>>> mean([1, 2, 3, 4, 4])
				90	2.8
				91	>>> mean([-1.0, 2.5, 3.25, 5.75])
				92	2.625
				93
				94	>>> from fractions import Fraction as F
				95	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				96	Fraction(13, 21)
				97
				98	>>> from decimal import Decimal as D
				99	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				100	Decimal('0.5625')
				101
				102	.. note::
				103
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	104	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	105	for central location: the mean is not necessarily a typical example of the
				106	data points. For more robust, although less efficient, measures of
				107	central location, see :func:`median` and :func:`mode`. (In this case,
				108	"efficient" refers to statistical efficiency rather than computational
				109	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	110
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	111	The sample mean gives an unbiased estimate of the true population mean,
				112	which means that, taken on average over all the possible samples,
				113	``mean(sample)`` converges on the true mean of the entire population. If
				114	data represents the entire population rather than a sample, then
				115	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	116
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	117
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	118	.. function:: fmean(data)
				119
				120	Convert data to floats and compute the arithmetic mean.
				121
				122	This runs faster than the :func:`mean` function and it always returns a
				123	:class:`float`. The result is highly accurate but not as perfect as
				124	:func:`mean`. If the input dataset is empty, raises a
				125	:exc:`StatisticsError`.
				126
				127	.. doctest::
				128
				129	>>> fmean([3.5, 4.0, 5.25])
				130	4.25
				131
				132	.. versionadded:: 3.8
				133
				134
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	135	.. function:: geometric_mean(data)
				136
				137	Convert data to floats and compute the geometric mean.
				138
				139	Raises a :exc:`StatisticsError` if the input dataset is empty,
				140	if it contains a zero, or if it contains a negative value.
				141
				142	No special efforts are made to achieve exact results.
				143	(However, this may change in the future.)
				144
				145	.. doctest::
				146
				147	>>> round(geometric_mean([54, 24, 36]), 9)
				148	36.0
				149
				150	.. versionadded:: 3.8
				151
				152
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	153	.. function:: harmonic_mean(data)
				154
				155	Return the harmonic mean of data, a sequence or iterator of
				156	real-valued numbers.
				157
				158	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	159	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	160	data. For example, the harmonic mean of three values a, b and c
				161	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				162
				163	The harmonic mean is a type of average, a measure of the central
				164	location of the data. It is often appropriate when averaging quantities
				165	which are rates or ratios, for example speeds. For example:
				166
				167	Suppose an investor purchases an equal value of shares in each of
				168	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				169	What is the average P/E ratio for the investor's portfolio?
				170
				171	.. doctest::
				172
				173	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				174	3.6
				175
				176	Using the arithmetic mean would give an average of about 5.167, which
				177	is too high.
				178
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	179	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	180	is less than zero.
				181
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	182	.. versionadded:: 3.6
				183
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	184
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	185	.. function:: median(data)
				186
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	187	Return the median (middle value) of numeric data, using the common "mean of
				188	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	189	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	190
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	191	The median is a robust measure of central location, and is less affected by
				192	the presence of outliers in your data. When the number of data points is
				193	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	194
				195	.. doctest::
				196
				197	>>> median([1, 3, 5])
				198	3
				199
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	200	When the number of data points is even, the median is interpolated by taking
				201	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	202
				203	.. doctest::
				204
				205	>>> median([1, 3, 5, 7])
				206	4.0
				207
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	208	This is suited for when your data is discrete, and you don't mind that the
				209	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	210
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	211	If your data is ordinal (supports order operations) but not numeric (doesn't
				212	support addition), you should use :func:`median_low` or :func:`median_high`
				213	instead.
				214
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	215	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	216
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	217
				218	.. function:: median_low(data)
				219
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	220	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	221	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	222
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	223	The low median is always a member of the data set. When the number of data
				224	points is odd, the middle value is returned. When it is even, the smaller of
				225	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	226
				227	.. doctest::
				228
				229	>>> median_low([1, 3, 5])
				230	3
				231	>>> median_low([1, 3, 5, 7])
				232	3
				233
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	234	Use the low median when your data are discrete and you prefer the median to
				235	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	236
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	237
				238	.. function:: median_high(data)
				239
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	240	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	241	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	242
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	243	The high median is always a member of the data set. When the number of data
				244	points is odd, the middle value is returned. When it is even, the larger of
				245	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	246
				247	.. doctest::
				248
				249	>>> median_high([1, 3, 5])
				250	3
				251	>>> median_high([1, 3, 5, 7])
				252	5
				253
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	254	Use the high median when your data are discrete and you prefer the median to
				255	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	256
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	257
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	258	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	Return the median of grouped continuous data, calculated as the 50th
				261	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	262	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	263
				264	.. doctest::
				265
				266	>>> median_grouped([52, 52, 53, 54])
				267	52.5
				268
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	269	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	270	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				271	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				272	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	273	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	274
				275	.. doctest::
				276
				277	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				278	3.7
				279
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	280	Optional argument interval represents the class interval, and defaults
				281	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	282
				283	.. doctest::
				284
				285	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				286	3.25
				287	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				288	3.5
				289
				290	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	291	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	292
				293	.. impl-detail::
				294
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	295	Under some circumstances, :func:`median_grouped` may coerce data points to
				296	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	297
				298	.. seealso::
				299
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	300	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				301	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	302
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	303	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	304	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	305	function in the Gnome Gnumeric spreadsheet, including `this discussion
				306	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	307
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	308
				309	.. function:: mode(data)
				310
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	311	Return the single most common data point from discrete or nominal data.
				312	The mode (when it exists) is the most typical value and serves as a
				313	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	314
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	315	If there are multiple modes, returns the first one encountered in the data.
				316	If data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	317
				318	``mode`` assumes discrete data, and returns a single value. This is the
				319	standard treatment of the mode as commonly taught in schools:
				320
				321	.. doctest::
				322
				323	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				324	3
				325
				326	The mode is unique in that it is the only statistic which also applies
				327	to nominal (non-numeric) data:
				328
				329	.. doctest::
				330
				331	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				332	'red'
				333
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	334	.. versionchanged:: 3.8
				335	Now handles multimodal datasets by returning the first mode encountered.
				336	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				337	found.
				338
				339
				340	.. function:: multimode(data)
				341
				342	Return a list of the most frequently occurring values in the order they
				343	were first encountered in the data. Will return more than one result if
				344	there are multiple modes or an empty list if the data is empty:
				345
				346	.. doctest::
				347
				348	>>> multimode('aabbbbccddddeeffffgg')
				349	['b', 'd', 'f']
				350	>>> multimode('')
				351	[]
				352
				353	.. versionadded:: 3.8
				354
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	355
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	356	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	357
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	358	Return the population standard deviation (the square root of the population
				359	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	360
				361	.. doctest::
				362
				363	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				364	0.986893273527251
				365
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	366
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	367	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	368
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	369	Return the population variance of data, a non-empty iterable of real-valued
				370	numbers. Variance, or second moment about the mean, is a measure of the
				371	variability (spread or dispersion) of data. A large variance indicates that
				372	the data is spread out; a small variance indicates it is clustered closely
				373	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	374
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	375	If the optional second argument mu is given, it should be the mean of
				376	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	377	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	378
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	379	Use this function to calculate the variance from the entire population. To
				380	estimate the variance from a sample, the :func:`variance` function is usually
				381	a better choice.
				382
				383	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	384
				385	Examples:
				386
				387	.. doctest::
				388
				389	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				390	>>> pvariance(data)
				391	1.25
				392
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	393	If you have already calculated the mean of your data, you can pass it as the
				394	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	395
				396	.. doctest::
				397
				398	>>> mu = mean(data)
				399	>>> pvariance(data, mu)
				400	1.25
				401
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	402	This function does not attempt to verify that you have passed the actual mean
				403	as mu. Using arbitrary values for mu may lead to invalid or impossible
				404	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	405
				406	Decimals and Fractions are supported:
				407
				408	.. doctest::
				409
				410	>>> from decimal import Decimal as D
				411	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				412	Decimal('24.815')
				413
				414	>>> from fractions import Fraction as F
				415	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				416	Fraction(13, 72)
				417
				418	.. note::
				419
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	420	When called with the entire population, this gives the population variance
				421	σ². When called on a sample instead, this is the biased sample variance
				422	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	423
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	424	If you somehow know the true population mean μ, you may use this function
				425	to calculate the variance of a sample, giving the known population mean as
				426	the second argument. Provided the data points are representative
				427	(e.g. independent and identically distributed), the result will be an
				428	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	429
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	430
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	431	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	432
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	433	Return the sample standard deviation (the square root of the sample
				434	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	435
				436	.. doctest::
				437
				438	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				439	1.0810874155219827
				440
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	441
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	442	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	443
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	444	Return the sample variance of data, an iterable of at least two real-valued
				445	numbers. Variance, or second moment about the mean, is a measure of the
				446	variability (spread or dispersion) of data. A large variance indicates that
				447	the data is spread out; a small variance indicates it is clustered closely
				448	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	449
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	450	If the optional second argument xbar is given, it should be the mean of
				451	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	452	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	453
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	454	Use this function when your data is a sample from a population. To calculate
				455	the variance from the entire population, see :func:`pvariance`.
				456
				457	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	458
				459	Examples:
				460
				461	.. doctest::
				462
				463	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				464	>>> variance(data)
				465	1.3720238095238095
				466
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	467	If you have already calculated the mean of your data, you can pass it as the
				468	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	469
				470	.. doctest::
				471
				472	>>> m = mean(data)
				473	>>> variance(data, m)
				474	1.3720238095238095
				475
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	476	This function does not attempt to verify that you have passed the actual mean
				477	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	478	impossible results.
				479
				480	Decimal and Fraction values are supported:
				481
				482	.. doctest::
				483
				484	>>> from decimal import Decimal as D
				485	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				486	Decimal('31.01875')
				487
				488	>>> from fractions import Fraction as F
				489	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				490	Fraction(67, 108)
				491
				492	.. note::
				493
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	494	This is the sample variance s² with Bessel's correction, also known as
				495	variance with N-1 degrees of freedom. Provided that the data points are
				496	representative (e.g. independent and identically distributed), the result
				497	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	498
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	499	If you somehow know the actual population mean μ you should pass it to the
				500	:func:`pvariance` function as the mu parameter to get the variance of a
				501	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	502
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	503	.. function:: quantiles(dist, *, n=4, method='exclusive')
				504
				505	Divide dist into n continuous intervals with equal probability.
				506	Returns a list of ``n - 1`` cut points separating the intervals.
				507
				508	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				509	n to 100 for percentiles which gives the 99 cuts points that separate
				510	dist in to 100 equal sized groups. Raises :exc:`StatisticsError` if n
				511	is not least 1.
				512
				513	The dist can be any iterable containing sample data or it can be an
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame^]	514	instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
				515	results, the number of data points in dist should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	516	Raises :exc:`StatisticsError` if there are not at least two data points.
				517
				518	For sample data, the cut points are linearly interpolated from the
				519	two nearest data points. For example, if a cut point falls one-third
				520	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame^]	521	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	522
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame^]	523	The method for computing quantiles can be varied depending on
				524	whether the data in dist includes or excludes the lowest and
				525	highest possible values from the population.
				526
				527	The default method is "exclusive" and is used for data sampled from
				528	a population that can have more extreme values than found in the
				529	samples. The portion of the population falling below the i-th of
				530	m data points is computed as ``i / (m + 1)``.
				531
				532	Setting the method to "inclusive" is used for describing population
				533	data or for samples that include the extreme points. The minimum
				534	value in dist is treated as the 0th percentile and the maximum
				535	value is treated as the 100th percentile. The portion of the
				536	population falling below the i-th of m data points is computed as
				537	``(i - 1) / (m - 1)``.
				538
				539	If dist is an instance of a class that defines an
				540	:meth:`~inv_cdf` method, setting method has no effect.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	541
				542	.. doctest::
				543
				544	# Decile cut points for empirically sampled data
				545	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				546	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				547	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				548	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				549	... 103, 107, 101, 81, 109, 104]
				550	>>> [round(q, 1) for q in quantiles(data, n=10)]
				551	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				552
				553	>>> # Quartile cut points for the standard normal distibution
				554	>>> Z = NormalDist()
				555	>>> [round(q, 4) for q in quantiles(Z, n=4)]
				556	[-0.6745, 0.0, 0.6745]
				557
				558	.. versionadded:: 3.8
				559
				560
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	561	Exceptions
				562	----------
				563
				564	A single exception is defined:
				565
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	566	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	567
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	568	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	569
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	570
				571	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	572	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	573
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	574	:class:`NormalDist` is a tool for creating and manipulating normal
				575	distributions of a `random variable
				576	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
				577	composite class that treats the mean and standard deviation of data
				578	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	579
				580	Normal distributions arise from the `Central Limit Theorem
				581	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	582	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	583
				584	.. class:: NormalDist(mu=0.0, sigma=1.0)
				585
				586	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	587	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	588	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	589	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	590
				591	If sigma is negative, raises :exc:`StatisticsError`.
				592
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	593	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	594
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	595	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	596	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				597	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	598
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	599	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	600
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	601	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	602	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				603	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	604
				605	.. attribute:: variance
				606
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	607	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	608	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				609	distribution. Equal to the square of the standard deviation.
				610
				611	.. classmethod:: NormalDist.from_samples(data)
				612
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	613	Makes a normal distribution instance computed from sample data. The
				614	data can be any :term:`iterable` and should consist of values that
				615	can be converted to type :class:`float`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	616
				617	If data does not contain at least two elements, raises
				618	:exc:`StatisticsError` because it takes at least one point to estimate
				619	a central value and at least two points to estimate dispersion.
				620
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	621	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	622
				623	Generates n random samples for a given mean and standard deviation.
				624	Returns a :class:`list` of :class:`float` values.
				625
				626	If seed is given, creates a new instance of the underlying random
				627	number generator. This is useful for creating reproducible results,
				628	even in a multi-threading context.
				629
				630	.. method:: NormalDist.pdf(x)
				631
				632	Using a `probability density function (pdf)
				633	<https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	634	compute the relative likelihood that a random variable X will be near
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	635	the given value x. Mathematically, it is the ratio ``P(x <= X <
				636	x+dx) / dx``.
				637
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	638	The relative likelihood is computed as the probability of a sample
				639	occurring in a narrow range divided by the width of the range (hence
				640	the word "density"). Since the likelihood is relative to other points,
				641	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	642
				643	.. method:: NormalDist.cdf(x)
				644
				645	Using a `cumulative distribution function (cdf)
				646	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	647	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	648	equal to x. Mathematically, it is written ``P(X <= x)``.
				649
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	650	.. method:: NormalDist.inv_cdf(p)
				651
				652	Compute the inverse cumulative distribution function, also known as the
				653	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				654	or the `percent-point
				655	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				656	function. Mathematically, it is written ``x : P(X <= x) = p``.
				657
				658	Finds the value x of the random variable X such that the
				659	probability of the variable being less than or equal to that value
				660	equals the given probability p.
				661
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	662	.. method:: NormalDist.overlap(other)
				663
				664	Compute the `overlapping coefficient (OVL)
				665	<http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger	14bab7a	2019-03-07 08:54:31 -0800	[diff] [blame]	666	between two normal distributions, giving a measure of agreement.
				667	Returns a value between 0.0 and 1.0 giving `the overlapping area for
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	668	the two probability density functions
Raymond Hettinger	14bab7a	2019-03-07 08:54:31 -0800	[diff] [blame]	669	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	670
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	671	Instances of :class:`NormalDist` support addition, subtraction,
				672	multiplication and division by a constant. These operations
				673	are used for translation and scaling. For example:
				674
				675	.. doctest::
				676
				677	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				678	>>> temperature_february * (9/5) + 32 # Fahrenheit
				679	NormalDist(mu=41.0, sigma=4.5)
				680
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	681	Dividing a constant by an instance of :class:`NormalDist` is not supported
				682	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	683
				684	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	685	variables, it is possible to `add and subtract two independent normally
				686	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	687	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				688	represented as instances of :class:`NormalDist`. For example:
				689
				690	.. doctest::
				691
				692	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				693	>>> drug_effects = NormalDist(0.4, 0.15)
				694	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	695	>>> round(combined.mean, 1)
				696	3.1
				697	>>> round(combined.stdev, 1)
				698	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	699
				700	.. versionadded:: 3.8
				701
				702
				703	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	704	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	705
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	706	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	707
				708	For example, given `historical data for SAT exams
				709	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	710	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	711	determine the percentage of students with test scores between 1100 and
				712	1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	713
				714	.. doctest::
				715
				716	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	717	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	718	>>> round(fraction * 100.0, 1)
				719	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	720
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	721	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				722	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				723
				724	.. doctest::
				725
				726	>>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
				727	[928, 1060, 1192]
				728	>>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
				729	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				730
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	731	What percentage of men and women will have the same height in `two normally
				732	distributed populations with known means and standard deviations
				733	<http://www.usablestats.com/lessons/normal>`_?
				734
				735	>>> men = NormalDist(70, 4)
				736	>>> women = NormalDist(65, 3.5)
				737	>>> ovl = men.overlap(women)
				738	>>> round(ovl * 100.0, 1)
				739	50.3
				740
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	741	To estimate the distribution for a model than isn't easy to solve
				742	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	743	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	744
				745	.. doctest::
				746
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	747	>>> def model(x, y, z):
				748	... return (3x + 7xy - 5y) / (11 * z)
				749	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	750	>>> n = 100_000
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	751	>>> X = NormalDist(10, 2.5).samples(n)
				752	>>> Y = NormalDist(15, 1.75).samples(n)
				753	>>> Z = NormalDist(5, 1.25).samples(n)
				754	>>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
				755	NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	756
				757	Normal distributions commonly arise in machine learning problems.
				758
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	759	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	760	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				761	The challenge is to predict a person's gender from measurements of normally
				762	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	763
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	764	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	765	measurements are assumed to be normally distributed, so we summarize the data
				766	with :class:`NormalDist`:
				767
				768	.. doctest::
				769
				770	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				771	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				772	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				773	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				774	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				775	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				776
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	777	Next, we encounter a new person whose feature measurements are known but whose
				778	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	779
				780	.. doctest::
				781
				782	>>> ht = 6.0 # height
				783	>>> wt = 130 # weight
				784	>>> fs = 8 # foot size
				785
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	786	Starting with a 50% `prior probability
				787	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				788	we compute the posterior as the prior times the product of likelihoods for the
				789	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	790
				791	.. doctest::
				792
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	793	>>> prior_male = 0.5
				794	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	795	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				796	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				797
				798	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				799	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				800
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	801	The final prediction goes to the largest posterior. This is known as the
				802	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	803	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				804
				805	.. doctest::
				806
				807	>>> 'male' if posterior_male > posterior_female else 'female'
				808	'female'
				809
				810
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	811	..
				812	# This modelines must appear within the last ten lines of the file.
				813	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;