Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: 1d52d98b29971b9a7e58d607205d370cc3d651c7 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
				29	currently unsupported. Mixed types are also undefined and
				30	implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, e.g.
				32	``map(float, input_data)``.
				33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	40	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	43	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	44	:func:`median` Median (middle value) of data.
				45	:func:`median_low` Low median of data.
				46	:func:`median_high` High median of data.
				47	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	48	:func:`mode` Single mode (most common value) of discrete or nominal data.
				49	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
				50	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	51
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	52	Measures of spread
				53	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	54
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	55	These functions calculate a measure of how much the population or sample
				56	tends to deviate from the typical or average values.
				57
				58	======================= =============================================
				59	:func:`pstdev` Population standard deviation of data.
				60	:func:`pvariance` Population variance of data.
				61	:func:`stdev` Sample standard deviation of data.
				62	:func:`variance` Sample variance of data.
				63	======================= =============================================
				64
				65
				66	Function details
				67	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	68
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	69	Note: The functions do not require the data given to them to be sorted.
				70	However, for reading convenience, most of the examples show sorted sequences.
				71
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	72	.. function:: mean(data)
				73
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	74	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	75
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	76	The arithmetic mean is the sum of the data divided by the number of data
				77	points. It is commonly called "the average", although it is only one of many
				78	different mathematical averages. It is a measure of the central location of
				79	the data.
				80
				81	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	82
				83	Some examples of use:
				84
				85	.. doctest::
				86
				87	>>> mean([1, 2, 3, 4, 4])
				88	2.8
				89	>>> mean([-1.0, 2.5, 3.25, 5.75])
				90	2.625
				91
				92	>>> from fractions import Fraction as F
				93	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				94	Fraction(13, 21)
				95
				96	>>> from decimal import Decimal as D
				97	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				98	Decimal('0.5625')
				99
				100	.. note::
				101
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	102	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	103	for central location: the mean is not necessarily a typical example of the
				104	data points. For more robust, although less efficient, measures of
				105	central location, see :func:`median` and :func:`mode`. (In this case,
				106	"efficient" refers to statistical efficiency rather than computational
				107	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	108
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	109	The sample mean gives an unbiased estimate of the true population mean,
				110	which means that, taken on average over all the possible samples,
				111	``mean(sample)`` converges on the true mean of the entire population. If
				112	data represents the entire population rather than a sample, then
				113	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	114
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	115
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	116	.. function:: fmean(data)
				117
				118	Convert data to floats and compute the arithmetic mean.
				119
				120	This runs faster than the :func:`mean` function and it always returns a
				121	:class:`float`. The result is highly accurate but not as perfect as
				122	:func:`mean`. If the input dataset is empty, raises a
				123	:exc:`StatisticsError`.
				124
				125	.. doctest::
				126
				127	>>> fmean([3.5, 4.0, 5.25])
				128	4.25
				129
				130	.. versionadded:: 3.8
				131
				132
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	133	.. function:: harmonic_mean(data)
				134
				135	Return the harmonic mean of data, a sequence or iterator of
				136	real-valued numbers.
				137
				138	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	139	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	140	data. For example, the harmonic mean of three values a, b and c
				141	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				142
				143	The harmonic mean is a type of average, a measure of the central
				144	location of the data. It is often appropriate when averaging quantities
				145	which are rates or ratios, for example speeds. For example:
				146
				147	Suppose an investor purchases an equal value of shares in each of
				148	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				149	What is the average P/E ratio for the investor's portfolio?
				150
				151	.. doctest::
				152
				153	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				154	3.6
				155
				156	Using the arithmetic mean would give an average of about 5.167, which
				157	is too high.
				158
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	159	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	160	is less than zero.
				161
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	162	.. versionadded:: 3.6
				163
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	164
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	165	.. function:: median(data)
				166
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	167	Return the median (middle value) of numeric data, using the common "mean of
				168	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	169	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	170
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	171	The median is a robust measure of central location, and is less affected by
				172	the presence of outliers in your data. When the number of data points is
				173	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	174
				175	.. doctest::
				176
				177	>>> median([1, 3, 5])
				178	3
				179
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	180	When the number of data points is even, the median is interpolated by taking
				181	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	182
				183	.. doctest::
				184
				185	>>> median([1, 3, 5, 7])
				186	4.0
				187
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	188	This is suited for when your data is discrete, and you don't mind that the
				189	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	190
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	191	If your data is ordinal (supports order operations) but not numeric (doesn't
				192	support addition), you should use :func:`median_low` or :func:`median_high`
				193	instead.
				194
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	195	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	196
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	197
				198	.. function:: median_low(data)
				199
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	200	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	201	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	202
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	203	The low median is always a member of the data set. When the number of data
				204	points is odd, the middle value is returned. When it is even, the smaller of
				205	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	206
				207	.. doctest::
				208
				209	>>> median_low([1, 3, 5])
				210	3
				211	>>> median_low([1, 3, 5, 7])
				212	3
				213
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	214	Use the low median when your data are discrete and you prefer the median to
				215	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	216
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	217
				218	.. function:: median_high(data)
				219
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	220	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	221	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	222
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	223	The high median is always a member of the data set. When the number of data
				224	points is odd, the middle value is returned. When it is even, the larger of
				225	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	226
				227	.. doctest::
				228
				229	>>> median_high([1, 3, 5])
				230	3
				231	>>> median_high([1, 3, 5, 7])
				232	5
				233
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	234	Use the high median when your data are discrete and you prefer the median to
				235	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	236
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	237
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	238	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	239
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	240	Return the median of grouped continuous data, calculated as the 50th
				241	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	242	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	243
				244	.. doctest::
				245
				246	>>> median_grouped([52, 52, 53, 54])
				247	52.5
				248
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	249	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	250	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				251	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				252	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	253	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	254
				255	.. doctest::
				256
				257	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				258	3.7
				259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	Optional argument interval represents the class interval, and defaults
				261	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	262
				263	.. doctest::
				264
				265	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				266	3.25
				267	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				268	3.5
				269
				270	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	271	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	272
				273	.. impl-detail::
				274
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	275	Under some circumstances, :func:`median_grouped` may coerce data points to
				276	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	277
				278	.. seealso::
				279
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	280	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				281	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	282
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	283	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	284	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	285	function in the Gnome Gnumeric spreadsheet, including `this discussion
				286	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	287
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	288
				289	.. function:: mode(data)
				290
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	291	Return the single most common data point from discrete or nominal data.
				292	The mode (when it exists) is the most typical value and serves as a
				293	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	294
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	295	If there are multiple modes, returns the first one encountered in the data.
				296	If data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	297
				298	``mode`` assumes discrete data, and returns a single value. This is the
				299	standard treatment of the mode as commonly taught in schools:
				300
				301	.. doctest::
				302
				303	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				304	3
				305
				306	The mode is unique in that it is the only statistic which also applies
				307	to nominal (non-numeric) data:
				308
				309	.. doctest::
				310
				311	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				312	'red'
				313
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	314	.. versionchanged:: 3.8
				315	Now handles multimodal datasets by returning the first mode encountered.
				316	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				317	found.
				318
				319
				320	.. function:: multimode(data)
				321
				322	Return a list of the most frequently occurring values in the order they
				323	were first encountered in the data. Will return more than one result if
				324	there are multiple modes or an empty list if the data is empty:
				325
				326	.. doctest::
				327
				328	>>> multimode('aabbbbccddddeeffffgg')
				329	['b', 'd', 'f']
				330	>>> multimode('')
				331	[]
				332
				333	.. versionadded:: 3.8
				334
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	335
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	336	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	337
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	338	Return the population standard deviation (the square root of the population
				339	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	340
				341	.. doctest::
				342
				343	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				344	0.986893273527251
				345
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	346
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	347	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	348
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	349	Return the population variance of data, a non-empty iterable of real-valued
				350	numbers. Variance, or second moment about the mean, is a measure of the
				351	variability (spread or dispersion) of data. A large variance indicates that
				352	the data is spread out; a small variance indicates it is clustered closely
				353	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	354
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	355	If the optional second argument mu is given, it should be the mean of
				356	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	357	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	358
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	359	Use this function to calculate the variance from the entire population. To
				360	estimate the variance from a sample, the :func:`variance` function is usually
				361	a better choice.
				362
				363	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	364
				365	Examples:
				366
				367	.. doctest::
				368
				369	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				370	>>> pvariance(data)
				371	1.25
				372
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	373	If you have already calculated the mean of your data, you can pass it as the
				374	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	375
				376	.. doctest::
				377
				378	>>> mu = mean(data)
				379	>>> pvariance(data, mu)
				380	1.25
				381
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	382	This function does not attempt to verify that you have passed the actual mean
				383	as mu. Using arbitrary values for mu may lead to invalid or impossible
				384	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	385
				386	Decimals and Fractions are supported:
				387
				388	.. doctest::
				389
				390	>>> from decimal import Decimal as D
				391	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				392	Decimal('24.815')
				393
				394	>>> from fractions import Fraction as F
				395	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				396	Fraction(13, 72)
				397
				398	.. note::
				399
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	400	When called with the entire population, this gives the population variance
				401	σ². When called on a sample instead, this is the biased sample variance
				402	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	403
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	404	If you somehow know the true population mean μ, you may use this function
				405	to calculate the variance of a sample, giving the known population mean as
				406	the second argument. Provided the data points are representative
				407	(e.g. independent and identically distributed), the result will be an
				408	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	409
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	410
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	411	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	412
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	413	Return the sample standard deviation (the square root of the sample
				414	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	415
				416	.. doctest::
				417
				418	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				419	1.0810874155219827
				420
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	421
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	422	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	423
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	424	Return the sample variance of data, an iterable of at least two real-valued
				425	numbers. Variance, or second moment about the mean, is a measure of the
				426	variability (spread or dispersion) of data. A large variance indicates that
				427	the data is spread out; a small variance indicates it is clustered closely
				428	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	429
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	430	If the optional second argument xbar is given, it should be the mean of
				431	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	432	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	433
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	434	Use this function when your data is a sample from a population. To calculate
				435	the variance from the entire population, see :func:`pvariance`.
				436
				437	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	438
				439	Examples:
				440
				441	.. doctest::
				442
				443	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				444	>>> variance(data)
				445	1.3720238095238095
				446
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	447	If you have already calculated the mean of your data, you can pass it as the
				448	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	449
				450	.. doctest::
				451
				452	>>> m = mean(data)
				453	>>> variance(data, m)
				454	1.3720238095238095
				455
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	456	This function does not attempt to verify that you have passed the actual mean
				457	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	458	impossible results.
				459
				460	Decimal and Fraction values are supported:
				461
				462	.. doctest::
				463
				464	>>> from decimal import Decimal as D
				465	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				466	Decimal('31.01875')
				467
				468	>>> from fractions import Fraction as F
				469	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				470	Fraction(67, 108)
				471
				472	.. note::
				473
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	474	This is the sample variance s² with Bessel's correction, also known as
				475	variance with N-1 degrees of freedom. Provided that the data points are
				476	representative (e.g. independent and identically distributed), the result
				477	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	478
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	479	If you somehow know the actual population mean μ you should pass it to the
				480	:func:`pvariance` function as the mu parameter to get the variance of a
				481	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	482
				483	Exceptions
				484	----------
				485
				486	A single exception is defined:
				487
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	488	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	489
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	490	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	491
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	492
				493	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	494	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	495
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	496	:class:`NormalDist` is a tool for creating and manipulating normal
				497	distributions of a `random variable
				498	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
				499	composite class that treats the mean and standard deviation of data
				500	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	501
				502	Normal distributions arise from the `Central Limit Theorem
				503	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	504	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	505
				506	.. class:: NormalDist(mu=0.0, sigma=1.0)
				507
				508	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	509	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	510	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	511	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	512
				513	If sigma is negative, raises :exc:`StatisticsError`.
				514
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	515	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	516
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	517	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	518	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				519	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	520
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	521	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	522
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	523	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	524	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				525	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	526
				527	.. attribute:: variance
				528
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	529	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	530	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				531	distribution. Equal to the square of the standard deviation.
				532
				533	.. classmethod:: NormalDist.from_samples(data)
				534
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	535	Makes a normal distribution instance computed from sample data. The
				536	data can be any :term:`iterable` and should consist of values that
				537	can be converted to type :class:`float`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	538
				539	If data does not contain at least two elements, raises
				540	:exc:`StatisticsError` because it takes at least one point to estimate
				541	a central value and at least two points to estimate dispersion.
				542
				543	.. method:: NormalDist.samples(n, seed=None)
				544
				545	Generates n random samples for a given mean and standard deviation.
				546	Returns a :class:`list` of :class:`float` values.
				547
				548	If seed is given, creates a new instance of the underlying random
				549	number generator. This is useful for creating reproducible results,
				550	even in a multi-threading context.
				551
				552	.. method:: NormalDist.pdf(x)
				553
				554	Using a `probability density function (pdf)
				555	<https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	556	compute the relative likelihood that a random variable X will be near
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	557	the given value x. Mathematically, it is the ratio ``P(x <= X <
				558	x+dx) / dx``.
				559
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	560	The relative likelihood is computed as the probability of a sample
				561	occurring in a narrow range divided by the width of the range (hence
				562	the word "density"). Since the likelihood is relative to other points,
				563	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	564
				565	.. method:: NormalDist.cdf(x)
				566
				567	Using a `cumulative distribution function (cdf)
				568	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	569	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	570	equal to x. Mathematically, it is written ``P(X <= x)``.
				571
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame^]	572	.. method:: NormalDist.inv_cdf(p)
				573
				574	Compute the inverse cumulative distribution function, also known as the
				575	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				576	or the `percent-point
				577	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				578	function. Mathematically, it is written ``x : P(X <= x) = p``.
				579
				580	Finds the value x of the random variable X such that the
				581	probability of the variable being less than or equal to that value
				582	equals the given probability p.
				583
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	584	.. method:: NormalDist.overlap(other)
				585
				586	Compute the `overlapping coefficient (OVL)
				587	<http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger	14bab7a	2019-03-07 08:54:31 -0800	[diff] [blame]	588	between two normal distributions, giving a measure of agreement.
				589	Returns a value between 0.0 and 1.0 giving `the overlapping area for
				590	two probability density functions
				591	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	592
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	593	Instances of :class:`NormalDist` support addition, subtraction,
				594	multiplication and division by a constant. These operations
				595	are used for translation and scaling. For example:
				596
				597	.. doctest::
				598
				599	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				600	>>> temperature_february * (9/5) + 32 # Fahrenheit
				601	NormalDist(mu=41.0, sigma=4.5)
				602
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	603	Dividing a constant by an instance of :class:`NormalDist` is not supported
				604	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	605
				606	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	607	variables, it is possible to `add and subtract two independent normally
				608	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	609	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				610	represented as instances of :class:`NormalDist`. For example:
				611
				612	.. doctest::
				613
				614	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				615	>>> drug_effects = NormalDist(0.4, 0.15)
				616	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	617	>>> round(combined.mean, 1)
				618	3.1
				619	>>> round(combined.stdev, 1)
				620	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	621
				622	.. versionadded:: 3.8
				623
				624
				625	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	626	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	627
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	628	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	629
				630	For example, given `historical data for SAT exams
				631	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	632	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	633	determine the percentage of students with scores between 1100 and 1200, after
				634	rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	635
				636	.. doctest::
				637
				638	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	639	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	640	>>> round(fraction * 100.0, 1)
				641	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	642
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame^]	643	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				644	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				645
				646	.. doctest::
				647
				648	>>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
				649	[928, 1060, 1192]
				650	>>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
				651	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				652
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	653	What percentage of men and women will have the same height in `two normally
				654	distributed populations with known means and standard deviations
				655	<http://www.usablestats.com/lessons/normal>`_?
				656
				657	>>> men = NormalDist(70, 4)
				658	>>> women = NormalDist(65, 3.5)
				659	>>> ovl = men.overlap(women)
				660	>>> round(ovl * 100.0, 1)
				661	50.3
				662
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	663	To estimate the distribution for a model than isn't easy to solve
				664	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	665	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	666
				667	.. doctest::
				668
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	669	>>> def model(x, y, z):
				670	... return (3x + 7xy - 5y) / (11 * z)
				671	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	672	>>> n = 100_000
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	673	>>> X = NormalDist(10, 2.5).samples(n)
				674	>>> Y = NormalDist(15, 1.75).samples(n)
				675	>>> Z = NormalDist(5, 1.25).samples(n)
				676	>>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
				677	NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	678
				679	Normal distributions commonly arise in machine learning problems.
				680
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	681	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	682	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				683	The challenge is to predict a person's gender from measurements of normally
				684	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	685
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	686	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	687	measurements are assumed to be normally distributed, so we summarize the data
				688	with :class:`NormalDist`:
				689
				690	.. doctest::
				691
				692	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				693	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				694	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				695	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				696	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				697	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				698
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	699	Next, we encounter a new person whose feature measurements are known but whose
				700	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	701
				702	.. doctest::
				703
				704	>>> ht = 6.0 # height
				705	>>> wt = 130 # weight
				706	>>> fs = 8 # foot size
				707
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	708	Starting with a 50% `prior probability
				709	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				710	we compute the posterior as the prior times the product of likelihoods for the
				711	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	712
				713	.. doctest::
				714
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	715	>>> prior_male = 0.5
				716	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	717	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				718	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				719
				720	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				721	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				722
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	723	The final prediction goes to the largest posterior. This is known as the
				724	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	725	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				726
				727	.. doctest::
				728
				729	>>> 'male' if posterior_male > posterior_female else 'female'
				730	'female'
				731
				732
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	733	..
				734	# This modelines must appear within the last ten lines of the file.
				735	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;