Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: 157500ed4b4a107c5f66327ebad7d15d997eaf48 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
				29	currently unsupported. Mixed types are also undefined and
				30	implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, e.g.
				32	``map(float, input_data)``.
				33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
				40	======================= =============================================
				41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	43	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	44	:func:`median` Median (middle value) of data.
				45	:func:`median_low` Low median of data.
				46	:func:`median_high` High median of data.
				47	:func:`median_grouped` Median, or 50th percentile, of grouped data.
				48	:func:`mode` Mode (most common value) of discrete data.
				49	======================= =============================================
				50
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	51	Measures of spread
				52	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	53
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	54	These functions calculate a measure of how much the population or sample
				55	tends to deviate from the typical or average values.
				56
				57	======================= =============================================
				58	:func:`pstdev` Population standard deviation of data.
				59	:func:`pvariance` Population variance of data.
				60	:func:`stdev` Sample standard deviation of data.
				61	:func:`variance` Sample variance of data.
				62	======================= =============================================
				63
				64
				65	Function details
				66	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	67
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	68	Note: The functions do not require the data given to them to be sorted.
				69	However, for reading convenience, most of the examples show sorted sequences.
				70
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	71	.. function:: mean(data)
				72
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	73	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	75	The arithmetic mean is the sum of the data divided by the number of data
				76	points. It is commonly called "the average", although it is only one of many
				77	different mathematical averages. It is a measure of the central location of
				78	the data.
				79
				80	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	81
				82	Some examples of use:
				83
				84	.. doctest::
				85
				86	>>> mean([1, 2, 3, 4, 4])
				87	2.8
				88	>>> mean([-1.0, 2.5, 3.25, 5.75])
				89	2.625
				90
				91	>>> from fractions import Fraction as F
				92	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				93	Fraction(13, 21)
				94
				95	>>> from decimal import Decimal as D
				96	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				97	Decimal('0.5625')
				98
				99	.. note::
				100
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	101	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	102	for central location: the mean is not necessarily a typical example of the
				103	data points. For more robust, although less efficient, measures of
				104	central location, see :func:`median` and :func:`mode`. (In this case,
				105	"efficient" refers to statistical efficiency rather than computational
				106	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	107
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	108	The sample mean gives an unbiased estimate of the true population mean,
				109	which means that, taken on average over all the possible samples,
				110	``mean(sample)`` converges on the true mean of the entire population. If
				111	data represents the entire population rather than a sample, then
				112	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	113
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	114
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	115	.. function:: fmean(data)
				116
				117	Convert data to floats and compute the arithmetic mean.
				118
				119	This runs faster than the :func:`mean` function and it always returns a
				120	:class:`float`. The result is highly accurate but not as perfect as
				121	:func:`mean`. If the input dataset is empty, raises a
				122	:exc:`StatisticsError`.
				123
				124	.. doctest::
				125
				126	>>> fmean([3.5, 4.0, 5.25])
				127	4.25
				128
				129	.. versionadded:: 3.8
				130
				131
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	132	.. function:: harmonic_mean(data)
				133
				134	Return the harmonic mean of data, a sequence or iterator of
				135	real-valued numbers.
				136
				137	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	138	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	139	data. For example, the harmonic mean of three values a, b and c
				140	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				141
				142	The harmonic mean is a type of average, a measure of the central
				143	location of the data. It is often appropriate when averaging quantities
				144	which are rates or ratios, for example speeds. For example:
				145
				146	Suppose an investor purchases an equal value of shares in each of
				147	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				148	What is the average P/E ratio for the investor's portfolio?
				149
				150	.. doctest::
				151
				152	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				153	3.6
				154
				155	Using the arithmetic mean would give an average of about 5.167, which
				156	is too high.
				157
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	158	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	159	is less than zero.
				160
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	161	.. versionadded:: 3.6
				162
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	163
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	164	.. function:: median(data)
				165
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	166	Return the median (middle value) of numeric data, using the common "mean of
				167	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	168	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	169
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	170	The median is a robust measure of central location, and is less affected by
				171	the presence of outliers in your data. When the number of data points is
				172	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	173
				174	.. doctest::
				175
				176	>>> median([1, 3, 5])
				177	3
				178
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	179	When the number of data points is even, the median is interpolated by taking
				180	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	181
				182	.. doctest::
				183
				184	>>> median([1, 3, 5, 7])
				185	4.0
				186
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	187	This is suited for when your data is discrete, and you don't mind that the
				188	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	189
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	190	If your data is ordinal (supports order operations) but not numeric (doesn't
				191	support addition), you should use :func:`median_low` or :func:`median_high`
				192	instead.
				193
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	194	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	195
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	196
				197	.. function:: median_low(data)
				198
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	199	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	200	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	201
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	202	The low median is always a member of the data set. When the number of data
				203	points is odd, the middle value is returned. When it is even, the smaller of
				204	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	205
				206	.. doctest::
				207
				208	>>> median_low([1, 3, 5])
				209	3
				210	>>> median_low([1, 3, 5, 7])
				211	3
				212
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	213	Use the low median when your data are discrete and you prefer the median to
				214	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	215
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	216
				217	.. function:: median_high(data)
				218
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	219	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	220	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	221
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	222	The high median is always a member of the data set. When the number of data
				223	points is odd, the middle value is returned. When it is even, the larger of
				224	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	225
				226	.. doctest::
				227
				228	>>> median_high([1, 3, 5])
				229	3
				230	>>> median_high([1, 3, 5, 7])
				231	5
				232
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	233	Use the high median when your data are discrete and you prefer the median to
				234	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	235
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	236
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	237	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	238
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	239	Return the median of grouped continuous data, calculated as the 50th
				240	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	241	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	242
				243	.. doctest::
				244
				245	>>> median_grouped([52, 52, 53, 54])
				246	52.5
				247
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	248	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	249	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				250	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				251	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	252	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	253
				254	.. doctest::
				255
				256	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				257	3.7
				258
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	259	Optional argument interval represents the class interval, and defaults
				260	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	261
				262	.. doctest::
				263
				264	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				265	3.25
				266	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				267	3.5
				268
				269	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	270	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	271
				272	.. impl-detail::
				273
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	274	Under some circumstances, :func:`median_grouped` may coerce data points to
				275	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	276
				277	.. seealso::
				278
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	279	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				280	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	281
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	282	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	283	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	284	function in the Gnome Gnumeric spreadsheet, including `this discussion
				285	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	286
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	287
				288	.. function:: mode(data)
				289
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	290	Return the most common data point from discrete or nominal data. The mode
				291	(when it exists) is the most typical value, and is a robust measure of
				292	central location.
				293
				294	If data is empty, or if there is not exactly one most common value,
				295	:exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	296
				297	``mode`` assumes discrete data, and returns a single value. This is the
				298	standard treatment of the mode as commonly taught in schools:
				299
				300	.. doctest::
				301
				302	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				303	3
				304
				305	The mode is unique in that it is the only statistic which also applies
				306	to nominal (non-numeric) data:
				307
				308	.. doctest::
				309
				310	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				311	'red'
				312
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	313
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	314	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	315
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	316	Return the population standard deviation (the square root of the population
				317	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	318
				319	.. doctest::
				320
				321	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				322	0.986893273527251
				323
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	324
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	325	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	326
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	327	Return the population variance of data, a non-empty iterable of real-valued
				328	numbers. Variance, or second moment about the mean, is a measure of the
				329	variability (spread or dispersion) of data. A large variance indicates that
				330	the data is spread out; a small variance indicates it is clustered closely
				331	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	332
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	333	If the optional second argument mu is given, it should be the mean of
				334	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	335	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	336
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	337	Use this function to calculate the variance from the entire population. To
				338	estimate the variance from a sample, the :func:`variance` function is usually
				339	a better choice.
				340
				341	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	342
				343	Examples:
				344
				345	.. doctest::
				346
				347	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				348	>>> pvariance(data)
				349	1.25
				350
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	351	If you have already calculated the mean of your data, you can pass it as the
				352	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	353
				354	.. doctest::
				355
				356	>>> mu = mean(data)
				357	>>> pvariance(data, mu)
				358	1.25
				359
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	360	This function does not attempt to verify that you have passed the actual mean
				361	as mu. Using arbitrary values for mu may lead to invalid or impossible
				362	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	363
				364	Decimals and Fractions are supported:
				365
				366	.. doctest::
				367
				368	>>> from decimal import Decimal as D
				369	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				370	Decimal('24.815')
				371
				372	>>> from fractions import Fraction as F
				373	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				374	Fraction(13, 72)
				375
				376	.. note::
				377
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	378	When called with the entire population, this gives the population variance
				379	σ². When called on a sample instead, this is the biased sample variance
				380	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	381
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	382	If you somehow know the true population mean μ, you may use this function
				383	to calculate the variance of a sample, giving the known population mean as
				384	the second argument. Provided the data points are representative
				385	(e.g. independent and identically distributed), the result will be an
				386	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	387
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	388
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	389	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	390
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	391	Return the sample standard deviation (the square root of the sample
				392	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	393
				394	.. doctest::
				395
				396	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				397	1.0810874155219827
				398
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	399
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	400	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	401
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	402	Return the sample variance of data, an iterable of at least two real-valued
				403	numbers. Variance, or second moment about the mean, is a measure of the
				404	variability (spread or dispersion) of data. A large variance indicates that
				405	the data is spread out; a small variance indicates it is clustered closely
				406	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	407
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	408	If the optional second argument xbar is given, it should be the mean of
				409	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	410	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	411
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	412	Use this function when your data is a sample from a population. To calculate
				413	the variance from the entire population, see :func:`pvariance`.
				414
				415	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	416
				417	Examples:
				418
				419	.. doctest::
				420
				421	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				422	>>> variance(data)
				423	1.3720238095238095
				424
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	425	If you have already calculated the mean of your data, you can pass it as the
				426	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	427
				428	.. doctest::
				429
				430	>>> m = mean(data)
				431	>>> variance(data, m)
				432	1.3720238095238095
				433
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	434	This function does not attempt to verify that you have passed the actual mean
				435	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	436	impossible results.
				437
				438	Decimal and Fraction values are supported:
				439
				440	.. doctest::
				441
				442	>>> from decimal import Decimal as D
				443	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				444	Decimal('31.01875')
				445
				446	>>> from fractions import Fraction as F
				447	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				448	Fraction(67, 108)
				449
				450	.. note::
				451
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	452	This is the sample variance s² with Bessel's correction, also known as
				453	variance with N-1 degrees of freedom. Provided that the data points are
				454	representative (e.g. independent and identically distributed), the result
				455	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	456
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	457	If you somehow know the actual population mean μ you should pass it to the
				458	:func:`pvariance` function as the mu parameter to get the variance of a
				459	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	460
				461	Exceptions
				462	----------
				463
				464	A single exception is defined:
				465
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	466	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	467
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	468	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	469
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	470
				471	:class:`NormalDist` objects
				472	===========================
				473
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	474	:class:`NormalDist` is a tool for creating and manipulating normal
				475	distributions of a `random variable
				476	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
				477	composite class that treats the mean and standard deviation of data
				478	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	479
				480	Normal distributions arise from the `Central Limit Theorem
				481	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	482	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	483
				484	.. class:: NormalDist(mu=0.0, sigma=1.0)
				485
				486	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	487	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	488	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	489	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	490
				491	If sigma is negative, raises :exc:`StatisticsError`.
				492
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	493	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	494
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	495	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	496	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				497	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	498
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	499	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	500
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	501	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	502	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				503	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	504
				505	.. attribute:: variance
				506
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	507	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	508	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				509	distribution. Equal to the square of the standard deviation.
				510
				511	.. classmethod:: NormalDist.from_samples(data)
				512
				513	Class method that makes a normal distribution instance
				514	from sample data. The data can be any :term:`iterable`
				515	and should consist of values that can be converted to type
				516	:class:`float`.
				517
				518	If data does not contain at least two elements, raises
				519	:exc:`StatisticsError` because it takes at least one point to estimate
				520	a central value and at least two points to estimate dispersion.
				521
				522	.. method:: NormalDist.samples(n, seed=None)
				523
				524	Generates n random samples for a given mean and standard deviation.
				525	Returns a :class:`list` of :class:`float` values.
				526
				527	If seed is given, creates a new instance of the underlying random
				528	number generator. This is useful for creating reproducible results,
				529	even in a multi-threading context.
				530
				531	.. method:: NormalDist.pdf(x)
				532
				533	Using a `probability density function (pdf)
				534	<https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	535	compute the relative likelihood that a random variable X will be near
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	536	the given value x. Mathematically, it is the ratio ``P(x <= X <
				537	x+dx) / dx``.
				538
				539	Note the relative likelihood of x can be greater than `1.0`. The
				540	probability for a specific point on a continuous distribution is `0.0`,
				541	so the :func:`pdf` is used instead. It gives the probability of a
				542	sample occurring in a narrow range around x and then dividing that
				543	probability by the width of the range (hence the word "density").
				544
				545	.. method:: NormalDist.cdf(x)
				546
				547	Using a `cumulative distribution function (cdf)
				548	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	549	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	550	equal to x. Mathematically, it is written ``P(X <= x)``.
				551
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	552	.. method:: NormalDist.overlap(other)
				553
				554	Compute the `overlapping coefficient (OVL)
				555	<http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
				556	between two normal distributions.
				557
				558	Measures the agreement between two normal probability distributions.
				559	Returns a value between 0.0 and 1.0 giving the overlapping area for
				560	two probability density functions.
				561
				562	In this `example from John M. Linacre
				563	<https://www.rasch.org/rmt/rmt101r.htm>`_ about 80% of each
				564	distribution overlaps the other:
				565
				566	.. doctest::
				567
				568	>>> N1 = NormalDist(2.4, 1.6)
				569	>>> N2 = NormalDist(3.2, 2.0)
				570	>>> ovl = N1.overlap(N2)
				571	>>> f'{ovl * 100.0 :.1f}%'
				572	'80.4%'
				573
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	574	Instances of :class:`NormalDist` support addition, subtraction,
				575	multiplication and division by a constant. These operations
				576	are used for translation and scaling. For example:
				577
				578	.. doctest::
				579
				580	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				581	>>> temperature_february * (9/5) + 32 # Fahrenheit
				582	NormalDist(mu=41.0, sigma=4.5)
				583
				584	Dividing a constant by an instance of :class:`NormalDist` is not supported.
				585
				586	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	587	variables, it is possible to `add and subtract two independent normally
				588	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	589	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				590	represented as instances of :class:`NormalDist`. For example:
				591
				592	.. doctest::
				593
				594	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				595	>>> drug_effects = NormalDist(0.4, 0.15)
				596	>>> combined = birth_weights + drug_effects
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	597	>>> f'mean: {combined.mean :.1f} standard deviation: {combined.stdev :.1f}'
				598	'mean: 3.1 standard deviation: 0.5'
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	599
				600	.. versionadded:: 3.8
				601
				602
				603	:class:`NormalDist` Examples and Recipes
				604	----------------------------------------
				605
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	606	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	607
				608	For example, given `historical data for SAT exams
				609	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	610	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	611	determine the percentage of students with scores between 1100 and 1200:
				612
				613	.. doctest::
				614
				615	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	616	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	617	>>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	618	'18.4% score between 1100 and 1200'
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	619
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	620	What percentage of men and women will have the same height in `two normally
				621	distributed populations with known means and standard deviations
				622	<http://www.usablestats.com/lessons/normal>`_?
				623
				624	>>> men = NormalDist(70, 4)
				625	>>> women = NormalDist(65, 3.5)
				626	>>> ovl = men.overlap(women)
				627	>>> round(ovl * 100.0, 1)
				628	50.3
				629
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	630	To estimate the distribution for a model than isn't easy to solve
				631	analytically, :class:`NormalDist` can generate input samples for a `Monte
				632	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
				633	model:
				634
				635	.. doctest::
				636
				637	>>> n = 100_000
				638	>>> X = NormalDist(350, 15).samples(n)
				639	>>> Y = NormalDist(47, 17).samples(n)
				640	>>> Z = NormalDist(62, 6).samples(n)
				641	>>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
				642	>>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
				643	NormalDist(mu=267.6516398754636, sigma=101.357284306067)
				644
				645	Normal distributions commonly arise in machine learning problems.
				646
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	647	Wikipedia has a `nice example of a Naive Bayesian Classifier
				648	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge is to
				649	predict a person's gender from measurements of normally distributed features
				650	including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	651
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	652	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	653	measurements are assumed to be normally distributed, so we summarize the data
				654	with :class:`NormalDist`:
				655
				656	.. doctest::
				657
				658	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				659	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				660	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				661	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				662	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				663	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				664
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	665	Next, we encounter a new person whose feature measurements are known but whose
				666	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	667
				668	.. doctest::
				669
				670	>>> ht = 6.0 # height
				671	>>> wt = 130 # weight
				672	>>> fs = 8 # foot size
				673
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	674	Starting with a 50% `prior probability
				675	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				676	we compute the posterior as the prior times the product of likelihoods for the
				677	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	678
				679	.. doctest::
				680
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	681	>>> prior_male = 0.5
				682	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	683	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				684	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				685
				686	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				687	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				688
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame^]	689	The final prediction goes to the largest posterior. This is known as the
				690	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	691	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				692
				693	.. doctest::
				694
				695	>>> 'male' if posterior_male > posterior_female else 'female'
				696	'female'
				697
				698
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	699	..
				700	# This modelines must appear within the last ten lines of the file.
				701	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;