Bootstrapping

Instatistics,bootstrappingisa method for assigning measures of accuracy to sampleestimates.^[1]Thistechnique allows estimation of the sampling distribution of almostany statistic using only very simple methods.^[2]^[3]Generally,it falls in the broader class ofresamplingmethods.

Bootstrapping is the practice of estimating properties ofanestimator(such as itsvariance) by measuring those properties when sampling from anapproximating distribution. One standard choice for anapproximating distribution is the empirical distribution of theobserved data. In the case where a set of observations can beassumed to be from anindependentand identically distributedpopulation, this can be implementedby constructing a number ofresamplesofthe observed dataset (and of equal size to the observed dataset),each of which is obtained byrandomsampling with replacementfrom the originaldataset.

It may alsobe used for constructinghypothesistests. It is often used as an alternative to inference basedon parametric assumptions when those assumptions are in doubt, orwhere parametric inference is impossible or requires verycomplicated formulas for the calculation of standarderrors.

Informaldescription

The basicidea of bootstrapping is that the sample we have collected is oftenthe best guess we have as to theshapeofthe population from which the sample was taken. For instance, asample of observations with two peaks in itshistogramwould not be wellapproximated by aGaussian ornormal bell curve, which has only one peak. Therefore, insteadof assuming a mathematical shape (like the normal curve or someother) for the population, we instead use the shape of thesample.

As anexample, assume we are interested in the average (ormean)height of people worldwide. We cannot measure all the people in theglobal population, so instead we sample only a tiny part of it, andmeasure that. Assume the sample is of size N; that is, we measurethe heights of N individuals. From that single sample, only onevalue of the mean can be obtained. In order to reason about thepopulation, we need some sense of thevariabilityofthe mean that we have computed.

To use thesimplest bootstrap technique, we take our original data set of Nheights, and, using a computer, make a new sample (called abootstrap sample) that is also of size N. This new sample is takenfrom the original usingsampling withreplacementso it is notidentical with the original "real" sample. We repeat this a lot(maybe 1000 or 10,000 times), and for each of these bootstrapsamples we compute its mean (each of these are called bootstrapestimates). We now have a histogram of bootstrap means. Thisprovides an estimate of the shape of the distribution of the meanfrom which we can answer questions about how much the mean varies.(The method here, described for the mean, can be applied to almostany otherstatisticorestimator.)

The keyprinciple of the bootstrap is to provide a way to simulate repeatedobservations from an unknown population using the obtained sampleas a basis.

Situations where bootstrapping isuseful

Adèr etal.^[4]recommendthe bootstrap procedure for the following situations:

Discussion

Advantages

A greatadvantage of bootstrap is its simplicity. It is straightforward wayto derive estimates ofstandarderrorsandconfidenceintervalsfor complexestimators of complex parameters of the distribution, such aspercentile points, proportions, odds ratio, and correlationcoefficients. Moreover, it is an appropriate way to control andcheck the stability of the results.

[edit]Disadvantages

Althoughbootstrapping is (under some conditions) asymptotically consistent,it does not provide general finite-sample guarantees. Furthermore,it has a tendency to be overly optimistic.^{[citationneeded]}The apparentsimplicity may conceal the fact that important assumptions arebeing made when undertaking the bootstrap analysis (e.g.independence of samples) where these would be more formally statedin other approaches.

Types of bootstrapscheme

In univariate problems, it is usuallyacceptable to resample the individual observations with replacement("case resampling" below). In small samples, a parametric bootstrapapproach might be preferred. For other problems, asmoothbootstrapwill likely bepreferred.

Forregression problems, various other alternatives areavailable.^{[citationneeded]}

[edit]Caseresampling

Bootstrap isgenerally useful for estimating the distribution of a statistic(e.g. mean, variance) without using normal theory (e.g.z-statistic, t-statistic). Bootstrap comes in handy when there isno analytical form or normal theory to help estimate thedistribution of the statistics of interest, since bootstrap methodcan apply to most random quantities, e.g., the ratio of varianceand mean. There are at least two ways of performing caseresampling.

The Monte Carlo algorithm for caseresampling is quite simple. First, we resample the data withreplacement, and the size of the resample must be equal to the sizeof the original data set. Then the statistic of interest iscomputed from the resample from the first step. We repeat thisroutine many times to get a more precise estimate of the Bootstrapdistribution of the statistic.
The 'exact' version for caseresampling is similar, but we exhaustively enumerate every possibleresample of the data set. This can be computationally expensive asthere are a total ofdifferentresamples, wherenisthe size of the data set.

[edit]Estimating the distributionof sample mean

Consider acoin-flipping experiment. We flip the coin and record whether itlands heads or tails. (Assume for simplicity that there are onlytwo outcomes) LetX = x₁,x₂,…,x₁₀be10 observations from the experiment.x_i=1if the i th fliplands heads, and 0 otherwise. From normal theory, we canuset-statistictoestimate the distribution of the sample mean,.

Instead, weuse bootstrap, specifically case resampling, to derive thedistribution of.We first resample the data to obtain abootstrapresample. An example of the first resample might look likethisX₁* =x₂,x₁,x₁₀,x₁₀,x₃,x₄,x₆,x₇,x₁,x₉.Note that there are some duplicates since a bootstrap resamplecomes from sampling with replacement from the data. Note also thatthe number of data points in a bootstrap resample is equal to thenumber of data points in our original observations. Then we computethe mean of this resample and obtain the firstbootstrapmean:μ₁*.We repeat this process to obtain the second resampleX₂*and compute the second bootstrap meanμ₂*.If we repeat this 100 times, then we haveμ₁*,μ₂*,…,μ₁₀₀*.This represents anempiricalbootstrap distributionof sample mean.From this empirical distribution, one can derive abootstrapconfidence intervalfor the purpose ofhypothesis testing.

[edit]Regression

Inregression problems,caseresamplingrefers to thesimple scheme of resampling individual cases - often rows ofadataset. For regression problems, so long as the data set is fairlylarge, this simple scheme is often acceptable. However, the methodis open to criticism^{[citationneeded]}.

Inregression problems, theexplanatoryvariablesare often fixed,or at least observed with more control than the response variable.Also, the range of the explanatory variables defines theinformation available from them. Therefore, to resample cases meansthat each bootstrap sample will lose some information. As such,alternative bootstrap procedures should be considered.

[edit]Bayesianbootstrap

Bootstrapping can be interpreted in aBayesianframework using ascheme that creates new datasets through reweighting the initialdata. Given a set ofdatapoints, the weighting assigned to data pointina new datasetis,whereisa low-to-high ordered list ofuniformlydistributed random numbers on,preceded by 0 and succeeded by 1. The distributions of a parameterinferred from considering many such datasetsarethen interpretable asposteriordistributionson thatparameter.^[5]

[edit]Smoothbootstrap

Under thisscheme, a small amount of (usually normally distributed)zero-centered random noise is added on to each resampledobservation. This is equivalent to sampling from akerneldensityestimate of the data.

[edit]Parametricbootstrap

In this casea parametric model is fitted to the data, often bymaximumlikelihood, and samples ofrandomnumbersare drawn fromthis fitted model. Usually the sample drawn has the same samplesize as the original data. Then the quantity, or estimate, ofinterest is calculated from these data. This sampling process isrepeated many times as for other bootstrap methods. The use of aparametric model at the sampling stage of the bootstrap methodologyleads to procedures which are different from those obtained byapplying basic statistical theory to inference for the samemodel.

[edit]Resamplingresiduals

Anotherapproach to bootstrapping in regression problems is toresampleresiduals.The method proceeds as follows.

Fit the model and retain thefitted valuesandthe residuals.
For each pair, (x_i,y_i), in whichx_iis the (possiblymultivariate) explanatory variable, add a randomly resampledresidual,,to the response variabley_i. In other words create syntheticresponse variableswherejisselected randomly from the list (1, …,n) foreveryi.
Refit the model using thefictitious response variablesy*_i, and retain the quantities ofinterest (often the parameters,,estimated from the syntheticy*_i).
Repeat steps 2 and 3 astatistically significant number of times.

This schemehas the advantage that it retains the information in theexplanatory variables. However, a question arises as to whichresiduals to resample. Raw residuals are one option; anotherisstudentizedresiduals(in linearregression). Whilst there are arguments in favour of usingstudentized residuals; in practice, it often makes littledifference and it is easy to run both schemes and compare theresults against each other.

[edit]Gaussian process regressionbootstrap

When dataare temporally correlated, straightforward bootstrapping destroysthe inherent correlations. This method uses Gaussian processregression to fit a probabilistic model from which replicates maythen be drawn.Gaussianprocessesare methods fromBayesiannon-parametricstatisticsbut are here usedto construct a parametric bootstrap approach, which implicitlyallows the time-dependence of the data to be taken intoaccount.

[edit]Wildbootstrap

Eachresidual is randomly multiplied by a random variable with mean 0and variance 1. This method assumes that the 'true' residualdistribution is symmetric and can offer advantages over simpleresidual sampling for smaller sample sizes.^[6]

[edit]Moving blockbootstrap

In themoving block bootstrap, n-b+1 overlapping blocks of length b willbe created in the following way: Observation 1 to b will be block1, observation 2 to b+1 will be block 2 etc. Then from these n-b+1blocks, n/b blocks will be drawn at random with replacement. Thenaligning these n/b blocks in the order they were picked, will givethe bootstrap observations. This bootstrap works with dependentdata, however, the bootstrapped observations will not be stationaryanymore by construction. But, it was shown that varying the blocklength can avoid this problem.^[7]

[edit]Choice ofstatistic

Thebootstrap distribution of a point estimator of a populationparameter has been used to produce a bootstrappedconfidenceintervalfor theparameter's true value, if the parameter can be written asafunctionof the population's distribution.

Populationparametersare estimated withmanypointestimators. Popular families of point-estimatorsincludemean-unbiased minimum-varianceestimators,median-unbiasedestimators,Bayesianestimators(for example,theposteriordistribution'smode,median,mean),andmaximum-likelihoodestimators.

A Bayesianpoint estimator and a maximum-likelihood estimator have goodperformance when the sample size is infinite, accordingtoasymptotictheory. For practical problems with finite samples, otherestimators may be preferable. Asymptotic theory suggests techniquesthat often improve the performance of bootstrapped estimators; thebootstrapping of a maximum-likelihood estimator may often beimproved using transformations related topivotalquantities.^[8]

[edit]Deriving confidenceintervals from the bootstrap distribution

Thebootstrap distribution of a parameter-estimator has been used tocalculateconfidenceintervalsfor itspopulation-parameter.^{[citationneeded]}

[edit]Effect of bias and the lackof symmetry on bootstrap confidence intervals

[edit]Methods for bootstrapconfidence intervals

There areseveral methods for constructing confidence intervals from thebootstrap distribution of arealparameter:

[edit]Exampleapplications

This sectionincludes alist ofreferences, related reading orexternallinks, butthe sources ofthis section remain unclear because it lacksinlinecitations.Pleaseimprovethisarticle by introducing more precise citations.(June2012)

[edit]Smoothedbootstrap

In1878,SimonNewcombtook observationson thespeed oflight.^[11]Thedata set contains twooutliers, whichgreatlyinfluencethesample mean. (Notethat the sample mean need not be aconsistentestimatorforanypopulation mean,because no mean need exist for aheavy-taileddistributions.) A well-defined androbuststatisticfor centraltendency is the sample median, which is consistent andmedian-unbiasedforthe population median.

Thebootstrap distribution for Newcomb's data appears below. Aconvolution-method of regularization reduces the discreteness ofthe bootstrap distribution, by adding a small amount ofN(0,σ²)random noise to each bootstrap sample. A conventional choiceisforsample sizen.^{[citationneeded]}

Histogramsof the bootstrap distribution and the smooth bootstrap distributionappear below

Thisarticle'sfactualaccuracy isdisputed.Pleasehelp to ensure that disputed statements arereliably sourced.See the relevant discussion on thetalkpage.(April2012)

. Thebootstrap distribution of the sample-median has only a small numberof values. The smoothed bootstrap distribution has arichersupport.

In thisexample, the bootstrapped 95% (percentile) confidence-interval forthe population median is (26, 28.5), which is close to the intervalfor (25.98, 28.46) for the smoothed bootstrap.

[edit]Relation to other approachesto inference

[edit]Relationship to otherresampling methods

Thebootstrap is distinguished from:

For moredetails seebootstrapresampling.

Bootstrapaggregating(bagging) isameta-algorithmbasedon averaging the results of multiple bootstrap samples.

[edit]U-statistics

Main article:U-statistic

Insituations where an obvious statistic can be devised to measure arequired characteristic using only a small number,r, of dataitems, a corresponding statistic based on the entire sample can beformulated. Given anr-samplestatistic, one can create ann-samplestatistic by something similar to bootstrapping (taking the averageof the statistic over all subsamples of sizer). Thisprocedure is known to have certain good properties and the resultis aU-statistic.Thesamplemeanandsamplevarianceare of this form,forr=1andr=2.

bootstrapping公司财务 bootstrapping是什么

爱华网本文地址 » http://www.aihuau.com/a/25101016/297657.html

Bootstrapping inverse floater