You will write a 2 page review/abstract/summary on an article from a peerreviewed scholarly journal. This is to assess your ability to select and summarize

the research of others, analyze and apply the research of others, and communicate

professionally and effectively to their regard. However, the most important

rationale for this assignment is for you to see how statistical analysis is presented.

Instructions:

1. Use the library system (https://www.ucumberlands.edu/library) or online catalog to

locate a journal article that pertains to your research, thesis, or area of

interest. The article you chose should have performed some statistical

analysis of gathered data and made an inference using something other than

just the average (mean). That is, they can’t just talk about averages; they

must have used one of these tests: t-test, chi-square, F-test, Fischer test,

ANOVA, MANOVA, ANCOVA, Mann-Whitney, correlation, regression.

2. Read the article thoroughly.

3. Write a 2-page summary about the article following the given guidelines.

Detailed Guidelines:

1. Your review should include:

a. The question/problem being researched by the author

b. The experiment that will answer the question

c. How they collected data

d. Analysis of the data (Must identify the statistical test used)

e. Their conclusion or findings

2. Your review can be single spaced or double spaced in at least 11 pt font.

3. You should make a reference page to list the one article you chose and

practice APA format. NOTE: The APA Guidelines address specifically how the

reference page is titled and how articles are cited. Note: It is not called

Bibliography.

4. Graphs or visual representations are not needed for this assignment.

5. A title page is not required, but you may include one if you want to practice

APA style. If you have a title page, your submission is 4 pages. If you do not

have a title page, your submission is 3 pages. Do NOT go over.

Grading:

Your grade on these assignments will be based on the rubric below. FOLLOW THE

GUIDELINES above to earn full credit for the assignment. Failure to follow

instructions, include all requested pieces, or keep within the page limit will result a

loss of points.

Rubric:

ON DATA-DRIVEN CHI SQUARE

STATISTICS

by

Huiyu Qian

Presented to the Graduate and Research Committee

of Lehigh University

in Candidacy for the Degree of

Doctor of Philosophy

in

Mathematics

Lehigh University

April, 2009

UMI Number: 3354754

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy

submitted. Broken or indistinct print, colored or poor quality illustrations and

photographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if unauthorized

copyright material had to be removed, a note will indicate the deletion.

______________________________________________________________

UMI Microform 3354754

Copyright 2009 by ProQuest LLC

All rights reserved. This microform edition is protected against

unauthorized copying under Title 17, United States Code.

_______________________________________________________________

ProQuest LLC

789 East Eisenhower Parkway

P.O. Box 1346

Ann Arbor, MI 48106-1346

c Copyright by Huiyu Qian

2009

ii

Approved and recommended for acceptance as a dissertation in partial fulfillment of the

requirements for the degree of Doctor of Philosophy.

Prof. Wei-Min Huang

Dissertation Director

Huiyu Qian

On Data-Driven Chi Square Statistics

Prof. Bennett Eisenberg

Dissertation Coadvisor

Date

Committee Members:

Prof. Ping-Shi Wu

Accepted Date

Prof. D. Gary Harlow

Mechanical Engineering Dept.

Lehigh University

iii

Acknowledgments

First of all, I would like to express my deep gratitude and appreciation to my advisor, WeiMin Huang, and my coadvisor, Bennett Eisenberg for their generous time and commitment

throughout my doctoral work. They encouraged me to develop independent analytical

thinking and research skills and also greatly assisted me with my scientific writing. Without

their guidance and patience, this dissertation would not have been possible. I enjoyed and

learned a lot in their lectures and academic meetings, not only mathematical and statistical

knowledge, but also the way to teach and how to generate new ideas.

I would also like to thank the members of my committee, Ping-Shi Wu and D. Gary

Harlow, for their helpful comments and encouragement. I would extend many thanks to

Christine Banzoff and especially Mary Ann Dent for working extremely hard to make our

department like a big family. Thanks to the graduate students who left before me, for their

kind support and friendship. I would particularly thank John Frommeyer and Francisco

Ojeda for their help while I was working in the Writing and Math Center.

Finally, I am grateful to my family members, especially my parents, Ping Lu and Weijiang Qian for their constant love, understanding and support during difficult times.

iv

Contents

Acknowledgments

iv

List of Tables

viii

List of Figures

ix

Abstract

1

1 Introduction

2

2 Construction and Preliminaries of 2DDCS Statistics

9

2.1

2DDCS Statistics on a Line . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2

Minimum Cell Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3

Related Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.4

Set of Possible Cutpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.5

Circular 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3 Null Distribution of the Cutpoints

31

3.1

Cutpoint of One-Sided KS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.2

Cutpoint of Two-Sided KS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

v

3.3

Cutpoint of One-Sided 2DDCS . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.4

Cutpoint of Two-Sided 2DDCS . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.5

Cutpoint of Circular 2DDCS . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4 Null Distributions of 2DDCS Statistics

56

4.1

KS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.2

One-Sided 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.3

Two-Sided 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.4

Circular 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5 Power Study

76

5.1

Power Function of Two-Sided 2DDCS Tests . . . . . . . . . . . . . . . . . .

77

5.2

Optimality of Circular 2DDCS Tests . . . . . . . . . . . . . . . . . . . . . .

79

5.3

Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

6 Applications

97

6.1

Goodness-of-Fit with Real Data . . . . . . . . . . . . . . . . . . . . . . . . .

98

6.2

Test Statistics with Unknown Parameters . . . . . . . . . . . . . . . . . . .

103

6.2.1

Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .

106

6.2.2

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

108

6.2.3

Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .

110

6.2.4

Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

6.2.5

Power study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112

Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

115

6.3.1

115

6.3

Optimal Cutpoint and Two Sample Tests . . . . . . . . . . . . . . .

vi

6.3.2

Possible Application in Regression Without Replications . . . . . . .

116

7 Conclusion

118

A Claims and Theorems

124

A.1 Set of Possible Cutpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

A.2 Cutpoint of One-Sided KS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

A.3 Consistency of Two-Sided 2DDCS Tests . . . . . . . . . . . . . . . . . . . .

129

B Approximation of Powers

132

B.1 Neyman-Pearson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

B.2 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

B.3 Pearson 2CS Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

136

C An example: Power Function of Two-Sided 2DDCS

140

VITA

145

vii

List of Tables

4.1

Critical Values of 2DDCS Statistics on a Line, n=200 and k=5*1000 . . . .

64

4.2

Critical Values of 2DDCS Statistics on a Line, n=400 and k=5*1000 . . . .

64

4.3

CDF of the circular 2DDCS statistic when n=10 . . . . . . . . . . . . . . .

71

4.4

Critical Values of Circular 2DDCS Statistics . . . . . . . . . . . . . . . . . .

74

4.5

Cumulative Distribution of N and the Circular 2DDCS Statistic, n=50 . . .

74

4.6

Cumulative Distribution of N and the Circular 2DDCS Statistic, n=100 . .

74

5.1

Cutoff Values for Power Comparison I, alpha=.05, n=100, K=5000 . . . . .

86

5.2

Power Comparison I, alpha=.05, n=100, K=5000 . . . . . . . . . . . . . . .

87

5.3

Parameters of the Alternatives . . . . . . . . . . . . . . . . . . . . . . . . .

90

5.4

Simulated Upper 5 Percentiles, n=100, K=5000 . . . . . . . . . . . . . . . .

95

5.5

Power Comparison II, alpha=.05, n=100, K=5000 . . . . . . . . . . . . . .

96

6.1

Simulated Critical Values (Samples Taken from Weibull(2,1)), n=10 . . . .

111

6.2

Simulated Critical Values (Samples Taken from Weibull(2,1)), n=100 . . . .

111

6.3

Simulated Powers of Testing Weibull(a,b), n=100,K=1000,alpha=.05 . . . .

115

viii

List of Figures

1.1

Example 1 of Regular Chi-Square Tests . . . . . . . . . . . . . . . . . . . .

2.1

Pearson’s CS test for uniformity: various choices of cells under different al-

6

ternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2

Pearson’s 2CS Statistic as a Function of t . . . . . . . . . . . . . . . . . . .

23

3.1

+

Frequency Histogram of V10

, k = 5000 . . . . . . . . . . . . . . . . . . . . .

35

3.2

Frequency Histogram of V2 ( 5000 replications) . . . . . . . . . . . . . . . .

40

3.3

Frequency Histogram of V50 ( 5000 replications) . . . . . . . . . . . . . . . .

41

3.4

Density of t̂+ , ε = 0, n = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.5

Frequency Histogram of t̂+

ε=.05 (n = 50, 5000 replications) . . . . . . . . . .

46

3.6

Regions for X(1) (y) and X(2) (x) . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.7

Null Distribution of t̂ When n = 2 . . . . . . . . . . . . . . . . . . . . . . .

50

3.8

Frequency Histogram of t̂ε=0 (n = 3, 5000 replications) . . . . . . . . . . . .

51

3.9

Density Histogram of t̂c When n = 10, K = 5000 . . . . . . . . . . . . . . .

55

4.1 X 2 t̂ , n = 1000, ε = .01, K = 5000 . . . . . . . . . . . . . . . . . . . . . . .

4.2 Density Histogram of X 2 t̂ with Fitted Curve, n = 1000, K = 5000 . . . .

ix

66

67

4.3

Probabilty Histogram of Circular 2DDCS Statistics, n = 200, K = 5000 . .

75

5.1

Line Bounds a Curved Boundary . . . . . . . . . . . . . . . . . . . . . . . .

80

5.2

Power Comparison: Uniform vs. Linear Alternative . . . . . . . . . . . . . .

83

5.3

Densities of g1 s and g2 s . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.4

Densities of g5 s and g6 s . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

6.1

Density Histogram of the Galaxies Data . . . . . . . . . . . . . . . . . . . .

101

6.2

Circular Plot of the 76 Turtles’ Orientations . . . . . . . . . . . . . . . . . .

102

6.3

Critical Values with Different Significance Levels and Sample Sizes . . . . .

113

x

Abstract

Pearson chi square tests have been very popular because they are intuitive, natural and

easy to carry out for most categorical data sets. However, the construction of the cells

has to be determined when the population is continuous. Moreover, the power of such an

arbitrarily selected chi square test for continuous data is very unstable and depends on the

choice of the cells. We propose several data-driven chi square tests in which the choice of

cells is based on the data itself. Two-cell data-driven chi square tests for data on a line

and on a circle are our main concerns. For data on a line, the tests require a minimum cell

length ε to avoid singularity. We study how to choose the proper value of ε and the set of

possible cutpoints. For directional data, we show that the circular two-cell data-driven chi

square test with equal cell lengths is equivalent to Ajne’s N test. By comparing with several

related tests, we find that our proposed tests are more powerful for a generic alternative

than a particular Pearson chi square test with the cells taken without investigating the data.

Examples on applications of the methods are also given.

1

Chapter 1

Introduction

Pearson’s chi square statistic was introduced by Karl Pearson in 1900. It measures the

discrepancy between the data and the proposed model, which is called the null hypothesis

in a hypothesis test. The statistic is written as

χ2 =

k

(Yi − Ei )2

Ei

i=1

.

(1.1)

It is the weighted sum of squares of the difference between the empirical frequency Yi and

the expected frequency Ei in each cell, group or category under the null hypothesis. Let

X1 , …, Xn be a random sample of n independent observations from a common population

with distribution function F . The chi square test was originally used to test the null

hypothesis that the data follows a specific distribution F0 versus a general alternative:

H0 : F (t) = F0 (t) for all t ∈ R

H : F (t) = F (t) for some t

a

0

.

Such a test based on the statistic (1.1) is called “Pearson’s chi square test for goodness-offit”. A similar test statistic to (1.1) is also used for a two-sample test of homogeneity or

2

independence. Since the idea and procedure of the three types of chi square tests are all

alike, without loss of generality, we concentrate on the test of goodness-of-fit. When the

expected frequency in each cell is large, it is a well known result that the statistic (1.1)

has approximately a chi square distribution with k − 1 degrees of freedom, which is usually

denoted as χ2k−1 . The null hypothesis is rejected when the observed value of the chi square

statistic is extreme. However, a small observed value of the chi square statistic does not

imply that the data is from the proposed model for continuous populations because the

selection of the cells for the test might hide the difference between the two distributions F

and F0 .

One of the reasons that a chi square goodness-of-fit test has been widely used is that

the statistic (1.1) can be used for any univariate distribution. The null distribution could

be discrete, continuous or mixed. We may also modify or tailor the statistic (1.1) when

the hypothesized distribution is not fully specified. The test is naturally carried out when

the data is grouped. In the continuous case, however, the construction of the cells has to

be determined before the calculation of the statistics. In the literature, there has been a

lot of discussion about the choice of the number of cells and the cell sizes. But there is no

explicit solution to the problem of constructing the cells such that the corresponding test will

be powerful for a general continuous alternative. For instance, Mann and Wald (1942)[20]

study the number of equiprobable disjoint cells for a given sample size and significance level.

In their result, the optimal number of cells is about n2/5 , where n is the sample size. They

show that the corresponding chi square test is unbiased and the test statistic can be closely

approximated by a chi square distribution. However, Kallenberg et al. (1985)[16] argue

that by taking unequiprobable cells one gets much more power for heavy-tailed alternatives

3

than by applying the Mann and Wald’s rule. Moreover, Hall (1985)[13] introduces a chi

square type statistic with overlapping cells and Inglot et al.(2003)[15] study a data-driven

chi square statistic. Here “data-driven” means the construction of the cells is driven by

the given data set. Both Hall’s and Inglot’s test statistics do not necessarily follow an

asymptotic chi square distribution under the null hypothesis. The main issues involved

are (i) the number of cells (ii) equiprobable or not (iii) overlapping or not and (iv) datadriven or not. We propose data-driven chi square type statistics with disjoint cells and we

focus on the two-cell case. Let X(1) , …, X(n) be the order statistics of a random sample of

n independent observations from a common population with distribution function F0 . In

addition, let T(i) = F0 X(i) , i = 1, 2, …, n. Then with the assumption of the continuity

of F0 (x) , T(1) , T(2) , …, T(n) is an ordered sample of n i.i.d. observations from the uniform

distribution on the unit interval. Therefore, we may focus on the goodness-of-fit test for

uniformity over [0, 1], that is,

H0 : F (t) = t, t ∈ [0, 1] .

Given a continuous data set, our concern is how to construct the cells of a Pearson chi

square (CS) statistic based on the data. Different constructions may lead to totally opposite

decisions as shown in the following example:

Example 1 Given the following data set with the sample size of 30:

.091 .162 .184 .309 .314 .329 .352 .359 .393 .404 .428 .474 .480 .512 .545 .547 .552 .556 .562

.563 .578 .591 .626 .627 .633 .656 .694 .766 .772 .850,

we test if the data are taken from a uniform distribution, that is, H0 : F (t) = t, for any

t ∈ [0, 1] , by carrying out a Pearson chi square test.

4

What a statistical analyst will typically do is to cut the unit interval into two or

three equiprobable cells. Suppose that analyst A would like to use two cells, [0, 1/2] and

(1/2, 1],while analyst B decides to have three cells [0, 1/3], (1/3, 2/3], (2/3, 1]. Then analyst

A will get the value of the chi square statistic

(13 − 15)2 (17 − 15)2

+

= .533,

15

15

which is significantly small at level .05 (χ21,.05 = 3.841). Therefore he will not reject the null

hypothesis at level α. However, analyst B will conclude to reject the uniformity over [0, 1]

at the same significance level because his observed value is

(6 − 10)2 (20 − 10)2 (4 − 10)2

+

+

= 15.2

10

10

10

and it is much larger than the corresponding cutoff value (χ22,.05 = 5.991). The frequency

histograms of the data with two and three equiprobable bins under the null are shown

respectively in Figure 1.1. The shaded regions denote the frequency deviations of the data

from uniformity. There is a significant deviation in the three-bin case but not in the two-bin

case. Which test should we take? In this case, we may take the three-cell test since it shows

an extremely large discrepancy between the data and the uniform distribution. However,

given a different data set, we may construct the cells differently. Thus a good choice of the

cells is data-dependent.

Our main interest is to find a good construction of the cells based on the given data

such that the corresponding chi square type test is relatively powerful under a broad class

of alternatives. In other words, the proposed test should be able to detect the deviation of

the given data from the null hypothesis even when we do not have too much information

about the underlying distribution. We concentrate on tests with disjoint cells and a fixed

5

Frequency Histogram

10

frequency

10

0

5

5

0

frequency

15

15

20

Frequency Histogram

0.0

0.2

0.4

0.6

0.8

0.0

1.0

0.2

0.4

0.6

x

x

Figure 1.1: Example 1 of Regular Chi-Square Tests

6

0.8

1.0

significance level α. Intuitively, the choice of the cells should maximize the discrepancy

between the alternative and the null to raise the power of a chi square test.

In Chapter 2, we give a detailed derivation of the selection rule of the cells for a chi square

test. We call the derived statistic the “Data-Driven Chi Square” statistic and it is denoted by

“DDCS” throughout this dissertation. In the simplest case when the number of cells is fixed

to be two, the resulting statistic is called a two-cell data-driven chi square (2DDCS) statistic.

It is similar to Miller and Siegmund (1982)[21]’s “Maximally Selected Chi-Square” (MSCS)

statistic and Inglot and Janic-Wroblewska’s “Data Driven Chi-Square Test” which we call

“IJDDCS”. All three are “Data-Driven” type chi square statistics. However, the MSCS test

is a two-sample test for homogeneity while our 2DDCS is a goodness-of-fit test. For any

given sample, the IJDDCS test allows more than two cells and the set of possible cutpoints

is fixed and finite while the 2DDCS test has two cells only and the set of possible cutpoints

is random and data dependent. To generalize the method for applications in directional

data, we propose a circular 2DDCS statistic with the edges wrapped around. For each

sample, we define the “cutpoint” t̂ to be the cutpoint of the 2DDCS statistic chosen based

on the data. Thus, a different sample may come up with a different cutpoint. Basically, the

cutpoint is the point where the corresponding chi square statistic is maximized. A drawback

of such a method is that an appropriate minimum cell length ε has to be determined to

avoid singularity.

In the third chapter, we consider the null distributions of the cutpoints for the 2DDCS

and related statistics. It turns out that the cutpoints of the 2DDCS tests on a line are

less likely to occur near the center under the null. For comparison, we also discuss the null

distribution of the cutpoint for a one-sided or two-sided Kolmogorov-Smirnov (KS) statistic.

7

The cutpoint of a circular 2DDCS statistic with fixed cell length is defined such a way that

it is uniformly distributed over (0, 1) under the null.

In Chapter 4, we discuss the null distributions of the 2DDCS statistics. We adapt the

asymptotic results on the null distribution of the MSCS statistic to the 2DDCS statistic on

a line. The critical values depend on the value of the minimum cell length ε. The circular

2DDCS statistic with a fixed cell length .5 is shown to be equivalent to Ajne’s N test.

Power studies in Chapter 5 show that the proposed 2DDCS tests are more robust than

an arbitrarily selected Pearson 2CS test and have higher power under a generic alternative

than the IJDDCS and KS tests. The 2DDCS tests on a line can be applied to any continuous

data when a Pearson 2CS test is appropriate. The tests can be tailored for a goodness-of-fit

test with unknown parameters. Examples of the applications are given in Chapter 6.

8

Chapter 2

Construction and Preliminaries of

2DDCS Statistics

Let X1 , …, Xn be a random sample of n i.i.d. random variables from a common distribution

F on the unit interval [0, 1] and let X(1) , …, X(n) be the order statistics from smallest to

largest. The empirical cumulative distribution function (ECDF) Fn (t) is defined as the

proportion of the number of observations (Xi s) ≤ t. In this chapter, we develop several

2DDCS statistics with which the corresponding tests will most likely make correct decisions

for a broad class of alternatives. In Section 2.1, we determine the selection rule of disjoint

cells based on the given data when the number of cells k is fixed. In Section 2.2, we conclude

that a minimum cell length ε is required for a 2DDCS statistic. Section 2.3 contains some

related results of stochastic processes for an asymptotic analysis of the 2DDCS statistics.

Moreover, we find that the set of possible cutpoints for the 2DDCS statistics on a line is the

set of all observations in Section 2.4. By such a claim, we rewrite the definitions of the test

statistics in terms of the order statistics {X(1) , …, X(n) } and we define the one-sided and

9

2.1. 2DDCS STATISTICS ON A LINE

two-sided 2DDCS statistics for data on the real line. At the end of this chapter we propose

a wrapped-around 2DDCS statistic which can be used to test directional data.

2.1

2DDCS Statistics on a Line

The main problem of a Pearson chi square (CS) goodness-of-fit test for continuous data is

that the power of the test is not stable. That is, any fixed construction of the cells is good for

certain groups of alternatives, but it could be a bad choice for some others. The solution to

this problem is to choose the cells based on the data. In this section we derive a data-driven

chi square type statistic. The cells of interest are two disjoint cells covering [0, 1] and the

hypothesized distribution is the uniform distribution. However, we start the discussion in a

more general case with the number of cells not necessarily two and with a general continuous

null distribution. Let k ≥ 2 be the number of disjoint cells and t = (t1 , …, tk−1 ) ∈ Rk−1 be

the vector of cell cutpoints. We suppose that the number of cells k is determined before

choosing the cell cutpoints t. Given the sample X = {X1 , …, Xn }, we need to test H0 : the

data is taken from a distribution with CDF F0 . The idea of “data-driven” is to select the

statistic with the “right” cell cutpoints t among those in the collection:

k

[Yi (t) − npi (t)]2

X 2 (t) =

,

npi (t)

i=1

(2.1)

where t = (t1 , …, tk−1 ) ∈ Rk−1 , −∞ = t0 (t) = (p1 (t), …,

pk (t)) be the cell probabilities according to F1 . Therefore

pi (t) = F1 (ti ) − F1 (ti−1 ) , i = 1, 2, …, n.

Similarly, let the cell probabilities under the null hypothesis be

pi (t) = F0 (ti ) − F0 (ti−1 ) = ti − ti−1 , i = 1, 2, …, n.

The form of the DDCS test statistic is similar to the Pearson CS statistic X 2 (t) except that

the cell cutpoints vector t̂ is chosen based on the data. But the cutpoints t for a Pearson CS

test should be selected without looking at the details of the data. Since we are looking for

the cutpoints with which the corresponding CS statistic will be able to detect the deviation

of the data from the null hypothesis as much as possible. In other words, we choose the

cutpoints such that the corresponding Pearson CS test is the most powerful test in the

collection (2.1). Therefore, the selection rule has to be derived from the power function of

a Pearson CS test with k cells. One solution is to apply the asymptotic theory of the chi

square statistics X 2 (t). However, for a fixed pair of null and alternative distributions, the

asymptotic power of any Pearson CS test is either 1 or 0 as shown by Neyman (1949)[23].

In order to make the asymptotic power neither 0 nor 1, we have to make the alternative to

be special. Thus, we may make the alternative be different from the null and the difference

gets smaller when n gets larger as Cochran (1952)[5] discussed. We consider the following

sequence of local alternatives,

δ i (t)

pi (t) = pi (t) + √ , i = 1, 2, …, k − 1,

n

11

2.1. 2DDCS STATISTICS ON A LINE

where the quantities δ i (t) remain fixed as n increases and the vector of t is fixed as well.

It is well known that the asymptotic power function of the Pearson CS statistic X 2 (t) ,

as defined in (2.1), is

P X 2 (t) ≥ c (d, α) |H1

P χ2d (λ (t)) ≥ c (d, α) ,

where c (d, α) is a constant such that P χ2d ≥ c (d, α) = α and χ2d (λ (t)) is a noncentral chi square random variable with λ (t) =

δ i (t)

pi (t)

k

i=1

2

and d = (k − 1). By the

approximation given by Patnaik (1949)[24],

P χ2d (λ (t)) ≥ c (d, α)

1−Φ

c (d, α) − d − λ (t)

.

2d + 4λ (t)

c (d, α) − d − λ (t)

needs to be minimized to maximize the power of a Pear2d + 4λ (t)

son CS test. It is also equivalent to maximize the noncentrality parameter λ (t) since both

we conclude that

the critical value c (d, α) and the degrees of freedom d are fixed. Moreover, the noncentrality

parameter

λ (t) =

k

i=1

δ i (t)

pi (t)

2

is a measure of discrepancy between the null hypothesized distribution F0 and the alternative

distribution F1 .

Just for reference, there are a couple of other formulas available in the literature to

approximate the power P χ2d (λ (t)) > c (d, α) . These estimations are all monotonically

increasing with respect to λ (t), therefore we may conclude the same result. Two examples

of these approximations are:

(1) Johnson (1959) : Φ

c (d, α) − d + 1 − λ (t)

d + λ (t)

12

and

2.1. 2DDCS STATISTICS ON A LINE

(2) Sankaran (1963) : Φ

c (d, α) + 21 − 12 d − λ (t) − 12 + 12 d for d ≤ 2c (d, α) + 1.

The conclusion agrees with our graphical intuition shown in Figure 2.1. A good choice

of cells can maximize the chi square type discrepancy between the null and the alternative

distributions. On the other hand, an inappropriate construction of cells might hide the

deviations of the alternative from the null. The unit horizontal line represents the null

distribution and the vertical lines are the cutting lines to form the cells. The slant lines

are the densities of the alternative distributions. Notice that here we are discussing the

cases regarding the null and the alternative distributions without including any data. The

examples show that the effectiveness of detecting the difference of the null and alternative

depends a lot on our choice of the cells. That is, a Pearson CS test might have no power at

all under certain alternatives such as case (a) in Figure 2.1, but it can be really powerful

under other alternatives like case (c). On the other hand, given the alternative (or the

data), whether the decision of a blindly picked Pearson CS test is correct or not depends

on the construction of the cells as shown in cases (a), (b) and (d).

The agreement of the cell areas under the null and the alternative does not imply the

similarity of the hypotheses and that is why we usually conclude “do not reject the null”

instead of “accept the null” even when a very small value of the test statistic is observed.

That is, a conclusion of “do not reject the null” for a Pearson CS test does not mean the

data really follow the null distribution. Conversely, if the test concludes “reject the null”,

it is appropriate to say the data do not follow the proposed distribution as in cases (b), (c)

and (d). Therefore, we may conclude rejection of the null if two Pearson CS tests make

opposite conclusions as in Example 1. But it is not that simple and that is the reason

we have to modify the Pearson chi square tests. The cases in Figure 2.1 also indicate

13

2.1. 2DDCS STATISTICS ON A LINE

that the construction of the cells for a powerful chi square test has to characterize the

null and alternative distributions to display the discrepancy between them when we have

specific information about the alternative distribution. However, the alternative is usually

not known in reality and so a modification of a Pearson chi square test should be made

to determine the cells based on the data and the hypothesized distribution such that the

difference can more likely be detected.

However, in real applications of a chi square test, the alternative distribution is not

known and λ (t) can not be calculated, so maximizing λ (t) is not the appropriate selection

rule for the cells. Notice that the noncentrality parameter λ (t) can also be written as

k

i=1

npi (t) − npi (t)

npi (t)

2

,

which can be approximated by the Pearson chi square statistic with cell cutpoints t

X 2 (t) =

k

i=1

Yi (t) − npi (t)

npi (t)

2

.

Therefore, if t̂ is the vector of cutpoints maximizing the value of X 2 (t), then it approxi

mately maximizes λ (t) as well and so the Pearson CS test based on X 2 t̂ will more likely

be a powerful Pearson k-cell CS test for this specific population than others. We have to be

aware that only when we pretend t̂ is randomly selected without including any information

about the sample, X 2 t̂ has a limiting chi square distribution under the null and so the

actual power of X 2 t̂ is usually a little bit lower than the highest possible power in the

collection (2.1). It is the trade-off for a test to be powerful in general rather than being

very powerful locally under some alternatives but not at all under some others.

For any fixed number of cells k ≥ 2, we may select the cutpoints t which maximizes

the chi square statistic X 2 (t) with respect to t. However, since the cells are left-open, the

14

2.1. 2DDCS STATISTICS ON A LINE

(a)

(b)

(c)

(d)

Figure 2.1: Pearson’s CS test for uniformity: various choices of cells under different alternatives

15

2.2. MINIMUM CELL LENGTH

absolute maximum value of X 2 (t) might not exist and so we define the DDCS statistic

to be the supremum X 2 (t) as

X 2 t̂ = sup X 2 (t) .

t

When the number of cells is fixed to be two and we test the data for uniformity over the

unit interval, we get the following statistic:

X 2 t̂ = sup X 2 (t) = sup

t∈(0,1)

t∈(0,1)

(nFn (t) − nt)2

nt (1 − t)

where X 2 (t) denotes the 2-cell chi square (2CS) statistic

is t, which is equivalent to

2DDCS test.

,

(2.2)

n (Fn (t) − t)2

with the cutpoint

t (1 − t)

(nFn (t) − nt)2

. A test based on the statistic (2.2) is called a

nt (1 − t)

The optimal cutpoint t∗ of a 2DDCS statistic is defined as

∗

t = inf Arg sup

t∈(0,1)

(nF (t) − nt)2

nt (1 − t)

,

(2.3)

where F (t) is the underlying distribution of the data.

2.2

Minimum Cell Length

(nFn (t) − nt)

For a 2CS statistic X 2 (t) =

2

nt (1 − t)

, the denominator t (1 − t) can’t be 0 or too

close to 0. Otherwise, the value of the 2CS statistic will be very large as t is very close to

0 or 1. To avoid singularity, we set up a minimum cell length ε in Definition 2.2 and define

a new statistic (2.4) to include the minimum cell length.

X t̂ε =

2

sup

t∈(ε,1−ε)

16

(nFn (t) − nt)2

nt (1 − t)

.

(2.4)

2.2. MINIMUM CELL LENGTH

where ε is a small positive real number and called the minimum cell length. The appropriate

minimum cell length ε is dependent on the sample size, the underlying distribution of the

data and the null hypothesis. Larger sample size may allow smaller minimum cell length.

But what is the appropriate value of ε given a sample size n without any other information?

Our goal this section is to find the proper value of ε such that the 2DDCS test will have a

small chance to pick up a cutpoint t̂ at the two ends when the optimal cutpoint t∗ (2.3) is

not.

Since nFn (t) is a Binomial random variable with parameters n and t if the sample is

taken from the null uniform distribution, the expected value of the 2CS statistic X 2 (t) is

then a constant 1 for any t. Moreover, V ar (nFn (t) − nt)2 is

2

2

E (nFn (t) − nt)2 − E (nFn (t) − nt)2 ,

where the first term is the forth moment about the mean nt under the null

2

E (nFn (t) − nt)2 = nt (1 − t) [3t2 (2 − n) + 3t (n − 2) + 1],

and the second term in the function vn (t) is the squared variance of nFn (t)

2

E (nFn (t) − nt)2 = [nt (1 − t)]2 = n2 t2 (t − 1)2 ,

we have

V ar (nFn (t) − nt)2 = nt (1 − t) [−(2n − 6)t2 + (2n − 6)t + 1].

The variance of X 2 (t)

Var(X 2 (t))=

−(2n − 6)t2 + (2n − 6)t + 1

nt (1 − t)

17

2.2. MINIMUM CELL LENGTH

can be simplified as

Var(X 2 (t))= 2 +

6t2 − 6t + 1

.

nt (1 − t)

Therefore, the 2CS statistic has the following properties under the null hypothesis:

1. E(X 2 (t))= 1, for any t.

2. Var(X 2 (t)) goes to infinity as t goes to 0 or 1.

3. The value of Var X 2 (t) depends on the sample size n.

4. When n → ∞,

i) Var(X 2 (t)) → 2 if nt → ∞.

ii) Var(X 2 (t)) → 2 + 1/λ if nt → a positive real number λ.

iii) Var(X 2 (t)) → ∞ if nt → 0.

5. Var(X 2 (t)) has a minimum point at t = 1/2.

To make sure that the value of the 2CS statistic X 2 (t) will be most likely in a reasonable

range, we set the value of the cutpoint t to be in a truncated interval [ε, 1 − ε] instead of

(0, 1) .

Remark 1 A minimum cell length ε is required for a 2DDCS test. When the sample size

is larger, the minimum cell length is required to be at least as large as λ/n, where λ is a

positive real number.

Moreover, the cutpoint t̂ is more likely to occur at the tails than the median. On the

other hand, the two-sided KS statistic Dn is equivalent to

18

2.3. RELATED STOCHASTIC PROCESSES

Dn∗ = supt∈(0,1) (nFn (t) − nt)2

where the statistic (nFn (t) − nt)2 for maximizing is the numerator of the 2CS statistic. The

expected value of (nFn (t) − nt)2 under null is nt (1 − t) , which is an open-down parabola

with the unique maximum point at t = .5. And the variance of (nFn (t) − nt)2 under null

has an unique maximum point at t = .5 as well. Thus the cutpoint for a two-sided KS

statistic is more likely at the median than the tails and it is not necessary to truncate the

interval for the value of t.

2.3

Related Stochastic Processes

The 2DDCS statistic (2.4) involves the empirical CDF Fn of a random sample from the

uniform distribution over [0, 1] (the distribution is denoted as “uniform(0, 1)”). It is wellknown that the distribution of a random sample of size n from the uniform distribution

over [0, 1] is the same as {Pn (t)} |Pn (1) = n}, where {Pn (t)} is the Poisson process with

occurrence rate n and jumps of

1

for 0 ≤ t ≤ 1. Asymptotically, it is related to the Brownian

n

bridge process and Brownian motion.

Let X 2 (t) =

n (Fn (t) − t)2

, where Fn (t) is the empirical CDF of a sample of size n from

t (1 − t)

uniform(0, 1). There are several well-known results which will be used in later chapters:

1. If Yn (t) =

√

n (Fn (t) − t) , then {Yn (t)}t∈[0,1] is called the sample process. It con-

verges weakly to the Brownian bridge process W0 (t) .

2. The Brownian bridge process W0 (.) can be transformed to a Brownian motion process

W (.) and

W (τ ) = (1 + τ ) W0

19

τ

1+τ

,

τ=

t

.

1+t

2.3. RELATED STOCHASTIC PROCESSES

3. Let τ =

t

τ

, that is, t =

, then the transformation from a Brownian bridge

1−t

1+τ

process W0 (.) to a Brownian motion process W (.) can be written in a way matching

the form of a 2CS statistic:

τ

τ

(1

+

τ

)

W

W

0

0

1+τ

1+τ

W (τ )

W0 (t)

√

=

= √ .

=

τ

τ

t (1 − t)

τ

τ

1+τ 1 − 1+τ

4. If we rewrite the 2CS statistic Xn2 (t) =

Xn2 (t) =

n (Fn (t) − t)2

to be

t (1 − t)

√

n |Fn (t) − t|

t (1 − t)

2

,

then by 1. and 3., it can be shown that the chi square statistic converges weakly to

W (τ ) 2

t

√

with τ =

.

1−t

τ

5. The supremum of the 2CS statistic converges weakly to the supremum of the squared

standardized Brownian. That is,

√

n |Fn (t) − t|

sup

t (1 − t)

t∈(ε,1−ε)

where τ 1 =

2

ε

1−ε

, τ2 =

and so

1+ε

ε

√

n |Fn (t) − t|

sup

t (1 − t)

t∈(ε,1−ε)

weakly

⇒

weakly

⇒

sup

t∈(τ 1 ,τ 2 )

sup

t∈(τ 1 ,τ 2 )

|W (τ )|

√

τ

2

,

|W (τ )|

√

.

τ

6. By applying the law of the iterated logarithm for Brownian motion, (Durrett (1996)[7]),

we get

W (τ )

√

= lim sup

τ

τ →∞

τ →∞

lim sup

almost surely and lim

τ →∞

W (τ )

2τ log |log τ |

2τ log |log τ | = lim

τ →∞

2τ log |log τ |

2τ log |log τ | = ∞. Thus the minimum cell length ε for the

2DDCS statistic is necessary.

20

2.4. SET OF POSSIBLE CUTPOINTS

2.4

Set of Possible Cutpoints

For computational reasons, the set of the possible cutpoint for the 2DDCS statistic (2.4)

can have a large size but has to be finite. A natural candidate is

SKn =

Kn − 1

1 2

,

, …,

Kn Kn

Kn

,

where Kn is a constant integer depends only on the sample size n (larger n, larger Kn ,

1

). It is the set considered by Inglot et al. (2003)[15]. However, we propose the

Kn

set of the ordered statistics S = X(1) , …, X(n) since we find that the supremum occurs

smaller

only at the sample observations.

The graph of a function X 2 (t) based on a random sample from uniform(0, 1) is displayed in Figure 2.2,:where the sample is {0.07936203, 0.36753821, 0.45932248, 0.46040850,

0.50693105, 0.67570917, 0.68063271, 0.86449449, 0.87271449, 0.92663101}. Diamonds denote left-hand limits and solid circles represent right-hand limits of X 2 (t) as t goes to the

sample observation points. From the graph, we see that the supremum of X 2 (t) occurs at

an observation point as proved in Appendix A.1.

Claim 1 If S = X(1) , …, X(n) is the set of ordered sample observations, then

A) sup

t∈(0,1)

n (Fn (t) − t)2

X 2 (t) =

B) t̂ = inf

t∈(0,1)

t (1 − t)

Arg sup

t∈(0,1)

X 2 (t) =

2

!2

i/n − X(i)

(i − 1) /n − X(i)

,

= max

1≤i≤n X(i) 1 − X(i)

X(i) 1 − X(i)

n (Fn (t) − t)2

t (1 − t)

.

∈ S.

In other words, the 2DDCS statistic X 2 t̂ is the maximum of all right-hand and left-

hand limits of X 2 (t) at the observation points. It is well known that the two-sided KS

statistics is

21

2.4. SET OF POSSIBLE CUTPOINTS

Dn = sup |Fn (t) − t| = max

1≤i≤n

t∈[0,1]

i

i−1

− X(i) , X(i) −

.

n

n

On the other hand, the one-sided KS statistic Dn+ is the maximum of all right-hand limits

of Fn (t) − t at the observation points and

Dn+ =

sup {Fn (t) − t} = max

1≤i≤n

t∈[0,1]

i

− X(i) , (one-sided KS).

n

Similarly, we define X 2 t̂ as the two-sided 2DDCS statistic and

2

2

i/n − X(i)

(i − 1) /n − X(i)

,

1≤i≤n X(i) 1 − X(i)

X(i) 1 − X(i)

X 2 t̂ = max

.

(2.5)

We also define the maximum of all right-hand limits as the one-sided 2DDCS statistic, that

is,

2

i/n − X(i)

1≤i≤n X(i) 1 − X(i)

X t̂+ = max

2

.

(2.6)

When the sample size is very small, a minimum cell length is not required because the

probability that we have observations very close to the two ends is small. However, the

probability increases as the sample size is getting larger and so a minimum cell length ε is

necessary if the sample size is not very small. Based on the order statistics {X(1) , …, X(n) },

we rewrite the 2DDCS test statistics with respect to the ordered observations. The one-sided

2DDCS statistic with a minimum cell length ε is defined as

X 2 t̂+

ε =

2

i/n − X(i)

max

[nε]≤i≤n−[nε] X(i) 1 − X(i)

,

where [nε] is the largest integer smaller than nε. The two-sided 2DDCS statistic (2.4) is

X2

Notice that

t̂ε =

2

2

i/n − X(i)

(i − 1) /n − X(i)

,

max

[nε]≤i≤n−[nε] X(i) 1 − X(i)

X(i) 1 − X(i)

22

.

1.0

0.5

0.0

2CS(t)

1.5

2.0

2.4. SET OF POSSIBLE CUTPOINTS

0.0

0.2

0.4

0.6

0.8

t

Figure 2.2: Pearson’s 2CS Statistic as a Function of t

23

1.0

2.4. SET OF POSSIBLE CUTPOINTS

1. The value of the one-sided 2DDCS statistic X 2 t̂+

≤ the value of the two-sided

ε

X 2 t̂ε for any given sample. When they are equal, the corresponding first cell includes the cutpoint t̂, otherwise it does not include t̂ and is right-open.

2. The null distributions of the one-sided and two-sided 2DDCS statistics are no longer

approximately chi square distributed under H0 . To see the empirical null distribution

of a 2DDCS statistic such as (2.4), we take 1000 random samples from uniform(0, 1),

say X 1 , …, X 1000 . Let t̂i and X 2 t̂i be the ith pairs of cutpoint and value of the test

statistic based on the ith given sample, i = 1, 2, …, 1000. Then the 1000 cutpoints

t̂1 , …, t̂1000 are typically different from each other. The density histogram of the sim

ulated values {X 2 t̂1 , …, X 2 t̂1000 } is quite different from the χ21 density and it is

actually shifted to the right of χ21 . The details about the cutpoint and the tests are

given in Chapter 3 and Chapter 4 respectively.

3. When the alternative is simple (fully specified) with its CDF F1 , then the best cutpoint

for a Pearson 2CS test is t∗ as defined in (2.3) with F = F1 . But notice that a chi

square test may not be used in this case because with the alternative known the most

powerful test is the Neyman-Pearson test.

4. The minimum cell length ε should depend on the sample size; a larger sample size

allows smaller ε. Moreover, the closer the value of ε is to t∗ , the higher power the test

will achieve.

5. When a random sample is taken from a specific distribution, the actual power of

the 2DDCS tests may not be as high as the “best luck” Pearson 2CS test using the

cutpoint t∗ (2.3).

24

2.5. CIRCULAR 2DDCS STATISTICS

2.5

Circular 2DDCS Statistics

Both the one-sided and two-sided 2DDCS statistics are applied to data on a line. However,

circular data such as wind directions and directions of migrating birds often arise in real

2 t̂

life and both of the 2DDCS statistics X 2 t̂+

ε are not appropriate to test the

ε and X

circular uniformity since they are more sensitive at the two ends and so dependent on the

choice of the starting point. Therefore, we propose the wrap-around or circular 2DDCS

statistic. Here the two end points of the unit line are wrapped around such that it becomes

a circle with unit circumference. Then we cut the circle into two semicircles to form the two

cells for a chi square type statistic. It is equivalent to taking a piece of segment from the

unit line as the first cell and wrapping up the remaining piece(s) as the second cell. The

resulting circular statistic is called the wrap-around chi square statistic or circular two-cell

chi square statistic (circular 2CS statistic). Such a test can be used to test uniformity both

on a line and on a circle. For simplicity, we first study the circular 2DDCS statistics with

both cell lengths fixed to be .5. Generalized wrapped-around 2DDCS statistics are defined

and discussed in Chapter 5. More details are given by Qian et al. (2009)[26].

Now we are testing the null hypothesis that the n points randomly located on the

circumference of the circle are uniformly distributed. We define the circular two-cell chi

square (2CS) statistic

X 2 (tc ) =

[nFn (t) − nFn (t − .5) − .5n]2 [nFn (t − .5) + n − nFn (t) − .5n]2

+

,

.5n

.5n

where t = tc mod 1, and t ∈ [.5, 1] is the right-endpoint of the middle segment (t − .5, t] on

the unwrapped line. The circular 2CS statistic X 2 (tc ) can be simplified and we have

X 2 (tc ) =

n (Fn (t) − Fn (t − .5) − .5)2

= 4n [Fn (t) − Fn (t − .5) − .5]2 .

.5 (1 − .5)

25

2.5. CIRCULAR 2DDCS STATISTICS

To make sure that the test will detect the deviation of the sample points from uniformity

for a broad class of alternatives, a circular 2DDCS test statistic X 2 t̂c is defined to be the

supremum of all possible circular 2CS statistics:

X 2 t̂c = sup 4n [Fn (t) − Fn (t − .5) − .5]2 .

(2.7)

t∈[.5,1]

Here t is in the interval [.5, 1], instead of [0, 1] because the total number of observations

and the frequency of one of the cells are the only information required to calculated each

statistic X 2 (tc ).

For the limitation of computations, the set of all possible selections of the right cutpoint

t̂c has to be finite. Just as the 2DDCS statistics on a line, there are two possible ways to

define the set. One option is to use the set of points from 0 to 1 with equal small length in

Kn − 1

1 2

,

, …,

. Another option is to choose t̂c from the set of all

between such as

Kn Kn

Kn

observations and their diametrically opposite points Sp = {X1 ,…, Xn , X1 + .5,…, Xn + .5}.

The drawback of the first option is that Kn should be dependent on the sample size n and

the value of the resulting X 2 t̂ might be varied for different choice of Kn . The second

option is better if it is true that the supremum of Xc2 (tc ) occurs only at the points in Sc . It

can be shown that the value of the circular 2DDCS statistic X 2 (tc ) changes only at either

an observation point t = Xi if Xi ≥ .5 or at its diametrically opposite point t = Xi + .5 if

Xi

Xi =

and Sc = X1 , X2 , …, Xn ,

X + .5, otherwise

i

then X 2 t̂c = max {limt→ X − X 2 (tc ) , limt→ X + X 2 (tc )}.

( i)

( i)

1≤i≤n

26

2.5. CIRCULAR 2DDCS STATISTICS

Proof. We first define a new set of points Sc = X1 , X2 , …, Xn and

Xi =

Xi , if Xi ≥ .5

X + .5, otherwise

.

i

We order the set of points Sc from smallest to largest and let X(1) , …, X(n) be the ordered

points. Then the circular 2CS function

X 2 (tc ) = 4n [Fn (t) − Fn (t − .5) − .5]2

is constant on each of the intervals

.5, X(1) , X(1)

, X(2)

, …, X(n−1)

, X(n)

and the value of X 2 (tc ) changes only at the points in Sc . Thus we have

X 2 t̂c = sup X 2 (tc ) = max {limt→ X − X 2 (tc ) , limt→ X + X 2 (tc )}.

( i)

( i)

1≤i≤n

t∈[.5,1]

Therefore the circular 2DDCS statistic can be expressed with respect to the sample

observations as well. Furthermore, if we let S (t) denote the number of observations cov”

#2

S(t)

ered by the semicircle (t − .5, t], then X 2 (tc ) is equal to 4n

− .5 or equivalently

n

4

2

[S (t) − .5n] for t ≥ .5. With respect to S (t), the parabola opens up and the axis of

n

symmetry is the vertical line S (t) = .5n. Thus the supremum of X 2 (tc ) occurs at the

supremum or infimum of S (t) and

X 2 (S (t)) = X 2 (n − S (t)) .

Therefore we find X 2 t̂c , the supremum of X 2 (tc ), by getting the supremum and infimum

of S (t). That is,

27

2.5. CIRCULAR 2DDCS STATISTICS

$

%2

”

#2

4

4

X 2 t̂c = max

inf S (t) − .5n

sup S (t) − .5n ,

n t∈[.5,1]

n t∈[.5,1]

and by symmetry we have

$

4

max sup S (t) , n − inf S (t)

X 2 t̂c =

n

t∈[.5,1]

t∈[.5,1]

%2

− .5n .

The question is how to find the supremum and infimum of S (t). By the definition

of S (t), the supremum and infimum is the maximum and minimum of all left-hand and

right-hand limits of S (t) at each point in the set Sc , respectively. That is,

sup S (t) = max

{limt→ X − S (t) , limt→ X + S (t)}

( i)

( i)

Xi ∈Sc

t∈[.5,1]

and

inf S (t) = min

{limt→ X − S (t) , limt→ X + S (t)}.

( i)

( i)

Xi ∈Sc

t∈[.5,1]

For any Xi equals to the number of observations

covered by the semicircle [Xi , Xi +.5], and its right-hand limit is the number of observations

covered by the semicircle (Xi , Xi + .5]. The maximum of these two is the left-hand limit

since it includes the observation point Xi and the minimum is the right-hand limit. On the

other hand, when Xi ≥ .5, the maximum and minimum of the two limits are respectively

the right-hand limit and left-hand limit. It does not matter if a diametrically opposite point

of an observation is included on the semicircle or not because the probability that it is also

an observation point is 0. Thus, if we let Mi denote the maximum of the left-hand and

right-hand limits of S (t) at Xi ∈ Sc , then

number of observations covered by the semicircle [Xi , Xi + .5] , if Xi

sup S (t) , n − inf S (t)

t∈[.5,1]

t∈[.5,1]

= max {Mi , n − (Mi − 1)}.

1≤i≤n

Let Ni denote the maximum number of observations covered by one of the semicircles with

endpoints Xi and its opposite

Xi∗ = (Xi + .5) mod 1,

then

Ni = max {Mi , n − (Mi − 1)} = max {Mi , n − Mi + 1} ,

(2.9)

thus we get the following claim:

Claim 3 The circular 2DDCS statistic can be written in terms of Ni :

4

X t̂c =

n

2

2

max {Ni } − .5n .

1≤i≤n

29

(2.10)

2.5. CIRCULAR 2DDCS STATISTICS

Given the order statistics X(1) , X(2) , …, X(n) , the above claim provides a method to

calculate the value of the circular 2DDCS statistic (2.10) and find the cutpoint t̂c . Both the

one-sided and two-sided 2DDCS statistics require to have a minimum cell length ε when the

sample size is not too small. The distributions of the test statistics are dependent on the

value of ε and so will be the powers of the 2DDCS tests on a line. One of the advantages of

applying this circular 2DDCS statistics is that the two cells now have equal length .5 and

so we do not have to worry about the minimum cell length ε. Notice that on the wrapped

circle, the point t = 1 is the same point t = 0.

30

Chapter 3

Null Distribution of the Cutpoints

For the 2DDCS statistics on a line, we define the “cutpoint” to be the first point t which

maximizes the corresponding chi square value. It is well-defined because the probability that

there are more than one maximizing points is 0 for both one-sided and two-sided 2DDCS

statistics. However, for a circular 2DDCS statistic, the maximizing point is not unique.

Therefore, we may define the “cutpoint” be a randomly selected point from a set of all

maximizing points. We expect a good test to be able to discriminate the difference from

the alternative equally likely on each point in the domain, therefore we are looking for a

test with the null distribution of the cutpoint to be exactly or approximately uniform. By

checking the null distribution of the cutpoint, we know whether the corresponding test is

fair or not and how to make it better. The information can also be used to decide what the

minimum cell length should be for a 2DDCS test. For this reason, all the tests this chapter

are started without a minimum cell length, that is, ε = 0. Let X(1) , X(2) , …, X(n) be the

order statistics of a random sample from uniform(0, 1) and Fn be the ECDF of the sample

as defined previously. The null distributions of the maximizing points of different types of

31

3.1. CUTPOINT OF ONE-SIDED KS

KS and 2DDCS statistics are discussed in separate sections.

3.1

Cutpoint of One-Sided KS

The right-sided KS statistic is defined as

Dn+ = sup0 0 and δ > 0, let N = max N1 , − 1 , we have

ε

)

)

)

)

) +

) +

Lr ))

Lr ))

1

)

)

P )Vn −

≥ ε N.

r + 1)

r + 1) r + 1

Thus

Lr

Lr

converges to Vn+ in probability as r → ∞. Hence

converges to Vn+ in

r+1

r+1

distribution as well.

Moreover, Dwass (1958)[8] prove that

Lr

is asymptotically uniformly distributed over

r+1

(0, 1) as r → ∞ by applying Andersen (1953)[2]’s result. Therefore Vn+ is also uniformly

distributed over the unit interval. A frequency histogram for the cutpoint Vn+ is generated

by Monte Carlo simulations with number of replications 5000 and sample size 10 as shown

in Figure 3.1, which is approximately uniform and confirming Theorem 1.

As it is well known and shown above, the maximizing point Vn+ corresponding to the

right-sided KS statistic is uniformly distributed over (0, 1) for any finite sample size. Then

how about the null distribution of the maximizing point Vn− regarding the left-sided KS

statistic? Intuitively, we may guess Vn− has the same distribution as Vn+ by the symmetry

property of a uniform random sample on [0, 1]. However, there is not much in the literature

has been done about the null distribution of either Vn− or Vn , which is the maximizing

point for two-sided KS statistic. This section and the following section focus on the null

distributions of Vn− and Vn .

Here similarly we let

Dn− = sup (t − Fn (t)),

t∈[0,1]

34

Frequency

3.1. CUTPOINT OF ONE-SIDED KS

Cutpoint t

+

Figure 3.1: Frequency Histogram of V10

, k = 5000

35

3.1. CUTPOINT OF ONE-SIDED KS

be the left-sided KS statistic and let

Vn− = inf {t : (t − Fn (t)) = Dn− } ,

t∈[0,1]

the maximizing point of the left-sided KS statistic. we prove that Vn− is also uniformly

distributed over (0, 1).

Given the order statistics

X(1) , X(2) , …, X(n) , the ECDF Fn (t) is right continuous

but not left continuous. The function [Fn (t) − t] with respect to t is strictly decreasing in

each interval X(i) , X(i+1) , i = 1, 2, …, n − 1, and the supremum of Fn (t) − t is realized

at the left endpoint of some interval X(i) , X(i+1) . Hence the right-sided KS statistic Dn+

is the supremum and maximum of (Fn (t) − t). On the other hand, the function t − Fn (t)

is strictly increasing in each interval X(i) , X(i+1) and so the supremum of the function

t − Fn (t) is achieved at the right endpoint of some interval. Therefore, the left-sided KS

statistic Dn− is the supremum but not maximum of the function t − Fn (t). As it is well

known, we have

Dn+ = max

1≤i≤n

i

− X(i)

limt→X + [Fn (t) − t] = max

(i)

1≤i≤n n

and

Dn− = max

1≤i≤n

i−1

limt→X − [t − Fn (t)] = max X(i) −

.

(i)

1≤i≤n

n

Lemma 1 For any n, the random variable Vn− has the same distribution as 1 − Vn+ and so

Vn− is also uniformly distributed over (0, 1) .

i

Proof. Given the order statistics X(1) , …, X(n) of a random sample, let Ki+ = −X(i)

n

i−1

. Since X(i) and 1 − X(n−i+1) have the same distribution, it is also

and Ki− = X(i) −

n

i

i

n−i

true for − X(i) and − 1 + X(n−i+1) (= X(n−i+1) −

). That is, Ki+ has the same

n

n

n

36

3.1. CUTPOINT OF ONE-SIDED KS

−

+

−

distribution as Kn−i+1

, i = 1, 2, …, n. Let Dn+ = maxi Ki+ , Dn− = maxi Kn−i+1

, Vn =

−

the first X(i) such that Ki+ = Dn+ and Vn− = the first X(n−i+1) such that K(n−i+1)

= Dn− .

Moreover, not only the marginal distributions are the same for each pair of Ki and

Kn−i+1 , but also the joint distribution of the vector Ki+ i=1,2,…,n is the same as the joint

−

distribution of the vector Kn−i+1

. By a linear transformation of the joint disi=1,2,…,n

tribution of X(1) , …, X(n) , we have

2

1

f{K + ,…,Kn+ } (d1 , d2 , …, dn ) = n!, where 0 ≤ − d1 ≤ − d2 ≤ … ≤ 1 − dn ≤ 1,

1

n

n

and

g(Kn− ,…,K − ) (l1 , l2 , …, ln ) = g

1

X(n) −

(l , l , …, l ) = n!,

1 2

n

n−1

n−2

,X(n−1) −

,…,X(1)

n

n

where

0 ≤ ln + 1 −

n

2

1

≤ … ≤ ln−1 + 1 − ≤ l1 + 1 − ≤ 1,

n

n

n

which is equivalent to

0≤

2

1

− l1 ≤ − l2 ≤ … ≤ 1 − ln ≤ 1.

n

n

Therefore Dn+ = max1≤i≤n Ki+ has the same distribution as

−

−

max Kn−i+1 = max Kj = Dn− .

1≤i≤n

1≤j≤n

Since X(i) and 1 − X(n−i−1) have the same distribution, Vn−

distribution

=

1 − Vn+ . Moreover,

Vn+ is uniformly distributed over (0, 1), so Vn− is uniformly distributed over (0, 1) as well.

That is,

P Vn− ≤ t = P 1 − Vn+ ≤ t = P Vn+ ≥ 1 − t = 1 − (1 − t) = t, for any t ∈ (0, 1) .

37

3.2. CUTPOINT OF TWO-SIDED KS

A frequency histogram for the cutpoint is simulated with number of replications 5000

and sample size 10, which is approximately uniform(0, 1) and so consistent with Lemma 1.

3.2

Cutpoint of Two-Sided KS

Let Vn be the maximizing point corresponding to the two-sided KS statistic, that is,

Vn = inf 0≤t≤1 {t : |Fn (t) − t| = Dn } ,

moreover, we may notice that Dn = max {Dn− , Dn+ } and

Vn =

Vn+ , if Dn+ ≥ Dn−

.

Vn− , if Dn− > Dn+

Although both Vn− and Vn+ are uniformly distributed over (0, 1) if the sample is taken from

uniform (0, 1) , it is not true for the two-sided maximizing point Vn whenever the sample

size is larger than 1. Vn occurs relatively more frequently at the median than at the tails

when the sample size is larger than 2.

Claim 4 If the sample size is 1, the maximizing point Vn is the observation itself and so it

is uniformly distributed over (0, 1).

When the sample size is 2, the conclusion is not that obvious, but it can be shown

that the density function of V2 is like a shoulder-lowered “W” shaped curve with highest

probability around the center, medium possibility to be around the two tails and lowest

probability around the first and third quartiles. Therefore, the two-sided KS test with

sample size 2 is more sensitive at the middle and less at the two sides.

Claim 5 The random variable V2 is NOT uniformly distributed over (0, 1) .

38

3.2. CUTPOINT OF TWO-SIDED KS

Proof. Let X(1) and X(2) be the ordered observations from smallest to largest, then

D2 = max 21 − X(1) , X(1) , 1 − X(2) , X(2) − 12 and V2 = X(1) if 21 − X(1) = D2 or X(1) =

D2 , otherwise V2 = X(2) . Thus the probability P (V2 ≤ t) is the sum of the probabilities

P X(1) ≤ t, D2 = 12 − X(1) or X(1) and P X(2) ≤ t, D2 = 1 − X(1) or X(2) − 21 .

To determine the above probability we need to separate the domain of X(1) and X(2)

to four regions where D2 = 12 − X(1) , X(1) , 1 − X(2) or X(2) − 21 respectively and the above

probability has to be discussed in four cases,

t − 2t2 + t2 = t − t2 , if t ≤ 41

3t2 − t + 1 , if 1

2

i/n − X(i)

X(i) 1 − X(i)

+

t̂ = n max

1≤i≤n

.

To see the null distribution of the smallest t at which X 2 t̂+ occurs, we run a simulation

or carry a theoretical analysis of the null distribution for t̂+ given the ordered sample

X(1) ,…,X(n) from the uniform(0, 1) distribution. We start the analysis with the simplest

cases when the sample size is only 1 or 2. In the trivial case when the sample size n is

1, the optimal cutpoint t̂+ has to be the only observation point X1 and so it is uniformly

distributed over (0, 1). However, when n = 2, the optimal cutpoint t̂+ could be any of the

two observation points X(1) and X(2) . Since the cutpoint is always included in the first cell

in this case, we may expect the smaller one X(1) is more likely to be the cutpoint than the

larger one X(2) , which turns out to be true shown by a simple calculation as follows.

P t̂+ = X(1) = P

2

1 − X(2)

X(1) − 21

≥

X(2)

X(1) 1 − X(1)

2

= P X(2) ≥ 4X(1) − 4X(1)

,

9

and so P t̂+ = X(1) = 16

. Thus we have

t̂+

X(1)

X(2)

probability

9

16

7

16

.

The minimum is more likely to be the cutpoint than the maximum.

To get the distribution for t̂+ , we need to find the probability Ft̂+ (t) = P t̂+ ≤ t ,

which is the sum of the probabilities

P X(1) ≤ t, t̂+ = X(1) and P X(2) ≤ t, t̂+ = X(2) .

42

3.3. CUTPOINT OF ONE-SIDED 2DDCS

That is, the sum of the probabilities that the smaller observation is not larger than t when

2 and that the larger observation is less than or equal

X(2) is larger or equal to 4X(1) − 4X(1)

to t otherwise. Therefore

2

2

Ft̂+ (t) = P X(1) ≤ t, X(2) ≥ 4X(1) − 4X(1)

+ P X(2) ≤ t, X(2)

(a) When t ≤ 43 , the CDF of t̂+ , Ft̂+ (t) = P t̂+ ≤ t is the integral

*t

*t

1−

2 ∗ ( 0 1 − 4x − 4×2 dx + 0 y −

√

1−y

dy)

2

Thus the density for t ≤ 34 is the derivative of the above integral and so

√

#

”

1− 1−t

2

.

ft̂+ (t) = 2 ∗ 1 − 4t − 4t + t −

2

After combining the like terms, we get

ft̂+ (t) = 8t2 +

√

1 − t − 6t + 1, when 0 ≤ t ≤ 43 .

(b) When t ∈ ( 34 , 1], the probability of the cutpoint is between t and 3/4 is

P t̂+ ≤ t − P t̂+ ≤ 34

√

√

*t 1+ 1−y 1− 1−y

2

1

1

= 4 ∗ 32 − 2 ∗ (1 − t) + 2 ∗ 3/4

−

dy

2

2

*t √

= 18 − 2 (1 − t)2 + 3/4 1 − ydx.

Therefore, the density function the cutpoint here is

√

ft̂+ (t) = 2 − 2t + 2 1 − t, for 1 ≥ t > 43 .

Thus the density function of t̂+ when n = 2 is

43

3.3. CUTPOINT OF ONE-SIDED 2DDCS

Figure 3.4: Density of t̂+ , ε = 0, n = 2

ft̂+ (t) =

√

8t2 + 1 − t − 6t + 1, 0 ≤ t ≤ 3

4

2 − 2t + 2√1 − t, 3

sided 2DDCS statistic X 2 t̂ is the supremum of all 2CS statistics with respect to t on the

interval (ε, 1 − ε). When the sample size is really small, the minimum cell length ε is not

required. In this section, we study the null distribution of the cutpoint t̂ for the two-sided

2DDCS statistic X 2 t̂ . First we find the exact null distribution of t̂ when there are only

two observations. Then simulations with different sample sizes are given and analyzed to

see what value of ε is appropriate for each typical sample size.

Let X(1) , …, X(n) be the order statistics taken from uniform(0, 1). Similar to the one-

sided 2DDCS cutpoint t̂+ , in the simplest case when the sample size is only 1, the cutpoint

t̂ is just the observation X1 itself and so t̂ is uniformly distributed on (0, 1). When the

sample size is 2, t̂ will take the value of X(1) or X(2) equally likely because of symmetry.

Furthermore, we may expect the probability of X 2 t̂ = X 2 t̂+ is .5, that is, the two-sided

2DDCS cutpoint is also equally likely to be included in the left cell (the 1st cell) or the

45

3.4. CUTPOINT OF TWO-SIDED 2DDCS

300

200

100

0

frequency

400

500

Frequency Histogram of Cutpoint t for X^2(t_hat+)

0.2

0.4

0.6

0.8

1.0

cutpoint t

Figure 3.5: Frequency Histogram of t̂+

ε=.05 (n = 50, 5000 replications)

46

3.4. CUTPOINT OF TWO-SIDED 2DDCS

right cell. To distinct the left-closed and right-closed cutpoint, we let

−

+

X(i)

= limx→0− X(i) + x and X(i)

= limx→0+ X(i) + x , i = 1, 2,

−

+

−

+

then the cutpoint t̂ ∈ X(1)

, X(1)

, X(2)

, X(2)

. With careful calculations of the probabilities,

we have

t̂

−

X(1)

+

X(1)

−

X(2)

+

X(2)

probability

13

48

11

48

11

48

13

48

.

Therefore, the probability that the cutpoint is X(1) is the same as the probability that it

is X(2) . Moreover, the statistic X 2 t̂ is a maximum and a supremum are equally likely.

In other words, the probability that the first cell is right-closed is equal to the probability

that the first cell is right-open.

To find the distribution of t̂ when the sample size n is 2, we compute the cumulative

probability

P t̂ ≤ t = P X(1) ≤ t, t̂ = X(1) + P X(2) ≤ t, t̂ = X(2) ,

which can be shown in four separated regions of t. The corresponding density function in

each case is given according to these for regions as shown in Figure 3.6.

(a) In the first region when t ≤ 14 , the CDF P t̂ ≤ t is the sum of the two double

integrals:

* t * 1−u

0

*t*v √

4u−4u2 2dvdu + 0 1− 1−v 2dudv.

2

Integrating both of them and simplifying the sum, we get the CDF of the data-driven

cutpoint t̂

47

3.4. CUTPOINT OF TWO-SIDED 2DDCS

Figure 3.6: Regions for X(1) (y) and X(2) (x)

48

3.4. CUTPOINT OF TWO-SIDED 2DDCS

√

√

t − 4t2 + 38 t3 − 23 1 − t + 23 t 1 − t + 32 .

Then, the density of the data-driven cutpoint t̂ in this case is the derivative of the

CDF and so

ft̂ (t) = 8t2 − 8t +

√

1 − t + 1;

(b) In the second region when t ∈ ( 41 , 12 ], we similarly get the density of data-driven

cutpoint t̂ :

ft̂ (t) =

√

√

t + 4t + 1 − t − 2;

(c) Next region is when t ∈ ( 12 , 34 ], the corresponding density of t̂ is

ft̂ (t) =

√

√

t − 4t + 1 − t + 2;

(d) In last region when t ∈ ( 34 , 1], we get the density is

ft̂ (t) = 2 (1 − 2t)2 +

√

t − 1.

Thus, the density curve of t̂ when the sample size is 2 is a “W” shaped curve as shown in

Figure 3.7 (Left). A simulated frequency histogram is also given in the Figure 3.7 (right).

we see that the simulated curve is consistent with the density calculated.

−

+

−

+

−

When the sample size n is 3, the cutpoint t̂ will be one of {X(1)

, X(1)

, X(2)

, X(2)

, X(3)

,

+

X(3)

}. The probability to be each of them can be summarized as

t̂

−

X(1)

+

X(1)

−

X(2)

+

X(2)

−

X(3)

+

X(3)

probability

0.21094

0.15356

0.13550

0.13550

0.15356

0.21094

49

.

3.4. CUTPOINT OF TWO-SIDED 2DDCS

frequency

0

100

200

300

400

Frequency Histogram of Cutpoint t_hat

0.0

0.2

0.4

0.6

0.8

1.0

cutpoint t

Figure 3.7: Null Distribution of t̂ When n = 2

With respect to the observations, the probabilities are

t̂

X(1)

X(2)

X(3)

probability

0.3645

0.2710

0.3645

.

And the cutpoint is more likely to be at the two tails than the center. The calculation of the

distribution of t̂ when the sample size n = 3 is much more complicated, so it is skipped here

and only the simulated frequency histogram of the cutpoint t̂ with number of replications

5000 is given in Figure 3.8.

By looking at the frequency histograms of the cutpoint t̂ with different sample sizes

(small, medium and large) and number of subintervals 20, 50, 100, 200, 400 and 800, respectively, we see that a significant number of frequencies are accumulated at the two ends

and the curve is concave up. Thus the 2DDCS test is less sensitive to the center than the

50

3.4. CUTPOINT OF TWO-SIDED 2DDCS

300

200

100

0

frequency

400

500

600

Frequency Histogram of Cutpoint t_hat

0.0

0.2

0.4

0.6

0.8

1.0

cutpoint t

Figure 3.8: Frequency Histogram of t̂ε=0 (n = 3, 5000 replications)

51

3.5. CUTPOINT OF CIRCULAR 2DDCS

two ends when the sample size is larger than or about 10. The problem is how much we

should take off from the ends. We may choose different minimum cell lengths for typical

sample sizes. According to our simulation results conducted under the null distribution,

the smallest appropriate minimum cell length in each case is listed in the following table.

sample size

small (n

Recall the definition of the circular 2DDCS statistic X 2 t̂c (2.7), which is similar to

the 2DDCS statistics if we write it as:

X2

t̂c = sup

t∈[.5,1]

√

n |Fn (t) − Fn (t − .5) − .5|

.5 (1 − .5)

2

It can be simplified as

X 2 t̂c = 4n sup (|Fn (t) − Fn (t − .5) − .5|)2 .

t∈[.5,1]

52

.

3.5. CUTPOINT OF CIRCULAR 2DDCS

Then we might define the cutpoint t̂∗c just as the ones on a line and let it be the first right

cutpoint of the middle piece on the circle such that the corresponding circular 2CS statistic

is maximized:

t̂∗c = Arg max

t∈[.5,1]

4n [Fn (t) − Fn (t − .5) − .5]2 .

(3.2)

However, such a defined maximizing cutpoint is actually not unique here and instead it is

an interval of points. A simple example is given when n = 1 where any t ∈ [.5, 1] gives the

same chi square value

X 2 (t) = 4n [Fn (t) − Fn (t − .5) − .5]2 = 4(1)(.5)2 = 1,

which is also the value of X 2 t̂c as long as the sample size is 1. The Definition (3.2) seems

not reasonable. Thus we may look for another definition of the cutpoint with respect to the

observations.

Let X1 , X2 , …, Xn be a random sample from uniform(0, 1) and Ni be the maximum

number of observations covered by one of the semicircles with endpoints Xi and Xi∗ as

defined in Section 2.4, then the circular 2DDCS statistic X 2 t̂c can be written with respect

to Ni as the Definition (2.10)

X2

4

t̂c =

n

2

max {Ni } − .5n .

1≤i≤n

Therefore, we may define the cutpoint in terms of the sample observations:

t̂c = the first Xj such that Nj = max {Ni } ,

1≤i≤n

(3.3)

where max {Ni } is the maximum number of observations covered by some semicircle as

1≤i≤n

the random variable N, defined in Ajne 1968 [1]’s paper. Intuitively, a well defined cutpoint

should be unique given any random sample and the null distribution of the cutpoint for a

53

3.5. CUTPOINT OF CIRCULAR 2DDCS

circular 2DDCS statistic should be uniformly distributed over the circumference because

the points are uniformly located and the null distribution of the random variables X 2 (t)

and N are both free of the location t. That is, if {X1 , X2 , …, Xn } is the original random

sample from uniform(0, 1) and {X1 , X2 , …, Xn } is the shifted sample observations such that

Xi = (Xi + θ) mod 1, i = 1, 2, …, n, and t̂c , t̂c are the cutpoints in terms of the original

and shifted set of sample observations respectively, then t̂c = (t̂c + θ) mod 1. Therefore the

cutpoint should not be defined as the first order statistic which maximizes X 2 (t) since the

null distribution of such a defined cutpoint will be skewed. The solution is randomization.

The cutpoint can be defined in terms of the random sample observations, instead of the

ordered ones, That is, we define the circular cutpoint t̂c to be the first sample observation

Xi at which X 2 (t) is maximized. The word “first” is necessary in the definition because

there can be more than one maximizing point. For example, when the sample size is 2,

both of the observation points {X1 , X2 } are maximizing the circular 2CS statistic value.

If we define the smallest maximizing observation point to be t̂c , then t̂c will always equal

to X(1) and then the cutpoint won’t be uniformly distributed on the null. However, if the

first random observation is defined to be the cutpoint, then t̂c can either X1 or X2 and it

is totally random. In other words, a point is randomly selected to be the cutpoint from the

set of maximizing points. The simulated density histogram as shown in Figure 3.9 verifies

the uniformity of the cutpoint t̂c .

54

3.5. CUTPOINT OF CIRCULAR 2DDCS

0.6

0.4

0.2

0.0

Density

0.8

1.0

Density Histogram of Circular Cutpoint

0.0

0.2

0.4

0.6

0.8

1.0

Circular 2DDCS Cutpoints

Figure 3.9: Density Histogram of t̂c When n = 10, K = 5000

55

Chapter 4

Null Distributions of 2DDCS

Statistics

Three 2DDCS tests have been proposed and the null distributions of the cutpoints are

discussed. Can we use the critical values from the chi square table? That is, we look at the

data, choose the best cutpoint which maximizes the discrepancy between the null hypothesis

and the data, and carry out a Pearson 2CS test with the level α critical value taken from

a chi square tail probability table? Such a test should be very powerful. However, is the

actual probability of the type I error controlled to be α? The answer is “No”. We look at

the two probabilities:

2

P X 2 t̂+

ε ≥ χ1,.05 | the data is taken from uniform (0, 1) ,

and

P X 2 t̂ε ≥ χ21,.05 | the data is taken from uniform (0, 1) ,

56

4.1. KS STATISTICS

where χ21,.05 is the upper 5th percentile of the chi square distribution with one degree of

freedom. By simulations, we get the two probabilities are 0.4142 and 0.4142, respectively.

In other words, the actual α level in each case is more than 40%. Therefore, the critical

values for Pearson chi square statistics are no longer appropriate. The corrected values

should be larger than those taken from χ21 since the chi square values are maximized.

In this chapter, we discuss the exact and asymptotic null distributions of the linear and

circular 2DDCS statistics. The statistics of interest in the finite-sample case will have no

minimum cell length, that is, ε = 0. Both simulation results and theoretical analysis are

presented. We use the same notations as those in the previous chapters.

4.1

KS Statistics

√

When a sample is taken from uniform(0, 1), it is well known that { n [Fn (t) − t]}t∈(0,1) is

asymptotically a Brownian bridge process W0 (t) and so the KS statistic

Dn =

is the supremum of

√

n supt∈[0,1] |Fn (t) − t|

√

n |W0 (t)|. By this fact, the cumulative distribution function (CDF)

of Dn is given by

P (Dn ≤ x) = 1 − 2

∞

i−1

exp −2i2 x2 /n ,

i=1 (−1)

which is sometimes written in the following form:

√

2nπ

FDn (x) =

x

∞

i=1 exp

−

(2i − 1)2 nπ2

8×2

Then the critical values for the two-sided KS test can be derived.

57

.

4.2. ONE-SIDED 2DDCS STATISTICS

4.2

One-Sided 2DDCS Statistics

To get the null distribution of a one-sided 2DDCS statistic

2

i/n − X(i)

1≤i≤n X(i) 1 − X(i)

X 2 t̂+ = max

we need to calculate

P X 2 t̂+ ≤ w2 = P

,

2

i/n − X(i)

≤ w2 , i = 1, 2, …, n ,

X(i) 1 − X(i)

where w is a nonnegative constant. When w = 0, we know

P X 2 t̂+ ≤ 0 = P X(i) = i/n, i = 1, 2, …, n = 0.

For nontrivial w2 > 0, however, it is harder to evaluate the probability. The quadratic

inequality

2

i/n − X(i)

≤ w2 , i = 1, 2, …, n

X(i) 1 − X(i)

2

is equivalent to i/n − X(i) ≤ X(i) 1 − X(i) w2 and it can be written in a standard form

2

(1 + w2 )X(i)

− (w2 +

2i

i2

)X(i) + 2 ≤ 0, i = 1, 2, …, n.

n

n

(4.1)

Let the interval [li (w) , ui (w)] denote the solution of (4.1) for each i. Then we have

w2 +

li (w) =

2i 1

−

n

n

(−4i2 + 4ni + w2 n2 )w2

2 (1 + w2 )

,

and

w2 +

ui (w) =

2i 1

+

n

n

(−4i2 + 4ni + w2 n2 )w2

2 (1 + w2 )

58

, i = 1, 2, …, n.

4.2. ONE-SIDED 2DDCS STATISTICS

We should verify that the existence of the roots in their domain such that (4.1) will satisfy

the following properties:

1. (w2 +

i2

2i 2

) − 4(1 + w2 ) 2 ≥ 0 for any i = 1, 2, …, n.

n

n

2. 0 ≤ li (w) ≤ ui (w) ≤ 1, i = 1, 2, …, n.

3. li (w) , ui (w) are respectively monotonically increasing with respect to i for any given

n and w2 .

By simple calculations, we prove that all the three conditions are satisfied. Therefore, for

w2 > 0, we have

2

i/n − X(i)

≤ w2 , i = 1, 2, …, n

P

X(i) 1 − X(i)

= P X 2 t̂+ ≤ w2

* 1 * x3 * x

= 0 … 0 0 2 n!dx1 dx2 …dxn .

To see the value of the probability, we first check the simplest case when the sample

size is really small such as n = 1 or 2.When n = 1, the exact CDF of the one-sided 2DDCS

statistic is

* u1 (w)

P X 2 t̂+ ≤ w2 = l1 (w)

1dt1 = u1 (w) − l1 (w) =

When n = 2, the bounds for X(1) are

√

w 2 + 1 − w 1 + w2

l1 (w) =

,

2 (1 + w2 )

and

√

w2 + 1 + w 1 + w 2

.

u1 (w) =

2 (1 + w2 )

59

w2

.

(1 + w2 )

4.3. TWO-SIDED 2DDCS STATISTICS

The bounds for X(2) are

l2 (w) =

1

, and u2 (w) = 1.

1 + w2

Now, we compute the CDF of X 2 t̂+ when the sample size is 2:

P X 2 t̂+ ≤ w2

* u2 (w) * u1 (w)

= 2 l2 (w)

l1 (w) 1dt1 dt2 = 2 (u1 (w) − l1 (w)) (u2 (w) − l2 (w))

w2

2w3

2w

=

,

=√

2

1 + w2 (1 + w )

(1 + w2 )3/2

In this way, we calculate the exact cumulative distribution of the one-sided 2DDCS statistic

for any finite sample size, but the complexity of calculation is increasing a lot while the

sample size is getting larger. An alternative way is to use simulated critical values when the

sample size is not too small. Simulated critical values of the one-sided 2DDCS statistics

are compared with those of the two-sided ones next section.

4.3

Two-Sided 2DDCS Statistics

Miller-Siegmund’s maximally selected chi square statistic for two-sample test of homogeneity has been discussed in the literature. Halpen (1982)[14] simulated the finite-sample

distribution of the maximally selected chi square statistics and Koziol (1991)[17] derived

the exact finite-sample distribution theory from Durbin’s (1971)[6] combinatorial approach.

In this section, Miller-Siegmund’s method is applied to get the tables of asymptotic critical

values. Simulations are conducted for comparisons as well.

Recall the two-sided 2DDCS without a minimum cell length is,

%

$

2

n

(F

(t)

−

t)

n

,

X 2 t̂ = sup0 .5n and k − .5n > 0 if k = n2 + 1 , …, n, the inequality (N − .5n)2 ≥

n

4

2

(k − .5n) is equivalent to N − .5n ≥ k − .5n and then we have

n

$

%

n

k 1 2

2

P X t̂c ≥ 4n

−

= P (N ≥ k) , if k ≥

+1 .

n 2

2

”

Now we are ready to get the distributions of X 2 t̂c . When the sample size is 1, we

know that X 2 t̂c = 1 with probability 1 because any semicircle will cover either 0 or 1

observation. Similarly, the value of X 2 t̂c will be 4 with probability 1 if the sample size

is n = 2 since the complement of the event that both of the two points can be covered by

69

4.4. CIRCULAR 2DDCS STATISTICS

some semicircle is X(2) −X(1) = .5, whose probability is 0. However, when n = 3, it requires

some calculation to get the exact distribution of X 2 t̂c . We know the maximum number

of observations covered by some semicircle can be either 2 or 3. The probability of N = 3

is equal to the probability that the distance of the largest and the smallest observations is

less or equal to the length of a semicircle, that is,

P (N = 3) = P X(3) − X(1) ≤ .5 .

It can be calculated by applying the joint distribution of the minimum X(1) and the maximum X(3) :

*1*v

.5 v−.5 6 (v − u) dudv = .75.

Thus, the exact null distribution of X 2 t̂c can be summarized as below.

X 2 t̂c

4(3)( 32 − .5)2 = 13

4(3)( 33 − .5)2 = 3

N

2

3

Probability

.25

.75

For larger sample sizes, we may apply Ajne’s result about N. Ajne (1968)[1] proved

!

that (Theorem 1 in his paper) for k = n2 + 1 , …, n,

P (N ≥ k) = 2−(n−1) (2k − n)

Thus, for k =

!

1

2n

+ 1, …, n and wk2 = 4n

∞

j=0

k 1

−

n 2

2

n

j (2k − n) + k

.

, we have

!

P X 2 t̂c ≤ wk2 = P (N ≤ k) = 1 − P (N ≥ k + 1) ,

70

(4.3)

4.4. CIRCULAR 2DDCS STATISTICS

N ≤ k

X 2 t̂c ≤ w2

…

Purchase answer to see full

attachment

## Leave a comment