Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

You must login to ask question.

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Hi, I need help with this article review, thank you.

Article Review Instructions
You will write a 2 page review/abstract/summary on an article from a peerreviewed scholarly journal. This is to assess your ability to select and summarize
the research of others, analyze and apply the research of others, and communicate
professionally and effectively to their regard. However, the most important
rationale for this assignment is for you to see how statistical analysis is presented.
Instructions:
1. Use the library system (https://www.ucumberlands.edu/library) or online catalog to
locate a journal article that pertains to your research, thesis, or area of
interest. The article you chose should have performed some statistical
analysis of gathered data and made an inference using something other than
just the average (mean). That is, they can’t just talk about averages; they
must have used one of these tests: t-test, chi-square, F-test, Fischer test,
ANOVA, MANOVA, ANCOVA, Mann-Whitney, correlation, regression.
2. Read the article thoroughly.
3. Write a 2-page summary about the article following the given guidelines.
Detailed Guidelines:
1. Your review should include:
a. The question/problem being researched by the author
b. The experiment that will answer the question
c. How they collected data
d. Analysis of the data (Must identify the statistical test used)
e. Their conclusion or findings
2. Your review can be single spaced or double spaced in at least 11 pt font.
3. You should make a reference page to list the one article you chose and
practice APA format. NOTE: The APA Guidelines address specifically how the
reference page is titled and how articles are cited. Note: It is not called
Bibliography.
4. Graphs or visual representations are not needed for this assignment.
5. A title page is not required, but you may include one if you want to practice
APA style. If you have a title page, your submission is 4 pages. If you do not
have a title page, your submission is 3 pages. Do NOT go over.
Grading:
Your grade on these assignments will be based on the rubric below. FOLLOW THE
GUIDELINES above to earn full credit for the assignment. Failure to follow
instructions, include all requested pieces, or keep within the page limit will result a
loss of points.
Rubric:
ON DATA-DRIVEN CHI SQUARE
STATISTICS
by
Huiyu Qian
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Doctor of Philosophy
in
Mathematics
Lehigh University
April, 2009
UMI Number: 3354754
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
______________________________________________________________
UMI Microform 3354754
Copyright 2009 by ProQuest LLC
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
_______________________________________________________________
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106-1346
c Copyright by Huiyu Qian

2009
ii
Approved and recommended for acceptance as a dissertation in partial fulfillment of the
requirements for the degree of Doctor of Philosophy.
Prof. Wei-Min Huang
Dissertation Director
Huiyu Qian
On Data-Driven Chi Square Statistics
Prof. Bennett Eisenberg
Dissertation Coadvisor
Date
Committee Members:
Prof. Ping-Shi Wu
Accepted Date
Prof. D. Gary Harlow
Mechanical Engineering Dept.
Lehigh University
iii
Acknowledgments
First of all, I would like to express my deep gratitude and appreciation to my advisor, WeiMin Huang, and my coadvisor, Bennett Eisenberg for their generous time and commitment
throughout my doctoral work. They encouraged me to develop independent analytical
thinking and research skills and also greatly assisted me with my scientific writing. Without
their guidance and patience, this dissertation would not have been possible. I enjoyed and
learned a lot in their lectures and academic meetings, not only mathematical and statistical
knowledge, but also the way to teach and how to generate new ideas.
I would also like to thank the members of my committee, Ping-Shi Wu and D. Gary
Harlow, for their helpful comments and encouragement. I would extend many thanks to
Christine Banzoff and especially Mary Ann Dent for working extremely hard to make our
department like a big family. Thanks to the graduate students who left before me, for their
kind support and friendship. I would particularly thank John Frommeyer and Francisco
Ojeda for their help while I was working in the Writing and Math Center.
Finally, I am grateful to my family members, especially my parents, Ping Lu and Weijiang Qian for their constant love, understanding and support during difficult times.
iv
Contents
Acknowledgments
iv
List of Tables
viii
List of Figures
ix
Abstract
1
1 Introduction
2
2 Construction and Preliminaries of 2DDCS Statistics
9
2.1
2DDCS Statistics on a Line . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2
Minimum Cell Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3
Related Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4
Set of Possible Cutpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5
Circular 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3 Null Distribution of the Cutpoints
31
3.1
Cutpoint of One-Sided KS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.2
Cutpoint of Two-Sided KS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
v
3.3
Cutpoint of One-Sided 2DDCS . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.4
Cutpoint of Two-Sided 2DDCS . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.5
Cutpoint of Circular 2DDCS . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4 Null Distributions of 2DDCS Statistics
56
4.1
KS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.2
One-Sided 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.3
Two-Sided 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.4
Circular 2DDCS Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5 Power Study
76
5.1
Power Function of Two-Sided 2DDCS Tests . . . . . . . . . . . . . . . . . .
77
5.2
Optimality of Circular 2DDCS Tests . . . . . . . . . . . . . . . . . . . . . .
79
5.3
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
6 Applications
97
6.1
Goodness-of-Fit with Real Data . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.2
Test Statistics with Unknown Parameters . . . . . . . . . . . . . . . . . . .
103
6.2.1
Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .
106
6.2.2
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
108
6.2.3
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
6.2.4
Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
6.2.5
Power study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
115
6.3.1
115
6.3
Optimal Cutpoint and Two Sample Tests . . . . . . . . . . . . . . .
vi
6.3.2
Possible Application in Regression Without Replications . . . . . . .
116
7 Conclusion
118
A Claims and Theorems
124
A.1 Set of Possible Cutpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
A.2 Cutpoint of One-Sided KS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
A.3 Consistency of Two-Sided 2DDCS Tests . . . . . . . . . . . . . . . . . . . .
129
B Approximation of Powers
132
B.1 Neyman-Pearson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
B.2 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
B.3 Pearson 2CS Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
C An example: Power Function of Two-Sided 2DDCS
140
VITA
145
vii
List of Tables
4.1
Critical Values of 2DDCS Statistics on a Line, n=200 and k=5*1000 . . . .
64
4.2
Critical Values of 2DDCS Statistics on a Line, n=400 and k=5*1000 . . . .
64
4.3
CDF of the circular 2DDCS statistic when n=10 . . . . . . . . . . . . . . .
71
4.4
Critical Values of Circular 2DDCS Statistics . . . . . . . . . . . . . . . . . .
74
4.5
Cumulative Distribution of N and the Circular 2DDCS Statistic, n=50 . . .
74
4.6
Cumulative Distribution of N and the Circular 2DDCS Statistic, n=100 . .
74
5.1
Cutoff Values for Power Comparison I, alpha=.05, n=100, K=5000 . . . . .
86
5.2
Power Comparison I, alpha=.05, n=100, K=5000 . . . . . . . . . . . . . . .
87
5.3
Parameters of the Alternatives . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.4
Simulated Upper 5 Percentiles, n=100, K=5000 . . . . . . . . . . . . . . . .
95
5.5
Power Comparison II, alpha=.05, n=100, K=5000 . . . . . . . . . . . . . .
96
6.1
Simulated Critical Values (Samples Taken from Weibull(2,1)), n=10 . . . .
111
6.2
Simulated Critical Values (Samples Taken from Weibull(2,1)), n=100 . . . .
111
6.3
Simulated Powers of Testing Weibull(a,b), n=100,K=1000,alpha=.05 . . . .
115
viii
List of Figures
1.1
Example 1 of Regular Chi-Square Tests . . . . . . . . . . . . . . . . . . . .
2.1
Pearson’s CS test for uniformity: various choices of cells under different al-
6
ternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2
Pearson’s 2CS Statistic as a Function of t . . . . . . . . . . . . . . . . . . .
23
3.1
+
Frequency Histogram of V10
, k = 5000 . . . . . . . . . . . . . . . . . . . . .
35
3.2
Frequency Histogram of V2 ( 5000 replications) . . . . . . . . . . . . . . . .
40
3.3
Frequency Histogram of V50 ( 5000 replications) . . . . . . . . . . . . . . . .
41
3.4
Density of t̂+ , ε = 0, n = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.5
Frequency Histogram of t̂+
ε=.05 (n = 50, 5000 replications) . . . . . . . . . .
46
3.6
Regions for X(1) (y) and X(2) (x) . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.7
Null Distribution of t̂ When n = 2 . . . . . . . . . . . . . . . . . . . . . . .
50
3.8
Frequency Histogram of t̂ε=0 (n = 3, 5000 replications) . . . . . . . . . . . .
51
3.9
Density Histogram of t̂c When n = 10, K = 5000 . . . . . . . . . . . . . . .
55

4.1 X 2 t̂ , n = 1000, ε = .01, K = 5000 . . . . . . . . . . . . . . . . . . . . . . .

4.2 Density Histogram of X 2 t̂ with Fitted Curve, n = 1000, K = 5000 . . . .
ix
66
67
4.3
Probabilty Histogram of Circular 2DDCS Statistics, n = 200, K = 5000 . .
75
5.1
Line Bounds a Curved Boundary . . . . . . . . . . . . . . . . . . . . . . . .
80
5.2
Power Comparison: Uniform vs. Linear Alternative . . . . . . . . . . . . . .
83
5.3
Densities of g1 s and g2 s . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.4
Densities of g5 s and g6 s . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.1
Density Histogram of the Galaxies Data . . . . . . . . . . . . . . . . . . . .
101
6.2
Circular Plot of the 76 Turtles’ Orientations . . . . . . . . . . . . . . . . . .
102
6.3
Critical Values with Different Significance Levels and Sample Sizes . . . . .
113
x
Abstract
Pearson chi square tests have been very popular because they are intuitive, natural and
easy to carry out for most categorical data sets. However, the construction of the cells
has to be determined when the population is continuous. Moreover, the power of such an
arbitrarily selected chi square test for continuous data is very unstable and depends on the
choice of the cells. We propose several data-driven chi square tests in which the choice of
cells is based on the data itself. Two-cell data-driven chi square tests for data on a line
and on a circle are our main concerns. For data on a line, the tests require a minimum cell
length ε to avoid singularity. We study how to choose the proper value of ε and the set of
possible cutpoints. For directional data, we show that the circular two-cell data-driven chi
square test with equal cell lengths is equivalent to Ajne’s N test. By comparing with several
related tests, we find that our proposed tests are more powerful for a generic alternative
than a particular Pearson chi square test with the cells taken without investigating the data.
Examples on applications of the methods are also given.
1
Chapter 1
Introduction
Pearson’s chi square statistic was introduced by Karl Pearson in 1900. It measures the
discrepancy between the data and the proposed model, which is called the null hypothesis
in a hypothesis test. The statistic is written as
χ2 =
k

(Yi − Ei )2
Ei
i=1
.
(1.1)
It is the weighted sum of squares of the difference between the empirical frequency Yi and
the expected frequency Ei in each cell, group or category under the null hypothesis. Let
X1 , …, Xn be a random sample of n independent observations from a common population
with distribution function F . The chi square test was originally used to test the null
hypothesis that the data follows a specific distribution F0 versus a general alternative:


 H0 : F (t) = F0 (t) for all t ∈ R

 H : F (t) = F (t) for some t
a
0
.
Such a test based on the statistic (1.1) is called “Pearson’s chi square test for goodness-offit”. A similar test statistic to (1.1) is also used for a two-sample test of homogeneity or
2
independence. Since the idea and procedure of the three types of chi square tests are all
alike, without loss of generality, we concentrate on the test of goodness-of-fit. When the
expected frequency in each cell is large, it is a well known result that the statistic (1.1)
has approximately a chi square distribution with k − 1 degrees of freedom, which is usually
denoted as χ2k−1 . The null hypothesis is rejected when the observed value of the chi square
statistic is extreme. However, a small observed value of the chi square statistic does not
imply that the data is from the proposed model for continuous populations because the
selection of the cells for the test might hide the difference between the two distributions F
and F0 .
One of the reasons that a chi square goodness-of-fit test has been widely used is that
the statistic (1.1) can be used for any univariate distribution. The null distribution could
be discrete, continuous or mixed. We may also modify or tailor the statistic (1.1) when
the hypothesized distribution is not fully specified. The test is naturally carried out when
the data is grouped. In the continuous case, however, the construction of the cells has to
be determined before the calculation of the statistics. In the literature, there has been a
lot of discussion about the choice of the number of cells and the cell sizes. But there is no
explicit solution to the problem of constructing the cells such that the corresponding test will
be powerful for a general continuous alternative. For instance, Mann and Wald (1942)[20]
study the number of equiprobable disjoint cells for a given sample size and significance level.
In their result, the optimal number of cells is about n2/5 , where n is the sample size. They
show that the corresponding chi square test is unbiased and the test statistic can be closely
approximated by a chi square distribution. However, Kallenberg et al. (1985)[16] argue
that by taking unequiprobable cells one gets much more power for heavy-tailed alternatives
3
than by applying the Mann and Wald’s rule. Moreover, Hall (1985)[13] introduces a chi
square type statistic with overlapping cells and Inglot et al.(2003)[15] study a data-driven
chi square statistic. Here “data-driven” means the construction of the cells is driven by
the given data set. Both Hall’s and Inglot’s test statistics do not necessarily follow an
asymptotic chi square distribution under the null hypothesis. The main issues involved
are (i) the number of cells (ii) equiprobable or not (iii) overlapping or not and (iv) datadriven or not. We propose data-driven chi square type statistics with disjoint cells and we
focus on the two-cell case. Let X(1) , …, X(n) be the order statistics of a random sample of
n independent observations from a common population with distribution function F0 . In

addition, let T(i) = F0 X(i) , i = 1, 2, …, n. Then with the assumption of the continuity
of F0 (x) , T(1) , T(2) , …, T(n) is an ordered sample of n i.i.d. observations from the uniform
distribution on the unit interval. Therefore, we may focus on the goodness-of-fit test for
uniformity over [0, 1], that is,
H0 : F (t) = t, t ∈ [0, 1] .
Given a continuous data set, our concern is how to construct the cells of a Pearson chi
square (CS) statistic based on the data. Different constructions may lead to totally opposite
decisions as shown in the following example:
Example 1 Given the following data set with the sample size of 30:
.091 .162 .184 .309 .314 .329 .352 .359 .393 .404 .428 .474 .480 .512 .545 .547 .552 .556 .562
.563 .578 .591 .626 .627 .633 .656 .694 .766 .772 .850,
we test if the data are taken from a uniform distribution, that is, H0 : F (t) = t, for any
t ∈ [0, 1] , by carrying out a Pearson chi square test.
4
What a statistical analyst will typically do is to cut the unit interval into two or
three equiprobable cells. Suppose that analyst A would like to use two cells, [0, 1/2] and
(1/2, 1],while analyst B decides to have three cells [0, 1/3], (1/3, 2/3], (2/3, 1]. Then analyst
A will get the value of the chi square statistic
(13 − 15)2 (17 − 15)2
+
= .533,
15
15
which is significantly small at level .05 (χ21,.05 = 3.841). Therefore he will not reject the null
hypothesis at level α. However, analyst B will conclude to reject the uniformity over [0, 1]
at the same significance level because his observed value is
(6 − 10)2 (20 − 10)2 (4 − 10)2
+
+
= 15.2
10
10
10
and it is much larger than the corresponding cutoff value (χ22,.05 = 5.991). The frequency
histograms of the data with two and three equiprobable bins under the null are shown
respectively in Figure 1.1. The shaded regions denote the frequency deviations of the data
from uniformity. There is a significant deviation in the three-bin case but not in the two-bin
case. Which test should we take? In this case, we may take the three-cell test since it shows
an extremely large discrepancy between the data and the uniform distribution. However,
given a different data set, we may construct the cells differently. Thus a good choice of the
cells is data-dependent.
Our main interest is to find a good construction of the cells based on the given data
such that the corresponding chi square type test is relatively powerful under a broad class
of alternatives. In other words, the proposed test should be able to detect the deviation of
the given data from the null hypothesis even when we do not have too much information
about the underlying distribution. We concentrate on tests with disjoint cells and a fixed
5
Frequency Histogram
10
frequency
10
0
5
5
0
frequency
15
15
20
Frequency Histogram
0.0
0.2
0.4
0.6
0.8
0.0
1.0
0.2
0.4
0.6
x
x
Figure 1.1: Example 1 of Regular Chi-Square Tests
6
0.8
1.0
significance level α. Intuitively, the choice of the cells should maximize the discrepancy
between the alternative and the null to raise the power of a chi square test.
In Chapter 2, we give a detailed derivation of the selection rule of the cells for a chi square
test. We call the derived statistic the “Data-Driven Chi Square” statistic and it is denoted by
“DDCS” throughout this dissertation. In the simplest case when the number of cells is fixed
to be two, the resulting statistic is called a two-cell data-driven chi square (2DDCS) statistic.
It is similar to Miller and Siegmund (1982)[21]’s “Maximally Selected Chi-Square” (MSCS)
statistic and Inglot and Janic-Wroblewska’s “Data Driven Chi-Square Test” which we call
“IJDDCS”. All three are “Data-Driven” type chi square statistics. However, the MSCS test
is a two-sample test for homogeneity while our 2DDCS is a goodness-of-fit test. For any
given sample, the IJDDCS test allows more than two cells and the set of possible cutpoints
is fixed and finite while the 2DDCS test has two cells only and the set of possible cutpoints
is random and data dependent. To generalize the method for applications in directional
data, we propose a circular 2DDCS statistic with the edges wrapped around. For each
sample, we define the “cutpoint” t̂ to be the cutpoint of the 2DDCS statistic chosen based
on the data. Thus, a different sample may come up with a different cutpoint. Basically, the
cutpoint is the point where the corresponding chi square statistic is maximized. A drawback
of such a method is that an appropriate minimum cell length ε has to be determined to
avoid singularity.
In the third chapter, we consider the null distributions of the cutpoints for the 2DDCS
and related statistics. It turns out that the cutpoints of the 2DDCS tests on a line are
less likely to occur near the center under the null. For comparison, we also discuss the null
distribution of the cutpoint for a one-sided or two-sided Kolmogorov-Smirnov (KS) statistic.
7
The cutpoint of a circular 2DDCS statistic with fixed cell length is defined such a way that
it is uniformly distributed over (0, 1) under the null.
In Chapter 4, we discuss the null distributions of the 2DDCS statistics. We adapt the
asymptotic results on the null distribution of the MSCS statistic to the 2DDCS statistic on
a line. The critical values depend on the value of the minimum cell length ε. The circular
2DDCS statistic with a fixed cell length .5 is shown to be equivalent to Ajne’s N test.
Power studies in Chapter 5 show that the proposed 2DDCS tests are more robust than
an arbitrarily selected Pearson 2CS test and have higher power under a generic alternative
than the IJDDCS and KS tests. The 2DDCS tests on a line can be applied to any continuous
data when a Pearson 2CS test is appropriate. The tests can be tailored for a goodness-of-fit
test with unknown parameters. Examples of the applications are given in Chapter 6.
8
Chapter 2
Construction and Preliminaries of
2DDCS Statistics
Let X1 , …, Xn be a random sample of n i.i.d. random variables from a common distribution
F on the unit interval [0, 1] and let X(1) , …, X(n) be the order statistics from smallest to
largest. The empirical cumulative distribution function (ECDF) Fn (t) is defined as the
proportion of the number of observations (Xi s) ≤ t. In this chapter, we develop several
2DDCS statistics with which the corresponding tests will most likely make correct decisions
for a broad class of alternatives. In Section 2.1, we determine the selection rule of disjoint
cells based on the given data when the number of cells k is fixed. In Section 2.2, we conclude
that a minimum cell length ε is required for a 2DDCS statistic. Section 2.3 contains some
related results of stochastic processes for an asymptotic analysis of the 2DDCS statistics.
Moreover, we find that the set of possible cutpoints for the 2DDCS statistics on a line is the
set of all observations in Section 2.4. By such a claim, we rewrite the definitions of the test
statistics in terms of the order statistics {X(1) , …, X(n) } and we define the one-sided and
9
2.1. 2DDCS STATISTICS ON A LINE
two-sided 2DDCS statistics for data on the real line. At the end of this chapter we propose
a wrapped-around 2DDCS statistic which can be used to test directional data.
2.1
2DDCS Statistics on a Line
The main problem of a Pearson chi square (CS) goodness-of-fit test for continuous data is
that the power of the test is not stable. That is, any fixed construction of the cells is good for
certain groups of alternatives, but it could be a bad choice for some others. The solution to
this problem is to choose the cells based on the data. In this section we derive a data-driven
chi square type statistic. The cells of interest are two disjoint cells covering [0, 1] and the
hypothesized distribution is the uniform distribution. However, we start the discussion in a
more general case with the number of cells not necessarily two and with a general continuous
null distribution. Let k ≥ 2 be the number of disjoint cells and t = (t1 , …, tk−1 ) ∈ Rk−1 be
the vector of cell cutpoints. We suppose that the number of cells k is determined before
choosing the cell cutpoints t. Given the sample X = {X1 , …, Xn }, we need to test H0 : the
data is taken from a distribution with CDF F0 . The idea of “data-driven” is to select the
statistic with the “right” cell cutpoints t among those in the collection:

k

[Yi (t) − npi (t)]2
X 2 (t) =
,
npi (t)
i=1
(2.1)
where t = (t1 , …, tk−1 ) ∈ Rk−1 , −∞ = t0 (t) = (p1 (t), …,
pk (t)) be the cell probabilities according to F1 . Therefore
pi (t) = F1 (ti ) − F1 (ti−1 ) , i = 1, 2, …, n.
Similarly, let the cell probabilities under the null hypothesis be
pi (t) = F0 (ti ) − F0 (ti−1 ) = ti − ti−1 , i = 1, 2, …, n.
The form of the DDCS test statistic is similar to the Pearson CS statistic X 2 (t) except that
the cell cutpoints vector t̂ is chosen based on the data. But the cutpoints t for a Pearson CS
test should be selected without looking at the details of the data. Since we are looking for
the cutpoints with which the corresponding CS statistic will be able to detect the deviation
of the data from the null hypothesis as much as possible. In other words, we choose the
cutpoints such that the corresponding Pearson CS test is the most powerful test in the
collection (2.1). Therefore, the selection rule has to be derived from the power function of
a Pearson CS test with k cells. One solution is to apply the asymptotic theory of the chi
square statistics X 2 (t). However, for a fixed pair of null and alternative distributions, the
asymptotic power of any Pearson CS test is either 1 or 0 as shown by Neyman (1949)[23].
In order to make the asymptotic power neither 0 nor 1, we have to make the alternative to
be special. Thus, we may make the alternative be different from the null and the difference
gets smaller when n gets larger as Cochran (1952)[5] discussed. We consider the following
sequence of local alternatives,
δ i (t)
pi (t) = pi (t) + √ , i = 1, 2, …, k − 1,
n
11
2.1. 2DDCS STATISTICS ON A LINE
where the quantities δ i (t) remain fixed as n increases and the vector of t is fixed as well.
It is well known that the asymptotic power function of the Pearson CS statistic X 2 (t) ,
as defined in (2.1), is

P X 2 (t) ≥ c (d, α) |H1

P χ2d (λ (t)) ≥ c (d, α) ,

where c (d, α) is a constant such that P χ2d ≥ c (d, α) = α and χ2d (λ (t)) is a noncentral chi square random variable with λ (t) =
δ i (t)
pi (t)
k
i=1
2
and d = (k − 1). By the
approximation given by Patnaik (1949)[24],

P χ2d (λ (t)) ≥ c (d, α)
1−Φ
c (d, α) − d − λ (t)
.
2d + 4λ (t)
c (d, α) − d − λ (t)
needs to be minimized to maximize the power of a Pear2d + 4λ (t)
son CS test. It is also equivalent to maximize the noncentrality parameter λ (t) since both
we conclude that
the critical value c (d, α) and the degrees of freedom d are fixed. Moreover, the noncentrality
parameter
λ (t) =
k
i=1
δ i (t)
pi (t)
2
is a measure of discrepancy between the null hypothesized distribution F0 and the alternative
distribution F1 .
Just for reference, there are a couple of other formulas available in the literature to

approximate the power P χ2d (λ (t)) > c (d, α) . These estimations are all monotonically
increasing with respect to λ (t), therefore we may conclude the same result. Two examples
of these approximations are:
(1) Johnson (1959) : Φ
c (d, α) − d + 1 − λ (t)
d + λ (t)
12
and
2.1. 2DDCS STATISTICS ON A LINE
(2) Sankaran (1963) : Φ

c (d, α) + 21 − 12 d − λ (t) − 12 + 12 d for d ≤ 2c (d, α) + 1.
The conclusion agrees with our graphical intuition shown in Figure 2.1. A good choice
of cells can maximize the chi square type discrepancy between the null and the alternative
distributions. On the other hand, an inappropriate construction of cells might hide the
deviations of the alternative from the null. The unit horizontal line represents the null
distribution and the vertical lines are the cutting lines to form the cells. The slant lines
are the densities of the alternative distributions. Notice that here we are discussing the
cases regarding the null and the alternative distributions without including any data. The
examples show that the effectiveness of detecting the difference of the null and alternative
depends a lot on our choice of the cells. That is, a Pearson CS test might have no power at
all under certain alternatives such as case (a) in Figure 2.1, but it can be really powerful
under other alternatives like case (c). On the other hand, given the alternative (or the
data), whether the decision of a blindly picked Pearson CS test is correct or not depends
on the construction of the cells as shown in cases (a), (b) and (d).
The agreement of the cell areas under the null and the alternative does not imply the
similarity of the hypotheses and that is why we usually conclude “do not reject the null”
instead of “accept the null” even when a very small value of the test statistic is observed.
That is, a conclusion of “do not reject the null” for a Pearson CS test does not mean the
data really follow the null distribution. Conversely, if the test concludes “reject the null”,
it is appropriate to say the data do not follow the proposed distribution as in cases (b), (c)
and (d). Therefore, we may conclude rejection of the null if two Pearson CS tests make
opposite conclusions as in Example 1. But it is not that simple and that is the reason
we have to modify the Pearson chi square tests. The cases in Figure 2.1 also indicate
13
2.1. 2DDCS STATISTICS ON A LINE
that the construction of the cells for a powerful chi square test has to characterize the
null and alternative distributions to display the discrepancy between them when we have
specific information about the alternative distribution. However, the alternative is usually
not known in reality and so a modification of a Pearson chi square test should be made
to determine the cells based on the data and the hypothesized distribution such that the
difference can more likely be detected.
However, in real applications of a chi square test, the alternative distribution is not
known and λ (t) can not be calculated, so maximizing λ (t) is not the appropriate selection
rule for the cells. Notice that the noncentrality parameter λ (t) can also be written as
k
i=1
npi (t) − npi (t)
npi (t)
2
,
which can be approximated by the Pearson chi square statistic with cell cutpoints t
X 2 (t) =
k
i=1
Yi (t) − npi (t)
npi (t)
2
.
Therefore, if t̂ is the vector of cutpoints maximizing the value of X 2 (t), then it approxi
mately maximizes λ (t) as well and so the Pearson CS test based on X 2 t̂ will more likely
be a powerful Pearson k-cell CS test for this specific population than others. We have to be
aware that only when we pretend t̂ is randomly selected without including any information

about the sample, X 2 t̂ has a limiting chi square distribution under the null and so the

actual power of X 2 t̂ is usually a little bit lower than the highest possible power in the
collection (2.1). It is the trade-off for a test to be powerful in general rather than being
very powerful locally under some alternatives but not at all under some others.
For any fixed number of cells k ≥ 2, we may select the cutpoints t which maximizes
the chi square statistic X 2 (t) with respect to t. However, since the cells are left-open, the
14
2.1. 2DDCS STATISTICS ON A LINE
(a)
(b)
(c)
(d)
Figure 2.1: Pearson’s CS test for uniformity: various choices of cells under different alternatives
15
2.2. MINIMUM CELL LENGTH

absolute maximum value of X 2 (t) might not exist and so we define the DDCS statistic

to be the supremum X 2 (t) as

X 2 t̂ = sup X 2 (t) .
t
When the number of cells is fixed to be two and we test the data for uniformity over the
unit interval, we get the following statistic:

X 2 t̂ = sup X 2 (t) = sup
t∈(0,1)
t∈(0,1)

(nFn (t) − nt)2
nt (1 − t)
where X 2 (t) denotes the 2-cell chi square (2CS) statistic
is t, which is equivalent to
2DDCS test.
,
(2.2)
n (Fn (t) − t)2
with the cutpoint
t (1 − t)
(nFn (t) − nt)2
. A test based on the statistic (2.2) is called a
nt (1 − t)
The optimal cutpoint t∗ of a 2DDCS statistic is defined as

t = inf Arg sup
t∈(0,1)

(nF (t) − nt)2
nt (1 − t)
,
(2.3)
where F (t) is the underlying distribution of the data.
2.2
Minimum Cell Length
(nFn (t) − nt)
For a 2CS statistic X 2 (t) =
2
nt (1 − t)
, the denominator t (1 − t) can’t be 0 or too
close to 0. Otherwise, the value of the 2CS statistic will be very large as t is very close to
0 or 1. To avoid singularity, we set up a minimum cell length ε in Definition 2.2 and define
a new statistic (2.4) to include the minimum cell length.

X t̂ε =
2
sup
t∈(ε,1−ε)

16
(nFn (t) − nt)2
nt (1 − t)
.
(2.4)
2.2. MINIMUM CELL LENGTH
where ε is a small positive real number and called the minimum cell length. The appropriate
minimum cell length ε is dependent on the sample size, the underlying distribution of the
data and the null hypothesis. Larger sample size may allow smaller minimum cell length.
But what is the appropriate value of ε given a sample size n without any other information?
Our goal this section is to find the proper value of ε such that the 2DDCS test will have a
small chance to pick up a cutpoint t̂ at the two ends when the optimal cutpoint t∗ (2.3) is
not.
Since nFn (t) is a Binomial random variable with parameters n and t if the sample is
taken from the null uniform distribution, the expected value of the 2CS statistic X 2 (t) is

then a constant 1 for any t. Moreover, V ar (nFn (t) − nt)2 is

2
2
E (nFn (t) − nt)2 − E (nFn (t) − nt)2 ,
where the first term is the forth moment about the mean nt under the null

2
E (nFn (t) − nt)2 = nt (1 − t) [3t2 (2 − n) + 3t (n − 2) + 1],
and the second term in the function vn (t) is the squared variance of nFn (t)

2
E (nFn (t) − nt)2 = [nt (1 − t)]2 = n2 t2 (t − 1)2 ,
we have

V ar (nFn (t) − nt)2 = nt (1 − t) [−(2n − 6)t2 + (2n − 6)t + 1].
The variance of X 2 (t)
Var(X 2 (t))=
−(2n − 6)t2 + (2n − 6)t + 1
nt (1 − t)
17
2.2. MINIMUM CELL LENGTH
can be simplified as
Var(X 2 (t))= 2 +
6t2 − 6t + 1
.
nt (1 − t)
Therefore, the 2CS statistic has the following properties under the null hypothesis:
1. E(X 2 (t))= 1, for any t.
2. Var(X 2 (t)) goes to infinity as t goes to 0 or 1.

3. The value of Var X 2 (t) depends on the sample size n.
4. When n → ∞,
i) Var(X 2 (t)) → 2 if nt → ∞.
ii) Var(X 2 (t)) → 2 + 1/λ if nt → a positive real number λ.
iii) Var(X 2 (t)) → ∞ if nt → 0.
5. Var(X 2 (t)) has a minimum point at t = 1/2.
To make sure that the value of the 2CS statistic X 2 (t) will be most likely in a reasonable
range, we set the value of the cutpoint t to be in a truncated interval [ε, 1 − ε] instead of
(0, 1) .
Remark 1 A minimum cell length ε is required for a 2DDCS test. When the sample size
is larger, the minimum cell length is required to be at least as large as λ/n, where λ is a
positive real number.
Moreover, the cutpoint t̂ is more likely to occur at the tails than the median. On the
other hand, the two-sided KS statistic Dn is equivalent to
18
2.3. RELATED STOCHASTIC PROCESSES

Dn∗ = supt∈(0,1) (nFn (t) − nt)2
where the statistic (nFn (t) − nt)2 for maximizing is the numerator of the 2CS statistic. The
expected value of (nFn (t) − nt)2 under null is nt (1 − t) , which is an open-down parabola
with the unique maximum point at t = .5. And the variance of (nFn (t) − nt)2 under null
has an unique maximum point at t = .5 as well. Thus the cutpoint for a two-sided KS
statistic is more likely at the median than the tails and it is not necessary to truncate the
interval for the value of t.
2.3
Related Stochastic Processes
The 2DDCS statistic (2.4) involves the empirical CDF Fn of a random sample from the
uniform distribution over [0, 1] (the distribution is denoted as “uniform(0, 1)”). It is wellknown that the distribution of a random sample of size n from the uniform distribution
over [0, 1] is the same as {Pn (t)} |Pn (1) = n}, where {Pn (t)} is the Poisson process with
occurrence rate n and jumps of
1
for 0 ≤ t ≤ 1. Asymptotically, it is related to the Brownian
n
bridge process and Brownian motion.
Let X 2 (t) =
n (Fn (t) − t)2
, where Fn (t) is the empirical CDF of a sample of size n from
t (1 − t)
uniform(0, 1). There are several well-known results which will be used in later chapters:
1. If Yn (t) =

n (Fn (t) − t) , then {Yn (t)}t∈[0,1] is called the sample process. It con-
verges weakly to the Brownian bridge process W0 (t) .
2. The Brownian bridge process W0 (.) can be transformed to a Brownian motion process
W (.) and
W (τ ) = (1 + τ ) W0

19
τ
1+τ

,
τ=
t
.
1+t
2.3. RELATED STOCHASTIC PROCESSES
3. Let τ =
t
τ
, that is, t =
, then the transformation from a Brownian bridge
1−t
1+τ
process W0 (.) to a Brownian motion process W (.) can be written in a way matching
the form of a 2CS statistic:

τ
τ
(1
+
τ
)
W
W
0
0
1+τ
1+τ
W (τ )
W0 (t)

=
= √ .

=
τ
τ
t (1 − t)
τ
τ
1+τ 1 − 1+τ
4. If we rewrite the 2CS statistic Xn2 (t) =
Xn2 (t) =
n (Fn (t) − t)2
to be
t (1 − t)

n |Fn (t) − t|
t (1 − t)
2
,
then by 1. and 3., it can be shown that the chi square statistic converges weakly to

W (τ ) 2
t

with τ =
.
1−t
τ
5. The supremum of the 2CS statistic converges weakly to the supremum of the squared
standardized Brownian. That is,


n |Fn (t) − t|
sup 
t (1 − t)
t∈(ε,1−ε)
where τ 1 =
2
ε
1−ε
, τ2 =
and so
1+ε
ε

n |Fn (t) − t|
sup
t (1 − t)
t∈(ε,1−ε)

weakly
 ⇒
weakly

sup
t∈(τ 1 ,τ 2 )
sup
t∈(τ 1 ,τ 2 )

|W (τ )|

τ
2
,

|W (τ )|

.
τ
6. By applying the law of the iterated logarithm for Brownian motion, (Durrett (1996)[7]),
we get
W (τ )

= lim sup
τ
τ →∞
τ →∞
lim sup
almost surely and lim
τ →∞
W (τ )
2τ log |log τ |
2τ log |log τ | = lim
τ →∞
2τ log |log τ |
2τ log |log τ | = ∞. Thus the minimum cell length ε for the
2DDCS statistic is necessary.
20
2.4. SET OF POSSIBLE CUTPOINTS
2.4
Set of Possible Cutpoints
For computational reasons, the set of the possible cutpoint for the 2DDCS statistic (2.4)
can have a large size but has to be finite. A natural candidate is
SKn =

Kn − 1
1 2
,
, …,
Kn Kn
Kn

,
where Kn is a constant integer depends only on the sample size n (larger n, larger Kn ,
1
). It is the set considered by Inglot et al. (2003)[15]. However, we propose the
Kn

set of the ordered statistics S = X(1) , …, X(n) since we find that the supremum occurs
smaller
only at the sample observations.
The graph of a function X 2 (t) based on a random sample from uniform(0, 1) is displayed in Figure 2.2,:where the sample is {0.07936203, 0.36753821, 0.45932248, 0.46040850,
0.50693105, 0.67570917, 0.68063271, 0.86449449, 0.87271449, 0.92663101}. Diamonds denote left-hand limits and solid circles represent right-hand limits of X 2 (t) as t goes to the
sample observation points. From the graph, we see that the supremum of X 2 (t) occurs at
an observation point as proved in Appendix A.1.

Claim 1 If S = X(1) , …, X(n) is the set of ordered sample observations, then
A) sup
t∈(0,1)

n (Fn (t) − t)2
X 2 (t) =
B) t̂ = inf
t∈(0,1)
t (1 − t)

Arg sup
t∈(0,1)

X 2 (t) =

2
!2
i/n − X(i)
(i − 1) /n − X(i)

,

= max
1≤i≤n X(i) 1 − X(i)
X(i) 1 − X(i)
n (Fn (t) − t)2
t (1 − t)
.
∈ S.

In other words, the 2DDCS statistic X 2 t̂ is the maximum of all right-hand and left-
hand limits of X 2 (t) at the observation points. It is well known that the two-sided KS
statistics is
21
2.4. SET OF POSSIBLE CUTPOINTS
Dn = sup |Fn (t) − t| = max
1≤i≤n
t∈[0,1]

i
i−1
− X(i) , X(i) −
.
n
n
On the other hand, the one-sided KS statistic Dn+ is the maximum of all right-hand limits
of Fn (t) − t at the observation points and
Dn+ =
sup {Fn (t) − t} = max
1≤i≤n
t∈[0,1]

i
− X(i) , (one-sided KS).
n

Similarly, we define X 2 t̂ as the two-sided 2DDCS statistic and

2
2
i/n − X(i)
(i − 1) /n − X(i)

,

1≤i≤n X(i) 1 − X(i)
X(i) 1 − X(i)

X 2 t̂ = max
.
(2.5)
We also define the maximum of all right-hand limits as the one-sided 2DDCS statistic, that
is,

2
i/n − X(i)

1≤i≤n X(i) 1 − X(i)

X t̂+ = max
2
.
(2.6)
When the sample size is very small, a minimum cell length is not required because the
probability that we have observations very close to the two ends is small. However, the
probability increases as the sample size is getting larger and so a minimum cell length ε is
necessary if the sample size is not very small. Based on the order statistics {X(1) , …, X(n) },
we rewrite the 2DDCS test statistics with respect to the ordered observations. The one-sided
2DDCS statistic with a minimum cell length ε is defined as

X 2 t̂+
ε =

2
i/n − X(i)

max
[nε]≤i≤n−[nε] X(i) 1 − X(i)
,
where [nε] is the largest integer smaller than nε. The two-sided 2DDCS statistic (2.4) is
X2
Notice that

t̂ε =

2
2
i/n − X(i)
(i − 1) /n − X(i)

,

max
[nε]≤i≤n−[nε] X(i) 1 − X(i)
X(i) 1 − X(i)
22
.
1.0
0.5
0.0
2CS(t)
1.5
2.0
2.4. SET OF POSSIBLE CUTPOINTS
0.0
0.2
0.4
0.6
0.8
t
Figure 2.2: Pearson’s 2CS Statistic as a Function of t
23
1.0
2.4. SET OF POSSIBLE CUTPOINTS

1. The value of the one-sided 2DDCS statistic X 2 t̂+
≤ the value of the two-sided
ε

X 2 t̂ε for any given sample. When they are equal, the corresponding first cell includes the cutpoint t̂, otherwise it does not include t̂ and is right-open.
2. The null distributions of the one-sided and two-sided 2DDCS statistics are no longer
approximately chi square distributed under H0 . To see the empirical null distribution
of a 2DDCS statistic such as (2.4), we take 1000 random samples from uniform(0, 1),

say X 1 , …, X 1000 . Let t̂i and X 2 t̂i be the ith pairs of cutpoint and value of the test
statistic based on the ith given sample, i = 1, 2, …, 1000. Then the 1000 cutpoints
t̂1 , …, t̂1000 are typically different from each other. The density histogram of the sim

ulated values {X 2 t̂1 , …, X 2 t̂1000 } is quite different from the χ21 density and it is
actually shifted to the right of χ21 . The details about the cutpoint and the tests are
given in Chapter 3 and Chapter 4 respectively.
3. When the alternative is simple (fully specified) with its CDF F1 , then the best cutpoint
for a Pearson 2CS test is t∗ as defined in (2.3) with F = F1 . But notice that a chi
square test may not be used in this case because with the alternative known the most
powerful test is the Neyman-Pearson test.
4. The minimum cell length ε should depend on the sample size; a larger sample size
allows smaller ε. Moreover, the closer the value of ε is to t∗ , the higher power the test
will achieve.
5. When a random sample is taken from a specific distribution, the actual power of
the 2DDCS tests may not be as high as the “best luck” Pearson 2CS test using the
cutpoint t∗ (2.3).
24
2.5. CIRCULAR 2DDCS STATISTICS
2.5
Circular 2DDCS Statistics
Both the one-sided and two-sided 2DDCS statistics are applied to data on a line. However,
circular data such as wind directions and directions of migrating birds often arise in real

2 t̂
life and both of the 2DDCS statistics X 2 t̂+
ε are not appropriate to test the
ε and X
circular uniformity since they are more sensitive at the two ends and so dependent on the
choice of the starting point. Therefore, we propose the wrap-around or circular 2DDCS
statistic. Here the two end points of the unit line are wrapped around such that it becomes
a circle with unit circumference. Then we cut the circle into two semicircles to form the two
cells for a chi square type statistic. It is equivalent to taking a piece of segment from the
unit line as the first cell and wrapping up the remaining piece(s) as the second cell. The
resulting circular statistic is called the wrap-around chi square statistic or circular two-cell
chi square statistic (circular 2CS statistic). Such a test can be used to test uniformity both
on a line and on a circle. For simplicity, we first study the circular 2DDCS statistics with
both cell lengths fixed to be .5. Generalized wrapped-around 2DDCS statistics are defined
and discussed in Chapter 5. More details are given by Qian et al. (2009)[26].
Now we are testing the null hypothesis that the n points randomly located on the
circumference of the circle are uniformly distributed. We define the circular two-cell chi
square (2CS) statistic
X 2 (tc ) =
[nFn (t) − nFn (t − .5) − .5n]2 [nFn (t − .5) + n − nFn (t) − .5n]2
+
,
.5n
.5n
where t = tc mod 1, and t ∈ [.5, 1] is the right-endpoint of the middle segment (t − .5, t] on
the unwrapped line. The circular 2CS statistic X 2 (tc ) can be simplified and we have
X 2 (tc ) =
n (Fn (t) − Fn (t − .5) − .5)2
= 4n [Fn (t) − Fn (t − .5) − .5]2 .
.5 (1 − .5)
25
2.5. CIRCULAR 2DDCS STATISTICS
To make sure that the test will detect the deviation of the sample points from uniformity

for a broad class of alternatives, a circular 2DDCS test statistic X 2 t̂c is defined to be the
supremum of all possible circular 2CS statistics:

X 2 t̂c = sup 4n [Fn (t) − Fn (t − .5) − .5]2 .
(2.7)
t∈[.5,1]
Here t is in the interval [.5, 1], instead of [0, 1] because the total number of observations
and the frequency of one of the cells are the only information required to calculated each
statistic X 2 (tc ).
For the limitation of computations, the set of all possible selections of the right cutpoint
t̂c has to be finite. Just as the 2DDCS statistics on a line, there are two possible ways to
define the set. One option is to use the set of points from 0 to 1 with equal small length in

Kn − 1
1 2
,
, …,
. Another option is to choose t̂c from the set of all
between such as
Kn Kn
Kn
observations and their diametrically opposite points Sp = {X1 ,…, Xn , X1 + .5,…, Xn + .5}.
The drawback of the first option is that Kn should be dependent on the sample size n and

the value of the resulting X 2 t̂ might be varied for different choice of Kn . The second
option is better if it is true that the supremum of Xc2 (tc ) occurs only at the points in Sc . It
can be shown that the value of the circular 2DDCS statistic X 2 (tc ) changes only at either
an observation point t = Xi if Xi ≥ .5 or at its diametrically opposite point t = Xi + .5 if
Xi

Xi =
and Sc = X1 , X2 , …, Xn ,

 X + .5, otherwise
i

then X 2 t̂c = max {limt→ X − X 2 (tc ) , limt→ X + X 2 (tc )}.
( i)
( i)
1≤i≤n
26
2.5. CIRCULAR 2DDCS STATISTICS

Proof. We first define a new set of points Sc = X1 , X2 , …, Xn and

Xi =


 Xi , if Xi ≥ .5

 X + .5, otherwise
.
i

We order the set of points Sc from smallest to largest and let X(1) , …, X(n) be the ordered
points. Then the circular 2CS function
X 2 (tc ) = 4n [Fn (t) − Fn (t − .5) − .5]2
is constant on each of the intervals

.5, X(1) , X(1)
, X(2)
, …, X(n−1)
, X(n)
and the value of X 2 (tc ) changes only at the points in Sc . Thus we have

X 2 t̂c = sup X 2 (tc ) = max {limt→ X − X 2 (tc ) , limt→ X + X 2 (tc )}.
( i)
( i)
1≤i≤n
t∈[.5,1]
Therefore the circular 2DDCS statistic can be expressed with respect to the sample
observations as well. Furthermore, if we let S (t) denote the number of observations cov”
#2
S(t)
ered by the semicircle (t − .5, t], then X 2 (tc ) is equal to 4n
− .5 or equivalently
n
4
2
[S (t) − .5n] for t ≥ .5. With respect to S (t), the parabola opens up and the axis of
n
symmetry is the vertical line S (t) = .5n. Thus the supremum of X 2 (tc ) occurs at the
supremum or infimum of S (t) and
X 2 (S (t)) = X 2 (n − S (t)) .

Therefore we find X 2 t̂c , the supremum of X 2 (tc ), by getting the supremum and infimum
of S (t). That is,
27
2.5. CIRCULAR 2DDCS STATISTICS
 $

%2

#2 
4

4
X 2 t̂c = max
inf S (t) − .5n
sup S (t) − .5n ,
 n t∈[.5,1]

n t∈[.5,1]
and by symmetry we have
$

4
max sup S (t) , n − inf S (t)
X 2 t̂c =
n
t∈[.5,1]
t∈[.5,1]
%2
− .5n .
The question is how to find the supremum and infimum of S (t). By the definition
of S (t), the supremum and infimum is the maximum and minimum of all left-hand and
right-hand limits of S (t) at each point in the set Sc , respectively. That is,
sup S (t) = max
{limt→ X − S (t) , limt→ X + S (t)}
( i)
( i)
Xi ∈Sc
t∈[.5,1]
and
inf S (t) = min
{limt→ X − S (t) , limt→ X + S (t)}.
( i)
( i)
Xi ∈Sc
t∈[.5,1]
For any Xi equals to the number of observations
covered by the semicircle [Xi , Xi +.5], and its right-hand limit is the number of observations
covered by the semicircle (Xi , Xi + .5]. The maximum of these two is the left-hand limit
since it includes the observation point Xi and the minimum is the right-hand limit. On the
other hand, when Xi ≥ .5, the maximum and minimum of the two limits are respectively
the right-hand limit and left-hand limit. It does not matter if a diametrically opposite point
of an observation is included on the semicircle or not because the probability that it is also
an observation point is 0. Thus, if we let Mi denote the maximum of the left-hand and
right-hand limits of S (t) at Xi ∈ Sc , then


 number of observations covered by the semicircle [Xi , Xi + .5] , if Xi
sup S (t) , n − inf S (t)
t∈[.5,1]
t∈[.5,1]
= max {Mi , n − (Mi − 1)}.
1≤i≤n
Let Ni denote the maximum number of observations covered by one of the semicircles with
endpoints Xi and its opposite
Xi∗ = (Xi + .5) mod 1,
then
Ni = max {Mi , n − (Mi − 1)} = max {Mi , n − Mi + 1} ,
(2.9)
thus we get the following claim:
Claim 3 The circular 2DDCS statistic can be written in terms of Ni :
4
X t̂c =
n
2

2
max {Ni } − .5n .
1≤i≤n
29
(2.10)
2.5. CIRCULAR 2DDCS STATISTICS

Given the order statistics X(1) , X(2) , …, X(n) , the above claim provides a method to
calculate the value of the circular 2DDCS statistic (2.10) and find the cutpoint t̂c . Both the
one-sided and two-sided 2DDCS statistics require to have a minimum cell length ε when the
sample size is not too small. The distributions of the test statistics are dependent on the
value of ε and so will be the powers of the 2DDCS tests on a line. One of the advantages of
applying this circular 2DDCS statistics is that the two cells now have equal length .5 and
so we do not have to worry about the minimum cell length ε. Notice that on the wrapped
circle, the point t = 1 is the same point t = 0.
30
Chapter 3
Null Distribution of the Cutpoints
For the 2DDCS statistics on a line, we define the “cutpoint” to be the first point t which
maximizes the corresponding chi square value. It is well-defined because the probability that
there are more than one maximizing points is 0 for both one-sided and two-sided 2DDCS
statistics. However, for a circular 2DDCS statistic, the maximizing point is not unique.
Therefore, we may define the “cutpoint” be a randomly selected point from a set of all
maximizing points. We expect a good test to be able to discriminate the difference from
the alternative equally likely on each point in the domain, therefore we are looking for a
test with the null distribution of the cutpoint to be exactly or approximately uniform. By
checking the null distribution of the cutpoint, we know whether the corresponding test is
fair or not and how to make it better. The information can also be used to decide what the
minimum cell length should be for a 2DDCS test. For this reason, all the tests this chapter
are started without a minimum cell length, that is, ε = 0. Let X(1) , X(2) , …, X(n) be the
order statistics of a random sample from uniform(0, 1) and Fn be the ECDF of the sample
as defined previously. The null distributions of the maximizing points of different types of
31
3.1. CUTPOINT OF ONE-SIDED KS
KS and 2DDCS statistics are discussed in separate sections.
3.1
Cutpoint of One-Sided KS
The right-sided KS statistic is defined as
Dn+ = sup0 0 and δ > 0, let N = max N1 , − 1 , we have
ε
)
)
)

)

) +
) +
Lr ))
Lr ))
1
)
)
P )Vn −
≥ ε N.
r + 1)
r + 1) r + 1
Thus
Lr
Lr
converges to Vn+ in probability as r → ∞. Hence
converges to Vn+ in
r+1
r+1
distribution as well.
Moreover, Dwass (1958)[8] prove that
Lr
is asymptotically uniformly distributed over
r+1
(0, 1) as r → ∞ by applying Andersen (1953)[2]’s result. Therefore Vn+ is also uniformly
distributed over the unit interval. A frequency histogram for the cutpoint Vn+ is generated
by Monte Carlo simulations with number of replications 5000 and sample size 10 as shown
in Figure 3.1, which is approximately uniform and confirming Theorem 1.
As it is well known and shown above, the maximizing point Vn+ corresponding to the
right-sided KS statistic is uniformly distributed over (0, 1) for any finite sample size. Then
how about the null distribution of the maximizing point Vn− regarding the left-sided KS
statistic? Intuitively, we may guess Vn− has the same distribution as Vn+ by the symmetry
property of a uniform random sample on [0, 1]. However, there is not much in the literature
has been done about the null distribution of either Vn− or Vn , which is the maximizing
point for two-sided KS statistic. This section and the following section focus on the null
distributions of Vn− and Vn .
Here similarly we let
Dn− = sup (t − Fn (t)),
t∈[0,1]
34
Frequency
3.1. CUTPOINT OF ONE-SIDED KS
Cutpoint t
+
Figure 3.1: Frequency Histogram of V10
, k = 5000
35
3.1. CUTPOINT OF ONE-SIDED KS
be the left-sided KS statistic and let
Vn− = inf {t : (t − Fn (t)) = Dn− } ,
t∈[0,1]
the maximizing point of the left-sided KS statistic. we prove that Vn− is also uniformly
distributed over (0, 1).
Given the order statistics

X(1) , X(2) , …, X(n) , the ECDF Fn (t) is right continuous
but not left continuous. The function [Fn (t) − t] with respect to t is strictly decreasing in

each interval X(i) , X(i+1) , i = 1, 2, …, n − 1, and the supremum of Fn (t) − t is realized

at the left endpoint of some interval X(i) , X(i+1) . Hence the right-sided KS statistic Dn+
is the supremum and maximum of (Fn (t) − t). On the other hand, the function t − Fn (t)

is strictly increasing in each interval X(i) , X(i+1) and so the supremum of the function
t − Fn (t) is achieved at the right endpoint of some interval. Therefore, the left-sided KS
statistic Dn− is the supremum but not maximum of the function t − Fn (t). As it is well
known, we have
Dn+ = max
1≤i≤n

i
− X(i)
limt→X + [Fn (t) − t] = max
(i)
1≤i≤n n
and
Dn− = max
1≤i≤n

i−1
limt→X − [t − Fn (t)] = max X(i) −
.
(i)
1≤i≤n
n
Lemma 1 For any n, the random variable Vn− has the same distribution as 1 − Vn+ and so
Vn− is also uniformly distributed over (0, 1) .

i
Proof. Given the order statistics X(1) , …, X(n) of a random sample, let Ki+ = −X(i)
n
i−1
. Since X(i) and 1 − X(n−i+1) have the same distribution, it is also
and Ki− = X(i) −
n
i
i
n−i
true for − X(i) and − 1 + X(n−i+1) (= X(n−i+1) −
). That is, Ki+ has the same
n
n
n
36
3.1. CUTPOINT OF ONE-SIDED KS


+

distribution as Kn−i+1
, i = 1, 2, …, n. Let Dn+ = maxi Ki+ , Dn− = maxi Kn−i+1
, Vn =

the first X(i) such that Ki+ = Dn+ and Vn− = the first X(n−i+1) such that K(n−i+1)
= Dn− .
Moreover, not only the marginal distributions are the same for each pair of Ki and

Kn−i+1 , but also the joint distribution of the vector Ki+ i=1,2,…,n is the same as the joint

distribution of the vector Kn−i+1
. By a linear transformation of the joint disi=1,2,…,n

tribution of X(1) , …, X(n) , we have
2
1
f{K + ,…,Kn+ } (d1 , d2 , …, dn ) = n!, where 0 ≤ − d1 ≤ − d2 ≤ … ≤ 1 − dn ≤ 1,
1
n
n
and
g(Kn− ,…,K − ) (l1 , l2 , …, ln ) = g
1
X(n) −
(l , l , …, l ) = n!,
1 2
n
n−1
n−2
,X(n−1) −
,…,X(1)
n
n
where
0 ≤ ln + 1 −
n
2
1
≤ … ≤ ln−1 + 1 − ≤ l1 + 1 − ≤ 1,
n
n
n
which is equivalent to
0≤
2
1
− l1 ≤ − l2 ≤ … ≤ 1 − ln ≤ 1.
n
n

Therefore Dn+ = max1≤i≤n Ki+ has the same distribution as


max Kn−i+1 = max Kj = Dn− .
1≤i≤n
1≤j≤n
Since X(i) and 1 − X(n−i−1) have the same distribution, Vn−
distribution
=
1 − Vn+ . Moreover,
Vn+ is uniformly distributed over (0, 1), so Vn− is uniformly distributed over (0, 1) as well.
That is,

P Vn− ≤ t = P 1 − Vn+ ≤ t = P Vn+ ≥ 1 − t = 1 − (1 − t) = t, for any t ∈ (0, 1) .
37
3.2. CUTPOINT OF TWO-SIDED KS
A frequency histogram for the cutpoint is simulated with number of replications 5000
and sample size 10, which is approximately uniform(0, 1) and so consistent with Lemma 1.
3.2
Cutpoint of Two-Sided KS
Let Vn be the maximizing point corresponding to the two-sided KS statistic, that is,
Vn = inf 0≤t≤1 {t : |Fn (t) − t| = Dn } ,
moreover, we may notice that Dn = max {Dn− , Dn+ } and
Vn =

Vn+ , if Dn+ ≥ Dn−
.
Vn− , if Dn− > Dn+
Although both Vn− and Vn+ are uniformly distributed over (0, 1) if the sample is taken from
uniform (0, 1) , it is not true for the two-sided maximizing point Vn whenever the sample
size is larger than 1. Vn occurs relatively more frequently at the median than at the tails
when the sample size is larger than 2.
Claim 4 If the sample size is 1, the maximizing point Vn is the observation itself and so it
is uniformly distributed over (0, 1).
When the sample size is 2, the conclusion is not that obvious, but it can be shown
that the density function of V2 is like a shoulder-lowered “W” shaped curve with highest
probability around the center, medium possibility to be around the two tails and lowest
probability around the first and third quartiles. Therefore, the two-sided KS test with
sample size 2 is more sensitive at the middle and less at the two sides.
Claim 5 The random variable V2 is NOT uniformly distributed over (0, 1) .
38
3.2. CUTPOINT OF TWO-SIDED KS
Proof. Let X(1) and X(2) be the ordered observations from smallest to largest, then

D2 = max 21 − X(1) , X(1) , 1 − X(2) , X(2) − 12 and V2 = X(1) if 21 − X(1) = D2 or X(1) =
D2 , otherwise V2 = X(2) . Thus the probability P (V2 ≤ t) is the sum of the probabilities

P X(1) ≤ t, D2 = 12 − X(1) or X(1) and P X(2) ≤ t, D2 = 1 − X(1) or X(2) − 21 .
To determine the above probability we need to separate the domain of X(1) and X(2)
to four regions where D2 = 12 − X(1) , X(1) , 1 − X(2) or X(2) − 21 respectively and the above
probability has to be discussed in four cases,




t − 2t2 + t2 = t − t2 , if t ≤ 41






 3t2 − t + 1 , if 1
2
i/n − X(i)

X(i) 1 − X(i)
+
t̂ = n max
1≤i≤n
.

To see the null distribution of the smallest t at which X 2 t̂+ occurs, we run a simulation
or carry a theoretical analysis of the null distribution for t̂+ given the ordered sample
X(1) ,…,X(n) from the uniform(0, 1) distribution. We start the analysis with the simplest
cases when the sample size is only 1 or 2. In the trivial case when the sample size n is
1, the optimal cutpoint t̂+ has to be the only observation point X1 and so it is uniformly
distributed over (0, 1). However, when n = 2, the optimal cutpoint t̂+ could be any of the
two observation points X(1) and X(2) . Since the cutpoint is always included in the first cell
in this case, we may expect the smaller one X(1) is more likely to be the cutpoint than the
larger one X(2) , which turns out to be true shown by a simple calculation as follows.

P t̂+ = X(1) = P
2

1 − X(2)
X(1) − 21


X(2)
X(1) 1 − X(1)

2
= P X(2) ≥ 4X(1) − 4X(1)
,

9
and so P t̂+ = X(1) = 16
. Thus we have
t̂+
X(1)
X(2)
probability
9
16
7
16
.
The minimum is more likely to be the cutpoint than the maximum.

To get the distribution for t̂+ , we need to find the probability Ft̂+ (t) = P t̂+ ≤ t ,
which is the sum of the probabilities

P X(1) ≤ t, t̂+ = X(1) and P X(2) ≤ t, t̂+ = X(2) .
42
3.3. CUTPOINT OF ONE-SIDED 2DDCS
That is, the sum of the probabilities that the smaller observation is not larger than t when
2 and that the larger observation is less than or equal
X(2) is larger or equal to 4X(1) − 4X(1)
to t otherwise. Therefore

2
2
Ft̂+ (t) = P X(1) ≤ t, X(2) ≥ 4X(1) − 4X(1)
+ P X(2) ≤ t, X(2)

(a) When t ≤ 43 , the CDF of t̂+ , Ft̂+ (t) = P t̂+ ≤ t is the integral

*t
*t
1−
2 ∗ ( 0 1 − 4x − 4×2 dx + 0 y −

1−y
dy)
2
Thus the density for t ≤ 34 is the derivative of the above integral and so

#

1− 1−t
2
.
ft̂+ (t) = 2 ∗ 1 − 4t − 4t + t −
2
After combining the like terms, we get
ft̂+ (t) = 8t2 +

1 − t − 6t + 1, when 0 ≤ t ≤ 43 .
(b) When t ∈ ( 34 , 1], the probability of the cutpoint is between t and 3/4 is

P t̂+ ≤ t − P t̂+ ≤ 34

*t 1+ 1−y 1− 1−y
2
1
1
= 4 ∗ 32 − 2 ∗ (1 − t) + 2 ∗ 3/4

dy
2
2
*t √
= 18 − 2 (1 − t)2 + 3/4 1 − ydx.
Therefore, the density function the cutpoint here is

ft̂+ (t) = 2 − 2t + 2 1 − t, for 1 ≥ t > 43 .
Thus the density function of t̂+ when n = 2 is
43
3.3. CUTPOINT OF ONE-SIDED 2DDCS
Figure 3.4: Density of t̂+ , ε = 0, n = 2
ft̂+ (t) =



 8t2 + 1 − t − 6t + 1, 0 ≤ t ≤ 3
4

 2 − 2t + 2√1 − t, 3
sided 2DDCS statistic X 2 t̂ is the supremum of all 2CS statistics with respect to t on the
interval (ε, 1 − ε). When the sample size is really small, the minimum cell length ε is not
required. In this section, we study the null distribution of the cutpoint t̂ for the two-sided

2DDCS statistic X 2 t̂ . First we find the exact null distribution of t̂ when there are only
two observations. Then simulations with different sample sizes are given and analyzed to
see what value of ε is appropriate for each typical sample size.

Let X(1) , …, X(n) be the order statistics taken from uniform(0, 1). Similar to the one-
sided 2DDCS cutpoint t̂+ , in the simplest case when the sample size is only 1, the cutpoint
t̂ is just the observation X1 itself and so t̂ is uniformly distributed on (0, 1). When the
sample size is 2, t̂ will take the value of X(1) or X(2) equally likely because of symmetry.

Furthermore, we may expect the probability of X 2 t̂ = X 2 t̂+ is .5, that is, the two-sided
2DDCS cutpoint is also equally likely to be included in the left cell (the 1st cell) or the
45
3.4. CUTPOINT OF TWO-SIDED 2DDCS
300
200
100
0
frequency
400
500
Frequency Histogram of Cutpoint t for X^2(t_hat+)
0.2
0.4
0.6
0.8
1.0
cutpoint t
Figure 3.5: Frequency Histogram of t̂+
ε=.05 (n = 50, 5000 replications)
46
3.4. CUTPOINT OF TWO-SIDED 2DDCS
right cell. To distinct the left-closed and right-closed cutpoint, we let


+
X(i)
= limx→0− X(i) + x and X(i)
= limx→0+ X(i) + x , i = 1, 2,


+

+
then the cutpoint t̂ ∈ X(1)
, X(1)
, X(2)
, X(2)
. With careful calculations of the probabilities,
we have


X(1)
+
X(1)

X(2)
+
X(2)
probability
13
48
11
48
11
48
13
48
.
Therefore, the probability that the cutpoint is X(1) is the same as the probability that it

is X(2) . Moreover, the statistic X 2 t̂ is a maximum and a supremum are equally likely.
In other words, the probability that the first cell is right-closed is equal to the probability
that the first cell is right-open.
To find the distribution of t̂ when the sample size n is 2, we compute the cumulative
probability

P t̂ ≤ t = P X(1) ≤ t, t̂ = X(1) + P X(2) ≤ t, t̂ = X(2) ,
which can be shown in four separated regions of t. The corresponding density function in
each case is given according to these for regions as shown in Figure 3.6.

(a) In the first region when t ≤ 14 , the CDF P t̂ ≤ t is the sum of the two double
integrals:
* t * 1−u
0
*t*v √
4u−4u2 2dvdu + 0 1− 1−v 2dudv.
2
Integrating both of them and simplifying the sum, we get the CDF of the data-driven
cutpoint t̂
47
3.4. CUTPOINT OF TWO-SIDED 2DDCS
Figure 3.6: Regions for X(1) (y) and X(2) (x)
48
3.4. CUTPOINT OF TWO-SIDED 2DDCS


t − 4t2 + 38 t3 − 23 1 − t + 23 t 1 − t + 32 .
Then, the density of the data-driven cutpoint t̂ in this case is the derivative of the
CDF and so
ft̂ (t) = 8t2 − 8t +

1 − t + 1;
(b) In the second region when t ∈ ( 41 , 12 ], we similarly get the density of data-driven
cutpoint t̂ :
ft̂ (t) =


t + 4t + 1 − t − 2;
(c) Next region is when t ∈ ( 12 , 34 ], the corresponding density of t̂ is
ft̂ (t) =


t − 4t + 1 − t + 2;
(d) In last region when t ∈ ( 34 , 1], we get the density is
ft̂ (t) = 2 (1 − 2t)2 +

t − 1.
Thus, the density curve of t̂ when the sample size is 2 is a “W” shaped curve as shown in
Figure 3.7 (Left). A simulated frequency histogram is also given in the Figure 3.7 (right).
we see that the simulated curve is consistent with the density calculated.

+

+

When the sample size n is 3, the cutpoint t̂ will be one of {X(1)
, X(1)
, X(2)
, X(2)
, X(3)
,
+
X(3)
}. The probability to be each of them can be summarized as


X(1)
+
X(1)

X(2)
+
X(2)

X(3)
+
X(3)
probability
0.21094
0.15356
0.13550
0.13550
0.15356
0.21094
49
.
3.4. CUTPOINT OF TWO-SIDED 2DDCS
frequency
0
100
200
300
400
Frequency Histogram of Cutpoint t_hat
0.0
0.2
0.4
0.6
0.8
1.0
cutpoint t
Figure 3.7: Null Distribution of t̂ When n = 2
With respect to the observations, the probabilities are

X(1)
X(2)
X(3)
probability
0.3645
0.2710
0.3645
.
And the cutpoint is more likely to be at the two tails than the center. The calculation of the
distribution of t̂ when the sample size n = 3 is much more complicated, so it is skipped here
and only the simulated frequency histogram of the cutpoint t̂ with number of replications
5000 is given in Figure 3.8.
By looking at the frequency histograms of the cutpoint t̂ with different sample sizes
(small, medium and large) and number of subintervals 20, 50, 100, 200, 400 and 800, respectively, we see that a significant number of frequencies are accumulated at the two ends
and the curve is concave up. Thus the 2DDCS test is less sensitive to the center than the
50
3.4. CUTPOINT OF TWO-SIDED 2DDCS
300
200
100
0
frequency
400
500
600
Frequency Histogram of Cutpoint t_hat
0.0
0.2
0.4
0.6
0.8
1.0
cutpoint t
Figure 3.8: Frequency Histogram of t̂ε=0 (n = 3, 5000 replications)
51
3.5. CUTPOINT OF CIRCULAR 2DDCS
two ends when the sample size is larger than or about 10. The problem is how much we
should take off from the ends. We may choose different minimum cell lengths for typical
sample sizes. According to our simulation results conducted under the null distribution,
the smallest appropriate minimum cell length in each case is listed in the following table.
sample size
small (n
Recall the definition of the circular 2DDCS statistic X 2 t̂c (2.7), which is similar to
the 2DDCS statistics if we write it as:
X2

t̂c = sup
t∈[.5,1]

n |Fn (t) − Fn (t − .5) − .5|
.5 (1 − .5)
2
It can be simplified as

X 2 t̂c = 4n sup (|Fn (t) − Fn (t − .5) − .5|)2 .
t∈[.5,1]
52
.
3.5. CUTPOINT OF CIRCULAR 2DDCS
Then we might define the cutpoint t̂∗c just as the ones on a line and let it be the first right
cutpoint of the middle piece on the circle such that the corresponding circular 2CS statistic
is maximized:
t̂∗c = Arg max
t∈[.5,1]

4n [Fn (t) − Fn (t − .5) − .5]2 .
(3.2)
However, such a defined maximizing cutpoint is actually not unique here and instead it is
an interval of points. A simple example is given when n = 1 where any t ∈ [.5, 1] gives the
same chi square value
X 2 (t) = 4n [Fn (t) − Fn (t − .5) − .5]2 = 4(1)(.5)2 = 1,

which is also the value of X 2 t̂c as long as the sample size is 1. The Definition (3.2) seems
not reasonable. Thus we may look for another definition of the cutpoint with respect to the
observations.
Let X1 , X2 , …, Xn be a random sample from uniform(0, 1) and Ni be the maximum
number of observations covered by one of the semicircles with endpoints Xi and Xi∗ as

defined in Section 2.4, then the circular 2DDCS statistic X 2 t̂c can be written with respect
to Ni as the Definition (2.10)
X2
4
t̂c =
n

2
max {Ni } − .5n .
1≤i≤n
Therefore, we may define the cutpoint in terms of the sample observations:
t̂c = the first Xj such that Nj = max {Ni } ,
1≤i≤n
(3.3)
where max {Ni } is the maximum number of observations covered by some semicircle as
1≤i≤n
the random variable N, defined in Ajne 1968 [1]’s paper. Intuitively, a well defined cutpoint
should be unique given any random sample and the null distribution of the cutpoint for a
53
3.5. CUTPOINT OF CIRCULAR 2DDCS
circular 2DDCS statistic should be uniformly distributed over the circumference because
the points are uniformly located and the null distribution of the random variables X 2 (t)
and N are both free of the location t. That is, if {X1 , X2 , …, Xn } is the original random
sample from uniform(0, 1) and {X1 , X2 , …, Xn } is the shifted sample observations such that
Xi = (Xi + θ) mod 1, i = 1, 2, …, n, and t̂c , t̂c are the cutpoints in terms of the original
and shifted set of sample observations respectively, then t̂c = (t̂c + θ) mod 1. Therefore the
cutpoint should not be defined as the first order statistic which maximizes X 2 (t) since the
null distribution of such a defined cutpoint will be skewed. The solution is randomization.
The cutpoint can be defined in terms of the random sample observations, instead of the
ordered ones, That is, we define the circular cutpoint t̂c to be the first sample observation
Xi at which X 2 (t) is maximized. The word “first” is necessary in the definition because
there can be more than one maximizing point. For example, when the sample size is 2,
both of the observation points {X1 , X2 } are maximizing the circular 2CS statistic value.
If we define the smallest maximizing observation point to be t̂c , then t̂c will always equal
to X(1) and then the cutpoint won’t be uniformly distributed on the null. However, if the
first random observation is defined to be the cutpoint, then t̂c can either X1 or X2 and it
is totally random. In other words, a point is randomly selected to be the cutpoint from the
set of maximizing points. The simulated density histogram as shown in Figure 3.9 verifies
the uniformity of the cutpoint t̂c .
54
3.5. CUTPOINT OF CIRCULAR 2DDCS
0.6
0.4
0.2
0.0
Density
0.8
1.0
Density Histogram of Circular Cutpoint
0.0
0.2
0.4
0.6
0.8
1.0
Circular 2DDCS Cutpoints
Figure 3.9: Density Histogram of t̂c When n = 10, K = 5000
55
Chapter 4
Null Distributions of 2DDCS
Statistics
Three 2DDCS tests have been proposed and the null distributions of the cutpoints are
discussed. Can we use the critical values from the chi square table? That is, we look at the
data, choose the best cutpoint which maximizes the discrepancy between the null hypothesis
and the data, and carry out a Pearson 2CS test with the level α critical value taken from
a chi square tail probability table? Such a test should be very powerful. However, is the
actual probability of the type I error controlled to be α? The answer is “No”. We look at
the two probabilities:

2
P X 2 t̂+
ε ≥ χ1,.05 | the data is taken from uniform (0, 1) ,
and

P X 2 t̂ε ≥ χ21,.05 | the data is taken from uniform (0, 1) ,
56
4.1. KS STATISTICS
where χ21,.05 is the upper 5th percentile of the chi square distribution with one degree of
freedom. By simulations, we get the two probabilities are 0.4142 and 0.4142, respectively.
In other words, the actual α level in each case is more than 40%. Therefore, the critical
values for Pearson chi square statistics are no longer appropriate. The corrected values
should be larger than those taken from χ21 since the chi square values are maximized.
In this chapter, we discuss the exact and asymptotic null distributions of the linear and
circular 2DDCS statistics. The statistics of interest in the finite-sample case will have no
minimum cell length, that is, ε = 0. Both simulation results and theoretical analysis are
presented. We use the same notations as those in the previous chapters.
4.1
KS Statistics

When a sample is taken from uniform(0, 1), it is well known that { n [Fn (t) − t]}t∈(0,1) is
asymptotically a Brownian bridge process W0 (t) and so the KS statistic
Dn =
is the supremum of

n supt∈[0,1] |Fn (t) − t|

n |W0 (t)|. By this fact, the cumulative distribution function (CDF)
of Dn is given by
P (Dn ≤ x) = 1 − 2


i−1
exp −2i2 x2 /n ,
i=1 (−1)
which is sometimes written in the following form:

2nπ
FDn (x) =
x

i=1 exp


(2i − 1)2 nπ2
8×2
Then the critical values for the two-sided KS test can be derived.
57
.
4.2. ONE-SIDED 2DDCS STATISTICS
4.2
One-Sided 2DDCS Statistics
To get the null distribution of a one-sided 2DDCS statistic

2
i/n − X(i)

1≤i≤n X(i) 1 − X(i)

X 2 t̂+ = max
we need to calculate

P X 2 t̂+ ≤ w2 = P
,

2
i/n − X(i)

≤ w2 , i = 1, 2, …, n ,
X(i) 1 − X(i)
where w is a nonnegative constant. When w = 0, we know

P X 2 t̂+ ≤ 0 = P X(i) = i/n, i = 1, 2, …, n = 0.
For nontrivial w2 > 0, however, it is harder to evaluate the probability. The quadratic
inequality

2
i/n − X(i)

≤ w2 , i = 1, 2, …, n
X(i) 1 − X(i)

2

is equivalent to i/n − X(i) ≤ X(i) 1 − X(i) w2 and it can be written in a standard form
2
(1 + w2 )X(i)
− (w2 +
2i
i2
)X(i) + 2 ≤ 0, i = 1, 2, …, n.
n
n
(4.1)
Let the interval [li (w) , ui (w)] denote the solution of (4.1) for each i. Then we have
w2 +
li (w) =
2i 1

n
n
(−4i2 + 4ni + w2 n2 )w2
2 (1 + w2 )
,
and
w2 +
ui (w) =
2i 1
+
n
n
(−4i2 + 4ni + w2 n2 )w2
2 (1 + w2 )
58
, i = 1, 2, …, n.
4.2. ONE-SIDED 2DDCS STATISTICS
We should verify that the existence of the roots in their domain such that (4.1) will satisfy
the following properties:
1. (w2 +
i2
2i 2
) − 4(1 + w2 ) 2 ≥ 0 for any i = 1, 2, …, n.
n
n
2. 0 ≤ li (w) ≤ ui (w) ≤ 1, i = 1, 2, …, n.
3. li (w) , ui (w) are respectively monotonically increasing with respect to i for any given
n and w2 .
By simple calculations, we prove that all the three conditions are satisfied. Therefore, for
w2 > 0, we have

2
i/n − X(i)

≤ w2 , i = 1, 2, …, n
P
X(i) 1 − X(i)

= P X 2 t̂+ ≤ w2
* 1 * x3 * x
= 0 … 0 0 2 n!dx1 dx2 …dxn .
To see the value of the probability, we first check the simplest case when the sample
size is really small such as n = 1 or 2.When n = 1, the exact CDF of the one-sided 2DDCS
statistic is

* u1 (w)
P X 2 t̂+ ≤ w2 = l1 (w)
1dt1 = u1 (w) − l1 (w) =
When n = 2, the bounds for X(1) are

w 2 + 1 − w 1 + w2
l1 (w) =
,
2 (1 + w2 )
and

w2 + 1 + w 1 + w 2
.
u1 (w) =
2 (1 + w2 )
59
w2
.
(1 + w2 )
4.3. TWO-SIDED 2DDCS STATISTICS
The bounds for X(2) are
l2 (w) =
1
, and u2 (w) = 1.
1 + w2

Now, we compute the CDF of X 2 t̂+ when the sample size is 2:

P X 2 t̂+ ≤ w2
* u2 (w) * u1 (w)
= 2 l2 (w)
l1 (w) 1dt1 dt2 = 2 (u1 (w) − l1 (w)) (u2 (w) − l2 (w))
w2
2w3
2w
=
,
=√
2
1 + w2 (1 + w )
(1 + w2 )3/2
In this way, we calculate the exact cumulative distribution of the one-sided 2DDCS statistic
for any finite sample size, but the complexity of calculation is increasing a lot while the
sample size is getting larger. An alternative way is to use simulated critical values when the
sample size is not too small. Simulated critical values of the one-sided 2DDCS statistics
are compared with those of the two-sided ones next section.
4.3
Two-Sided 2DDCS Statistics
Miller-Siegmund’s maximally selected chi square statistic for two-sample test of homogeneity has been discussed in the literature. Halpen (1982)[14] simulated the finite-sample
distribution of the maximally selected chi square statistics and Koziol (1991)[17] derived
the exact finite-sample distribution theory from Durbin’s (1971)[6] combinatorial approach.
In this section, Miller-Siegmund’s method is applied to get the tables of asymptotic critical
values. Simulations are conducted for comparisons as well.
Recall the two-sided 2DDCS without a minimum cell length is,
%
$
2

n
(F
(t)

t)
n
,
X 2 t̂ = sup0 .5n and k − .5n > 0 if k = n2 + 1 , …, n, the inequality (N − .5n)2 ≥
n
4
2
(k − .5n) is equivalent to N − .5n ≥ k − .5n and then we have
n
$

%
n

k 1 2
2
P X t̂c ≥ 4n

= P (N ≥ k) , if k ≥
+1 .
n 2
2

Now we are ready to get the distributions of X 2 t̂c . When the sample size is 1, we

know that X 2 t̂c = 1 with probability 1 because any semicircle will cover either 0 or 1

observation. Similarly, the value of X 2 t̂c will be 4 with probability 1 if the sample size
is n = 2 since the complement of the event that both of the two points can be covered by
69
4.4. CIRCULAR 2DDCS STATISTICS
some semicircle is X(2) −X(1) = .5, whose probability is 0. However, when n = 3, it requires

some calculation to get the exact distribution of X 2 t̂c . We know the maximum number
of observations covered by some semicircle can be either 2 or 3. The probability of N = 3
is equal to the probability that the distance of the largest and the smallest observations is
less or equal to the length of a semicircle, that is,

P (N = 3) = P X(3) − X(1) ≤ .5 .
It can be calculated by applying the joint distribution of the minimum X(1) and the maximum X(3) :
*1*v
.5 v−.5 6 (v − u) dudv = .75.

Thus, the exact null distribution of X 2 t̂c can be summarized as below.

X 2 t̂c
4(3)( 32 − .5)2 = 13
4(3)( 33 − .5)2 = 3
N
2
3
Probability
.25
.75
For larger sample sizes, we may apply Ajne’s result about N. Ajne (1968)[1] proved
!

that (Theorem 1 in his paper) for k = n2 + 1 , …, n,
P (N ≥ k) = 2−(n−1) (2k − n)
Thus, for k =
!
1
2n
+ 1, …, n and wk2 = 4n

j=0
k 1

n 2



2
n
j (2k − n) + k


.
, we have

!
P X 2 t̂c ≤ wk2 = P (N ≤ k) = 1 − P (N ≥ k + 1) ,
70
(4.3)
4.4. CIRCULAR 2DDCS STATISTICS
N ≤ k
X 2 t̂c ≤ w2

Purchase answer to see full
attachment

aalan

Leave a comment