Datasets

Published datasets are available here. Users may practice implementation of statistical techniques on them. We seek contributions of datasets to add to this resource.

Study	Reference	Stata File	ASCII File
CASS	Leisenring et al. (2000) Weiner et al. (1979)	est1.dta	est1.csv est1_desc.txt
Pancreatic Ca biomarkers	Wieand et al. (1989)	wiedat2b.dta	wiedat2b.csv wiedat2b_desc.txt
Ultrasound for hepatic mets	Tosteson and Begg (1988)	tostbegg2.dta	tostbegg2.csv tostbegg2_desc.txt
CARET PSA	Etzioni et al. (1999)	psa2b.dta	psa2b.csv psa2b_desc.txt
Gene expression array	Pepe et al. (2003)	orchratio2.dta	orchratio2.csv orchratio2_desc.txt
Norton neonatal audiology	Norton et al. (2000)	nnhs2.dta	nnhs2.csv nnhs2_desc.txt
Leisenring neonatal audiology	Leisenring et al. (1997)	lplaudio_b.dta	lplaudio_b.csv lplaudio_b_desc.txt
Prostate Ca - St. Louis	Smith et al. (1997)	psa_dre_v2.dta	psa_dre_v2.csv psa_dre_desc_v2_.txt
Stover audiology	Stover et al. (1996)	dp2.dta	dp2.csv dp2_desc.txt
Scintigraphy study	Muller et al. (1989)	mlt1.dta	mlt1.csv mlt1_desc.txt
59 Pap screen studies	Fahey et al. (1995)	fim.dta	fim.csv fim_desc.txt
Prenatal screen data (hypothetical)		hpns.dta	hpns.csv hpns_desc.txt
Ovarian Ca markers (hypothetical)		ocdata_b.dta	ocdata_b.csv ocdata_b_desc.txt
Covariate adjustment datasets	Janes et al (2009)	Figure 1, scenario 1 Figure 1, scenario 2	.csv file and .txt file .csv file and .txt file
ROC regression dataset	Janes et al (2009)	Figure 4	.csv file and .txt file
Simulated AKI data	Pepe et al (2007, 2008)	aki_sim.dta	aki_sim.csv file aki_sim_desc.txt file
Two frameworks for ordinal ratings	Morris et al (2010)	two_marker_sim.dta	two_marker_sim.csv file two_marker_sim_desc.txt file
Multiple Gene Risk Prediction	Pepe, Gu, Morris (2010)	modelA.dta modelB.dta	modelA.csv modelB.csv
Simulated Risk Reclassification dataset	Pepe (2011)	risk_reclass_b.dta	risk_reclass_b.csv risk_reclass_b_desc.txt

Stata format data files can be read with versions 8 and above.
Comma-separated ASCII (csv) files include variable names on the first row.

Dataset References

Etzioni R, Pepe M, Longton G, Hu C, Goodman G (1999). Incorporating the time dimension in receiver operating characteristic curves: A case study of prostate cancer. Medical Decision Making 19:242-51.

Fahey MT, Irwig LM, Macaskill P (1995). Meta-analysis of Pap test accuracy. American Journal of Epidemiology 141:680-9.

Janes H, Longton G, Pepe MS (2009). Accommodating Covariates in ROC Analysis. Stata Journal 9(1):17-39.

Leisenring W, Alonzo T, Pepe MS (2000). Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics 56:345-51.

Leisenring W, Pepe MS, Longton G (1997). A marginal regression modelling framework for evaluating medical diagnostic tests. Statistics in Medicine16:1263-81.

Morris DE, Pepe MS, Barlow WE (2010). Contrasting two frameworks for ROC analysis of ordinal ratings. Medical Decision Making (in press).

Muller C, Wasserman HJ, Erlank P, Klopper JF, Morkel HR, Ellmann A (1989). Optimisation of density and contrast yielded by multiformat photographic images used for scintigraphy. Physics in Medicine and Biology 34:473-81.

Norton SJ, Gorga MP, Widen JE, Folsom RC, Sininger Y, Cone-Wesson B, Vohr BR, Mascher K, Fletcher K. (2000).

Identification of neonatal hearing impairment: Evaluation of transient evoked ototacoustic emission, distortion product otoacoustic emission, and auditory brain stem response test performance. Ear and Hearing 21:508-28.

Pepe MS (2011). Problems with Risk Reclassification Methods for Evaluating Prediction Models. American Journal of Epidemiology 173:1327-1335.

Pepe MS (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction (Oxford Statistical Science Series). Oxford University Press.

Pepe MS, Gu W, Morris DE (2010). The Potential of Genes and Other Markers to Inform about Risk. Cancer Epidemiology, Biomarkers and Prevention19(3):655-665.

Pepe MS, Longton G, Anderson G, Schummer M (2003). Selecting differentially expressed genes from microarray experiments. Biometrics 59:133-42.

Pepe MS, Longton G, Janes H (2007). Estimation and Comparison of Receiver Operating Characteristic Curves. Stata Journal 9(1):1.

Pepe M, Zheng Y, Jin Y., Huang Y, Parikh C, Levy W. (2008) Evaluating the ROC performance of markers for future events. events. Lifetime Data Analysis14(1):86-113.

Smith DS, Bullock AD, Catalona WJ (1997). Racial differences in operating characteristics of prostate cancer screening tests. The Journal of Urology158:1861-66.

Stover L, Gorga MP, Neely T (1996). Torwards optimizing the clinical utility of distortion product otoacoustic emission measurements. Journal of the Acoustical Society of America 100:956-967.

Tosteson AN, Begg CB (1988). A general regression methodology for ROC curve estimation. Medical Decision Making 8:204-15.

Weiner DA, Ryan TJ, McCabe CH, Kennedy JW, Schloss M, Tristani F, Chaitman BR, Fisher LD (1979). Exercise stress testing. Correlations among history of angina, ST-segment response and prevalence of coronary-artery disease in the Coronary Artery Aurgery Study (CASS). New England Journal of Medicine 301(5):230-5.

Wieand S, Gail MH, James BR, James KL (1989). A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 76:585-92.

Diagnostic and Biomarkers Statistical (DABS) Center

Diagnostic and Biomarkers Statistical (DABS) Center

Datasets

Dataset References