The President's Challenge is a commonly used physical fitness assessment battery and includes criterion-referenced standards for certain items to identify minimum levels of performance necessary for good health (President's Council on Physical Fitness and Sport, 2003). Although students assist in the test administration of the President' Challenge and typically record scores in the school setting, empirical evidence is lacking to justify the use of student scores, especially in the identification of healthy/unhealthy performance levels. In order to uncover appropriate testing practices, more research is needed specific to fitness test use (Keating and Silverman, 2004). Therefore, the purpose of this study was to compare student and instructor norm-referenced and criterion-referenced scoring on health-fitness items included in the battery. This study was part of a larger measurement project and parental assent and IRB approval were obtained. Middle-school students (n = 265), ages 12 to 15, were administered tests of muscular strength (pull-up, PU), and flexibility (v-sit, VSIT) on two occasions, separated by one week. Following a week of rest, students were administered an assessment of cardiorespiratory fitness (one-mile run-walk, MILE), and retested the following week. Student partners and eight trained administrators collected scores on participants. To examine norm-referenced scoring, separate t-tests were used to determine if mean participant scores tallied by administrators were significantly different from mean participant scores tallied by student partners on MILE, PU, and VSIT (p < .01) following both administrations. There was a significant mean difference between scorers on the VSIT following the test administration (MAdministrator = 2.8 + 3.5 in, MStudent Partner = 3.8 + 3.4, p < .01). There were no significant mean differences between administrator and student partner scorers respectively on the MILE (11.3 + 2.5 min; 11.1 + 2.5) or PU (2.3 + 3.3., 2.4 + 3.4) following the test administration (p > .01). No significant differences were noted for any of the items following the retest administration (p > .01). To examine criterion-referenced differences between scorers, separate chi-square analyses (p < .01) were conducted on MILE, PU, and VSIT following both administrations (healthy/unhealthy classification agreement between scorers). Although the large sample size resulted in significant differences between administrator and student partner passing rates on all three tests following both administrations (p < .01), few participants were misclassified on the MILE (3%, 2%), PU (4%, 4%), and VSIT (15%, 5%) following test and retest administrations, respectively. Keyword(s): assessment, measurement/evaluation, physical education PK-12