Heightened interest in authentic assessment (e.g., Lund, 1999) and qualitative analysis (e.g., Knudson & Morrison, 1997) illustrates the need for both practitioners and researchers to focus on the fundamental principles of measurement and evaluation: validity and reliability. An oft-overlooked element of reliability deals with test (or rater) objectivity, defined by Safrit and Wood (1995) as "the degree of accuracy in scoring a test" (p. 167). Since authentic assessment and qualitative analysis depend so strongly upon the use of observation and scoring rubrics, it is critical that the validity of such assessment results not be reduced by poor agreement within (intra-) and among (inter-) test administrators. The purpose of the current study was to examine and compare common methods for estimating how well judges or raters agree when using qualitative assessment measures. Specifically, the investigator used the ordinal ratings derived from assessing trials of overarm throwing for force using three-, four-, and six-level developmental sequences (Roberton, 1978; Langendorfer, 1999) for the basis of comparison. Data were reduced by the investigator by observing on two separate occasions n=100 side view trials different aged persons performing forceful throwing recorded with either 16mm film (at 64 frames per second) or videotape recorded at 30 or 60hz. Several commonly used statistics for measuring reliability and rater objectivity, the Pearson product-moment and Spearman rank-order correlations, proportion of (exact) agreement, kappa, and adjusted kappa, all were calculated using these data. As suspected, the Pearson and Spearman correlation techniques both produced inappropriately high levels of rater objectivity (ranging from .85-.98) due largely to the fact that these statistics assess relationship instead of rater agreement. Proportions of agreement (P) coefficients (ranging from .80-.95), while measuring exactly how well ratings coincided, did not take into account chance levels of agreement, particularly when using the three- and four-level throwing sequences for humerus and stepping action. The most appropriate, but lowest, rater agreement values were associated with proportion of agreement using the kappa and weighted kappa corrections for chance (ranging from .64-.85). Further investigations need to determine what levels of agreement should be deemed satisfactory when using the kappa adjustment in live observations by practitioners as well as recorded behavior used by researchers.Keyword(s): assessment, measurement/evaluation, research