Item bank, in which a set of items is calibrated on the same scale, is a modern test construction practice with several measurement advantages (e.g., assessment is invariant to items selected). Except for FitSmart (Zhu et al., 1999), the field of physical education (PE) has not taken the advantages of this practice. To assess the National Standards for PE (NASPE, 2004),- Elementary Standard 1, an item/task bank called “PE Metrics” was developed by NASPE.
Purpose
This study reported technical details of the development and calibration of the bank.
Methods
A total of 30 tasks and related scoring rubrics were developed for Kindergarten (K), Grades 2 (G2) and 5 (G5), respectively:
K – Underhand Catching, Dribble with Hand (C = Common Task), Hopping (C), Running, Sliding, Striking, Underhand Throw, Weight Transfer
G2 – Approach & Kick a Ball, Dance Sequence, Dribble with Jog (C), Galloping, Gymnastics Sequence, Jumping & Landing Combination, Jump forward (C), Locomotor Sequence, Overhand Catching, Skipping, Striking with Paddle
G5 – Basketball: Dribble, Pass and Receive; Defense; Offence; Dance, Floor Hockey, Gymnastics, Inline Skating, Overhand throwing, Soccer: Dribble, Pass, and Receive (C); Offense; Striking Ball with Paddle (C).
After several pilots and revisions, the tasks were administered to a national sample of students (N = 4,956, 2,501 males and 2,385 females, from 57 schools; K = 1,488, G2 = 1,907, & G5 = 1,563). While the common items, which are used to link all items on the same scale, were administered to all students in the same grade, non-common items were administered only to selected subsamples.
Analysis/Results
The collected data were screened using descriptive, items analysis and outlier statistics. The cleaned data were analyzed by the Rasch rating scale model (Wright & Masters, 1982) using FACETS, a Rasch analysis software. The model-data fit was evaluated using Infit and Outfit statistics (between .7 and 1.3) and task categorization was examined using related statistics (e.g., average measures). The tasks fit the model well according to Infit and Outfit statistics. Assessment task difficulties were well spread in a large of range, e.g., the most difficult tasks in K are Dribble with Hand and Striking (logits = .73), the most easy one is Underhand Catching (-1.12), and scoring rubrics difficulties ranged from -1.44 to 1.16. Categorization statistics indicated that most of tasks were well developed, with a good discrimination.
Conclusions
PE Metrics is ready to measure students' achievement and set an excellent example for future test construction in PE.