Spring Math
Mathematics
Summary
Spring Math is a comprehensive RtI system that includes screening, progress monitoring, classwide and individual math intervention, and implementation and decisionmaking support. Assessments are generated within the tool when needed and Spring Math uses student data to customize classwide and individual intervention plans for students. Clear and easytounderstand graphs and reports are provided within the teacher and coach dashboards. Spring Math uses gated screening that involves CBMs administered to the class as a whole followed by classwide intervention to identify students in need of intensive intervention. Spring Math assesses 130 skills in Grades K8 and remedies gaps in learning for grades K12. The skills offer comprehensive but strategic coverage of the Common Core State Standards. Spring Math assesses mastery of number operations, prealgebraic thinking, and mathematical logic. It also measures understanding of “tool skills.” Tool skills provide the foundation a child needs to question, speculate, reason, solve, and explain realworld problems. Spring Math emphasizes tool skills across grades with gradeappropriate techniques and materials.
 Where to Obtain:
 Developer: Amanda VanDerHeyden; Publisher: TIES/Sourcewell Tech
 sales@springmath.com
 1667 Snelling Avenue North, St. Paul, MN 55108
 (651) 9996100
 www.springmath.com
 Initial Cost:
 $7.00 per student
 Replacement Cost:
 $7.00 per student per 1 year
 Included in Cost:
 Sites must have access to one computer per teacher, internet connection, and the ability to print in black and white. Spring Math provides extensive implementation support at no additional cost through a support portal to which all users have access. Support materials include howto videos, brief howto documents, access to all assessments and acquisition lesson plans for 130 skills, and live and archived webinars. In addition to the support portal, sites that wish to purchase additional coaching support can do so by accessing our network of trained coaches who have expertise in RtI/MTSS leadership and specific training in Spring Math (they can hire trainer as an independent contractor). http://www.springmath.com/trainingsupport/
 Assessments are standardized, but very brief in duration. If a student requires intervention, intervention allows for oral and written responding, the use of individual rewards for “beating the last best score,” a range of concrete, representational, and abstract understanding activities, and individualized modeling with immediate corrective feedback.
 Training Requirements:
 Less than 1 hr of training
 Qualified Administrators:
 The examiners need to be educators trained in the administration of the measures. Training materials are provided.
 Access to Technical Support:
 Support materials are provided organized by userrole (teacher, coach, data administrator) via our support portal under the dropdown menu under their login icon. We provide free webinars throughout the year for users and host free training institutes at least annually. We provide a systematic onboarding process for new users to help them get underway. If users encounter technical difficulties, they can submit a request for help directly from their account, which generates a support ticket to our tech support team. Support tickets are monitored during business hours and are responded to the same day.
 Assessment Format:

 Performance measure
 Scoring Time:

 1 minutes per student
 Scores Generated:

 Raw score
 Administration Time:

 8 minutes per student
 Scoring Method:

 Manually (by hand)
 Technology Requirements:

 Computer or tablet
 Internet connection
 Accommodations:
 Assessments are standardized, but very brief in duration. If a student requires intervention, intervention allows for oral and written responding, the use of individual rewards for “beating the last best score,” a range of concrete, representational, and abstract understanding activities, and individualized modeling with immediate corrective feedback.
Descriptive Information
 Please provide a description of your tool:
 Spring Math is a comprehensive RtI system that includes screening, progress monitoring, classwide and individual math intervention, and implementation and decisionmaking support. Assessments are generated within the tool when needed and Spring Math uses student data to customize classwide and individual intervention plans for students. Clear and easytounderstand graphs and reports are provided within the teacher and coach dashboards. Spring Math uses gated screening that involves CBMs administered to the class as a whole followed by classwide intervention to identify students in need of intensive intervention. Spring Math assesses 130 skills in Grades K8 and remedies gaps in learning for grades K12. The skills offer comprehensive but strategic coverage of the Common Core State Standards. Spring Math assesses mastery of number operations, prealgebraic thinking, and mathematical logic. It also measures understanding of “tool skills.” Tool skills provide the foundation a child needs to question, speculate, reason, solve, and explain realworld problems. Spring Math emphasizes tool skills across grades with gradeappropriate techniques and materials.
ACADEMIC ONLY: What skills does the tool screen?
 Please describe specific domain, skills or subtests:
 BEHAVIOR ONLY: Which category of behaviors does your tool target?

 BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each subdomain or subconstruct.
Acquisition and Cost Information
Administration
 Are norms available?
 No
 Are benchmarks available?
 Yes
 If yes, how many benchmarks per year?
 3
 If yes, for which months are benchmarks available?
 Fall, winter, spring
 BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
 If yes, how many students can be rated concurrently?
Training & Scoring
Training
 Is training for the administrator required?
 Yes
 Describe the time required for administrator training, if applicable:
 Less than 1 hr of training
 Please describe the minimum qualifications an administrator must possess.
 The examiners need to be educators trained in the administration of the measures. Training materials are provided.
 No minimum qualifications
 Are training manuals and materials available?
 Yes
 Are training manuals/materials fieldtested?
 Yes
 Are training manuals/materials included in cost of tools?
 Yes
 If No, please describe training costs:
 Can users obtain ongoing professional and technical support?
 Yes
 If Yes, please describe how users can obtain support:
 Support materials are provided organized by userrole (teacher, coach, data administrator) via our support portal under the dropdown menu under their login icon. We provide free webinars throughout the year for users and host free training institutes at least annually. We provide a systematic onboarding process for new users to help them get underway. If users encounter technical difficulties, they can submit a request for help directly from their account, which generates a support ticket to our tech support team. Support tickets are monitored during business hours and are responded to the same day.
Scoring
 Do you provide basis for calculating performance level scores?

Yes
 Does your tool include decision rules?

Yes
 If yes, please describe.
 Spring Math graphs student performance at screening relative to two criteria: the answers correct equivalent of the digits correct criterion reflecting instructionallevel and masterylevel performance. Screening measures reflect a subskill mastery measurement approach to CBM and intentionally sample rigorous gradelevel understanding (https://static1.squarespace.com/static/57ab866cf7e0ab5cbba29721/t/591b4b9a86e6c0d47e88a51d/1494961050245/SM_ScreeningByGrades_TimeOfYear.pdf) If 50% of the class or more score below the instructional range on the screening measures, Spring Math recommends (and then provides) classwide math intervention. Classwide intervention data are recorded weekly and after 4 weeks of implementation, Spring Math recommends individual students who score below the instructional range when the class median has reached mastery on any given skill. Spring Math then directs the diagnostic (i.e., “drill down”) assessment to provide the correctly aligned intervention for students needing individualized intervention. During intervention, all materials are provided to the teacher that are needed to conduct intervention including scripted activities to build related conceptual understanding. Student performance is graphed weekly, the student’s rate of improvement relative to the class median rate of improvement (during classwide intervention) is provided and if receiving intensive individualized intervention, the student’s rate of improvement on the intervention skill and the generalization skill is provided. Spring Math adjusts interventions weekly based upon student data. The coach dashboard provides realtime summaries of intervention implementation (weeks with scores, most recent score entry) and intervention progress (rate of skill mastery) and directs the coach to support in cases where intervention growth is not sufficient.
 Can you provide evidence in support of multiple decision rules?

Yes
 If yes, please describe.
 Our previous research (e.g., VanDerHeyden, Witt, & Naquin, 2001; VanDerHeyden & Witt, 2005; VanDerHeyden, Witt, & Gilbertson, 2007) of the RtI decision rules used in Spring Math. Further, more recent mathematics specific research, such as VanDerHeyden et al., 2012 and Burns, VanDerHeyden, & Jiban, 2006 have also used and tested the decision rules in the context of intervention delivery and RtI decision making. A full list of references is here: http://springmath.s3.amazonaws.com/pdf/faq/SpringMath_References.pdf. In this application, we provide classification accuracy data for screening + classwide intervention to determine the need for individualized intensive intervention.
 Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
 Composite scores included in the analyses here are the raw score total of the screening measures administered on each screening occasion.
 Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
 Spring Math assessment procedures are similar across measures and include clear and consistent assessment and scoring practices. Simple, clear language is used along with sample items in order to make the assessments appropriate for students who are linguistically diverse or those with disabilities. Supplements to the assessment instructions which do not alter the procedures are allowed.
Technical Standards
Classification Accuracy & CrossValidation Summary
Grade 
Kindergarten

Grade 1

Grade 3

Grade 5

Grade 7


Classification Accuracy Fall  
Classification Accuracy Winter  
Classification Accuracy Spring 
Spring Composite Score (Classwide Intervention Risk)
Classification Accuracy
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Classwide Intervention Risk is defined as the number of times a child remains in the frustrational range on a skill when the class median has reached mastery during a standard protocol daily 1215min intervention divided by the total number of weekly classwide intervention progress monitoring scores available for the student. The actual formula is: ((Number of times at risk during classwide intervention + 1) / Number of Progress Monitoring Scores) * 1. A constant is added in the numerator and the entire quotient is reverse scaled by multiplying by 1 to reflect the negative linear relationship between academic risk and academic achievement (e.g., less risk is associated with greater academic performance scores). A scatterplot of classwide intervention risk against yearend standard scores on the Arizona accountability measure with reference and screening cutscores marked is available upon request from the Center.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the classification analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?

Yes
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
 Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.
CrossValidation
 Has a crossvalidation study been conducted?

No
 If yes,
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the crossvalidation analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
AzMERIT
Classification Accuracy
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 The outcome measure in grades 3, 5 and 7 was the AzMERIT (https://cms.azed.gov/home/GetDocumentFile?id=5b6b29191dcb250edc160590), which is the statewide achievement test in Arizona. The mean scores on the AZ measure for participants in grades 3, 5, and 7 were in the proficient range. The base rate of nonproficiency in grade 3 was 22% versus 58% nonproficient for the state. The base rate of nonproficiency in grade 5 was 26% versus 60% nonproficient for the state. The base rate of nonproficiency in grade 7 was 32% versus 69% nonproficient for the state. We used a local 20th percentile total math standard score equivalent on the AZ test as the reference criterion to identify students in need of more intensive intervention. At each grade, the local 20th percentile standard score equivalent was in the nonproficient range according to the state.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the classification analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 The reference criterion was performing below the 20th percentile (local norm) on the researcherconstructed composite measures in grades K and 1 and Arizona yearend test of mathematics for grades 3, 5 , and 7. Nonparametric ROC analysis was conducted using STATA 12 using the fall and winter screening composite scores as the test variable and below the 20th percentile performance on the reference criterion as the reference variable. ROCgenerated AUC values and classification agreement values were generated within the ROC analysis. Screening risk was determined in two ways (1) on universal screening at fall and winter, and (2) response to classwide intervention which was delivered universally and has been shown to produce learning gains and permit more accurate determination of risk when base rates of risk are high (VanDerHeyden, McLaughlin, Algina, & Snyder, 2012; VanDerHeyden 2013). Classification agreement analyses were also conducted directly using a priori screening decision rules for universal screening at fall & winter and subsequent risk during classwide intervention. Results replicated the ROCgenerated classification agreement indices reported in this application. Universal screening measures were highly sensitive but generated a high number of falsepositive errors. Classwide intervention response data (also collected universally) was slightly less sensitive but much more specific than universal screening (e.g., Grade 3 sensitivity = .82 and specificity = .80). In grades K and 1, probability of yearend risk was zero for children who passed the screeners. If K and 1st grade children who met the risk criterion during classwide intervention had a 62% and 59% probability of year end risk which was 2.67 and 2.36 times the base rate of risk respectively. At Grade 3, children who passed the screeners had zero probability of yearend risk. Third graders meeting the risk criterion during classwide intervention had a 33% chance of yearend risk on the AZ measure which was 1.5 times the base rate of risk in the sample. Grade 5 students who passed the screener had zero probability of yearend risk, but 39% probability of yearend risk if they met the risk criterion during classwide intervention which was 2.17 times the base rate of risk for the sample. Grade 7 students who passed the screeners had a 1% chance of yearend risk on the AZ measure, but 44% chance of yearend risk if they met the risk criterion during classwide intervention which was 1.63 times the base rate of risk in the sample. Functionally, universal screening is used to determine the need for classwide intervention. Classwide intervention data can then be used as a second screening gate.
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?

Yes
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
 Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.
CrossValidation
 Has a crossvalidation study been conducted?

No
 If yes,
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the crossvalidation analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
Winter Composite Score
Classification Accuracy
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 In grades K and 1, a Winter Composite (for fall screening) and a Spring Composite (for winter screening) were the outcome criteria. For Grade K, the Winter Composite was a researcherconstructed measure that reflected the raw score total of 4 timed measures, timed at 1minute each. These measures were Count Objects to 20 & Write Answer, Identify Number and Draw Circles to 20, Quantity Discrimination with Dot Sets to 20, and Missing Number to 20. Thus, the composite score reflected understanding of objectnumber correspondence, cardinality, and ordinality. To respond correctly, children also had to be facile with identifying and writing numbers. Curriculumbased measurement of understanding of objectnumber correspondence, cardinality, number identification/naming, and ordinality have been studied extensively by multiple researcher teams (Floyd, Hojnoski, & Key, 2006). Among the first research teams to investigate curriculumbased measures of early numeracy were VanDerHeyden, Witt, Naquin, and Noell (2001) who studied counting objects and writing numbers and identifying numbers and drawing corresponding object sets with alternate form reliability correlations ranging from r = .7 to .84 and concurrent correlation validity evidence of r = .44 to .61 with the Comprehensive Inventory of Basic Skills, Revised (Brigance, 1999). Clarke and Shinn (2004) examined, among other measures, a missing number measure to quantify ordinal understanding among first graders with excellent findings, reporting 26week testretest r = .81 and concurrent validity correlation with the Number Knowledge Test (Okamoto & Case, 1996) of r = .74. In 2011, VanDerHeyden and colleagues developed and tested new measures of early mathematical understanding, but also included the missing number measure, counting objects and writing the number, and identifying the number and drawing circles. These authors reported testretest correlation values ranging from r = .71 to .87, correlation with Test of Early Mathematical Ability (TEMA; Ginsburg & Baroody, 2003) scores of r = .55  to .71, and longitudinal correlation values with curriculumbased measures for addition and subtraction at the end of first grade of r = .51 to .55. The quantity comparison measure using dot sets was also examined with testretest r = .82, concurrent validity with the TEMA r = .49, and predictive validity with yearend first grade measures of addition and subtraction of r = .43. In First Grade, the Winter Composite was a researcherconstructed measure that reflected the raw score total from 4 timed measures. These measures were Sums to 12, Subtraction 05, Fact Families for Addition & Subtraction 05, and Quantity Discrimination with Numbers in the Hundreds. The Spring Composite was a researcherconstructed measure that reflected the raw score total from 3 timed measures: Sums to 20, Subtraction 020, and Fact Families for Addition & Subtraction 020. These measures assessed understanding of addition, subtraction, the relationship between addition and subtraction, and quantity comparison using place value understanding. These measures have been widely studied in the curriculumbased measurement literature and are commonly used as outcome measures within studies (Foegen, Jiban, & Deno, 2007; Fuchs, Fuchs, Compton, Bryant, Hamlett, & Seethaler, 2007). Research teams have demonstrated that proficiency on single skill computation skills like those included in the first grade composite measures can be reliably measured, and meaningfully predict skill retention, performance on more distal measures, and rate of skill mastery (trials to criterion) when learning more complex related skills (Burns, VanDerHeyden, & Jiban, 2006; VanDerHeyden, Codding, & Martin, 2017; VanDerHeyden & Burns, 2009). The measures emphasize skills that are particularly important for forecasting (and subsequently avoiding) longerterm mathematical risk and thus, are recommended for focus in RtI/MTSS systems (Gersten & Chard, 1999; Gersten et al., 2009).
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the classification analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 The reference criterion was performing below the 20th percentile (local norm) on the researcherconstructed composite measures in grades K and 1 and Arizona yearend test of mathematics for grades 3, 5 , and 7. Nonparametric ROC analysis was conducted using STATA 12 using the fall and winter screening composite scores as the test variable and below the 20th percentile performance on the reference criterion as the reference variable. ROCgenerated AUC values and classification agreement values were generated within the ROC analysis. Screening risk was determined in two ways (1) on universal screening at fall and winter, and (2) response to classwide intervention which was delivered universally and has been shown to produce learning gains and permit more accurate determination of risk when base rates of risk are high (VanDerHeyden, McLaughlin, Algina, & Snyder, 2012; VanDerHeyden 2013). Classification agreement analyses were also conducted directly using a priori screening decision rules for universal screening at fall & winter and subsequent risk during classwide intervention. Results replicated the ROCgenerated classification agreement indices reported in this application. Universal screening measures were highly sensitive but generated a high number of falsepositive errors. Classwide intervention response data (also collected universally) was slightly less sensitive but much more specific than universal screening (e.g., Grade 3 sensitivity = .82 and specificity = .80). In grades K and 1, probability of yearend risk was zero for children who passed the screeners. If K and 1st grade children who met the risk criterion during classwide intervention had a 62% and 59% probability of year end risk which was 2.67 and 2.36 times the base rate of risk respectively. At Grade 3, children who passed the screeners had zero probability of yearend risk. Third graders meeting the risk criterion during classwide intervention had a 33% chance of yearend risk on the AZ measure which was 1.5 times the base rate of risk in the sample. Grade 5 students who passed the screener had zero probability of yearend risk, but 39% probability of yearend risk if they met the risk criterion during classwide intervention which was 2.17 times the base rate of risk for the sample. Grade 7 students who passed the screeners had a 1% chance of yearend risk on the AZ measure, but 44% chance of yearend risk if they met the risk criterion during classwide intervention which was 1.63 times the base rate of risk in the sample. Functionally, universal screening is used to determine the need for classwide intervention. Classwide intervention data can then be used as a second screening gate.
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?

Yes
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
 Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.
CrossValidation
 Has a crossvalidation study been conducted?

No
 If yes,
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the crossvalidation analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
Spring Composite Score
Classification Accuracy
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 In grades K and 1, a Winter Composite (for fall screening) and a Spring Composite (for winter screening) were the outcome criteria. For Grade K, the Winter Composite was a researcherconstructed measure that reflected the raw score total of 4 timed measures, timed at 1minute each. These measures were Count Objects to 20 & Write Answer, Identify Number and Draw Circles to 20, Quantity Discrimination with Dot Sets to 20, and Missing Number to 20. Thus, the composite score reflected understanding of objectnumber correspondence, cardinality, and ordinality. To respond correctly, children also had to be facile with identifying and writing numbers. Curriculumbased measurement of understanding of objectnumber correspondence, cardinality, number identification/naming, and ordinality have been studied extensively by multiple researcher teams (Floyd, Hojnoski, & Key, 2006). Among the first research teams to investigate curriculumbased measures of early numeracy were VanDerHeyden, Witt, Naquin, and Noell (2001) who studied counting objects and writing numbers and identifying numbers and drawing corresponding object sets with alternate form reliability correlations ranging from r = .7 to .84 and concurrent correlation validity evidence of r = .44 to .61 with the Comprehensive Inventory of Basic Skills, Revised (Brigance, 1999). Clarke and Shinn (2004) examined, among other measures, a missing number measure to quantify ordinal understanding among first graders with excellent findings, reporting 26week testretest r = .81 and concurrent validity correlation with the Number Knowledge Test (Okamoto & Case, 1996) of r = .74. In 2011, VanDerHeyden and colleagues developed and tested new measures of early mathematical understanding, but also included the missing number measure, counting objects and writing the number, and identifying the number and drawing circles. These authors reported testretest correlation values ranging from r = .71 to .87, correlation with Test of Early Mathematical Ability (TEMA; Ginsburg & Baroody, 2003) scores of r = .55  to .71, and longitudinal correlation values with curriculumbased measures for addition and subtraction at the end of first grade of r = .51 to .55. The quantity comparison measure using dot sets was also examined with testretest r = .82, concurrent validity with the TEMA r = .49, and predictive validity with yearend first grade measures of addition and subtraction of r = .43. The Spring Composite was a researcherconstructed measure that reflected the raw score total of 4 timed measures, timed at 1minute each. These measures were Change Quantity of Dots to Make 10, Missing Number to 20, Addition 05 for Kindergarten, and Subtraction 05 for Kindergarten. These measures reflected understanding of objectnumber correspondence to make sets of objects ranging from 110 giving a starting set of 110 (except never matching). This measure required the child to strike out or add to a dot set to create the specified quantity to match a number. This measure required the child to understand the number quantity desired (number identification) and then to add or remove dots to create an equivalent set. The second measure is the Missing Number measure which assesses ordinality with excellent concurrent and predictive validity for children. The third and fourth measures assessed the child’s ability to combine and take quantities to 5 using numbers.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the classification analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 The reference criterion was performing below the 20th percentile (local norm) on the researcherconstructed composite measures in grades K and 1 and Arizona yearend test of mathematics for grades 3, 5 , and 7. Nonparametric ROC analysis was conducted using STATA 12 using the fall and winter screening composite scores as the test variable and below the 20th percentile performance on the reference criterion as the reference variable. ROCgenerated AUC values and classification agreement values were generated within the ROC analysis. Screening risk was determined in two ways (1) on universal screening at fall and winter, and (2) response to classwide intervention which was delivered universally and has been shown to produce learning gains and permit more accurate determination of risk when base rates of risk are high (VanDerHeyden, McLaughlin, Algina, & Snyder, 2012; VanDerHeyden 2013). Classification agreement analyses were also conducted directly using a priori screening decision rules for universal screening at fall & winter and subsequent risk during classwide intervention. Results replicated the ROCgenerated classification agreement indices reported in this application. Universal screening measures were highly sensitive but generated a high number of falsepositive errors. Classwide intervention response data (also collected universally) was slightly less sensitive but much more specific than universal screening (e.g., Grade 3 sensitivity = .82 and specificity = .80). In grades K and 1, probability of yearend risk was zero for children who passed the screeners. If K and 1st grade children who met the risk criterion during classwide intervention had a 62% and 59% probability of year end risk which was 2.67 and 2.36 times the base rate of risk respectively. At Grade 3, children who passed the screeners had zero probability of yearend risk. Third graders meeting the risk criterion during classwide intervention had a 33% chance of yearend risk on the AZ measure which was 1.5 times the base rate of risk in the sample. Grade 5 students who passed the screener had zero probability of yearend risk, but 39% probability of yearend risk if they met the risk criterion during classwide intervention which was 2.17 times the base rate of risk for the sample. Grade 7 students who passed the screeners had a 1% chance of yearend risk on the AZ measure, but 44% chance of yearend risk if they met the risk criterion during classwide intervention which was 1.63 times the base rate of risk in the sample. Functionally, universal screening is used to determine the need for classwide intervention. Classwide intervention data can then be used as a second screening gate.
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?

Yes
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
 Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.
CrossValidation
 Has a crossvalidation study been conducted?

No
 If yes,
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the crossvalidation analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
AzMERIT (Classwide Intervention Risk)
Classification Accuracy
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Classwide Intervention Risk is defined as the number of times a child remains in the frustrational range on a skill when the class median has reached mastery during a standard protocol daily 1215min intervention divided by the total number of weekly classwide intervention progress monitoring scores available for the student. The actual formula is: ((Number of times at risk during classwide intervention + 1) / Number of Progress Monitoring Scores) * 1. A constant is added in the numerator and the entire quotient is reverse scaled by multiplying by 1 to reflect the negative linear relationship between academic risk and academic achievement (e.g., less risk is associated with greater academic performance scores). A scatterplot of classwide intervention risk against yearend standard scores on the Arizona accountability measure with reference and screening cutscores marked is available upon request from the Center.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the classification analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?

Yes
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
 Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.
CrossValidation
 Has a crossvalidation study been conducted?

No
 If yes,
 Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
 Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
 Describe how the crossvalidation analyses were performed and cutpoints determined. Describe how the cut points align with students atrisk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
 Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
 If yes, please describe the intervention, what children received the intervention, and how they were chosen.
Classification Accuracy  Fall
Evidence  Kindergarten  Grade 1  Grade 3  Grade 5  Grade 7 

Criterion measure  Winter Composite Score  Winter Composite Score  AzMERIT  AzMERIT  AzMERIT 
Cut Points  Percentile rank on criterion measure  20  20  20  20  20 
Cut Points  Performance score on criterion measure  17  59  3515  3582  3633 
Cut Points  Corresponding performance score (numeric) on screener measure  12  39  24  35  13 
Classification Data  True Positive (a)  
Classification Data  False Positive (b)  
Classification Data  False Negative (c)  
Classification Data  True Negative (d)  
Area Under the Curve (AUC)  0.85  0.85  0.86  0.77  0.91 
AUC Estimate’s 95% Confidence Interval: Lower Bound  0.76  0.76  0.73  0.63  0.87 
AUC Estimate’s 95% Confidence Interval: Upper Bound  0.94  0.93  0.99  0.91  0.96 
Statistics  Kindergarten  Grade 1  Grade 3  Grade 5  Grade 7 

Base Rate  
Overall Classification Rate  
Sensitivity  
Specificity  
False Positive Rate  
False Negative Rate  
Positive Predictive Power  
Negative Predictive Power 
Sample  Kindergarten  Grade 1  Grade 3  Grade 5  Grade 7 

Date  9/1/17  9/1/17  9/1/17  9/1/17  9/1/17 
Sample Size  
Geographic Representation  Mountain (AZ)  Mountain (AZ)  Mountain (AZ)  Mountain (AZ)  Mountain (AZ) 
Male  
Female  
Other  
Gender Unknown  
White, NonHispanic  
Black, NonHispanic  
Hispanic  
Asian/Pacific Islander  
American Indian/Alaska Native  
Other  
Race / Ethnicity Unknown  
Low SES  
IEP or diagnosed disability  
English Language Learner 
Classification Accuracy  Winter
Evidence  Kindergarten  Grade 1  Grade 3  Grade 5  Grade 7 

Criterion measure  Spring Composite Score  Spring Composite Score  AzMERIT  AzMERIT  AzMERIT 
Cut Points  Percentile rank on criterion measure  20  20  20  20  20 
Cut Points  Performance score on criterion measure  29  37  3515  3582  3633 
Cut Points  Corresponding performance score (numeric) on screener measure  21  78  31  31  26 
Classification Data  True Positive (a)  
Classification Data  False Positive (b)  
Classification Data  False Negative (c)  
Classification Data  True Negative (d)  
Area Under the Curve (AUC)  0.87  0.91  0.79  0.74  0.91 
AUC Estimate’s 95% Confidence Interval: Lower Bound  0.79  0.86  0.67  0.62  0.87 
AUC Estimate’s 95% Confidence Interval: Upper Bound  0.95  0.97  0.91  0.86  0.95 
Statistics  Kindergarten  Grade 1  Grade 3  Grade 5  Grade 7 

Base Rate  
Overall Classification Rate  
Sensitivity  
Specificity  
False Positive Rate  
False Negative Rate  
Positive Predictive Power  
Negative Predictive Power 
Sample  Kindergarten  Grade 1  Grade 3  Grade 5  Grade 7 

Date  1/5/18  1/5/18  1/5/18  1/5/18  1/5/18 
Sample Size  
Geographic Representation  Mountain (AZ)  Mountain (AZ)  Mountain (AZ)  Mountain (AZ)  Mountain (AZ) 
Male  
Female  
Other  
Gender Unknown  
White, NonHispanic  
Black, NonHispanic  
Hispanic  
Asian/Pacific Islander  
American Indian/Alaska Native  
Other  
Race / Ethnicity Unknown  
Low SES  
IEP or diagnosed disability  
English Language Learner 
Classification Accuracy  Spring
Evidence  Kindergarten  Grade 1  Grade 3  Grade 5 

Criterion measure  Spring Composite Score (Classwide Intervention Risk)  Spring Composite Score (Classwide Intervention Risk)  AzMERIT (Classwide Intervention Risk)  AzMERIT (Classwide Intervention Risk) 
Cut Points  Percentile rank on criterion measure  20  20  20  20 
Cut Points  Performance score on criterion measure  29  37  3515  3582 
Cut Points  Corresponding performance score (numeric) on screener measure  0.06  0.08  0.11  0.05 
Classification Data  True Positive (a)  
Classification Data  False Positive (b)  
Classification Data  False Negative (c)  
Classification Data  True Negative (d)  
Area Under the Curve (AUC)  0.79  0.95  0.86  0.85 
AUC Estimate’s 95% Confidence Interval: Lower Bound  0.67  0.90  0.78  0.78 
AUC Estimate’s 95% Confidence Interval: Upper Bound  0.91  0.99  0.96  0.93 
Statistics  Kindergarten  Grade 1  Grade 3  Grade 5 

Base Rate  
Overall Classification Rate  
Sensitivity  
Specificity  
False Positive Rate  
False Negative Rate  
Positive Predictive Power  
Negative Predictive Power 
Sample  Kindergarten  Grade 1  Grade 3  Grade 5 

Date  
Sample Size  
Geographic Representation  
Male  
Female  
Other  
Gender Unknown  
White, NonHispanic  
Black, NonHispanic  
Hispanic  
Asian/Pacific Islander  
American Indian/Alaska Native  
Other  
Race / Ethnicity Unknown  
Low SES  
IEP or diagnosed disability  
English Language Learner 
Reliability
Grade 
Kindergarten

Grade 1

Grade 3

Grade 5

Grade 7


Rating 
 *Offer a justification for each type of reliability reported, given the type and purpose of the tool.
 Probes are generated following a set of programmed parameters that were built and tested in a development phase. To determine measure equivalence, problem sets were generated, and each problem within a problem set was scored for possible digits correct. The digits correct metric comes from the curriculumbased measurement literature (Deno & Mirkin, 1977) and allows for sensitive measurement of child responding. Typically, each digit that appears in the correct place value position to arrive at the correct final answer is counted as a digit correct. Generally, digits correct work is counted for all the work that occurs below the problem (in the answer) but does not include any work that may appear above the problem in composing or decomposing hundreds or tens, for example, when regrouping. A standard response format was selected for all measures, which reflected the relevant responses in steps to arrive at a correct and complete answer. Potential digits correct was the unit of analysis that we used to test the equivalence of generated problem sets. For example, in scoring adding and subtracting fractions with unlike denominators, all digits correct in generating fractions with equivalent denominators, then the digits correct in combining or taking the fraction quantity, and digits correct in simplifying the final fraction were counted. The number of problems generated depended upon the task difficulty of the measure. If the measure assessed an easier skill (defined as having fewer potential digits correct), then the number of problems generated was greater than the number of problems that were generated and tested for harder skills for which the possible digits correct scores were much higher. Problems generated for equivalence testing ranged from 80 problems to 480 problems per measure. A total of 46,022 problems were generated and scored for possible digits correct to test the equivalence of generated problem sets. Problem sets ranged from 848 problems. Most problem sets contained 30 problems. For each round of testing, 10 problem sets were generated per measure. The mean possible digits correct per problem was computed for each problem set for each measure. The standard deviation of possible digits correct across the ten generated problem sets was computed and was required to be less than 10% of the mean possible digits correct to establish equivalence. Spring Math has 130 measures. Thirtyeight measures were not tested for equivalence because there was no variation in possible digits correct per problem type. These measures were all singledigit answers and included measures like Sums to 6, Subtraction 05, and Number Names. Eightythree measures met equivalence standards on the first round of testing, with a standard deviation of possible digits correct per problem per problem set that was on average 4% of the mean possible digits correct per problem. Seven measures required revision and a second round of testing. These measures included Mixed Fraction Operations, Multiply Fractions, Convert Improper to Mixed, Solve 2Step Equations, Solve Equations with Percentages, Convert Fractions to Decimals, and Collect Like Terms. After revision and retesting, the average percent of the mean that the standard deviation represented was 4%. One measure required a third round of revision and retesting. This measure was Order of Operations. On the third round, it met the equivalence criterion with the standard deviation representing on average 10% of the mean possible digits correct per problem across generated problem sets. In this section of the application, we report the results from a yearlong study in Louisiana during which screening measures were generated and administered to classes of children with a 1week interval of time between assessment occasions. Measures were administered by researchers with rigorous integrity and interreliability controls in place.
 *Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
 Reliability data were collected in three schools in southeastern Louisiana with appropriate procedural controls. Researchers administered the screening measures for the reliability study following an administration script. On 25% of testing occasions balanced across times (time1 and time2), grades, and classrooms, a second trained observer documented the percentage of correctly completed steps during screening administration. Average integrity (percentage of steps correctly conducted) was 99.36% with a less than perfect integrity score on only 4 occasions (one missed the sentence in the protocol telling students not to skip around, one exceeded the 2min timing interval for one measure by 5 seconds, and on two occasions, students turned their papers over before being told to do so). Demographic data are provided in the table below for the reliability sample.
 *Describe the analysis procedures for each reported type of reliability.
 Spring Math uses 34 timed measures per screening occasions. The initial risk decision and subsequent classwide intervention risk decision is based on the set of measures as a whole and the subsequent risk during classwide intervention. For these reliability analyses, we report the Pearson r correlation coefficient (with 95% CI) for two time occasions  fall and winter  for the generated (i.e., alternate form) measures administered 1 week apart.
*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or interrater reliability coefficients).
Type of  Subgroup  Informant  Age / Grade  Test or Criterion  n  Median Coefficient  95% Confidence Interval Lower Bound 
95% Confidence Interval Upper Bound 

 Results from other forms of reliability analysis not compatible with above table format:
 Manual cites other published reliability studies:
 No
 Provide citations for additional published studies.
 Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
 No
If yes, fill in data for each subgroup with disaggregated reliability data.
Type of  Subgroup  Informant  Age / Grade  Test or Criterion  n  Median Coefficient  95% Confidence Interval Lower Bound 
95% Confidence Interval Upper Bound 

 Results from other forms of reliability analysis not compatible with above table format:
 Manual cites other published reliability studies:
 No
 Provide citations for additional published studies.
Validity
Grade 
Kindergarten

Grade 1

Grade 3

Grade 5

Grade 7


Rating 
 *Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
 *Describe the sample(s), including size and characteristics, for each validity analysis conducted.
 The sample size is included in the table. The demographics are similar to those described in the reliability section.
 *Describe the analysis procedures for each reported type of validity.
 We have reported the Pearson r correlation for theoretically anticipated convergent measures and theoretically anticipated discriminant measures.
*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.
Type of  Subgroup  Informant  Age / Grade  Test or Criterion  n  Median Coefficient  95% Confidence Interval Lower Bound 
95% Confidence Interval Upper Bound 

 Results from other forms of validity analysis not compatible with above table format:
 Manual cites other published reliability studies:
 No
 Provide citations for additional published studies.
 Describe the degree to which the provided data support the validity of the tool.
 We see a pattern of correlations that supports multitrait, multimethod logic (Campbell & Fiske, 1959).
 Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
If yes, fill in data for each subgroup with disaggregated validity data.
Type of  Subgroup  Informant  Age / Grade  Test or Criterion  n  Median Coefficient  95% Confidence Interval Lower Bound 
95% Confidence Interval Upper Bound 

 Results from other forms of validity analysis not compatible with above table format:
 Manual cites other published reliability studies:
 Provide citations for additional published studies.
Bias Analysis
Grade 
Kindergarten

Grade 1

Grade 3

Grade 5

Grade 7


Rating  Yes  Yes  Yes  Yes  Yes 
 Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiplegroup confirmatory factor models.
 Yes
 If yes,
 a. Describe the method used to determine the presence or absence of bias:
 We conducted a series of binary logistic regression analyses using Stata. Scoring below the 20th percentile on the Arizona yearend state test was the outcome criterion. The interaction term for each subgroup and the fall composite screening score, winter composite screening score, and classwide intervention risk is provided in the table below.
 b. Describe the subgroups for which bias analyses were conducted:
 Gender, Students with Disabilities, Ethnicity, and SES.
 c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
 None of the interactions were significant. Thus, screening accuracy did not differ across subgroups in a way that was statistically significant
Data Collection Practices
Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.