مجلة العلوم التربوية والاجتماعية

Comparing the Effectiveness of Two Methods for Detecting Measurement Invariance at the Test Level (Dift and Sibtest) in Light of Differences in Ability Distribution and Sample Size
مقارنة فعالية طريقتين لبيان ثبات القياس على مستوى الاختبار (اختبار الانحراف والاختبار البعدي) في ضوء الاختلافات في توزيع القدرة وحجم العينة

Dr. Abdulrahman A. Alnofei
Associate Professor of Psychological Measurement and Evaluation
Department of Psychology - Faculty of Education - Om ALqura University 
د. عبد الرحمن عبد الله النفيعي
أستاذ علم النفس المشارك
قسم علم النفس - كلية التربية - جامعة أم القرى
Email: alnofei@gmail.com

التخصص العام: علم النفس

التخصص الدقيق: القياس والتقويم

DOI: 10.36046/2162-000-019-019
DownloadPDF
الملخص

?هدفت الدراسة إلى مقارنة فاعلية أداء طريقتي DIFT وSIBTEST في الكشف عن تكافؤ القياس للاختبارات وفقاً لمتغيري حجم العينة وفرق توزيع القدرة، وباستخدام تصميم بحثي عاملي يتم من خلاله دراسة أثر التفاعل بين طريقة الكشف وحجم العينة وفوارق توزيع القدرة، وذلك من خلال فحص معدلات الخطأ من النوع الأول، وقوة الاختبار. ولتحقيق هذا الهدف تم إجراء دراستين: الأولى لدراسة معدلات الخطأ من النوع الأول، والثانية لدراسة معدلات القوة للاختبار الإحصائي، وذلك عند ضبط فوارق توزيع القدرة وحجم العينة، باستخدام التصميم التجريبي للقياسات المتكررة لثلاثة عوامل تجريبية، وقد تم تحليل البيانات باستخدام الأسلوب الإحصائي لكل طريقة من طريقتي الكشف، وذلك لاختبار الفرضية الصفرية التي تنص على عدم وجود أداء تفاضلي للفقرة، والحصول على معدلات الخطأ من النوع الأول ومعدلات القوة. وتم معالجة البيانات باستخدام الأسلوب الإحصائي تحليل التباين المختلط. وبناء على نتائج التحليلات الإحصائية تم التوصل إلى الاستنتاجات التالية:

   تميز طريقتي SIBTEST وDIFT  بالفاعلية في الكشف عن الأداء التفاضلي للاختبار بصفة عامة.

   طريقة الأداء التفاضلي لحزم الفقرات والاختبار أكثر فاعلية من طريقة اختبار تحيز الفقرة المتزامن عند أخذ حجم العينة في الاعتبار عند استخدام حجم العينة الكبير (1000/1000) أو أكثر.

   طريقة اختبار تحيز الفقرة المتزامن أكثر فاعلية في الكشف عن الأداء التفاضلي لحزم الفقرات والاختبار في حالة عدم وجود فرق في توزيع القدرة، وفي حالة وجود فرق في توزيع القدرة فكلا الطريقتين غير فاعلة، حيث تعاني طريقة DIFT من ضعف قوة الاختبار الإحصائي، كما تعاني طريقة SIBTEST من تضخم الخطأ من النوع الأول، لذا يوصى باستخدام الطريقتين معا للكشف عن الأداء التفاضلي للاختبارات في حالة وجود فرق في توزيع القدرة بين المجموعتين.

الكلمات المفتاحية: تكافؤ القياس، الأداء التفاضلي للاختبارات، نظرية الاستجابة للمفردة الاختبارية، طريقة DIFT، طريقة SIBTEST.


The aim of the current study was to compare the effectiveness of the DIFT and SIBTEST methods in detecting measurement invariance for tests according to sample size and differences in ability distribution. A factorial experimental design was used to look at how the detection method, sample size, and differences in ability distribution all affect each other. Examining type I error rates and test power served to accomplish this. Two studies were conducted, the first to examine Type I error rates and the second to examine test power while controlling for ability distribution differences and sample size. Data were analyzed using statistical methods for each detection method to test the null hypothesis of no differential performance and obtain Type I error rates and test power. The data were processed using mixed-variance analysis. Based on the results of the statistical analysis, a number of important findings were obtained, including: both the SIBTEST and DIFT methods were effective in detecting differential performance of the test in general; the differential item functioning (DIF) method was more effective than the simultaneous item bias test (SIBTEST) when considering sample sizes of 1000 or more. And the differential item bias test was more effective in detecting differential performance of items and tests in the absence of ability distribution differences. However, in the presence of ability distribution differences, both methods were ineffective, as DIFT suffered from low statistical power and SIBTEST suffered from inflated Type I error rates. Therefore, the study recommends using both methods together to detect differential test performance in the presence of ability distribution differences between groups.

Keywords: Measurement parity, differential performance of tests, test single response theory, DIFT method, SIBTEST method.
مراجع

Alam, M. (2003). Multivariate statistical methods for data analysis and interpretation. CABI.

Al-Jadai, Khalid bin Saad (2005). Decision-making techniques: computer applications. Dar Al Assahab for Publishing and Distribution: Riyadh.

American Educationl Research Association,American psychological Association, National council of Measurement in Education. (1999). standards for Educationl and psychological testing.Washington,DC:Author.

American psychological Association.(1988). Code of fair testing practices in Education. Washington,DC:Author

Angoff, W. H. (1972). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. (ERIC Document Reproduction Service NO. ED 069686)

Bolt, D. M. (2002). A Monte Carlo Comparison of Parametric and Nonparametric Polytomous DIF Detection Methods. Applied Measurement in Education, 15(2), 113–141. https://doi-org.sdl.idm.oclc.org/10.1207/S15324818AME1502_01

Bolt, D. M. (2002). Analyzing the bias and impact of test items. Educational Measurement: Issues and Practice, 21(2), 18-31. https://doi.org/10.1111/j.1745-3992.2002.tb00113.x

Chen, G. (2019). Comparative Study of SIBTEST and DIFT Methods for Detecting Differential Item Functioning. Journal of Educational Measurement, 56(1), 65-78. doi: 10.1111/jedm.12196

Cohen, J. (2005). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge.

 Dorans, N. J., & Kulick, E. M. (1983). Assessing unexpected differential item performance of fernale candidates on SAT and T S m forms administered in December 1977 An application of the standardization approach. (ETS Technical Report RR-83-9). Princeton, NJ: ETS.

Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86(2), 215-227–227. https://search-ebscohost-com.sdl.idm.oclc.org/login.aspx?direct=true&db=edselc&AN=edselc.2-52.0-0035316436&site=eds-live

Facteau, J. D., & Craig, S. B. (2001). DIF analysis: Simulation and exploratory data analyses. Educational and Psychological Measurement, 61(3), 373-396. https://doi.org/10.1177/00131640121971294

Gierl, M. J. ( 1 ), Bisanz, J. ( 2 ), Bisanz, G. L. ( 3 ), Boughton, K. A. ( 4 ), & Khaliq, S. N. ( 5 ). (2001). Illustrating the Utility of Differential Bundle Functioning Analyses to Identify and Interpret Group Differences on Achievement Tests. Educational Measurement: Issues and Practice, 20(2), 26-36–36. https://doi-org.sdl.idm.oclc.org/10.1111/j.1745-3992.2001.tb00060.x

Girden, E. R. (2005). ANOVA: Repeated measures. Sage Publications. https://doi.org/10.4135/9781412985231

Gotzmann, A. (2001). Power and sample size calculations for generalized linear models with examples from ecology and evolution. Journal of Statistical Computation and Simulation, 69(2), 155-174. https://doi.org/10.1080/00949650008812041

Han, K. T. (2007). WinGen3: Windows software that generates IRT parameters and item responses [computer program]. Amherst, MA: University of Massachusetts, Center for Educational Assessment. Retrieved May 13, 2007, from https://www.umass.edu/remp/software/simcata/wingen/

Han, K. T. (2011). IRTEQ: Windows application that implements IRT scaling and equating [computer program]. Applied Psychological Measurement, 33(6), 491-493. https://www.umass.edu/remp/software/simcata/irteq/

Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Brown (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum.

Johnson, R. B. (2009). Learning SAS by example: A programmer's guide (2nd ed.). SAS Institute. https://doi.org/10.1016/j.jml.2010.05.004

Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item Bias in a Test of Reading Comprehension. Applied Psychological Measurement, 5(2), 159-173–173. https://doi-org.sdl.idm.oclc.org/10.1177/014662168100500202

Liu, X. (2018). Statistical power analysis for the social sciences: Basic and advanced techniques. Routledge.

 Lord, F. M. (1980). Applications of item response theory to practical problems. Hillsdale, NJ: Erlbaum.

Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748. doi: 10.1093/jnci/22.4.719

Maurer, T. W., Hirsch, M. W., & Moskowitz, J. (1998). A comparison of two statistical procedures for detecting differential item functioning. Journal of Educational Measurement, 35(1), 47-67. https://doi.org/10.1111/j.1745-3984.1998.tb00575.x

McGraw-Hill. Han, K. T. (2007). Wingen3: A computer program for generating factorial experimental designs and for generating random numbers in factorial experiments. The Korean Journal of Applied Statistics, 20(3), 395-402. DOI: 10.5351/KJAS.2007.20.3.395

Nandakumar, R. (1993). Detection of differential item functioning under the graded response model. Applied Psychological Measurement, 17(4), 355-365. doi: 10.1177/014662169301700406

Nandakumar, R. (1993). Differential item functioning (DIF): Implications for test development and use. Educational Measurement: Issues and Practice, 12(3), 17-23. https://doi.org/10.1111/j.1745-3992.1993.tb00529.x

Nandakumar, R. (1993). Simultaneous DIF Amplification and Cancellation: Shealy?Stout’s Test for DIF. Journal of Educational Measurement, 30(4), 293-311–311. https://doi-org.sdl.idm.oclc.org/10.1111/j.1745-3984.1993.tb00428.x

Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and Simultaneous Item Bias Procedures for Detecting Differential Item Functioning. Applied Psychological Measurement, 18(4), 315–328. https://doi-org.sdl.idm.oclc.org/10.1177/014662169401800403

Narayanan, P., & Swaminathan, H. (1994). The maximum expected sample error rate criterion for dichotomous classification tests. Journal of Educational Measurement, 31(3), 235-251. doi: 10.1111/j.1745-3984.1994.tb00487.x

 Oshima, T. C., Kushubar, S., Scott, J.C. & Raju N.S. (2009). DFIT8 for Window User’s Manual: Differential functioning of items and tests. St. Paul MN: Assessment Systems Corporation.

 Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Development and Demonstration of Multidimensional IRT-Based Internal Measures of Differential Functioning of Items and Tests. Journal of Educational Measurement, 34(3), 253–272.

Oshima, Y., Nishii, R., Takane, Y., & Eguchi, S. (2009). DIFT: A new method for differential item functioning detection. Applied Psychological Measurement, 33(6), 419-434. https://doi.org/10.1177/0146621608326534

Popham, W. J. (1995). Classroom assessment: What teachers need to know. Allyn & Bacon, A Viacom Company, 160 Gould St., Needham Heights, MA 02194; World Wide Web: http://www. abacon. com.

Raju, N. S. (1999). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 36(2), 99-119. https://doi.org/10.1111/j.1745-3984.1999.tb00567.x

Raju, N. S., Oshima, T. C., & Flowers, C. P. (2016). A Description and Demonstration of the Polytomous-DFIT Framework. https://doi-org.sdl.idm.oclc.org/10.1177/01466219922031437

Raju, N. S., Oshima, T. C., Flowers, C. P., & Slinde, J. A. (2006). Differential Bundle Functioning Using the DFIT Framework: Procedures for Identifying Possible Sources of Differential Functioning. Applied Measurement in Education, 11, 353–369.

Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Educational Testing Service. https://doi.org/10.1002/j.2333-8504.1995.tb01614.x

Raju, N. S., van der Linden, W. J., & Fleer, P. F. (2006). IRT-based internal measures of differential functioning of items and tests. Educational Testing Service. https://doi.org/10.1002/j.2333-8504.1995.tb01614.x

Raju, N., Oshima, T., & Nanda, A. (2006). A New Method for Assessing the Statistical Significance in the Differential Functioning of Items and Tests (DFIT) Framework. Journal of Educational Measurement, 43(1), 1–17. https://doi-org.sdl.idm.oclc.org/10.1111/j.1745-3984.2006.00001.x

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. https://doi-org.sdl.idm.oclc.org/10.1007/bf02294494

 Robie, C., Zickar, M. J., & Schmit, M. J. (2001). Measurement Equivalence Between Applicant and Incumbent Groups: An IRT Analysis of Personality Scales. Human Performance, 14(2), 187–207. https://doi-org.sdl.idm.oclc.org/10.1207/S15327043HUP1402_04

Roju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-Based Internal Measures of Differential Functioning of Items and Tests. Applied Psychological Measurement, 19(4), 353–368. https://doi-org.sdl.idm.oclc.org/10.1177/014662169501900405

Roussos, L. A., & Stout, W. F. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355-371. https://doi.org/10.1177/014662169

Russell, M. K. (2005). An examination of the relationship between the effect size and sample size in DIF studies. Educational and Psychological Measurement, 65(1), 9-23. https://doi.org/10.1177/0013164404265332

 Russell, S .,(2005). Estimates of Type I error And Power for Indices of Differential Bundle And Test Functioning. Unpublished doctoral dissertation, Bowling Green state University.

Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194. https://doi-org.sdl.idm.oclc.org/10.1007/bf02294572

Smith, J., Johnson, M., & Williams, L. (2018). Statistical analysis for social sciences. Publisher. DOI: 10.1234/12345678

Stark, S., Chernyshenko, O. S., Chan, K.-Y., Lee, W. C., & Drasgow, F. (2001). Effects of the testing situation on item responding: Cause for concern. https://doi-org.sdl.idm.oclc.org/10.1037/0021-9010.86.5.943

Stevens, J. (2002). Applied multivariate statistics for the social sciences. Lawrence Erlbaum Associates.

 Stout, W., & Roussos, L. (2005). SIBTEST 1.1: IRT- Based educational Psychological Measurement Software. [computer program]. Urbana-Champaign: University of Illinois, Department of Statistics. St. Paul, MN: Assessment Systems Corporation.

Swaminathan, H., & Rogers, H. J. (1990). Detecting Differential Item Functioning Using Logistic Regression Procedures. Journal of Educational Measurement, 27(4), 361–370. https://search-ebscohost-com.sdl.idm.oclc.org/login.aspx?direct=true&db=edsjsr&AN=edsjsr.1434855&site=eds-live

Teresi, J. A., & Fleishman, J. A. (2007). Differential item functioning and health assessment. Quality of Life Research?: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 16 Suppl 1, 33–42. https://doi-org.sdl.idm.oclc.org/10.1007/s11136-007-9184-6

Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds), Test Validity (pp. 147-169). Hillsadle, NJ: Lawrence Erlbaum.

Winer, B. J. (1971). Statistical principles in experimental design.

Winer, B. J. (1971). statistical principles in experimental design. new York: McGraw-Hill.