?هدفت الدراسة إلى مقارنة فاعلية أداء طريقتي DIFT وSIBTEST في الكشف عن تكافؤ القياس للاختبارات وفقاً لمتغيري حجم العينة وفرق توزيع القدرة، وباستخدام تصميم بحثي عاملي يتم من خلاله دراسة أثر التفاعل بين طريقة الكشف وحجم العينة وفوارق توزيع القدرة، وذلك من خلال فحص معدلات الخطأ من النوع الأول، وقوة الاختبار. ولتحقيق هذا الهدف تم إجراء دراستين: الأولى لدراسة معدلات الخطأ من النوع الأول، والثانية لدراسة معدلات القوة للاختبار الإحصائي، وذلك عند ضبط فوارق توزيع القدرة وحجم العينة، باستخدام التصميم التجريبي للقياسات المتكررة لثلاثة عوامل تجريبية، وقد تم تحليل البيانات باستخدام الأسلوب الإحصائي لكل طريقة من طريقتي الكشف، وذلك لاختبار الفرضية الصفرية التي تنص على عدم وجود أداء تفاضلي للفقرة، والحصول على معدلات الخطأ من النوع الأول ومعدلات القوة. وتم معالجة البيانات باستخدام الأسلوب الإحصائي تحليل التباين المختلط. وبناء على نتائج التحليلات الإحصائية تم التوصل إلى الاستنتاجات التالية:
• تميز طريقتي SIBTEST وDIFT بالفاعلية في الكشف عن الأداء التفاضلي للاختبار بصفة عامة.
• طريقة الأداء التفاضلي لحزم الفقرات والاختبار أكثر فاعلية من طريقة اختبار تحيز الفقرة المتزامن عند أخذ حجم العينة في الاعتبار عند استخدام حجم العينة الكبير (1000/1000) أو أكثر.
• طريقة اختبار تحيز الفقرة المتزامن أكثر فاعلية في الكشف عن الأداء التفاضلي لحزم الفقرات والاختبار في حالة عدم وجود فرق في توزيع القدرة، وفي حالة وجود فرق في توزيع القدرة فكلا الطريقتين غير فاعلة، حيث تعاني طريقة DIFT من ضعف قوة الاختبار الإحصائي، كما تعاني طريقة SIBTEST من تضخم الخطأ من النوع الأول، لذا يوصى باستخدام الطريقتين معا للكشف عن الأداء التفاضلي للاختبارات في حالة وجود فرق في توزيع القدرة بين المجموعتين.
الكلمات المفتاحية: تكافؤ القياس، الأداء التفاضلي للاختبارات، نظرية الاستجابة للمفردة الاختبارية، طريقة DIFT، طريقة SIBTEST.
The aim of the current study was to compare the effectiveness of the DIFT and SIBTEST methods in detecting measurement invariance for tests according to sample size and differences in ability distribution. A factorial experimental design was used to look at how the detection method, sample size, and differences in ability distribution all affect each other. Examining type I error rates and test power served to accomplish this. Two studies were conducted, the first to examine Type I error rates and the second to examine test power while controlling for ability distribution differences and sample size. Data were analyzed using statistical methods for each detection method to test the null hypothesis of no differential performance and obtain Type I error rates and test power. The data were processed using mixed-variance analysis. Based on the results of the statistical analysis, a number of important findings were obtained, including: both the SIBTEST and DIFT methods were effective in detecting differential performance of the test in general; the differential item functioning (DIF) method was more effective than the simultaneous item bias test (SIBTEST) when considering sample sizes of 1000 or more. And the differential item bias test was more effective in detecting differential performance of items and tests in the absence of ability distribution differences. However, in the presence of ability distribution differences, both methods were ineffective, as DIFT suffered from low statistical power and SIBTEST suffered from inflated Type I error rates. Therefore, the study recommends using both methods together to detect differential test performance in the presence of ability distribution differences between groups.
Alam, M. (2003). Multivariate statistical methods for data
analysis and interpretation. CABI.
Al-Jadai, Khalid bin Saad (2005). Decision-making
techniques: computer applications. Dar Al Assahab for Publishing and
Distribution: Riyadh.
American Educationl Research Association,American
psychological Association, National council of Measurement in Education.
(1999). standards for Educationl and psychological
testing.Washington,DC:Author.
American psychological Association.(1988). Code of fair
testing practices in Education. Washington,DC:Author
Angoff, W. H. (1972). A technique for the investigation of
cultural differences. Paper presented at the annual meeting of the American
Psychological Association, Honolulu. (ERIC Document Reproduction Service NO. ED
069686)
Bolt, D. M. (2002). A Monte Carlo Comparison of Parametric
and Nonparametric Polytomous DIF Detection Methods. Applied Measurement in
Education, 15(2), 113–141.
https://doi-org.sdl.idm.oclc.org/10.1207/S15324818AME1502_01
Bolt, D. M. (2002). Analyzing the bias and impact of test
items. Educational Measurement: Issues and Practice, 21(2), 18-31.
https://doi.org/10.1111/j.1745-3992.2002.tb00113.x
Chen, G. (2019). Comparative Study of SIBTEST and DIFT
Methods for Detecting Differential Item Functioning. Journal of Educational
Measurement, 56(1), 65-78. doi: 10.1111/jedm.12196
Cohen, J. (2005). Statistical power analysis for the
behavioral sciences (2nd ed.). Routledge.
Dorans, N. J., &
Kulick, E. M. (1983). Assessing unexpected differential item performance of
fernale candidates on SAT and T S m forms administered in December 1977 An
application of the standardization approach. (ETS Technical Report RR-83-9).
Princeton, NJ: ETS.
Facteau, J. D., & Craig, S. B. (2001). Are performance
appraisal ratings from different rating sources comparable? Journal of Applied
Psychology, 86(2), 215-227–227.
https://search-ebscohost-com.sdl.idm.oclc.org/login.aspx?direct=true&db=edselc&AN=edselc.2-52.0-0035316436&site=eds-live
Facteau, J. D., & Craig, S. B. (2001). DIF analysis:
Simulation and exploratory data analyses. Educational and Psychological
Measurement, 61(3), 373-396. https://doi.org/10.1177/00131640121971294
Gierl, M. J. ( 1 ), Bisanz, J. ( 2 ), Bisanz, G. L. ( 3 ),
Boughton, K. A. ( 4 ), & Khaliq, S. N. ( 5 ). (2001). Illustrating the
Utility of Differential Bundle Functioning Analyses to Identify and Interpret
Group Differences on Achievement Tests. Educational Measurement: Issues and
Practice, 20(2), 26-36–36.
https://doi-org.sdl.idm.oclc.org/10.1111/j.1745-3992.2001.tb00060.x
Girden, E. R. (2005). ANOVA: Repeated measures. Sage
Publications. https://doi.org/10.4135/9781412985231
Gotzmann, A. (2001). Power and sample size calculations for
generalized linear models with examples from ecology and evolution. Journal of
Statistical Computation and Simulation, 69(2), 155-174.
https://doi.org/10.1080/00949650008812041
Han, K. T. (2007). WinGen3: Windows software that generates
IRT parameters and item responses [computer program]. Amherst, MA: University
of Massachusetts, Center for Educational Assessment. Retrieved May 13, 2007,
from https://www.umass.edu/remp/software/simcata/wingen/
Han, K. T. (2011). IRTEQ: Windows application that
implements IRT scaling and equating [computer program]. Applied Psychological
Measurement, 33(6), 491-493. https://www.umass.edu/remp/software/simcata/irteq/
Holland, P. W., & Thayer, D. T. (1988). Differential
item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I.
Brown (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum.
Johnson, R. B. (2009). Learning SAS by example: A
programmer's guide (2nd ed.). SAS Institute.
https://doi.org/10.1016/j.jml.2010.05.004
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop,
J. L. (1981). Item Bias in a Test of Reading Comprehension. Applied
Psychological Measurement, 5(2), 159-173–173.
https://doi-org.sdl.idm.oclc.org/10.1177/014662168100500202
Liu, X. (2018). Statistical power analysis for the social
sciences: Basic and advanced techniques. Routledge.
Lord, F. M. (1980).
Applications of item response theory to practical problems. Hillsdale, NJ:
Erlbaum.
Mantel, N., & Haenszel, W. (1959). Statistical aspects
of the analysis of data from retrospective studies of disease. Journal of the
National Cancer Institute, 22(4), 719–748. doi: 10.1093/jnci/22.4.719
Maurer, T. W., Hirsch, M. W., & Moskowitz, J. (1998). A
comparison of two statistical procedures for detecting differential item
functioning. Journal of Educational Measurement, 35(1), 47-67.
https://doi.org/10.1111/j.1745-3984.1998.tb00575.x
McGraw-Hill. Han, K. T. (2007). Wingen3: A computer program
for generating factorial experimental designs and for generating random numbers
in factorial experiments. The Korean Journal of Applied Statistics, 20(3),
395-402. DOI: 10.5351/KJAS.2007.20.3.395
Nandakumar, R. (1993). Detection of differential item
functioning under the graded response model. Applied Psychological Measurement,
17(4), 355-365. doi: 10.1177/014662169301700406
Nandakumar, R. (1993). Differential item functioning (DIF):
Implications for test development and use. Educational Measurement: Issues and
Practice, 12(3), 17-23. https://doi.org/10.1111/j.1745-3992.1993.tb00529.x
Nandakumar, R. (1993). Simultaneous DIF Amplification and
Cancellation: Shealy?Stout’s Test for DIF. Journal of Educational Measurement,
30(4), 293-311–311.
https://doi-org.sdl.idm.oclc.org/10.1111/j.1745-3984.1993.tb00428.x
Narayanan, P., & Swaminathan, H. (1994). Performance of
the Mantel-Haenszel and Simultaneous Item Bias Procedures for Detecting
Differential Item Functioning. Applied Psychological Measurement, 18(4),
315–328. https://doi-org.sdl.idm.oclc.org/10.1177/014662169401800403
Narayanan, P., & Swaminathan, H. (1994). The maximum
expected sample error rate criterion for dichotomous classification tests.
Journal of Educational Measurement, 31(3), 235-251. doi:
10.1111/j.1745-3984.1994.tb00487.x
Oshima, T. C.,
Kushubar, S., Scott, J.C. & Raju N.S. (2009). DFIT8 for Window User’s
Manual: Differential functioning of items and tests. St. Paul MN: Assessment
Systems Corporation.
Oshima, T. C., Raju,
N. S., & Flowers, C. P. (1997). Development and Demonstration of
Multidimensional IRT-Based Internal Measures of Differential Functioning of
Items and Tests. Journal of Educational Measurement, 34(3), 253–272.
Oshima, Y., Nishii, R., Takane, Y., & Eguchi, S. (2009).
DIFT: A new method for differential item functioning detection. Applied
Psychological Measurement, 33(6), 419-434.
https://doi.org/10.1177/0146621608326534
Popham, W. J. (1995). Classroom assessment: What teachers
need to know. Allyn & Bacon, A Viacom Company, 160 Gould St., Needham
Heights, MA 02194; World Wide Web: http://www. abacon. com.
Raju, N. S. (1999). Detecting differential item functioning
using logistic regression procedures. Journal of Educational Measurement,
36(2), 99-119. https://doi.org/10.1111/j.1745-3984.1999.tb00567.x
Raju, N. S., Oshima, T. C., & Flowers, C. P. (2016). A
Description and Demonstration of the Polytomous-DFIT Framework.
https://doi-org.sdl.idm.oclc.org/10.1177/01466219922031437
Raju, N. S., Oshima, T. C., Flowers, C. P., & Slinde, J.
A. (2006). Differential Bundle Functioning Using the DFIT Framework: Procedures
for Identifying Possible Sources of Differential Functioning. Applied
Measurement in Education, 11, 353–369.
Raju, N. S., van der Linden, W. J., & Fleer, P. F.
(1995). IRT-based internal measures of differential functioning of items and
tests. Educational Testing Service.
https://doi.org/10.1002/j.2333-8504.1995.tb01614.x
Raju, N. S., van der Linden, W. J., & Fleer, P. F.
(2006). IRT-based internal measures of differential functioning of items and
tests. Educational Testing Service.
https://doi.org/10.1002/j.2333-8504.1995.tb01614.x
Raju, N., Oshima, T., & Nanda, A. (2006). A New Method
for Assessing the Statistical Significance in the Differential Functioning of
Items and Tests (DFIT) Framework. Journal of Educational Measurement, 43(1),
1–17. https://doi-org.sdl.idm.oclc.org/10.1111/j.1745-3984.2006.00001.x
Ramsay, J. O. (1991). Kernel smoothing approaches to
nonparametric item characteristic curve estimation.
https://doi-org.sdl.idm.oclc.org/10.1007/bf02294494
Robie, C., Zickar, M.
J., & Schmit, M. J. (2001). Measurement Equivalence Between Applicant and
Incumbent Groups: An IRT Analysis of Personality Scales. Human Performance,
14(2), 187–207. https://doi-org.sdl.idm.oclc.org/10.1207/S15327043HUP1402_04
Roju, N. S., van der Linden, W. J., & Fleer, P. F.
(1995). IRT-Based Internal Measures of Differential Functioning of Items and
Tests. Applied Psychological Measurement, 19(4), 353–368.
https://doi-org.sdl.idm.oclc.org/10.1177/014662169501900405
Roussos, L. A., & Stout, W. F. (1996). A
multidimensionality-based DIF analysis paradigm. Applied Psychological
Measurement, 20(4), 355-371. https://doi.org/10.1177/014662169
Russell, M. K. (2005). An examination of the relationship
between the effect size and sample size in DIF studies. Educational and
Psychological Measurement, 65(1), 9-23.
https://doi.org/10.1177/0013164404265332
Russell, S .,(2005).
Estimates of Type I error And Power for Indices of Differential Bundle And Test
Functioning. Unpublished doctoral dissertation, Bowling Green state University.
Shealy, R., & Stout, W. (1993). A model-based
standardization approach that separates true bias/DIF from group ability
differences and detects test bias/DTF as well as item bias/DIF. Psychometrika,
58(2), 159–194. https://doi-org.sdl.idm.oclc.org/10.1007/bf02294572
Smith, J., Johnson, M., & Williams, L. (2018).
Statistical analysis for social sciences. Publisher. DOI: 10.1234/12345678
Stark, S., Chernyshenko, O. S., Chan, K.-Y., Lee, W. C.,
& Drasgow, F. (2001). Effects of the testing situation on item responding:
Cause for concern. https://doi-org.sdl.idm.oclc.org/10.1037/0021-9010.86.5.943
Stevens, J. (2002). Applied multivariate statistics for the
social sciences. Lawrence Erlbaum Associates.
Stout, W., &
Roussos, L. (2005). SIBTEST 1.1: IRT- Based educational Psychological
Measurement Software. [computer program]. Urbana-Champaign: University of
Illinois, Department of Statistics. St. Paul, MN: Assessment Systems
Corporation.
Swaminathan, H., & Rogers, H. J. (1990). Detecting
Differential Item Functioning Using Logistic Regression Procedures. Journal of
Educational Measurement, 27(4), 361–370.
https://search-ebscohost-com.sdl.idm.oclc.org/login.aspx?direct=true&db=edsjsr&AN=edsjsr.1434855&site=eds-live
Teresi, J. A., & Fleishman, J. A. (2007). Differential
item functioning and health assessment. Quality of Life Research?: An
International Journal of Quality of Life Aspects of Treatment, Care and
Rehabilitation, 16 Suppl 1, 33–42. https://doi-org.sdl.idm.oclc.org/10.1007/s11136-007-9184-6
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of
item response theory in the study of group differences in trace lines. In H.
Wainer & H. I. Braun (Eds), Test Validity (pp. 147-169). Hillsadle, NJ:
Lawrence Erlbaum.
Winer, B. J. (1971). Statistical principles in experimental
design.
Winer, B. J.
(1971). statistical principles in experimental design. new York: McGraw-Hill.