|  e-ISSN: 2618-589X

Original article | TAY Journal 2023, Vol. 7(4) 902-921

Determination of Type I Error and Power Rate in Differential Item Functioning by Several Methods

Şeyma Erbay Mermer, Yasemin Kuzu, Hülya Kelecioğlu

pp. 902 - 921   |  DOI: https://doi.org/10.29329/tayjournal.2023.610.09   |  Manu. Number: tay journal.2023.052

Published online: October 30, 2023  |   Number of Views: 7  |  Number of Download: 91


In this study, the methods based on Classical Test Theory and Item Response Theory were used comparatively to determine Type I error and power rates in Differential Item Functioning. Logistic regression, Mantel-Haenszel, Lord's , Breslow-Day and Raju's area index methods were used for the analyses, which were conducted using the R programming language. Determination of Type I error and power rates of these methods under certain conditions was carried out by simulation study. For data generation, analyzes were made under eight conditions in total by examining different sample sizes and DIF rates created with the WinGen 3 program. The results of the study indicate that, in general when the ratio of items containing DIF increased, Type I error increased and the power ratio decreased. Among the methods based on Item Response Theory, Lord's and Raju's area index methods gave better results than other methods with low error and high power.

Keywords: IRT, DIF, Type I error, power

How to Cite this Article?

APA 6th edition
Mermer, S.E., Kuzu, Y. & Kelecioglu, H. (2023). Determination of Type I Error and Power Rate in Differential Item Functioning by Several Methods . TAY Journal, 7(4), 902-921. doi: 10.29329/tayjournal.2023.610.09

Mermer, S., Kuzu, Y. and Kelecioglu, H. (2023). Determination of Type I Error and Power Rate in Differential Item Functioning by Several Methods . TAY Journal, 7(4), pp. 902-921.

Chicago 16th edition
Mermer, Seyma Erbay, Yasemin Kuzu and Hulya Kelecioglu (2023). "Determination of Type I Error and Power Rate in Differential Item Functioning by Several Methods ". TAY Journal 7 (4):902-921. doi:10.29329/tayjournal.2023.610.09.


    American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Educatio [NCME]. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

    Ankenmann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistics in detecting differential item functioning. Journal of Educational Measurement, 36(4), 277–300.

    Atar, B., & Kamata, A. (2011). Comparison of IRT likelihood ratio test and logistic regression DIF detection procedures. Hacettepe University Journal of Education, 41, 36–47.

    Awuor, R. A. (2008). Effect of unequal sample sizes on the power of DIF detection: An IRT-based monte carlo study with sıbtest and Mantel-Haenszel procedures [Unpublished master thesis]. Virginia Polytechnic Institute and State University.

    Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. SAGE.

    Chen, J. H., Chen, C. T., & Shih, C. L. (2014). Improving the control of Type I error rate in assessing differential item functioning for hierarchical generalized linear model when impact is presented. Applied Psychological Measurement, 38(1), 18–36.

    Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning items. Educational Measurement: Issues and Practice, 17, 31–44.

    Çepni, Z. (2011). Değişen madde fonksiyonlarının sibtest, Mantel Haenzsel, lojistik regresyon ve madde tepki kuramı yöntemleriyle incelenmesi [Differential item functioning analysis using sıbtest, Mantel Haenszel, logistic regression and item response theory methods]. [Unpublished master thesis], Hacettepe University.

    Ellis, B., & Raju, N. (2003): Test and item bias: what they are, what they aren’t, and how to detect them (ED480042). ERIC. https://eric.ed.gov/?id=ED480042

    Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press.

    Embretson, S. E. (2007). Construct validity: A universal validity system or just another test evaluation prosedure?, Educational Researcher, 36(8), 449-455.

    Erdem-Keklik, D. (2012). İki kategorili maddelerde tek biçimli değişen madde fonksiyonu belirleme tekniklerinin karşılaştırılması: Bir simülasyon çalışması [Comparison of techniques in detecting uniform differential item functioning in dichotomous items: A simulation study]. (Tez No.311744). [Doctoral dissertation, Ankara University], National Thesis Center.

    Furlow, C. F., Ross, T. R., & Gagné, P. (2009). The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Applied Psychological Measurement, 33(6), 441–464.

    Gierl, M. J., Rogers, W. T., & Klinger, D. A. (1999). Using statistical and judgmental reviews to identify and interpret translation differential item functioning. Alberta Journal of Educational Research, 45(4), 353–376.

    Gierl, M. J., Jodoin, M. G., & Ackerman, T. A. (2000, April 24–27). Performance of Mantel-Haenszel, simultaneous item bias test, and logistic regression when the proportion of DIF items is large [Paper presentation]. The Annual Meeting of the American Educational Research Association (AERA), New Orleans, Louisiana, USA.

    Gök, B., Kabasakal, K. A., & Kelecioğlu, H. (2014). PISA 2009 öğrenci anketi tutum maddelerinin kültüre göre değişen madde fonksiyonu açısından incelenmesi [Analysis of attitude items in PISA 2009 student questionnaire in terms of differential item functioning based on culture]. Journal of Measurement and Evaluation in Education and Psychology, 5(1), 72–87.

    Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE.

    Hou, L.,de la Torre, J. D., & Nandakumar, R. (2014). Differential item functioning assessment in cognitive diagnostic modeling: Application of the Wald test to investigate DIF in the DINA model. Journal of Educational Measurement, 51(1), 98–125.

    Jeon, M., Rijmen, F., & Rabe-Hesketh, S. (2013). Modeling differential item functioning using a generalization of the multiple-group bifactor model. Journal of Educational and Behavioral Statistics, 38(1), 32–60.

    Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349.

    Kabasakal, K. A., & Kelecioglu, H. (2015). Effect of differential item functioning on test equating. Educational Sciences: Theory and Practice, 15(5), 1229–1246.

    Kan, A., Sünbül, Ö., & Ömür, S. (2013). 6.- 8. Sınıf seviye belirleme sınavları alt testlerinin çeşitli yöntemlere göre değişen madde fonksiyonlarının incelenmesi [Investigating the differential item functions of the 6th-8th grade subtests of the Level Assessment Examination according to various methods]. Mersin University Journal of the Faculty of Education, 9(2), 207–222.

    Kane, M. (2006). Content-related validity evidence in test development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 131–153). Lawrence Erlbaum Associates.

    Karami H., & Nodoushan M. A. S. (2011). Differential item functioning (DIF): Current problems and future directions. International Journal of Language Studies, 5(4), 133–142.

    Lee, S., Bulut, O., & Suh, Y. (2017). Multidimensional extension of multiple indicators multiple causes models to detect DIF. Educational and Psychological Measurement, 77(4), 545–569.

    Lee, K. (2003). Parametric and nonparametric IRT models for assessing differential item functioning [Unpublished doctoral dissertation]. Wayne State University.

    Li, H., Qin, Q., & Lei, PW. (2017). An examination of the ınstructional sensitivity of the TIMSS math items: a hierarchical differential ıtem functioning approach, Educatıonal Assessment, 22(1), 1–17.

    Mertler, C. A., & Vannatta, R. A. (2005). Advanced and multivariate statistical methods: Practical application and interpretation (3rd ed.). Pyrczak.

    Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education.

    Penfield, R. D., & Lam, T. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19(3), 5–15.

    Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105–116.

    Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel type I error performance. Journal of Educational Measurement, 33, 215–230.

    Sireci, S. G., & Allalouf, A. (2003). Appraising item equivalence across multiple languages and cultures. Language Testing, 20(2), 148–166.

    Sireci, S. G., & Rios, J. A. (2013). Decisions that make a difference in detecting differential item functioning. Educational Research and Evaluation, 19(2-3), 170–187.

    Sünbül, Ö., & Sünbül, S. Ö. (2016). Type I error rates and power study of several differential item functioning determination methods. Elementary Education Online, 15(3), 882–897.

    Şahin, M. G. (2017). Comparison of objective and subjective methods on determination of differential item functioning. Universal Journal of Educational Research 5(9), 1435–1446.

    Turhan, A. (2006). Multilevel 2PL item response model vertical equating with the presence of differential item functioning [Unpublished doctoral dissertation]. The Florida State University.

    Vaughn, B. K., & Wang, Q. (2010). DIF trees: Using classifications trees to detect differential item functioning. Educational and Psychological Measurement, 70(6) 941–952.

    Walker, C. M., & Gocer Sahin, S. (2016). Using a multidimensional IRT framework to better understand differential ıtem functioning (DIF): A tale of three dıf detection procedures. Educational and Psychological Measurement, 77(6), 945–970.

    Zheng, Y., Gierl, M. J., & Cui, Y. (2007). Using real data to compare DIF detection and effect size measures among Mantel-Haenszel, SIBTEST and logistic regression procedures [Paper presentation]. NCME,  Chicago.

    Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Lawrence Erlbaum Associates.

    Zumbo, B. D. (1999). A Handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and likert-type (ordinal) item scores. Directorate of Human Resources Research and Evaluation, Department of National Defense.

    Zumbo, B. D., & Gelin, M. N. (2005). A matter of test bias in educational policy research: bringing the context into picture by investigating sociological community moderated (or mediated) test and item bias. Journal of Educational Research and Policy Studies, 5(1), 23.

    Zumbo, B. D. A., & Thomas, D. R. (1996). A measure of dif effect size using logistic regression procedures [Paper presentation]. National Board of Medical Examiners. US, Philadelphia.

    Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, 2012(1).