WIC 2014 Tutorial - Plagiarism Detection in Digital Documents

Plagiarism Detection in Digital Documents: Some text mining techniques, algorithms and proposals for stopping one of the worst problem in the Web

Juan D. Velásquez - Web Intelligence Consortium Chile Research Centre & Department of Industrial Engineering, Universidad de Chile, Chile

In this tutorial, we will review some techniques and algorithms for detecting plagiarism in digital documents, with practical applications in educational institutions.
While traditional methods for automatic detection of plagiarism compute the similarity measures on a document-to-document basis, this is not always possible since the potential source documents are not always available. By applying new text mining techniques and algorithms, we can extract more precise patterns for discovering possible paragraphs plagiarism, for instance, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it.
The education institutions have faced the plagiarism phenomenon from different point of views, since internal regulations to adoption of informatics systems to stop the copy and paste praxis. The big problem in education, beyond on the moral and ethics issues, is when the students write a report, by copying text in other sources whiteout the corresponding quotation, then they don't learn nothing. Also in this tutorial we will explore some informatics solution to stop the plagiarism.


  • The Plagiarism in digital documents: Fixing the problem [2].
  • From text to vectors [7].
  • Intrinsic Plagiarism Detection [6].
  • Authorship Detection [1].
  • Quotation Detection. External Plagiarism Detection [5].
  • Collusion Analysis [3].
  • Indexing and Querying Approach
  • Systems for stoping the plagiarism [4].
  • A proposal solution.
  • Conclusions


  1. Baayen, H., van Halteren, H., & Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11 , 121-132.
  2. Fialko , F. (1993). There's no excuse for plagiarism.
  3. Irving, R. (2004). Plagiarism and collusion detection using the Smith-Waterman algorithm. Technical Report University of Glasgow, Department of Computing Science.
  4. Kang, N., & Han, S. (2006). Document copy detection system based on plagiarism patterns. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing (pp. 571-574). Springer Berlin / Heidelberg volume 3878 of Lecture Notes in Computer Science.
  5. Oberreuter, G., L'Huillier, G., Ros, S. A., & Velasquez, J. D. (2011). Approaches for intrinsic and external plagiarism detection - notebook for pan at clef 2011. In V. Petras, P. Forner, & P. D. Clough (Eds.), CLEF (Notebook Papers/Labs/Workshop).
  6. Oberreuter, G., & Velasquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40 , 3756-3763.
  7. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18 , 613-620.


Dr. Juan D. Velásquez is Associate Professor with the Industrial Engineering Department at University of Chile. During his academic life, he has been adviser for more than 80 reports and theses (master and doctorate), has written more than 100 scientific publications and book chapters, is the author of the book Adaptive Web Site: A Knowledge Extraction from Web Data Approach, published by IOS Press in 2008 and editor in chief of the books series Advanced Techniques in Web Intelligence part 1 and 2, published by Springer-Verlang in September, 2010 and 2012. He has been a visiting professor  at the Center for Collaborative Research, University of Tokyo, Japan,  and Technology, and the VSB Ostrava Technical University, Czech Republic,besides being a guest lecturer in more than 10 countries. In 2009, he was the General Chair of the International Knowledge Engineering System (KES) Conference, which was held for the first time in Latin America (Santiago, Chile, September).  He is Director of the Web Intelligence Consortium Chile Centre (wi.dii.uchile.cl), whose main focus is the research in web mining, web personalization, Business Intelligence and Social Network Analysis. He has served as IT consulting, company owner, IT management and also as leader of Complex Engineering projects.



Warsaw - The Old Town

Panorama of Warsaw

Chopin statue in Łazienki park

University of Warsaw - Library and gardens

Warsaw - Palace in Łazienki park

University of Warsaw - WIC 2014 venue

University of Warsaw - Central Campus

Glimpse of modern Warsaw

Warsaw - Castle Square

Warsaw - Downtown by night

Warsaw - Royal Castle seen from the river

Warsaw University of Technology