CS: MQP: DS: MQP: AI vs. Human: Identify AI-generated text against human-written text

Capobianco, Marc; Shah-Nathwani, Krish; Luong, Duong; Reynolds, Matthew; Phelan, Charles

Student Work

CS: MQP: DS: MQP: AI vs. Human: Identify AI-generated text against human-written text

Public Deposited

With the recent innovation in Large Language Models, the world has been taken by storm by the vast implications and applications of including these effective new creations into our everyday lives. However, as with most things, these new wonderful leaps forward in technology are being perverted with malicious intent. Models that can effectively replicate human speech have been used to plagiarize texts, spread false information, and displace workers from their careers. In order to use these models in a beneficial way to society it's extremely important to have detection methods in place to detect non-human generated content. However, as these models are becoming ever more complex, simple solutions that have worked for previous models are rapidly becoming obsolete. So in this project, we explore the effectiveness of BERT and RoBERTa-based machine generated text detection in a supervised setting. We created several models for both BERT and RoBERTa, trained with 5%, 10%, 15%, 20%, and 100% of the dataset. We conducted these experiments with frozen and unfrozen parameter variations. We found that frozen variations of BERT outperformed frozen RoBERTa when trained on a more limited dataset, as opposed to when found that the unfrozen variation of RoBERTa outperformed unfrozen BERT when trained on the same limited data. With some final analysis, we found RoBERTa outperformed BERT in both the frozen and unfrozen variations when trained on the entire dataset.

This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.

Creator

Publisher