Konsep Inti
While not a replacement for traditional usability testing, AI-powered tools like UX-LLM offer a valuable supplementary approach for identifying usability issues, particularly for smaller teams or less common user paths.
Abstrak
Bibliographic Information:
Ebrahimi Pourasad, A., & Maalej, W. (2024). Does GenAI Make Usability Testing Obsolete? arXiv preprint arXiv:2411.00634.
Research Objective:
This paper investigates the potential of Generative AI, specifically Large Language Models (LLMs), to support and potentially automate usability evaluations for mobile applications. The research aims to determine the accuracy of an LLM-based tool, UX-LLM, in predicting usability issues and compare its performance to traditional usability evaluation methods.
Methodology:
The researchers developed UX-LLM, a tool that leverages LLMs to identify usability issues in iOS apps using app context, source code, and view images. To evaluate UX-LLM, the researchers selected two open-source iOS apps and conducted three parallel usability evaluations: UX-LLM analysis, expert reviews, and usability testing with 10 participants. Two UX experts assessed the usability issues identified by each method to determine precision and recall. Additionally, a focus group with a student development team explored the perceived usefulness and integration challenges of UX-LLM in a real-world project.
Key Findings:
- UX-LLM demonstrated moderate to good precision (0.61-0.66) in identifying valid usability issues but lower recall (0.35-0.38), indicating it can detect issues but may miss a significant portion.
- Compared to expert reviews and usability testing, UX-LLM provided unique insights, particularly for less common user paths and code-level issues, but missed broader contextual or navigation-related problems.
- The student development team perceived UX-LLM as a valuable supplementary tool, appreciating its ability to uncover overlooked issues and provide actionable feedback. However, they highlighted integration challenges and suggested improvements like IDE integration and solution proposals.
Main Conclusions:
The study concludes that while GenAI-powered tools like UX-LLM cannot fully replace traditional usability evaluation methods, they offer valuable support, especially for smaller teams with limited resources. UX-LLM's ability to analyze source code allows it to identify issues that might be missed by other methods.
Significance:
This research contributes to the growing field of AI-assisted software development by exploring the potential of GenAI in usability evaluation. It highlights the benefits and limitations of such tools, paving the way for further research and development in this area.
Limitations and Future Research:
The study acknowledges limitations regarding the generalizability of findings due to the selection of specific apps and the limited number of UX experts. Future research should explore UX-LLM's performance with more complex apps, diverse user groups, and different usability evaluation methods. Additionally, investigating the integration of UX-LLM into development workflows and exploring its potential to suggest solutions are promising avenues for future work.
Statistik
Nielsen tests with five participants uncovers about 80% of usability issues.
UX-LLM demonstrated precision ranging from 0.61 and 0.66 and recall between 0.35 and 0.38.
Expert 1 labelled 27 samples as actual usability issues, 13 as non-usability issues, 5 as uncertain, and 4 as incorrect/irrelevant statements.
Expert 2 labelled 31 samples as usability issues, 12 as non-usability issues, 2 as uncertain, and 4 as incorrect/irrelevant statements.
Cohen’s Kappa measure was κ = 0.53, suggesting "Moderate" agreement between the UX experts.
Of the total 110 issues, the usability testings uncovered 25 issues, with 8 unique to it.
The expert review pointed out 54, including 31 unique issues.
UX-LLM identified 30 issues, contributing 8 unique insights.
Only 9 issues were identified by all three methods.
Kutipan
"Respond using app domain language; you must not use technical terminology or mention code details."
"Some issues feel a bit generic and some don’t make sense, since they are addressed in previous screens."
"I appreciate the fresh perspectives it offers. Even incorrect usability issues can be valuable as they make me reevaluate design decisions."
"The feedback on the button bug was spot on; it’s not something we would have thought about by ourselves."
"On some screens we assumed something is not ideal, but we did not know what the problem was, these issues are very helpful."
"I’m a laid-back person, so it would annoy me to have to use another application beside my IDE."
"It’s great to see an overview of what’s available; you can quickly eliminate unnecessary issues and reflect on them. In the end, it saves a lot of time as it is easier than conducting usability evaluations ourselves."
"When it criticised the accessibility of the colours, it would be nice if it could also show what colours to use instead."
"It has identified issues that we overlooked, and not just a few."