ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4

Abstract

In recent years, large language models (LLMs) have made significant progressin natural language processing (NLP), with models like ChatGPT and GPT-4achieving impressive capabilities in various linguistic tasks. However,training models on such a large scale is challenging, and finding datasets thatmatch the model's scale is often difficult. Fine-tuning and training modelswith fewer parameters using novel methods have emerged as promising approachesto overcome these challenges. One such model is MiniGPT-4, which achievescomparable vision-language understanding to GPT-4 by leveraging novelpre-training models and innovative training strategies. However, the modelstill faces some challenges in image understanding, particularly in artisticpictures. A novel multimodal model called ArtGPT-4 has been proposed to addressthese limitations. ArtGPT-4 was trained on image-text pairs using a Tesla A100device in just 2 hours, using only about 200 GB of data. The model can depictimages with an artistic flair and generate visual code, including aestheticallypleasing HTML/CSS web pages. Furthermore, the article proposes novel benchmarksfor evaluating the performance of vision-language models. In the subsequentevaluation methods, ArtGPT-4 scored more than 1 point higher than the current\textbf{state-of-the-art} model and was only 0.25 points lower than artists ona 6-point scale. Our code and pre-trained model are available at\url{https://huggingface.co/Tyrannosaurus/ArtGPT-4}.

Quick Read (beta)

loading the full paper ...