"Vanilla" image classification tasks typically don't require specialized models, as the default Hugging Face models tend to deliver satisfactory results.
A new study has shown that fine-tuned neural networks initiated with transfer learning generally outperform scikit-learn models trained on extracted neural network features in complex, domain-specific tasks. This is mainly due to the adaptability of fine-tuning, which allows for more powerful feature representations aligned with the task.
The research, which focuses on earth observation, computer vision, and machine learning, utilises a dataset from the 2013/2014 Chesapeake Conservancy land cover project. The dataset consists of 15,809 unique patches of size 128 x 128 pixels, composed of National Agriculture Imagery Program (NAIP) satellite imagery, and includes 4-bands of information at a 1 meter squared resolution. The dataset includes 5 land cover classes: Water, Tree Canopy and Shrubs, Low Vegetation, Barren, and Impervious Surfaces.
The dataset is somewhat significantly class imbalanced, with Tree Canopy and Shrubs being highly over-represented and Barren and Impervious Surfaces being highly under-represented. To address this issue, the researchers used Principal Component Analysis (PCA) as a non-learned feature for dimension reduction.
Learned features were extracted from two pre-trained models: Microsoft's Bidirectional Encoder Image Transformer (BEiT) and Facebook's ConvNext model. The models were then trained using scikit-learn models and transfer learned and fine-tuned neural networks were also trained for comparison.
Model evaluation was based on balanced accuracies, individual class accuracies, and confusion matrices on the held-out test set. The results showed that the fine-tuned BEiT model performed second best overall with a balanced accuracy of 82.9%, while the ConvNext model performed best overall with a balanced accuracy of 84.4%.
Interestingly, the study found that handcrafted features like Histogram of Gradients (HOGs) weren't as effective for supervised modeling purposes in this context, as they incurred a significant drop off in balanced accuracy relative to the models trained from learned features.
The research also highlights the limitations of the outputted models, including worsened classification performance on imagery of other class types, at other resolutions, and with other conditions. However, it also suggests that pre-trained embeddings paired with simpler models can perform nearly as well as fine-tuned neural networks.
The study references a famous blog post by Andrej Karpath on the shift from the old school engineering to the new school of deep learning, dubbed "software 2.0". This shift, which is still relevant today, emphasises the importance of pre-trained models and transfer learning in achieving superior performance on complex, domain-specific tasks.
The research also mentions that knowledge of the Python package universe is critical for the project. With thousands of pre-trained neural networks being released annually by companies like Microsoft, this knowledge will only become more essential in the future.
The findings of this study are significant for data scientists and those interested in earth observation, computer vision, and machine learning, as they demonstrate the power of fine-tuned neural networks initiated with transfer learning in complex, domain-specific tasks. Furthermore, the study also emphasises the importance of understanding and utilising pre-trained models and transfer learning in achieving superior performance.
On a side note, it's worth mentioning that Hugging Face is not only great for Natural Language Processing, but also amazing for Computer Vision. In fact, Hugging Face's Model Hub offers a wide range of pre-trained models that can be used for various computer vision tasks.
Lastly, it's interesting to note that Stable Diffusion, a neural network that can turn text prompts into images/art, has been downloaded by over 10 million users, demonstrating the growing interest and application of neural networks in various fields.
Technology, such as pre-trained models like Microsoft's Bidirectional Encoder Image Transformer (BEiT) and Facebook's ConvNext model, plays a crucial role in the performance of fine-tuned neural networks in complex, domain-specific tasks, as demonstrated by the study utilised in the research. The knowledge of the Python package universe, including the availability of thousands of pre-trained neural networks released annually by companies, is essential for such projects and will become increasingly important in the future.