[Bản dịch] Giải thích trực quan về Vision Transformer (A Visual Guide to Vision Transformers)

(discuss.pytorch.kr)

13 điểm bởi ninebow 2024-04-22 | 1 bình luận | Chia sẻ qua WhatsApp

ℹ️ Sau khi đọc bài hướng dẫn trực quan về Visual Transformers do xguru giới thiệu, tôi đã dịch bài Giải thích trực quan về Vision Transformer (ViT) (A Visual Guide to Vision Transformers) do Dennis Turp, một Data Scientist kiêm Software Engineer, chấp thuận.
Vision Transformer (ViT) là mô hình áp dụng Transformer vào lĩnh vực CV (Computer Vision), cho hiệu năng vượt trội trong các bài toán như phát hiện đối tượng và phân loại hình ảnh. Đặc biệt, nó được dùng nhiều như một Visual Encoder để trích xuất đặc trưng (feature) từ ảnh.
Do phần giải thích trong nguyên tác khá ngắn gọn nên ở những chỗ có thể khó hiểu, tôi đã bổ sung một số chú thích để hỗ trợ việc nắm bắt nội dung.

Giải thích trực quan về Vision Transformer (ViT)

Bài viết này là phần giải thích trực quan về Vision Transformers (ViTs), một lớp mô hình học sâu đạt hiệu năng hàng đầu (SotA, State-of-the-Art) trong các tác vụ phân loại hình ảnh. Vision Transformer áp dụng kiến trúc Transformer, vốn ban đầu được thiết kế cho xử lý ngôn ngữ tự nhiên (NLP), vào dữ liệu hình ảnh. Trong bài này, bạn sẽ có thể hiểu cách Vision Transformer hoạt động thông qua các giải thích ngắn gọn đi kèm trực quan hóa giúp nắm được luồng dữ liệu khi cuộn xuống theo dõi. (:pytorch::kr:: Ở đây khó tái hiện phần giải thích bằng cuộn trang, nên sẽ thay bằng ảnh chụp màn hình. Bạn nên xem thêm nguyên văn.)

This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. Vision Transformers apply the transformer architecture, originally designed for natural language processing (NLP), to image data. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and how the flow of the data through the model looks like.

0. Quan sát dữ liệu / Lets start with the data

Cũng như các mạng nơ-ron tích chập (CNN) thông thường, Vision Transformer được huấn luyện theo phương thức học có giám sát (Supervised Learning). Nghĩa là mô hình được học trên một tập dữ liệu gồm ảnh và các nhãn (label) tương ứng của chúng.

Like normal convolutional neural networks, vision transformers are trained in a supervised manner. This means that the model is trained on a dataset of images and their corresponding labels.

1. Chỉ tập trung vào một điểm dữ liệu / Focus on one data point

Để hiểu rõ hơn những gì diễn ra bên trong Vision Transformer, trước tiên hãy chỉ tập trung vào một dữ liệu duy nhất (batch size bằng 1). Và hãy cùng đặt ra câu hỏi này: cần chuẩn bị (tiền xử lý) dữ liệu đó như thế nào để có thể đưa vào Transformer?

To get a better understanding of what happens inside a vision transformer lets focus on a single data point (batch size of 1). And lets ask the question: How is this data point prepared in order to be consumed by a transformer?

2. Tạm gác nhãn sang một bên / Forget the label for the moment

Nhãn sẽ trở nên liên quan hơn ở phần sau. Hiện tại, thứ duy nhất còn lại để xem xét là một hình ảnh đơn lẻ.

The label will become more relevant later. For now the only thing that we are left with is a single image.

3. Chia ảnh thành các patch / Create patches of the image

Để chuẩn bị ảnh cho việc sử dụng bên trong Transformer, ta chia toàn bộ ảnh thành các patch có cùng kích thước p x p.

To prepare the image for the use inside the transformer we divide the image into equally sized patches of size p x p.

4. Làm phẳng các patch ảnh / Flatting of the image patches

Các patch sau đó được làm phẳng (flatten) thành các vector có kích thước p' = p² x c, trong đó p là độ dài một cạnh của patch và c là số kênh. (:pytorch::kr:: Ví dụ, với ảnh RGB thì số kênh là 3.)

The patches are now flattened into vectors of dimension p'= p²*c where p is the size of the patch and c is the number of channels.

5. Tạo embedding từ patch / Creating patch embeddings

Các vector được tạo từ patch ảnh ở trên giờ sẽ được mã hóa bằng một phép biến đổi tuyến tính. Vector embedding của patch (Patch Embedding Vector) thu được sẽ có kích thước cố định là d.

These image patch vectors are now encoded using a linear transformation. The resulting Patch Embedding Vector has a fixed size d.

6. Embedding cho tất cả các patch / Embedding all patches

Khi toàn bộ patch ảnh đều đã được embedding thành các vector có kích thước cố định, ta sẽ thu được một mảng có kích thước n x d, trong đó n là số lượng patch ảnh và d là kích thước embedding của mỗi patch.

Now that we have embedded our image patches into vectors of fixed size, we are left with an array of size n x d where n is the the number of image patches and d is the size of the patch embedding

7. Thêm token phân loại (CLS) / Appending a classification token

Để huấn luyện mô hình hiệu quả, ta thêm vào các patch embedding một vector bổ sung gọi là token phân loại (CLS token). Vector này là một tham số có thể học được của mạng nơ-ron và được khởi tạo ngẫu nhiên. Lưu ý rằng chỉ có một CLS token, và cùng một vector này được thêm vào cho mọi điểm dữ liệu. (:pytorch::kr:: Đến đây, khi thêm CLS token vào n patch embedding, ta sẽ có (n+1) x d với (n+1) embedding, mỗi embedding có kích thước d.)

In order for us to effectively train our model we extend the array of patch embeddings by an additional vector called classification token (cls token). This vector is a learnable parameter of the network and is randomly initialized. Note: We only have one cls token and we append the same vector for all data points.

8. Thêm vector embedding vị trí / Add positional embedding Vectors

Cho đến lúc này, patch embedding chưa có thông tin vị trí riêng. Ta khắc phục vấn đề này bằng cách cộng thêm một vector embedding vị trí (Positional Embedding Vector) có thể học được, được khởi tạo ngẫu nhiên, vào tất cả patch embedding. Ta cũng cộng thêm một vector vị trí như vậy vào token phân loại (CLS token) đã thêm ở trên. (:pytorch::kr:: Trong Transformer, giá trị của Positional Encoding được "cộng" vào. Vì vậy kích thước vector không thay đổi.)

Currently our patch embeddings have no positional information associated with them. We remedy that by adding a learnable randomly initialized positional embedding vector to all our patch embeddings. We also add a such a positional embedding vector to our classification token.

9. Đưa vào Transformer / Transformer Input

Sau khi thêm các vector embedding vị trí, ta còn lại một mảng có kích thước (n+1) x d. Đây sẽ là đầu vào cho Transformer, và phần này sẽ được giải thích chi tiết hơn trong các bước tiếp theo.

After the positional embedding vectors have been added we are left with an array of size (n+1) x d. This will be our input for the transformer which will be explained in greater detail in the next steps.

10.1. Transformer: Tạo QKV / QKV Creation

Các vector patch embedding đầu vào của Transformer được ánh xạ tuyến tính thành nhiều vector lớn. Các vector mới này sau đó được tách thành ba phần có kích thước bằng nhau. Đó lần lượt là Q - vector truy vấn (Query), K - vector khóa (Key), và V - vector giá trị (Value). Ta sẽ có (n+1) vector cho mỗi loại.

Our transformer input patch embedding vectors are linearly embedded into multiple large vectors. These new vectors are than separated into three equal sized parts. The Q - Query Vector, the K - Key Vector and the V - Value Vector . We will have (n+1) of a all of those vectors.

10.2. Transformer: Tính điểm attention / Attention Score Calculation

Trước tiên, để tính điểm attention A, ta nhân tất cả các vector truy vấn Q với tất cả các vector khóa K.

To calculate our attention scores A we will now multiply all of our query vectors Q with all of our key vectors K.

10.3. Transformer: Ma trận điểm attention / Attention Score Matrix

Với ma trận điểm attention A thu được, ta áp dụng hàm softmax cho từng hàng sao cho tổng của mỗi hàng bằng 1.

Now that we have the attention score matrix A we apply a softmax function to every row such that every row sums up to 1.

10.4. Transformer: Tính thông tin ngữ cảnh tổng hợp / Aggregated Contextual Information Calculation

Để tính thông tin ngữ cảnh tổng hợp (aggregated contextual information) cho vector patch embedding đầu tiên, ta tập trung vào hàng đầu tiên của ma trận attention. Sau đó dùng các phần tử trong hàng này làm trọng số cho các vector giá trị V. Kết quả là vector thông tin ngữ cảnh tổng hợp (aggregated vector) cho patch embedding của miếng ảnh đầu tiên.

To calculate the aggregated contextual information for the first patch embedding vector. We focus on the first row of the attention matrix. And use the entires as weights for our Value Vectors V. The result is our aggregated contextual information vector for the first image patch embedding.

10.5. Transformer: Lấy thông tin ngữ cảnh tổng hợp cho mọi patch / Aggregated Contextual Information for every patch

Ta lặp lại quá trình trên cho các hàng còn lại của ma trận điểm attention để thu được N+1 vector thông tin ngữ cảnh tổng hợp. Tức là mỗi patch có một vector (=N cái) cộng thêm một vector cho token phân loại (CLS Token) (=1 cái). Đến đây là hoàn tất attention head đầu tiên.

Now we repeat this process for every row of our attention score matrix and the result will be N+1 aggregated contextual information vectors. One for every patch + one for the classification token. This steps concludes our first Attention Head.

10.6. Transformer: Multi-Head Attention / Multi-Head Attention

Vì ta đang xử lý multi-head attention (của Transformer), nên sẽ lặp lại toàn bộ quy trình từ 10.1 đến 10.5 với một phép ánh xạ QKV khác. Trong hình minh họa ở trên, ta chỉ giả định có 2 head, nhưng thông thường ViT có nhiều head hơn. Kết quả cuối cùng là tạo ra nhiều vector thông tin ngữ cảnh tổng hợp.

Now because we are dealing multi head attention we repeat the entire process from step 10.1 - 10-5 again with a different QKV mapping. For our explanatory setup we assume 2 Heads but typically a VIT has many more. In the end this results in multiple Aggregated contextual information vectors.

10.7. Transformer: Bước cuối của lớp attention / Last Attention Layer Step

Sau khi chồng nhiều head được tạo ra theo cách này, chúng sẽ được ánh xạ thành các vector có kích thước d, cũng chính là kích thước của patch embedding.

These heads are stacked together and are mapped to vectors of size d which was the same size as our patch embeddings had.

10.8. Transformer: Tính kết quả của lớp attention / Attention Layer Result

Đến đây, lớp attention đã được hoàn thiện từ bước trước, và ta thu được các embedding có kích thước hoàn toàn giống hệt với lúc dùng làm đầu vào.

The previous step concluded the attention layer and we are left with the same amount of embeddings of exactly the same size as we used as input.

10.9. Transformer: Kết nối dư / Residual connections

Trong Transformer, kết nối dư (Residual Connection) được sử dụng rất nhiều; về cơ bản, đó là việc cộng đầu vào của lớp trước vào đầu ra của lớp hiện tại. Ở đây chúng ta cũng sẽ thực hiện kết nối dư.

Transformers make heavy use of residual connections which simply means adding the input of the previous layer to the output the current layer. This is also something that we will do now.

10.10. Transformer: Tính kết quả của kết nối dư / Residual connection Result

Thông qua kết nối dư này, các vector có cùng kích thước d được cộng với nhau và tạo ra vector có cùng kích thước.

The addition results in vectors of the same size.

10.11. Transformer: Đưa qua mạng feed-forward / Feed Forward Network

Đầu ra có được cho đến thời điểm này sẽ được đưa qua một mạng nơ-ron nhân tạo feed-forward với các hàm kích hoạt phi tuyến.

Now these outputs are feed through a feed forward neural network with non linear activation functions

10.12. Transformer: Tính kết quả cuối cùng / Final Result

Trong Transformer, sau các phép tính đến thời điểm này vẫn còn một kết nối dư nữa, nhưng ở đây chúng ta sẽ bỏ qua để phần giải thích ngắn gọn hơn và kết thúc phép toán của lớp Transformer. Cuối cùng, Transformer tạo ra đầu ra có cùng kích thước với đầu vào.

After the transformer step there is another residual connections which we will skip here for brevity. And so the last step concluded the transformer layer. In the end the transformer produced outputs of the same size as input.

11. Lặp lại phép toán Transformer / Repeat Transformers

Toàn bộ phép toán Transformer từ 10.1 đến 10.12 vừa trình bày ở trên sẽ được lặp lại nhiều lần. Ở đây lấy ví dụ là 6 lần.

Repeat the entire transformer calculation Steps 10.1 - Steps 10.12 for the Transformer several times e.g. 6 times.

12. Xác định đầu ra của token phân loại / Identify Classification token output

Bước cuối cùng là xác định đầu ra của token phân loại (CLS token). Vector này sẽ được sử dụng ở bước cuối trong hành trình của Vision Transformer.

Last step is to identify the classification token output. This vector will be used in the final step of our Vision Transformer journey.

13. Bước cuối: Dự đoán xác suất phân loại / Final Step: Predicting classification probabilities

Ở bước cuối cùng này, token đầu ra phân loại sẽ được đưa qua một mạng nơ-ron nhân tạo fully-connected khác để dự đoán xác suất phân loại của ảnh đầu vào.

In the final and last step we use this classification output token and another fully connected neural network to predict the classification probabilities of our input image.

14. Huấn luyện Vision Transformer / Training of the Vision Transformer

Vision Transformer được huấn luyện bằng hàm mất mát cross-entropy tiêu chuẩn, so sánh xác suất phân lớp đã dự đoán với nhãn lớp đúng. Mô hình học bằng backpropagation và gradient descent, cập nhật các tham số của mô hình theo hướng tối thiểu hóa hàm mất mát.

We train the Vision Transformer using a standard cross-entropy loss function, which compares the predicted class probabilities with the true class labels. The model is trained using backpropagation and gradient descent, updating the model parameters to minimize the loss function.

Kết luận / Conclusion

Thông qua phần giải thích trực quan này, chúng ta đã đi qua các thành phần chính của Vision Transformer, từ chuẩn bị dữ liệu đến huấn luyện mô hình. Hy vọng phần giải thích này đã giúp bạn hiểu cách Vision Transformer hoạt động và cách nó được dùng để phân loại hình ảnh.

In this visual guide, we have walked through the key components of Vision Transformers, from the data preparation to the training of the model. We hope this guide has helped you understand how Vision Transformers work and how they can be used to classify images.

Để giúp bạn hiểu Vision Transformer rõ hơn, tác giả cũng chuẩn bị một Colab Notebook nhỏ. Hãy xem thêm phần bình luận trong 'Blogpost'. Đoạn mã này được lấy từ bản triển khai ViT PyTorch rất tuyệt vời của @lucidrains, nên đừng quên xem qua công trình của anh ấy.

I prepared this little Colab Notebook to help you understand the Vision Transformer even better. Please have look for the 'Blogpost' comment. The code was taken from @lucidrains great VIT Pytorch implementation be sure to checkout his work.

Nếu bạn có câu hỏi hoặc góp ý, đừng ngần ngại liên hệ với tôi bất cứ lúc nào. Cảm ơn bạn đã đọc! (GitHub](https://github.com/mdturp), X(Twitter), Threads, LinkedIn của tác giả)

If you have any questions or feedback, please feel free to reach out to me. Thank you for reading!

Lời cảm ơn / Acknowledgements

Bản triển khai VIT bằng PyTorch của @lucidrains
Tất cả hình ảnh đều được lấy từ Wikipedia và được phép sử dụng theo giấy phép CC BY-SA 4.0.

VIT Pytorch implementation

All images have been taken from Wikipedia and are licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

Đọc thêm

Các bài báo, mã nguồn và tài liệu liên quan đến Vision Transformer được tổng hợp trên PapersWithCode

https://paperswithcode.com/method/vision-transformer

⚠️Quảng cáo⚠️: Bạn thấy bài viết do :pytorch:Cộng đồng người dùng PyTorch Hàn Quốc tổng hợp này hữu ích chứ? Nếu đăng ký thành viên, chúng tôi sẽ gửi các bài viết nổi bật qua email cho bạn! (Mặc định là Weekly, nhưng cũng có thể đổi sang Daily.)

1 bình luận

gcback 2024-04-22

Cảm ơn bạn đã vất vả chuẩn bị tài liệu hữu ích này.^