{"id":1992,"date":"2024-09-18T14:39:44","date_gmt":"2024-09-18T06:39:44","guid":{"rendered":"https:\/\/www.gnn.club\/?p=1992"},"modified":"2024-10-10T14:43:07","modified_gmt":"2024-10-10T06:43:07","slug":"transformers-attention-mechanism","status":"publish","type":"post","link":"http:\/\/gnn.club\/?p=1992","title":{"rendered":"\u81ea\u6ce8\u610f\u529b\u673a\u5236&#038;Transformer\u7b97\u6cd5"},"content":{"rendered":"<h1><img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918144730732.png\" style=\"height:50px;display:inline\"> Deep Learning<\/h1>\n<hr \/>\n<p>create by Arwin Yu<\/p>\n<h2>Tutorial 05 - Transformers - Attention Mechanism<\/h2>\n<hr \/>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918144800208.png\" style=\"height:250px\">\n<\/p>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/bubbles\/50\/000000\/checklist.png\" style=\"height:50px;display:inline\"> Agenda<\/h3>\n<hr \/>\n<ul>\n<li>\u6ce8\u610f\u529b\u673a\u5236\uff08The Attention Mechanism\uff09\n<ul>\n<li>\u81ea\u6211\u6ce8\u610f\u529b\u673a\u5236<\/li>\n<li>\u591a\u5934\u6ce8\u610f\u529b\u673a\u5236<\/li>\n<li>\u4ea4\u53c9\u6ce8\u610f\u529b\u673a\u5236<\/li>\n<\/ul>\n<\/li>\n<li>\u53d8\u5f62\u91d1\u521a\u6a21\u578b\uff08The Transformer\uff09\n<ul>\n<li>\u7f16\u7801\u5668<\/li>\n<li>\u4f4d\u7f6e\u7f16\u7801<\/li>\n<li>\u89e3\u7801\u5668<\/li>\n<li>\u6559\u5e08\u673a\u5236<\/li>\n<\/ul>\n<\/li>\n<li>\u9884\u8bad\u7ec3\u6a21\u578b\uff08Pretrained Models\uff09\n<ul>\n<li>Bert<\/li>\n<li>GPT<\/li>\n<\/ul>\n<\/li>\n<li>\u89c6\u89c9\u4e2d\u7684Transformer\n<ul>\n<li>Vision Transformer<\/li>\n<li>Swin Transformer<\/li>\n<\/ul>\n<\/li>\n<li>\u8bad\u7ec3Transformer\u7684\u6280\u5de7\uff08The training trick of Transformer\uff09\n<ul>\n<li>\u521d\u59cb\u5316<\/li>\n<li>\u5f52\u4e00\u5316<\/li>\n<li>\u6fc0\u6d3b\u51fd\u6570<\/li>\n<li>\u4f4d\u7f6e\u7f16\u7801<\/li>\n<li>\u4f18\u5316\u5668<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/cute-clipart\/64\/000000\/alarm.png\" style=\"height:50px;display:inline\"> The Attention Mechanism<\/h2>\n<hr \/>\n<p>Transformer\u6a21\u578b\u4e2d\u6700\u5173\u952e\u90e8\u5206\u5c31\u662f\u81ea\u6ce8\u610f\u529b\uff08Self-Attention\uff09\u673a\u5236\uff0c\u6b63\u5982 Transformer \u7684\u8bba\u6587\u7684\u6807\u9898\u662f\u201cAttention Is All You Need\u201d\uff01\u4ee5\u6587\u672c\u95ee\u9898\u4e3a\u4f8b\u6765\u8bb2\u89e3\u8fd9\u4e2a\u673a\u5236\u3002\u5728\u5904\u7406\u6587\u672c\u95ee\u9898\u65f6\uff0c\u81ea\u6ce8\u610f\u529b\u673a\u5236\u4f1a\u544a\u8bc9\u6a21\u578b\uff1a\u5728\u5904\u7406\u53e5\u5b50\u4e2d\u7684\u6bcf\u4e2a\u5355\u8bcd\u65f6\uff0c\u7279\u522b\u5173\u6ce8\u67d0\u4e9b\u91cd\u8981\u7684\u5355\u8bcd\uff0c\u5e76\u6216\u591a\u6216\u5c11\u5730\u5ffd\u7565\u5176\u5b83\u5355\u8bcd\u3002\u7b80\u5355\u6765\u8bf4\uff0c\u5c31\u662f\u7ed9\u53e5\u5b50\u4e2d\u4e0d\u540c\u5355\u8bcd\u5206\u914d\u4e0d\u540c\u7684\u6743\u91cd\u3002\u8fd9\u662f\u7b26\u5408\u5e38\u7406\u7684\uff0c\u56e0\u4e3a\u4e00\u53e5\u8bdd\u4e2d\u7684\u6bcf\u4e2a\u5355\u8bcd\u91cd\u8981\u7a0b\u5ea6\u662f\u4e0d\u4e00\u6837\u7684\uff0c\u4ece\u8bed\u6cd5\u89d2\u5ea6\u8bf4\uff0c\u4e3b\u8c13\u5bbe\u8bed\u6bd4\u5176\u5b83\u53e5\u5b50\u6210\u5206\u66f4\u91cd\u8981\uff0cself-attention\u673a\u5236\u5c31\u662f\u6a21\u578b\u5c1d\u8bd5\u5b66\u4e60\u53e5\u5b50\u6210\u5206\u91cd\u8981\u7a0b\u5ea6\u7684\u65b9\u6cd5\u3002<\/p>\n<p>self-attention\u53ef\u4ee5\u901a\u8fc7\u5b66\u4e60\u53e5\u5b50\u6210\u5206\u91cd\u8981\u7a0b\u5ea6\u66f4\u597d\u7684\u7406\u89e3\u8bed\u8a00\u7684\u4e0a\u4e0b\u6587\uff0c\u800c\u4e0a\u4e0b\u6587\u5bf9\u4e8e\u8bed\u8a00\u6a21\u578b\u6765\u8bf4\u662f\u81f3\u5173\u91cd\u8981\u7684\u3002\u4f8b\u5982\uff0c\u770b\u4e00\u4e0b\u673a\u5668\u4eba\u7b2c\u4e8c\u5b9a\u5f8b\uff1a<\/p>\n<p><strong>\u673a\u5668\u4eba\u7b2c\u4e8c\u5b9a\u5f8b\uff1a\u673a\u5668\u4eba\u5fc5\u987b\u670d\u4ece\u4eba\u7c7b\u53d1\u51fa\u7684\u547d\u4ee4\uff0c\u9664\u975e\u8fd9\u4e9b\u547d\u4ee4\u4e0e\u7b2c\u4e00\u5b9a\u5f8b\u76f8\u51b2\u7a81\u3002<\/strong><\/p>\n<p>\u5f53\u6a21\u578b\u5904\u7406\u8fd9\u53e5\u8bdd\u65f6\uff0c\u5b83\u5fc5\u987b\u80fd\u591f\u77e5\u9053\uff1a<\/p>\n<p>\u5b83\u6307\u7684\u662f\u673a\u5668\u4eba<\/p>\n<p>\u8fd9\u79cd\u547d\u4ee4\u6307\u7684\u662f\u6cd5\u5f8b\u7684\u524d\u534a\u90e8\u5206\uff0c\u5373\u201c\u4eba\u7c7b\u53d1\u51fa\u7684\u547d\u4ee4\u201d<\/p>\n<p>\u7b2c\u4e00\u5b9a\u5f8b\u6307\u7684\u662f\u6574\u4e2a\u7b2c\u4e00\u5b9a\u5f8b\u7b49\u7b49<\/p>\n<hr \/>\n<ul>\n<li>\n<p>\u6ce8\u610f\u529b\u5c42\u7684\u8f93\u5165\u88ab\u79f0\u4e3a\u67e5\u8be2\u5411\u91cf$q$\u3002$q$\u662f\u6211\u4eec\u5e0c\u671b\u5bf9\u5176\u8fdb\u884c\u6ce8\u610f\u529b\u5904\u7406\u7684\u8f93\u5165\u4fe1\u606f\u3002<\/p>\n<\/li>\n<li>\n<p>\u5bf9\u4e8e\u6bcf\u4e2a\u67e5\u8be2\u5411\u91cf$q$\uff0c\u6ce8\u610f\u529b\u673a\u5236\u4f1a\u6839\u636e\u8bb0\u5fc6\u8fd4\u56de\u4e00\u4e2a\u8f93\u51fa\u3002\u8bb0\u5fc6\u662f\u4e00\u7ec4\u5728\u6ce8\u610f\u529b\u5c42\u4e2d\u7f16\u7801\u7684\u952e-\u503c\u5bf9(key-value, \u7b80\u79f0$k$\u5411\u91cf\uff0c$v$\u5411\u91cf)\u3002\u8fd9\u4e9b\u952e-\u503c\u5bf9\u5e2e\u52a9\u6a21\u578b\u627e\u5230\u4e0e\u67e5\u8be2\u76f8\u5173\u7684\u4fe1\u606f\u3002<\/p>\n<\/li>\n<\/ul>\n<p>\u6ce8\u610f\u529b\u4e3b\u8981\u6709\u4e24\u79cd\u7c7b\u578b\uff1a<\/p>\n<ol>\n<li>Self-attention<\/li>\n<li>Cross-attention<\/li>\n<li>Multi-head attention<\/li>\n<\/ol>\n<p><strong>\u81ea\u6211\u6ce8\u610f\u529b<\/strong>\u6307\u7684\u662f\u5728\u540c\u4e00\u4e2a\u8f93\u5165\u5e8f\u5217\u4e2d\uff0c\u67e5\u8be2\u3001\u952e\u548c\u503c\u90fd\u662f\u6765\u81ea\u540c\u4e00\u7ec4\u5e8f\u5217\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u6a21\u578b\u5728\u5904\u7406\u6bcf\u4e2a\u5143\u7d20\u65f6\uff0c\u90fd\u4f1a\u67e5\u770b\u6574\u4e2a\u5e8f\u5217\u6765\u8ba1\u7b97\u6bcf\u4e2a\u5143\u7d20\u4e4b\u95f4\u7684\u76f8\u4f3c\u6027\u3002<\/p>\n<p><strong>\u4ea4\u53c9\u6ce8\u610f\u529b<\/strong>\u5219\u662f\u6307\u67e5\u8be2\u548c\u952e-\u503c\u5bf9\u6765\u81ea\u4e0d\u540c\u7684\u8f93\u5165\u5e8f\u5217\u3002\u4f8b\u5982\uff0c\u5728\u673a\u5668\u7ffb\u8bd1\u4e2d\uff0c\u4e00\u4e2a\u5e8f\u5217\u53ef\u80fd\u662f\u6e90\u8bed\u8a00\u53e5\u5b50\uff0c\u53e6\u4e00\u4e2a\u5e8f\u5217\u662f\u76ee\u6807\u8bed\u8a00\u53e5\u5b50\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918144938489.png\" style=\"height:550px\">\n<\/p>\n<p><strong>\u591a\u5934\u6ce8\u610f\u529b<\/strong>\u6307\u7684\u662f\u901a\u8fc7\u5e76\u884c\u8ba1\u7b97\u591a\u4e2a\u6ce8\u610f\u529b\u673a\u5236\u6765\u6355\u6349\u8f93\u5165\u5e8f\u5217\u4e2d\u4e0d\u540c\u90e8\u5206\u7684\u4e0d\u540c\u7279\u5f81\uff0c\u8fd9\u6837\u6a21\u578b\u53ef\u4ee5\u66f4\u5168\u9762\u5730\u7406\u89e3\u548c\u5904\u7406\u5e8f\u5217\u6570\u636e\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918145028778.png\" style=\"height:300px\">\n<\/p>\n<ul>\n<li>\u4f7f\u7528\u4e24\u4e2a\u6ce8\u610f\u529b\u5934\u5bf9\u540c\u4e00\u67e5\u8be2\u8fdb\u884c\u8f93\u51fa\u53ef\u89c6\u5316\u3002<\/li>\n<li>\u6211\u4eec\u53ef\u4ee5\u770b\u5230\uff0c\u5982\u679c\u67e5\u8be2\u8bcd\u662f <strong>it<\/strong>\uff0c\u7b2c\u4e00\u4e2a\u6ce8\u610f\u529b\u5934\u4f1a\u66f4\u591a\u5730\u5173\u6ce8\u5355\u8bcd <strong>the animal<\/strong>\uff0c\u800c\u7b2c\u4e8c\u4e2a\u6ce8\u610f\u529b\u5934\u4f1a\u66f4\u591a\u5730\u5173\u6ce8\u5355\u8bcd <strong>tired<\/strong>\u3002<\/li>\n<li>\u56e0\u6b64\uff0c\u6700\u7ec8\u7684\u4e0a\u4e0b\u6587\u8868\u793a\u5c06\u5173\u6ce8\u6240\u6709\u5355\u8bcd <strong>the\u3001animal<\/strong> \u548c <strong>tired<\/strong>\uff0c\u56e0\u6b64\u4e0e\u4f20\u7edf\u65b9\u5f0f\u76f8\u6bd4\uff0c\u8fd9\u662f\u4e00\u79cd\u66f4\u4f18\u8d8a\u7684\u8868\u793a\u3002<\/li>\n<li><a href=\"https:\/\/blogs.oracle.com\/datascience\/multi-head-self-attention-in-nlp\">Images Source<\/a><\/li>\n<\/ul>\n<h3>\u81ea\u6ce8\u610f\u529b\u673a\u5236\u8be6\u89e3<\/h3>\n<hr \/>\n<ul>\n<li>\u6211\u4eec\u5c06\u5411\u91cf $q\u3001k$ \u548c $v$ \u5206\u522b\u8868\u793a\u4e3a<strong>\u81ea\u6211\u6ce8\u610f\u529b<\/strong>\u4e2d\u7684\u67e5\u8be2\u3001\u952e\u548c\u503c\u5411\u91cf\uff0c\u4ee5\u53ca\u5b83\u4eec\u5bf9\u5e94\u7684\u53ef\u5b66\u4e60\u53c2\u6570\u77e9\u9635 $W_{q}\u3001W_k$ \u548c $W_v$\uff0c\u7528\u4e8e<strong>\u6620\u5c04<\/strong>\u6211\u4eec\u7684\u8f93\u5165\u5d4c\u5165$x \\in \\mathbb{R}^{d_{x}}$\uff1a $$ q = W_q x \\in \\mathbb{R}^{d }$$ $$ k = W_k x \\in \\mathbb{R}^{d}$$ $$ v = W_v x \\in \\mathbb{R}^{d_v}$$<\/li>\n<li>\u6211\u4eec\u901a\u5e38<strong>\u4e0d<\/strong>\u5305\u62ec\u4efb\u4f55\u975e\u7ebf\u6027\uff0c\u56e0\u4e3a\u6ce8\u610f\u529b\u5b8c\u5168\u57fa\u4e8e<strong>\u65b9\u5411<\/strong>\u3002<\/li>\n<li>\u4e3a\u4e86\u5c06\u67e5\u8be2\u4e0e\u6240\u6709\u53ef\u80fd\u7684\u952e\u8fdb\u884c\u6bd4\u8f83\uff0c$q$ \u548c $k$ \u5fc5\u987b\u5177\u6709\u76f8\u540c\u7684\u7ef4\u6570\uff0c\u5373 $q, k \\in \\mathbb{R}^d$\u3002<\/li>\n<li>$v$ \u53ef\u4ee5\u662f\u4efb\u610f\u7ef4\u5ea6\uff0c$v \\in \\mathbb{R}^{d_v}$\u3002 <\/li>\n<li>\u4e3a\u7b80\u5355\u8d77\u89c1\uff0c\u6211\u4eec\u5047\u8bbe\u6240\u6709\u5185\u5bb9\u90fd\u5177\u6709\u76f8\u540c\u7684\u7ef4\u5ea6$d$ ($d_v=d$)\uff0c\u8fd9\u4e5f\u662f\u6211\u4eec\u5728\u5b9e\u8df5\u4e2d\u901a\u5e38\u6240\u505a\u7684\u3002<\/li>\n<\/ul>\n<p>\u8fd9\u4e09\u4e2a\u5411\u91cf\u7684\u521b\u5efa\u8fc7\u7a0b\u5728\u6a21\u578b\u5b9e\u73b0\u65f6\u975e\u5e38\u7b80\u5355\uff0c\u901a\u8fc7\u795e\u7ecf\u7f51\u7edc\u5c42\u7684\u6620\u5c04\u5373\u53ef\u5f97\u5230\u3002\u5177\u4f53\u6765\u8bf4\uff0c\u8f93\u5165\u6570\u636e\u4e3atoken\u672c\u8eab\uff08\u5047\u8bbe64\u7ef4\uff09\uff0c\u800c\u6620\u5c04\u540e\u7684\u8f93\u5165\u5411\u91cf\u53ef\u4ee5\u662f192\u7ef4\uff0c\u6b64\u65f6\u7b2c0-63\u7ef4\u4f5c\u4e3aq\u5411\u91cf\uff0c64-127\u7ef4\u4f5c\u4e3ak\u5411\u91cf\uff0c\u800c128-192\u7ef4\u4f5c\u4e3av\u5411\u91cf\u3002\u8bf7\u6ce8\u610f\uff0c\u67e5\u8be2\u5411\u91cf\u3001\u952e\u5411\u91cf\u548c\u503c\u5411\u91cf\u662f\u4e3a\u8ba1\u7b97\u548c\u601d\u8003\u6ce8\u610f\u529b\u673a\u5236\u800c\u62bd\u8c61\u51fa\u7684\u6982\u5ff5\uff0c\u6216\u8005\u8bf4\u662f\u6211\u4eec\u5bf9\u6a21\u578b\u7684\u5b66\u4e60\u671f\u671b\u3002\u56e0\u4e3a\u8fd9\u4e09\u4e2a\u65b0\u5411\u91cf\u5728\u521a\u521b\u5efa\u65f6\u662f\u968f\u673a\u521d\u59cb\u5316\u7684\uff0c\u6ca1\u6709\u7279\u6b8a\u542b\u4e49\uff0c\u662f\u7ecf\u8fc7\u6a21\u578b\u8bad\u7ec3\u5206\u522b\u5f97\u5230\u4e86\u7c7b\u4f3c\u67e5\u8be2\u3001\u56de\u590d\u3001\u5b58\u503c\u7b49\u5411\u91cf\u529f\u80fd\uff0c\u4e00\u4e2a\u8bcd\u5411\u91cf\u53ef\u4ee5\u901a\u8fc7\u5b83\u4eec\u4e0e\u5176\u5b83\u8bcd\u5411\u91cf\u8fdb\u884c\u4e92\u52a8\u6765\u5efa\u6a21\u8bcd\u4e0e\u8bcd\u4e4b\u95f4\u7684\u76f8\u5173\u6027\u3002\u5728\u8bfb\u8005\u9605\u8bfb\u5b8c\u63a5\u4e0b\u6765\u7684\u5168\u90e8\u8ba1\u7b97\u8fc7\u7a0b\u4e4b\u540e\uff0c\u5c31\u4f1a\u660e\u767d\u5b83\u4eec\u540d\u5b57\u7684\u7531\u6765\u3002<\/p>\n<p>self-attention\u7684\u4e00\u4e2a\u7c97\u7565\u7684\u7c7b\u6bd4\u662f\u5c06\u5176\u60f3\u8c61\u4e3a\u5728\u6587\u4ef6\u67dc\u4e2d\u641c\u7d22\u3002\u67e5\u8be2\u5411\u91cf$q$\u5c31\u50cf\u4e00\u5f20\u4fbf\u7b7e\u7eb8\uff0c\u4e0a\u9762\u5199\u7740\u60a8\u6b63\u5728\u7814\u7a76\u7684\u4e3b\u9898\u3002$k$\u5411\u91cf\u5c31\u50cf\u67dc\u5b50\u5185\u6587\u4ef6\u5939\u7684\u6807\u7b7e\u3002\u5f53\u4f60\u5c06\u6807\u7b7e\u4e0e\u4fbf\u7b7e\u5339\u914d\u65f6\uff0c\u6211\u4eec\u53d6\u51fa\u8be5\u6587\u4ef6\u5939\u7684\u5185\u5bb9\uff0c\u8fd9\u4e9b\u5185\u5bb9\u5c31\u662f\u503c\u5411\u91cf$v$\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918145316731.png\" style=\"height:300px\">\n<\/p>\n<p>\u6bcf\u4e2a\u6587\u4ef6\u5939\u7684\u6743\u91cd\u5206\u6570\u662f\u901a\u8fc7\u67e5\u8be2\u5411\u91cf\u4e0e\u6b63\u5728\u8bc4\u5206\u7684\u76f8\u5e94\u5355\u8bcd\u7684\u952e\u5411\u91cf\u7684\u70b9\u79ef\u8ba1\u7b97\u5f97\u51fa\u7684\u3002\u70b9\u79ef\u7684\u516c\u5f0f\uff1a $a \\times b=|a| \\times|b| \\times \\cos \\theta$ \u3002\u5176\u610f\u4e49\u5c31\u662f\u6bd4\u8f83\u4e24\u4e2a\u5411\u91cf\u7684\u76f8\u5173\u7a0b\u5ea6\uff0c\u76f8\u5173\u6027\u8d8a\u9ad8\uff0c\u5206\u6570\u8d8a\u5927\u3002\u6ce8\u610f\uff0c\u70b9\u79ef\u540e\u9700\u8981\u5bf9\u7ed3\u679c\u8fdb\u884csoftmax\u6620\u5c04\u5f97\u5230\u6743\u91cd\u5206\u6570\uff0cSoftmax\u6620\u5c04\u540e\u7684\u5206\u6570\u51b3\u5b9a\u4e86\u6bcf\u4e2a\u8bcd\u5728\u53e5\u5b50\u4e2d\u67d0\u4e2a\u4f4d\u7f6e\u7684\u91cd\u8981\u6027\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918145420897.png\" style=\"height:300px\">\n<\/p>\n<p>\u6211\u4eec\u5c06\u6bcf\u4e2a\u503c\u5411\u91cf\u4e58\u4ee5\u5b83\u7684\u6743\u91cd\u5206\u6570\u5e76\u6c42\u548c\u2014\u2014\u5f97\u5230\u6211\u4eec\u7684\u81ea\u6ce8\u610f\u529b\u7ed3\u679c\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918145505496.png\" style=\"height:300px\">\n<\/p>\n<p><a href=\"http:\/\/wenqianzhao.cn\/2020\/12\/29\/transformer\/\">Image Source<\/a><\/p>\n<p>\u603b\u7ed3\u4e00\u4e0b\uff0cself-attention\u516c\u5f0f\u5206\u4e3a\u4e09\u4e2a\u6b65\u9aa4:<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918145557491.png\" style=\"height:250px\">\n<\/p>\n<p><a href=\"https:\/\/lena-voita.github.io\/nlp_course\/seq2seq_and_attention.html\">Image by Lena Voita<\/a><\/p>\n<p>\uff081\uff09\u8ba1\u7b97 $Q$ \u548c $K$ \u4e4b\u95f4\u7684\u76f8\u4f3c\u5ea6\uff0c\u5373 $Q K^{\\mathrm{T}}$ \u3002<\/p>\n<p>\uff082\uff09\u7531\u4e8e $Q$ \u548c $K$ \u7684\u7ef4\u5ea6\u53ef\u80fd\u5f88\u5927, \u56e0\u6b64\u9700\u8981\u5c06\u5176\u9664\u4ee5$\\sqrt{d_k}$  \u6765\u7f29\u653e\u3002\u8fd9\u6709\u52a9\u4e8e\u907f\u514d\u5728 Softmax \u8ba1\u7b97\u65f6\u51fa\u73b0\u68af\u5ea6\u6d88\u5931\u6216\u68af\u5ea6\u7206\u70b8\u7684\u95ee\u9898\u3002<\/p>\n<p>\uff083\uff09\u5bf9\u76f8\u4f3c\u5ea6\u77e9\u9635\u8fdb\u884c Softmax \u64cd\u4f5c, \u5f97\u5230\u6bcf\u4e2a\u67e5\u95f4\u5411\u91cf\u4e0e\u6240\u6709\u952e\u5411\u91cf\u7684\u6743\u91cd\u5206\u5e03\u3002\u7136\u540e, \u5c06\u8fd9\u4e9b\u6743\u91cd\u4e0e\u503c\u77e9\u9635 $V$ \u76f8\u4e58\u5e76\u76f8\u52a0, \u5f97\u5230\u81ea\u6ce8\u610f\u529b\u673a\u5236\u7684\u8f93\u51fa\u77e9\u9635\u3002<\/p>\n<h3>\u591a\u5934\u81ea\u6ce8\u610f\u529b\u673a\u5236<\/h3>\n<hr \/>\n<p>\u201c\u591a\u5934\u6ce8\u610f\u529b\u201d\uff08Multi-heads self-attention\uff09\u7684\u673a\u5236\u8fdb\u4e00\u6b65\u7ec6\u5316\u4e86\u81ea\u6ce8\u610f\u529b\u5c42\u3002\u5bf9\u4e8e\u591a\u5934\u6ce8\u610f\u529b\uff0c\u5176\u4e2d\u6709\u591a\u7ec4\u67e5\u8be2\u5411\u91cf\u3001\u952e\u5411\u91cf\u548c\u503c\u5411\u91cf\uff0c\u8fd9\u91cc\u628a\u4e00\u7ec4q, k, v\u79f0\u4e4b\u4e3a\u4e00\u4e2a\u5934\uff0cTransformer\u539f\u8bba\u6587\u4e2d\u4f7f\u7528\u516b\u4e2a\u6ce8\u610f\u529b\u5934\u3002\u6bcf\u7ec4\u6ce8\u610f\u529b\u5934\u90fd\u662f\u53ef\u8bad\u7ec3\u7684\uff0c\u7ecf\u8fc7\u8bad\u7ec3\u53ef\u4ee5\u6269\u5c55\u6a21\u578b\u5173\u6ce8\u4e0d\u540c\u4f4d\u7f6e\u7684\u80fd\u529b\u3002<\/p>\n<p>\u4e3e\u4e00\u4e2a\u5f62\u8c61\u7684\u7c7b\u6bd4\uff1a\u628a\u6ce8\u610f\u529b\u5934\u7c7b\u6bd4\u6210\u5c0f\u5b66\u751f\uff0c\u90a3\u4e48\u591a\u4e2a\u5c0f\u5b66\u751f\u5728\u5b66\u4e60\u8fc7\u7a0b\u4e2d\u4f1a\u5f62\u6210\u4e0d\u540c\u7684\u601d\u7ef4\u6a21\u5f0f\uff0c\u5bf9\u540c\u6837\u7684\u95ee\u9898\u4f1a\u4ea7\u751f\u4e0d\u540c\u7684\u7406\u89e3\u3002\u8fd9\u5c31\u662f\u4e3a\u4ec0\u4e48\u8981\u4f7f\u7528\u591a\u5934\u7684\u539f\u56e0\uff0c\u5c31\u662f\u5e0c\u671b\u6a21\u578b\u53ef\u4ee5\u4ece\u4e0d\u540c\u7684\u89d2\u5ea6\u601d\u8003\u8f93\u5165\u4fe1\u606f\uff0c\u5982\u4e0b\u56fe\u6240\u793a\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918145939147.png\" style=\"height:350px\">\n<\/p>\n<p><a href=\"https:\/\/lena-voita.github.io\/nlp_course\/seq2seq_and_attention.html#multi_head_attention\">Image Source<\/a><\/p>\n<p>\u4f46\u662f\uff0c\u591a\u5934\u6ce8\u610f\u529b\u673a\u5236\u4e5f\u7ed9\u5e26\u6765\u4e86\u4e00\u4e2a\u95ee\u9898\u3002\u5982\u679c\u4f7f\u7528\u516b\u4e2a\u5934\uff0c\u7ecf\u8fc7\u591a\u5934\u6ce8\u610f\u529b\u673a\u5236\u540e\u4f1a\u5f97\u52308\u4e2a\u8f93\u51fa\uff0c\u4f46\u662f\uff0c\u5b9e\u9645\u4e0a\u53ea\u9700\u8981\u4e00\u4e2a\u8f93\u51fa\u7ed3\u679c\u3002\u6240\u4ee5\u9700\u8981\u4e00\u79cd\u65b9\u6cd5\u5c06\u8fd9\u516b\u4e2a\u8f93\u51fa\u538b\u7f29\u6210\u4e00\u4e2a\u77e9\u9635\uff0c\u65b9\u6cd5\u4e5f\u5f88\u7b80\u5355\uff0c\u5c06\u5b83\u4eec\u4e58\u4ee5\u4e00\u4e2a\u989d\u5916\u7684\u6743\u91cd\u77e9\u9635\u5373\u53ef\u3002\u8fd9\u4e2a\u64cd\u4f5c\u53ef\u4ee5\u901a\u8fc7\u4e00\u4e2a\u795e\u7ecf\u7f51\u7edc\u5c42\u7684\u6620\u5c04\u5b8c\u6210\uff0c\u5982\u56fe\uff1a<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918150019565.png\" style=\"height:350px\">\n<\/p>\n<ul>\n<li>\u591a\u5934\u6ce8\u610f\u529b\u7684\u601d\u60f3\u4e0e<strong>\u7ec4\u5377\u79ef<\/strong>\u975e\u5e38\u76f8\u4f3c\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918150100576.gif\" style=\"height:300px\">\n<\/p>\n<p><a href=\"https:\/\/animatedai.github.io\/\">Image Source<\/a><\/p>\n<pre><code class=\"language-python\">import torch.nn as nn\nclass MultiHeadAttention(nn.Module):\n    def __init__(self, d_model, num_heads, dropout, d_input=None):\n        super().__init__()\n        self.num_heads = num_heads\n        self.d_model = d_model\n        if d_input is None:\n            d_xq = d_xk = d_xv = d_model\n        else:\n            d_xq, d_xk, d_xv = d_input\n\n        # Make sure that the embedding dimension of model is a multiple of number of heads\n        assert d_model % self.num_heads == 0\n\n        self.d_k = d_model \/\/ self.num_heads  # here d is divided between the heads\n        # each head has hidden dimension d\n\n        # These are still of dimension d_model. They will be split into number of heads \n        self.W_q = nn.Linear(d_xq, d_model, bias=False)\n        self.W_k = nn.Linear(d_xk, d_model, bias=False)\n        self.W_v = nn.Linear(d_xv, d_model, bias=False)\n\n        # Outputs of all sub-layers need to be of dimension d_model\n        self.W_h = nn.Linear(d_model, d_model)\n\n        self.dropout = nn.Dropout(dropout)\n\n    def scaled_dot_product_attention(self, Q, K, V):\n        batch_size = Q.size(0) \n        k_length = K.size(-2) \n\n        # Scaling by d_k so that the soft(arg)max doesn&#039;t saturate\n        Q = Q \/ np.sqrt(self.d_k)                         # (bs, n_heads, q_length, dim_per_head)\n        scores = torch.matmul(Q, K.transpose(2, 3))          # (bs, n_heads, q_length, k_length)\n\n        A = torch.softmax(scores, dim=-1)  # (bs, n_heads, q_length, k_length)\n        A = self.dropout(A)\n\n        # Get the weighted average of the values\n        H = torch.matmul(A, V)     # (bs, n_heads, q_length, dim_per_head)\n\n        return H, A \n\n    def split_heads(self, x, batch_size):\n        &quot;&quot;&quot;\n        Split the last dimension into (heads X depth)\n        Return after transpose to put in shape (batch_size X num_heads X seq_length X d_k)\n        &quot;&quot;&quot;\n        return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)\n\n    def group_heads(self, x, batch_size):\n        &quot;&quot;&quot;\n        Combine the heads again to get (batch_size X seq_length X (num_heads times d_k))\n        &quot;&quot;&quot;\n        return x.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)\n\n    def forward(self, X_q, X_k, X_v):\n        batch_size, seq_length, dim = X_q.size() # dim = embedding dimension\n\n        # After transforming, split into num_heads \n        Q = self.split_heads(self.W_q(X_q), batch_size)  # (bs, n_heads, q_length, dim_per_head)\n        K = self.split_heads(self.W_k(X_k), batch_size)  # (bs, n_heads, k_length, dim_per_head)\n        V = self.split_heads(self.W_v(X_v), batch_size)  # (bs, n_heads, v_length, dim_per_head)\n\n        # Calculate the attention weights for each of the heads\n        H_cat, A = self.scaled_dot_product_attention(Q, K, V)\n\n        # Put all the heads back together by concat\n        H_cat = self.group_heads(H_cat, batch_size)    # (bs, q_length, dim)\n\n        # Final linear layer  \n        H = self.W_h(H_cat)  # (bs, q_length, dim)\n\n        return H, A<\/code><\/pre>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/color\/96\/000000\/transformer.png\" style=\"height:50px;display:inline\"> The Transformer<\/h2>\n<hr \/>\n<p><strong>Transformer<\/strong>\uff1a\u57fa\u4e8e\u6ce8\u610f\u529b\u673a\u5236\u7684\u7f16\u7801\u5668-\u89e3\u7801\u5668\u67b6\u6784\uff0c\u65e8\u5728\u7ed3\u5408\u524d\u9988\u7f51\u7edc\uff08FFN\uff0c\u57fa\u672c\u4e0a\u662f\u5b9e\u73b0\u4e00\u7ef4\u5377\u79ef\u7684 MLP\uff09\u548c RNN \u7684\u4f18\u52bf\u3002<\/p>\n<p>RNN \u88ab\u591a\u5934\u6ce8\u610f\u529b\u5c42\u53d6\u4ee3\uff0c\u901a\u8fc7<strong>\u4f4d\u7f6e\u7f16\u7801<\/strong>\u6574\u5408\u4f4d\u7f6e\u4fe1\u606f\uff0c\u5e76\u5e94\u7528<strong>\u5c42\u89c4\u8303\u5316<\/strong>\u3002<\/p>\n<p>\u53ef\u5e76\u884c\u8ba1\u7b97\uff0c\u8bad\u7ec3\u65f6\u95f4\u660e\u663e\u7f29\u77ed\u3002<\/p>\n<p><strong>\u6211\u4eec\u4ec0\u4e48\u65f6\u5019\u9700\u8981\u7f16\u7801\u5668-\u89e3\u7801\u5668\uff0c\u4ec0\u4e48\u65f6\u5019\u53ef\u4ee5\u53ea\u4f7f\u7528\u7f16\u7801\u5668\u6216\u89e3\u7801\u5668\uff1f<\/strong><\/p>\n<ul>\n<li>\u5206\u7c7b\uff08\u4f8b\u5982\u60c5\u7eea\u5206\u6790\uff09\u53ef\u80fd\u53ea\u4f7f\u7528 Tranformer \u7f16\u7801\u5668\u3002<\/li>\n<li>\u4e0b\u4e00\u4e2a\u6807\u8bb0\u9884\u6d4b\uff08\u4f8b\u5982\u8bed\u8a00\u5efa\u6a21\uff09\u53ef\u80fd\u53ea\u4f7f\u7528 Tranformer \u89e3\u7801\u5668\uff08\u81ea\u56de\u5f52\u63a8\u7406\uff09\u3002<\/li>\n<li>\u5e8f\u5217\u5230\u5e8f\u5217 (seq2seq\uff0c\u4f8b\u5982\u673a\u5668\u7ffb\u8bd1) \u6a21\u578b\u9700\u8981\u7f16\u7801\u5668\u548c\u89e3\u7801\u5668\u6a21\u5757\uff0c\u5e76\u4e14\u5728\u4e24\u4e2a\u4e0d\u540c\u7684\u5e8f\u5217 (\u4f8b\u5982\uff0c\u4e00\u4e2a\u82f1\u6587\u53e5\u5b50\u548c\u4e00\u4e2a\u6cd5\u6587\u53e5\u5b50) \u4e4b\u95f4\u4f7f\u7528 <strong>\u4ea4\u53c9\u6ce8\u610f\u529b<\/strong>\u3002<\/li>\n<li>\u6211\u4eec\u73b0\u5728\u6765\u770b\u770b Transformer \u7684\u6bcf\u4e2a\u7ec4\u4ef6\u3002\n<p align=\"center\">\n<img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918150158694.png\" style=\"height:550px\">\n<\/p>\n<\/li>\n<\/ul>\n<h4>Transformer's Encoder Module<\/h4>\n<hr \/>\n<p>\u6bcf\u4e2a\u7f16\u7801\u5668\u4e2d\u7684\u81ea\u6ce8\u610f\u529b\u5c42\u5468\u56f4\u90fd\u6709\u4e00\u4e2a\u6b8b\u5dee\u8fde\u63a5\uff0c\u7136\u540e\u662f\u5c42\u5f52\u4e00\u5316\u6b65\u9aa4\u3002\u5f52\u4e00\u5316\u7684\u8f93\u51fa\u518d\u901a\u8fc7\u524d\u9988\u7f51\u7edc\uff08Feed Forward Network\uff0cFFN\uff09\u8fdb\u884c\u6620\u5c04\uff0c\u4ee5\u8fdb\u884c\u8fdb\u4e00\u6b65\u5904\u7406\u3002\u524d\u9988\u7f51\u7edc\u672c\u8d28\u4e0a\u5c31\u662f\u51e0\u5c42\u795e\u7ecf\u7f51\u7edc\u5c42\uff0c\u5176\u4e2d\u95f4\u91c7\u7528ReLU\u6fc0\u6d3b\u51fd\u6570\uff0c\u4e24\u5c42\u4e4b\u95f4\u91c7\u7528\u6b8b\u5dee\u8fde\u63a5\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918150240291.png\" style=\"height:350px\">\n<\/p>\n<p><a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\">Image Source<\/a><\/p>\n<pre><code class=\"language-python\">&quot;&quot;&quot;\nFeed Forward Network (FFN): an MLP with one hidden layer and ReLU activation applied to each and every element in the set.\n&quot;&quot;&quot;\nclass FFN(nn.Module):\n    def __init__(self, d_model, hidden_dim_multiplier=4, resid_pdrop=0.1):\n        super().__init__()\n        self.fc_1 = nn.Linear(d_model, hidden_dim_multiplier * d_model)\n        self.act = nn.ReLU(True)  # inplace=True, saves a little bit of memory\n        self.proj = nn.Linear(hidden_dim_multiplier * d_model, d_model)\n        self.dropout = nn.Dropout(resid_pdrop)\n\n    def forward(self, x):\n        # x: [batch_size, seq_len, embed_dim]\n        x = self.dropout(self.proj(self.act(self.fc_1(x))))  #  [batch_size, seq_len, embed_dim]\n        return x<\/code><\/pre>\n<pre><code class=\"language-python\">import torch\nffn = FFN(d_model=4, hidden_dim_multiplier=2)\nffn.eval()\nffn(torch.ones((2, 3, 4)))[0]  # batch_size = 2, seq_len = 3, embed_dim = 4\n# note that the FFN only operates on the last dimension\n# PyTorch&#039;s nn.Linear know how to handle tensors of shape [batch_size, seq_len, embed_dim]<\/code><\/pre>\n<pre><code>tensor([[ 0.2905, -0.0280, -0.2088, -0.0021],\n        [ 0.2905, -0.0280, -0.2088, -0.0021],\n        [ 0.2905, -0.0280, -0.2088, -0.0021]], grad_fn=<SelectBackward0>)<\/code><\/pre>\n<h4>Positional Encoding<\/h4>\n<hr \/>\n<ul>\n<li>\u4e0eRNN \u4e0d\u540c\uff0c\u591a\u5934\u6ce8\u610f\u5c42\u548c\u4f4d\u7f6e\u524d\u9988\u7f51\u7edcFFN\u90fd\u72ec\u7acb\u8ba1\u7b97\u5e8f\u5217\u4e2d\u6bcf\u4e2a\u5143\u7d20\u7684\u8f93\u51fa\u3002<\/li>\n<li>\u6b64\u529f\u80fd\u4f7f\u6211\u4eec\u80fd\u591f<strong>\u5e76\u884c\u5316\u8ba1\u7b97<\/strong>\uff0c\u4f46\u65e0\u6cd5\u4e3a\u7ed9\u5b9a\u5e8f\u5217\u5efa\u6a21\u987a\u5e8f\u4fe1\u606f\u3002<\/li>\n<li>\u4e3a\u4e86\u66f4\u597d\u5730\u6355\u83b7\u987a\u5e8f\u4fe1\u606f\uff0cTransformer \u6a21\u578b\u4f7f\u7528\u4f4d\u7f6e\u7f16\u7801\u6765<strong>\u7ef4\u62a4\u8f93\u5165\u5e8f\u5217\u7684\u4f4d\u7f6e\u4fe1\u606f<\/strong>\u3002<\/li>\n<li>\u4f4d\u7f6e\u7f16\u7801\u6dfb\u52a0\u4e86\u4f4d\u7f6e\u4fe1\u606f\u3002\u8fd9\u53ef\u4ee5\u901a\u8fc7\u591a\u79cd\u65b9\u5f0f\u5b9e\u73b0\uff0c\u539f\u59cb Transformer \u4f7f\u7528 <code>sin<\/code> \u548c <code>cos<\/code> \u51fd\u6570\u6765\u6dfb\u52a0\u8be5\u4fe1\u606f\u3002<\/li>\n<\/ul>\n<p>\u5047\u8bbe $X \\in \\mathbb{R}^{l \\times d}$ :<\/p>\n<ul>\n<li>$X$ \u662f\u4e00\u4e2a\u793a\u4f8b\u7684\u5d4c\u5165\u77e9\u9635\uff0c\u5176\u4e2d $l$ \u662f\u5e8f\u5217\u7684\u957f\u5ea6\uff08\u5373\u5e8f\u5217\u4e2d\u6709\u591a\u5c11\u4e2a\u8bcd\u6216\u5143\u7d20\uff09\uff0c $d$ \u662f\u5d4c\u5165\u5411\u91cf\u7684\u5927\u5c0f\uff08\u5373\u6bcf\u4e2a\u8bcd\u6216\u5143\u7d20\u7684\u8868\u793a\u7ef4\u5ea6\uff09\u3002<\/li>\n<\/ul>\n<p>\u4f4d\u7f6e\u7f16\u7801\u77e9\u9635 $P \\in \\mathbb{R}^{l \\times d}$ :<\/p>\n<ul>\n<li>\u4f4d\u7f6e\u7f16\u7801\u5c42\u4f1a\u751f\u6210\u4e00\u4e2a\u4f4d\u7f6e\u7f16\u7801\u77e9\u9635 $P$ \uff0c\u8be5\u77e9\u9635\u7684\u5f62\u72b6\u4e0e $X$ \u76f8\u540c\u3002<\/li>\n<li>\u4f4d\u7f6e\u7f16\u7801 $P$ \u7684\u6bcf\u4e2a\u4f4d\u7f6e\u90fd\u662f\u6839\u636e\u8bcd\u5728\u5e8f\u5217\u4e2d\u7684\u4f4d\u7f6e\uff08\u5373 $i$ ) \u548c\u5d4c\u5165\u7ef4\u5ea6\u7684\u4f4d\u7f6e (\u5373 $j$ ) \u6765\u8ba1\u7b97\u7684\u3002<\/li>\n<\/ul>\n<p>\u4f4d\u7f6e\u7f16\u7801\u7684\u8ba1\u7b97\u516c\u5f0f:<\/p>\n<ul>\n<li>\u5bf9\u4e8e\u4f4d\u7f6e $i$ \u548c\u5d4c\u5165\u5411\u91cf\u7ef4\u5ea6\u4e2d\u7684\u5076\u6570\u7d22\u5f15 $2 j$ \uff0c\u4f4d\u7f6e\u7f16\u7801\u7684\u503c\u4e3a:<br \/>\n$$<br \/>\nP_{i, 2 j}=\\sin \\left(\\frac{i}{10000^{2 j \/ d}}\\right)<br \/>\n$$<\/li>\n<li>\u5bf9\u4e8e\u4f4d\u7f6e $i$ \u548c\u5d4c\u5165\u5411\u91cf\u7ef4\u5ea6\u4e2d\u7684\u5947\u6570\u7d22\u5f15 $2 j+1$ \uff0c\u4f4d\u7f6e\u7f16\u7801\u7684\u503c\u4e3a:<br \/>\n$$<br \/>\nP_{i, 2 j+1}=\\cos \\left(\\frac{i}{10000^{2 j \/ d}}\\right)<br \/>\n$$<\/li>\n<\/ul>\n<p>\u5176\u4e2d\uff0c<\/p>\n<ul>\n<li>$i$ \u8868\u793a\u5e8f\u5217\u4e2d\u8bcd\u7684\u4f4d\u7f6e\u7d22\u5f15\uff08\u4ece 0 \u5f00\u59cb\uff09\u3002<\/li>\n<li>$2 j$ \u548c $2 j+1$ \u5206\u522b\u8868\u793a\u5d4c\u5165\u5411\u91cf\u7ef4\u5ea6\u4e2d\u7684\u5076\u6570\u548c\u5947\u6570\u7d22\u5f15\u4f4d\u7f6e\u3002<\/li>\n<li>\u5206\u6bcd\u4e2d\u7684 $10000^{2 j \/ d}$ \u662f\u4e00\u4e2a\u7f29\u653e\u56e0\u5b50\uff0c\u786e\u4fdd\u4e0d\u540c\u7ef4\u5ea6\u4e0a\u7684\u4f4d\u7f6e\u7f16\u7801\u5728\u6570\u503c\u4e0a\u6709\u9002\u5f53\u7684\u5dee\u5f02\u3002<\/li>\n<\/ul>\n<p>\u4f4d\u7f6e\u7f16\u7801\u77e9\u9635 $P$ \u4f1a\u52a0\u5230\u5d4c\u5165\u77e9\u9635 $X$ \u4e0a\uff0c\u5f62\u6210\u65b0\u7684\u8868\u793a\u77e9\u9635 $X+P$ \u3002\u8fd9\u4e2a\u65b0\u8868\u793a\u77e9\u9635\u65e2\u5305\u542b\u4e86\u8bcd\u7684\u8bed\u4e49\u4fe1\u606f\uff08\u6765\u81ea $X$ \uff09\uff0c\u4e5f\u5305\u542b\u4e86\u8bcd\u7684\u4f4d\u7f6e\u4fe1\u606f\uff08\u6765\u81ea $P$ \uff09\uff0c\u4ece\u800c\u589e\u5f3a\u4e86\u6a21\u578b\u5bf9\u5e8f\u5217\u6570\u636e\u7684\u7406\u89e3\u3002\u6b63\u5f26\u548c\u4f59\u5f26\u51fd\u6570\u751f\u6210\u7684\u7f16\u7801\u786e\u4fdd\u4e86\u4e0d\u540c\u4f4d\u7f6e\u4e0a\u7684\u8bcd\u80fd\u591f\u88ab\u533a\u5206\uff0c\u540c\u65f6\u4e5f\u5305\u542b\u4e86\u76f8\u5bf9\u4f4d\u7f6e\u4fe1\u606f\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918151051367.png\" style=\"height:200px\">\n<\/p>\n<h4><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/?size=100&id=91CnU00i6HLv&format=png&color=000000\" style=\"height:50px;display:inline\"> \u4e3a\u4ec0\u4e48\u4e0d\u80fd\u53ea\u7528 sin \u6216 cos\uff1f<\/h4>\n<hr \/>\n<pre><code class=\"language-python\">class PositionalEncoding(nn.Module):\n    def __init__(self, num_hiddens, dropout, max_len=1000):\n        super(PositionalEncoding, self).__init__()\n        self.dropout = nn.Dropout(dropout)\n\n        self.P = torch.zeros((1, max_len, num_hiddens))\n        X = torch.arange(0, max_len, dtype=torch.float32).reshape(-1, 1)\n        X = X \/ torch.pow(10_000, torch.arange(0, num_hiddens, 2, dtype=torch.float32) \/ num_hiddens)\n        self.P[:, :, 0::2] = torch.sin(X)\n        self.P[:, :, 1::2] = torch.cos(X) \n\n    def forward(self, X):\n        X = X + self.P[:, :X.shape[1], :].to(X.device)\n        # if using learned embeddings:\n        # X = X + self.positional_embeddings[:, :X.shape[1]]  # [bs, seq_len, embed_dim]\n        return self.dropout(X)<\/code><\/pre>\n<pre><code class=\"language-python\">import matplotlib.pyplot as plt\nimport numpy as np\npe = PositionalEncoding(num_hiddens=20, dropout=0)\npe.eval()\nY = pe(torch.zeros((1, 100, 20))).data.cpu().numpy()  # 1 example, 100 words with embedding dim of 20\nfig = plt.figure(figsize=(8, 4))\nax = fig.add_subplot(111)\nfor p in [4, 5, 6, 7]:\n    ax.plot(np.arange(100), Y[0, :, p].T, label=f&#039;dim {p}&#039;)\nax.legend()\nax.grid()<\/code><\/pre>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918151154269.png\" style=\"height:200px\">\n<\/p>\n<p>\u4e0a\u56fe\u7684\u7ef4\u5ea6dim\uff0c\u5bf9\u5e94\u8f93\u5165\u5e8f\u5217\u6570\u636e\u4e2d\u7684\u6bcf\u4e00\u4e2a\u5143\u7d20\u3002<\/p>\n<p>\u7b2c 4 \u548c 6 \u7ef4\u5ea6\u4f7f\u7528 $\\sin$ \uff0c\u7b2c 5 \u548c 7 \u7ef4\u5ea6\u4f7f\u7528 $\\cos$ \u3002<\/p>\n<p>\u5bf9\u4e8e\u8f83\u5c0f\u7684 j\uff08\u4f8b\u59824\uff0c5\uff09\uff0c\u7f29\u653e\u56e0\u5b50\u8f83\u5c0f\uff0c\u4f4d\u7f6e\u7f16\u7801\u7684\u53d8\u5316\u8f83\u5feb\uff08\u9ad8\u9891\u53d8\u5316\uff09\u3002\u5bf9\u4e8e\u8f83\u5927\u7684 j\uff08\u4f8b\u59826\uff0c7\uff09\uff0c\u7f29\u653e\u56e0\u5b50\u8f83\u5927\uff0c\u4f4d\u7f6e\u7f16\u7801\u7684\u53d8\u5316\u8f83\u6162\uff08\u4f4e\u9891\u53d8\u5316\uff09\u3002<\/p>\n<pre><code class=\"language-python\"># Embeddings class: sequences -&gt; features\n\nclass Embeddings(nn.Module):\n    def __init__(self, d_model, vocab_size, max_position_embeddings, dropout=0):\n        super().__init__()\n        self.dropout = dropout\n        self.word_embeddings = nn.Embedding(vocab_size, d_model, padding_idx=1)\n        self.position_embeddings = PositionalEncoding(num_hiddens=d_model, dropout=self.dropout,\n                                                      max_len=max_position_embeddings)\n        self.LayerNorm = nn.LayerNorm(d_model, eps=1e-12)\n        self.d_model = d_model\n\n    def forward(self, input_ids):\n        seq_length = input_ids.size(1)\n\n        # Get word embeddings for each input id\n        word_embeddings = self.word_embeddings(input_ids)                   # (bs, max_seq_length, dim)\n        # Get position embeddings for the word embeddings and add them     \n        embeddings = self.position_embeddings(word_embeddings) # (bs, max_seq_length, dim)\n\n        # Layer norm \n        embeddings = self.LayerNorm(embeddings)             # (bs, max_seq_length, dim)\n        return embeddings<\/code><\/pre>\n<pre><code class=\"language-python\"># Transformer encoder\nclass EncoderLayer(nn.Module):\n    def __init__(self, d_model, num_heads, hidden_dim_mult=4, dropout=0.1):\n        super().__init__()\n\n        self.dropout = dropout\n        self.mha = MultiHeadAttention(d_model, num_heads, dropout=dropout)\n        self.ffn = FFN(d_model, hidden_dim_mult, dropout)\n\n        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)\n        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)\n\n    def forward(self, x):\n\n        # Multi-head attention \n        attn_output, _ = self.mha(x, x, x)  # (batch_size, input_seq_len, d_model)\n\n        # Layer norm after adding the residual connection \n        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)\n\n        # Feed forward \n        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)\n\n        # Second layer norm after adding residual connection \n        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)\n\n        return out2\n\nclass TransformerEncoder(nn.Module):\n    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim_mult, input_vocab_size,\n               maximum_position_encoding, dropout=0.1):\n        super().__init__()\n\n        self.d_model = d_model\n        self.num_layers = num_layers\n        self.dropout = dropout\n\n        self.embedding = Embeddings(d_model, input_vocab_size, maximum_position_encoding, dropout)\n\n        self.enc_layers = nn.ModuleList()\n        for _ in range(num_layers):\n            self.enc_layers.append(EncoderLayer(d_model, num_heads, ff_hidden_dim_mult, self.dropout))\n\n    def forward(self, x):\n        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)\n\n        for i in range(self.num_layers):\n            x = self.enc_layers[i](x)\n\n        return x  # (batch_size, input_seq_len, d_model)<\/code><\/pre>\n<pre><code class=\"language-python\"># Transormer classifier for sentiment analysis\nclass TransformerClassifier(nn.Module):\n    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim_mult, input_vocab_size, num_answers):\n        super().__init__()\n\n        self.encoder = TransformerEncoder(num_layers, d_model, num_heads, ff_hidden_dim_mult, input_vocab_size,\n                                          maximum_position_encoding=10000)\n        self.dense = nn.Linear(d_model, num_answers)\n\n    def forward(self, x):\n        x = self.encoder(x)  # [batch_size, seq_len, d_model]\n        # pooling\n        x, _ = torch.max(x, dim=1)  # [batch_size, d_model], can also use torch.mean(dim=1) or just x[:, -1]\n        x = self.dense(x)  # [batch_size, num_answers]\n        return x<\/code><\/pre>\n<pre><code class=\"language-python\">import pandas as pd\nimport torch \n\n# \u52a0\u8f7dIMDB\u6570\u636e\u96c6\u7684CSV\u6587\u4ef6\ncsv_path = &#039;.\/datasets\/imdb\/IMDB Dataset.csv&#039; \n\n# \u8bfb\u53d6CSV\u6587\u4ef6\nds = pd.read_csv(csv_path)\n\n# \u6253\u5370\u6570\u636e\u96c6\u4fe1\u606f\nprint(f&quot;Dataset size: {len(ds)}&quot;)<\/code><\/pre>\n<pre><code>Dataset size: 50000<\/code><\/pre>\n<pre><code class=\"language-python\"># \u521b\u5efa\u8fed\u4ee3\u5668\ndata_iter = iter(ds.itertuples(index=False, name=None))\nlabels = [sentiment for review, sentiment in data_iter]  # Extracting sentiments from the second column\nnum_data = len(labels)\nclasses = set(labels)\nnum_class = len(classes)\nprint(f&#039;Total examples: {num_data}, Classes: {classes}&#039;)\n\n# \u6253\u5370\u4e00\u4e2a\u793a\u4f8b\ndata_iter = iter(ds.itertuples(index=False, name=None))\nprint(&quot;Example data entry:&quot;, next(data_iter))  # [review, sentiment]<\/code><\/pre>\n<pre><code>Total examples: 50000, Classes: {'positive', 'negative'}\nExample data entry: (\"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br \/><br \/>The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br \/><br \/>It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br \/><br \/>I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.\", 'positive')<\/code><\/pre>\n<pre><code class=\"language-python\">import pandas as pd\nimport torch\nfrom torchtext.vocab import build_vocab_from_iterator\nfrom torchtext.data.utils import get_tokenizer\nimport torchtext \n\n# \u521d\u59cb\u5316\u5206\u8bcd\u5668\ntokenizer = get_tokenizer(&quot;basic_english&quot;)\nprint(&quot;Tokenizer initialized.&quot;)\n\ndef yield_tokens(data_iter):\n    for review, sentiment in data_iter:\n        yield tokenizer(review)\n\nprint(&quot;Tokenized example:&quot;, list(yield_tokens([(&#039;This is an example review.&#039;, &#039;positive&#039;)])))\n\n# \u6784\u5efa\u8bcd\u6c47\u8868\nmax_vocab_size = 55_000\nvocab = build_vocab_from_iterator(yield_tokens(iter(ds.itertuples(index=False, name=None))), specials=[&quot;&lt;unk&gt;&quot;, &quot;&lt;pad&gt;&quot;], max_tokens=max_vocab_size)\nvocab.set_default_index(vocab[&quot;&lt;unk&gt;&quot;])\nprint(&quot;Vocabulary built successfully.&quot;)\n\n# \u67e5\u770b\u8bcd\u6c47\u8868\u793a\u4f8b\nprint(&quot;Vocabulary example:&quot;, vocab([&#039;here&#039;, &#039;is&#039;, &#039;an&#039;, &#039;example&#039;]))\n<\/code><\/pre>\n<pre><code>Tokenizer initialized.\nTokenized example: [['this', 'is', 'an', 'example', 'review', '.']]\nVocabulary built successfully.\nVocabulary example: [136, 10, 41, 472]<\/code><\/pre>\n<pre><code class=\"language-python\">device = torch.device(&quot;cuda:0&quot; if torch.cuda.is_available() else &quot;cpu&quot;)\nvocab_size = len(vocab)\nmodel = TransformerClassifier(num_layers=1, d_model=32, num_heads=2, \n                              ff_hidden_dim_mult=4, input_vocab_size=vocab_size, num_answers=2)\nmodel.to(device)<\/code><\/pre>\n<pre><code>TransformerClassifier(\n  (encoder): TransformerEncoder(\n    (embedding): Embeddings(\n      (word_embeddings): Embedding(55000, 32, padding_idx=1)\n      (position_embeddings): PositionalEncoding(\n        (dropout): Dropout(p=0.1, inplace=False)\n      )\n      (LayerNorm): LayerNorm((32,), eps=1e-12, elementwise_affine=True)\n    )\n    (enc_layers): ModuleList(\n      (0): EncoderLayer(\n        (mha): MultiHeadAttention(\n          (W_q): Linear(in_features=32, out_features=32, bias=False)\n          (W_k): Linear(in_features=32, out_features=32, bias=False)\n          (W_v): Linear(in_features=32, out_features=32, bias=False)\n          (W_h): Linear(in_features=32, out_features=32, bias=True)\n          (dropout): Dropout(p=0.1, inplace=False)\n        )\n        (ffn): FFN(\n          (fc_1): Linear(in_features=32, out_features=128, bias=True)\n          (act): ReLU(inplace=True)\n          (proj): Linear(in_features=128, out_features=32, bias=True)\n          (dropout): Dropout(p=0.1, inplace=False)\n        )\n        (layernorm1): LayerNorm((32,), eps=1e-06, elementwise_affine=True)\n        (layernorm2): LayerNorm((32,), eps=1e-06, elementwise_affine=True)\n      )\n    )\n  )\n  (dense): Linear(in_features=32, out_features=2, bias=True)\n)<\/code><\/pre>\n<pre><code class=\"language-python\"># collate_fn processes sample from DataLoader according to the data processing pipelines declared previously. \n# label is a tensor saving the labels of individual text entries.\nlabel_pipeline = lambda x: 1 if x == &quot;positive&quot; else 0\ntext_pipeline = lambda x: vocab(tokenizer(x))\n\nmax_seq_len = 200\n\ndef collate_batch(batch):\n    label_list, text_list = [], []\n    for _text, _label in batch:\n        label_list.append(label_pipeline(_label))\n        processed_text = torch.tensor(text_pipeline(_text)[:max_seq_len], dtype=torch.int64)\n        if processed_text.shape[0] &lt; max_seq_len:\n            pad = vocab([&#039;&lt;pad&gt;&#039;])[0] * torch.ones(max_seq_len - len(processed_text), dtype=torch.int64, device=processed_text.device)\n            processed_text = torch.cat([processed_text, pad])\n        text_list.append(processed_text)\n    label_list = torch.tensor(label_list, dtype=torch.int64)\n    text_list = torch.stack(text_list, dim=0)\n    return label_list.to(device), text_list.to(device)\n<\/code><\/pre>\n<pre><code class=\"language-python\">import time \n\ndef train(dataloader):\n    model.train()\n    total_acc, total_count = 0, 0\n    log_interval = 500\n    start_time = time.time()\n\n    for idx, (label, text) in enumerate(dataloader):\n        predicted_label = model(text)\n        loss = criterion(predicted_label, label)\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n        total_acc += (predicted_label.argmax(1) == label).sum().item()\n        total_count += label.size(0)\n        if idx % log_interval == 0 and idx &gt; 0:\n            elapsed = time.time() - start_time\n            print(f&quot;| epoch {epoch:3d} | {idx:5d}\/{len(dataloader):5d} batches| accuracy {total_acc \/ total_count:8.3f}&quot;)\n            total_acc, total_count = 0, 0\n            start_time = time.time()\n\ndef evaluate(dataloader):\n    model.eval()\n    total_acc, total_count = 0, 0\n\n    with torch.no_grad():\n        for idx, (label, text) in enumerate(dataloader):\n            predicted_label = model(text)\n            total_acc += (predicted_label.argmax(1) == label).sum().item()\n            total_count += label.size(0)\n    return total_acc \/ total_count\n<\/code><\/pre>\n<pre><code class=\"language-python\"># hyper-parameters\nbatch_size = 128\nepochs = 10\nlr = 1e-3\noptimizer = torch.optim.AdamW(model.parameters(), lr=lr)\ncriterion = torch.nn.CrossEntropyLoss()<\/code><\/pre>\n<pre><code class=\"language-python\">from torch.utils.data import DataLoader, random_split\n# Split dataset into train, validation, and test sets\ntrain_ratio = 0.8\nvalid_ratio = 0.1\ntest_ratio = 0.1\n\ntrain_size = int(train_ratio * num_data)\nvalid_size = int(valid_ratio * num_data)\ntest_size = num_data - train_size - valid_size\n\ntrain_dataset, valid_dataset, test_dataset = random_split(ds.to_numpy(), [train_size, valid_size, test_size])\n\n# Create DataLoader instances\nbatch_size = 64\ntrain_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)\nvalid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)\ntest_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)<\/code><\/pre>\n<pre><code class=\"language-python\">\n# train loop\ntotal_accu = None\nfor epoch in range(epochs):\n    epoch_start_time = time.time()\n    train(train_dataloader)\n    accu_val = evaluate(valid_dataloader)\n    total_accu = accu_val\n    print(&quot;-&quot; * 59)\n    print(f&quot;| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s | valid accuracy {accu_val:8.3f}&quot;)\n    print(&quot;-&quot; * 59)<\/code><\/pre>\n<pre><code>| epoch   0 |   500\/  625 batches| accuracy    0.615\n-----------------------------------------------------------\n| end of epoch   0 | time: 13.31s | valid accuracy    0.728\n-----------------------------------------------------------\n| epoch   1 |   500\/  625 batches| accuracy    0.765\n-----------------------------------------------------------\n| end of epoch   1 | time: 12.07s | valid accuracy    0.801\n-----------------------------------------------------------\n| epoch   2 |   500\/  625 batches| accuracy    0.825\n-----------------------------------------------------------\n| end of epoch   2 | time: 12.19s | valid accuracy    0.827\n-----------------------------------------------------------\n| epoch   3 |   500\/  625 batches| accuracy    0.854\n-----------------------------------------------------------\n| end of epoch   3 | time: 11.99s | valid accuracy    0.838\n-----------------------------------------------------------\n| epoch   4 |   500\/  625 batches| accuracy    0.872\n-----------------------------------------------------------\n| end of epoch   4 | time: 11.99s | valid accuracy    0.851\n-----------------------------------------------------------\n| epoch   5 |   500\/  625 batches| accuracy    0.887\n-----------------------------------------------------------\n| end of epoch   5 | time: 12.10s | valid accuracy    0.857\n-----------------------------------------------------------\n| epoch   6 |   500\/  625 batches| accuracy    0.900\n-----------------------------------------------------------\n| end of epoch   6 | time: 12.10s | valid accuracy    0.854\n-----------------------------------------------------------\n| epoch   7 |   500\/  625 batches| accuracy    0.910\n-----------------------------------------------------------\n| end of epoch   7 | time: 12.12s | valid accuracy    0.857\n-----------------------------------------------------------\n| epoch   8 |   500\/  625 batches| accuracy    0.921\n-----------------------------------------------------------\n| end of epoch   8 | time: 12.14s | valid accuracy    0.854\n-----------------------------------------------------------\n| epoch   9 |   500\/  625 batches| accuracy    0.929\n-----------------------------------------------------------\n| end of epoch   9 | time: 12.46s | valid accuracy    0.864\n-----------------------------------------------------------<\/code><\/pre>\n<pre><code class=\"language-python\"># evaluation on test set\naccu_test = evaluate(test_dataloader)\nprint(f&quot;test accuracy {accu_test:8.3f}&quot;)<\/code><\/pre>\n<pre><code>test accuracy    0.867<\/code><\/pre>\n<h4>Transformer's Decoder Module<\/h4>\n<hr \/>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918151318423.png\" style=\"height:550px\">\n<\/p>\n<p>\u201cTransformer \u89e3\u7801\u5668\u5757\u4e0e\u7f16\u7801\u5668\u5757\u975e\u5e38\u76f8\u4f3c\uff0c\u4f46\u6709\u4e00\u4e9b\u5173\u952e\u533a\u522b\u3002\u9664\u4e86\u591a\u5934\u6ce8\u610f\u5c42\u548c\u4f4d\u7f6e\u7f16\u7801\u7f51\u7edc\uff0c\u89e3\u7801\u5668\u5757\u8fd8\u5305\u542b\u4e00\u4e2a\u989d\u5916\u7684\u5b50\u5c42\uff0c\u5373\u5c06\u591a\u5934\u6ce8\u610f\u529b\u5e94\u7528\u4e8e\u7f16\u7801\u5668\u8f93\u51fa\u7684\u5b50\u5c42\u3002<\/p>\n<p>\u4ea4\u53c9\u6ce8\u610f\uff1a\u4ea4\u53c9\u6ce8\u610f\u673a\u5236\u4e0e\u81ea\u6ce8\u610f\u673a\u5236\u7c7b\u4f3c\uff0c\u4f7f\u7528\u67e5\u8be2\u3001\u952e\u548c\u503c\u7684\u8bbe\u7f6e\u3002\u4f46\u5728\u89e3\u7801\u5668\u4e2d\uff0c\u8f93\u5165\u7a0d\u5fae\u590d\u6742\u4e00\u4e9b\u3002\u89e3\u7801\u5668\u63a5\u6536\u7684\u6570\u636e\u70b9 $y_i$ \u9996\u5148\u7ecf\u8fc7\u81ea\u6ce8\u610f\u548c add-norm \u5757\uff0c\u7136\u540e\u5230\u8fbe\u4ea4\u53c9\u6ce8\u610f\u5757\u3002\u5728\u8fd9\u4e2a\u5757\u4e2d\uff0c\u89e3\u7801\u5668\u7684\u8f93\u5165\u4f5c\u4e3a\u67e5\u8be2\uff0c\u800c\u7f16\u7801\u5668\u7684\u8f93\u51fa $h^{Enc}$\uff08\u7531\u6240\u6709\u5148\u524d\u7684\u8f93\u5165 $x_1, ..., x_t$ \u8ba1\u7b97\u5f97\u51fa\uff09\u4f5c\u4e3a\u952e\u548c\u503c\u3002<\/p>\n<p>\u4ee5\u201c\u6211\u662f\u5b66\u751f\u201d\u2192 \u201ci am a student\u201d \u7684\u8bed\u8a00\u7ffb\u8bd1\u4efb\u52a1\u4e3a\u4f8b\uff0c\u5982\u4e0b\u56fe\u6240\u793a\u3002<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918151921196.png\" style=\"height:350px\">\n<\/p>\n<p>\u5728\u8bed\u8a00\u7ffb\u8bd1\u7684\u4f8b\u5b50\u4e2d\uff0c\u89e3\u7801\u5668\u751f\u6210\u7684\u67e5\u8be2\u8868\u793a\u5f53\u524d\u65f6\u95f4\u6b65\u7684\u7ffb\u8bd1\u9700\u6c42\uff0c\u800c\u7f16\u7801\u5668\u7684\u952e\u548c\u503c\u63d0\u4f9b\u4e86\u4ece\u6e90\u8bed\u8a00\u53e5\u5b50\u7684\u6240\u6709\u4e4b\u524d\u65f6\u95f4\u6b65\u8f93\u5165\u4e2d\u63d0\u53d6\u5230\u7684\u4fe1\u606f\u3002<\/p>\n<p>\u4f8b\u5982\uff0c\u5982\u679c\u6211\u4eec\u5728\u7ffb\u8bd1\u4e00\u4e2a\u53e5\u5b50\uff0c\u5f53\u89e3\u7801\u5668\u5728\u751f\u6210\u76ee\u6807\u8bed\u8a00\u53e5\u5b50\u7684\u67d0\u4e2a\u5355\u8bcd\u65f6\uff0c\u5b83\u4f1a\u4ea7\u751f\u4e00\u4e2a\u67e5\u8be2\uff0c\u8868\u793a\u5b83\u9700\u8981\u4ece\u6e90\u8bed\u8a00\u53e5\u5b50\u7684\u54ea\u4e2a\u90e8\u5206\u83b7\u53d6\u4fe1\u606f\u3002\u7f16\u7801\u5668\u7684\u8f93\u51fa\uff08\u952e\u548c\u503c\uff09\u5305\u542b\u4e86\u6e90\u8bed\u8a00\u53e5\u5b50\u7684\u6240\u6709\u4fe1\u606f\uff0c\u6bd4\u5982\u8bed\u6cd5\u7ed3\u6784\u548c\u4e0a\u4e0b\u6587\u542b\u4e49\u3002\u8fd9\u4e9b\u4fe1\u606f\u901a\u8fc7\u952e\u548c\u503c\u63d0\u4f9b\u7ed9\u89e3\u7801\u5668\u3002<\/p>\n<p>\u901a\u8fc7\u8fd9\u79cd\u65b9\u5f0f\uff0c\u89e3\u7801\u5668\u80fd\u591f\u5728\u751f\u6210\u76ee\u6807\u8bed\u8a00\u53e5\u5b50\u7684\u4e0b\u4e00\u4e2a\u5355\u8bcd\u65f6\uff0c\u53c2\u8003\u6e90\u8bed\u8a00\u53e5\u5b50\u7684\u5168\u5c40\u4fe1\u606f\u3002\u8fd9\u79cd\u4ea4\u53c9\u6ce8\u610f\u673a\u5236\u4f7f\u5f97\u6a21\u578b\u5728\u7ffb\u8bd1\u8fc7\u7a0b\u4e2d\u80fd\u591f\u66f4\u597d\u5730\u7406\u89e3\u548c\u5229\u7528\u6e90\u8bed\u8a00\u7684\u4e0a\u4e0b\u6587\uff0c\u4ece\u800c\u63d0\u9ad8\u7ffb\u8bd1\u7684\u51c6\u786e\u6027\u548c\u8fde\u8d2f\u6027\u3002\u4f8b\u5982\uff0c\u5728\u7ffb\u8bd1\u4e00\u4e2a\u957f\u53e5\u5b50\u65f6\uff0c\u89e3\u7801\u5668\u53ef\u4ee5\u901a\u8fc7\u4ea4\u53c9\u6ce8\u610f\u673a\u5236\u786e\u4fdd\u6bcf\u4e2a\u751f\u6210\u7684\u5355\u8bcd\u90fd\u4e0e\u6e90\u8bed\u8a00\u53e5\u5b50\u7684\u76f8\u5173\u90e8\u5206\u76f8\u5bf9\u5e94\uff0c\u4ece\u800c\u751f\u6210\u4e00\u4e2a\u8fde\u8d2f\u3001\u51c6\u786e\u7684\u8bd1\u6587\u3002<\/p>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/plasticine\/100\/000000\/teacher.png\" style=\"height:50px;display:inline\"> Teacher Forcing<\/h3>\n<hr \/>\n<p>\u5728 Transformer \u6a21\u578b\u7684\u80cc\u666f\u4e0b\uff0c<strong>\u6559\u5e08\u5f3a\u5236\uff08Teacher Forcing\uff09<\/strong>\u662f\u4e00\u79cd\u8bad\u7ec3\u7b56\u7565\uff0c\u7528\u4e8e\u63d0\u9ad8\u5e8f\u5217\u5230\u5e8f\u5217\u6a21\u578b\uff08\u5982\u673a\u5668\u7ffb\u8bd1\u3001\u6587\u672c\u751f\u6210\u7b49\uff09\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u7684\u6548\u7387\u548c\u51c6\u786e\u6027\u3002<\/p>\n<ul>\n<li>\n<p>\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\uff0c\u6a21\u578b\u9700\u8981\u5b66\u4e60\u5982\u4f55\u6839\u636e\u8f93\u5165\u5e8f\u5217\u751f\u6210\u8f93\u51fa\u5e8f\u5217\u3002\u5bf9\u4e8e Transformer \u6a21\u578b\u6765\u8bf4\uff0c\u8fd9\u5305\u62ec\u7f16\u7801\u5668\u548c\u89e3\u7801\u5668\u4e24\u4e2a\u90e8\u5206\u3002\u7f16\u7801\u5668\u5c06\u8f93\u5165\u5e8f\u5217\u7f16\u7801\u6210\u9690\u85cf\u8868\u793a\uff0c\u89e3\u7801\u5668\u5219\u6839\u636e\u8fd9\u4e9b\u9690\u85cf\u8868\u793a\u751f\u6210\u8f93\u51fa\u5e8f\u5217\u3002<\/p>\n<\/li>\n<li>\n<p>\u5728\u6ca1\u6709\u6559\u5e08\u5f3a\u5236\u7684\u60c5\u51b5\u4e0b\uff0c\u89e3\u7801\u5668\u5728\u6bcf\u4e00\u6b65\u751f\u6210\u4e0b\u4e00\u4e2a\u5355\u8bcd\u65f6\uff0c\u90fd\u4f1a\u4f7f\u7528\u5b83\u81ea\u5df1\u524d\u4e00\u6b65\u751f\u6210\u7684\u5355\u8bcd\u4f5c\u4e3a\u8f93\u5165\u3002\u8fd9\u79cd\u65b9\u5f0f\u7684\u95ee\u9898\u662f\uff0c\u5982\u679c\u6a21\u578b\u5728\u67d0\u4e00\u6b65\u751f\u6210\u4e86\u9519\u8bef\u7684\u5355\u8bcd\uff0c\u9519\u8bef\u4f1a\u4e0d\u65ad\u79ef\u7d2f\uff0c\u4f7f\u5f97\u540e\u7eed\u7684\u9884\u6d4b\u66f4\u52a0\u4e0d\u51c6\u786e\u3002<\/p>\n<\/li>\n<li>\n<p>\u6559\u5e08\u5f3a\u5236\u901a\u8fc7\u4f7f\u7528\u771f\u5b9e\u7684\u76ee\u6807\u5355\u8bcd\uff08\u5373\u8bad\u7ec3\u6570\u636e\u4e2d\u7684\u6b63\u786e\u5355\u8bcd\uff09\u4f5c\u4e3a\u89e3\u7801\u5668\u5728\u6bcf\u4e00\u6b65\u7684\u8f93\u5165\uff0c\u6765\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\u3002\u5728\u6bcf\u4e00\u6b65\u751f\u6210\u65f6\uff0c\u89e3\u7801\u5668\u4e0d\u4f7f\u7528\u5b83\u81ea\u5df1\u751f\u6210\u7684\u5355\u8bcd\uff0c\u800c\u662f\u4f7f\u7528\u8bad\u7ec3\u6570\u636e\u4e2d\u5bf9\u5e94\u7684\u6b63\u786e\u5355\u8bcd\u3002\u8fd9\u6837\u53ef\u4ee5\u907f\u514d\u9519\u8bef\u7684\u7d2f\u79ef\uff0c\u52a0\u901f\u6a21\u578b\u6536\u655b\uff0c\u5e76\u63d0\u9ad8\u8bad\u7ec3\u9636\u6bb5\u7684\u51c6\u786e\u6027\u3002<\/p>\n<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152016113.gif\" style=\"height:500px\">\n<\/p>\n<p><a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\">Animation by Jay Alammar<\/a><\/p>\n<p>\u5176\u7f3a\u70b9\u662f\u5728\u5b9e\u9645\u63a8\u7406\uff08\u9884\u6d4b\uff09\u65f6\uff0c\u6a21\u578b\u4e0d\u4f1a\u6709\u771f\u5b9e\u7684\u76ee\u6807\u5355\u8bcd\u53ef\u7528\uff0c\u53ef\u80fd\u5bfc\u81f4\u6a21\u578b\u5728\u8bad\u7ec3\u548c\u63a8\u7406\u9636\u6bb5\u8868\u73b0\u4e0d\u4e00\u81f4\u3002\u5728\u5b9e\u9645\u5e94\u7528\u4e2d\uff0c\u7814\u7a76\u4eba\u5458\u53ef\u80fd\u4f1a\u7ed3\u5408\u6559\u5e08\u5f3a\u5236\u4e0e\u5176\u4ed6\u8bad\u7ec3\u6280\u5de7\uff0c\u5982Scheduled Sampling\uff0c\u4ee5\u7f13\u89e3\u4e0a\u8ff0\u95ee\u9898\u3002<\/p>\n<p>Scheduled Sampling \u662f\u4e00\u79cd\u5177\u4f53\u5b9e\u73b0\u4e0a\u8ff0\u968f\u673a\u6027\u7684\u7b56\u7565\u3002\u5176\u57fa\u672c\u601d\u60f3\u662f\uff0c\u5728\u8bad\u7ec3\u521d\u671f\uff0c\u6559\u5e08\u5f3a\u5236\u7684\u4f7f\u7528\u6982\u7387\u8f83\u9ad8\uff0c\u800c\u968f\u7740\u8bad\u7ec3\u7684\u8fdb\u884c\uff0c\u8fd9\u4e2a\u6982\u7387\u9010\u6e10\u964d\u4f4e\u3002\u4ee5\u4e0b\u662f\u5176\u5177\u4f53\u6b65\u9aa4\uff1a<\/p>\n<ul>\n<li>\u521d\u671f\uff1a<\/li>\n<\/ul>\n<p>\u9ad8\u6982\u7387\u4f7f\u7528\u6b63\u786e\u7684\u76ee\u6807\u5355\u8bcd\u3002<br \/>\n\u4f8b\u5982\uff0c90% \u7684\u65f6\u95f4\u6b65\u4f7f\u7528\u6b63\u786e\u7684\u5355\u8bcd\uff0c10% \u7684\u65f6\u95f4\u6b65\u4f7f\u7528\u6a21\u578b\u751f\u6210\u7684\u5355\u8bcd\u3002<\/p>\n<ul>\n<li>\u4e2d\u671f\uff1a<\/li>\n<\/ul>\n<p>\u968f\u7740\u8bad\u7ec3\u8fdb\u5c55\uff0c\u9010\u6e10\u964d\u4f4e\u4f7f\u7528\u6b63\u786e\u5355\u8bcd\u7684\u6982\u7387\u3002<br \/>\n\u4f8b\u5982\uff0c50% \u7684\u65f6\u95f4\u6b65\u4f7f\u7528\u6b63\u786e\u7684\u5355\u8bcd\uff0c50% \u7684\u65f6\u95f4\u6b65\u4f7f\u7528\u6a21\u578b\u751f\u6210\u7684\u5355\u8bcd\u3002<\/p>\n<ul>\n<li>\u540e\u671f\uff1a<\/li>\n<\/ul>\n<p>\u6700\u7ec8\uff0c\u4f7f\u5f97\u6a21\u578b\u51e0\u4e4e\u5b8c\u5168\u4f9d\u8d56\u81ea\u5df1\u751f\u6210\u7684\u5355\u8bcd\u3002<br \/>\n\u4f8b\u5982\uff0c10% \u7684\u65f6\u95f4\u6b65\u4f7f\u7528\u6b63\u786e\u7684\u5355\u8bcd\uff0c90% \u7684\u65f6\u95f4\u6b65\u4f7f\u7528\u6a21\u578b\u751f\u6210\u7684\u5355\u8bcd\u3002<br \/>\n\u901a\u8fc7\u8fd9\u79cd\u65b9\u5f0f\uff0c\u6a21\u578b\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u65e2\u80fd\u5feb\u901f\u5b66\u4e60\u5230\u6b63\u786e\u7684\u6a21\u5f0f\uff0c\u53c8\u80fd\u9010\u6e10\u9002\u5e94\u5b9e\u9645\u63a8\u7406\u573a\u666f\uff0c\u4ece\u800c\u63d0\u9ad8\u6a21\u578b\u7684\u6cdb\u5316\u80fd\u529b\u548c\u5b9e\u9645\u5e94\u7528\u6548\u679c\u3002<\/p>\n<h4>Transformer Architecture Summary<\/h4>\n<hr \/>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152110311.png\" style=\"height:500px\">\n<\/p>\n<p><a href=\"https:\/\/lena-voita.github.io\/nlp_course\/seq2seq_and_attention.html#transformer_model_architecture\">Image Source<\/a><\/p>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/cotton\/64\/000000\/torch.png\" style=\"height:50px;display:inline\"> Native Transformer in PyTorch<\/h3>\n<hr \/>\n<ul>\n<li>Transformer is implmented natively in PyTorch:<br \/><code>torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation=&#039;relu&#039;, custom_encoder=None, custom_decoder=None, layer_norm_eps=1e-05, batch_first=False, norm_first=False, device=None, dtype=None)<\/code><\/li>\n<li><a href=\"https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.Transformer.html\">Documentation<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/pytorch\/examples\/blob\/main\/word_language_model\/main.py\">Code example usage<\/a><\/li>\n<\/ul>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/emoji\/96\/000000\/woman-lifting-weights.png\" style=\"height:50px;display:inline\"> Pretrained Models - BERT and GPT<\/h2>\n<hr \/>\n<ul>\n<li>\u5927\u89c4\u6a21\u9884\u8bad\u7ec3\u6a21\u578b\u5728\u8fc7\u53bb\u51e0\u5e74\u4e2d\u8d8a\u6765\u8d8a\u53d7\u6b22\u8fce\uff0c\u56e0\u4e3a\u5927\u516c\u53f8\u53ef\u4ee5\u8bad\u7ec3\u975e\u5e38\u5927\u7684\u6a21\u578b\uff0c\u7136\u540e\u5c06\u5176\u53d1\u5e03\u7ed9\u516c\u4f17\uff0c\u4f9b\u5176\u6309\u539f\u6837\u4f7f\u7528\u6216\u9488\u5bf9\u7528\u6237\u7684\u81ea\u5b9a\u4e49\u6570\u636e\u96c6\u8fdb\u884c\u5fae\u8c03\u3002<\/li>\n<li><strong>\u6765\u81ea Transformers \u7684\u53cc\u5411\u7f16\u7801\u5668\u8868\u793a (BERT)\uff0cGoogle<\/strong> - \u4e00\u79cd\u7531 Google \u5f00\u53d1\u7684\u57fa\u4e8e Transformer \u7684\u81ea\u7136\u8bed\u8a00\u5904\u7406 (NLP) \u9884\u8bad\u7ec3\u673a\u5668\u5b66\u4e60\u6280\u672f\u3002\u5176\u60f3\u6cd5\u662f\u5c4f\u853d\u67d0\u4e9b\u5355\u8bcd\uff0c\u7136\u540e\u5c1d\u8bd5\u9884\u6d4b\u5b83\u4eec\u3002\u539f\u59cb\u7684\u82f1\u8bed BERT \u6a21\u578b\u6709\u4e24\u79cd\u9884\u8bad\u7ec3\u7684\u901a\u7528\u7c7b\u578b\uff1a<\/li>\n<li>(1) $BERT_{BASE}$ \u6a21\u578b\uff0c12 \u5c42\u3001768 \u4e2a\u9690\u85cf\u300112 \u4e2a\u5934\u3001110M \u53c2\u6570\u795e\u7ecf\u7f51\u7edc\u67b6\u6784\u3002<\/li>\n<li>(2) $BERT_{LARGE}$ \u6a21\u578b\uff0c24 \u5c42\u30011024 \u4e2a\u9690\u85cf\u300116 \u4e2a\u5934\u3001340M \u53c2\u6570\u795e\u7ecf\u7f51\u7edc\u67b6\u6784\u3002<\/li>\n<li>\u4e24\u8005\u90fd\u662f\u5728\u5305\u542b 8 \u4ebf\u4e2a\u5355\u8bcd\u7684 BooksCorpus \u6570\u636e\u96c6\u548c\u5305\u542b 25 \u4ebf\u4e2a\u5355\u8bcd\u7684\u82f1\u6587\u7ef4\u57fa\u767e\u79d1\u7248\u672c\u4e0a\u8fdb\u884c\u8bad\u7ec3\u7684\u3002<\/li>\n<li>\u6269\u5c55\uff1aRoBERTa\uff08Facebook\uff09\u3001DistillBERT\uff08HuggingFace\uff09<\/li>\n<li><strong>\u751f\u6210\u5f0f\u9884\u8bad\u7ec3 Transformer (GPT)\uff0cOpenAI<\/strong> - \u4e00\u79cd\u4f7f\u7528\u6df1\u5ea6\u5b66\u4e60\u751f\u6210\u7c7b\u4f3c\u4eba\u7c7b\u6587\u672c\u7684\u81ea\u56de\u5f52\u8bed\u8a00\u6a21\u578b\u3002GPT \u7ecf\u8fc7\u56e0\u679c\u8bed\u8a00\u5efa\u6a21 (CLM) \u76ee\u6807\u8bad\u7ec3\uff0c\u56e0\u6b64\u80fd\u591f\u6709\u6548\u9884\u6d4b\u5e8f\u5217\u4e2d\u7684\u4e0b\u4e00\u4e2a\u6807\u8bb0\u3002\u6240\u63d0\u51fa\u7684\u65b9\u6cd5\u5229\u7528\u8bed\u8a00\u6a21\u578b\u5728\u5404\u79cd\u672a\u6807\u8bb0\u6587\u672c\u8bed\u6599\u5e93\u4e0a\u8fdb\u884c\u751f\u6210\u5f0f\u9884\u8bad\u7ec3\uff0c\u7136\u540e\u5bf9\u6bcf\u4e2a\u7279\u5b9a\u4efb\u52a1\u8fdb\u884c\u5224\u522b\u5f0f\u5fae\u8c03\u3002\u4e0e BERT \u4e0d\u540c\uff0cGPT \u662f\u4e00\u79cd\u751f\u6210\u6a21\u578b\uff0c\u800c BERT \u662f\u4e00\u79cd\u6709\u6548\u7684\u5355\u8bcd\/\u53e5\u5b50\u5d4c\u5165\u9884\u8bad\u7ec3\u6a21\u578b\u3002<\/li>\n<li>GPT \u6f14\u793a - <a href=\"https:\/\/transformer.huggingface.co\/doc\/gpt\">\u4f7f\u7528 Transformer \u5199\u4f5c<\/a>\u3002<\/li>\n<li><a href=\"https:\/\/huggingface.co\/\">HuggingFace<\/a> \u662f\u4e00\u5bb6\u81f4\u529b\u4e8e\u53d1\u5e03\u6240\u6709\u53ef\u7528\u9884\u8bad\u7ec3\u6a21\u578b\u7684\u516c\u53f8\uff0c\u5b83\u4e5f\u9002\u7528\u4e8e PyTorch - <a href=\"https:\/\/github.com\/huggingface\/transformers\">HuggingFace Transformers<\/a>\u3002<\/li>\n<li><a href=\"https:\/\/pytorch.org\/hub\/huggingface_pytorch-transformers\/\">\u4f7f\u7528 PyTorch \u7684\u793a\u4f8b<\/a>\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152408388.gif\" style=\"height:400px\">\n<\/p>\n<p><a href=\"https:\/\/jalammar.github.io\/how-gpt3-works-visualizations-animations\/\">Animation by Jay Alammar<\/a><\/p>\n<p><a href=\"https:\/\/huggingface.co\/models\">HF Models Hub<\/a><\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152458350.png\" style=\"height:500px\">\n<\/p>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/bubbles\/50\/null\/picture.png\" style=\"height:50px;display:inline\"> Vision Transformer (ViT)<\/h2>\n<hr \/>\n<ul>\n<li>\n<p>\u6211\u4eec\u53ef\u4ee5\u5c06\u56fe\u50cf\u5757\u89c6\u4e3a\u6211\u4eec\u7684\u201c\u5355\u8bcd\u201d\uff0c\u5373token\u3002<\/p>\n<\/li>\n<li>\n<p>\u8fd9\u4f7f\u5f97\u6211\u4eec\u80fd\u591f\u5c06 Transformer \u67b6\u6784\u7528\u4e8e\u89c6\u89c9\u4efb\u52a1\uff01<\/p>\n<\/li>\n<li>\n<p>\u9996\u5148\uff0c\u5c06\u56fe\u50cf\u5206\u5272\u6210\u56fa\u5b9a\u5927\u5c0f\u7684\u5757\uff0c\u7136\u540e\u5c06\u6bcf\u4e2a\u5757\u7ebf\u6027\u5d4c\u5165\u3002\u7136\u540e\uff0c\u6dfb\u52a0 2D \u4f4d\u7f6e\u5d4c\u5165\uff0c\u5e76\u5c06\u5f97\u5230\u7684\u5411\u91cf\u5e8f\u5217\u9988\u9001\u5230\u6807\u51c6 Transformer \u7f16\u7801\u5668\u3002<\/p>\n<\/li>\n<li>\n<p>\u4e3a\u4e86\u6267\u884c\u5206\u7c7b\uff0c\u5c06\u4e00\u4e2a\u989d\u5916\u7684\u53ef\u5b66\u4e60\u201c\u5206\u7c7b\u6807\u8bb0\u201d\u6dfb\u52a0\u5230\u5e8f\u5217\u4e2d\uff0c\u7c7b\u4f3c\u4e8e\u57fa\u4e8e Transformer \u7684 NLP \u4efb\u52a1\u3002<\/p>\n<\/li>\n<li>\n<p>Transformers \u9700\u8981\u5927\u91cf\u6570\u636e\u624d\u80fd\u83b7\u5f97\u9ad8\u7cbe\u5ea6\uff0c\u56e0\u6b64\uff0c\u5728\u6570\u636e\u8f83\u5c11\u7684\u60c5\u51b5\u4e0b\uff0cCNN \u901a\u5e38\u6bd4 Transformers \u8868\u73b0\u66f4\u597d\u3002<\/p>\n<\/li>\n<li>\n<p>\u4e3a\u4e86\u8fbe\u5230 ViT \u7684\u9ad8\u6027\u80fd\uff0c\u901a\u5e38\u4f7f\u7528\u5927\u5c3a\u5bf8\u6570\u636e\u96c6\u8fdb\u884c\u9884\u8bad\u7ec3\uff0c\u56e0\u4e3a\u5b83\u5bf9\u5927\u6570\u636e\u96c6\u7684\u4f9d\u8d56\u88ab\u89e3\u91ca\u4e3a\u7531\u4e8e\u4f4e\u5c40\u90e8\u5f52\u7eb3\u504f\u5dee\uff0c\u8fd9\u662f CNN \u7684\u4e00\u4e2a\u91cd\u8981\u7279\u6027\u3002<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/pytorch.org\/vision\/main\/models\/vision_transformer.html\">Official ViT Pre-trained Models in PyTorch<\/a>.<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/github.com\/lucidrains\/vit-pytorch\">ViT Models and Examples with PyTorch<\/a>.<\/p>\n<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152543565.gif\" style=\"height:400px\">\n<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Vision_transformer\">Image Source<\/a><\/p>\n<pre><code class=\"language-python\"># code skeleton from: https:\/\/lightning.ai\/docs\/pytorch\/latest\/notebooks\/course_UvA-DL\/11-vision-transformer.html\n\ndef img_to_patch(x, patch_size, flatten_channels=True):\n    &quot;&quot;&quot;\n    Inputs:\n        x - Tensor representing the image of shape [B, C, H, W]\n        patch_size - Number of pixels per dimension of the patches (integer)\n        flatten_channels - If True, the patches will be returned in a flattened format\n                           as a feature vector instead of a image grid.\n    &quot;&quot;&quot;\n    B, C, H, W = x.shape\n    x = x.reshape(B, C, H \/\/ patch_size, patch_size, W \/\/ patch_size, patch_size)\n    x = x.permute(0, 2, 4, 1, 3, 5)  # [B, H&#039;, W&#039;, C, p_H, p_W]\n    x = x.flatten(1, 2)  # [B, H&#039;*W&#039;, C, p_H, p_W]\n    if flatten_channels:\n        x = x.flatten(2, 4)  # [B, H&#039;*W&#039;, C*p_H*p_W]\n    return x\n\nclass VisionTransformer(nn.Module):\n    def __init__(\n        self,\n        embed_dim,\n        hidden_dim,\n        num_channels,\n        num_heads,\n        num_layers,\n        num_classes,\n        patch_size,\n        num_patches,\n        dropout=0.0,\n    ):\n        &quot;&quot;&quot;\n        Inputs:\n            embed_dim - Dimensionality of the input feature vectors to the Transformer\n            hidden_dim - Dimensionality of the hidden layer in the feed-forward networks\n                         within the Transformer\n            num_channels - Number of channels of the input (3 for RGB)\n            num_heads - Number of heads to use in the Multi-Head Attention block\n            num_layers - Number of layers to use in the Transformer\n            num_classes - Number of classes to predict\n            patch_size - Number of pixels that the patches have per dimension\n            num_patches - Maximum number of patches an image can have\n            dropout - Amount of dropout to apply in the feed-forward network and\n                      on the input encoding\n        &quot;&quot;&quot;\n        super().__init__()\n\n        self.patch_size = patch_size\n\n        # Layers\/Networks\n        self.input_layer = nn.Linear(num_channels * (patch_size**2), embed_dim)\n        self.transformer = nn.Sequential(\n            *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))\n        )\n        self.mlp_head = nn.Sequential(nn.LayerNorm(embed_dim), nn.Linear(embed_dim, num_classes))\n        self.dropout = nn.Dropout(dropout)\n\n        # Parameters\/Embeddings\n        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))\n        self.pos_embedding = nn.Parameter(torch.randn(1, 1 + num_patches, embed_dim))\n\n    def forward(self, x):\n        # Preprocess input\n        x = img_to_patch(x, self.patch_size)\n        B, T, _ = x.shape\n        x = self.input_layer(x)\n\n        # Add CLS token and positional encoding\n        cls_token = self.cls_token.repeat(B, 1, 1)\n        x = torch.cat([cls_token, x], dim=1)\n        x = x + self.pos_embedding[:, : T + 1]\n\n        # Apply Transforrmer\n        x = self.dropout(x)\n        x = x.transpose(0, 1)\n        x = self.transformer(x)\n\n        # Perform classification prediction\n        cls = x[0]\n        out = self.mlp_head(cls)\n        return out<\/code><\/pre>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/?size=100&id=63921&format=png&color=000000\" style=\"height:50px;display:inline\"> Swin Transformer<\/h2>\n<hr \/>\n<p>Swin Transformer\uff08Shifted Window Transformer\uff09\u662f\u4e00\u79cd\u4e13\u95e8\u7528\u4e8e\u8ba1\u7b97\u673a\u89c6\u89c9\u4efb\u52a1\u7684 Transformer \u6a21\u578b\u3002\u4e0e\u4f20\u7edf\u7684 Vision Transformer (ViT) \u4e0d\u540c\uff0cSwin Transformer \u901a\u8fc7\u5f15\u5165\u6ed1\u52a8\u7a97\u53e3\u673a\u5236\uff0c\u63d0\u5347\u4e86\u5bf9\u56fe\u50cf\u7684\u5efa\u6a21\u80fd\u529b\u548c\u8ba1\u7b97\u6548\u7387\u3002<\/p>\n<ul>\n<li>\u7a97\u53e3\u5212\u5206\uff1aSwin Transformer \u5c06\u8f93\u5165\u56fe\u50cf\u5212\u5206\u4e3a\u56fa\u5b9a\u5927\u5c0f\u7684\u7a97\u53e3\uff0c\u6bcf\u4e2a\u7a97\u53e3\u5185\u7684\u50cf\u7d20\u4f5c\u4e3a\u4e00\u4e2a\u5355\u5143\u8fdb\u884c\u5904\u7406\u3002\u8fd9\u6837\u53ef\u4ee5\u51cf\u5c11\u8ba1\u7b97\u91cf\uff0c\u5e76\u4fdd\u6301\u5c40\u90e8\u4fe1\u606f\u7684\u5b8c\u6574\u6027\u3002<\/li>\n<li>\u7a97\u53e3\u5185\u8ba1\u7b97\uff1a\u5728\u6bcf\u4e2a\u7a97\u53e3\u5185\uff0c\u5e94\u7528\u6807\u51c6\u7684 Transformer \u8ba1\u7b97\u673a\u5236\uff0c\u63d0\u53d6\u5c40\u90e8\u7279\u5f81\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152638737.png\" style=\"height:300px\">\n<\/p>\n<ul>\n<li>\u7a97\u53e3\u6ed1\u52a8\uff1a\u5728\u540e\u7eed\u5c42\u4e2d\uff0c\u901a\u8fc7\u6ed1\u52a8\u7a97\u53e3\u7684\u4f4d\u7f6e\uff0c\u4f7f\u7a97\u53e3\u4e4b\u95f4\u7684\u8fb9\u754c\u50cf\u7d20\u4e5f\u80fd\u591f\u88ab\u5145\u5206\u5229\u7528\uff0c\u4ece\u800c\u589e\u5f3a\u5168\u5c40\u4fe1\u606f\u7684\u4ea4\u6d41\u3002\n<p align=\"center\">\n<img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152722956.png\" style=\"height:500px\">\n<\/p>\n<\/li>\n<\/ul>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/external-others-maxicons\/62\/null\/external-magic-medieval-others-maxicons.png\" style=\"height:50px;display:inline\"> How to Tame Your Transformer?<\/h2>\n<hr \/>\n<ul>\n<li>\u4f17\u6240\u5468\u77e5\uff0cTransformer \u5f88\u96be\u8bad\u7ec3\uff0c\u56e0\u4e3a\u5b83\u4eec\u5bf9\u6570\u636e\u96c6\u7684\u5927\u5c0f\u4ee5\u53ca\u8d85\u53c2\u6570\u7684\u9009\u62e9\uff08\u5305\u62ec\u5b66\u4e60\u7387\u3001\u6279\u91cf\u5927\u5c0f\u548c\u4f18\u5316\u5668\uff09\u5f88\u654f\u611f\u3002<\/li>\n<li>\u4ee5\u4e0b\u662f\u4e00\u4e9b\u6280\u5de7\u548c\u7a8d\u95e8\uff0c\u53ef\u4f7f Transformer \u66f4\u52a0\u7a33\u5b9a\u5e76\u66f4\u5feb\u5730\u6536\u655b\u3002<\/li>\n<li>\u5982\u9700\u66f4\u8be6\u7ec6\u7684\u5206\u6790\uff0c\u8bf7\u67e5\u770b<a href=\"https:\/\/www.borealisai.com\/research-blogs\/tutorial-17-transformers-iii-training\/\">Tricks For Training Transformers - Borealis AI - P. Xu, S. Prince<\/a>.<\/li>\n<\/ul>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/clouds\/100\/rocket.png\" style=\"height:50px;display:inline\"> Initialization<\/h3>\n<hr \/>\n<ul>\n<li>\u521d\u59cb\u5316\u5728 LLM \u4e2d\u5f88\u91cd\u8981\uff0c\u4e0d\u4ec5\u5bf9\u7a33\u5b9a\u6027\u5f88\u91cd\u8981\uff0c\u800c\u4e14\u5bf9\u6700\u7ec8\u6027\u80fd\u4e5f\u5f88\u91cd\u8981\uff01\n<ul>\n<li><a href=\"https:\/\/github.com\/bigscience-workshop\/bigscience\/blob\/master\/train\/lessons-learned.md\">BLOOM: Lessons Learned in Training LLMs<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>Transformer \u521d\u59cb\u5316<\/strong>\uff1a\u6709\u5f88\u591a\u65b9\u6cd5\u548c\u8bba\u6587\uff0c\u4f46\u57fa\u672c\u89c4\u5219\u662f\u4f7f\u7528\u4f4e\u6807\u51c6\u5dee\u4f5c\u4e3a\u521d\u59cb\u5316\u5206\u5e03\uff08\u901a\u5e38\u662f\u6b63\u6001\/\u9ad8\u65af\u5206\u5e03\uff09\u3002\n<ul>\n<li><strong>T-Fixup<\/strong>: <a href=\"https:\/\/proceedings.mlr.press\/v119\/huang20f.html\">\u201cImproving Transformer Optimization Through Better Initialization\u201d. Huang et al., ICML 2020<\/a>.<\/li>\n<li><strong>DT-Fixup<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/2012.15355\">&quot;Optimizing deeper transformers on small datasets.\u201c Peng et al, ACL 2021<\/a>.<\/li>\n<li><strong>Admin<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/2004.08249\">&quot;Understanding the difficulty of training transformers.\u201c. Liu et al.,\u00a0EMNLP 2020<\/a>.<\/li>\n<li><strong>GradInit<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/2102.08098\">&quot;Gradinit: Learning to initialize neural networks for stable and efficient training.&quot;\u00a0Chen et al., NeurIPS 2021<\/a>.<\/li>\n<li><strong>DS-Init<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/1908.11365\">&quot;Improving deep transformer with depth-scaled initialization and merged attention.\u201c. Biao et al., 2019<\/a>.<\/li>\n<li><strong>Mimetic-Init<\/strong> (ViTs): <a href=\"https:\/\/arxiv.org\/abs\/2305.09828\">&quot;Mimetic Initialization of Self-Attention Layers&quot;, Trockman and Kolter., 2023<\/a>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-python\"># GPT initialization example\n# https:\/\/github.com\/karpathy\/minGPT\/blob\/master\/mingpt\/model.py\ndef _init_weights(self, module):\n    if isinstance(module, nn.Linear):\n        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)\n        if module.bias is not None:\n            torch.nn.init.zeros_(module.bias)\n    elif isinstance(module, nn.Embedding):\n        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)\n    elif isinstance(module, nn.LayerNorm):\n        torch.nn.init.zeros_(module.bias)\n        torch.nn.init.ones_(module.weight)<\/code><\/pre>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/external-flaticons-lineal-color-flat-icons\/64\/null\/external-workout-running-flaticons-lineal-color-flat-icons-4.png\" style=\"height:50px;display:inline\"> Learning Rate Warm-Up<\/h3>\n<hr \/>\n<ul>\n<li><strong>Learning rate warm-up<\/strong>: \u5728\u8bad\u7ec3\u7684\u65e9\u671f\u9636\u6bb5\uff0c\u5b66\u4e60\u7387\u4f1a\u9010\u6e10\u589e\u52a0\u3002<\/li>\n<li>\u867d\u7136\u8fd9\u901a\u5e38\u4e0d\u662f\u5927\u591a\u6570\u6df1\u5ea6\u5b66\u4e60\u67b6\u6784\u6240\u5fc5\u9700\u7684\uff0c\u4f46\u5bf9\u4e8e Transformers \u6765\u8bf4\uff0c\u5982\u679c\u6211\u4eec\u53ea\u662f\u4ece\u4e00\u4e2a\u5178\u578b\u7684\u5b66\u4e60\u7387\u5f00\u59cb\uff0c\u8bad\u7ec3\u5c31\u4f1a\u5931\u8d25\u3002<\/li>\n<li>\u5982\u679c\u6211\u4eec<em>\u4ece\u975e\u5e38\u5c0f\u7684\u5b66\u4e60\u7387\u5f00\u59cb<\/em>\uff0c\u90a3\u4e48\u8bad\u7ec3\u662f\u7a33\u5b9a\u7684\uff0c\u4f46\u968f\u540e\u4f1a\u82b1\u8d39\u4e0d\u5207\u5b9e\u9645\u7684\u957f\u65f6\u95f4\u3002<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2002.04745\">Xiong et al., 2020<\/a> \u4f7f\u7528\u4e0d\u540c\u7684\u4f18\u5316\u5668\u548c\u5b66\u4e60\u7387\u8ba1\u5212\u8fdb\u884c\u4e86\u51e0\u6b21\u5b9e\u9a8c\u3002\u4ed6\u4eec\u7684\u7ed3\u679c\u8868\u660e\uff0c<strong>\u5b66\u4e60\u7387\u9884\u70ed\u5bf9\u4e8e Adam \u548c SGD \u90fd\u81f3\u5173\u91cd\u8981<\/strong>\uff0c\u5e76\u4e14\u8bad\u7ec3\u8fc7\u7a0b\u5bf9\u9884\u70ed\u6b65\u9aa4\u5f88\u654f\u611f\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918152824502.png\" style=\"height:500px\">\n<\/p>\n<ul>\n<li>\u867d\u7136\u5b66\u4e60\u7387\u9884\u70ed\u6709\u6548\uff0c\u4f46\u5b83\u6709\u4e00\u4e9b\u660e\u663e\u7684\u7f3a\u70b9\u2014\u2014\u5b83\u5f15\u5165\u4e86\u4e00\u4e2a\u989d\u5916\u7684\u8d85\u53c2\u6570\u2014\u2014\u9884\u70ed\u6b65\u9aa4\u7684\u6570\u91cf\uff0c\u5e76\u4e14\u5b83\u5c06\u5b66\u4e60\u7387\u521d\u59cb\u5316\u4e3a\u96f6\uff0c\u8fd9\u4f1a\u51cf\u6162\u8bad\u7ec3\u901f\u5ea6\u3002<\/li>\n<\/ul>\n<pre><code class=\"language-python\">import numpy as np\nimport matplotlib.pyplot as plt\n\nclass CosineWarmupScheduler:\n    def __init__(self, optimizer, warmup, max_iters):\n        self.optimizer = optimizer\n        self.warmup = warmup\n        self.max_num_iters = max_iters\n\n    def get_lr_factor(self, epoch):\n        lr_factor = 0.5 * (1 + np.cos(np.pi * epoch \/ self.max_num_iters))\n        if epoch &lt;= self.warmup:\n            lr_factor *= epoch * 1.0 \/ self.warmup\n        return lr_factor\n\n    def get_lr(self):\n        lr_factor = self.get_lr_factor(epoch=self.last_epoch)\n        return [base_lr * lr_factor for base_lr in self.base_lrs]\n\n# Parameters for visualization\nwarmup = 300  # Example warmup period\nmax_iter = 5000  # Example total iterations\n\n# Initialize scheduler\nscheduler = CosineWarmupScheduler(optimizer=None, warmup=warmup, max_iters=max_iter)\n\n# Generate learning rate factors for each epoch\nepochs = np.arange(max_iter)\nlr_factors = [scheduler.get_lr_factor(epoch) for epoch in epochs]\n\n# Plot the learning rate schedule\nplt.figure(figsize=(10, 6))\nplt.plot(epochs, lr_factors, label=&#039;Learning Rate Factor&#039;)\nplt.axvline(x=warmup, color=&#039;r&#039;, linestyle=&#039;--&#039;, label=&#039;End of Warmup&#039;)\nplt.title(&#039;Cosine Warmup Scheduler&#039;)\nplt.xlabel(&#039;Iteration&#039;)\nplt.ylabel(&#039;Learning Rate Factor&#039;)\nplt.legend()\nplt.grid(True)\nplt.show()\n<\/code><\/pre>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918153113291.png\" style=\"height:400px\">\n<\/p>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/external-flaticons-lineal-color-flat-icons\/64\/external-activation-media-agency-flaticons-lineal-color-flat-icons.png\" style=\"height:50px;display:inline\"> GLU Variants Activations<\/h3>\n<hr \/>\n<ul>\n<li>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2002.05202\">GLU Variants Improve Transformer<\/a> - Noam Shazeer, 2020.<\/p>\n<\/li>\n<li>\n<p>Transformer \u7684 FFN\uff08MLP\uff09\u90e8\u5206\u4e2d\u7684 ReLU \u6fc0\u6d3b\u53ef\u4ee5\u7528\u95e8\u63a7\u7ebf\u6027\u5355\u5143 (GLU) \u7cfb\u5217\u7684\u53d8\u4f53\u66ff\u6362\uff0c\u4ee5\u63d0\u9ad8\u6027\u80fd\u3002<\/p>\n<\/li>\n<li>\n<p>\u95e8\u63a7\u7ebf\u6027\u5355\u5143\u7531\u4e24\u4e2a\u7ebf\u6027\u6295\u5f71\u7684\u5206\u91cf\u4e58\u79ef\u7ec4\u6210\uff0c\u5176\u4e2d\u4e00\u4e2a\u9996\u5148\u901a\u8fc7 S \u578b\u51fd\u6570\u3002<\/p>\n<p align=\"center\">\n<img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918153003956.png\" style=\"height:150px\">\n<\/p>\n<\/li>\n<li>\n<p>\u539f\u56e0\u662f\u4ec0\u4e48\uff1f<\/p>\n<\/li>\n<\/ul>\n<p>We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.\u201d<\/p>\n<pre><code class=\"language-python\">from torch.nn import functional as F\n# replace FFN with FFNSwiglu\nclass FFNSwiglu(nn.Module):\n    def __init__(self, d_model, hidden_dim_multiplier=4, resid_pdrop=0.1):\n        super().__init__()\n        self.w1 = nn.Linear(d_model, hidden_dim_multiplier * d_model, bias=False)\n        self.w2 = nn.Linear(hidden_dim_multiplier * d_model, d_model, bias=False)\n        self.w3 = nn.Linear(d_model, hidden_dim_multiplier * d_model, bias=False)\n\n        self.dropout = nn.Dropout(resid_pdrop)\n\n    def forward(self, x):\n        x = self.dropout(self.w2(F.silu((self.w1(x))) * self.w3(x)))\n        return x<\/code><\/pre>\n<pre><code class=\"language-python\"># \u53ef\u89c6\u5316\u51fd\u6570\u5f62\u72b6\ndef visualize_ffn_swiglu(d_model):\n    # \u521d\u59cb\u5316\u5c42\n    ffn_swiglu = FFNSwiglu(d_model)\n\n    # \u751f\u6210\u8f93\u5165\u6570\u636e\n    x = torch.linspace(-10, 10, 100).view(-1, 1).repeat(1, d_model)\n\n    # \u83b7\u53d6\u5c42\u7684\u8f93\u51fa\n    with torch.no_grad():\n        y = ffn_swiglu(x).mean(dim=1).numpy()  # \u53d6\u5747\u503c\u4ee5\u7b80\u5316\u53ef\u89c6\u5316\n\n    # \u7ed8\u5236\u51fd\u6570\u5f62\u72b6\n    plt.figure(figsize=(10, 6))\n    plt.plot(x[:, 0].numpy(), y, label=&#039;FFNSwiglu Output&#039;)\n    plt.title(&#039;FFNSwiglu Activation Function Shape&#039;)\n    plt.xlabel(&#039;Input&#039;)\n    plt.ylabel(&#039;Output&#039;)\n    plt.legend()\n    plt.grid(True)\n    plt.show()\n\n# \u8c03\u7528\u53ef\u89c6\u5316\u51fd\u6570\nvisualize_ffn_swiglu(d_model=1)<\/code><\/pre>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918153046364.png\" style=\"height:400px\">\n<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918153157728.png\" style=\"height:250px\">\n<\/p>\n<ul>\n<li>tip\uff1aFFNSwiglu\u662f\u53ef\u5b66\u4e60\u7684<\/li>\n<\/ul>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/nolan\/64\/replace.png\" style=\"height:50px;display:inline\"> Alternatives to (Post) Layer Normalization<\/h3>\n<hr \/>\n<ul>\n<li>Pre-Layer Normalization (Pre-LN) Transformers \u662f Transformer \u6a21\u578b\u7684\u4e00\u79cd\u53d8\u4f53\uff0c\u4e0e\u6807\u51c6 Transformer \u4e0d\u540c\uff0c\u5b83\u5728\u6bcf\u4e2a\u5b50\u5c42\uff08\u5b50\u5c42\u5305\u62ec\u591a\u5934\u81ea\u6ce8\u610f\u548c\u524d\u9988\u795e\u7ecf\u7f51\u7edc\uff09\u4e4b\u524d\u5e94\u7528\u5c42\u5f52\u4e00\u5316\uff08Layer Normalization\uff09\u3002\u8fd9\u79cd\u7ed3\u6784\u8bbe\u8ba1\u4e0e\u6807\u51c6 Transformer \u7684 Post-LN\uff08\u5728\u6bcf\u4e2a\u5b50\u5c42\u4e4b\u540e\u8fdb\u884c\u5c42\u5f52\u4e00\u5316\uff09\u6709\u6240\u4e0d\u540c\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918153241539.png\" style=\"height:400px\">\n<\/p>\n<ul>\n<li>\u5c42\u5f52\u4e00\u5316\u5728\u6bcf\u4e2a\u5b50\u5c42\u4e4b\u524d\u5e94\u7528\uff0c\u6709\u52a9\u4e8e\u7a33\u5b9a\u68af\u5ea6\uff0c\u4f7f\u6a21\u578b\u5728\u8bad\u7ec3\u65f6\u66f4\u7a33\u5b9a\uff0c\u5c24\u5176\u662f\u5728\u6df1\u5c42\u7f51\u7edc\u4e2d\uff0c\u51cf\u5c11\u4e86\u68af\u5ea6\u6d88\u5931\u6216\u7206\u70b8\u7684\u98ce\u9669\u3002<\/li>\n<\/ul>\n<pre><code class=\"language-python\">class EncoderLayerPreLN(nn.Module):\n    def __init__(self, d_model, num_heads, hidden_dim_mult=4, dropout=0.1):\n        super().__init__()\n\n        self.dropout = dropout\n        self.mha = MultiHeadAttention(d_model, num_heads, dropout=dropout)\n        self.ffn = FFN(d_model, hidden_dim_mult, dropout)\n\n        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)\n        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)\n\n    def forward(self, x):\n\n        # pre-ln\n        x = self.layernorm1(x)\n\n        # Multi-head attention \n        attn_output, _ = self.mha(x, x, x)  # (batch_size, input_seq_len, d_model)\n\n        # the first residual connection \n        out1 = x + attn_output  # (batch_size, input_seq_len, d_model)\n\n        # Feed forward + pre-ln \n        ffn_output = self.ffn(self.layernorm2(out1))  # (batch_size, input_seq_len, d_model)\n\n        # the second residual connection \n        out2 = out1 + ffn_output  # (batch_size, input_seq_len, d_model)<\/code><\/pre>\n<pre><code class=\"language-python\"># in pytorch set `norm_first=True`\nmodel = torch.nn.Transformer(d_model=512,\n                             nhead=8, num_encoder_layers=6,\n                             num_decoder_layers=6,\n                             dim_feedforward=2048,\n                             dropout=0.1,\n                             activation=&#039;gelu&#039;,\n                             batch_first=True,\n                             norm_first=True)  # pre-ln: norm_first=True<\/code><\/pre>\n<ul>\n<li><strong>ReZero<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/2003.04887\">Bachlechner et al., 2020<\/a> \u5efa\u8bae\u5220\u9664\u5c42\u89c4\u8303\u5316\u5e76\u4e3a\u6bcf\u4e2a\u6b8b\u5dee\u5c42\u5f15\u5165\u4e00\u4e2a\u53ef\u8bad\u7ec3\u53c2\u6570 $\\alpha$\uff0c\u8fd9\u6837\u81ea\u6ce8\u610f\u529b\u5757\u6b8b\u5dee\u5c42\u5c31\u53d8\u6210\u4e86 $\\mathbf{X} + \\alpha\\bf MhSa[\\mathbf{X}]$\uff0c\u5176\u4e2d $\\alpha$ \u521d\u59cb\u5316\u4e3a\u96f6\u3002<\/li>\n<li>\u8fd9\u6837\u505a\u7684\u7ed3\u679c\u662f\u6574\u4e2a\u7f51\u7edc\u4ec5\u521d\u59cb\u5316\u4ee5\u8ba1\u7b97FFN\uff0c\u5e76\u4e14\u81ea\u6ce8\u610f\u529b\u548c MLP \u5c42\u7684\u8d21\u732e\u4f1a\u9010\u6e10\u81ea\u9002\u5e94\u5730\u5f15\u5165\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918155411604.png\" style=\"height:200px\">\n<\/p>\n<pre><code class=\"language-python\"># for more examples, check out https:\/\/github.com\/majumderb\/rezero, \n# https:\/\/github.com\/tbachlechner\/ReZero-examples\/blob\/master\/ReZero-Deep_Fast_Transformer.ipynb\nclass EncoderLayerReZero(nn.Module):\n    def __init__(self, d_model, num_heads, hidden_dim_mult=4, dropout=0.1):\n        super().__init__()\n\n        self.dropout = dropout\n        self.mha = MultiHeadAttention(d_model, num_heads, dropout=dropout)\n        self.ffn = FFN(d_model, hidden_dim_mult, dropout)\n\n        # instead of LN, we use a learnable alpha parameter initialized to zero\n        self.resweight = nn.Parameter(torch.tensor([0.0]), requires_grad=True)\n\n    def forward(self, x):\n        # Multi-head attention \n        attn_output, _ = self.mha(x, x, x)  # (batch_size, input_seq_len, d_model)\n\n        # the first residual connection + rezero\n        out1 = x + attn_output * self.resweight  # (batch_size, input_seq_len, d_model)\n\n        # Feed forward\n        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)\n\n        # the second residual connection + rezero\n        out2 = out1 + ffn_output * self.resweight  # (batch_size, input_seq_len, d_model)<\/code><\/pre>\n<ul>\n<li><strong>SandwichNorm<\/strong>: why not both?\n<ul>\n<li>\u8be5\u6280\u672f\u6700\u65e9\u51fa\u73b0\u5728\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2105.13290\">CoqView<\/a> \u8bba\u6587\u4e2d\uff0c\u8fd9\u662f\u8457\u540d\u7684\u6587\u672c\u8f6c\u56fe\u50cf\u8f6c\u6362\u5668 DALL-E \u7684\u4e2d\u6587\u7248\u3002<\/li>\n<li>\u4ed6\u4eec\u5efa\u8bae\u5728\u4f7f\u7528 Pre-LN \u65f6\uff0c\u5728\u6240\u6709\u5206\u652f\u8f93\u51fa\u4e2d\u6dfb\u52a0\u4e00\u4e2a\u989d\u5916\u7684 LN\u3002<\/li>\n<li>\u6709\u4e9b\u4eba\u53d1\u73b0\u8fd9\u5728\u8bad\u7ec3\u671f\u95f4\u9762\u4e34\u4e0d\u7a33\u5b9a\u65f6\u975e\u5e38\u6709\u6548\u3002<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918155514853.png\" style=\"height:200px\">\n<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2105.13290\">Image Source<\/a><\/p>\n<ul>\n<li><strong>RMSNorm<\/strong>: LN, but without mean centering and learned bias.\n<ul>\n<li>Faster than LN.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2102.11972\">An investigative paper<\/a> \u53d1\u73b0\u8fd9\u662f\u6027\u80fd\u6700\u4f73\u7684\u6807\u51c6\u5316\u53d8\u4f53\u3002<\/li>\n<\/ul>\n<\/li>\n<li>\u4f7f\u7528\u4e0d\u540c\u7684\u7f51\u7edc\u67b6\u6784\u5bf9\u591a\u4e2a\u4efb\u52a1\u8fdb\u884c\u7684\u5927\u91cf\u5b9e\u9a8c\u8868\u660e\uff0cRMSNorm \u4e0e LayerNorm \u5b9e\u73b0\u4e86\u76f8\u5f53\u7684\u6027\u80fd\uff0c\u4f46\u5728\u4e0d\u540c\u6a21\u578b\u4e0a\u5c06\u8fd0\u884c\u65f6\u95f4\u51cf\u5c11\u4e86 7%\u223c64%\u3002<\/li>\n<li>\u901a\u5e38\u5728\u540e LN \u914d\u7f6e\u4e2d\u3002<\/li>\n<\/ul>\n<p>$$ y_i = \\text{RMSNorm}(x_i)=\\gamma_i \\hat{x}_i \\in \\mathbb{R}^d $$<br \/>\n$$ \\hat{x}_i = \\frac{x_i}{\\sqrt{\\frac{1}{d}\\sum_{l=1}^d x_{i,l}^2}} $$<\/p>\n<pre><code class=\"language-python\"># https:\/\/github.com\/lucidrains\/x-transformers\nclass RMSNorm(nn.Module):\n    def __init__(self, dim, eps = 1e-8):\n        super().__init__()\n        self.scale = dim ** -0.5\n        self.eps = eps\n        self.g = nn.Parameter(torch.ones(dim))\n\n    def forward(self, x):\n        norm = torch.norm(x, dim=-1, keepdim=True) * self.scale\n        return self.g * x \/ norm.clamp(min=self.eps)<\/code><\/pre>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918155634512.png\" style=\"height:300px\">\n<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2102.11972\">Image Source<\/a><\/p>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/clouds\/100\/null\/support.png\" style=\"height:50px;display:inline\"> Rectified Adam (RAdam) - Reducing Adam's Variance<\/h3>\n<hr \/>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1908.03265\">Liu et al., (2019)<\/a> \u8ba4\u4e3a Adam \u4f18\u5316\u5668\u5728\u8bad\u7ec3\u521d\u671f\u5b66\u4e60\u7387\u7684\u9ad8\u65b9\u5dee\u662f\u7531\u4e8e\u65e9\u671f\u9636\u6bb5\u6837\u672c\u4e0d\u8db3\u5bfc\u81f4\u7684\u3002<\/li>\n<\/ul>\n<p>\u4ed6\u4eec\u57fa\u4e8e\u4e00\u4e2a\u5b9e\u9a8c\u63d0\u51fa\u8fd9\u4e00\u89c2\u70b9\uff1a\u5728\u524d 2000 \u4e2a\u8bad\u7ec3\u6b65\u9aa4\u4e2d\uff0c\u4ed6\u4eec\u4e0d\u6539\u53d8\u6a21\u578b\u53c2\u6570\u6216 Adam \u7684\u52a8\u91cf\u9879\uff0c\u53ea\u8c03\u6574\u5b66\u4e60\u7387\u3002<br \/>\n\u7ecf\u8fc7\u8fd9\u6bb5\u65f6\u95f4\u540e\uff0c\u4e0d\u518d\u9700\u8981\u9884\u70ed\uff08warm-up\uff09\uff01<br \/>\n\u4ed6\u4eec\u63d0\u51fa\u4e86\u4e00\u79cd\u65b0\u7684\u4f18\u5316\u65b9\u6cd5\uff0c\u79f0\u4e3a <strong>Rectified Adam \u6216 RAdam<\/strong>\uff0c\u5b83\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u9010\u6e10\u6539\u53d8\u52a8\u91cf\u9879\uff0c\u4ee5\u907f\u514d\u9ad8\u65b9\u5dee\u3002<br \/>\n\u53ef\u4ee5\u8fd9\u6837\u7406\u89e3\uff1a\u6211\u4eec\u5b9e\u9645\u4e0a\u5c06\u5b66\u4e60\u7387\u9884\u70ed\u6574\u5408\u5230\u4e86 Adam \u7b97\u6cd5\u4e2d\uff0c\u4f46\u91c7\u7528\u4e86\u4e00\u79cd\u6709\u7406\u8bba\u4f9d\u636e\u7684\u65b9\u5f0f\u3002<br \/>\n<a href=\"https:\/\/nn.labml.ai\/optimizers\/radam.html\">\u9010\u6b65\u7b97\u6cd5\u548c\u5b9e\u73b0<\/a>\u3002\u201d<\/p>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918155713752.png\" style=\"height:300px\">\n<\/p>\n<ul>\n<li>Training loss v.s. # of iterations of Transformers on the De-En IWSLT\u201914 dataset (machine translation).<\/li>\n<\/ul>\n<pre><code class=\"language-python\"># RAdam in pytorch: https:\/\/pytorch.org\/docs\/stable\/generated\/torch.optim.RAdam.html#torch.optim.RAdam\noptimizer = torch.optim.RAdam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)<\/code><\/pre>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/nolan\/64\/123.png\" style=\"height:50px;display:inline\"> Positional Encodings\/Embeddings\/Bias<\/h3>\n<hr \/>\n<ul>\n<li>Transformer \u53bb\u9664\u4e86\u4efb\u4f55\u5f52\u7eb3\u504f\u5dee\uff08\u4f8b\u5982 CNN \u7684\u5c40\u90e8\u6027\uff09\u3002<\/li>\n<li>\u4f4d\u7f6e\u5d4c\u5165\u5bf9\u4e8e\u4f7f\u7528 Transformer \u8fdb\u884c\u5e8f\u5217\u5efa\u6a21\u81f3\u5173\u91cd\u8981\u3002<\/li>\n<li>\u4f4d\u7f6e\u7f16\u7801\u53ef\u4ee5\u662f <em>\u53ef\u5b66\u4e60\u7684<\/em> \u6216 <em>\u6052\u5b9a\u7684<\/em>\uff0c\u4e5f\u53ef\u4ee5\u662f <strong>\u7edd\u5bf9\u7684<\/strong> \u6216 <strong>\u76f8\u5bf9\u7684<\/strong>\u3002<\/li>\n<li><strong>Vanilla Transformer<\/strong>\uff1a\u6052\u5b9a\u7edd\u5bf9\u4f4d\u7f6e\u7f16\u7801\uff08\u6b63\u5f26\u548c\u4f59\u5f26\uff09\u3002<\/li>\n<li>\u6b64\u5916\uff0c\u4e0d\u5728\u8f93\u5165 Transformer \u4e4b\u524d\u6dfb\u52a0\u4f4d\u7f6e\u7f16\u7801\uff0c\u800c\u662f\u53ef\u4ee5\u5c06\u5b83\u4eec <strong>\u76f4\u63a5\u6ce8\u5165\u5230\u6ce8\u610f\u77e9\u9635<\/strong>\uff0c\u8fd9\u901a\u5e38\u4f1a\u5e26\u6765\u66f4\u597d\u7684\u6027\u80fd\u3002<\/li>\n<li>\u4f8b\u5982\uff0cGPT-3 \u4f7f\u7528 <em>\u53ef\u5b66\u4e60\u7684<\/em> <strong>\u7edd\u5bf9<\/strong> \u4f4d\u7f6e\u7f16\u7801\uff0c\u800c T5 \u4f7f\u7528 <em>\u53ef\u5b66\u4e60\u7684<\/em> <strong>\u76f8\u5bf9<\/strong> \u4f4d\u7f6e\u504f\u5dee\u3002<\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918155759409.png\" style=\"height:200px\">\n<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2102.11090\">Image Source<\/a><\/p>\n<ul>\n<li><strong>Relative Positional Encoding (RPE)<\/strong>: \u76f8\u5bf9\u4f4d\u7f6e\u7f16\u7801 (RPE) \u76f4\u63a5\u6dfb\u52a0\u5230\u6ce8\u610f\u77e9\u9635\u4e2d\uff01<\/li>\n<li>\u4e5f\u79f0\u4e3a\u201c\u76f8\u5bf9\u4f4d\u7f6e\u504f\u5dee\u201d\u3002<\/li>\n<li>\u76ee\u524d\uff0cRPE \u4f18\u4e8e\u7edd\u5bf9\u4f4d\u7f6e\u7f16\u7801 (APE)\uff0c\u5b83\u5df2\u6210\u4e3a\u6240\u6709\u8fd1\u671f LLM \u7684\u6807\u51c6\u3002<\/li>\n<li>\u6709\u51e0\u79cd\u65b9\u6cd5\u53ef\u4ee5\u8ba1\u7b97\u76f8\u5bf9\u4f4d\u7f6e\u504f\u5dee\u77e9\u9635\uff0c\u8fd9\u4e9b\u503c\u53ef\u4ee5\u5b66\u4e60\u6216\u9884\u5148\u786e\u5b9a\u3002<\/li>\n<li>\u4e00\u4e9b\u6d41\u884c\u7684\u8fd1\u671f\u4f4d\u7f6e\u7f16\u7801\uff1a\n<ul>\n<li>\u7b80\u5355\u76f8\u5bf9\u4f4d\u7f6e\u504f\u5dee\uff08\u7528\u4e8e T5\uff09\u3002<\/li>\n<li>\u5177\u6709\u7ebf\u6027\u504f\u5dee\u7684\u6ce8\u610f (ALiBi)\u3002<\/li>\n<li>\u65cb\u8f6c\u4f4d\u7f6e\u5d4c\u5165 (RoPE\uff0c\u7528\u4e8e PaLM)\u3002<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/github.com\/lucidrains\/x-transformers\/blob\/52bcac25437064757d8c4e5bd9e77b9598b462bb\/x_transformers\/x_transformers.py#L227\">Code Examples<\/a><\/li>\n<\/ul>\n<p align=\"center\">\n  <img decoding=\"async\" src=\"https:\/\/gnnclub-1311496010.cos.ap-beijing.myqcloud.com\/wp-content\/uploads\/2024\/09\/20240918155836464.png\" style=\"height:300px\">\n<\/p>\n<p><a href=\"https:\/\/paperswithcode.com\/method\/relative-position-encodings\">Image Source<\/a><\/p>\n<h3><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/bubbles\/50\/refresh.png\" style=\"height:50px;display:inline\"> Staying Up-to-Date with Transformers<\/h3>\n<hr \/>\n<ul>\n<li>\u8be5\u9886\u57df\u53d1\u5c55\u975e\u5e38\u8fc5\u901f\uff01<\/li>\n<li>\u6211\u4eec\u5982\u4f55\u8ddf\u8e2a\u6240\u6709\u65b0\u6539\u8fdb\uff1f<\/li>\n<li>\u63a8\u8350\u5b58\u50a8\u5e93\uff1a<a href=\"https:\/\/github.com\/lucidrains\/x-transformers\">https:\/\/github.com\/lucidrains\/x-transformers<\/a><\/li>\n<li>\u5176\u4ed6\u63a8\u8350\u5b58\u50a8\u5e93\uff1a<\/li>\n<li><a href=\"https:\/\/github.com\/facebookresearch\/fairseq\">https:\/\/github.com\/facebookresearch\/fairseq<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/microsoft\/unilm\">https:\/\/github.com\/microsoft\/unilm<\/a><\/li>\n<\/ul>\n<h2><img decoding=\"async\" src=\"https:\/\/img.icons8.com\/dusk\/64\/000000\/prize.png\" style=\"height:50px;display:inline\"> Credits<\/h2>\n<hr \/>\n<ul>\n<li>Icons made by <a href=\"https:\/\/www.flaticon.com\/authors\/becris\" title=\"Becris\">Becris<\/a> from <a href=\"https:\/\/www.flaticon.com\/\" title=\"Flaticon\">www.flaticon.com<\/a><\/li>\n<li>Icons from <a href=\"https:\/\/icons8.com\/\">Icons8.com<\/a> - <a href=\"https:\/\/icons8.com\">https:\/\/icons8.com<\/a><\/li>\n<li><a href=\"https:\/\/d2l.ai\/chapter_recurrent-neural-networks\/index.html\">Dive Into Deep Learning - Recurrent Neural Networks<\/a><\/li>\n<li><a href=\"https:\/\/atcold.github.io\/pytorch-Deep-Learning\/en\/week12\/12-1\/\">DS-GA 1008 - NYU CENTER FOR DATA SCIENCE - Deep Sequence Modeling<\/a><\/li>\n<li><a href=\"https:\/\/pytorch.org\/tutorials\/beginner\/text_sentiment_ngrams_tutorial.html\">Text classification with the torchtext library<br \/>\n<\/a><\/li>\n<li><a href=\"https:\/\/www.borealisai.com\/research-blogs\/tutorial-17-transformers-iii-training\/\">Tricks For Training Transformers - Borealis AI - P. Xu, S. Prince<\/a><\/li>\n<li><a href=\"https:\/\/taldatech.github.io\">Tal Daniel<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Deep Learning create by Arwin Yu Tutorial 05 &#8211; Transfor [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2005,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18,24],"tags":[19],"class_list":["post-1992","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-18","category-24","tag-19"],"_links":{"self":[{"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/posts\/1992","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/gnn.club\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1992"}],"version-history":[{"count":6,"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/posts\/1992\/revisions"}],"predecessor-version":[{"id":2094,"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/posts\/1992\/revisions\/2094"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/gnn.club\/index.php?rest_route=\/wp\/v2\/media\/2005"}],"wp:attachment":[{"href":"http:\/\/gnn.club\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1992"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/gnn.club\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1992"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/gnn.club\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1992"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}