본문 바로가기
AI-LAB/논문리뷰

[1] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

by JS LAB 2025. 3. 23.
728x90
반응형

 

Contribution

  • They propose a novel Multi-gate Mixture-of-Experts model which explicitly models task relationships.
  • They conduct control experiments on synthetic data. They report how task relatedness affects training dynamics in multi-task learning and how MMoE improves both model expressiveness and trainability.
  • They conduct experiments on real benchmark data and a large-scale production recommendation system with hundreds of millions of users and items.

 

Each Experts is a feed-forward network.

The gating networks take the input features and output softmax gates assembling the experts with different weights, allowing different tasks to utilize experts differently

The results of the assembled experts are then passed into the task-specific tower networks.

In this way, the gating networks for different tasks can learn different mixture patterns of experts assembling, and thus capture the task relationships.

 

+ They conduct a synthetic experiment.

Datasets : UCI Census-income dataset

 


Modeling Approaches

$$
\begin{align}
y_k = h^k(f(x)) \tag{1} \\
y = \sum_{i=1}^{n} g(x)_i f_i(x) \tag{5} \\
\end{align}
$$

 

  • Eq (1) is the Shared-bottom Multi-task Model.
  • $K$ tasks,
  • shared-bottom network $f$,
  • $K$ tower networks $h^k$.

 

  • Eq (5) is the Original MoE Model.
  • $ f_i $ 는 n개의 expert network,
  • $ g $ 는 모든 experts의 결과를 ensemble하는 gating network.

The new model is called Multi-gate Mixture-of-Experts (MMoE) model, where the key idea is to substitute the shared bottom network $ f $ in Eq (1) with the MoE layer in Eq (5).

 

$$
\begin{align}
y_k = h^k(f^k(x)) \tag{6} \\
f^k(x) = \sum_{i=1}^{n} g^k(x)_i f_i(x) \tag{7} \\
g^k(x) = \text{softmax}(W_{gk} \cdot x) \tag{8}
\end{align}
$$

 

  • Eq (5)가 Eq (7)로 수정됨. $g_k$는 각 task k에 대해 separate gating network이기 때문.
  • $ W_{gk} \in \mathbb{R}^{n \times d} $ is a trainable matrix.
    • $n$ is the number of experts
    • $d$ is the feature dimension

 

 
728x90
반응형