Contribution
- They propose a novel Multi-gate Mixture-of-Experts model which explicitly models task relationships.
- They conduct control experiments on synthetic data. They report how task relatedness affects training dynamics in multi-task learning and how MMoE improves both model expressiveness and trainability.
- They conduct experiments on real benchmark data and a large-scale production recommendation system with hundreds of millions of users and items.
Each Experts is a feed-forward network.
The gating networks take the input features and output softmax gates assembling the experts with different weights, allowing different tasks to utilize experts differently
The results of the assembled experts are then passed into the task-specific tower networks.
In this way, the gating networks for different tasks can learn different mixture patterns of experts assembling, and thus capture the task relationships.
+ They conduct a synthetic experiment.
Datasets : UCI Census-income dataset
Modeling Approaches
$$
\begin{align}
y_k = h^k(f(x)) \tag{1} \\
y = \sum_{i=1}^{n} g(x)_i f_i(x) \tag{5} \\
\end{align}
$$
- Eq (1) is the Shared-bottom Multi-task Model.
- $K$ tasks,
- shared-bottom network $f$,
- $K$ tower networks $h^k$.
- Eq (5) is the Original MoE Model.
- $ f_i $ 는 n개의 expert network,
- $ g $ 는 모든 experts의 결과를 ensemble하는 gating network.
The new model is called Multi-gate Mixture-of-Experts (MMoE) model, where the key idea is to substitute the shared bottom network $ f $ in Eq (1) with the MoE layer in Eq (5).
$$
\begin{align}
y_k = h^k(f^k(x)) \tag{6} \\
f^k(x) = \sum_{i=1}^{n} g^k(x)_i f_i(x) \tag{7} \\
g^k(x) = \text{softmax}(W_{gk} \cdot x) \tag{8}
\end{align}
$$
- Eq (5)가 Eq (7)로 수정됨. $g_k$는 각 task k에 대해 separate gating network이기 때문.
- $ W_{gk} \in \mathbb{R}^{n \times d} $ is a trainable matrix.
- $n$ is the number of experts
- $d$ is the feature dimension