Train a Paris-Style Decentralized Expert Model

Vision-Language & Cross-Modal Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Train a Paris-Style Decentralized Expert Model, Gating Network Sparsity, Kullback-Leibler Divergence, Gradient Clipping, Contrastive Loss Functions, Weight Quantization, Distributed Systems, Deep Learning Architectures, Multimodal Representation Learning, Optimization Theory, Information Theory, Mixture of Experts (MoE), Cross-Attention Mechanisms, Federated Learning Protocols, Stochastic Gradient Descent, Embedding Space Alignment.

Paris trains 8 expert diffusion models in complete isolation with NO gradient, parameter, or activation synchronization. Data is partitioned into semantic clusters, each expert trains only on its partition, and a lightweight router learns to select experts at inference. Implement this: (1) train paris model partitions data by cluster ids, trains each expert in isolation (learning mean target of its partition), and trains a router (learning centroid of each cluster's features). (2) paris inference uses the router to select expert(s) based on feature similarity to centroids.