DICE: Data Influence Cascade in Decentralized Learning

Contents

Background

Main Results

Future Work

Citation


DICE

Updates in progress. More coming soon!

🗓️ 2025-07-10 - Added background context and practical applications.


Background

The Rise of Decentralized Learning

Training state-of-the-art AI models has become astronomically expensive, risking concentration of AI capabilities within a handful of tech giants. Decentralized Learning (DL) offers a compelling alternative: crowdsourcing both data and computational resources across a peer-to-peer (P2P) network. Unlike centralized or federated approaches relying on a central server, decentralized learning enables direct peer-to-peer model exchanges through gossip protocols, democratizing access to large-scale machine learning while naturally preserving data privacy.

Decentralized Learning Architecture

Figure: Architectural comparison between centralized (server-based) and decentralized (peer-to-peer) learning paradigms.

The Core Challenge: Contribution Quantification and Incentive Mechanisms

However, this collaborative vision faces a fundamental obstacle: the absence of proper incentive mechanisms. In decentralized networks, participation is voluntary, and computational resources are contributed by independent entities. A critical question emerges:

Why would rational participants contribute their valuable data and computational resources without a fair and transparent way to measure and reward their efforts?

Without addressing this question, decentralized learning risks suffering from the “tragedy of the commons”-participants may free-ride on others’ contributions while minimizing their own efforts. This challenge is particularly acute in decentralized settings because:

  1. No central authority exists to monitor or enforce fair participation
  2. Heterogeneous contributions vary significantly across participants in both quantity and quality
  3. Dynamic networks where participants may join, leave, or behave maliciously

The solution requires a robust framework for quantifying individual contributions in decentralized learning. Only with accurate contribution measurement can we design fair incentive mechanisms that reward participants proportionally to their actual impact on the collective learning outcome.

But how should we measure contributions? While superficial metrics like GPU uptime or data quantity are easy to track, they fail to capture what truly matters: the actual impact on learning outcomes. A node providing 1000 low-quality or mislabeled data points contributes far less than one providing 100 high-quality, informative samples. Similarly, a node with high uptime but poor data has limited real value.

This insight leads us to data influence-a principled approach to measuring the causal impact of each data point on model performance. Data influence directly quantifies how much a specific piece of training data affects the final model’s predictions, making it an ideal metric for contribution assessment in collaborative learning.

Why Traditional Data Influence Methods Fall Short

Data influence estimation has been extensively studied in centralized machine learning. Classical approaches include leave-one-out (LOO) retraining and influence functions (Koh & Liang, 2017), which estimate a data point’s value by measuring how model performance changes when that data is removed.

However, these centralized methods fundamentally fail in decentralized settings due to two critical distinctions:

1. Multiple Models, Not One: In decentralized learning, there is no single global model. Each participant maintains their own local model, and the notion of “model performance” becomes ambiguous. Whose model should we evaluate? How do we aggregate influence across different models?

2. Dynamic, Cascading Influence: More fundamentally, data influence in decentralized learning is not static—it cascades through the network. When a node performs a local update using its data, this change propagates to its neighbors through gossip communication. These neighbors then pass the influence to their neighbors, creating a ripple effect that spreads across the entire network over time.

💡 The Ripple Effect Analogy

Like a stone dropped into a pond, a node's local update creates influence ripples that propagate through the network via peer-to-peer communications. Just as water ripples fade with distance, data influence decays across multiple hops.

Cascading Data Influence Visualization

Figure: Visualization of cascading data influence in decentralized learning as ripples spreading through a network.

Traditional influence estimators, designed for a single, static model, cannot capture this recursive, multi-hop propagation of influence. We need a fundamentally new framework.

Our Contributions: Introducing DICE

This work introduces DICE (Data Influence Cascade in dEcentralized learning), the first comprehensive framework for measuring data influence in fully decentralized learning systems. Our key contributions span three dimensions:

Conceptually, we establish a customer-centric perspective for defining data value in decentralized settings. Inspired by management theory , we recognize that when a node shares its model update, its neighbors are the customers. Therefore, data influence should be measured by the impact on these receiving nodes’ test performance—rather than by the contributing node itself.

Theoretically, we develop a rigorous mathematical framework that:

Practically, DICE enables critical applications for building robust and fair decentralized learning systems:

In the following sections, we’ll dive deep into the theoretical foundations of DICE and explore how these insights translate into practical tools for decentralized learning.


Main Results

Theorem (Approximation of r-hop DICE-GT)

The r-hop DICE-GT influence \(\mathcal{I}_{\mathrm{DICE-GT}}^{(r)}(\boldsymbol{z}_j^t, \boldsymbol{z}^{\prime})\) can be approximated as follows:

\[\begin{equation} \begin{split} &\mathcal{I}_{\mathrm{DICE-E}}^{(r)}(\boldsymbol{z}_j^t, \boldsymbol{z}^{\prime})\\ & = - \sum_{\rho=0}^{r} \sum_{ (k_1, \dots, k_{\rho}) \in P_j^{(\rho)} } \eta^{t} q_{k_\rho}  \underbrace{ \left( \prod_{s=1}^{\rho} \boldsymbol{W}_{k_s, k_{s-1}}^{t+s-1} \right) }_{\text{communication graph-related term}} \times \underbrace{ \nabla L\bigl(\boldsymbol{\theta}_{k_{\rho}}^{t+\rho}; \boldsymbol{z}^{\prime}\bigr)^\top }_{\text{test gradient}} \\ & \quad \times \underbrace{ \left( \prod_{s=2}^{\rho} \left( \boldsymbol{I} - \eta^{t+s-1} \boldsymbol{H}(\boldsymbol{\theta}_{k_s}^{t+s-1}; \boldsymbol{z}_{k_s}^{t+s-1}) \right) \right) }_{\text{curvature-related term}} \times \underbrace{ \Delta_j(\boldsymbol{\theta}_j^t,\boldsymbol{z}_j^t) }_{\text{optimization-related term}} \end{split} \end{equation}\]

where \(\Delta_j(\boldsymbol{\theta}_j^t,\boldsymbol{z}_j^t) = \mathcal{O}_j(\boldsymbol{\theta}_j^t,\boldsymbol{z}_j^t)-\boldsymbol{\theta}_j^t\), \(k_0 = j\). Here \(P_j^{(\rho)}\) denotes the set of all sequences \((k_1, \dots, k_{\rho})\) such that \(k_s \in \mathcal{N}_{\mathrm{out}}^{(1)}(k_{s-1})\) for \(s=1,\dots,\rho\) and \(\boldsymbol{H}(\boldsymbol{\theta}_{k_s}^{t+s}; \boldsymbol{z}_{k_s}^{t+s})\) is the Hessian matrix of \(L\) with respect to \(\boldsymbol{\theta}\) evaluated at \(\boldsymbol{\theta}_{k_s}^{t+s}\) and data \(\boldsymbol{z}_{k_s}^{t+s}\).

For the cases when \(\rho = 0\) and \(\rho = 1\), the relevant product expressions are defined as identity matrices, thereby ensuring that the r-hop DICE-E remains well-defined.

Key Insights from DICE

Our theory uncovers the intricate interplay of factors that shape data influence in decentralized learning:

  • 1. Asymmetric Influence and Topological Importance: The influence of identical data is not uniform across the network. Instead, nodes with greater topological significance exert stronger influence.
  • 2. The Role of Intermediate Nodes and Loss Landscape: Intermediate nodes actively contribute to an "influence chain". The local loss landscape of these models also actively shapes the influence as it propagates through the network.
  • 3. Influence Cascades with Damped Decay: Data influence cascades with "damped decay" induced by mixing parameter W. This decay, which can be exponential with the number of hops, ensures that influence is "localized".

Deconstructing the Theory: From Ground Truth to Practical Estimator

The theorem above represents our main theoretical contribution, but understanding its significance requires unpacking how we built this result from foundational principles. Let’s trace the conceptual journey from defining the ideal “ground truth” influence to deriving a practical, computable approximation.

Layer 1: Defining the Ground Truth 🎯

Before we can approximate influence, we must first define what we’re approximating-the ideal, “gold standard” measure of data influence. This requires carefully reasoning about causality: What would change if a particular data point is removed in decentralized learning?

Definition 1 (One-hop Ground-truth Influence)

For a data point \(\boldsymbol{z}_j^t\) at node \(j\) and time \(t\), its one-hop ground-truth influence on a test point \(\boldsymbol{z}'\) at a neighbor node \(k \in \mathcal{N}_{\mathrm{out}}^{(1)}(j)\) is defined as:

\[\mathcal{I}_{\mathrm{DICE-GT}}^{(1)}(\boldsymbol{z}_j^t, \boldsymbol{z}') = L(\boldsymbol{\theta}_k^{t+1,-j}; \boldsymbol{z}') - L(\boldsymbol{\theta}_k^{t+1}; \boldsymbol{z}')\]

where \(\boldsymbol{\theta}_k^{t+1}\) is node \(k\)’s model at time \(t+1\) trained with \(\boldsymbol{z}_j^t\), and \(\boldsymbol{\theta}_k^{t+1,-j}\) is the counterfactual model obtained by retraining without \(\boldsymbol{z}_j^t\).

Intuition: This measures the direct impact of data \(\boldsymbol{z}_j^t\) on its immediate neighbors’ test performance. If removing the data point increases test loss (positive influence), the data might be helpful; if it decreases test loss (negative influence), the data might be harmful.

Definition 2 (Multi-hop Ground-truth Influence)

An important point to note in decentralized learning is that the influence doesn’t stop at immediate neighbors but cascades through the network. The \(r\)-hop ground-truth influence extends the counterfactual reasoning recursively:

\[\mathcal{I}_{\mathrm{DICE-GT}}^{(r)}(\boldsymbol{z}_j^t, \boldsymbol{z}') = \sum_{k \in \mathcal{N}_{\mathrm{out}}^{(r)}(j)} q_k \left[ L(\boldsymbol{\theta}_k^{t+r,-j}; \boldsymbol{z}') - L(\boldsymbol{\theta}_k^{t+r}; \boldsymbol{z}') \right]\]

where \(\mathcal{N}_{\mathrm{out}}^{(r)}(j)\) denotes all nodes reachable from \(j\) within \(r\) hops, and \(q_k\) is a weighting factor (e.g., sampling probability).

Intuition: This aggregates influence across all nodes within an \(r\)-hop neighborhood. The counterfactual \(\boldsymbol{\theta}_k^{t+r,-j}\) requires retraining the entire network for \(r\) rounds without \(\boldsymbol{z}_j^t\)-an exponentially expensive operation that makes direct computation infeasible in practice.

Layer 2: Building a Computable Bridge 🌉

The ground-truth definitions are conceptually clear but computationally intractable. We need an approximation that can be computed efficiently using only information available during standard training. This is where gradient-based influence estimation enters.

Proposition 1 (Approximation of One-hop DICE-GT)

Using a first-order Taylor expansion, the one-hop ground-truth influence can be approximated as:

\[\mathcal{I}_{\mathrm{DICE-E}}^{(1)}(\boldsymbol{z}_j^t, \boldsymbol{z}') \approx -\eta^t \nabla L(\boldsymbol{\theta}_k^{t+1}; \boldsymbol{z}')^\top \Delta_j(\boldsymbol{\theta}_j^t, \boldsymbol{z}_j^t)\]

where \(\Delta_j(\boldsymbol{\theta}_j^t, \boldsymbol{z}_j^t) = \mathcal{O}_j(\boldsymbol{\theta}_j^t, \boldsymbol{z}_j^t) - \boldsymbol{\theta}_j^t\) represents the model update direction from node \(j\)’s optimizer, and \(\eta^t\) is the learning rate.

Key Insight: This approximation reveals that one-hop influence depends on:

  1. The optimization step \(\Delta_j\) induced by the data at the source node
  2. The test gradient \(\nabla L(\boldsymbol{\theta}_k^{t+1}; \boldsymbol{z}')\) at the receiving node

The negative sign indicates that if the update \(\Delta_j\) aligns with the negative test gradient (the direction that reduces test loss), the data has positive influence.

This proposition is the crucial stepping stone: it shows that influence can be estimated using gradients computed during normal training, without expensive leave-one-out retraining.

Layer 3: Dissecting the Main Theorem 🔬

Now we return to our main theorem with fresh eyes. The multi-hop approximation \(\mathcal{I}_{\mathrm{DICE-E}}^{(r)}\) extends Proposition 1 to capture how influence propagates and transforms as it cascades through multiple hops. The complex formula can be understood through four distinct components, each playing a specific physical role:

🧩 The Four Pillars of Cascading Influence

1. Optimization Term 💧 - Source of Influence: $$ \Delta_j(\boldsymbol{\theta}_j^t, \boldsymbol{z}_j^t) $$ | The initial "push" that data gives to its local model.

2. Communication Term 🌐 - Propagation Path: $$ \prod_{s=1}^{\rho} \boldsymbol{W}_{k_s, k_{s-1}}^{t+s-1} $$ | Tracks the path influence takes through the network topology.

3. Curvature Term 🔍 - Influence Modulator: $$ \prod_{s=2}^{\rho} \left( \boldsymbol{I} - \eta^{t+s-1} \boldsymbol{H} \right) $$ | Loss landscape curvature actively shapes, dampens, or amplifies the passing influence signal.

4. Test Gradient 📏 - Impact Measurement: $$ \nabla L(\boldsymbol{\theta}_{k_\rho}^{t+\rho}; \boldsymbol{z}') $$ | The "probe" that measures final influence by checking how the cascaded update affects test loss at the destination.

The Complete Picture: Data influence is born at the source node (optimization), travels through the network topology (communication), gets modulated by intermediate nodes' learning landscapes (curvature), and is ultimately measured by its impact on test performance (test gradient).

Remarkable Insights from This Decomposition:

  1. Topology Determines Power: The communication term reveals that network position matters enormously. Nodes with high out-degree or centrality can amplify their data’s influence across the network. This explains why identical data at different nodes can have vastly different global impact.

  2. Influence Cascades with Damped Decay: The communication matrix \(\boldsymbol{W}\) creates a natural damping effect on influence propagation. Since each entry of \(\boldsymbol{W}\) is at most 1 (and typically less), the multiplicative cascade \(\prod \boldsymbol{W}\) leads to exponential decay with the number of hops. Additionally, the curvature term acts as a modulating “friction” that further shapes how influence propagates. Together, these ensure influence remains localized rather than spreading indefinitely.

  3. Dynamic, State-Dependent Influence: Unlike centralized settings where data influence is static, DICE reveals that influence in decentralized learning is fundamentally dynamic. The same data can have different influence at different times, depending on the evolving loss landscapes of intermediate nodes.


Layer 4: From Theory to Practice-Proximal Influence

The full multi-hop theory is powerful but can be expensive to compute for large \(r\). For practical applications, we introduce a lightweight metric that captures the most immediate and significant influence relationships.

Definition 3 (Proximal Influence)

The proximal influence measures the direct, one-hop influence that a neighbor \(k\)’s data has on node \(j\):

\[\mathcal{I}_{\mathrm{Prox}}(k \to j; \boldsymbol{z}') = -\eta \sum_{t} \nabla L(\boldsymbol{\theta}_j^{t+1}; \boldsymbol{z}')^\top \boldsymbol{W}_{jk}^t \Delta_k(\boldsymbol{\theta}_k^t, \boldsymbol{z}_k^t)\]

Why This Matters: Proximal influence has three key advantages:

  1. Computational Efficiency: It only requires information already exchanged during standard gossip training (model updates \(\Delta_k\) and local gradients). No additional communication overhead!

  2. Reciprocity Measurement: Nodes can compute both \(\mathcal{I}_{\mathrm{Prox}}(k \to j)\) and \(\mathcal{I}_{\mathrm{Prox}}(j \to k)\) to assess whether collaboration is mutually beneficial. This enables reciprocity factors for fair contribution assessment.

  3. Anomaly Detection: Malicious or corrupted data produces distinctly negative or anomalously low proximal influence, allowing nodes to identify and exclude harmful neighbors dynamically.

🛡️ Application: Securing Decentralized Learning

Scenario: In a 50-node decentralized network, 5 malicious nodes inject label-flipped data (cats labeled as dogs, etc.). How can honest nodes detect and exclude these bad actors without a central authority?

Solution with Proximal Influence: Each node computes proximal influence for all its neighbors. Honest nodes observe that malicious neighbors consistently produce negative influence (increasing test loss rather than decreasing it). By setting a simple threshold (e.g., exclude neighbors with influence < 0), honest nodes can autonomously prune malicious connections.

Result: Within a few communication rounds, malicious nodes become isolated, and the network self-heals—all without centralized coordination. This demonstrates DICE's power for building robust, self-governing decentralized systems.


Future Work

DICE opens exciting avenues for building robust, fair, and sustainable decentralized learning systems. We envision four key directions:

Design Influence-weighted Learning Algorithms

A natural next step is developing new optimization algorithms that leverage DICE scores during training. The key challenge: How to estimate multi-hop DICE scores efficiently?

Promising directions include online estimation methods that incrementally compute influence without expensive retraining, and influence-weighted aggregation schemes that prioritize high-quality updates while filtering out noise.

Detect Malicious and Free-riding Nodes

DICE provides a foundation for autonomous security mechanisms. Nodes with consistently negative influence can be automatically identified and excluded, enabling self-regulating networks that maintain quality without centralized oversight.

Future work could explore Byzantine-robust protocols and adaptive neighbor selection strategies for enhanced resilience.

Design Contribution-based Incentive Mechanisms

Perhaps the most transformative application: fair reward systems that compensate participants proportionally to their measured contributions. DICE could enable blockchain-integrated compensation schemes and reciprocity-based matchmaking between nodes.

This transforms decentralized learning from altruism into a sustainable economic model.

Build Decentralized Learning Ecosystem

The ultimate vision is a thriving decentralized AI economy with three interconnected markets:

DICE provides the foundational valuation framework that enables transparent and fair collaboration, helping to democratize AI for individuals and academia.


Citation

Cite Our Paper 😀

If you find our work insightful, we would greatly appreciate it if you could cite our paper.

@inproceedings{zhu2025dice,
  title="{DICE: Data Influence Cascade in Decentralized Learning}",
  author="Tongtian Zhu and Wenhao Li and Can Wang and Fengxiang He",
  booktitle="The Thirteenth International Conference on Learning Representations",
  year="2025",
  url="https://openreview.net/forum?id=2TIYkqieKw"
}