Brief Study Note on Three Privacy Privacy-Preserving Distributed Deep Learning Methods

Miaozhi Yu
12 min readMay 2, 2021

Nowadays, more and more companies and industries are trying to incorporate ML/AI into their product. Among the newly emerging techniques, deep learning neural networks stand out as a new state of the art in classification and prediction of high dimensional data. It has been widely used in areas like image classification, video classification and bio-sensors etc. Industries such as biomedicine and health especially benefit from the emerging technologies that builds deep neural networks to predict or to inference the diagnostic results. This can automate human involvement and reduce cost. Training deep neural networks are usually very data intensive and requires preparation of large scale datasets collected from multiple entities. However, data in biomedicine and health industries are usually short of labels and are commonly distributed, needing aggregated at a centralized storage site. Further more, training deep neural networks also means computing millions of parameters, requiring tremendous computing power. Besides the above two difficulties, application of deep learning to such domains can sometimes be challenging because of privacy and ethical issues (Health Insurance Portability and Accountability (HIPAA)) associated with sharing of de-anonymized data. Data preparation is often obligated to keep the user data private, adding extra layer of difficulties when building machine learning pipelines. Recent years, as more laws, for example, General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), are introduced to protect person level identification privacy, being able to train machine learning or deep learning without direct access the data becomes more and more important. Data from mobile devices and tablets provides rich resources for training machine learning or deep learning models which can greatly improving usability by powering more intelligent applications. These devices, on the other side, rarely leaves their owner. Thus the data usually comes in distributed form and is sensitive and private in nature.

In this study note, I am going to introduce three privacy-preserving distributed deep Learning methods that provides solutions to the above difficulties. They are Federated Learning, Split Learning and SplitFed. Then I will discuss the comparison of these three method based on the experiment conduct by Gawali, Manish, et al in biomedicine field.

Let’s start by introducing these three methods:

Federated Learning

Federated learning (FL) (Konečný, Jakub, et al. arXiv preprint arXiv:1610.02527 (2016)) is a distributed learning method that enables training of neural network models across multiple devices or servers without the need for movement of data. Instead of storing data from various sources at a centralized processing site and preform a centralized training, FL is involved with multiple federated rounds to obtain a robust model. A federated round is defined as below:

  1. fetching the global model from the main server to local server
  2. train model at local server and send local updates to the main server
  3. the main server aggregates the updates received from the distributed sources and upgrades the global model using federated averaging algorithm.

In this training, the global model is able to learn from large and diverse data from each local center and each local center also learns from other center’s data.

Take linear regression as an example. Supposed we have a set of labeled training data (x_i, y_i). For each data entry, we can calculate the loss according to the loss function f.

In federated learning setting, data is massively distributed, non-IID (since they are drawn from different distribution) and unbalanced (each node has very different number of training data is holds). So normal optimization algorithm like SDG may not work well in this setting to find the global/local minimum. In the paper, Jakub Konecny and the team introduced federated optimization which as the following properties:

A. If the algorithm is initialized to the optimal solution, it stays there.

B. If all the data is on a single node, the algorithm should converge in O(1) rounds of communication.

C. If each feature occurs on a single node, so the problems are fully decomposable (each machine is essentially learning a disjoint block of parameters), then the algorithm should converge in O(1) rounds of communication6 .

(D) If each node contains an identical dataset, then the algorithm should converge in O(1) rounds of communication.

The Federated optimization is the combination of two algorithms, SVRG (Stochastic Variance Reduced Gradient) and DANE (Distributed Approximate Newton algorithm). The previous is an optimization algorithm from the SDG class while the later is an algorithm that belongs to the family of distributed method that operate via the quadratic perturbation.

SVRG algorithm

In distributed formulation of the problem, assuming the loss function is convex, then the local loss function for each node can be defined as:

which we can further rewrite into

we then can write the global empirical loss function as linear combination of the local loss function,

where nk is the number of partition k of the data (number of data held in one local node) and n is the total number of data.

The team further present DANE, an algorithm that was originally analyzed for solving the problem of structure for the local loss function mentioned above. Here is the detail of the algorithm.

By combining the above two algorithms, the team introduced Federated optimization SVRG,

In this way, the local server can compute the parameter update and send them to the main server where the global model get updated and send the federated averaging parameters back to local node for the next federated round. The global model proves to be robust in terms of approaching the global minimum of the loss function.

Split Learning

Split learning (SL) consists of training a machine learning model across multiple hosts by splitting the model into multiple segments. One split configuration is called label-sharing configuration. In this configuration, the server holds the labels and each client performs one step of forward propagation step till a particular layer called the cut layer. Then the output at the cut layer is sent to the main server where the rest of the forward propagation is continued and the training loss is calculated. Then the server continue with back-propagation steps until the cut layer after which the gradients are sent back to clients to further carry out the rest of the back-propagation. The server cannot access the raw local client data in the training process, thus preserving the privacy of the client.

In this method, there are two split part, the main server (Bob) and the clients (Alices).

  1. A single data entity (Alice) doesn’t need to share the data with Bobor other data resources.
  2. 2. The supercomputing resource (Bob) wants control over the archi-tecture of the Neural Network(s)
  3. 3. Bob also keeps a part of network parameters required for inference.

The team proposed two algorithms in both centralized mode and peer-to-peer mode. Both algorithm preserves the data privacy in the communication

Centralized Mode:

Peer-to-peer mode:

SplitFed

SplitFed learning (SFL) is a new decentralized machine learning methodology proposed by Thapa at al, which combines the strengths of FL and SL. In the simplest configuration called the label sharing configuration, the entire neural network architecture is ‘split’ into two parts. Instead of training the client networks sequentially, Thapa et al. proposed training the client networks parallel, which is a property drawn from FL.

There are two variations of SFL:

SFLv1

Clients perform a forward propagation step in parallel on their respective data and send the activations obtained at the cut layer to the main server.

Each client performs a forward propagation in parallel locally and send the activations at the cut layer to the main server where the rest of the forward propagation is carried on for all client activations in parallel. Then the back-propagation step is performed and the gradients are sent back to the clients respectively while the networks is updated using the a weighted average of the gradients. The client continue the back propagation step and send the updates to the fed server who in return sends the averaged updates back to all clients so that the client and server are synchronized.

SFLv2

The training of the server-side network is sequential; i.e, clients perform forward propagation and back-propagation one by one sequentially. The client networks are synchronized at the end of each epoch by averaging all client updates at the fed server.

Now let’s move on to comparing these three methods:

The paper (Gawali, Manish, et al. “Comparison of Privacy-Preserving Distributed Deep Learning Methods in Healthcare.” arXiv preprint arXiv:2012.12591 (2020).) conduct the experiment using data of chest X-ray scans from five different sources. Three of these were private datasets, which we refer to as DT1, DT2, and DT3. The label comes from a team of board-certified radiologists who manually annotated these X-ray images using a custom built annotation tool. The following table describes the dataset distribution and number of training, validation and test data taken from various sources.

For FL, the paper used federated averaging algorithm introduced in the Konečný, Jakub, et al. to update the global neural network model at the end of each federated round (epoch).

For SL, the paper experimented with two split learning configurations: the vanilla split learning/label sharing (LS) configuration and the U-shaped split-learning/ non-label sharing (NLS) configuration as shown below.

In the LS configuration, the images stays with thee clients and the labels are sent to the server. In NLS configuration, both images and the labels remain with the server.

In the DenseNet experiments, the cut layer is set at layer 4 and the rest of the network is at the server for the LS configuration. For the NLS configuration, the last fully connected layer is present at the client-side in addition to first 4 layers.

In the U-Net experiments, the cut layer is set at layer 16 and the rest of the network is at the server for the LS configuration. For the NLS configuration, the segmentation head (consisting of the last 3 layers) is at the client-side in addition to the first 6 layers.

The paper does not use any form of weight synchronization; all client network segment weights are unique after training. The team passes an image from a particular data source from train, validation, and test sets through the corresponding client network.

The model they used in the experiment are DenseNet and U-Net.The performance evaluation metric the paper used are AUROC, AUPRC, and threshold-dependent techniques such as F1- score and kappa. It also takes other metrics into consideration, for example, , training time, data communication, and computation. They also used the traditional centralized method as the benchmark for performance.

For SplitFed, they also have excluded SFLv1 from the experiments due to the unavailability of a supercomputer. Instead the paper propose a novel architecture called SplitFedv3 which has the potential to outperform SL and SFLv2. As a large trainable part of the network is at the server in SL and SFLv2, “catastrophic forgetting” can happen, where the trained model favors the client data it recently used for training(M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou, M. Milchenko, W. Xu, D. Marcus, R. R. Colen, et al., “Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data,”).

In SFLv3, client-side networks are unique for each client and the server-side network is an averaged version, the same as in SplitFedv1. The problem of catastrophic forgetting is avoided due to averaging of the server-side network. In SFLv2 and SFLv3, the split happens at the same position in the networks, as described in the split learning settings for the DenseNet and U-Net experiments. For SplitFed, we used only the alternate client training technique, and we exp.

Now let’s take a look at the results.

Below are the comparison result for Performance:

One can see that no distributed learning method outperforms the benchmark for both models.

We can further investigate the confidence intervals of the AUROC and AUPRC curve:

For Elapsed Training Time:

The time taken to train the centralized and different distributed learning models is shown below.

SL, SFLv2, and SFLv3 models take almost the same time to train depending upon the configuration. FL models take significantly less time to train than split learning, SFLv2, and SFLv3, for both sets of experiments.

For Data Communication,

The amount of back-and-forth data communication that takes place between the server and all clients is shown below

For Computation,

The paper considers both the computation at the server (Server Flops)and the client (Client Flops) as both are in the range of TareFlops. Since different client holds different dataset so it takes the average of all the Client Flops as the metric. Th comparison results are shown below:

The number of computations that take place at the client is significantly greater in FL than SL. SL, SFLv2, and SFLv3 has similar number of computation.

Conclusion

The study in the paper demonstrated the cost and feasibility of using distributed learning methods in practice. In terms of model performance, even though none of the method out-performs the traditional centralized method, SplitFedv3 performs the best among the methods. For other metrics like training data and data communication, the SL, SplitFedv2, and SplitFedv3 models take more time to train compared to the FL model and require more data communication. SL, SplitFedv2, and SplitFedv3 would need a high-speed network with large bandwidth to train in practical setting. However, the FL model has higher computational costs. To train an FL model, clients would require a good number of computational resources to carry out heavy computations. On the other hand, SL, SplitFedv2, and SplitFedv3 is able to perform small number of computations even without access to GPUs.

Taking all the metrics, including performance, elapsed training time, data communication and computation into account, FL is the best distributed learning method as long as sufficient computing power is provided.

References

Gawali, Manish, et al. “Comparison of Privacy-Preserving Distributed Deep Learning Methods in Healthcare.” arXiv preprint arXiv:2012.12591 (2020).

Konečný, Jakub, et al. “Federated optimization: Distributed machine learning for on-device intelligence.” arXiv preprint arXiv:1610.02527 (2016).

O. Gupta and R. Raskar, “Distributed learning of deep neural network over multiple agents,” Journal of Network and Computer Applications, vol. 116, pp. 1–8, 2018.

C. Thapa, M. A. P. Chamikara, and S. Camtepe, “Splitfed: When federated learning meets split learning,” arXiv preprint arXiv:2004.12088, 2020.

--

--