PS2: Parameter Server on Spark
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43080 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 2019144 / 155
34
37
44
116
120
122
144
148
155
00:00
Magneto-optical driveNetzwerkverwaltungServer (computing)Parameter (computer programming)10 (number)Server (computing)Primitive (album)Lecture/ConferenceMeeting/Interview
00:14
Task (computing)Hydraulic jumpCartesian coordinate systemVirtual machineError messageUser profilePreprocessorVideoconferencingComputer animation
00:35
Beta functionSystem programmingVirtual machineMachine learningContext awarenessCore dumpProcess (computing)Computing platformComputer animation
01:01
MUDComputing platformPiVirtual machineProcess (computing)Computer animation
01:16
DataflowTensorEnterprise architectureAbstractionComputer fontComputer fontVirtual machineWave packetElectronic data processingSystem programmingParameter (computer programming)Enterprise architectureEmailServer (computing)Model theoryComputer animation
01:49
Model theorySystem programmingAbstractionVirtual machineEnterprise architectureResultantCasting (performing arts)GradientComputer animation
02:14
Broadcasting (networking)Model theoryDevice driverTelecommunicationGradientMathematical analysisLinear regressionLogistic distributionIterationNumberMultiplication signDevice driverWave packetData storage deviceModel theoryMultiplicationGradientSingle-precision floating-point formatFigurate numberLinear regressionSet (mathematics)Different (Kate Ryan album)Logistic distributionTelecommunicationIterationNumberDegree (graph theory)SummierbarkeitGroup actionBroadcasting (networking)Computer animation
03:10
TelecommunicationModel theoryRight angleModel theoryMathematical analysisIterationGradientFigurate numberComputer animation
03:24
TelecommunicationTelecommunicationDevice driverServer (computing)Parameter (computer programming)Model theoryComputer animation
03:38
NumberTelecommunicationDevice driverParameter (computer programming)Sanitary sewerSystem programmingEnterprise architectureTelecommunicationPeer-to-peerModel theoryDevice driverServer (computing)Parameter (computer programming)Virtual machineEnterprise architectureMultiplicationComputer animation
04:01
Sanitary seweroutputParameter (computer programming)TelecommunicationDevice driverServer (computing)Model theoryEnterprise architectureFigurate numberSystem programmingRow (database)Coordinate systemServer (computing)Scheduling (computing)Computer animation
04:15
Multiplication signModel theoryVirtual machineCoordinate systemoutputProcess (computing)Device driverServer (computing)Data storage deviceWave packetOperations support systemOperator (mathematics)Scheduling (computing)Computer animation
04:33
Scheduling (computing)Office suiteModel theoryVirtual machineOperations support systemServer (computing)AbstractionComputer animation
04:56
Model theoryPattern languageFeldrechnerOperations support systemElement (mathematics)Server (computing)ArchitectureParameter (computer programming)Virtual machineModel theoryPattern languageData structureEnterprise architectureOperations support systemCharacteristic polynomialServer (computing)Parameter (computer programming)ComputerFeldrechnerComputer animation
05:21
Model theoryOperations support systemMultiplicationElement (mathematics)FeldrechnerComputer animation
05:35
Vector processorWeightGradientVector processorLinear regressionRectangleFeldrechnerWell-formed formulaGradientModel theoryComputer animation
05:56
Model theoryFeldrechnerVector processorOperations support systemElement (mathematics)Operations support systemModel theoryFeldrechnerState of matterComputer animation
06:11
Parameter (computer programming)Einbettung <Mathematik>Graph (mathematics)FeldrechnerModel theoryFunction (mathematics)Model theoryFunctional (mathematics)Graph (mathematics)Einbettung <Mathematik>FeldrechnerComputer animation
06:32
Model theoryFeldrechnerElement (mathematics)Operations support systemVector processorModel theoryFeldrechnerComputerPattern languageComputer animation
06:47
Model theoryOperations support systemServer (computing)Operations support systemSource codeModel theorySoftware design patternPower (physics)5 (number)Near-ringTelecommunicationComputerPattern languageServer (computing)Computer animation
07:16
Vector processorHausdorff dimensionAbstractionSource codeVector processorModel theoryTelecommunicationDimensional analysisServer (computing)Parameter (computer programming)Computer animation
07:36
Parameter (computer programming)Vector processorServer (computing)Principle of localityOperations support systemParameter (computer programming)Server (computing)Row (database)Element (mathematics)Figurate numberLocal ringTerm (mathematics)Partition (number theory)Computer animation
08:08
Operator (mathematics)TelecommunicationProduct (business)Power (physics)Server (computing)Operations support systemCategory of beingMultiplicationDimensional analysisPower (physics)Element (mathematics)Server (computing)Multiplication signTelecommunicationRow (database)Dot productComputer animation
08:44
TelecommunicationMultiplication signBeta functionOperations support systemServer (computing)Dot productTable (information)Computer animation
09:21
Operator (mathematics)Abelian categoryNormed vector spaceSparse matrixPopulation densityOperations support systemSupremumUniform resource locatorDecision tree learningTable (information)Normal (geometry)NumberRow (database)Computer animation
09:43
Dimensional analysisServer (computing)Linear regressionComputer animation
10:03
Machine codeVelocityGradientPopulation densityWeightLocal ringMatrix (mathematics)Sample (statistics)FeldrechnerPartition (number theory)Server (computing)Machine codeModel theoryFunctional (mathematics)Computer programmingFeldrechnerVirtual machineOperations support systemLink (knot theory)CuboidServer (computing)Parameter (computer programming)Distribution (mathematics)Computer fileComputer animation
10:40
WeightPopulation densityMachine codeServer (computing)VelocityInformation managementGradientVector processorProgrammable read-only memoryModel theoryLocal ringFeldrechnerDifferential operatorModel theoryGroup actionIterationVector processorLocal ringOperations support systemServer (computing)GradientComputer animation
11:09
Machine codeComputerModel theoryOperations support systemCoordinate systemFunctional (mathematics)Figurate numberServer (computing)ImplementationKey (cryptography)Wave packetProduct (business)CASE <Informatik>Mobile appComputer animation
12:02
Model theoryResultantProduct (business)Scalar fieldComputer animation
12:23
Graph (mathematics)Integrated development environmentVirtual machineSystem programmingModel theoryComputer networkBefehlsprozessorMiniDiscRead-only memoryVirtual machineBefehlsprozessorCore dumpSoftware2 (number)Computer animation
12:46
BefehlsprozessorComputer networkRead-only memoryMiniDiscGraph (mathematics)Model theoryWorkloadGradientSpacetimeAbstractionServer (computing)Figurate numberParameter (computer programming)Computer animation
13:13
Pairwise comparisonServer (computing)System programmingDifferent (Kate Ryan album)Multiplication signComputer animation
13:26
PlastikkarteFigurate numberView (database)Model theoryGraph (mathematics)ImplementationMultiplication signGraph (mathematics)AbstractionPeer-to-peerSurjective functionComputer animation
14:11
Graph (mathematics)Parameter (computer programming)Server (computing)TelecommunicationNumberComputer animation
14:25
Server (computing)NumberComputer animation
14:40
Performance appraisalGenderRead-only memoryGradientLogistic distributionMultiplication signFigurate numberDifferent (Kate Ryan album)Pairwise comparisonSystem programmingLinear regressionNetwork topologyPoint (geometry)Peer-to-peerComputer animation
15:12
System programmingPairwise comparisonMobile appFigurate numberMultiplication signWorkloadPoint (geometry)Computer animation
15:35
PiPersonal digital assistantStatisticsSystem programmingFigurate numberComputer animation
15:50
Virtual machineProcess (computing)Computing platformNumberSinc functionFigurate numberRight angleComputer animation
Transcript: English(auto-generated)
00:02
I'm Zhepeng Zhang from Peking University and now I'm an intern from Tencent. Today I'm going to talk about our recent work about PS2 on how to build a primitive server on top of Spark. This work is done with Bin Cui and Xu Pongmyao from Peking University, Le Le Yu and Jiawei Zhang from Tencent.
00:22
So let's start with the industrial scenario at Tencent. We have many large-scale machine learning applications like user profiling and video recommendation. So a common solution to handle these tasks is that we first use big data tools like a Spark to do data collection or data pre-processing. Then we fit this data into specialized machine learning systems such as TensorFlow or Angel.
00:47
However, this process incur expensive data movement between two different systems, especially in the context of big data. To verify this, we log the job running every day in Tencent machine learning platform. We find that more than 80% of the data is extracted and processed by Spark,
01:06
but only 3% of the machine learning jobs are running on Spark MLlib as shown in this pie chart, the yellow one, while others are running more on TensorFlow, Angel and XGBoost.
01:21
So our solution here is that can we use a single system to avoid the expensive data movement, but can still preserve the ability of big data processing and efficient machine learning training. So to this end, we propose PS2, a parameter server architecture on top of Spark.
01:42
So here is the outline of this talk. We first analyze the performance bottlenecks in Spark MLlib when handling large-scale machine learning models. Then we present the PS2 system architecture. After that, we present the PCV abstraction which we use to manage machine learning models on PS2.
02:02
Finally, I present some experimental results and show some industrial uses. So we use stochastic gradient as an example to show to analyze Spark MLlib. Here is how MLlib implements SGD.
02:20
Say you have a driver and multiple executors. The driver stores the model and executor stores the training data. When doing SGD, the driver first broadcasts the model to executors. The executor uses the received model and their local data to compute the gradient. Finally, the driver aggregates the gradient, sums them up and updates the model.
02:45
Clearly, we find that Spark uses a single node driver for communication. So we then analyze the performance bottlenecks of Spark MLlib empirically. We train logistic regression on data sets with different number of features using Spark MLlib.
03:03
So the figure here, the left figure shows the time cost per iteration on different data sets. We can find that the time cost per iteration increases sharply when the model size increases. The right figure further presents a breakdown analysis of the four steps we mentioned above.
03:22
Here, we find that clearly the gradient aggregation step is a bottleneck and here MLlib performs bad. The reason is that Spark employs a single node as that is a driver to do the communication. As we know, parameter servers are good at communication because it can partition the model parameters
03:44
to different machines so that they can balance the communication. So motivated by this, we propose PS2, a parameter server architecture on top of Spark. So conceptually, PS2 replaces the Spark driver with multiple parameter servers.
04:03
Then this figure shows the system architecture of PS2. There are three system rows in PS2, the coordinator, the executors and the PS servers. The coordinator schedules the whole system and it contains the PS master and the driver. The executors process the input and the train machine learning models.
04:25
The PS servers stores the machine learning models. When running a machine learning job on PS2, the coordinator first sends RDD operations to executors and schedules the PS servers by DCV ops.
04:41
The executors can also communicate with the PS servers by the DCV operations. Here DCV is an abstraction that we use to manage machine learning models on PS servers. So now let's take a look at the DCV operation. The first question is that why do we need a new data abstraction?
05:04
Why the traditional parameter server architecture is not enough? The reason is that the computation pattern of updating many existing machine learning models is ignored in the existing parameter server architectures. They usually have the following two characteristics.
05:21
The first one is many machine learning models usually employ multiple vectors as a model. The second one is that we usually need to do element wise operations on these multiple model vectors. We use two examples to explain this. The first one is we use Adam to train a logistic regression.
05:42
This is a formula to update the model vectors. Here beta1, beta2 are hyperparameters. G is a gradient vector. S, V and W are model vectors. Clearly we can find that in these two rectangles, when updating the model vectors, we are doing element wise operations among these model vectors. For example,
06:07
in this one, we are doing element wise operation between S, V and W. The second example is we use SGD to do graph embedding. Here
06:20
we use Deep Work as an example for graph embedding. The model update function is here. Here eta is a hyperparameter. U and V are model vectors. When training the model, we use one model vector to update the other. It's also clear that we also need to do element wise operations between these two model vectors.
06:46
So although this computation pattern is widely existed, traditional pool or push operators cannot express these operations efficiently because they lack the power of doing server-side computation. So if we use
07:02
pool or only use pool and push operators to implement this computation pattern, we have to pull these models to the workers. Then we do the update and push them back to the servers. Clearly, this is not efficient and will incur huge communication cost when it comes to large models.
07:25
So we propose the DCV abstraction, namely it's called dimension collocated vector. A DCV is a distributed vector on parameter servers that is partitioned by column and considers locality issues. So for example here,
07:42
this figure below shows two DCVs distributed on two PS servers. Both of them have K elements and are partitioned by columns. DCV further supports both the row and column access operators comparing to only row access operators in
08:01
traditional parameter servers like pool and push. So row access operators access the DCVs in by rows. For example, they can read or write for a single row. A column access operators then access on the same dimension of multiple DCVs like this, and they can support
08:22
element-wise operations on multiple DCVs. So we next show the power of column access operators. Suppose we want to do the dot between two DCVs. If we use the row access operators, we have to pull these two DCVs from the server and then do the dot.
08:44
Then the communication cost is two times K. But if we can use column access operators, we can do the dot inside each PS server. Then we just aggregate the dots. So here the example, the communication cost is only two.
09:04
Thus, so the larger K is, the bigger communication cost we can reduce. Thus column access operators in detail, the table below summarizes the operators that we support in PS2.
09:27
For row access operators, we support pool, push, sum, and number of non-zeros, and norm. For column access operators, we can we support expy, dot, sub, and so on. For creation ops, we create, we support you,
09:46
we can create a dense DCV, a sparse DCV. We also support to derive a new DCV from an existing one, such that the same dimension of two DCVs are located on the same server.
10:01
So next we show how PS can train a logistic regression using Adam by working over the Scala code. Writing code in PS2 is as simple as writing Spark programs, and that Spark users can easily use PS2 to write their own machine learning programs.
10:20
Here the underlined functions are DCV operations, which are executed on parameter servers. So let's walk through this code. We first use Spark's text file function to load the data distributedly. Then we initialize the model vectors. Here we use the
10:41
derive operator to generate some model vectors to ensure they are co-located. Then in each iteration, we first initialize, set the gradient vector as zero. Then each worker pulls the model from
11:01
servers by a pull operator. Then each worker computes the gradient and they push the local gradient to servers by an add operation. Finally, we incur a zip operation to incur server-side computation to update the model.
11:21
So let's see another example for using for the implementation of DeepWalk in PS2. This figure shows that suppose we have two PS servers, then we partition them into two pieces. When training the DeepWalk model, we first use the coordinator to
11:43
schedule the executors by a RDD operation. Then the executors incur a dot operation to compute the dot in PS servers. After then, they incur a self-defined update function to update the model in PS servers.
12:04
In this case, we also do not need to pull or push the model to workers, but just send in some scalars like dot products. So let me present some experimental results.
12:23
The experiments are conducted on an inner-shelled YARN cluster in Tencent, which contains 2,700 machines. Each machine has one CPU with 12 cores. The network is 10 gigabits per second. For baselines, we compared the Spark MLlib,
12:42
DSTML, glint, petum, and XGBoost. For models, we compared logistic regression, LDA gradient boosting, trees, and DeepWalk. For data sets, we used both public and Tencent workloads. So we first present the benefits brought by the DCV abstraction. In this experiment, we compared
13:05
PS2 with Spark without parameter servers and PS and parameter servers without the DCV abstraction. So figure A and B presents the comparison of Adam on different systems.
13:21
Sorry, start. So we can find from figure A and B that PS2 can get speeded up by 4.7 5x faster than PS without the DCV abstraction, and
13:43
also, it can get 15.7 and 50.6 times faster than the original Spark implementation. Figure C and D further compares PS and PS2 on DeepWalk model on two Tencent graphs. On a smaller graph,
14:07
it's not working. So on a smaller graph, it gets speeded up by 5x, but on a larger graph, the speed up somehow
14:21
decreases. It's because we use more parameter servers and the communication cost of PS2 is related to the number of servers.
14:41
Okay, then we presented the end-to-end comparison. This figure compares logistic regression on different systems. We can find that PS2 is still the fastest. For example, it gets 1.6 and 2.3 times faster than PATUM on KDDB and the KDD12 dataset.
15:01
On gradient boosting trees, PS2 can get 3.3 times faster than gradient boosting, than XGBoost. We also compare. For LDA, PS2 can get 3.7 times faster than PATUM and 9x faster than GLint.
15:23
On Figure B, we show that PS2 can get 17 times faster than Spark MLlib. Moreover, on Figure C, we can show that PS2 can run on app Tencent workload with topic size as 1000 while all other systems fail. So now PS2 has been widely used in Tencent for more than one year and
15:47
widely used in other companies. The left figure shows the number of PS2 jobs running every month last year in Tencent machine learning platform. We observe a huge increase since July. The right figure shows that more than 100 companies
16:03
have tried or used PS2. So if you want to try it, it's available online. Thanks.