In the past few years, many Byzantine-tolerant distributed machine learning (DML) algorithms have been proposed in the point-to-point communication model. In this paper, we focus on a popular DML framework -- the parameter server computation paradigm and iterative learning algorithms that proceed in rounds, e.g., \cite{vaidya-2f-redundancy, Rachid_genuine_BML,Su_BGD}. One limitation of prior algorithms in this domain is the \textit{high communication complexity}. All the Byzantine-tolerant DML algorithms that we are aware of need to send n d-dimensional vectors from worker nodes to the parameter server in each round, where n is the number of workers and d is the number of dimensions of the feature space (which may be in the order of millions). In wireless network, power consumption is proportional to the number of bits transmitted. Consequently, it is extremely difficult, if not impossible, to deploy these algorithms in power-limited wireless devices. %This is because communication of large gradients (or vectors) requires high power consumption. Motivated by this observation, we aim to reduce the \textit{communication complexity} of Byzantine-tolerant DML algorithms in the \textit{single-hop radio network} \cite{Broadcast_SPAA10,Broadcast_Nitin_PODC05,Broadcast_Koo_PODC04}. Inspired by the CGC filter developed by Gupta and Vaidya, PODC 2020 \cite{vaidya-2f-redundancy}, we propose a gradient descent-based algorithm, Echo-CGC. Our main novelty is a mechanism to utilize the \textit{broadcast properties} of the radio network to avoid transmitting the raw gradients (full d -dimensional vectors). In the radio network, each worker can overhear previous gradients that were transmitted to the parameter server. Roughly speaking, in Echo-CGC, %each worker will compute its local stochastic gradient, and if it if a worker agrees'' with a combination of prior gradients, it will broadcast the echo message'' instead of the its raw local gradient. The echo message contains a vector of coefficients (of size at most n) and the ratio of the magnitude between two gradients (a float). In comparison, the traditional approaches need to send n local gradients in each round, where each gradient is typically a vector in a ultra-high dimensional space (d≫n). %Each worker i computes the angle between the received gradients in the current round and its own gradient (on the local cost function). If there exists a linear combination of previous gradients broadcast by other workers whose angle is less than a certain bound α, then worker i broadcasts the echo'' message, which contains a vector of coefficients (of size at most n) and the ratio of the magnitude between two gradients (a float). Compared to the traditional approach that sends the entire raw gradient – which is typically a vector in a ultra-high dimensional space – the echo message contains only a few bytes \qinzi{Is this still true with the Span method?} \lewis{No, we can say it’s in the order of n and will be beneficial when n«d}; hence, Echo-CGC effectively reduces the communication complexity. The improvement on communication complexity of our algorithm depends on multiple factors, including number of nodes, number of faulty workers in an execution, and the cost function. We numerically analyze the improvement, and show that with a large number of nodes, Echo-CGC reduces 80% of the communication under suitable assumptions. |