Hot Topics on Data Center (HotDC) 2018

Keynote Session

Accelerate Machine Intelligence: An Edge to Cloud Continuum

Hadi Esmaeilzadeh – UCSD

Background

open source: http://act-lab.org/artifacts

CoSMIC stack

how to distribute
  • understanding machine learning – solving optimize problem
  • abstraction between algorithm and acceleration system – parallelized stochastic gradient descent solver(to fpga gpu asic cgra xeon phi)
  • leverage linearity of differentiation for distributed learning
  • programming and compilation
    • build a new language for math
    • dataflow graph generation
how to design customizable accelerator
  • multi-threading acceleration
  • connectivity and bussing
  • PE architecture – make hardware simple
how to reduce overhead of distributed coordination

specialized system software in CoSIMC

benchmarks
  • 16-node CoSIMC with UltraScale+FPGA offer 18.8x speedup over 16-node spark with E3 skylake cpu
  • using FPGA (66%) and software (34%) for speedup

RoboX Accelerator Architecture

DNNs tolerate low-bitwidth operations – bit-level

Making Cloud Systems Reliable and Dependable: Challenges and Opportunities

Lidong Zhou- MSRA

Background

system reliability:

  • Fault Tolerance
  • Redundancies
  • State Machine Replication
  • Paxos
  • Erasure Coding

Real-World Gray Failures in Cloud

  • redundancies in data center networking
  • active device and link failure localization in data center
  • NetBouncer: large-Scale path probing and diagnosis
  • NetBouncer: leverage the power of scale
  • root cause of the gray failure – stuck due to network issue – heart beat still normal (request stuck)
  • Insight: should detect what the requesters errors
    • critical gray failure are ovserviable
    • from error handling to error reporting

Solution – Panorama

  • Analysis – automatically covert a software component into an in-situ observer
  • Runtime – observer send to local observation store(LOS)
    • locate ob-boundary
    • observations not always direct
    • observations split to ob-origin & ob-sink
    • match ob-origin & ob-sink
  • Detect what “requesters” see

Reliability of Large-Scale Distributed Systems

  • foundation reliability
  • rethink cloud reliability: new theory & new method
  • understand gray failure
  • systematic and comprehensive observations

paper: Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

Haibo Chen – SJTU

Background

  • (Distributed) Transactions were slow
  • High cost for distributed TX – Usually 10s~100s of thousands of TPS – (SIGMOD’12)
  • only 4% of wall-clock time spent in useful data processing

new features:

  • RDMA: remote direct memory access
    • ultra low latency(5us)
    • ultra high throughput
  • NVM: Non-volatile memory

An Active Line of Research of RDMA-enabled TX

  • DrTM – DrTM(SOSP 2015) DrTM-R(EuroSys 2016) DrTM-B(USENIX ATC 2017)
  • FaRM – FaRM-KV(NSDI 2014) FaRM-TX(SOSP 2015)
  • FaSST(OSDI 2016)
  • LITE(SOSP 2017)

Transaction(TX)s

  • protocols – OCC,2PL,SI…
  • impl on hardware devices – CX3,CX4,CX5,ROCE, one-side, two-side….
  • OLTP workloads – TPC-C, TPC-E, TATP, Smallbank

Main: Use RDMA in TXs

outlet:

  • RDMA primitive-level analysis
  • Phase-by-phase analysis for TX
  • DrTM+H: Putting it all together

content:

  • phase: Exe/Val/Log/Commit
  • offloading with one-side improves the performance
  • one-sided primitive has good scalability on modern RNIC
  • Execution framework & DrTM+H:https://github/com/SJTU-IPADS/drtmh

RDMA in Data Centers: from Cloud Computing to Machine Learning

Chuanxiong Guo – ByteDance

Background

  • Data Center Network (DCN) offer lot services
    • single ownership
    • large scale
    • bisection bandwidth
  • TCP/IP not working well
    • latency
    • bandwidth
    • processing overhead(40G) – 12% CPU at receiver & 6% CPU at sender

RDMA over Commodity Ethernet (RoCEv2)

  • no CPU overhead
  • single QP, 88Gb/s 1.7% CPU usage (TCP 8 connection 30-50Gb/s, client 2.6% & server 4.3% CPU)
  • RoCEv2 needs a lossless ethernet network
    • PFC(priority-based flow control) hop-by-hop flow control
    • DCQCN – sender-switch-receiver (RP-CP-NP)
  • the slow-receiver symptom – ToR tot NIC is 40Gb/s & NIC to server is 64Gb/s. NIC may generate large number of PFC pause frames

RDMA for DNN Training Acceleration

  • understanding using DNN
  • DNN Training: BP
  • Distributed ML training, GPUs, with mini-batch
  • RDMA acceleration : ResNet \ RNNs \ DNN (rdma performance better than tcp)

Highlighted Research Session

Congestion Control Mechanisms in Data Center Networks

Wei Bai – MSRA

DCN中实现低时延

  • 排队时延 -PIAS(NSDI 2015)
  • 丢包重发时延 – TLT

PIAS

  • Flow completion Time (FCT)是关键问题
  • 流信息不能假设为已知、可以在现有设备上快速部署
  • PIAS performs Multi-level feedback queue (MLFQ) to emulate shortest job first (SJF)
  • three function in pias:
    • package tagging
    • switch
    • rate control

TLT

  • 同时达到Lossy & Loss-Less两种网络的好处
  • using PFC to eliminate congestion packet losses
  • packet loss :
    • middle – fast retransmissions
    • tail – Timeout retransmissions
    • 识别重要包, 当交换机队列超过阈值时丢掉非重要包

Understanding the challenges of Scaling Distributed DNN Training

Cheng Li – USTC

  • Deep Learning growth fast
  • DNN – Deep Neural Networks
  • benefit: more data / bigger models / more computation
  • Jeff Dean – Google

Distributed DNN

  • Model or data parallelism
    • data parallelism is a primary choice
  • BSP / ASP – BSP is choice (ASP可能不收敛)
    • Bulk Synchronous Parallel – 确定时间同步
    • Asynchronous Parallel
  • net \ server \ other bottlenecks for parallelism
  • 通过测试确定影响计算能力的制约条件
    • 数据压缩传输带来的压缩开销
  • 系统设计
    • 弹性系统设计
    • 短板效应 – 最终计算速度的制约
    • 如何快速调整系统的规模等 – message bus流处理 – 用生产者消费者模型

Octopus: an RDMA-enable Distributed Persistent Memory File System

Youyou Lu – Tsinghua

  • 分布式文件系统设计
  • 非易失性内存 – 内存存储
  • DRAM Limitations
    • Cell Density
    • Refresh – 性能/功耗
  • NVDIMM内存 – 断电后存储数据
  • Intel 3D Xpoint – 接近内存的延迟, 高容量, 断电非易失
  • RDMA – 高性能环境下使用
  • DiskGluster – latency来自于HDD | MemGluster – latency来自于软件
  • RDMA-enable Distributed File System
    • shard data mamangment
    • New data flow strategies
    • Efficient RPC design
    • Concurrent control

Design

  • I/O处理
    • 将所有NVMM组织为同一空间
    • 降低DFS中的数据拷贝(7次降到4次)
    • server扫描数据存储地址,client获取地址之后自己获取(将任务转嫁给client)
  • Metadata RPC
  • Collect-Dispatch Distributed Transaction
  • 性能测试
    • 局域网服务期间测试 – 带宽可以达到网络带宽的88%
    • 在Hadoop平台下进行测试

Short Talk

Computer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS)

新的技术AI \ IOT
提高新的软硬协同设计能力 – CPU\GPU\FPGA\GPU\ASIC
ZyForce平台 – 虚拟FPGA实验

ActionFlow:A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS)

国科大大四 – 2018.8开始
机器人应用快速开发

Labeled Network Stack, Yifan Shen (ICT, CAS)

Caching or Not: Rethinking Virtual File System for Non-Volatile Main Memory, Ying Wang (ICT, CAS)

Data Motif-based Proxy Benchmarks for Big Data and AI Workloads, Chen Zheng (ICT, CAS)