DynaShard는 시스템 워크로드에 따라 샤드 구성을 지속적으로 모니터링하고 조정하여 리소스 활용도와 성능을 최적화하는 새로운 블록체인 샤딩 메커니즘입니다.
This paper proposes a decentralized control method for DC microgrids that locally restores bus voltage to its nominal value, eliminating reliance on communication links and enhancing reliability by compensating for voltage drops across feeder lines using local feedback within each converter.
Mint is a novel distributed tracing framework that addresses the limitations of traditional sampling methods by leveraging commonality and variability analysis to reduce trace overhead while capturing all requests, enabling comprehensive system observability with minimal resource consumption.
データストリーミング技術とシリアライズプロトコルは、データセットの特性とアプリケーションの要件に応じてパフォーマンスが大きく異なるため、最適な組み合わせを選択することが重要である。
Large-scale distributed model training is susceptible to frequent machine failures, leading to significant downtime and economic losses. Minder, an automated faulty machine detection system, leverages machine-level similarity and continuity patterns in monitoring metrics to quickly and accurately identify faulty machines, minimizing manual effort and downtime.
This paper presents context parallelism, a system optimization technique using ring attention, to improve the latency and scalability of large language model (LLM) inference, especially for long contexts, achieving near-linear scaling for long-context prefill latency with up to 128 GPUs.
The FCDCC framework enhances the fault tolerance and numerical stability of distributed CNNs by combining coded distributed computing (CDC) with novel tensor partitioning and encoding schemes.
Cephalo is a system designed to maximize the efficiency of transformer model training on heterogeneous GPU clusters, achieving significantly higher throughput than existing methods by decoupling compute distribution from training state assignment and optimizing resource utilization.
This paper introduces NDCG and DMBFGS, two novel decentralized optimization algorithms designed to efficiently solve nonconvex and strongly convex problems, respectively, by leveraging the strengths of conjugate gradient and memoryless BFGS methods in a decentralized setting.
EACO-RAG is a novel distributed RAG system that leverages edge computing, adaptive knowledge updates, and inter-node collaboration to enhance scalability, reduce delay and resource consumption, and improve the accuracy of responses in large-scale environments.