The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Press Advantage Reveals Why Volume of Media Placements Matters More Than Quality for AI Search Dominance

Press Advantage Reveals Why Volume of Media Placements Matters More Than Quality for AI Search Dominance

Las Vegas, NV – March 12, 2026 – PRESSADVANTAGE – Press Advantage, a full-service press release distribution company,

March 12, 2026

Exclusive day pass to Physical Culture on Flexxd

Exclusive day pass to Physical Culture on Flexxd

Get exclusive day pass access to Physical Culture Brooklyn on Flexxd. Train in a coach-supported, high-performance gym

March 12, 2026

MIAMI BOOK FAIR LAUNCHES STORIES WE SHARE: A CELEBRATION OF JEWISH VOICES

MIAMI BOOK FAIR LAUNCHES STORIES WE SHARE: A CELEBRATION OF JEWISH VOICES

New Literary Series Launches March 18 with Acclaimed Authors and Marks Miami Introduction of Jewish Book Council’s Nu

March 12, 2026

Influential Women Features Catherine Chai: 1st Assistant Manager at Cato Corporation

Influential Women Features Catherine Chai: 1st Assistant Manager at Cato Corporation

MOBILE, AL, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Retail Leader Driving Sales, Team Development, and

March 12, 2026

Tanoia Appoints Kathleen Murray Piper as Chief Operating Officer and Co Founder

Tanoia Appoints Kathleen Murray Piper as Chief Operating Officer and Co Founder

Appointment signals company’s shift from product market fit to scaled growth Kathleen is a proven enterprise operator

March 12, 2026

Nijigen no Mori ‘NARUTO & BORUTO Shinobi-Zato’ ‘Shinobi-Zato 7th Anniversary Event’ Volume 6

Nijigen no Mori ‘NARUTO & BORUTO Shinobi-Zato’ ‘Shinobi-Zato 7th Anniversary Event’ Volume 6

Volume 6: The Return of the Ultra-Difficult "Chunin Exams" AWAJI, JAPAN, March 12, 2026 /EINPresswire.com/ — The

March 12, 2026

International Long COVID Awareness Day 2026 Highlights Ongoing Impact of Long COVID

International Long COVID Awareness Day 2026 Highlights Ongoing Impact of Long COVID

4th Annual International Long COVID Awareness Day Calls for Urgent and Swift Action for Long COVID Long COVID is

March 12, 2026

ProteQC® Co-Founder Darren Bender Presents ‘Post-Quantum Negligence’ in PQShield Podcast Interview

ProteQC® Co-Founder Darren Bender Presents ‘Post-Quantum Negligence’ in PQShield Podcast Interview

New interview explores how delaying post-quantum cryptography could expose organisations to future legal liability

March 12, 2026

StudyFetch Expands Access to NVIDIA Workforce Development Courses for High School Students Through New Honen Platform

StudyFetch Expands Access to NVIDIA Workforce Development Courses for High School Students Through New Honen Platform

By hosting NVIDIA workforce development courses within Honen, we are helping expand access to structured AI training

March 12, 2026

North Carolina’s Teacher Attrition Rate Nears Record High

North Carolina’s Teacher Attrition Rate Nears Record High

A new report presented to the North Carolina State Board of Education shows an increasing teacher attrition rate.

March 12, 2026

Move United Education Conference Coming to Cape Cod, April 20-23

Move United Education Conference Coming to Cape Cod, April 20-23

Over 500 Attendees Expected at National Adaptive Sports Gathering CAPE COD, MA, UNITED STATES, March 12, 2026

March 12, 2026

Aambé Health Launches ‘Living Food’ Initiative with One Season Farmers and Harvest Today to Expand Tribal Food Systems

Aambé Health Launches ‘Living Food’ Initiative with One Season Farmers and Harvest Today to Expand Tribal Food Systems

Aambé Health Launches “Living Food” Initiative with One Season Farmers and Harvest Today to Expand Tribal Food Systems

March 12, 2026

Sherweb Sets Sights on Thousands of UK MSPs with Latest Expansion

Sherweb Sets Sights on Thousands of UK MSPs with Latest Expansion

Sherweb Expands Into UK Market, Bringing Tailored Solutions to MSPs NEW YORK, NY, UNITED STATES, March 12, 2026

March 12, 2026

Foreclosure.com Publishes Educational Article on a 90-Day Fix-and-Flip Strategy in Boise’s Real Estate Market

Foreclosure.com Publishes Educational Article on a 90-Day Fix-and-Flip Strategy in Boise’s Real Estate Market

The feature explores how disciplined renovation timelines and market analysis are shaping modern house flipping

March 12, 2026

Texas Closes 6 Radar Blind Spots, While 8 Critical Weather Gaps Remain

Texas Closes 6 Radar Blind Spots, While 8 Critical Weather Gaps Remain

Six Texas Counties Enter Private-Public Partnerships That Others Can Replicate to Address Weather-Related Risks to

March 12, 2026

Invito Energy Partners Expands Leadership Team and Elevates CFO to Drive Next Phase of Growth

Invito Energy Partners Expands Leadership Team and Elevates CFO to Drive Next Phase of Growth

Company Announces Four Executive Appointments Signaling Accelerated Momentum Across Capital Markets, Operations,

March 12, 2026

Kristin Atherton Named Best Fiction Narrator

Kristin Atherton Named Best Fiction Narrator

Actor receives 2026 Audie Award for her narration of RBmedia audiobook “Outlander” Stepping into a world so cherished

March 12, 2026

Pervaziv AI Releases AI Code Review 2.0 GitHub Action for Repository-Wide Security Scanning and AI-Powered Remediation

Pervaziv AI Releases AI Code Review 2.0 GitHub Action for Repository-Wide Security Scanning and AI-Powered Remediation

New release integrates automated security scanning, AI-powered remediation, and GitHub-native workflows for enterprise

March 12, 2026

e.Republic Achieves Record Impact and Growth, Deepening Its Commitment to the $160B Government and Education Market

e.Republic Achieves Record Impact and Growth, Deepening Its Commitment to the $160B Government and Education Market

Five years of strong performance reflect rising demand for trusted intelligence, connections, and expertise in the

March 12, 2026

HerAnova™ to Exhibit at Pacific Coast Reproductive Society 2026 Annual Meeting

HerAnova™ to Exhibit at Pacific Coast Reproductive Society 2026 Annual Meeting

Company to Showcase HerResolve™ Non-Invasive Endometriosis Blood Test at Booth 604 BOSTON, MA, UNITED STATES, March 12,

March 12, 2026

MerQube Announces Strategic AI Partnership with Noonum to Revolutionize Thematic Indexing

MerQube Announces Strategic AI Partnership with Noonum to Revolutionize Thematic Indexing

Latest agentic indexing AI brings machine reasoning to thematic indexes NEW YORK, NY, UNITED STATES, March 12, 2026

March 12, 2026

Lessons from Military Transition Inform New Approach to Student Wellness at NASPA Conferences

Lessons from Military Transition Inform New Approach to Student Wellness at NASPA Conferences

Veteran-informed resilience tools presented at two NASPA conferences show how insights from military transition can

March 12, 2026

Collision Repair 2026: ADAS Paradox, Total Loss Surge & Rise of Robot-Driven Service Networks. Mytsv.com Intelligence

Collision Repair 2026: ADAS Paradox, Total Loss Surge & Rise of Robot-Driven Service Networks. Mytsv.com Intelligence

New analysis from MyTSV.com reveals how ADAS, EVs, giga casting & AI fleets are reshaping insurance, repair

March 12, 2026

OCR’s Ph.D. Project Agreements Put Universities on Notice: Partnership Eligibility Rules Matter

OCR’s Ph.D. Project Agreements Put Universities on Notice: Partnership Eligibility Rules Matter

OCR agreements highlight that university partnership eligibility rules must align with Title VI and be clearly

March 12, 2026

Southern Creamery Co. Named 2025 Best of Georgia Award Winner

Southern Creamery Co. Named 2025 Best of Georgia Award Winner

FAIRMOUNT , GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Southern Creamery Co., a handcrafted ice cream and

March 12, 2026

The Law Office of Casey Tuggle Honored with 2025 Best of Georgia Award

The Law Office of Casey Tuggle Honored with 2025 Best of Georgia Award

SAVANNAH, GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — The Law Office of Casey Tuggle has been named a 2025

March 12, 2026

Dose Moving & Storage Ranked Among Forbes’ 10 Best Moving Companies in Phoenix

Dose Moving & Storage Ranked Among Forbes’ 10 Best Moving Companies in Phoenix

Phoenix-based moving company earns national recognition from Forbes for quality, reliability, and customer experience

March 12, 2026

Influential Women Features Janet Brown, CPA: Former Chief Financial Officer at Space Center Houston

Influential Women Features Janet Brown, CPA: Former Chief Financial Officer at Space Center Houston

HOUSTON, TX, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Accomplished Financial Leader Driving Mission-Focused

March 12, 2026

Principles that uniquely determine simple risk-sharing rules

Principles that uniquely determine simple risk-sharing rules

GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Researchers develop an axiomatic framework to clarify which

March 12, 2026

Southern Energy Renewables and National Laboratory of the Rockies Execute CRADA Option Agreement to Advance Synthetic Aviation Fuel Technology

Southern Energy Renewables and National Laboratory of the Rockies Execute CRADA Option Agreement to Advance Synthetic Aviation Fuel Technology

GOLDEN, CO / ACCESS Newswire / March 12, 2026 / Southern Energy Renewables and the U.S. Department of Energy's (DOE's)

March 12, 2026

Gemdale Gold Unaware of Any Material Change

Gemdale Gold Unaware of Any Material Change

VANCOUVER, BC / ACCESS Newswire / March 12, 2026 / At the request of CIRO, Gemdale Gold Inc. (TSXV:GEMG) ("Gemdale" or

March 12, 2026

TruChoice Financial’s James Ruhle Named to Insurance Business America’s 2026 Top Specialist Wholesale Brokers List

TruChoice Financial’s James Ruhle Named to Insurance Business America’s 2026 Top Specialist Wholesale Brokers List

Industry veteran recognized for expertise in advanced annuity solutions and commitment to financial professional

March 12, 2026

Context Management Powers Production-Ready AI Analytics at Enterprise Scale

Context Management Powers Production-Ready AI Analytics at Enterprise Scale

GoodData delivers governed semantics, grounded knowledge, guided behavior, and full observability for reliable AI

March 12, 2026

Virtual Coworker Achieves the Largest Social Media Following of Any Virtual Assistant Company in the Philippines

Virtual Coworker Achieves the Largest Social Media Following of Any Virtual Assistant Company in the Philippines

Surpassing all competitors with 360K on LinkedIn and 257K on Facebook, cementing its place as the #1 Virtual Assistant

March 12, 2026

AdSimulo Enables Architects, Engineers, and Developers to Design Optimal Lift (Elevator) Systems in Minutes

AdSimulo Enables Architects, Engineers, and Developers to Design Optimal Lift (Elevator) Systems in Minutes

The world-leading lift traffic analysis application uses an expert system to deliver optimal elevator designs in

March 12, 2026

OpenJobs AI Raises a Multi-Million-Dollar Seed Round to Build the First Agent-First Recruiting Platform

OpenJobs AI Raises a Multi-Million-Dollar Seed Round to Build the First Agent-First Recruiting Platform

OpenJobs AI's autonomous recruiting agent Mira runs the full hiring workflow, from job briefs to candidate engagement

March 12, 2026

Manufacturers Take Center Stage in New Residential Demonstration Platform

Manufacturers Take Center Stage in New Residential Demonstration Platform

The Build Experience launches its first demonstration home, integrating major building product brands directly into

March 12, 2026

Anago Cleaning Systems Named a Top Franchise for Women by Franchise Business Review

Anago Cleaning Systems Named a Top Franchise for Women by Franchise Business Review

Independent franchisee survey highlights strong satisfaction among female owners across the Anago franchise system

March 12, 2026

Fix Mi Casa Premieres March 23 on LatinoCircuitTV; Host and Producers Appear on PIX11’s Monica Makes It Happen March 16

Fix Mi Casa Premieres March 23 on LatinoCircuitTV; Host and Producers Appear on PIX11’s Monica Makes It Happen March 16

YONKERS, NY, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Freedom Studios, Inc. announced that its bilingual

March 12, 2026

Influential Women Spotlights Ann Menna: Founder of IHAVEAMINUTE.COM and Veteran Educational Leadership Consultant

Influential Women Spotlights Ann Menna: Founder of IHAVEAMINUTE.COM and Veteran Educational Leadership Consultant

SAN DIEGO, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Respected Educator and Mentor with 45+ Years of

March 12, 2026