End-to-End Latency Decomposition in AI Web Applications: Rethinking Infrastructure in LLM-Based Systems

Jasna Hamzabegovic; Amel Džanić; Kenan Duraković

doi:10.7251/IJEEC2601037H

Authors

Jasna Hamzabegovic University of Bihać
Amel Džanić University of Bihać
Kenan Duraković University of Bihać

DOI:

https://doi.org/10.7251/IJEEC2601037H

Keywords:

AI web applications; latency decomposition; large language models (LLM): serverless computing; virtual private server (VPS); end-to-end latency; performance evaluation; cloud computing; AI systems; benchmarking

Abstract

The increasing integration of artificial intelligence into web applications, particularly through large language models (LLMs), has fundamentally reshaped the performance characteristics of modern systems. Unlike traditional architectures, where latency is primarily determined by backend infrastructure, AI-driven applications operate as multi-stage pipelines involving orchestration logic, network communication, and external model inference.

This paper introduces an end-to-end latency decomposition framework for analyzing performance in AI-powered web applications. A controlled experimental study is conducted using two production-equivalent implementations deployed in serverless and virtual private server (VPS) environments. The methodology distinguishes between full-stack execution, including LLM inference, and infrastructure-only scenarios, enabling precise isolation of latency contributions across infrastructure, application, and model layers.

The results indicate that in full-stack scenarios, model-related latency dominates system performance, accounting for approximately 85% of total response time, thereby minimizing the impact of infrastructure differences. In contrast, infrastructure-only scenarios reveal significant performance variations between deployment environments.

These findings challenge infrastructure-centric optimization approaches and demonstrate the need for system-level performance evaluation in LLM-based applications. The proposed framework provides a practical methodology for identifying performance bottlenecks and offers actionable insights for optimizing AI-driven web systems.

End-to-End Latency Decomposition in AI Web Applications: Rethinking Infrastructure in LLM-Based Systems

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License