Applying AI/ML Methods to Diagnose Network Issues Using Telemetry Data

During ONE Summit 2024, LFN Board member and Cisco Distinguished Engineer, Frank Brockners, delivered a presentation on applying artificial intelligence (AI) and machine learning (ML) in network troubleshooting. This summary captures the key points of his talk, emphasizing the practical applications of AI/ML in diagnosing network issues through telemetry data.

The Challenge of Data Overload

Modern network devices, such as the Cisco 8000 series routers and runtime systems like Kubernetes (K8s), produce immense amounts of operational data like logs or streaming telemetry. These systems can emit information from over a million counters frequently, which can overwhelm traditional processing methods. Sophisticated change detection methods are essential for operators to manage this data flood effectively, offering a way to maintain control over increasingly complex IT systems.

Harnessing AI for Clearer Insights

In his presentation, Frank explored how one can draw insights from logs and streaming telemetry in natural language, to either allow an IT-engineer to digest the information provided by the systems more easily, or even enable systems to reason about the data like a human. Frank discussed how Large Language Models (LLMs) like GPT-4 can be harnessed to transform complex telemetry data into understandable insights. In many cases the amount of data available is much larger than the size of a context window of an LLM. Techniques such as embeddings or statistical methods like Term Frequency-Inverse Document Frequency (TF-IDF) are required to “filter” the vast datasets, ensuring that crucial details of the input data is kept, i.e., the “signals” are kept while the “noise” is removed from the data.

Practical Examples: Diagnosing Logs and Streaming Telemetry with the help of AI

Frank provided two practical examples. The first one was about diagnosing a log file that was initially too cumbersome: The entire log file would have resulted in more than 200,000 tokens – too large to fit the context window of e.g., GPT-4. By applying ML-methods and reducing the dataset, the data became manageable and ready for diagnostic processes using an LLM. The second one was about detecting change-points in a streaming telemetry dataset with time-series from 7333 distinct counters. Frank revealed the structure using embedding methods, demonstrated how clustering mechanisms could be employed to identify change points – and finally explained how feature selection methods could be used to again trim down the overall size of the input to something that an LLM can be asked to diagnose. In both examples, the final diagnosis of the LLM was pretty accurately describing the issue – something that the audience truly appreciated.

Navigating the Benefits and Challenges of AI/ML

While AI dramatically speeds up problem identification and can help provide a diagnosis in natural language, it requires significant initial setup and ongoing tuning. The benefits, such as quick problem identification and streamlined diagnostics, must be balanced against the challenges, which include the need for careful implementation and a deep understanding of the technology’s capabilities and limitations. Once the data is properly pre-processed and filtered, LLMs such as GPT-4 can be used to interpret it and even process it further, like determining dependencies between different entities in a larger system and even root-cause analysis.

Conclusion: A Look to the Future

Frank’s insights at the ONE Summit 2024 clearly demonstrate the promising future of AI and ML in network diagnostics, illustrated with practical uses of classic ML-methods and GPT-4. LLMs can enable “open world” reasoning about network data in natural language – complementing our classic root-cause analysis solutions.

For those interested in further exploring this topic, watch the full presentation on the LFN YouTube channel.

Author

lfnetworking

View all posts