Active network troubleshooting will always be a necessity for broadband and enterprise network operators. Issues can arise that require a network operations center (NOC) team to investigate closely. Having a recording of network traffic at the point of the problem, in the form of network packet captures (“pcaps”), are one of the best tools at an operator’s disposal when troubleshooting these customer issues. What are some of the best ways that operator network teams can gather and record network packets? Since they can come from many different places and apply to many different teams, what is the best way to work with and get the most use out of them?
Whether done through automated tools that catch events, or through direct interaction, the ability to investigate and resolve issues remotely - whenever possible - saves operators huge amounts of resources and ensures a quality end-user experience. Naturally, this ability is only as good as the data available to the operations team. When it comes to remote troubleshooting and management, large scale telemetry combined with machine learning alert and resolution systems can cut down on a bulk of trouble tickets. When something goes wrong and does require the eyes of an analyst, there are many tools available to help actively manage the network from the core all the way to Wi-Fi and the end-user.
For operators, network packet captures present an opportunity to gather useful intelligence that can’t be gained through typical network management systems or protocols. This data is useful because it is a record of everything that happened on the network in a given period of time, and contains all of the details that would either be missed by “big data” telemetry processing or left out of a simple alert system.
Some examples include:
Obviously, operators have a much bigger - and heterogeneous - network footprint to handle. Furthermore, each part of the network is often handled by different departments entirely, whether it’s the network core, interfaces with wholesale networks, enterprise broadband customers with SD-WAN or cloud services offered by the operator, and residential broadband subscribers themselves, with both retail and operator-provided Customer Premises Equipment (CPE).
This means that the method for gathering packet captures will be different depending on the use case. Here’s a few examples.
For parts of the network where data is crossing on high-speed links (10G+), maintaining constant packet capture can be resource intensive and not always practical. In these cases, adding a network tap capable of sourcing packets at these speeds is the way to go. There are several products dedicated to this specific use case, given just how hardware intensive it can be. Fortunately, companies like LiveAction provide appliances that can handle recording at these speeds, summarize important data, and send the resulting pcaps to a central location or analysis system.
To troubleshoot services that are cloud dependent and deployed in web-service environments, packet captures need to be gathered on virtual interfaces in the cloud-network infrastructure. This is tricky, as access to packet level network data isn’t always available in a contained or virtual server.
One of the ways that this is done in the field is through traffic mirroring, which turns a virtual interface into a tunnel specifically to sink packet data from another interface. In Amazon Web Services deployments, this is done through a feature called VPC traffic mirroring. With the ability to record packet captures in a cloud environment, operators can then take the resulting files and send them to a central location where they can be accessed by analysts.
A lot can be learned about end-user performance and experience issues by monitoring the network gateway or Wi-Fi router itself. For enterprise use cases, high-end managed Wi-Fi systems from companies like Meraki, Cradlepoint, and Mist have native packet capture built into their products that can be initiated and collected through their management systems.
For residential gateways, many do include native packet capture in their products, but the open-source router operating system known as OpenWRT in particular has a great implementation for gathering pcaps. In addition to the ability to activate them via the UI, OpenWRT can automatically send the pcaps to a system such as CloudShark for storage, organization, and analysis.
Many operators use standardized network management protocols like the Broadband Forum’s TR-069 protocol, and its successor, the User Services Platform (TR-369), to manage their CPE base. These protocols work much like enterprise network management protocols like SNMP or NETCONF, but are built specifically for broadband/home CPE use cases.
These protocols use a “data model” to describe the capabilities of a managed device, and to describe the components, commands, and KPIs used to manage configurations, monitor and optimize network links and applications, and manage firmware upgrades. The primary data model today is known as “Device:2”.
Within version 2.13 and later of Device:2 is the PacketCaptureDiagnostics object/command, which can be used to automatically initiate a packet capture on a capable device. Moreover, it can be used to automatically retrieve those captures as part of the process, rather than storing them locally on the CPE.
Here’s how it works. The Device.PacketCaptureDiagnostics. command object (Device.PacketCaptureDiagnostics() in USP) has an argument called “FileTarget”. This can be any URL capable of receiving the resulting packet capture file. This URL can include HTTP arguments used in web application APIs. This makes it great for integration with CloudShark, which has a versatile upload API to receive a capture into an operator’s CloudShark Enterprise system. CloudShark will return a unique URL of the location of the packet capture that is stored in the .Results. table (or returned in an OperationComplete notification in USP).
What should you do with all of these captures once you have them, and what is the best way to work with them? As we mentioned, operators are in an interesting position given that network responsibilities often span many different teams and have points that can use packet captures distributed all over the world.
CloudShark Enterprise was built specifically with these use cases in mind. Here are some best practices enabled for operators who use CloudShark with these packet capture methods:
Ultimately, packet captures are a fundamental part of network troubleshooting and a great asset to network operators of all tiers. What are some of the ways you use captures now? Let us know!