TopoScope

Introduction

Nowadays, the Internet is composed of more than 60,000 autonomous systems (ASes). Based on their business contracts, these ASes propagate prefixes and exchange routing information with each other to control how traffic will be routed. In a simplified manner, the business contracts are often represented by the AS-to-AS relationships, including peer-to-peer (p2p), customer-to-provider (c2p) and provider-to-customer (p2c), in the AS level topology. As both the scale and the complexity of the Internet grow at an unprecedented rate, a good knowledge of the relationships between different ASes is of great importance for many aspects of Internet research.

Despite the significant progress achieved by latest inference algorithms, their inference results still suffer from errors on some critical links due to limited data, thus hindering many applications that rely on the inferred relationships. We take an in-depth analysis on the challenges inherent in the data, especially the limited coverage and biased concentration of the vantage points (VPs). Some aspects of them have been largely overlooked but will become more exacerbated when the Internet further grows.

Then we develop TopoScope, a framework for accurately recovering AS relationships from such fragmentary observations. TopoScope uses ensemble learning and Bayesian Network to mitigate the observation bias originating not only from a single VP, but also from the uneven distribution of available VPs. It also discovers the intrinsic similarities between groups of adjacent links, and infers the relationships on hidden links that are not directly observable. Compared to state-of-the-art inference algorithms, TopoScope reduces the inference error by up to 2.7~4 times, discovers the relationships for around 30,000 upper layer hidden AS links, and is still more accurate and stable under more incomplete or biased observations.

Challenges and Basic Ideas

links observed simultaneously by several VPs

links observed by VPs at different tiers

Noises in collected routes. Unexpected events, such as route leaks or link failures, may cause noisy routes that do not represent intended AS relationships to be reported. Since the perfection of such heuristics is not our focus in this paper, we will just use some standard filtering methods in our evaluations. Nonetheless, our inference framework, due to its better robustness, can reduce the negative impact caused by the unfiltered noises.
Limited and biased observations from each VP. Existing inference algorithms typically use routing data published by RouteViews and RIPE RIS, collected from around 1600 VPs. These VPs are divided into full VPs and partial VPs, according to their observation ability. However, even a full VP roughly observes only around 1.12% p2p links and 71.87% p2c links. Existing inference algorithms have taken this fully into account. Please note that due to the limited observation abilitie of partial VPs, in the following discussion we only discuss full VPs and their observation results.
Aggregated observation bias from limited VPs. The observation bias of each VP not only exists, but also differs a lot from each other. This will lead to an aggregated observation bias closely related to the set of VPs an inference algorithm takes as input. The lack of observations on AS links will trick inference algorithms that depend much on link's basic proporties. To overcome the sensitivity of empirical rules, TopoScope uses a simple but powerful ensemble learning technique, i.e., voting, to find those empirically derived relationships that are not sensitive to the removal of observations. It should be noted that, there are definitely many hidden links, especially p2p links, that cannot be seen by any of these VPs. We will solve the problem later.
Aggregated observation bias from unbalanced VP locations. Bias originates not only from the small cardinality of the VP set, but also from how the VPs distribute themselves in the network. We roughly divide the Internet hierarchy into four tiers, i.e., clique, high tier, low tier and stub tier. Existing VPs are more concentrated in the upper tiers. We find for p2c links, the observation capabilities of VPs in different tiers are very similar. However, for p2p links, a p2p link can be observed always better by VPs located in the same hierarchy as itself, than by VPs in other tiers. Thus, the concentration of VPs in higher tiers will cause imbalanced observations. In an inference algorithm that relies on a probability model of link features, the weight (or probability) of each feature is often related with its occurrence frequency, which is in turn affected by the number of links related to that feature. To solve this problem, TopoScope uses a Bayesian Network model that takes into account the inter-dependencies among link features, especially the observation bias in different tiers. In this way, the uneven distribution of observed features originating from the biased VP distributions can be effectively mitigated.
Hidden links that cannot be directly observed by VPs. As mentioned above, VPs will miss a non-negligible fraction of AS links, especially p2p links. We tackle this problem by studying the intrinsic similarities between different groups of links. We find for a singleton link l_AB which is observed by only one VP and easy to be missed, very probably a pair of consecutive links l_AC and l_CB can also be observed. So if we can successfully learn the characteristics of detour couples of singleton links, we can find more similar couples, hence hidden links. TopoScope learns a classifier for detour couples, thus dig out hidden links. To our knowledge, this is the first attempt to infer AS relationships on hidden links that cannot be directly observed.

Algorithm Design of TopoScope

error rate under different observation groups

Bayesian Network structure

TopoScope works in four steps as follows.

Divide the whole data set into several groups, apply some existing empirical inference algorithm (i.e., AS-Rank) to each group, and use voting to find a set of consensus links, for which a majority of groups can reach an agreement on their inferred relationships.
Among consensus links, we find that the p2c links seen by the majority of VPs and the p2p links seen by the minority of VPs have higher inference accuracy. We determine these two categoris of links as trusted links.
Train and optimize a Bayesian Network with Expectation Maximization for inference on fuzzy links, which include all links that have not been identified as trusted links in step 2.
Train a decision tree for detour couples, use it to find potential detour couples, and decide relationships on the corresponding hidden links.

Publications

TopoScope: Recover AS Relationships From Fragmentary Observations

Zitong Jin, Xingang Shi, Yan Yang, Xia Yin, Zhiliang Wang, Jianping Wu from Tsinghua University

TopoScope is an AS relationship inference algorithm that combines ensemble learning and Bayesian networks. In addition, TopoScope also supports hidden link inference. You can learn more about TopoScope in IMC 2020.

[PDF]

Code

We release the following code on Github:

TopoScope

You can also download the zip file here.

Presentation

TBA.

Inference Result

You can download the basic inference and hidden link inference of TopoScope here.

Authors

Zitong Jin, Zhongguancun Laborarty, jinzt@mail.zgclab.edu.cn
Xingang Shi, Tsinghua University, shixg@cernet.edu.cn
Yan Yang, Tsinghua University, yangyan15@mails.tsinghua.edu.cn
Xia Yin, Tsinghua University, yxia@tsinghua.edu.cn
Zhiliang Wang, Tsinghua University, wzl@cernet.edu.cn
Jianping Wu, Tsinghua University, jianping@cernet.edu.cn

Result Download →