Plenary
23rd October 2017
9 a.m.:
OSAMA I AL‑DOSARY: Good morning everyone. We are about to start. Hi. So I am Osama and this is Ondrej, we are part of the Programme Committee and we will be chairing this morning's session. And we will start with our first presenter for this morning, Luca.
LUCA SANI: Morning to everyone. I am Luca.
LUCA SANI: I am researcher at the National Research Council of Italy in Pisa, and as some of you may know, I am one of the people surrounding the Isolario Project which is route collecting project providing realtime services to its participants. What I am going to present to you now is software that we needed to develop within this project, called interactive collector engine. This is basically a route collecting software which enables the realtime access to the routing table of a route collector.
So as, you know, a route collector is device surrounding route collector software to which autonomous systems with connect establishing a BGP session. Sharing routing information to the route collector itself. In this presentation I will call feeder a router of an autonomous system connected to route collector. So basically route checker maintains a routing table in which it stores the best routes received from its feeders. Usually it dumps the content of the table on a periodic basis, as well as the received BGP messages. This data then can be used to analyse the ‑‑ how the Internet interdomain routing looks like from the perspective of all the feeders connected to the route collector itself.
Those are the most known projects that not only deployed route collectors around the world but also provide ‑‑ also make publically available the BGP data they collect. The organ university and the routing information service of RIPE NCC which data since late '90s. There is packet clearing‑house which monk other activities collects data from a lot of IXPs around the world and since 2011. And then there is Isolario which is the project I am working on my institute.
The data provided by these is invaluable to study data interdomain routing system. Depending on the application you want to implement they need to wait for the ‑‑ may not be desirable. For example, let's say I want to check on the fly, the set of routes that a given feeder is announcing to a given portion of the ISP space what I can do is to part the dumps of the routing table, those are provided every two to eight hours because dumping the content of the table is an intensive process. On the other hand you could park the update messages to receive during time. Since update messages represent only the evolution of the inter‑domain routing depending on the stability of the routes that you want to monitor you may have two parts ‑‑ to park a lot of them. Another way would be to directly query the content of the routing table by asking directly to the route collecting software. The major problem with these approaches that most route collecting software is single threaded. Most use Quagga. This means that while the software is answering the request to provide the content of the table, it cannot parse the receiving messages from the feeders, cause a delay in the collection of the data.
To stress out these problems, we run worst case scenario using Quagga. In this scenario, there are two feeders, which nouns two Quagga data full IPv4 routing table, in a sequential way. The graphic reports the cumulative amount of records that Quagga write about receiving the messages from the feeders. As you can see from time zero to time 35, there is the first table transfer and the records are written normally. Then after about 10 seconds starts the second transfer but more or less after 15 seconds a bunch of read requests, ten read requests, arrive to retrieve the full routing table of the first feeder, and as you can see, as the read requests arrive, Quagga literally stops to record the MRT messages until the requests are satisfied. Then the writing of MRT messages is completed. And there are two main problems in this case. The first is that all the messages arriving during the requests are not parsed so they are collected with a delay of about 20 seconds. And then ‑‑ also, all the messages arriving after the request arrive with a rate which is ‑‑ which is not the expected one because the packets were queued waiting to be parsed by the software.
So, what we propose here is a route collecting software that is multi‑threaded by design. This this architecture but the BGP sessions and the requests coming from the applications are handled by dedicated set of threads. The only point in which they have to synchronize is the access to the routing table. The routing table is implemented like classic Patricia Trie represents a prefix and, one for each feeder that announced that prefix to the route collecting software. In this context there are two different kind of threads that can access the routing table. There are writers and readers. The writers are the threads and linking the BGP sessions and the readers are the threads handling the requests coming from applications.
The access is regulated using the classic read/write paradigm and where a read lock on a given resource is permitted with other read locks but one and only one write lock is permitted in, on a given resource. The resources that can be locked are both the Patricia Trie and the single nodes.
So, for example, when a writer reads a packet from a feeder it checks if the announced prefix is present in the routing table by locking in the table in read mode. If it is not present, then locks the table in write node and inserts the node, otherwise simply retrieves the node. At this point the writer has a node to which can operate either because adjusting the node because just ‑‑ then inserts the path attribute in the node by locking in write mode the node itself. On the other hand, the reader, when wants to read a prefix, simply locks the routing table in read mode, checks and retrieve the mode in present and locks and then read the attribute. In this way multiple readers and writers can access in parallel the routing table as long as they have to access different nodes. The only moment which the table is not accessible by anyone except from anyone is when the writer is locking the RIB ‑‑ the routing table in write mode. But this happens only when the writer has to insert a new mode in the routing table and this is very unlikely event at this steady state because different feeders usually announce the same set of destination addresses with, of course, different path attributes.
So, in order to show the effectiveness of this implementation we run a test similar to the previous one where there are two feeders announcing sequentially different date for routing table to collecting software, and as you can see, we again, the graphic shows the cumulative amount of MRT records recorded by ICE. As you can see, more or less at time 40 starts the second RIB transfer and contextually a bunch of readers tries to access the full routing table ‑‑ to read of the first feeder. There are different lines, each one representing a different amount of readers trying to access the content of the routing table in parallel. The continuous red line is the case of zero readers and as a you can see in all the other cases the lines are very, very close, meaning that the presence of the readers do not affect the performance of the writer.
So basically, writers and readers can go in parallel without affecting each other.
Given that we have a multiple threads available we run the test with the different amounts. Here reported one and eight cores and as you can see, even in the case of one core, ICE is still able to allow readers and writers to go in parallel, thanks to the schedule activity. The main difference between the two tests is that the one with eight cores and like ten seconds in advance. So this test was from the perspective of the writer, but what about the readers? Are they affected by the presence of the writer? So in order to evaluate this aspect we measured the amount of time that the set of readers takes to read the full routing table of the first reader before and during the transfer of the second one. So, for example, if you take 3.66, the value red on the left, this means that a bunch of eight readers in the test with one core takes about 3.66 seconds to retrieve the full routing table of the first feeder. And before the table transfer of the second one is started. If you take, if you measure the same measurement during a stable transfer over the second ‑‑ during the table transfer of the second feeder you have a value which is very, very close, meaning that also the performance of the readers are not affected by the presence of the writer.
So those tests were mainly ‑‑ were used to evaluate the multi‑threaded aspect. However, what about memory? How much memory ICE uses to store information. This graphic reports the memory usage by varying the number of full feeder IPv4 connected to ICE itself and as you can see ICE consumes about 80 megabytes per feeder meaning on a standard machine you can have 100 feeders. So in Isolario where you want to use the collector to perform route collecting on a larger scale you need at least ten machines. So the question is that if it is possible to reduce the memory consumption of the software. Of course the memory usage is mainly due to the presence of attributes inside the routing table. The question if it is possible to compress the attributes themselves. In order to answer this question, we analyse the redundancy of the path attributes by analysing the redundancy in different period of times, exploiting BGP collected by route collectors. For each path attribute we computed unique index which is the ratio of the different values assumed by that path attribute and the total appearances of that path attribute. So basically the lower is the index the more redundant is the path attribute. This graphic reports the uniqueness index in the path attributes both in IPv4 and IPv6 case. And as you can see at least in the IPv4 for the most common path attributes there is a lot of redundancy. For example, if the take the autonomous system path in each period of time it's unique index is less than 0.2, meaning that over 100 autonomous system paths only 20 ‑‑ less than 20 are unique. This is not so much the case in the IPv6 scenario. Maybe because a lower amount of routes share the same set of path attributes. However, this is not a problem because the vast majority of memory is consumed by the presence of the full IPv4 routing table.
So, path attributes can effectively be compressed and what we need is compression algorithm which must be lossless and adaptive but must allow random access to the data because usually applications want to access different portion of the table and if I want to read the path attribute of a given prefix I don't want to the compress other path attributes. So, we had different choices and we decided to use dictionary encoding compression algorithm, where the most recurring patterns are with the sequences of fixed sites indexes.
So basically, we used the classic LZW compression algorithm by storing one dictionary per feeder. So when the receives packet from a feeder extracted path attribute, compress the path attribute according to the dictionary relating to the feeder and then inserts the sequence of fixed site indexes in place of the path attribute. There is little difference between the classic architecture of compression and decompression he they share the dictionary because there is compressors either end, so this allowed us to make slight modification to the classic algorithm in meeting the compression sequences.
This graphic on the left reports the results. And as you can see, the memory consumption is reduced by about 30%, but we have to pay these memory saving in performance and as you can see on the graphic on the right the ability to ICE, to process routes increase from about 12 micro seconds to 13, this measurement is the amount of time that take for a feeder to transfer the full routing table to ICE. So finally, this is a use case of ICE which is basically why we designed ICE, which is Isolario, as I said in the beginning is a route collecting project which offers realtime services to its participants, and usually applications ask to retrieve a portion of the table to be shown to the users on a common web browser. So usually the steady state situation is retrieved from ICE and subsequent update, the subsequent evolution is displayed thanks to the BGP feed that is inserted into Isolario.
If you want you can try the service by logging into Isolario itself using guest credentials. So basically what we propose is multi‑threaded collector engine, which takes care of memory consumption. This is not a complete BGP demo like Quagga or BIRD, there are many open issues. This is just designed to support realtime access to the content of the routing table and the simultaneous collection of BGP data. In the future we will try to add some other fee, for example, support for the add‑path standard, maybe also to handle BMP input and database we can transform this kind of software in kind of route server and use it as a basic to build, for example, realtime looking glasses. And also we are research institute so we are open to any kind of suggestion and collaboration, the software is OpenSource, it is written in C++ and you can download it from Isolario website and make the modification that you want. Thanks.
(Applause)
ONDREJ SURY: Questions?
AUDIENCE SPEAKER: Hi. Colin Petrie RIPE NCC. So I work on the team that operates the RIS collector project and we should talk more about this because we are basically doing the same thing. We have been developing a realtime streaming version of the RIS collectors and if anyone wants to work with us on that, we are quite happy to. One thing I did ‑ had a question about, can you go back to page 13, I think it was, slide 13. What you were doing with compressing the path attributes, you said that you were using the ‑‑ looking for different attributes and seeing how repetitive they were. When you were doing were you using the data from the update messages or the RIB dumps.
LUCA SANI: Table snapshots.
Colin: In the implementations where the snapshots are based on Quagga, just so you know that the ‑‑ it doesn't actually dump all of the attributes that are in the original update messages. Quagga selectively picks only a subset of the attributes to store in the table dumps. So the unknown transitive attributes that are in the update messages did not get stored in the table dumps, that is Quagga‑specific thing. So that might affect the compressibility and how repetitive it is if it's not including some of the extra attributes. I thought I would mention that for some feedback.
LUCA SANI: Okay. I didn't know that, I will look into that.
Colin: If you want to have a talk about that.
LUCA SANI: Of course. Thanks.
ONDREJ SURY: Any more questions? Then I have a question. You implemented the Patricia Trie. Have you looked at quid bit tree or QP tree?
LUCA SANI: Some of the aspects that we have to evaluate. This is just a first implementation and ‑‑
ONDREJ SURY: I would recommend look at QP tree, it's really good preference.
LUCA SANI: Okay. Maybe I talk to you later.
ONDREJ SURY: Thank you.
(Applause)
And next up is Jose Leitao from Facebook.
JOSE LEITAO: Hello everyone. Let's take a picture. Smile. So good morning, first of all, we would like to thank the hosts, the sponsors and RIPE for giving us the opportunity to come share some of the things we are working on. So that is Daniel and myself, we are production network engineers for Facebook and based in Dublin and if you want to talk after the presentation, if you have any questions or if you want to tell us what your company is doing about the particular problem that we are going to talk about, please come and find us.
And essentially, I also tend to put this slide in most presentations because I think it helps illustrate for people that might not be aware, what are the some of the challenges of this scale. And basically some are recent as of June and essentially boils down to we have over 1.3 billion daily active users and we have over 2 billion monthly active users which means the network is very large and we run into very interesting scenarios.
So what are we trying to solve with this system which we subsequently open source, try to handle this scenario, which is you have ‑‑ cluster type of topology so you run into scenarios when ‑‑ monitoring fails you it's the first line of defence and obviously like if you have issues and it's very robust, your monitoring will pick it up. We run into, because the passive monitoring where when this doesn't happen because the problem hasn't been really serviced by the vendor, it hasn't surfaced at the basic level because there is not a counter for this and you are to the first to run into this which happens more often than not, we are kind of at a loss. So we needed to go to figuring out an active approach to this and this is what this system does. Essentially, helping find that needle in the haystack and we call this NetNorad and this is what it's called in Facebook. What it does is to boil it down, is do the same thing a human would do but at scale and all the time. And if you give any income engineer and network operator the situation where you have and essentially there is a problem and you know there is a problem in doing those two hosts. What a network engineer is going to do is some active probing to try to find the source, the needle in that haystack. And this is what we are doing with NetNorad. We run figures in some of the machines that we have and rerun responders on all of them and talking about servers, this information gets collected and sent to database for further analysis and the goal of this is provide a system that helps the network operator narrow down where to do his troubleshooting and accelerate the reduction of that problem.
The evolution of this system, we like to proof of concepts to essentially, I have this idea can we do a B1, can you spend little time trying to validate doing a proof of concept to see if this makes sense and we followed that approach here and the first situation which was only measuring latency was rubbing a call inviting to the ping agent in the host and worked well, showed it made sense, interesting data, so we have weird late sees between A and B and why is that so we evolved this into essentially doing raw sockets with ICMP, provided ECMP coverage. The story with this we were obviously generating a bunch of synthetic TCP traffic and corrupting or changing certain metrics that the service owners were using so they were unhappy about this so we went okay we have more protocols and we essentially started doing ICMP and that worked well, we had some issues with ECMP coverage, we were getting polarisation which meant we were missing some paths which is undesirable. So we changed the protocol again, we are slow like that. And we went to UDP and this is actually what the open source version is going to be using. And then we decided to add a second source of signal which was ICMP because that was very food and efficient but the problem when we complement this with UDP made a lot of sense, right?
And essentially, this boils down and Daniel is going to talk about it more when he talks about the ping source part of that, pingers and responders so the role for ‑‑ the pinger runs in a subset of machines so the pinger can be complex and the pinger can have a high load and essentially what it has is this is your target list and this is what you have to teach. It does time stamp and sends results to the back ends and can do a high rate of PPS and because we have ‑‑ we exercise this in every cluster we have. The responders are they need to be able to run in every host that we have or at least in very large percentage and obviously we want to give, we don't want to disrupt an actual workload with our sort of packet loss agents so it has to be very light and flexible and consume very little memory. So we try to make it as simple as possible, just some time stamping and ensuring when you send a reply to that synthetic traffic request it's actually using the correct marketing, so mirroring whatever the request came with, right? .
So why UDP ‑‑ it solves the ICMP problem as before, and very extensible so we can make the probe structure whatever we want it to be.
So with this in mind the next step after you have that core is, how you make sense of all that data. And what we decide was essentially like for us it makes sense, if you think about this, if you think about this as an aggegation layer and the yellow circles are typical and the blue are servers. For us it didn't make any sense to have alarms or to build the output of this base on servers, we didn't care between serves we had some sort of issue, we aggregate this had to a cluster level which is a group and this was granular enough or is, to take issues between with racks, with dedicated pingers per clusters and kind of jump ahead and we ride to hit every single machine in that cluster. This goes to open source additional, and in our production it lags realtime bay couple of minutes, which is good enough.
And this provides like I said, your goal out of this is try to narrow down that needle in the haystack W this sort of structure you can very distinctly get, help you do that. Your target cluster, the place you want to see what its health is that red circle that you have there. If you have this structure which each company and organisation will have, whatever makes sense for them, in our case it's sort of like this, we have regions and data centres and inside of those data centres you might have buildings or clusters, but so you have that target cluster, we get a ping from inside that same data centre so that tells us what is the health of that or those double racks. You can have another data centre also trying to reach that target cluster, if you want to call it the metro connectivity, this is data centre Dubai one and the other is two, what is the connectivity between the two and you can have something out of the region, like something in Europe do the same thing and gives you three data points, do I have loss inside the same data centre and region or do I have loss between different regions? And now Daniel is going to come and talk about how you can use the system.
DANIEL RODRIGUEZ: So‑so that was a little bit of history about how Facebook built NetNorad on a high level overview on how does it work for us. The idea that this open source project was to provide like a proof of concept, a baseline for anyone to build their own NetNorad system, to measure the same things we measure in Facebook, latency and packet loss. So, the main component of this is pinger and provides you, what you see there, pingers and responders, that is the main core of NetNorad and that is what we use, in Facebook and that is what we use for the open source version.
So, if you have pingers and pongers and you start with let's say, that is your, what you have available, and you are a human, what do you do when you want to ping something? You have the IP address, you you know where the destination host is located, you go to your host and you do a ping. You get the results, you see the measurement of latency and you see the packet loss and with that information you are able to troubleshoot any specific path. You know where the traffic is going and you know if that path is having a problem or not.
Well, with the solution we are basically automating that process. We are making an automated system to ping all possible areas of our network. So if we want to do that, we need to have a way to know where are the pongers, okay, the receivers of the traffic located and when I say where I am saying they are IP address and testing labels, there is going to give us information of the geographical location of that host. By default, and the open source version of this, we use labels like data centre, cluster and racks. So by default you are able to say, okay, this is specific host is located, I don't know, in New York, in cluster 2, in rack 3. Okay you will be able to define this information. But you can change the labels and set up to whatever you want, to whatever feeds your network. So with the labels and the IP address, when we let's say fire up a ponger we basically register to a controller, and the controller is basically the most simple code of Python flask and a very simple sequel light database. So when we fire the pongers, we register the information in the controller, okay? After we are registered we keep ourselves sending like a keep alive to controller, letting it know that we are still alive and that basically we are able to receive traffic. We are able to respond to whoever ping us.
On the other hand, we have pingers. What the ping approximator does is every minute it will, let's say there is a basic ground line that will wake up and will contact the controller and ask, okay, give me the list of things I need to ping. By default, the solution basically provides the full list of pongers available. So you will be basically pinging all the things, all the pongers in the network. This can be changed, it's a matter of changing the logic of what are you going to be pinging, so maybe you want to throttle down a big list of hosts and ping a segment or give some pingers a list and some other pingers another list, do fancy things like I just want this pinger to ping pongers in Madrid or in Barcelona or Melbourne, it depends on your logic, but it's a very simple piece of code so you can put in there your specific logic.
So the pinger gets the list and we will start pinging every minute all the pongers it has received. That information is sent to an Influx DB database, so time series database, is open source, is freely available for anyone to use. So that information will be sent to InfluxDB and our UI, the way we visualise that information is Chronograf. We have used Chronograf because it feeds, works very well with InfluxDB, you can replace this with Lerfana, if you are not interested in the visualisation of the data you can actually interact directly with InfluxDB and build your own alerting on, if I have more than 5% of packet loss in a specific rack I will shut down the rack. If I more than 10% packet loss in a specific cluster and a specific region I drain the cluster. You can build in there your specific logic.
So all of this is available on the open NetNorad hub site. You will have all the software needed to fire up this solution. You have very specific detail step by step guide of what you need to do to put this into production. And very important, we have made package available for the libraries and the actual UDP pinger library that we use. Compiling all the these things and building all the things is a very, like, tedious process and time‑consuming so the idea is to, was to build a very easy solution to have all of this working. So if you are running on daemon you have the package, if not you have to take the time to compile this but running and you have that event.
So let's see a demo and how this works. We are going to be working with two scenarios, the first thing that we want to show how we can detect packet loss is with this. Imagine that on rack 1 in any ‑‑ every cluster we have a pinger, okay? So rack 1 in every cluster is pinging all racks in all clusters in all other available places. So in this first packet loss detection what we are going to do, is detect that Rack 1 is having ‑‑ is measuring some packet loss to Rack 3. Let's say in Rack 3 we have like a faulty fibre, we have a faulty rack switch that is dropping packets and we are going to be able to detect this. For our specific solution we are running all of this on, in the Cloud, these are basically machines for the demonstration but it works in a real network. So we are going to be, to see how Rack 1 picks up the loss to Rack 3. The other scenario we are going to be seeing is this one, how we are going to detect packet loss between regions. So we have three regions, Madrid, SFO and Melbourne. We are going to see that rack 1 is picking up some loss to SFO but not seeing any loss to ‑‑ this as you can see, all this information and knowing that every pinger is pinging all pongers available in the network will allow you to narrow down where is the problem.
So let's fire up in here. Our UI, and this is basically a graph, we have a pre‑defined dashboard that we use for the demonstration. That is built like this. On the ‑‑ on your right side you have latency measurements that latency as we measure from rack 1 to other racks in the same cluster. So this is basically latency within the cluster. And this is packet loss within the cluster. As you can see here, we put the highest on the top. You can see here that rack 1 as it says in the title, Madrid cluster 1, rack 1, is measuring packet loss to all other racks in the same cluster. And you can see there that we are having like 10% of packet loss that is being seen between Rack 1 and Rack 3. And as you can see, it's very easy to visualise where is the packet load happening and you can go and take an action on that, you can actually go to Rack 3 and shut down the rack because that is in fact ‑‑ impacting your customers and your service. When the loss is gone, that Jose isn't there introducing some packet loss ‑‑ you see the issue is gone. You can see here we have these couple of graphs and build an R1 to measure loss between cluster in the same data centre, let's say cluster inside Madrid and the same latency inside Madrid, where in this demonstration we are working with packet loss but if you are interested in latency you will be probably playing side of the solution. And the same happens in Melbourne, you can see we have a pretty bad packet loss between Madrid and, if you see here SFO. We have like 70% packet loss between these two regions. So if you see that in your network you have a big problem you are dropping more than half of your package, that is something to take action. And on the right side you have the latency measure that we are having, so we have around 3 milliseconds of delay to reach Melbourne and 93 to reach SFO from Madrid, which makes sense. All these are actually running in these locations or the latency measurements are correct. And this is basically what you get on the UI. Like I said, if you want to just play with the data and you can do it directly on the InfluxDB storage and build your alarm in, build your detections and probably automate also that. If I see packet loss I will go and ‑ the rack, if I see packet loss or extreme latency I will move the customers around so you can play around and adjust it to feed your business solution. And that is basically what we have right now. Like I say, everything is available on the GitHub open nor add project. And that is all, thank you very much. If you have any questions.
OSAMA I AL‑DOSARY: Any questions? I know it's early but...
AUDIENCE SPEAKER: Nasser from Shatel. I have a question, I am not sure if this question is out of this session scope or not. In your example you showed three regions which were Madrid, SFO and I think Melbourne. Exactly. I am wondering what happens actually based on your destination, for example, you see packet loss from region Madrid to SFO but from region Melbourne to SFO it's fine, so what kind of decision you are going to make here because it could be the network which is congested between Madrid and SFO, not the cluster itself? So I am wondering how do you distinguish these kind of issues?
JOSE LEITAO: Like I said, this is to assist the operator in trying to narrow down where to start, for example this is a real thing, I would probably ‑‑ in how we introduce it in the presentation I would start looking at the backbone. Like it's specific to your network but, for example, for us in the beginning because now we have another system that actually narrows this down more, we have spoken publically, but like in a few years ago like this would have helped in sense of we need to start looking at the network inside of a POP or the backbone network between those two places which is, when you get to this it's monitoring is not seeing anything because you might be hitting; is having a bad day and this is not reporting in any way.
DANIEL RODRIGUEZ: If you have congestion because you have, you are reaching the top of your capacity maybe that is something that you can automate, if let's say you are seeing packet loss in the system, you fire up an alarm and you have something else that picks up that, go and see the capacity utilisation are we running at the top, then okay it's a problem derived from the lack of capacity, and not because there is an issue. So this is ‑‑ this is something that allows to ‑‑ a lot of automation to be built around and you can see like a big system to measure a lot of issues on the network.
AUDIENCE SPEAKER: Exactly, this is what I am thinking, because actually you need combination of these tools to make it work, a real life scenario. Thank you.
OSAMA I AL‑DOSARY: Any other questions? We still have time. Right. Thank you.
(Applause)
Next up is Louis Poinsignon.
LOUIS POINSIGNON: Good morning everyone. I work at Cloudflare in London, I would like to thank the RIPE, the host and the sponsor for allowing me to present here. I am going to talk about network monitoring at the the scale of Cloudflare and all the things we can do with all the information we have.
So, why do we monitor and what do we monitor? So usually it's all billing, trying to reduce costs, traffic engineering, where should we peer, where should we set up a new data centre and also how can we optimise network. We also do some anomaly detection, troubleshooting, and to be faster at fixing problems. And eventually we'd like to be able to predict issues.
So, what are the user source of information, SNMP, which are usually pull a router, you get traffic of interface, you get flow data which are samples of packets. And BGP routing table which gives, so I am just going to talk about the last two.
So I am going to start with flow data. So NetFlow is a protocol from Cisco and IP fiction is from NetFlow, the open standard. It's template based so this is stateful, I am going to come back to this, and between sampling and collection you have usually 23 seconds on the Cisco with NetFlow v9 and 65 IPFIX, just sampling few packets. Every minute you have a template and it's giving you information like the routers, I am going to send you packets, bytes, IP source, and over all other routers takes eleven minutes to gather all the templates. So sFlow, sFlow structure is specified in the specification so it says I am going to send you network sFlow, a wi‑fi sFlow. It provides counters and so packets something. It's instantaneous, we don't have the time stamp so I am going to consider it's instantaneous. What we want is in our case is a mix of both of these data, like source IP, destination interface, so we are computer rate, we can store the data and we can ‑‑ we want to find network information. And each of the protocol provides parts of what we want, so sFlow adds Mac address because we have a packet header but NetFlow, most of the templates I have seen from Juniper and Cisco do not provide the Mac address but instead we have BGP information, destination S and time stamp. And both of them provide information on part interface, source and destination. Cloudflare today we have more than 100 routers from various vendors all around the globe and like very different environment, and terabits of traffic. So we consider it already too late if a user notifies us about an issue, and what we used before was an F dump, an nfacctd. We had two path, one going to a collector going to a ‑‑ the other one was aggregations giving a rate, and it was adding BGP information, aggregation and into database. And I am saying nfacctd is an important part because we had some issue with that. And so these tools were great but they became unfit because we out grew them. We had a vendor bug at some point, information in our NetFlow packets so and at some point we had too many packets and we started dropping them because and so we lost 10% of the visibility of our traffic. Plus at some point we add more and more routers which did sFlow and we could not process it and we had even more branches. So we had two aggregations, so I thought we should unify this, do one single pipeline plus it would be great if we could monitor the collection, plus some of the teams, some of the salespeople were interested by the data we had. And they wanted a nice API and interface to display them. And also if we could create aggregation for Cloudflare by type of customer or region it would be better. So, for instance, vendor bug I talked about, overnight an ASN completely disappeared so all the ‑‑ it was pretty big in Europe and it was corrupted by a small ISP from Brazil. So we completely lost it and we had only, we had to map all the IDS N so this bug was presented on the database, on F dump but also collector just fixed the issue, so we had two different information, which was very hard to troubleshoot issue, like how can I, I need to know this one has been corrupted. So what we built in the end is a new pipeline, routers and data, like sFlow, IPFIX and we have decoders that are connected to an abstraction using format. Contains interface, ASN, all this network information that coders everything, and then we have other things to process, adding correcting issue with BGP information or adding Cloudflare‑specific fields or even dual information and we split into two path, a data and aggregations, data warehouse similar to F dumps and aggregation for fast look ups are just graphs of aggregation. So of course at scale, like the only way we had to scale recently because we load balance to multiple containers, you have to load balance specifically using with ‑‑ since NetFlow is stateful you have to load balance to same container, to same instance, to same node otherwise you are going to use template and not going to be able to decode NetFlow. And once everything is converted to an abstraction everything is sent to Kafka, and it's a great tool. It's used message going to be able to distribute it over multiple consumers, it's a producer consumer kind of thing. And you can load balance very easily. And you have the procertificates, which are more costly in terms of processing because it's mapping, like it's going to look up every IP and every ASN and add the information. And then we send it to a database using Clickhouse, from the index, it's data warehouse, database so very well made for insert, insert, and summing and do aggregation on this. Don't look for updating a, it's not made for that. And we use Flink for aggregation, and Flink is a great tool for University of Berlin and it's going to compute aggregation, the same thing as nfcctd and sends it to time series database. GoFlow, it's just made for decoding, it just ‑‑ it was benchmarked at 30,000 messages for one, for one collector. So you can scale it like this. You can easily extend it with new protocols, format, it's just made for scale and just decoding packets and we have the processing units which are also in go and just correcting, we use other database and they are just made for mapping the IP, for instance. And then we have inserters to populate database and we use Kafka and Flink and Clickhouse for the database.
About the aggregations, so Flink is Java framework for building stream processing apps. So you create a job which is going to be split into tasks, you see every branch here is one type of aggregation. Don't do that, it's not the best optimised thing but this, I use this for giving another view of how the pipeline of aggregation is done. And we get a accurate time aggregation, I will come back to this later.
So the main concept is map reduce, since we don't have enough B OA on single node to do the aggregation we are just going to split the tasks, so you use a key to send one specific flow, for example S 6501, to a specific specific node and going to sum two of them and send to another and everything is going to be in parallel and in the end you are going to have a sum. So it's way faster because we want a very accurate, a very small aggregation, like right now we are doing two minutes, which is sufficient for having a global estimate of the rate.
For instance, sample programme you get the source processing from Kafka, then you have a filter so, for instance, I say want the eyeball traffic and then you map it to a specific status structure and you reduce it, the reduce phase is what I showed before, is this. And then you send it to Kafka. So, a little word about windowing, it's like the two minutes aggregation so you create packets and every flow is going to be inside the same packet corresponding to a certain time stamp and once you get no data, every two minutes once I get no data I am going to do the reduce phase. This allows some ‑‑ if we have a node that is lagging, that is too loaded, has two minutes to catch up to send all the flows. So as a result we can do business intelligence, simple as SQL query or an API call and get easily top networks per country, data centres, plan, transit providers. I get a share of every country, like which providers has in which country. For instance, how much is the ratio, IPv6 ratio traffic and where. Comcast, 46%. So this was done with a single SQL queries on Clickhouse. Aggregations, so we see the traffic of every ASN and by data centres and country. We created different aggregations by type of traffic and this is very useful to troubleshoot non‑network problems for other times. This was used, this is a visualisation which during the hurricane we saw the traffic dropping and very easy to just get the graphs and plot it into Python or any other language. Another use, this is not abstract artwork, this is traffic of all our data centres normalised and this allowed me to get all the best hours for maintenance on our network so just instead of doing this data centre, it's better to do maintenance between this time and time I just automated for every data centre and we have a list which is used. So you can do anything, pretty much, with script and with data stored into open source or Clickhouse. And anomaly detection, so every issue is usually we see it, we have big sample of Internet traffic so there was, Taiwan in traffic dropped by few percent and we see countrywide shutdowns and I usually plot them with Python and try detect them. What we want to do is doing automatic animated action so many of the deriffivation, correlation and median and variance, because this time is usually periodic and representative. So for instance, the correlation would be between two variables, it's over same date approximately the same so all the points that are outside the straight line would be, for instance, strange traffic, and the path which is green is usually the variants so you can consider traffic is good between these variants, and you consider plus or more 10%, it's acceptable of a job, it can happen. But I want to know when it start dropping drastically.
So, how we are going to integrate it, this is currently a project. So basically all the aggregated data is going back to Flink and going to compute information and we have used classification using machine on some of the nodes to detect if there is anomaly and then send an alert. Just the last word about the other source of information, BGP/routing table which is very new. We have 100s of routers and full tables, millions of routes and just for comparison, with RIPE RIS has 15 percent, route views has 47 percent on first route views, so we have a whole, very easy to see route leaks on every IX and we decided to integrate this with similar pipeline and stools.
So, collectors again, processing and data warehouse and aggregation at the same time. So, it's a bit different than how we process the flows, it's a bit different because it's batch processing so you have a full table, you have an dump and you batch it, it's like kind of one‑off, it's not continuous stream. So we use spark, we could use Flink as well and for some example it's very new so looking for the use case, more use case but who is the longest AS path and where ‑‑ which prefixes do we peer with? I mean, which prefixes are given by peering, where could we reach them better, for instance. And also creating tables for mapping IP to ASN which can be used.
Open source because we are giving back. So the flow collector will be open source soon. It's been agreed by my manager and everybody at Cloudflare. What it does is decoding NetFlow, send into a generic network format using Protobuff and it's providing metrics so how many packets are we receiving. We have also some routers just data gets corrupted so we discard it. And you can scale it and it's like 23 micro seconds for NetFlow and 18 for sFlow. It does not do, if you have wi‑fi GSM specific fields, it won't decode that but you can implement them. It does not do aggregation, you have to use different tools. How much would it cost? Because this is made for kind of cloud computing, we put it on our cluster, so on ‑‑ you need this and it's mention of the costs, three use case, 10 gigabits per second, 100 gig bits, 1 terabit, don't take my word for it, this is gross estimation, 100 dollars or 200 dollars a month for the first case, which was 10 gigabits per second. And then 500, if you like, it's not exponential, it's just ‑‑ you save on costs. And the BGP library will also be released. Not collector, just the library to decode BGP packets, can maintain a session and encode/decode MRT dumps and includes many other RFCs and extensions. Why we did that was, there are great collectors but most of them did not support the load of 100, more than 100 tables, and we had to scale it, we had to accept like, we want a node to just, if it's shut down as collector going to start another node and routers are still going to be collected to it.
That is all. Thank you. If you have any questions.
(Applause)
ONDREJ SURY: So any questions?
AUDIENCE SPEAKER: Sebastian Castro, NZRS. I love the presentation and the different things you have going there. So I have three questions. First, why Flink and not the spark?
LOUIS POINSIGNON: So Flink does stream processing and it was implemented also by your platform team and I thought would give it a try and it's optimised for stream processing. Spark is more batch and, all the data sets need to feed in ram so I gave Flink a try. There are other tools you can use but we decided, oh, this is probably the best for us.
AUDIENCE SPEAKER: Second question: What is the granularity in terms of time for flows, so you have a graph, it's in hours but what is the granularity you have for your ‑‑
LOUIS POINSIGNON: We have up to, so we get, so just delayed depending on if it's Juniper or Cisco, delayed by 30 seconds to one minute and we aggregate over two minutes Windows which was kind of by experience we tried a few aggregation Windows and two minutes was the best, which gave the best result. In one minute you don't get all the flows from everywhere so you have, you get spikes and then very low and every two minutes and it gets two more minutes just to be sure you have all the flows into one packet. So it's delayed by five minutes and we have two minutes granularity.
AUDIENCE SPEAKER: So third question: So the final question is about the anomaly detection, do you have SLA or expected time to detect anomalies?
LOUIS POINSIGNON: So there was a prototype and it worked, because what I showed is what we are expecting to build on prediction with the new plan, we have some anomaly detection, it's probably like 15 minutes also, you can do it shorter but you might get some fake false positives.
AUDIENCE SPEAKER: Okay. Thank you.
AUDIENCE SPEAKER: Hi, Brian Trammell. I have done one or two things and it's cool to see how it's been integrated into this whole system. Real quick question, the GoFlow, the collector, is that available or is that a closed source thing?
LOUIS POINSIGNON: Not yet. It's going to be available.
AUDIENCE SPEAKER: Thanks. I will follow it up.
ONDREJ SURY: Any more questions? If no, then thank you.
(Applause)
And so this concludes the session. And we are running a little bit early so enjoy the break.
(Coffee break)
LIVE CAPTIONING BY AOIFE DOWNES, RPR, DUBLIN, IRELAND.