Transcription:
What is URL Classification
Hi, everyone. This is Barak from url-classification.io, and today I’m going to speak about what is URL classification and what are the common usage for it. Before starting, it’s important to know that for URL classification, there are other names like URL categorization and URL categorization database. There are new answers between the names, which I’m going to cover in future lectures.
So, what is URL classification? URL classification is the process of converting URL or domain into a category. So for example, the site CNN is News and the site Google is search engine. That’s very easy. That’s domain level. The next level is URL level. So for example, you have an article in CNN that speaks about money. So the domain is News, but the URL itself is about money. Now that was really straightforward. The next question is why would you need your classification? What would you do with it? So, I think the first thing that people use URL classification for is for parental control.
Parental control
Parental control is type of software that limit kids for visiting certain sites. So the parental control application would intercept the visited site, let’s say, it’s an adult site. But how would it know? It’s an adult site. It has no way of knowing that because it only intercepts traffic and is able to block or report it. So it contacts a third party API, also some parental control vendors has their own service, and queries the type of site it is, and that service would say, this is an adult site.
Now, according to the parental control, all most likely it would be blocked. Of course, parental control blocks other type of sites, for example, school cheats, maybe games, maybe news. It all depends on the parent, and they need some sort of service that allows them to know the type of site it is because there is no inherit way of knowing the category of the site. Besides sites that end with a known extension that you know for some degree of certainty what types of sites there is, there is no inherit way to know, and you need a third party service.
Another feature for parental control is reporting of the sites the child visited. So, it can be on top of blocking or instead of blocking, and at the end of the day or the week, the parent would get a report and said, look, the child visited these sites for this amount of time and these sites for this amount of time, and use categories to show what types of sites and the parent can decide if it’s good or bad. So we also need URL classification for that. So that would be the first usage.
DLP (Data Leakage Prevention)
The next usage would be for data leakage protection or DLP. DLP are used for enterprises and businesses to stop data leakage. For example, and the employee would want to send some world document that shouldn’t be sent to a competitor. The DLP application would detect sites like file sharing, web mail and other type of collaboration sites, and block them to avoid any kind of leakage detection. Of course, they do most stuff besides that.
Another possibility for DLP just similar to the parental control is adding reporting feature to the application that shows which sites the employees went to. And then you can get a report once a month, once a week, whatever you choose and see that John, your employee, spent most of the time on YouTube. And unless jobs requires being on YouTube, that would be a waste of your time, and you need to do something about that. So, you need to know which sites are in which category to be able to report it. So that’s why you need URL classification with DLP.
Endpoint devices
The next scenario is for endpoint devices, for example, routers. Some routers has built-in parental controls that need URL classification. So it’s similar to DLP and similar to parental control application, but it’s on router level. The differences would be in what type of URL classification service you need, but that’s for another lecture and not for this one. So that’s was pretty straight forward. The next one is for ISPs.
ISPs
ISPs may wants to know where their users are going. They want to know where their users are going, for example, in some countries, if you’re the ISP, you need to block adult sites unless the user consented to visit them up front. Another usage, if you want to maximize your resources, you want to know where your users are visiting most, and you want to adjust your outages accordingly, maybe you want to block some users because they abused the system. And I’m sure there are more scenarios that I didn’t think about.
Sometimes a potential client comes with a scenario I never heard of for URL classification. So what I’m saying is not the all, it means that there are other possibilities and I’m just giving the most common ones.
Programmatic ads
So the next one is programmatic ads. And what does it mean? The first thing with programmatic ads brand safety. Brand safety means you don’t want to add it to be where it shouldn’t be. For example, an ad by a children company or children entertainment should not be on adult sites or on sites that sell certain things.
Brand safety
Every advertiser has his policies about where you can show the ads and where you can’t show the ads. And if you are caught showing the ad where you shouldn’t, the advertiser may block your account. Now when I’m speaking about brand safety, I’m not speaking about people that go to Facebook ads or Google ads. Those kind of places are already filtered. I’m talking about people that are using ad exchanges that have more freedom in choosing where to go. And in that case, you would like that protection so you will not be blocked with certain advertisers.
Ad tageting
The next feature is being more specific with your ads. So for example, let’s say you have an ad for party or balloons, something that is children related. You would like your ads to be on relevant places like children’s sites, parenting sites, recreation sites, for example. You wouldn’t want to be in a news site that always speaks about bad news because bad news is bad for shopping.
Now, on the other hand, let’s say you have an ad for home invasion alarm. You would want to place it in places like the news sites I mentioned before, because people might get scared and says, okay, it’s a good time to buy a house alarm. Just to iterate, Facebook and Google already has these features. This is more for ad exchanges and big agencies that are doing that without Google and Facebook. Now, the last thing is RTB bidding. RTB is real-time bidding.
RTB
It means that there’s a site, for example, CNN, that right now is showing a page to a user, and CNN wants to maximize its profits, of course, so it says, look, I have a user that is going to this page and you have 200 millisecond to make a decision. Do you want me to show your ad or not? And there’s a bidding between advertisers, how much to pay for that ad.
So, let’s say, CNN says, I have this page. And this page is about cars. And the agency right now doesn’t have any ads about cars. So it would get the URL and the URL classification will tag this URL as cars. Now the agency say, okay, I don’t have any inventory for that. I will not advertise on this page because most likely it’s not relevant.
But another page about an home invasion. So he says, I’m very good with alarms. I have a lot of clients that are selling home alarms and I can be on those pages and maybe pay some more because that’s very relevant to what I do. The challenge with RTB is that you have a very short window to make a decision, about 200 milliseconds. So the solution must be able to give you an answer rather quickly.
So, these are the main usages for URL classification. I will recap that, that’s parental control, DLP and point devices like routers, ISPs and programmatic ads.