Website URL Classification

When speaking with potential client’s about URL Classification, one the questions I ask is if they want to use Domain level classification or URL level classification. So what’s the difference?

In Domain level URL Classification, the site is classified as a whole, which means CNN and Wikipedia are classified at the main level, so CNN is News and Wikipedia is Reference (our category for dictionaries and wiki pages), this mode is suitable for most parental control and DLP applications. In our database we have about 65 million domains, and if our API gets a request to a new domain it doesn’t know, it will classify it on the fly, and will provide result within 30 seconds.

In URL level URL Classification, the URL itself is examined and the classification is more specific, so for example an article about money in CNN would be classified as Finance. This mode has more latency, because if the URL is not stored in the database, the classifier needs to get it on the fly, and categorize is. In our database we have about 5 Billion URLs, in order to cut down on the wait time of new URLs.

It’s possible to use a hybrid approach, if the domain is a domain we planned to block anyway for example adult content, we don’t need to classify each sub URL, or we can decide that we want to classify only sub pages of specific categories, like: News, Reference, User generated content and similar that can be either good or bad for our end users.