Website classification protocol: Difference between revisions

From URL Classification
Jump to navigation Jump to search
No edit summary
(No difference)

Revision as of 08:34, 4 September 2020

HTTP Protocol

Request

Create a web request to the server in the form of:

http://thor.url-classification.io/url.php?version=&guid=&id=&url=

Mandatory version number:

  • Domain level categorization - Can be w1 or w2:
    • w1 - Means you will get the category name, if site is unknown it will return that the site is unknown and you should query it later.
    • w11 - Same as w1, but if the site is unknown it will wait until there's an answer to send the reply.
    • w2 - Means you will get the category ID.
    • w21 - Same as w2, but if the site is unknown it will wait until there's an answer to send the reply.
  • Page level categorization - Can be w3 or w4:
    • w3 - Means you will get the category name, if page is unknown it will return that the site is unknown and you should query it later.
    • w31 - Same as w1, but if the page is unknown it will wait until there's an answer to send the reply.
    • w4 - Means you will get the category ID.
    • w41 - Same as w2, but if the pageis unknown it will wait until there's an answer to send the reply.
  • With page level categorization, is the main domain is not safe, you will receive the domain category, and page will not be classified.
  • Domain and sub URL categorization (for big sites, like: Craigslist, Huffington Post) - Can be w5 or w6:
    • w5 - Means you will get the category name, if site is unknown it will return that the site is unknown and you should query it later.
    • w51 - Same as w1, but if the site is unknown it will wait until there's an answer to send the reply.
    • w6 - Means you will get the category ID.
    • w61 - Same as w2, but if the site is unknown it will wait until there's an answer to send the reply.

Other mandatory values:

  • guid - Randomly generated upper case GUID, for example: 1707D6F1-70C8-4BB3-A721-CBB47962E01C (you must create a new one per request because the server remembers them and will not accept the same GUID twice).
  • id - MD5 of a formula given to clients only.

Data to classify, can only use one parameter per request:

  • url - The URL we want to inspect, the URL will be without the http://
  • url64 - If the URL contains characters which can't be used as a plaintext (?, #, or &), you can use url64 instead of url and put the base64 encoding of the URL to classify.
  • keyword64 - A keyword phrase you want to classify, base64 encoded (you can specify number of keywords, delimited by ~)
  • multiple64 - Multiple URLs to classify, URLs are delimited by " , " (space comma space) and base64 encoded.

An example actual request would be:

http://thor.url-classification.io/url.php?version=w1&guid=1707D6F1-70C8-4BB3-A721-CBB47962E01C&id=MD5&url=google.com

URL Flags

When URL has flags inside of it, for example:

http://www.somesite.com/?flag1=data1&flag2=data2

You must use the url64 flag end encode the URL.

Reply

Reply will be a single line composed of two strings in form of:

String1~String2

String1 is the index of the result which can be:

  • FM - Found master, URL found and the site has the same classification in all its pages.
  • FR - Found regular, URL found, but it's specific to this URL only, you should query the other URLs for this domain.
  • NF - Not found, server doesn't know what this URL is.
  • CL - Check later, you will get the result only for w1/w2, it means that you need to check later with the server because it's scanning in real time.

For protocols w5,w51,w6,w61 string1 can also be:

  • FS - Found sub URL master and all this sub URL have the same classification, for example if you get FS with the URL: "www.huffingtonpost.com/celebrity/" it means it's relevant to this URL or any URL in the form of: "www.huffingtonpost.com/celebrity/somehtml.html", but if you try to check "www.huffingtonpost.com/books/" it's a seperate check.
  • FP - Found sub URL and a difference in parameters matter, for example if you get FP with "forums.craigslist.org/?forumID=1204&areaID=372" you would need to also query "forums.craigslist.org/?forumID=96&areaID=372" since the URL parameters (after the ?) have changed.

String2 is delimited by "," and contains the category ID or names (depending on the protocol).

An example reply for version w1 for yahoo.com would be:

FM~Search engine,Portal

An example reply for a new site for w1 would be:

CL

Bad URL

Incase the URL was bad you will get a single string reply: BU

Query server

There are number of servers and you might want to query the servers and get the time it takes to reach the server and working with the fastest one.

To make a query you use the url google.com and add a query flag, an example request would look like this:

http://thor.url-classification.io/url.php?version=w1&guid=1707D6F1-70C8-4BB3-A721-CBB47962E01C&id=MD5&url=google.com&query=1

Servers

These are the current deployed servers:

  • thor.url-classification.io - US server.
  • optimus.url-classification.io - UK server.
  • rodimus.url-classification.io - Australian server.
  • julius.url-classification.io - German server.


Choosing the server

Best way to choose a server is to do a query against each server and see the who is fastest which means it's the closest server.

In the URL you send you need to query for google.com and add a flag query=1 this to indicate this is a server speed test.

Server's uptime

Our servers goes down for maintenance once a week for ten minutes (not in parallel), the way to work with it is when you can't connect to a server, you need to connect to the next server on the list from the test done to find the closest server, after ten minutes you can revert back to the original server.

How to work with the results

There are number of ways to work with the classification server, the easiest one is to query every URL but it's not the most efficient one, since you will have delay for each URL you query and you might not have to query every URL.

FM

FM means that this domain has the same classification across the entire domain, so for example if you got FM for www.url-classification.io then it doesn't matter what URI comes after www.url-classification.io (www.url-classification.io/main, www.url-classification.io/test) the classification will be the same, so when you get FM you should cache that result for an a period of time (an hour is suggested). The FM is for an exact domain so ads.url-classification.io and url-classification.io should be queried even if you got FM for www.url-classification.io

FR

FR means that this domain might have different classification per URI and you should only cache that complete address (including the flags, so if the flags changes you need to query the server again), this is mostly for search engines and sites that have porn segments.

CL

CL means the site is new and the engine classifies it, you should check again later, we recommend to try every 1 second, also this flag is only valid when you use protocol w1 and w11.

BU

BU means that the URL is malformed.

FS

Found a sub URL, all sub URLS (excluding the html part) are considered under the same classification, if you received FR or FM for a site you would never receive FS.

FP

Found a URL that can change content based on the URL parameters. If you received FR or FM for a site you would never receive FS.