Website classification protocol

From URL Classification
Jump to navigation Jump to search

HTTP/S Protocol

This protocol is used for deals that are priced per end user.

Request

Create a web request to the server in the form of:

https://thor.url-classification.io/url.php?version=&guid=&id=&url=

Mandatory version number:

  • Domain level categorization - Can be w1 or w2:
    • w1 - Means you will get the category name, if site is unknown it will return that the site is unknown and you should query it later.
    • w11 - Same as w1, but if the site is unknown it will wait until there's an answer to send the reply.
    • w2 - Means you will get the category ID.
    • w21 - Same as w2, but if the site is unknown it will wait until there's an answer to send the reply.
  • Page level categorization - Can be w3 or w4:
    • w3 - Means you will get the category name, if page is unknown it will return that the site is unknown and you should query it later.
    • w31 - Same as w1, but if the page is unknown it will wait until there's an answer to send the reply.
    • w4 - Means you will get the category ID.
    • w41 - Same as w2, but if the pageis unknown it will wait until there's an answer to send the reply.
  • With page level categorization, is the main domain is not safe, you will receive the domain category, and page will not be classified.
  • Domain and sub URL categorization (for big sites, like: Craigslist, Huffington Post) - Can be w5 or w6:
    • w5 - Means you will get the category name, if site is unknown it will return that the site is unknown and you should query it later.
    • w51 - Same as w1, but if the site is unknown it will wait until there's an answer to send the reply.
    • w6 - Means you will get the category ID.
    • w61 - Same as w2, but if the site is unknown it will wait until there's an answer to send the reply.

Other mandatory values:

  • guid - Randomly generated upper case GUID, for example: 1707D6F1-70C8-4BB3-A721-CBB47962E01C (you must create a new one per request because the server remembers them and will not accept the same GUID twice).
  • id - MD5 of a formula given to clients only.

Data to classify, can only use one parameter per request:

  • url - The URL we want to inspect, the URL will be without the http://
  • url64 - If the URL contains characters which can't be used as a plaintext (?, #, or &), you can use url64 instead of url and put the base64 encoding of the URL to classify.
  • keyword64 - A keyword phrase you want to classify, base64 encoded (you can specify number of keywords, delimited by ~)
  • multiple64 - Multiple URLs to classify, URLs are delimited by " , " (space comma space) and base64 encoded.

An example actual request would be:

https://thor.url-classification.io/url.php?version=w1&guid=1707D6F1-70C8-4BB3-A721-CBB47962E01C&id=MD5&url=google.com

URL Flags

When URL has flags inside of it, for example:

http://www.somesite.com/?flag1=data1&flag2=data2

You must use the url64 flag and encode the URL using base64, for example with the site in the example, the request will be:

https://thor.url-classification.io/url.php?version=&guid=&id=&url64=aHR0cDovL3d3dy5zb21lc2l0ZS5jb20vP2ZsYWcxPWRhdGExJmZsYWcyPWRhdGEy

Reply

Reply will be a single line composed of two strings in form of:

String1~String2

String1 is the index of the result which can be:

  • FM - Found master, URL found and the site has the same classification in all its pages.
  • FR - Found regular, URL found, but it's specific to this URL only, you should query the other URLs for this domain.
  • NF - Not found, server doesn't know what this URL is.
  • CL - Check later, you will get the result only for w1/w2, it means that you need to check later with the server because it's scanning in real time.

For protocols w5,w51,w6,w61 string1 can also be:

  • FS - Found sub URL master and all this sub URL have the same classification, for example if you get FS with the URL: "www.huffingtonpost.com/celebrity/" it means it's relevant to this URL or any URL in the form of: "www.huffingtonpost.com/celebrity/somehtml.html", but if you try to check "www.huffingtonpost.com/books/" it's a seperate check.
  • FP - Found sub URL and a difference in parameters matter, for example if you get FP with "forums.craigslist.org/?forumID=1204&areaID=372" you would need to also query "forums.craigslist.org/?forumID=96&areaID=372" since the URL parameters (after the ?) have changed.

String2 is delimited by "," and contains the category ID or names (depending on the protocol).

An example reply for version w1 for yahoo.com would be:

FM~Search engine,Portal

An example reply for a new site for w1 would be:

CL

Bad URL

Incase the URL was bad you will get a single string reply: BU

Query server

There are number of servers and you might want to query the servers and get the time it takes to reach the server and working with the fastest one.

To make a query you use the url google.com and add a query flag, an example request would look like this:

https://thor.url-classification.io/url.php?version=w1&guid=1707D6F1-70C8-4BB3-A721-CBB47962E01C&id=MD5&url=google.com&query=1

Servers

These are the current deployed servers:

  • thor.url-classification.io - US server - East coast.
  • loki.url-classification.io - US server - West coast.
  • optimus.url-classification.io - UK server.
  • rodimus.url-classification.io - Australian server.
  • julius.url-classification.io - German server.

Choosing the server

Best way to choose a server is to do a query against each server and see the who is fastest which means it's the closest server.

In the URL you send you need to query for google.com and add a flag query=1 this to indicate this is a server speed test.

Server's uptime

Our servers goes down for maintenance once a week for ten minutes (not in parallel), the way to work with it is when you can't connect to a server, you need to connect to the next server on the list from the test done to find the closest server, after ten minutes you can revert back to the original server.

How to work with the results

There are number of ways to work with the classification server, the easiest one is to query every URL but it's not the most efficient one, since you will have delay for each URL you query and you might not have to query every URL.

FM

FM means that this domain has the same classification across the entire domain, so for example if you got FM for www.url-classification.io then it doesn't matter what URI comes after www.url-classification.io (www.url-classification.io/main, www.url-classification.io/test) the classification will be the same, so when you get FM you should cache that result for an a period of time (an hour is suggested). The FM is for an exact domain so ads.url-classification.io and url-classification.io should be queried even if you got FM for www.url-classification.io

FR

FR means that this domain might have different classification per URI and you should only cache that complete address (including the flags, so if the flags changes you need to query the server again), this is mostly for search engines and sites that have porn segments.

CL

CL means the site is new and the engine classifies it, you should check again later, we recommend to try every 1 second, also this flag is only valid when you use protocol w1 and w11.

BU

BU means that the URL is malformed.

FS

Found a sub URL, all sub URLS (excluding the html part) are considered under the same classification, if you received FR or FM for a site you would never receive FS.

FP

Found a URL that can change content based on the URL parameters. If you received FR or FM for a site you would never receive FS.

SSL

The servers support TLS1.2 and TLS 1.3

Port 80

The servers will reply to plain HTTP requests, these connections are not secure and should only be used for server to server communications and with data that does not contain any form of PII.

For avoidance of doubt, you should not use port 80 unless you know what you are doing.

Extra options

Unicode domains

For Unicode domains you should send the Punycode of that domain.

IAB results

You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format.

Google search

  • When using protocol version w1/w11/w2/w21 the server will classify the search phrase.
  • When using protocol version w3/w31/w4/w41 the server will try to fetch the search page and classify the content, but since Google block bots eventually the server will be blocked and the result will be Site under construction, if that happens, the protocol should be reverted to w1/w11/w2/w21.

HTTP/S Protocol for per call pricing

This protocol is used for deals that have X amount of queries per month or other timeframe. A reply can be immediate or it could take couple of seconds incase the server dynamically classifies a domain or a URL.

Request

Create a web request to the server in the form of:

https://app.url-classification.io/api.php?token=demo&apitype=geturlclassification&domain=

An example actual request would be:

https://app.url-classification.io/api.php?token=demo&apitype=geturlclassification&domain=ebay.com

Token

We provide the token parameter

Domain

Domain is the domain to get the URL Classification for, it should not include any / or trailing http:// or https://

URL Based request

For URL based classification send the full URL including the http:// or https:// prefix, also make sure that if it contains & or ? to url encode it.

One URL Based query is considered as two domain queries.

Reply

Reply will be a JSON that contains both the ID and strings of the classification.

An reply for the previous example for ebay.com would be:

{"database":0,"category1":"67","category2":0,"category3":0,"category4":0,"categorytext1":"Shopping","categorytext2":"","categorytext3":"","categorytext4":"","responsecode":0}

Possible errors

Bad URL

Incase the URL was bad or something was incorrect, you will get the following reply:

{"noclassification":1,"responsecode":0}

Bad token

Incase the token doesn't exist, you will get the following reply:

{"responsecode":2}

Out of credits

If the user is out of credits, you will get the following reply:

{"responsecode":12}

Extra options

Unicode domains

For Unicode domains you should send the Punycode of that domain.

IAB results

You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format.

Checking for available credits

You can check available credits by sending:

https://app.url-classification.io/api.php?token=demo&apitype=credits

Make sure to replace the demo token with your token.

The reply will be:

{"responsecode":0,"credits":1000}

Credits check does not change the available credits.