Website classification protocol: Difference between revisions
No edit summary |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
<pre> | <pre> | ||
https://thor.url-classification.io/url.php?version=&guid=&id=&url= | |||
</pre> | </pre> | ||
Line 58: | Line 58: | ||
</pre> | </pre> | ||
You must use the url64 flag | You must use the url64 flag and encode the URL using base64, for example with the site in the example, the request will be: | ||
<pre> | |||
https://thor.url-classification.io/url.php?version=&guid=&id=&url64=aHR0cDovL3d3dy5zb21lc2l0ZS5jb20vP2ZsYWcxPWRhdGExJmZsYWcyPWRhdGEy | |||
</pre> | |||
== Reply == | == Reply == | ||
Line 94: | Line 98: | ||
</pre> | </pre> | ||
== Bad URL == | === Bad URL === | ||
Incase the URL was bad you will get a single string reply: BU | Incase the URL was bad you will get a single string reply: BU | ||
Line 108: | Line 112: | ||
</pre> | </pre> | ||
= Servers = | == Servers == | ||
These are the current deployed servers: | These are the current deployed servers: | ||
Line 118: | Line 122: | ||
* julius.url-classification.io - German server. | * julius.url-classification.io - German server. | ||
== Choosing the server == | === Choosing the server === | ||
Best way to choose a server is to do a query against each server and see the who is fastest which means it's the closest server. | Best way to choose a server is to do a query against each server and see the who is fastest which means it's the closest server. | ||
Line 124: | Line 128: | ||
In the URL you send you need to query for google.com and add a flag query=1 this to indicate this is a server speed test. | In the URL you send you need to query for google.com and add a flag query=1 this to indicate this is a server speed test. | ||
== Server's uptime == | === Server's uptime === | ||
Our servers goes down for maintenance once a week for ten minutes (not in parallel), the way to work with it is when you can't connect to a server, you need to connect to the next server on the list from the test done to find the closest server, after ten minutes you can revert back to the original server. | Our servers goes down for maintenance once a week for ten minutes (not in parallel), the way to work with it is when you can't connect to a server, you need to connect to the next server on the list from the test done to find the closest server, after ten minutes you can revert back to the original server. | ||
= How to work with the results = | == How to work with the results == | ||
There are number of ways to work with the classification server, the easiest one is to query every URL but it's not the most efficient one, since you will have delay for each URL you query and you might not have to query every URL. | There are number of ways to work with the classification server, the easiest one is to query every URL but it's not the most efficient one, since you will have delay for each URL you query and you might not have to query every URL. | ||
== FM == | === FM === | ||
FM means that this domain has the same classification across the entire domain, so for example if you got FM for www.url-classification.io then it doesn't matter what URI comes after www.url-classification.io (www.url-classification.io/main, www.url-classification.io/test) the classification will be the same, so when you get FM you should cache that result for an a period of time (an hour is suggested). The FM is for an exact domain so ads.url-classification.io and url-classification.io should be queried even if you got FM for www.url-classification.io | FM means that this domain has the same classification across the entire domain, so for example if you got FM for www.url-classification.io then it doesn't matter what URI comes after www.url-classification.io (www.url-classification.io/main, www.url-classification.io/test) the classification will be the same, so when you get FM you should cache that result for an a period of time (an hour is suggested). The FM is for an exact domain so ads.url-classification.io and url-classification.io should be queried even if you got FM for www.url-classification.io | ||
== FR == | === FR === | ||
FR means that this domain might have different classification per URI and you should only cache that complete address (including the flags, so if the flags changes you need to query the server again), this is mostly for search engines and sites that have porn segments. | FR means that this domain might have different classification per URI and you should only cache that complete address (including the flags, so if the flags changes you need to query the server again), this is mostly for search engines and sites that have porn segments. | ||
== CL == | === CL === | ||
CL means the site is new and the engine classifies it, you should check again later, we recommend to try every 1 second, also this flag is only valid when you use protocol w1 and w11. | CL means the site is new and the engine classifies it, you should check again later, we recommend to try every 1 second, also this flag is only valid when you use protocol w1 and w11. | ||
== BU == | === BU === | ||
BU means that the URL is malformed. | BU means that the URL is malformed. | ||
== FS == | === FS === | ||
Found a sub URL, all sub URLS (excluding the html part) are considered under the same classification, if you received FR or FM for a site you would never receive FS. | Found a sub URL, all sub URLS (excluding the html part) are considered under the same classification, if you received FR or FM for a site you would never receive FS. | ||
== FP == | === FP === | ||
Found a URL that can change content based on the URL parameters. If you received FR or FM for a site you would never receive FS. | Found a URL that can change content based on the URL parameters. If you received FR or FM for a site you would never receive FS. | ||
= SSL = | == SSL == | ||
The servers support TLS1.2 and TLS 1.3 | The servers support TLS1.2 and TLS 1.3 | ||
== Port 80 == | === Port 80 === | ||
The servers will reply to plain HTTP requests, these connections are not secure and should only be used for server to server communications and with data that does not contain any form of PII. | The servers will reply to plain HTTP requests, these connections are not secure and should only be used for server to server communications and with data that does not contain any form of PII. | ||
Line 166: | Line 170: | ||
For avoidance of doubt, you should not use port 80 unless you know what you are doing. | For avoidance of doubt, you should not use port 80 unless you know what you are doing. | ||
= Extra options = | == Extra options == | ||
== Unicode domains == | === Unicode domains === | ||
For Unicode domains you should send the Punycode of that domain. | For Unicode domains you should send the Punycode of that domain. | ||
== IAB results == | === IAB results === | ||
You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format. | You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format. | ||
== Google search == | === Google search === | ||
* When using protocol version w1/w11/w2/w21 the server will classify the search phrase. | * When using protocol version w1/w11/w2/w21 the server will classify the search phrase. | ||
Line 190: | Line 194: | ||
<pre> | <pre> | ||
https://app.url-classification.io/api.php?token=demo&apitype=geturlclassification&domain= | https://app.url-classification.io/api.php?token=demo&apitype=geturlclassification&domain= | ||
</pre> | </pre> | ||
Line 211: | Line 215: | ||
For URL based classification send the full URL including the http:// or https:// prefix, also make sure that if it contains & or ? to url encode it. | For URL based classification send the full URL including the http:// or https:// prefix, also make sure that if it contains & or ? to url encode it. | ||
URL Based | One URL Based query is considered as two domain queries. | ||
== Reply == | == Reply == | ||
Line 248: | Line 252: | ||
{"responsecode":12} | {"responsecode":12} | ||
</pre> | </pre> | ||
== Extra options == | |||
=== Unicode domains === | |||
For Unicode domains you should send the Punycode of that domain. | |||
=== IAB results === | |||
You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format. | |||
== Checking for available credits == | == Checking for available credits == |
Latest revision as of 15:01, 18 April 2022
HTTP/S Protocol
This protocol is used for deals that are priced per end user.
Request
Create a web request to the server in the form of:
https://thor.url-classification.io/url.php?version=&guid=&id=&url=
Mandatory version number:
- Domain level categorization - Can be w1 or w2:
- w1 - Means you will get the category name, if site is unknown it will return that the site is unknown and you should query it later.
- w11 - Same as w1, but if the site is unknown it will wait until there's an answer to send the reply.
- w2 - Means you will get the category ID.
- w21 - Same as w2, but if the site is unknown it will wait until there's an answer to send the reply.
- Page level categorization - Can be w3 or w4:
- w3 - Means you will get the category name, if page is unknown it will return that the site is unknown and you should query it later.
- w31 - Same as w1, but if the page is unknown it will wait until there's an answer to send the reply.
- w4 - Means you will get the category ID.
- w41 - Same as w2, but if the pageis unknown it will wait until there's an answer to send the reply.
- With page level categorization, is the main domain is not safe, you will receive the domain category, and page will not be classified.
- Domain and sub URL categorization (for big sites, like: Craigslist, Huffington Post) - Can be w5 or w6:
- w5 - Means you will get the category name, if site is unknown it will return that the site is unknown and you should query it later.
- w51 - Same as w1, but if the site is unknown it will wait until there's an answer to send the reply.
- w6 - Means you will get the category ID.
- w61 - Same as w2, but if the site is unknown it will wait until there's an answer to send the reply.
Other mandatory values:
- guid - Randomly generated upper case GUID, for example: 1707D6F1-70C8-4BB3-A721-CBB47962E01C (you must create a new one per request because the server remembers them and will not accept the same GUID twice).
- id - MD5 of a formula given to clients only.
Data to classify, can only use one parameter per request:
- url - The URL we want to inspect, the URL will be without the http://
- url64 - If the URL contains characters which can't be used as a plaintext (?, #, or &), you can use url64 instead of url and put the base64 encoding of the URL to classify.
- keyword64 - A keyword phrase you want to classify, base64 encoded (you can specify number of keywords, delimited by ~)
- multiple64 - Multiple URLs to classify, URLs are delimited by " , " (space comma space) and base64 encoded.
An example actual request would be:
https://thor.url-classification.io/url.php?version=w1&guid=1707D6F1-70C8-4BB3-A721-CBB47962E01C&id=MD5&url=google.com
URL Flags
When URL has flags inside of it, for example:
http://www.somesite.com/?flag1=data1&flag2=data2
You must use the url64 flag and encode the URL using base64, for example with the site in the example, the request will be:
https://thor.url-classification.io/url.php?version=&guid=&id=&url64=aHR0cDovL3d3dy5zb21lc2l0ZS5jb20vP2ZsYWcxPWRhdGExJmZsYWcyPWRhdGEy
Reply
Reply will be a single line composed of two strings in form of:
String1~String2
String1 is the index of the result which can be:
- FM - Found master, URL found and the site has the same classification in all its pages.
- FR - Found regular, URL found, but it's specific to this URL only, you should query the other URLs for this domain.
- NF - Not found, server doesn't know what this URL is.
- CL - Check later, you will get the result only for w1/w2, it means that you need to check later with the server because it's scanning in real time.
For protocols w5,w51,w6,w61 string1 can also be:
- FS - Found sub URL master and all this sub URL have the same classification, for example if you get FS with the URL: "www.huffingtonpost.com/celebrity/" it means it's relevant to this URL or any URL in the form of: "www.huffingtonpost.com/celebrity/somehtml.html", but if you try to check "www.huffingtonpost.com/books/" it's a seperate check.
- FP - Found sub URL and a difference in parameters matter, for example if you get FP with "forums.craigslist.org/?forumID=1204&areaID=372" you would need to also query "forums.craigslist.org/?forumID=96&areaID=372" since the URL parameters (after the ?) have changed.
String2 is delimited by "," and contains the category ID or names (depending on the protocol).
An example reply for version w1 for yahoo.com would be:
FM~Search engine,Portal
An example reply for a new site for w1 would be:
CL
Bad URL
Incase the URL was bad you will get a single string reply: BU
Query server
There are number of servers and you might want to query the servers and get the time it takes to reach the server and working with the fastest one.
To make a query you use the url google.com and add a query flag, an example request would look like this:
https://thor.url-classification.io/url.php?version=w1&guid=1707D6F1-70C8-4BB3-A721-CBB47962E01C&id=MD5&url=google.com&query=1
Servers
These are the current deployed servers:
- thor.url-classification.io - US server - East coast.
- loki.url-classification.io - US server - West coast.
- optimus.url-classification.io - UK server.
- rodimus.url-classification.io - Australian server.
- julius.url-classification.io - German server.
Choosing the server
Best way to choose a server is to do a query against each server and see the who is fastest which means it's the closest server.
In the URL you send you need to query for google.com and add a flag query=1 this to indicate this is a server speed test.
Server's uptime
Our servers goes down for maintenance once a week for ten minutes (not in parallel), the way to work with it is when you can't connect to a server, you need to connect to the next server on the list from the test done to find the closest server, after ten minutes you can revert back to the original server.
How to work with the results
There are number of ways to work with the classification server, the easiest one is to query every URL but it's not the most efficient one, since you will have delay for each URL you query and you might not have to query every URL.
FM
FM means that this domain has the same classification across the entire domain, so for example if you got FM for www.url-classification.io then it doesn't matter what URI comes after www.url-classification.io (www.url-classification.io/main, www.url-classification.io/test) the classification will be the same, so when you get FM you should cache that result for an a period of time (an hour is suggested). The FM is for an exact domain so ads.url-classification.io and url-classification.io should be queried even if you got FM for www.url-classification.io
FR
FR means that this domain might have different classification per URI and you should only cache that complete address (including the flags, so if the flags changes you need to query the server again), this is mostly for search engines and sites that have porn segments.
CL
CL means the site is new and the engine classifies it, you should check again later, we recommend to try every 1 second, also this flag is only valid when you use protocol w1 and w11.
BU
BU means that the URL is malformed.
FS
Found a sub URL, all sub URLS (excluding the html part) are considered under the same classification, if you received FR or FM for a site you would never receive FS.
FP
Found a URL that can change content based on the URL parameters. If you received FR or FM for a site you would never receive FS.
SSL
The servers support TLS1.2 and TLS 1.3
Port 80
The servers will reply to plain HTTP requests, these connections are not secure and should only be used for server to server communications and with data that does not contain any form of PII.
For avoidance of doubt, you should not use port 80 unless you know what you are doing.
Extra options
Unicode domains
For Unicode domains you should send the Punycode of that domain.
IAB results
You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format.
Google search
- When using protocol version w1/w11/w2/w21 the server will classify the search phrase.
- When using protocol version w3/w31/w4/w41 the server will try to fetch the search page and classify the content, but since Google block bots eventually the server will be blocked and the result will be Site under construction, if that happens, the protocol should be reverted to w1/w11/w2/w21.
HTTP/S Protocol for per call pricing
This protocol is used for deals that have X amount of queries per month or other timeframe. A reply can be immediate or it could take couple of seconds incase the server dynamically classifies a domain or a URL.
Request
Create a web request to the server in the form of:
https://app.url-classification.io/api.php?token=demo&apitype=geturlclassification&domain=
An example actual request would be:
https://app.url-classification.io/api.php?token=demo&apitype=geturlclassification&domain=ebay.com
Token
We provide the token parameter
Domain
Domain is the domain to get the URL Classification for, it should not include any / or trailing http:// or https://
URL Based request
For URL based classification send the full URL including the http:// or https:// prefix, also make sure that if it contains & or ? to url encode it.
One URL Based query is considered as two domain queries.
Reply
Reply will be a JSON that contains both the ID and strings of the classification.
An reply for the previous example for ebay.com would be:
{"database":0,"category1":"67","category2":0,"category3":0,"category4":0,"categorytext1":"Shopping","categorytext2":"","categorytext3":"","categorytext4":"","responsecode":0}
Possible errors
Bad URL
Incase the URL was bad or something was incorrect, you will get the following reply:
{"noclassification":1,"responsecode":0}
Bad token
Incase the token doesn't exist, you will get the following reply:
{"responsecode":2}
Out of credits
If the user is out of credits, you will get the following reply:
{"responsecode":12}
Extra options
Unicode domains
For Unicode domains you should send the Punycode of that domain.
IAB results
You can add a parameter &iab=1 or &iab=2 to receive the results in IAB format.
Checking for available credits
You can check available credits by sending:
https://app.url-classification.io/api.php?token=demo&apitype=credits
Make sure to replace the demo token with your token.
The reply will be:
{"responsecode":0,"credits":1000}
Credits check does not change the available credits.