Using ProxyCrawl to get Webpages

With most race results now being published on the web I set up some processes to import data automatically into Canterbury Harriers’ results system.

Recently, however, I was getting a “403 Forbidden” response from one such site which had obviously decided to cut down on “scraping”. This was particularly irritating as our system was only making one result page request at a time and, in fact, would cause less load on their results server than a human would via a web browser. (Looking at the web console in Chrome I discovered that a web browser would make an additional 73 requests for the same page, with images making the bulk of the additional ones.)

I tried adding the usual techniques for getting around such things, such as adding headers to mimic a browser. However, that was to no avail.

When investigating the problem I tried some web screenshot grabbing systems to see whether they got the same response I did. Most did except for ProxyCrawl which got the expected webpage response.

Getting set up with ProxyCrawl was really easy, with a simple GET request to their server including their API token and the URL required.

Below is a simple class in PHP to get the contents of a URL. The CrawlService could be defined to use any PSR-18-compliant HTTP client and any PSR-3-compliant logger. I have used GuzzleHttp and Monolog, with the latter being used to log successes or failures from the service. ProxyCrawl does have its own dashboard showing successes and failures but it’s best to have your own log for debugging purposes. The API token is stored in a .env file (with key CRAWL_SERVICE_TOKEN), so as not to be exposed in any code repositories, and is retrieved using Dotenv. The constructor of the class uses PHP 8 constructor property promotion.

<?php

namespace TelfordCodes;

use GuzzleHttp\Client;
use Monolog\Logger;

class CrawlService
{
    private const CRAWL_SERVICE_URL_FMT = "https://api.proxycrawl.com/?token=%s&url=%s";

    function __construct(private Client $client, private Logger $logger) 
    { } 

    public function getResponse(string $base_url) : string
    {   
        $enc_base_url = urlencode($base_url);
        $service_url = sprintf(self::CRAWL_SERVICE_URL_FMT,
                                $_ENV['CRAWL_SERVICE_TOKEN'],
                                $enc_base_url);
        try {
            $response = $this->client->get($service_url);
            $body = $response->getBody();
            $text = $body->getContents();
            $this->logger->info("Successful request for $base_url");
        } catch (\Exception $e) {
            $text = ""; 
            $this->logger->error("Failed request to $base_url " . $e->getMessage());
        }   

        return $text;
    }   
}

Then the class could be used with code like the following. In the below, the ColoredLineFormatter for Monolog makes reading the log easier, visually distinguishing between successes and errors.

use GuzzleHttp\Client;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Bramus\Monolog\Formatter\ColoredLineFormatter;

use TelfordCodes\CrawlService;

// Load environment variables from .env, including the ProxyCrawl API token.
(Dotenv\Dotenv::createImmutable(__DIR__))->load(); 

$client = new GuzzleHttp\Client();

$logger = new Logger("CrawlService");
$handler = new StreamHandler(__DIR__.'/log/crawler.log', Logger::DEBUG);
$handler->setFormatter(new ColoredLineFormatter());
$logger->pushHandler($handler);

$crawl_service = new CrawlService($client,$logger);
$url = "https://site-you-need-goes-here";
$resp = $crawl_service->getResponse($url);