Sobrevivir al diluvio de ?q=: como el trafico de crawlers convierte la busqueda por facetas de PrestaShop en un riesgo de DoS
You wake up to a thread of customer emails asking why the site was down at 3am. The order count is normal, the error page is intermittent, and your admin dashboard looks fine. The hosting provider sends a graph that shows MySQL connections pegged at the limit and PHP-FPM workers all busy. None of the requests were from real customers.
If you run a PrestaShop store with faceted (layered) search and you have not yet been hit by this pattern, you almost certainly will be. The shape of the attack is dull, the cause is structural, and the right fix is not the one most store owners reach for first. This is a write-up of how to actually defend against it.
The symptom
Logs show thousands of requests to URLs like /category-slug?q=Brand-Brand-A/Brand-Brand-B/Color-Red, every one slightly different, coming from rotating IPs and a mix of declared user agents. Some claim to be Bingbot, some claim to be Chrome, some declare themselves as GPTBot or ClaudeBot or PerplexityBot. The category URL itself is legitimate — it is your own faceted search exposing brand and attribute filters as query-string parameters — but the volume and combinatorial diversity of the requests is not.
The PrestaShop forums have collected reports of this pattern from store owners running 1.7.x and 8.x throughout 2024 and 2025. The shape is consistent: one thread alone documents store owners watching tens of thousands of unique ?q= URLs being crawled per day, with the controller continuing to respond even after the ps_facetedsearch module is disabled.
The first thing most owners try is disabling the faceted search module. It does not help. ps_facetedsearch exposes the filter UI in the storefront, but a GET request to /category-slug?q=anything still boots PrestaShop, dispatches to the category controller, and runs through the request lifecycle. The query string is a valid request shape whether or not the module that generated those URLs is currently enabled. Disabling the module hides the filter sidebar from human visitors; it does not prevent automated clients from continuing to hit URLs they already discovered.
Why the volume is growing
The combinatorial nature of faceted-search URLs has been around since the module shipped. What is new is the scale of automated traffic that finds and crawls them.
Cloudflare's July 2025 analysis of crawler traffic across its network reported:
- Crawler traffic up 18% year over year, with peaks of 32% in April 2025.
- Googlebot: +96% growth, expanding from ~30% to ~50% of all crawl traffic.
- GPTBot: +305% growth, 2.2% to 7.7% market share.
- PerplexityBot: +157,490% (essentially zero to meaningful).
- ClaudeBot: -46% in requests after early-2025 changes.
- Bytespider: -85% drop.
The composition shifts month to month but the trend does not: more crawlers, hitting more URLs, more often. A category page that generates a few hundred filter combinations becomes a target for everything from Googlebot building structured-data indexes to AI training scrapers harvesting product attributes. Cloudflare itself began blocking AI training bots by default on new zones in mid-2025, which tells you how the platform sees the trajectory.
Layered on top of that legitimate crawler traffic is a much smaller but more aggressive layer of hostile automation: scrapers, competitive intelligence bots, and stress-test traffic. These do not respect robots.txt, rotate IPs aggressively, and forge legitimate user-agent strings. They are the ones that take the site down.
Why the obvious fixes do not work
Most of the advice on community forums falls into one of five buckets. None of them are sufficient on their own.
Disabling the faceted-search module. Removes the UI, leaves the URL surface. The category controller still answers ?q= requests.
Adding Disallow: /*?q= to robots.txt. Googlebot largely honors it. Bingbot honors it inconsistently — there are public reports of it crawling disallowed query patterns anyway. AI crawlers respect robots.txt at very different rates depending on the operator. Hostile scrapers ignore it entirely. Worth doing, never sufficient.
Blanket .htaccess block on any URL containing ?q=. This catches the bots, but it also breaks your own AJAX filter requests (the same URL pattern your storefront uses to fetch filter results without a page reload) and breaks every bookmarked filter URL real customers have saved or shared. SEO damage compounds if Google had indexed any of those filter pages as canonical for long-tail queries.
Blocking specific bot user-agents. Whack-a-mole. New bots appear weekly, hostile bots forge UA strings, and the legitimate AI referrers (ChatGPT, Perplexity, Copilot) that drive a growing share of inbound clicks would also be blocked. The crawl-to-click ratio is still poor for AI bots, but it is not zero — Cloudflare's own data shows referrals exist.
Full-page caching of ?q= URLs. The cache hit rate is near zero because each filter combination is a unique URL. You cache 50,000 pages and serve each one once. The cache becomes a write-amplification problem instead of a read shortcut.
The remaining option — and the one the rest of this post is about — is to put the decision of who gets to hit those URLs in front of PrestaShop entirely, where it can be made cheaply and without booting PHP.
The right fix: Cloudflare Managed Challenge on the faceted-search shape
Cloudflare's Managed Challenge is the right tool for this problem for three reasons. It is non-interactive most of the time (humans pass it transparently), it does not require you to name specific bots in a blocklist, and it lets you keep your storefront's own AJAX filter calls working by matching on a request shape that automation cannot trivially mimic.
The pattern is two custom rules (paid Bot Management) or one (free tier), evaluated against any GET that contains the ?q= query parameter.
If you have Bot Management (paid)
Rule 1 — skip verified bots:
cf.client.bot
Action: Skip remaining custom rules.
This lets Googlebot, Bingbot, and any other Cloudflare-verified crawler through unchallenged. Verified-bot detection is based on reverse DNS and signed IP ranges, not user-agent strings, so a hostile scraper forging "Mozilla/5.0 (compatible; Googlebot/2.1)" will not match.
Rule 2 — challenge likely automation on the faceted-search shape:
http.request.method eq "GET"
and len(http.request.uri.args["q"]) >= 0
and not cf.client.bot
and not any(lower(http.request.headers["x-requested-with"][*])[*] eq "xmlhttprequest")
and not starts_with(http.request.uri.path, "/module/")
and not starts_with(http.request.uri.path, "/api/")
and cf.bot_management.score gt 1
and cf.bot_management.score lt 30
Action: Managed Challenge.
What this is saying, in plain English: a GET request that carries a q query parameter, is not a verified bot, is not the storefront's own AJAX filter call (which sends X-Requested-With: XMLHttpRequest), is not a module or API endpoint, and lands in the Bot Management score range that means "probably automation, not yet definitely" — challenge it. Humans get a one-click pass; automation gets stopped at the edge.
If you want to be stricter on definitively-automated traffic, add this rule before Rule 2:
http.request.method eq "GET"
and len(http.request.uri.args["q"]) >= 0
and not cf.client.bot
and cf.bot_management.score eq 1
Action: Block or Managed Challenge, depending on appetite. Score 1 is Cloudflare's "this is automation, full stop" rating; blocking outright is defensible.
If you do not have Bot Management (free or Pro plan)
There is no equivalent bot score on free plans. The old cf.threat_score field that older guides recommend is now permanently set to 0 — Cloudflare deprecated it. Do not waste time on rules that reference it.
Without a bot score, you fall back to matching on request shape alone. Start conservative:
http.request.method eq "GET"
and len(http.request.uri.args["q"]) >= 0
and not cf.client.bot
and not any(lower(http.request.headers["x-requested-with"][*])[*] eq "xmlhttprequest")
and not starts_with(http.request.uri.path, "/module/")
and not starts_with(http.request.uri.path, "/api/")
Action: Managed Challenge.
This challenges every non-AJAX, non-verified-bot GET that includes a q parameter. Real customers loading a bookmarked filter URL see a brief Managed Challenge and pass it. Bookmarked links keep working. Crawler traffic gets cut off at the edge. The cost is one extra round-trip for human visitors who land on a faceted URL via a saved bookmark or shared link.
Important caveats before you turn any of this on
Test in log-only / Security Events first. Set the rule action to "Log" for at least 24–48 hours and watch Cloudflare's Security Events dashboard. You are looking for two things: that the rule is firing on the traffic you expect (a lot of it), and that it is not firing on legitimate paths you forgot about (sitemap fetchers, analytics tools, a custom module that for some reason uses ?q= in its own URLs). Only flip to Managed Challenge after you have confirmed the matcher is clean.
The X-Requested-With: XMLHttpRequest exception is a UX preservation, not a security boundary. Any HTTP client can send that header. An attacker who realizes the rule exists can defeat it by adding the header to their scraper. The reason it is in the rule is to let your own storefront's filter AJAX through without a challenge — not to keep determined automation out. The cf.client.bot and bot-score checks are what actually stops automation; the AJAX exception just keeps real users on the same page when they click filters.
Make sure every DNS record is proxied (orange-cloud). A single grey-cloud A record (mail, dev, admin subdomain) leaks the origin IP. Anyone who can find the origin can bypass Cloudflare entirely and hit PrestaShop directly. Audit your DNS panel before you trust the WAF.
Restrict origin to Cloudflare IPs. Even with everything proxied, hard-block the origin at the firewall to accept HTTP/HTTPS only from Cloudflare's published IP ranges. Otherwise an attacker who learns the origin IP from old DNS history can hit it directly.
If you depend on AI search referrals, watch the impact. Some AI crawlers are also part of the referral path that brings customers in via ChatGPT, Perplexity, and Copilot links. A blanket challenge will reduce both crawl traffic and (a smaller amount of) inbound referrals. The trade-off is usually worth it for an e-commerce site, but it is a trade-off — check your analytics after a week.
The fallback: a dispatcher override
If you cannot deploy Cloudflare today — because you do not control DNS, because your host's CDN integration is fragile, because a payment module is incompatible with proxied origins, or because you simply need the bleeding to stop in the next ten minutes — the in-PHP fallback is a Dispatcher override that 301-redirects non-AJAX ?q= requests back to the same URL without the filter:
<?php
// override/classes/Dispatcher.php
class Dispatcher extends DispatcherCore
{
public function dispatch()
{
// Only redirect if 'q' is present AND the request is NOT an AJAX call.
if (Tools::getValue('q') && (
empty($_SERVER['HTTP_X_REQUESTED_WITH']) ||
strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) !== 'xmlhttprequest'
)) {
$uri = $_SERVER['REQUEST_URI'];
$parts = parse_url($uri);
$path = $parts['path'];
$queryParams = [];
if (isset($parts['query'])) {
parse_str($parts['query'], $queryParams);
}
unset($queryParams['q']);
$newQuery = http_build_query($queryParams);
$newUrl = Tools::getShopDomainSsl(true) . $path;
if (!empty($newQuery)) {
$newUrl .= '?' . $newQuery;
}
header("Location: $newUrl", true, 301);
exit;
}
return parent::dispatch();
}
}
What this does: any direct-browser, non-AJAX request to a URL containing ?q= gets 301-redirected to the same path with the q parameter stripped. AJAX requests (the storefront's own filter calls) pass through unchanged. The faceted filter UI keeps working for real customers; the URL surface that bots crawl becomes a single canonical category page.
Three trade-offs to understand before you ship this:
- SEO impact. If Google has indexed any of your filter URLs as canonical for long-tail queries ("blue running shoes size 9"), those rankings collapse — every indexed filter URL now redirects to the unfiltered category. For most stores this is acceptable; the original facet URLs were generally weak rankings anyway. But check Search Console before you deploy, and consider a one-time export of the high-value filter URLs you want to keep, with explicit 301 maps to dedicated landing pages.
- Bookmarks and shares. A customer who bookmarked or shared a filter URL no longer lands on the filtered view. The Cloudflare Managed Challenge approach preserves this; the dispatcher override does not.
- It only helps with PHP-side load. The request still hits PHP — it just exits quickly with a 301 instead of running the full category page. If your bottleneck is connection count rather than CPU, this helps less than the edge-level fix.
Deploy it as a stopgap. Move to Cloudflare as soon as you can.
The architectural angle: replace the URL surface itself
Both fixes above protect the existing ps_facetedsearch URL surface from automation. Neither changes the fact that the module exposes a combinatorially large public URL space in the first place.
A different class of filter module avoids the problem at the source: filter state lives client-side (or in a single AJAX POST), the public URL stays clean, and only deliberately-curated SEO landing pages (e.g., a hand-built page for "men's running shoes" rather than every brand-color-size combination) become indexable. We built Filter Revolution in that style, partly in response to exactly the traffic patterns described above. It reduces the high-cardinality public URL exposure that makes ps_facetedsearch attractive to crawlers. It is not a magic shield against AI crawlers — nothing is — and it does not retroactively un-index the URLs Google and Bing have already discovered. But it removes the structural attack surface that the Cloudflare and dispatcher fixes are working around.
If you are tired of patching the symptom and you are due for a faceted-search rework anyway, look at replacing the module rather than wrapping it. If the existing module fits the store, the Cloudflare rule above will keep working indefinitely.
One last thing: the empty carts
Every bot request that hits the category page also triggers PrestaShop's session and cart bootstrap. Each unique automated visitor gets a ps_cart row. After a month of crawler traffic, your back-office cart counter shows 80,000 "active" carts, your cart-related admin pages slow to a crawl, and any "abandoned cart" workflow becomes unusable because 99% of those carts never had a human attached.
If you have already deployed the Cloudflare rule, this stops growing — but the existing pollution does not clean itself up. PrestaShop's built-in cart-cleanup tasks are not aggressive enough to handle months of accumulated bot carts. The pragmatic fix is to hide anonymous, bot-origin carts from the back-office counters and listings rather than delete them — the rows stay in the database for whatever forensic value they hold, but they stop polluting the UI. We maintain a small internal module for this and are happy to share it with anyone who asks; reach out if you need a copy.
Quick checklist
- Turn on Cloudflare for the domain. Confirm every DNS record is orange-cloud proxied.
- Add Rule 1 (skip verified bots) and Rule 2 (Managed Challenge on the faceted shape) in log-only mode. Watch Security Events for 24–48 hours.
- Confirm legitimate AJAX filter calls are not being challenged. Confirm Googlebot is reaching the site. Confirm scraper traffic is being matched.
- Flip Rule 2 from Log to Managed Challenge.
- Restrict the origin firewall to Cloudflare IP ranges. This is what closes the loop.
- If Cloudflare is not an option, deploy the Dispatcher override as a stopgap and plan migration.
- If you are due for a filter-module rework, evaluate replacing
ps_facetedsearchentirely. - Clean up the back-office cart pollution separately so the cart admin stays usable.
Closing
This is the new baseline. Faceted-search URL explosion meets AI-era crawler volume, and the result is a self-inflicted DoS surface that ships out of the box on every PrestaShop install. The defaults will not save you, the obvious workarounds break legitimate users, and "block the bots" gets harder every month as the bots get better.
The good news is that the right fix lives at the edge, costs nothing on a free Cloudflare plan, and takes about thirty minutes to deploy carefully. The bad news is that it is one more thing you should have done last year. The middle news is that the work compounds: once you have a clean origin behind a tuned WAF, the next class of automated nuisance — credential stuffing, scraping, fake-checkout DoS — gets noticeably easier to defend against from the same control plane.
For broader hardening, see our PrestaShop Security Hardening Checklist. For what happens when a store does get compromised by something more sophisticated than crawler abuse, see our anatomy of a Magecart-style attack on a PrestaShop 1.7.x store. The defensive playbook is the same shape either way: cheap controls at the edge, narrow exceptions where you have to make them, and absolutely no security claims that rest on user-agent strings.
Comentarios
Aún no hay comentarios. ¡Sé el primero!
Sé el primero en hacer una pregunta o compartir una opinión útil.
Dejar un comentario
Comparte una pregunta, un detalle de instalación o una opinión que pueda ayudar a otro lector.