WordPress Security Bot Attack WP Engine Cloudflare

Stopping a Coordinated Bot Attack on a WP Engine Site

Stopping a Coordinated Bot Attack on a WP Engine Site

When Content Scraping Turned Into an Availability Problem on WordPress

At around 2am on a Tuesday, UptimeRobot fired an alert for a client site I manage on WP Engine. Response times had spiked past 8 seconds. By the time I was looking at the dashboard, the site was effectively down.

At first glance, it looked like a denial-of-service incident. But the traffic pattern did not match a conventional DDoS. This was not a wall of obvious garbage requests, and it was not a single endpoint being hammered with brute force volume.

What I found instead was a large-scale content scraping operation that had discovered the site’s taxonomy structure and was using real category values to harvest archive content at scale. In the process, the bots generated thousands of unique query-string combinations. Those unique URLs bypassed WP Engine’s cache layer, forced full WordPress renders, and overloaded the origin server’s CPU.

The goal appeared to be content harvesting. The availability problem was the consequence.

Here’s how I diagnosed it, what I built to stop it, and what I put in place to make the site more resilient against the same pattern in the future.

Why Content Scrapers Target Structured WordPress Sites

Content scraping is largely automated and opportunistic. Bots crawl the open web looking for sites with large bodies of structured, reusable content. WordPress sites with custom post types, archives, taxonomies, filters, and high post volume are especially attractive because their structure gives scrapers a map.

Once a site fits the profile, scraped content can be republished on thin affiliate sites, used to populate low-quality content farms, fed into aggregation systems, or harvested for AI-related datasets. The scraper does not need to care about the client, the brand, or the business context. It only needs to identify that the content is structured, crawlable, and worth extracting.

In this case, the bots appeared to have mapped two custom post type archives and the category terms associated with them. They were not browsing the site the way a user would. They were systematically generating combinations of real category values in the query string to pull different archive states and content groupings.

That behavior made the traffic look more legitimate than a blunt-force attack. The paths were real. The categories were real. The archive pages existed. But the pattern was not human, and the effect on the server was severe.

What the Scraping Pattern Looked Like in the Logs

The first useful signal was not total traffic volume. It was request pattern.

When I pulled the WP Engine access logs, I saw the same archive paths being requested repeatedly with different combinations of real category parameters. The requests were hitting in rapid succession, often within the same second, from different IP addresses.

GET /resource-type-a/?category=topic-1&category=topic-2&category=topic-3
GET /resource-type-a/?category=topic-2&category=topic-4&category=topic-5
GET /resource-type-b/?category=topic-1&category=topic-3&category=topic-6
GET /resource-type-b/?category=topic-4&category=topic-5&category=topic-6

Every request used real taxonomy values from the site. That was the key detail. These were not random query strings. The bots had enough knowledge of the content structure to generate plausible filtered archive requests.

The problem was that each unique query string created a unique URL. WP Engine’s cache treated each one as a separate cacheable resource, which meant the requests missed cache and propagated back to the PHP origin. The server then had to execute full WordPress page renders, including database queries against the custom post type archives, for traffic that should never have required that much origin work.

The user agent was another red flag. Across hundreds of IPs, the requests claimed to be coming from the same impossible browser:

Chrome/142.0.0.0 on Mac OS X 10_15_7

Chrome 142 did not exist at the time. A fabricated browser version, repeated across a distributed set of IPs, made the automation obvious once the logs were examined closely.

UptimeRobot showed the operational pattern clearly: downtime windows of roughly 4–7 minutes recurring every 20–30 minutes. The bots were not simply flooding the server nonstop. They appeared to be running in bursts, generating enough origin load to overwhelm the CPU, then backing off before repeating the cycle.

Why the Site Went Down

The important distinction is that the site did not go down simply because bots were scraping it.

Content scraping happens constantly. Most of the time, edge caching absorbs a large portion of that activity. The real issue here was the combination of taxonomy-aware scraping and query-string entropy.

The bots were rotating real category combinations, which created a large number of unique URLs. Those URLs bypassed the existing cache layer because each one appeared different. Instead of serving cached archive responses from the edge, the platform had to send those requests to the origin server.

That created a compounding performance problem:

  • The bots generated large numbers of unique archive URLs.
  • Each unique URL missed cache.
  • Each miss forced a full WordPress render.
  • Each render triggered database queries against archive content.
  • The repeated origin work exhausted available CPU.
  • Legitimate visitors then experienced slow responses or downtime.

This was not necessarily a botnet trying to intentionally overload the CPU. It was more likely a scraping operation that was indifferent to the server cost of its crawl strategy.

From the site owner’s perspective, though, the result was the same: the site became unavailable.

Why This Was Difficult to Diagnose Quickly

This kind of incident is harder to diagnose than a simple traffic spike because the individual requests look plausible in isolation.

The archive paths were real. The category terms were real. The user agent looked superficially browser-like. The request rate per IP was not necessarily extreme. There were no obvious malicious payloads, login attacks, SQL injection attempts, or strange admin requests.

The signal only became obvious when the requests were viewed together.

The real signature was query-string entropy: many different query strings hitting the same small set of archive paths, using real taxonomy terms, from different IPs, in coordinated bursts.

If you are seeing elevated response times on a WordPress site without an obvious plugin failure, cron issue, or traffic surge, raw access logs are often the fastest path to the truth. Analytics dashboards may show that traffic increased, but logs show what the traffic was actually doing.

Immediate Stabilization

The first priority was to stop the origin from being overwhelmed. I handled that in parallel on two fronts: blocking the most obvious bot signature and getting WP Engine involved early.

Blocking the Fabricated User Agent

Because the bot traffic was consistently presenting the same nonexistent Chrome version, I created a Cloudflare WAF rule to block that user agent immediately.

http.user_agent contains "Chrome/142.0.0.0"

That was a low-risk intervention. Since the browser version did not exist, legitimate user impact was effectively zero. The rule did not solve the entire problem, but it immediately reduced a significant portion of the automated traffic.

Opening a WP Engine Support Ticket

I also opened a WP Engine support ticket while the incident was still active.

That mattered for two reasons. First, WP Engine had server-level visibility that was not available from the WordPress dashboard alone. Second, if the incident escalated or required cache-layer changes, I wanted their infrastructure team already looking at the same pattern rather than starting cold after the fact.

The Cloudflare Response

Once the immediate bot signature was blocked, the more important task was addressing the behavior that made the scraping so expensive: query-string permutations against the archive paths.

The goal was not merely to block a single user agent. The goal was to stop category-parameter combinations from generating endless cache misses.

Normalizing Category Query Strings

The decisive mitigation was a Cloudflare Transform Rule that normalized the affected archive requests before they reached the cache layer.

The rule targeted repeated category parameters on the two affected custom post type archive paths and stripped those category parameters before the cache key was generated.

# Cloudflare Transform Rule — Query String Normalization
# Targets repeated category parameter permutations against affected CPT archives

(http.request.uri.path matches "^/resource-type-[ab]/")
and (http.request.uri.query matches "category=")
and (http.request.uri.query matches ".*category=.*category=.*category=")

That changed the economics of the crawl.

Before normalization, every category permutation created a new URL and forced a new origin render. After normalization, those requests collapsed back toward a smaller set of cacheable archive URLs. The bots could still request pages, but they could no longer force WordPress to render a fresh version for every permutation.

Rate Limiting the Normalized Archive Paths

The second rule added rate limiting around the affected archive paths. The point was to catch the remaining automated traffic after query-string normalization removed the cache-busting effect.

# Cloudflare Rate Limiting Rule
Action: Managed Challenge
Threshold: 50 requests per minute per IP
Scope: /resource-type-a/, /resource-type-b/
Matching: URI Path after normalization

I used Managed Challenge rather than a hard block. A hard block returns a 403 with no recovery path for legitimate users who happen to trip the rule. A managed challenge gives real browsers a way through while filtering out a large amount of automated traffic.

In this case, the transform rule addressed the origin-load problem, and the rate limit helped suppress the remaining bot activity.

What Resolved the Incident

Within about 15 minutes of the Cloudflare rules going live, origin request volume dropped sharply.

The query-string normalization rule was the decisive intervention. Once the bot requests stopped generating unique cache keys, the origin no longer had to render a new WordPress response for every category combination. The CPU load dropped because the platform could serve far more of the traffic from cache.

The remaining traffic was handled by the managed challenge and rate limiting as the botnet continued rotating through its IP pool.

By around 4am, the site was responding normally and UptimeRobot stopped reporting downtime.

Hardening After the Incident

Stopping the active incident was only the first step. The second workstream was reducing the likelihood that the same scraping pattern could take the site down again.

Robots.txt Updates

Malicious bots ignore robots.txt, so this was not treated as a security control. But it was still worth updating.

I added disallow rules for parameterized archive variants so compliant crawlers would have a clearer signal that query-string permutations were not indexable content.

User-agent: *
Disallow: /resource-type-a/*?*
Disallow: /resource-type-b/*?*

This does not stop hostile scraping. It does reduce unnecessary crawl noise from compliant bots and helps define the intended crawl surface more clearly.

WP Engine Cache Configuration

I also worked with WP Engine support on custom cache behavior for the affected archive paths.

The goal was to reinforce the Cloudflare normalization at the hosting layer. Query parameters outside an approved set were stripped or ignored before cache keys were generated for those archive pages.

That mattered because Cloudflare rules are only one layer. If a bot found a way around the edge rule, the WP Engine cache configuration still reduced the chance that arbitrary parameter combinations would force origin renders.

Monitoring the Archive Paths

UptimeRobot was already monitoring availability, but after the incident I added periodic Cloudflare Analytics review to the maintenance cadence for the site.

The review focused specifically on request anomalies around the affected archive paths, unusual query-string patterns, and user agents appearing at scale. The assumption was not that the exact same bot signature would return. The assumption was that a similar scraping pattern could reappear with different headers, different parameters, or different timing.

The goal was not permanent immunity. The goal was to make the site harder and more expensive to scrape aggressively enough that it would fall out of the bot’s crawl queue or at least stop threatening origin stability.

Caveats

  • Query-string normalization requires careful auditing. If your site uses meaningful query parameters for filters, search, pagination, or dynamic content, stripping the wrong values can break real functionality.
  • Do not assume every query parameter is hostile. UTM parameters, search parameters, pagination values, and legitimate filters may need to be preserved.
  • Cloudflare free-tier WAF rules have limits. If you are already close to the rule limit, you may need to consolidate existing rules before adding new ones.
  • Managed Challenge is usually safer than a hard block when there is any chance of legitimate traffic matching the rule.
  • Do not purge cache during an active origin-load incident unless you have already stopped the bot behavior. Purging cache while bots are generating cache misses can make the problem worse.
  • After mitigation, audit WP Engine cache exclusions. Any uncached archive or landing page remains a soft target for origin exhaustion.

The Broader Lesson

This incident was not best understood as a conventional DDoS. It was a content harvesting operation that became an availability problem because of how the scraping strategy interacted with WordPress taxonomy archives and cache behavior.

The bots appeared to be harvesting content through real category combinations. Those combinations produced high query-string entropy. The cache layer treated those URLs as distinct resources. The origin server had to render too many archive pages. CPU usage spiked, and the site went down.

That chain matters because it changes the mitigation strategy.

If you treat the problem only as a traffic-volume issue, you reach for generic rate limits. If you treat it only as a cache problem, you may miss the scraping behavior that caused it. But if you understand the full sequence — taxonomy-aware scraping, query-string permutations, cache bypass, origin exhaustion — the response becomes much clearer.

The fix was not one rule. It was a layered response: identify the scraping pattern, block the obvious bot signature, normalize the query strings, challenge repeated archive traffic, reinforce cache behavior at the host level, and monitor the paths most likely to be targeted again.

The larger takeaway is simple: when a WordPress site with structured archives suddenly slows down without an obvious cause, look for patterns in the raw access logs before assuming the problem is a plugin, a cron job, or normal traffic growth.

Sometimes the issue is not traffic volume alone. Sometimes it is what the traffic is making WordPress do.