Shanhai Liao | Senior Software Engineer, Content Acquisition and Media Platform; Di Ruan, | Senior Staff Software Engineer, Content Acquisition and Media Platform; Evan Li, | Senior Engineering Manager, Content Acquisition and Media Platform
Introduction
Accurate content understanding underpins Pinterest’s ability to drive distribution and engagement. This requires deep insight not just into the image itself, but also the outbound links or items to which those images point. At the foundation of this process lies a deceptively simple problem: URL normalization.
When Pinterest ingests content from millions of merchant domains, the same product page often appears under many different URLs. A single pair of shoes might be referenced by dozens of URL variations — each one decorated with different tracking parameters, session tokens, or analytics tags. While downstream systems can eventually deduplicate by content identity, the inability to recognize these duplicates at the URL level means every variation is independently fetched, rendered, and processed. At scale, this redundant ingestion and processing represents a significant waste of computational resources — rendering the same page dozens of times simply because its URLs differ in irrelevant parameters.
Item canonicalization — ensuring that identical items represented by different URLs are unified — is critical for organizing shopping catalogs and presenting a consistent experience to users. For many partners, a provided item ID determines canonical identity, but in its absence, the onus falls to advanced URL normalization to deduplicate effectively.
This post details the technical journey behind the Minimal Important Query Param Set (MIQPS) algorithm: a system that automatically learns which URL parameters matter for content identity, enabling dynamic and precise URL normalization at scale.
Background: The URL Normalization Challenge
Consider a typical product URL from an e-commerce site:
https://example.com/shoes?id=42&color=red
This URL identifies a specific product variant. But in practice, the same product page is often reached through URLs like:
https://example.com/shoes?id=42&color=red&utm_source=facebook&session=abc123
https://example.com/shoes?id=42&color=red&ref=pinterest&click_id=xyz
https://example.com/shoes?id=42&color=red&tracking=campaign_spring
Figure 1: The URL duplication problem. Multiple URLs with different tracking parameters all resolve to the same product content.
The parameters utm_source
, session
, ref
, click_id
, and tracking
are all neutral - they don’t change the content of the page. Meanwhile, id
and color
are non-neutral - they determine which product and variant are displayed.
The challenge is distinguishing between the two. For well-known e-commerce platforms, this can be solved with curated rules. Shopify URLs, for example, use variants
as the key product differentiator. Salesforce Commerce Cloud uses parameters like start
, sz
, prefn1
, and prefv1
. For these platforms, static allowlists are sufficient.
But Pinterest ingests content from a large number of domains, operating on a wide variety of platforms.
For this long tail of domains, URL parameter conventions vary wildly. Static rules cannot scale to cover them all. We need a dynamic, data-driven approach.
The MIQPS Algorithm
The core insight behind MIQPS is straightforward: if removing a query parameter changes the content of a page, that parameter is important; if it doesn’t, the parameter is noise and can be safely stripped. Crucially, this analysis runs independently per domain — each merchant site gets its own MIQPS map, because the same parameter name can be meaningful on one domain and irrelevant on another.
The algorithm operates in three steps.
Step 1: Collect the URL Corpus
As Pinterest’s content ingestion pipeline processes URLs from domains, the system accumulates a corpus of observed URLs per domain. This corpus is stored durably and represents a snapshot of all the URL variations seen for a given domain. It serves as the input to the MIQPS analysis.
Step 2: Group URLs by Query Parameter Pattern
Not all URLs from a domain share the same set of query parameters. A product page URL might carry {id, color, utm_source}
while a category page might carry {category, page, sort}
. Analyzing them together would be meaningless.
Moreover, the same parameter name can play different roles depending on its context. Consider the parameter `ref`: on a product page URL like `example.com/product? id = 42 & ref = homepage`
, `ref`
is purely a tracking parameter and is neutral - removing it doesn’t change the product displayed. But on a comparison page URL like `example.com/compare? ref=99`
, the same `ref`
parameter identifies which items to compare and is non-neutral. By grouping URLs by their full parameter pattern, the algorithm evaluates each parameter within its specific context, correctly classifying it as neutral in one pattern and non-neutral in another.
To address this, the algorithm groups URLs by their query parameter pattern — the sorted set of parameter names present in the URL. For example:
To address this, the algorithm groups URLs by their query parameter pattern — the sorted set of parameter names present in the URL. For example:
URLs sharing the same query pattern are grouped together. The top K patterns by URL count are selected for analysis, focusing computational resources on the patterns that matter most.
Step 3: For Each Pattern, Test Each Parameter
For each query parameter within a pattern, the algorithm determines whether it is neutral or non-neutral through empirical testing:
1. Sample: Select up to S URLs with distinct values for the parameter under test.
2. Compare: For each sampled URL, compute the content ID — a fingerprint derived from the page’s rendered visual content — for both:
— The original URL (with the parameter present)
— A modified URL (with the parameter removed)
3. Classify: If removing the parameter changes the content ID in at least T% of samples, the parameter is classified as non-neutral (important). Otherwise, it is neutral (safe to drop).
The content ID is a hash of the page’s visual representation, meaning two URLs that render the same visible content will produce the same content ID, even if their underlying HTML differs slightly. This particular fingerprinting approach leverages Pinterest’s in-house page rendering infrastructure, which is tailored to our content pipeline. The core MIQPS algorithm, however, is agnostic to how the content fingerprint is produced — it only requires a function that returns the same identifier for the same page content. Third parties looking to adopt a similar approach could substitute alternatives such as DOM tree hashing, HTTP response body checksums, or even simpler heuristics like comparing the `
A natural question is: why not simply use the **canonical URL** declared in the page’s HTML (via the `` tag) to resolve duplicates? If the merchant provides a canonical URL, two variant URLs pointing to the same product should share the same canonical, making deduplication trivial. In practice, however, canonical URLs are unreliable at scale. Many merchant sites omit them entirely, set them incorrectly (e.g., pointing every page to the homepage), or include tracking parameters in the canonical URL itself. Because we cannot assume canonical URLs are present or correct across the long tail of merchant domains, MIQPS uses visual content comparison as a ground-truth signal that works regardless of how well-maintained a site’s metadata is.
Algorithm Parameters
The behavior of the MIQPS algorithm is governed by a small set of tunable parameters:
Two additional design choices make the algorithm practical at scale:
- Early exit optimization: If the mismatch rate already exceeds T% after N successful tests, we stop testing that parameter early. This avoids unnecessary page rendering calls for parameters that are clearly non-neutral.
- Conservative default: When fewer than N sample URLs are available for a parameter, it is treated as non-neutral by default. The system errs on the side of keeping parameters rather than dropping ones that might matter.
Putting It Together
Figure 2: The MIQPS computation pipeline.
The output of this pipeline is a MIQPS map: a mapping from each query parameter pattern to the set of non-neutral parameters within that pattern. This map is published to a configuration store and consumed at runtime during URL normalization.
Multi-Layer Normalization Strategy
MIQPS does not operate in isolation. In production, URL normalization combines static rules with the dynamically computed MIQPS. Static rules capture known conventions — curated allowlists for recognized e-commerce platforms and regex patterns for widely used parameter naming schemes. These rules handle cases where we already have high confidence about which parameters matter.
MIQPS complements these static rules by covering the long tail of domains where no predefined rules exist. A URL parameter is kept if it is matched by either the static rules or the MIQPS non-neutral set. Only parameters that pass neither check are stripped. This combination ensures broad coverage: static rules provide immediate, reliable handling for known platforms, while MIQPS dynamically adapts to everything else.
Anomaly Detection: Guarding Against Regressions
Computing MIQPS is inherently dependent on external page rendering. Pages can change, rendering infrastructure can have transient issues, and a domain’s URL structure can shift between analysis runs. Without safeguards, a bad MIQPS computation could cause the system to start dropping parameters that are actually important — leading to content deduplication errors and degraded catalog quality.
To address this, the system includes an anomaly detection layer that compares each newly computed MIQPS against the previously published version. The comparison follows a set of conservative rules:
- Parameter removed from non-neutral set (anomaly): If a parameter that was previously classified as non-neutral is now classified as neutral, the pattern is flagged as anomalous. This is the dangerous case — it means we would start stripping a parameter that we previously determined was important.
- Parameter added to non-neutral set (not anomalous): If a previously neutral parameter is now classified as non-neutral, this is not considered an anomaly. It simply means we discovered a new important parameter, and the worst case is keeping slightly more parameters than necessary.
- Pattern removed entirely (not anomalous): If a query pattern from the previous MIQPS is absent in the new one, this is not flagged. Patterns can naturally disappear as a domain’s URL structure evolves.
If more than A% of existing patterns are flagged as anomalous, the entire MIQPS update is rejected and the previous version is retained. This ensures the system never regresses — it errs on the side of over-keeping parameters rather than accidentally dropping ones that affect content identity.
System Architecture and Integration
The MIQPS system fits into Pinterest’s content processing pipeline as follows:
Figure 3: End-to-end system architecture.
The architecture has three distinct phases:
- Content Ingestion: As URLs are processed from domains, the system writes each unique URL to a per-domain corpus stored in S3. This happens continuously as part of normal content processing.
- MIQPS Computation: After a content processing cycle completes for a domain, an offline job is triggered. This job downloads the URL corpus, runs the MIQPS algorithm (grouping, sampling, content ID comparison), performs anomaly detection, and publishes the result to both a config store (for runtime consumption) and S3 (for archival and debugging).
- URL Normalization: At runtime, the URL processor loads the MIQPS map from the config store at initialization. For each URL it processes, it looks up the query pattern, retrieves the non-neutral parameter set, and strips all parameters not matched by any of the four normalization layers.
This separation of concerns means the expensive content ID comparison happens offline and asynchronously, while runtime URL normalization is a fast, in-memory lookup.
An alternative design would be to determine parameter importance **in realtime** — rendering the page with and without each parameter at the moment a URL is first encountered. This would eliminate staleness entirely and provide immediate coverage for newly discovered domains. However, we chose the offline approach for several reasons:
- Latency: Each content ID computation requires rendering a full page, which takes seconds. Testing every parameter in a URL would multiply this cost, adding unacceptable latency to the content processing pipeline.
- Cost: Offline analysis scales with the number of domains, while realtime analysis would scale with the number of URLs — orders of magnitude more expensive.
- Reliability: Transient rendering failures in an offline job are isolated and retryable. In a realtime path, they would directly block content processing.
In practice, the offline approach is a natural fit because URL parameter conventions change infrequently — on the order of weeks or months. The small amount of staleness between computation cycles is an acceptable tradeoff for the massive savings in cost, latency, and operational complexity.
Conclusion
URL normalization may seem like a mundane infrastructure problem, but at Pinterest’s scale — with a large number of domains and billions of URLs — getting it right has outsized impact on content quality.
The MIQPS algorithm brings several key properties to this challenge:
- Dynamic and data-driven: MIQPS automatically adapts to each domain’s URL conventions without requiring manual configuration or domain-specific rules. As a domain’s URL structure evolves, the algorithm discovers new patterns and adjusts accordingly.
- Layered and defense-in-depth: The multi-layer normalization strategy combines static allowlists, regex patterns, and dynamically computed MIQPS. Each layer catches a different class of parameters, and a parameter only needs to match one layer to be preserved.
- Conservative and regression-resistant: The anomaly detection system ensures that MIQPS updates never regress — previously important parameters cannot be silently dropped. The system consistently errs on the side of keeping parameters rather than stripping them.
- Scalable and cost-efficient: By grouping URLs by pattern, focusing on the top K patterns, and using early exit optimizations, the algorithm keeps computational costs manageable even across hundreds of thousands of domains.
By aligning normalization strategies with proven content identity signals, MIQPS ensures every unique item or experience is surfaced cleanly — improving search and recommendations, downstream catalog management, and ultimately the user experience.
Facts Only
Pinterest developed the Minimal Important Query Param Set (MIQPS) algorithm to address URL normalization challenges.
The algorithm distinguishes between neutral (tracking, session) and non-neutral (product-identifying) URL parameters.
MIQPS operates in three steps: collecting URLs per domain, grouping them by query parameter patterns, and testing each parameter’s impact on content.
The system uses content fingerprints derived from rendered page visuals to determine parameter importance.
Static rules handle known e-commerce platforms, while MIQPS dynamically covers the long tail of domains.
Anomaly detection prevents regressions by flagging changes in parameter classification.
The architecture separates offline MIQPS computation from runtime URL normalization to optimize performance.
MIQPS is integrated into Pinterest’s content ingestion pipeline, reducing redundant processing.
The algorithm is conservative, defaulting to retaining parameters when uncertainty exists.
Pinterest’s solution leverages existing page rendering infrastructure for content fingerprinting.
The system is designed to scale across hundreds of thousands of domains.
Executive Summary
Pinterest faces a significant challenge in content deduplication due to URL variations that point to identical product pages but differ in tracking parameters, session tokens, or analytics tags. To address this, Pinterest developed the Minimal Important Query Param Set (MIQPS) algorithm, which dynamically learns which URL parameters are essential for content identity. The algorithm operates by collecting URLs from domains, grouping them by query parameter patterns, and empirically testing each parameter to determine if its removal changes the page content. MIQPS complements static rules for known e-commerce platforms, ensuring broad coverage while adapting to the long tail of domains with varying URL conventions. The system includes anomaly detection to prevent regressions and is integrated into Pinterest’s content processing pipeline, balancing computational efficiency with accuracy. By reducing redundant processing, MIQPS improves content quality, search, recommendations, and user experience at scale.
The approach is conservative, erring on the side of retaining parameters rather than risking incorrect deduplication. It leverages Pinterest’s existing page rendering infrastructure to generate content fingerprints, ensuring that visual content—rather than unreliable metadata like canonical URLs—serves as the ground truth for deduplication. The offline computation model avoids latency and cost issues associated with real-time analysis, making it practical for large-scale deployment. MIQPS represents a data-driven solution to a deceptively complex problem, demonstrating how automated systems can adapt to diverse and evolving web structures while maintaining high standards of content integrity.
Full Take
**ACADEMIC MODE**
The MIQPS algorithm presents a novel, data-driven approach to URL normalization, addressing a critical scalability challenge in content deduplication. The methodology is sound in its empirical testing of parameter importance, using visual content fingerprints as a ground truth—a robust alternative to unreliable metadata like canonical URLs. However, peer reviewers might flag several considerations:
1. **Methodology Check**: The reliance on visual content fingerprints assumes that rendering infrastructure is consistent and that visual differences correlate perfectly with content identity. Transient rendering failures or dynamic content (e.g., personalized recommendations) could introduce noise. The sample size (S) and threshold (T%) for parameter classification are tunable but not explicitly justified in the text—small samples or lenient thresholds might misclassify parameters.
2. **Claims vs. Evidence**: The article asserts that MIQPS reduces redundant processing "at scale," but quantitative metrics (e.g., percentage reduction in duplicate fetches, computational cost savings) are not provided. The claim of scalability is plausible given the offline computation model, but real-world performance data would strengthen the argument.
3. **Literature Context**: The problem of URL normalization is well-documented in web crawling and e-commerce literature. MIQPS extends prior work by automating parameter classification per domain, but it doesn’t engage with existing solutions (e.g., heuristic-based deduplication in search engines). A comparison to alternatives like DOM hashing or HTTP response checksums would contextualize its novelty.
4. **Real-World Implications**: If the findings hold, MIQPS could significantly improve efficiency in any system ingesting web content at scale. However, its effectiveness depends on the stability of domain URL structures—frequent changes might require more frequent recomputation, increasing costs.
5. **Bridge Questions**: How would MIQPS perform on domains with highly dynamic content (e.g., single-page applications)? Could the algorithm be adapted to use lighter-weight signals (e.g., HTTP headers) to reduce rendering overhead? What is the false positive/negative rate in parameter classification, and how does it impact downstream systems?
The work is a pragmatic solution to a real-world problem, but further validation—particularly around edge cases and cost-benefit trade-offs—would solidify its contributions.
**Patterns detected: none**
