style: ruff/black autofix + misc updates #5

shua-ie · 2025-07-21T04:40:13Z

No description provided.

src/quarrycore/dedup/canonical.py

+        Less accurate but more resilient for malformed HTML.
+        """
+        # Remove script tags and their content
+        html = re.sub(r"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", "", html, flags=re.DOTALL | re.IGNORECASE)


To address the issue, we will replace the insecure regex with a more robust approach for removing <script> tags. Instead of relying on regex to parse HTML, which is inherently error-prone, we should use a proper HTML parser library like BeautifulSoup even in the fallback method. This is consistent with the recommendation to use well-tested libraries for HTML parsing and sanitization.

The changes will:

Use BeautifulSoup in the _canonicalize_fallback method to safely remove <script> tags and their content.

Remove the regex-based logic for <script> tag handling in the fallback method.

Ensure compatibility by continuing to normalize whitespace and decode HTML entities as before.

style: ruff/black autofix + misc updates

1f0f557

shua-ie merged commit 72023d4 into main Jul 21, 2025
4 of 15 checks passed

github-advanced-security bot found potential problems Jul 21, 2025

View reviewed changes

@@ -21,7 +21,7 @@
                 HAS_SELECTOLAX = True
             except ImportError:
-                from bs4 import BeautifulSoup, Comment
+                from bs4 import BeautifulSoup, Comment  # Already imported
                 HAS_SELECTOLAX = False
@@ -130,10 +130,11 @@
                     Less accurate but more resilient for malformed HTML.
                     """
-                    # Remove script tags and their content
-                    html = re.sub(r"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", "", html, flags=re.DOTALL | re.IGNORECASE)
-                    # Remove style tags and their content
+                    # Use BeautifulSoup to remove script tags and their content
+                    soup = BeautifulSoup(html, "html.parser")
+                    for script in soup.find_all("script"):
+                        script.decompose()
+                    html = str(soup)
                     html = re.sub(r"<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>", "", html, flags=re.DOTALL | re.IGNORECASE)
                     # Remove HTML comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

style: ruff/black autofix + misc updates #5

style: ruff/black autofix + misc updates #5

Uh oh!

shua-ie commented Jul 21, 2025

Uh oh!

Uh oh!

Check failure

Copilot Autofix

Uh oh!

style: ruff/black autofix + misc updates #5

style: ruff/black autofix + misc updates #5

Uh oh!

Conversation

shua-ie commented Jul 21, 2025

Uh oh!

Uh oh!

Check failure

Copilot Autofix

Uh oh!