Skip to content

style: ruff/black autofix + misc updates #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 21, 2025
Merged

Conversation

shua-ie
Copy link
Owner

@shua-ie shua-ie commented Jul 21, 2025

No description provided.

@shua-ie shua-ie merged commit 72023d4 into main Jul 21, 2025
4 of 15 checks passed
Less accurate but more resilient for malformed HTML.
"""
# Remove script tags and their content
html = re.sub(r"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", "", html, flags=re.DOTALL | re.IGNORECASE)

Check failure

Code scanning / CodeQL

Bad HTML filtering regexp High

This regular expression does not match script end tags like </script >.

Copilot Autofix

AI about 1 month ago

To address the issue, we will replace the insecure regex with a more robust approach for removing <script> tags. Instead of relying on regex to parse HTML, which is inherently error-prone, we should use a proper HTML parser library like BeautifulSoup even in the fallback method. This is consistent with the recommendation to use well-tested libraries for HTML parsing and sanitization.

The changes will:

  1. Use BeautifulSoup in the _canonicalize_fallback method to safely remove <script> tags and their content.
  2. Remove the regex-based logic for <script> tag handling in the fallback method.
  3. Ensure compatibility by continuing to normalize whitespace and decode HTML entities as before.
Suggested changeset 1
src/quarrycore/dedup/canonical.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/quarrycore/dedup/canonical.py b/src/quarrycore/dedup/canonical.py
--- a/src/quarrycore/dedup/canonical.py
+++ b/src/quarrycore/dedup/canonical.py
@@ -21,7 +21,7 @@
 
     HAS_SELECTOLAX = True
 except ImportError:
-    from bs4 import BeautifulSoup, Comment
+    from bs4 import BeautifulSoup, Comment  # Already imported
 
     HAS_SELECTOLAX = False
 
@@ -130,10 +130,11 @@
 
         Less accurate but more resilient for malformed HTML.
         """
-        # Remove script tags and their content
-        html = re.sub(r"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", "", html, flags=re.DOTALL | re.IGNORECASE)
-
-        # Remove style tags and their content
+        # Use BeautifulSoup to remove script tags and their content
+        soup = BeautifulSoup(html, "html.parser")
+        for script in soup.find_all("script"):
+            script.decompose()
+        html = str(soup)
         html = re.sub(r"<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>", "", html, flags=re.DOTALL | re.IGNORECASE)
 
         # Remove HTML comments
EOF
@@ -21,7 +21,7 @@

HAS_SELECTOLAX = True
except ImportError:
from bs4 import BeautifulSoup, Comment
from bs4 import BeautifulSoup, Comment # Already imported

HAS_SELECTOLAX = False

@@ -130,10 +130,11 @@

Less accurate but more resilient for malformed HTML.
"""
# Remove script tags and their content
html = re.sub(r"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", "", html, flags=re.DOTALL | re.IGNORECASE)

# Remove style tags and their content
# Use BeautifulSoup to remove script tags and their content
soup = BeautifulSoup(html, "html.parser")
for script in soup.find_all("script"):
script.decompose()
html = str(soup)
html = re.sub(r"<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>", "", html, flags=re.DOTALL | re.IGNORECASE)

# Remove HTML comments
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant