Skip to content

hunspell: don't fail when .dic exists in the temp dir and .aff doesn't; add a way to avoid temp file creation completely (with Lucene) #11359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

donnerpeter
Copy link
Collaborator

@donnerpeter donnerpeter commented May 6, 2025

With a fix for #11321

Summary by CodeRabbit

  • Refactor
    • Enhanced dictionary loading with a flexible Factory interface supporting both file and stream inputs.
    • Streamlined initialization and pattern computation for dictionary resources, improving maintainability.
  • New Features
    • Added support for creating Hunspell dictionaries directly from input streams for better resource compatibility.
  • Deprecation
    • Deprecated the old factory setter method in favor of a new stream-based factory setter.

Copy link
Contributor

coderabbitai bot commented May 6, 2025

Walkthrough

This change refactors the Hunspell dictionary creation mechanism from being file-based to stream-based. It introduces a new Factory interface for creating HunspellDictionary instances from InputStreams, updates related initialization logic in HunspellRule, and removes temporary file extraction for resource-based dictionaries, streamlining dictionary and affix file handling.

Changes

Files/Paths Change Summary
Hunspell dictionary creation and factory refactor
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java
Refactored dictionary creation to use a new Factory interface with InputStreams, replaced BiFunction-based factory, updated related methods, added new overloads for resource-based dictionary loading, and deprecated the old setter method.
HunspellRule dictionary initialization and pattern extraction
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java
Simplified dictionary and affix file handling by using resource streams directly, updated non-word pattern logic to operate on InputStreams, removed file-copying and path resolution helpers, and changed nonWordPatterns map key to String.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • SteVio89

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (2)

75-83: Potential collision when using filename as “language” code

dictionary.getFileName().toString() includes the .dic suffix and is used as the prefix for createTempFile.
If two paths share the same filename (e.g., both are default.dic in different directories), their temp prefixes collide, producing confusing names like default.dic123456. It also deviates from the semantic “language code” parameter expected by custom factories.

Consider passing only the stem without extension:

-        .create(dictionary.getFileName().toString(), dic, aff);
+        .create(FilenameUtils.removeExtension(dictionary.getFileName().toString()), dic, aff);

This keeps the parameter meaningful and unique.


90-97: Fail-fast message could be more precise

The thrown RuntimeException reports “Could not find the dictionary … in the classpath” even when only the .aff is missing.
Return separate messages (or include which stream is null) to aid debugging:

if (dic == null) {
  throw new RuntimeException("Missing .dic for language \"" + language + "\" in classpath: " + dicPath);
}
if (aff == null) {
  throw new RuntimeException("Missing .aff for language \"" + language + "\" in classpath: " + affPath);
}

Minor but saves users time when diagnosing mis-packaged resources.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2eaf905 and 7d2d06e.

📒 Files selected for processing (2)
  • languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (3 hunks)
  • languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java (3 hunks)

Comment on lines 537 to 564
private static Pattern computNonWordPattern(InputStream stream) {
try (Scanner sc = new Scanner(stream)) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
if (line.startsWith("WORDCHARS ")) {
String wordCharsFromAff = line.substring("WORDCHARS ".length());
//System.out.println("#" + wordCharsFromAff+ "#");
return Pattern.compile("(?![" + wordCharsFromAff.replace("-", "\\-") + "])" + NON_ALPHABETIC);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Escape all regex-meta characters in WORDCHARS

Currently only the dash (-) is escaped when building the character class.
Characters like ^, \, ], or [ inside WORDCHARS will break the pattern and throw PatternSyntaxException.

Consider escaping every char that has a special meaning in a character class:

private static String escapeForCharClass(String s) {
  StringBuilder sb = new StringBuilder();
  for (char c : s.toCharArray()) {
    if ("\\^[]-".indexOf(c) >= 0) sb.append('\\');
    sb.append(c);
  }
  return sb.toString();
}
...
return Pattern.compile("(?![" + escapeForCharClass(wordCharsFromAff) + "])" + NON_ALPHABETIC);

This makes the method robust for all dictionaries.

Comment on lines 521 to 536
Path affPath = Paths.get(shortDicPath + ".aff");
hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);
}
}
if (affPath != null) {
nonWordPattern = nonWordPatterns.computeIfAbsent(affPath, (path) -> {
String wordChars = "";
if (path != null) {
try (Scanner sc = new Scanner(path)) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
if (line.startsWith("WORDCHARS ")) {
String wordCharsFromAff = line.substring("WORDCHARS ".length());
//System.out.println("#" + wordCharsFromAff+ "#");
wordChars = "(?![" + wordCharsFromAff.replace("-", "\\-") + "])";
break;
}
}
nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
try (InputStream stream = Files.newInputStream(affPath)) {
return computNonWordPattern(stream);
} catch (IOException e) {
logger.error("Could not read aff file: " + path, e);
logger.error("Could not read aff file: " + affPath, e);
}
}
return Pattern.compile(wordChars + NON_ALPHABETIC);
});
return NON_ALPHABETIC_PATTERN;
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

aff file may be absent on disk – avoid NoSuchFileException

Files.newInputStream(affPath) will fail if the .aff vanished.
Mirror the in-resource handling by checking Files.exists first, falling back to the default pattern when the file is missing.

-        Path affPath = Paths.get(shortDicPath + ".aff");
-        hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);
-        nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
-          try (InputStream stream = Files.newInputStream(affPath)) {
+        Path affPath = Paths.get(shortDicPath + ".aff");
+        hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);
+        nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
+          if (!Files.exists(affPath)) {
+            logger.warn("Missing .aff file at {}, falling back to generic pattern", affPath);
+            return NON_ALPHABETIC_PATTERN;
+          }
+          try (InputStream stream = Files.newInputStream(affPath)) {
             return computNonWordPattern(stream);
           } catch (IOException e) {

Prevents unexpected start-up failures on corrupt installations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Path affPath = Paths.get(shortDicPath + ".aff");
hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);
}
}
if (affPath != null) {
nonWordPattern = nonWordPatterns.computeIfAbsent(affPath, (path) -> {
String wordChars = "";
if (path != null) {
try (Scanner sc = new Scanner(path)) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
if (line.startsWith("WORDCHARS ")) {
String wordCharsFromAff = line.substring("WORDCHARS ".length());
//System.out.println("#" + wordCharsFromAff+ "#");
wordChars = "(?![" + wordCharsFromAff.replace("-", "\\-") + "])";
break;
}
}
nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
try (InputStream stream = Files.newInputStream(affPath)) {
return computNonWordPattern(stream);
} catch (IOException e) {
logger.error("Could not read aff file: " + path, e);
logger.error("Could not read aff file: " + affPath, e);
}
}
return Pattern.compile(wordChars + NON_ALPHABETIC);
});
return NON_ALPHABETIC_PATTERN;
});
}
Path affPath = Paths.get(shortDicPath + ".aff");
hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);
nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
if (!Files.exists(affPath)) {
logger.warn("Missing .aff file at {}, falling back to generic pattern", affPath);
return NON_ALPHABETIC_PATTERN;
}
try (InputStream stream = Files.newInputStream(affPath)) {
return computNonWordPattern(stream);
} catch (IOException e) {
logger.error("Could not read aff file: " + affPath, e);
}
return NON_ALPHABETIC_PATTERN;
});
}

Comment on lines 512 to 518
if (JLanguageTool.getDataBroker().resourceExists(shortDicPath)) {
String path = getDictionaryPath(langCountry, shortDicPath);
if ("".equals(path)) {
hunspell = null;
} else {
affPath = Paths.get(path + ".aff");
hunspell = Hunspell.getDictionary(Paths.get(path + ".dic"), affPath);
addIgnoreWords();
}
String aff = shortDicPath.substring(0, shortDicPath.length() - 3) + "aff";
hunspell = Hunspell.forDictionaryInResources(langCountry, shortDicPath, aff);
addIgnoreWords();
nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ ->
computNonWordPattern(JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff)));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Gracefully handle missing .aff resource to avoid NPE

getFromResourceDirAsStream(aff) may return null when the .dic exists but the .aff doesn’t – exactly the scenario described in #11321.
Passing a null stream into computNonWordPattern(..) will throw a NullPointerException, undoing the intent of this PR.

-        nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ ->
-          computNonWordPattern(JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff)));
+        InputStream affStream = JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff);
+        nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
+          if (affStream == null) {                // fallback: no WORDCHARS in aff
+            logger.warn("Missing .aff for {}, using default non-alphabetic pattern", shortDicPath);
+            return NON_ALPHABETIC_PATTERN;
+          }
+          return computNonWordPattern(affStream);
+        });

This keeps the rule operational even when only the dictionary is present.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (JLanguageTool.getDataBroker().resourceExists(shortDicPath)) {
String path = getDictionaryPath(langCountry, shortDicPath);
if ("".equals(path)) {
hunspell = null;
} else {
affPath = Paths.get(path + ".aff");
hunspell = Hunspell.getDictionary(Paths.get(path + ".dic"), affPath);
addIgnoreWords();
}
String aff = shortDicPath.substring(0, shortDicPath.length() - 3) + "aff";
hunspell = Hunspell.forDictionaryInResources(langCountry, shortDicPath, aff);
addIgnoreWords();
nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ ->
computNonWordPattern(JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff)));
if (JLanguageTool.getDataBroker().resourceExists(shortDicPath)) {
String aff = shortDicPath.substring(0, shortDicPath.length() - 3) + "aff";
hunspell = Hunspell.forDictionaryInResources(langCountry, shortDicPath, aff);
addIgnoreWords();
InputStream affStream = JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff);
nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
if (affStream == null) { // fallback: no WORDCHARS in aff
logger.warn("Missing .aff for {}, using default non-alphabetic pattern", shortDicPath);
return NON_ALPHABETIC_PATTERN;
}
return computNonWordPattern(affStream);
});
}

Comment on lines 44 to 73
private static Factory viaTempFiles(BiFunction<Path, Path, HunspellDictionary> factory) {
return (language, dictionaryStream, affixStream) -> {
Path dictionary = Files.createTempFile(language, ".dic");
Path affix = Files.createTempFile(language, ".aff");
Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
try {
return factory.apply(dictionary, affix);
} finally {
Files.deleteIfExists(dictionary);
Files.deleteIfExists(affix);
}
};
}
Copy link
Contributor

@coderabbitai coderabbitai bot May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Deleting temp files immediately can break on Windows / memory-mapped impls

viaTempFiles deletes the newly-created .dic/.aff right after the dictionary is built:

Files.deleteIfExists(dictionary);
Files.deleteIfExists(affix);

On Windows this will throw AccessDeniedException when the native Hunspell DLL still holds the files open.
For Lucene-based factories that memory-map the files, removing them invalidates the mmap’d buffer.

Safer options:

  1. Use File.deleteOnExit() and keep the files for the lifetime of the JVM.
  2. Delegate deletion to HunspellDictionary#close() (delete when the dictionary is closed).
  3. At minimum, swallow IOExceptions here and log a warning.
-      } finally {
-        Files.deleteIfExists(dictionary);
-        Files.deleteIfExists(affix);
+      } finally {
+        dictionary.toFile().deleteOnExit();
+        affix.toFile().deleteOnExit();
       }

Please reconsider; otherwise cross-platform stability may regress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested, yet, but is it safe to delete the files directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Let's first look at the DumontsHunspellDictionary implementation
rg -A 5 "class DumontsHunspellDictionary" --context=20

# Let's also check if there are any close methods and how they handle resources
rg -A 5 "void close" --type java

# And let's check the HunspellDictionary interface
rg -A 10 "interface HunspellDictionary" --type java

# Check for any memory-mapped usage
rg "map\(" --type java

# Any Lucene integrations that might be using memory-mapped files
rg "Lucene" --type java

# Check for any file locking or similar mechanisms
rg "FileLock|lock" --type java

Length of output: 65953


@SteVio89 No, it's not safe to delete the files immediately after creating the dictionary. There are two main issues:

  1. Windows file locking: The native Hunspell library (used by DumontsHunspellDictionary) likely keeps the files open. Windows doesn't allow deleting files that are still open, which would cause AccessDeniedException.

  2. Memory-mapped files: The PR mentions Lucene integration - Lucene typically uses memory-mapped files. Deleting a file that's memory-mapped invalidates the mapping and can cause crashes or undefined behavior.

The reviewer's suggestion to use deleteOnExit() is appropriate as it:

  • Ensures files are cleaned up when the JVM exits
  • Doesn't interfere with active file handles
  • Works cross-platform
dictionary.toFile().deleteOnExit();
affix.toFile().deleteOnExit();

Alternatively, you could delegate deletion to the dictionary's close() method since it knows when it's done with the files.


🧬 Code Graph Analysis Results

Relevant snippet from languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (lines 1-109)

package org.languagetool.rules.spelling.hunspell;

import org.languagetool.JLanguageTool;
import org.languagetool.broker.ResourceDataBroker;

import java.io.*;
import java.nio.file.*;
import java.util.*;
import java.util.function.BiFunction;

public final class Hunspell {
  static class LanguageAndPath {
    private final Path dictionary;
    private final Path affix;
    LanguageAndPath(Path dictionary, Path affix) {
      this.dictionary = Objects.requireNonNull(dictionary);
      this.affix = Objects.requireNonNull(affix);
    }
    @Override
    public boolean equals(Object o) {
      if (this == o) return true;
      if (o == null || getClass() != o.getClass()) return false;
      LanguageAndPath that = (LanguageAndPath) o;
      return Objects.equals(dictionary, that.dictionary) &&
          Objects.equals(affix, that.affix);
    }
    @Override
    public int hashCode() {
      return Objects.hash(dictionary, affix);
    }
  }

  private static final Map<LanguageAndPath, HunspellDictionary> map = new HashMap<>();
  private static Factory hunspellDictionaryFactory = viaTempFiles(DumontsHunspellDictionary::new);

  /**
   * @deprecated Use {@link #setHunspellStreamFactory}
   */
  @Deprecated
  public static void setHunspellDictionaryFactory(BiFunction<Path, Path, HunspellDictionary> factory) {
    hunspellDictionaryFactory = viaTempFiles(factory);
  }

  private static Factory viaTempFiles(BiFunction<Path, Path, HunspellDictionary> factory) {
    return (language, dictionaryStream, affixStream) -> {
      Path dictionary = Files.createTempFile(language, ".dic");
      Path affix = Files.createTempFile(language, ".aff");
      Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
      Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
      try {
        return factory.apply(dictionary, affix);
      } finally {
        Files.deleteIfExists(dictionary);
        Files.deleteIfExists(affix);
      }
    };
  }

  /**
   * Set a custom way to create Hunspell dictionaries,
   * e.g., more efficient or portable than the default, possibly via Apache Lucene.
   * The default one is to use a native wrapper over the real Hunspell binary,
   * creating temporary files from the streams.
   */
  public static void setHunspellStreamFactory(Factory factory) {
    hunspellDictionaryFactory = factory;
  }

  public static synchronized HunspellDictionary getDictionary(Path dictionary, Path affix) {
    LanguageAndPath key = new LanguageAndPath(dictionary, affix);
    HunspellDictionary hunspell = map.get(key);
    if (hunspell != null && !hunspell.isClosed()) {
      return hunspell;
    }
    try(var dic = Files.newInputStream(dictionary); var aff = Files.newInputStream(affix)) {
      HunspellDictionary newHunspell = hunspellDictionaryFactory
        .create(dictionary.getFileName().toString(), dic, aff);
      map.put(key, newHunspell);
      return newHunspell;
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }

  public static HunspellDictionary forDictionaryInResources(String language, String resourcePath) {
    return forDictionaryInResources(language, resourcePath + language + ".dic", resourcePath + language + ".aff");
  }

  public static HunspellDictionary forDictionaryInResources(String language, String dicPath, String affPath) {
    ResourceDataBroker broker = JLanguageTool.getDataBroker();
    try (var dic = broker.getFromResourceDirAsStream(dicPath); var aff = broker.getFromResourceDirAsStream(affPath)) {
      if (dic == null || aff == null) {
        throw new RuntimeException("Could not find the dictionary for language \"" + language + "\" in the classpath");
      }
      return hunspellDictionaryFactory.create(language, dic, aff);
    } catch (IOException e) {
      throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);
    }
  }

  public static HunspellDictionary forDictionaryInResources(String language) {
    return forDictionaryInResources(language, "");
  }

  public interface Factory {
    HunspellDictionary create(String languageCode, InputStream dictionary, InputStream affix) throws IOException;
  }
}

Summary of relevant details from the above snippet:

  • The class Hunspell manages Hunspell dictionaries keyed by a combination of dictionary and affix file paths.
  • It uses a Factory interface to create HunspellDictionary instances from dictionary and affix input streams.
  • The default factory implementation (viaTempFiles) creates temporary files for the dictionary and affix streams, then deletes these temporary files immediately after creating the dictionary.
  • The temporary files are deleted using Files.deleteIfExists() in a finally block, ensuring cleanup happens even if dictionary creation throws.
  • There is a static map caching created HunspellDictionary instances to avoid reloading.
  • The method getDictionary(Path dictionary, Path affix) opens input streams from the given paths and uses the factory to create or retrieve a cached dictionary.
  • The method forDictionaryInResources loads dictionary and affix files from resources and creates a dictionary using the factory.
  • The user comment "Not tested, yet, but is it safe to delete the files directly?" relates to the temporary file deletion in viaTempFiles method, which deletes the temporary files immediately after dictionary creation.

No other code snippets from the provided possibly_relevant_code_snippets are directly relevant to the question about deleting files in Hunspell.java.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen in the native code of Hunspell, the files are read and closed. The same happens in Lucene. We can also add a Javadoc requirement that the streams should be closed by the call return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@SteVio89
Copy link
Member

SteVio89 commented May 9, 2025

Thanks @donnerpeter, will do some tests before merging again :) starting next week.

String dictionaryPath;
//in the webstart, java EE or OSGi bundle version, we need to copy the files outside the jar
//to the local temporary directory
if (StringUtils.equalsAny(dictURL.getProtocol(), "jar", "vfs", "bundle", "bundleresource")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this check in your change, too? When running LT on our Servers, we have all language-module jar extracted in our file system see this. This means we don't need the copy to /tmp, before using the files.

On the contrary, it creates unnecessary files.

Sorry for not seeing this in the first review.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'd check for the file:// protocol though, what do you think?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)

45-66: File deletion strategy still presents cross-platform concerns

The current implementation in viaTempFiles still deletes temporary files immediately after dictionary creation with Files.deleteIfExists(). This approach can cause issues:

  1. On Windows, native libraries may keep files open, causing AccessDeniedException
  2. With memory-mapped files (like those used by Lucene), deleting the underlying file invalidates the mapping

While the PR author noted that "the files are read and closed" in the native code, this assumption might not hold across all platforms and implementations.

#!/bin/bash
# Check native Hunspell implementations that might be used
rg -A 5 "class.*HunspellDictionary" --type java

# Look for any file locking or file access patterns
rg "map\(|FileChannel|MappedByteBuffer" --type java

Consider using deleteOnExit() instead for more robust cross-platform behavior:

-          Files.deleteIfExists(dictionary);
-          Files.deleteIfExists(affix);
+          dictionary.toFile().deleteOnExit();
+          affix.toFile().deleteOnExit();
🧹 Nitpick comments (1)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)

125-143: Well-designed Factory interface with clear documentation

The new Factory interface provides a clean separation between file-based and stream-based dictionary creation with good documentation. The default implementation for createFromLocalFiles ensures implementers only need to focus on the stream-based method.

One suggestion - consider adding a clarification in the Javadoc for createFromStreams that while implementers don't need to close the streams, they still need to fully consume them before returning.

   /**
    * Create a Hunspell dictionary from the given streams.
    * All necessary information should be extracted from both streams by the time this method returns.
-   * Closing the streams is not necessary in the implementation of this method, as the caller will close them itself.
+   * Closing the streams is not necessary in the implementation of this method, as the caller will close them itself.
+   * However, the implementation must fully consume both streams before returning.
    */
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2d06e and 0ff2a74.

📒 Files selected for processing (1)
  • languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (4 hunks)
🔇 Additional comments (3)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (3)

98-119: Good optimization for file-based URLs

The implementation now efficiently handles resource loading by checking if the resources are available as local files first, avoiding unnecessary stream operations when direct file access is possible.


84-91: Improved resource management in getDictionary method

The refactoring properly moves stream management responsibility to the Factory implementation, simplifying this method while maintaining proper resource handling.


37-43: Good deprecation pattern with migration path

The code properly marks the old method as deprecated and provides a clear migration path to the new API through the Javadoc link. The implementation maintains backward compatibility by adapting the old factory interface to the new one.

@SteVio89
Copy link
Member

I did a test deployment. With these changes, the server restarts every 7–15 minutes (depends on requests). Restart mostly happens because the health check fails.

Unfortunately, I could not find any helpful logs, just see that memory usage grows until RAM is full -> restart.

@donnerpeter
Copy link
Collaborator Author

@SteVio89 Can I do anything to diagnose this on my side?

@SteVio89
Copy link
Member

You could run the server org.languagetool.server.HTTPServer locally and send random text in different languages to it. I will get a memory dump from our server and see what's happens. But not before Friday.

@donnerpeter
Copy link
Collaborator Author

You could run the server org.languagetool.server.HTTPServer locally and send random text in different languages to it. I will get a memory dump from our server and see what's happens. But not before Friday.

I'd be grateful if you get that memory dump :)

@SteVio89
Copy link
Member

Unfortunately, I couldn't find the root cause. Another colleague will take another look next week.

I keep you updated.

@donnerpeter
Copy link
Collaborator Author

Any updates? :)

@ilia-permiashkin
Copy link

Hey @SteVio89! Do you have any updates on this? The issue you mentioned isn't reproducible locally on my machine. Sometimes there are memory spikes reaching up to 4Gb, but then everything goes back down to normal (~1Gb). So we'd really like to get the memory dump at least to take a look at the healthcheck problem.

@donnerpeter
Copy link
Collaborator Author

@SteVio89 any updates? :)

@SteVio89
Copy link
Member

Sorry, we're currently tied up in a project that's blocking all backend resources. We haven't been able to reinvestigate the issue on our server.

If you absolutely want this PR merged, would it be possible to enable the changes (or the old behavior) via an ENV variable?

This way, you're not directly dependent on us.

…t; add a way to avoid temp file creation completely (with Lucene)
@donnerpeter
Copy link
Collaborator Author

If you absolutely want this PR merged, would it be possible to enable the changes (or the old behavior) via an ENV variable?

Done!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)

68-69: Unsafe immediate deletion of temp files persists

The immediate deletion of temporary files after dictionary creation remains problematic, as identified in previous reviews. This can cause AccessDeniedException on Windows when the native Hunspell DLL keeps files open, and invalidate memory-mapped buffers in Lucene-based implementations.

As suggested in the previous review, use deleteOnExit() to defer deletion:

-        Files.deleteIfExists(dictionary);
-        Files.deleteIfExists(affix);
+        dictionary.toFile().deleteOnExit();
+        affix.toFile().deleteOnExit();
🧹 Nitpick comments (1)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)

134-137: Potential Windows path handling issue with URL.getPath()

Using URL.getPath() on Windows file URLs may return paths like "/C:/path/to/file" which Path.of() might not handle correctly due to the leading slash.

Consider using Paths.get(url.toURI()) for more robust cross-platform handling:

-        return hunspellDictionaryStreamFactory.createFromLocalFiles(language, Path.of(dicUrl.getPath()), Path.of(affUrl.getPath()));
+        return hunspellDictionaryStreamFactory.createFromLocalFiles(language, Paths.get(dicUrl.toURI()), Paths.get(affUrl.toURI()));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ff2a74 and 0a636b0.

📒 Files selected for processing (2)
  • languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (4 hunks)
  • languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java
🔇 Additional comments (6)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (6)

7-7: LGTM!

The URL import is necessary for the new resource URL handling functionality added later in the file.


34-35: LGTM!

The environment variable flag provides necessary backward compatibility and addresses the production issues mentioned in the PR discussion. This allows teams to revert to the old behavior if needed.


38-38: LGTM!

The new Factory-based field provides better abstraction than the previous BiFunction approach, supporting both file and stream-based dictionary creation.


40-51: LGTM!

The deprecation is properly documented and the conditional logic ensures backward compatibility. The method appropriately delegates to either the old or new approach based on the environment flag.


75-83: LGTM!

The new Factory setter provides a clean API for custom implementations. The Javadoc clearly explains the use cases and benefits over the default temporary file approach.


157-175: LGTM!

The Factory interface is well-designed with clear separation of concerns. The default implementation of createFromLocalFiles provides convenience while allowing optimized implementations. The Javadoc clearly specifies stream handling responsibilities.

Comment on lines +91 to 106

if (FORCE_TEMP_FILES) {
HunspellDictionary newHunspell = hunspellDictionaryFactory.apply(dictionary, affix);
map.put(key, newHunspell);
return newHunspell;
}

try {
HunspellDictionary newHunspell = hunspellDictionaryStreamFactory
.createFromLocalFiles(dictionary.getFileName().toString(), dictionary, affix);
map.put(key, newHunspell);
return newHunspell;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Inconsistent error handling between old and new approaches

The new Factory-based approach (lines 98-105) wraps IOException in RuntimeException, while the old BiFunction approach (lines 92-96) doesn't handle exceptions. This creates inconsistent behavior depending on the FORCE_TEMP_FILES flag.

Consider adding consistent exception handling:

 if (FORCE_TEMP_FILES) {
-  HunspellDictionary newHunspell = hunspellDictionaryFactory.apply(dictionary, affix);
-  map.put(key, newHunspell);
-  return newHunspell;
+  try {
+    HunspellDictionary newHunspell = hunspellDictionaryFactory.apply(dictionary, affix);
+    map.put(key, newHunspell);
+    return newHunspell;
+  } catch (Exception e) {
+    throw new RuntimeException(e);
+  }
 }
🤖 Prompt for AI Agents
In
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java
around lines 91 to 106, the code inconsistently handles exceptions between the
FORCE_TEMP_FILES block and the try-catch block. To fix this, add appropriate
exception handling to the FORCE_TEMP_FILES block by wrapping any IOException or
relevant exceptions thrown by hunspellDictionaryFactory.apply in a try-catch,
and rethrow as RuntimeException to match the behavior of the other block,
ensuring consistent error handling regardless of the flag.

Comment on lines +113 to +129
if (FORCE_TEMP_FILES) {
try {
ResourceDataBroker broker = JLanguageTool.getDataBroker();
InputStream dictionaryStream = broker.getAsStream(dicPath);
InputStream affixStream = broker.getAsStream(affPath);
if (dictionaryStream == null || affixStream == null) {
throw new RuntimeException("Could not find dictionary for language \"" + language + "\" in classpath");
}
Path dictionary = Files.createTempFile(language, ".dic");
Path affix = Files.createTempFile(language, ".aff");
Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
return hunspellDictionaryFactory.apply(dictionary, affix);
} catch (IOException e) {
throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same temp file deletion issue in FORCE_TEMP_FILES path

The FORCE_TEMP_FILES code path has the same immediate temporary file deletion problem, but here the files aren't being deleted at all. The method creates temp files (lines 121-122) and copies streams to them (lines 123-124), but never cleans them up after calling hunspellDictionaryFactory.apply(dictionary, affix) on line 125.

Add cleanup for the temporary files:

         Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
         Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
-        return hunspellDictionaryFactory.apply(dictionary, affix);
+        try {
+          return hunspellDictionaryFactory.apply(dictionary, affix);
+        } finally {
+          dictionary.toFile().deleteOnExit();
+          affix.toFile().deleteOnExit();
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (FORCE_TEMP_FILES) {
try {
ResourceDataBroker broker = JLanguageTool.getDataBroker();
InputStream dictionaryStream = broker.getAsStream(dicPath);
InputStream affixStream = broker.getAsStream(affPath);
if (dictionaryStream == null || affixStream == null) {
throw new RuntimeException("Could not find dictionary for language \"" + language + "\" in classpath");
}
Path dictionary = Files.createTempFile(language, ".dic");
Path affix = Files.createTempFile(language, ".aff");
Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
return hunspellDictionaryFactory.apply(dictionary, affix);
} catch (IOException e) {
throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);
}
}
if (FORCE_TEMP_FILES) {
try {
ResourceDataBroker broker = JLanguageTool.getDataBroker();
InputStream dictionaryStream = broker.getAsStream(dicPath);
InputStream affixStream = broker.getAsStream(affPath);
if (dictionaryStream == null || affixStream == null) {
throw new RuntimeException("Could not find dictionary for language \"" + language + "\" in classpath");
}
Path dictionary = Files.createTempFile(language, ".dic");
Path affix = Files.createTempFile(language, ".aff");
Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
try {
return hunspellDictionaryFactory.apply(dictionary, affix);
} finally {
dictionary.toFile().deleteOnExit();
affix.toFile().deleteOnExit();
}
} catch (IOException e) {
throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);
}
}
🤖 Prompt for AI Agents
In
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java
around lines 113 to 129, the temporary files created for dictionary and affix
are not deleted after use, causing potential resource leaks. Modify the code to
ensure that after calling hunspellDictionaryFactory.apply(dictionary, affix),
both temporary files are deleted in a finally block or using try-with-resources
to guarantee cleanup regardless of success or failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants