hunspell: don't fail when .dic exists in the temp dir and .aff doesn't; add a way to avoid temp file creation completely (with Lucene) #11359

donnerpeter · 2025-05-06T12:41:18Z

With a fix for #11321

Summary by CodeRabbit

Refactor
- Enhanced dictionary loading with a flexible Factory interface supporting both file and stream inputs.
- Streamlined initialization and pattern computation for dictionary resources, improving maintainability.
New Features
- Added support for creating Hunspell dictionaries directly from input streams for better resource compatibility.
Deprecation
- Deprecated the old factory setter method in favor of a new stream-based factory setter.

coderabbitai · 2025-05-06T12:41:25Z

Walkthrough

This change refactors the Hunspell dictionary creation mechanism from being file-based to stream-based. It introduces a new Factory interface for creating HunspellDictionary instances from InputStreams, updates related initialization logic in HunspellRule, and removes temporary file extraction for resource-based dictionaries, streamlining dictionary and affix file handling.

Changes

Files/Paths	Change Summary
Hunspell dictionary creation and factory refactor `languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java`	Refactored dictionary creation to use a new Factory interface with InputStreams, replaced BiFunction-based factory, updated related methods, added new overloads for resource-based dictionary loading, and deprecated the old setter method.
HunspellRule dictionary initialization and pattern extraction `languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java`	Simplified dictionary and affix file handling by using resource streams directly, updated non-word pattern logic to operate on InputStreams, removed file-copying and path resolution helpers, and changed nonWordPatterns map key to String.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

hunspell: don't fail when .dic exists in the temp dir and .aff doesn't; add a way to avoid temp file creation completely (with Lucene) #11321: Both PRs modify the Hunspell class by replacing the BiFunction-based factory with a Factory interface for stream-based dictionary creation, indicating direct overlap in the dictionary instantiation mechanism.

Suggested reviewers

SteVio89

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (2)
75-83: Potential collision when using filename as “language” code

dictionary.getFileName().toString() includes the .dic suffix and is used as the prefix for createTempFile.
If two paths share the same filename (e.g., both are default.dic in different directories), their temp prefixes collide, producing confusing names like default.dic123456. It also deviates from the semantic “language code” parameter expected by custom factories.

Consider passing only the stem without extension:
-        .create(dictionary.getFileName().toString(), dic, aff);
+        .create(FilenameUtils.removeExtension(dictionary.getFileName().toString()), dic, aff);
This keeps the parameter meaningful and unique.

90-97: Fail-fast message could be more precise

The thrown RuntimeException reports “Could not find the dictionary … in the classpath” even when only the .aff is missing.
Return separate messages (or include which stream is null) to aid debugging:
if (dic == null) {
  throw new RuntimeException("Missing .dic for language \"" + language + "\" in classpath: " + dicPath);
}
if (aff == null) {
  throw new RuntimeException("Missing .aff for language \"" + language + "\" in classpath: " + affPath);
}
Minor but saves users time when diagnosing mis-packaged resources.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2eaf905 and 7d2d06e.

📒 Files selected for processing (2)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (3 hunks)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java (3 hunks)

coderabbitai · 2025-05-06T12:45:21Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java

+  private static Pattern computNonWordPattern(InputStream stream) {
+    try (Scanner sc = new Scanner(stream)) {
+      while (sc.hasNextLine()) {
+        String line = sc.nextLine();
+        if (line.startsWith("WORDCHARS ")) {
+          String wordCharsFromAff = line.substring("WORDCHARS ".length());
+          //System.out.println("#" + wordCharsFromAff+ "#");
+          return Pattern.compile("(?![" + wordCharsFromAff.replace("-", "\\-") + "])" + NON_ALPHABETIC);
+        }


⚠️ Potential issue

Escape all regex-meta characters in WORDCHARS

Currently only the dash (-) is escaped when building the character class.
Characters like ^, \, ], or [ inside WORDCHARS will break the pattern and throw PatternSyntaxException.

Consider escaping every char that has a special meaning in a character class:

private static String escapeForCharClass(String s) { StringBuilder sb = new StringBuilder(); for (char c : s.toCharArray()) { if ("\\^[]-".indexOf(c) >= 0) sb.append('\\'); sb.append(c); } return sb.toString(); } ... return Pattern.compile("(?![" + escapeForCharClass(wordCharsFromAff) + "])" + NON_ALPHABETIC);

This makes the method robust for all dictionaries.

coderabbitai · 2025-05-06T12:45:21Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java

+        Path affPath = Paths.get(shortDicPath + ".aff");
        hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);
-      }
-    }
-    if (affPath != null) {
-      nonWordPattern = nonWordPatterns.computeIfAbsent(affPath, (path) -> {
-        String wordChars = "";
-        if (path != null) {
-          try (Scanner sc = new Scanner(path)) {
-            while (sc.hasNextLine()) {
-              String line = sc.nextLine();
-              if (line.startsWith("WORDCHARS ")) {
-                String wordCharsFromAff = line.substring("WORDCHARS ".length());
-                //System.out.println("#" + wordCharsFromAff+ "#");
-                wordChars = "(?![" + wordCharsFromAff.replace("-", "\\-") + "])";
-                break;
-              }
-            }
+        nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {
+          try (InputStream stream = Files.newInputStream(affPath)) {
+            return computNonWordPattern(stream);
          } catch (IOException e) {
-            logger.error("Could not read aff file: " + path, e);
+            logger.error("Could not read aff file: " + affPath, e);
          }
-        }
-        return Pattern.compile(wordChars + NON_ALPHABETIC);
-      });
+          return NON_ALPHABETIC_PATTERN;
+        });
+      }


🛠️ Refactor suggestion

aff file may be absent on disk – avoid NoSuchFileException

Files.newInputStream(affPath) will fail if the .aff vanished.
Mirror the in-resource handling by checking Files.exists first, falling back to the default pattern when the file is missing.

- Path affPath = Paths.get(shortDicPath + ".aff"); - hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath); - nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> { - try (InputStream stream = Files.newInputStream(affPath)) { + Path affPath = Paths.get(shortDicPath + ".aff"); + hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath); + nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> { + if (!Files.exists(affPath)) { + logger.warn("Missing .aff file at {}, falling back to generic pattern", affPath); + return NON_ALPHABETIC_PATTERN; + } + try (InputStream stream = Files.newInputStream(affPath)) { return computNonWordPattern(stream); } catch (IOException e) {

Prevents unexpected start-up failures on corrupt installations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Path affPath = Paths.get(shortDicPath + ".aff");

hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);

}

}

if (affPath != null) {

nonWordPattern = nonWordPatterns.computeIfAbsent(affPath, (path) -> {

String wordChars = "";

if (path != null) {

try (Scanner sc = new Scanner(path)) {

while (sc.hasNextLine()) {

String line = sc.nextLine();

if (line.startsWith("WORDCHARS ")) {

String wordCharsFromAff = line.substring("WORDCHARS ".length());

//System.out.println("#" + wordCharsFromAff+ "#");

wordChars = "(?![" + wordCharsFromAff.replace("-", "\\-") + "])";

break;

}

}

nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {

try (InputStream stream = Files.newInputStream(affPath)) {

return computNonWordPattern(stream);

} catch (IOException e) {

logger.error("Could not read aff file: " + path, e);

logger.error("Could not read aff file: " + affPath, e);

}

}

return Pattern.compile(wordChars + NON_ALPHABETIC);

});

return NON_ALPHABETIC_PATTERN;

});

}

Path affPath = Paths.get(shortDicPath + ".aff");

hunspell = Hunspell.getDictionary(Paths.get(shortDicPath + ".dic"), affPath);

nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {

if (!Files.exists(affPath)) {

logger.warn("Missing .aff file at {}, falling back to generic pattern", affPath);

return NON_ALPHABETIC_PATTERN;

}

try (InputStream stream = Files.newInputStream(affPath)) {

return computNonWordPattern(stream);

} catch (IOException e) {

logger.error("Could not read aff file: " + affPath, e);

}

return NON_ALPHABETIC_PATTERN;

});

}

coderabbitai · 2025-05-06T12:45:21Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java

      if (JLanguageTool.getDataBroker().resourceExists(shortDicPath)) {
-        String path = getDictionaryPath(langCountry, shortDicPath);
-        if ("".equals(path)) {
-          hunspell = null;
-        } else {
-          affPath = Paths.get(path + ".aff");
-          hunspell = Hunspell.getDictionary(Paths.get(path + ".dic"), affPath);
-          addIgnoreWords();
-        }
+        String aff = shortDicPath.substring(0, shortDicPath.length() - 3) + "aff";
+        hunspell = Hunspell.forDictionaryInResources(langCountry, shortDicPath, aff);
+        addIgnoreWords();
+        nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ ->
+          computNonWordPattern(JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff)));
+


⚠️ Potential issue

Gracefully handle missing .aff resource to avoid NPE

getFromResourceDirAsStream(aff) may return null when the .dic exists but the .aff doesn’t – exactly the scenario described in #11321.
Passing a null stream into computNonWordPattern(..) will throw a NullPointerException, undoing the intent of this PR.

- nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> - computNonWordPattern(JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff))); + InputStream affStream = JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff); + nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> { + if (affStream == null) { // fallback: no WORDCHARS in aff + logger.warn("Missing .aff for {}, using default non-alphabetic pattern", shortDicPath); + return NON_ALPHABETIC_PATTERN; + } + return computNonWordPattern(affStream); + });

This keeps the rule operational even when only the dictionary is present.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (JLanguageTool.getDataBroker().resourceExists(shortDicPath)) {

String path = getDictionaryPath(langCountry, shortDicPath);

if ("".equals(path)) {

hunspell = null;

} else {

affPath = Paths.get(path + ".aff");

hunspell = Hunspell.getDictionary(Paths.get(path + ".dic"), affPath);

addIgnoreWords();

}

String aff = shortDicPath.substring(0, shortDicPath.length() - 3) + "aff";

hunspell = Hunspell.forDictionaryInResources(langCountry, shortDicPath, aff);

addIgnoreWords();

nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ ->

computNonWordPattern(JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff)));

if (JLanguageTool.getDataBroker().resourceExists(shortDicPath)) {

String aff = shortDicPath.substring(0, shortDicPath.length() - 3) + "aff";

hunspell = Hunspell.forDictionaryInResources(langCountry, shortDicPath, aff);

addIgnoreWords();

InputStream affStream = JLanguageTool.getDataBroker().getFromResourceDirAsStream(aff);

nonWordPattern = nonWordPatterns.computeIfAbsent(shortDicPath, __ -> {

if (affStream == null) { // fallback: no WORDCHARS in aff

logger.warn("Missing .aff for {}, using default non-alphabetic pattern", shortDicPath);

return NON_ALPHABETIC_PATTERN;

}

return computNonWordPattern(affStream);

});

}

coderabbitai · 2025-05-06T12:45:22Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java

+  private static Factory viaTempFiles(BiFunction<Path, Path, HunspellDictionary> factory) {
+    return (language, dictionaryStream, affixStream) -> {
+      Path dictionary = Files.createTempFile(language, ".dic");
+      Path affix = Files.createTempFile(language, ".aff");
+      Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
+      Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
+      try {
+        return factory.apply(dictionary, affix);
+      } finally {
+        Files.deleteIfExists(dictionary);
+        Files.deleteIfExists(affix);
+      }
+    };
+  }


⚠️ Potential issue

Deleting temp files immediately can break on Windows / memory-mapped impls

viaTempFiles deletes the newly-created .dic/.aff right after the dictionary is built:

Files.deleteIfExists(dictionary); Files.deleteIfExists(affix);

On Windows this will throw AccessDeniedException when the native Hunspell DLL still holds the files open.
For Lucene-based factories that memory-map the files, removing them invalidates the mmap’d buffer.

Safer options:

Use File.deleteOnExit() and keep the files for the lifetime of the JVM.

Delegate deletion to HunspellDictionary#close() (delete when the dictionary is closed).

At minimum, swallow IOExceptions here and log a warning.

- } finally { - Files.deleteIfExists(dictionary); - Files.deleteIfExists(affix); + } finally { + dictionary.toFile().deleteOnExit(); + affix.toFile().deleteOnExit(); }

Please reconsider; otherwise cross-platform stability may regress.

Not tested, yet, but is it safe to delete the files directly?

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Let's first look at the DumontsHunspellDictionary implementation rg -A 5 "class DumontsHunspellDictionary" --context=20 # Let's also check if there are any close methods and how they handle resources rg -A 5 "void close" --type java # And let's check the HunspellDictionary interface rg -A 10 "interface HunspellDictionary" --type java # Check for any memory-mapped usage rg "map\(" --type java # Any Lucene integrations that might be using memory-mapped files rg "Lucene" --type java # Check for any file locking or similar mechanisms rg "FileLock|lock" --type java

Length of output: 65953

@SteVio89 No, it's not safe to delete the files immediately after creating the dictionary. There are two main issues:

Windows file locking: The native Hunspell library (used by DumontsHunspellDictionary) likely keeps the files open. Windows doesn't allow deleting files that are still open, which would cause AccessDeniedException.

Memory-mapped files: The PR mentions Lucene integration - Lucene typically uses memory-mapped files. Deleting a file that's memory-mapped invalidates the mapping and can cause crashes or undefined behavior.

The reviewer's suggestion to use deleteOnExit() is appropriate as it:

Ensures files are cleaned up when the JVM exits

Doesn't interfere with active file handles

Works cross-platform

dictionary.toFile().deleteOnExit(); affix.toFile().deleteOnExit();

Alternatively, you could delegate deletion to the dictionary's close() method since it knows when it's done with the files.

🧬 Code Graph Analysis Results

Relevant snippet from languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (lines 1-109)

package org.languagetool.rules.spelling.hunspell; import org.languagetool.JLanguageTool; import org.languagetool.broker.ResourceDataBroker; import java.io.*; import java.nio.file.*; import java.util.*; import java.util.function.BiFunction; public final class Hunspell { static class LanguageAndPath { private final Path dictionary; private final Path affix; LanguageAndPath(Path dictionary, Path affix) { this.dictionary = Objects.requireNonNull(dictionary); this.affix = Objects.requireNonNull(affix); } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; LanguageAndPath that = (LanguageAndPath) o; return Objects.equals(dictionary, that.dictionary) && Objects.equals(affix, that.affix); } @Override public int hashCode() { return Objects.hash(dictionary, affix); } } private static final Map<LanguageAndPath, HunspellDictionary> map = new HashMap<>(); private static Factory hunspellDictionaryFactory = viaTempFiles(DumontsHunspellDictionary::new); /** * @deprecated Use {@link #setHunspellStreamFactory} */ @Deprecated public static void setHunspellDictionaryFactory(BiFunction<Path, Path, HunspellDictionary> factory) { hunspellDictionaryFactory = viaTempFiles(factory); } private static Factory viaTempFiles(BiFunction<Path, Path, HunspellDictionary> factory) { return (language, dictionaryStream, affixStream) -> { Path dictionary = Files.createTempFile(language, ".dic"); Path affix = Files.createTempFile(language, ".aff"); Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING); Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING); try { return factory.apply(dictionary, affix); } finally { Files.deleteIfExists(dictionary); Files.deleteIfExists(affix); } }; } /** * Set a custom way to create Hunspell dictionaries, * e.g., more efficient or portable than the default, possibly via Apache Lucene. * The default one is to use a native wrapper over the real Hunspell binary, * creating temporary files from the streams. */ public static void setHunspellStreamFactory(Factory factory) { hunspellDictionaryFactory = factory; } public static synchronized HunspellDictionary getDictionary(Path dictionary, Path affix) { LanguageAndPath key = new LanguageAndPath(dictionary, affix); HunspellDictionary hunspell = map.get(key); if (hunspell != null && !hunspell.isClosed()) { return hunspell; } try(var dic = Files.newInputStream(dictionary); var aff = Files.newInputStream(affix)) { HunspellDictionary newHunspell = hunspellDictionaryFactory .create(dictionary.getFileName().toString(), dic, aff); map.put(key, newHunspell); return newHunspell; } catch (IOException e) { throw new RuntimeException(e); } } public static HunspellDictionary forDictionaryInResources(String language, String resourcePath) { return forDictionaryInResources(language, resourcePath + language + ".dic", resourcePath + language + ".aff"); } public static HunspellDictionary forDictionaryInResources(String language, String dicPath, String affPath) { ResourceDataBroker broker = JLanguageTool.getDataBroker(); try (var dic = broker.getFromResourceDirAsStream(dicPath); var aff = broker.getFromResourceDirAsStream(affPath)) { if (dic == null || aff == null) { throw new RuntimeException("Could not find the dictionary for language \"" + language + "\" in the classpath"); } return hunspellDictionaryFactory.create(language, dic, aff); } catch (IOException e) { throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e); } } public static HunspellDictionary forDictionaryInResources(String language) { return forDictionaryInResources(language, ""); } public interface Factory { HunspellDictionary create(String languageCode, InputStream dictionary, InputStream affix) throws IOException; } }

Summary of relevant details from the above snippet:

The class Hunspell manages Hunspell dictionaries keyed by a combination of dictionary and affix file paths.

It uses a Factory interface to create HunspellDictionary instances from dictionary and affix input streams.

The default factory implementation (viaTempFiles) creates temporary files for the dictionary and affix streams, then deletes these temporary files immediately after creating the dictionary.

The temporary files are deleted using Files.deleteIfExists() in a finally block, ensuring cleanup happens even if dictionary creation throws.

There is a static map caching created HunspellDictionary instances to avoid reloading.

The method getDictionary(Path dictionary, Path affix) opens input streams from the given paths and uses the factory to create or retrieve a cached dictionary.

The method forDictionaryInResources loads dictionary and affix files from resources and creates a dictionary using the factory.

The user comment "Not tested, yet, but is it safe to delete the files directly?" relates to the temporary file deletion in viaTempFiles method, which deletes the temporary files immediately after dictionary creation.

No other code snippets from the provided possibly_relevant_code_snippets are directly relevant to the question about deleting files in Hunspell.java.

From what I've seen in the native code of Hunspell, the files are read and closed. The same happens in Lucene. We can also add a Javadoc requirement that the streams should be closed by the call return.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

SteVio89 · 2025-05-09T13:46:39Z

Thanks @donnerpeter, will do some tests before merging again :) starting next week.

SteVio89 · 2025-05-09T14:32:40Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java

-    String dictionaryPath;
-    //in the webstart, java EE or OSGi bundle version, we need to copy the files outside the jar
-    //to the local temporary directory
-    if (StringUtils.equalsAny(dictURL.getProtocol(), "jar", "vfs", "bundle", "bundleresource")) {


Can we have this check in your change, too? When running LT on our Servers, we have all language-module jar extracted in our file system see this. This means we don't need the copy to /tmp, before using the files.

On the contrary, it creates unnecessary files.

Sorry for not seeing this in the first review.

Makes sense. I'd check for the file:// protocol though, what do you think?

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)
45-66: File deletion strategy still presents cross-platform concerns

The current implementation in viaTempFiles still deletes temporary files immediately after dictionary creation with Files.deleteIfExists(). This approach can cause issues:

On Windows, native libraries may keep files open, causing AccessDeniedException

With memory-mapped files (like those used by Lucene), deleting the underlying file invalidates the mapping

While the PR author noted that "the files are read and closed" in the native code, this assumption might not hold across all platforms and implementations.
#!/bin/bash
# Check native Hunspell implementations that might be used
rg -A 5 "class.*HunspellDictionary" --type java

# Look for any file locking or file access patterns
rg "map\(|FileChannel|MappedByteBuffer" --type java
Consider using deleteOnExit() instead for more robust cross-platform behavior:
-          Files.deleteIfExists(dictionary);
-          Files.deleteIfExists(affix);
+          dictionary.toFile().deleteOnExit();
+          affix.toFile().deleteOnExit();

🧹 Nitpick comments (1)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)
125-143: Well-designed Factory interface with clear documentation

The new Factory interface provides a clean separation between file-based and stream-based dictionary creation with good documentation. The default implementation for createFromLocalFiles ensures implementers only need to focus on the stream-based method.

One suggestion - consider adding a clarification in the Javadoc for createFromStreams that while implementers don't need to close the streams, they still need to fully consume them before returning.
   /**
    * Create a Hunspell dictionary from the given streams.
    * All necessary information should be extracted from both streams by the time this method returns.
-   * Closing the streams is not necessary in the implementation of this method, as the caller will close them itself.
+   * Closing the streams is not necessary in the implementation of this method, as the caller will close them itself.
+   * However, the implementation must fully consume both streams before returning.
    */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2d06e and 0ff2a74.

📒 Files selected for processing (1)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (4 hunks)

🔇 Additional comments (3)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (3)

98-119: Good optimization for file-based URLs

The implementation now efficiently handles resource loading by checking if the resources are available as local files first, avoiding unnecessary stream operations when direct file access is possible.

84-91: Improved resource management in getDictionary method

The refactoring properly moves stream management responsibility to the Factory implementation, simplifying this method while maintaining proper resource handling.

37-43: Good deprecation pattern with migration path

The code properly marks the old method as deprecated and provides a clear migration path to the new API through the Javadoc link. The implementation maintains backward compatibility by adapting the old factory interface to the new one.

SteVio89 · 2025-05-13T16:28:05Z

I did a test deployment. With these changes, the server restarts every 7–15 minutes (depends on requests). Restart mostly happens because the health check fails.

Unfortunately, I could not find any helpful logs, just see that memory usage grows until RAM is full -> restart.

donnerpeter · 2025-05-14T15:25:23Z

@SteVio89 Can I do anything to diagnose this on my side?

SteVio89 · 2025-05-15T08:06:17Z

You could run the server org.languagetool.server.HTTPServer locally and send random text in different languages to it. I will get a memory dump from our server and see what's happens. But not before Friday.

donnerpeter · 2025-05-16T08:58:10Z

You could run the server org.languagetool.server.HTTPServer locally and send random text in different languages to it. I will get a memory dump from our server and see what's happens. But not before Friday.

I'd be grateful if you get that memory dump :)

SteVio89 · 2025-05-16T14:17:48Z

Unfortunately, I couldn't find the root cause. Another colleague will take another look next week.

I keep you updated.

donnerpeter · 2025-06-02T13:13:30Z

Any updates? :)

ilia-permiashkin · 2025-06-25T12:18:11Z

Hey @SteVio89! Do you have any updates on this? The issue you mentioned isn't reproducible locally on my machine. Sometimes there are memory spikes reaching up to 4Gb, but then everything goes back down to normal (~1Gb). So we'd really like to get the memory dump at least to take a look at the healthcheck problem.

donnerpeter · 2025-07-24T03:49:52Z

@SteVio89 any updates? :)

SteVio89 · 2025-07-29T09:48:05Z

Sorry, we're currently tied up in a project that's blocking all backend resources. We haven't been able to reinvestigate the issue on our server.

If you absolutely want this PR merged, would it be possible to enable the changes (or the old behavior) via an ENV variable?

This way, you're not directly dependent on us.

…t; add a way to avoid temp file creation completely (with Lucene)

donnerpeter · 2025-08-06T11:31:32Z

If you absolutely want this PR merged, would it be possible to enable the changes (or the old behavior) via an ENV variable?

Done!

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)
68-69: Unsafe immediate deletion of temp files persists

The immediate deletion of temporary files after dictionary creation remains problematic, as identified in previous reviews. This can cause AccessDeniedException on Windows when the native Hunspell DLL keeps files open, and invalidate memory-mapped buffers in Lucene-based implementations.

As suggested in the previous review, use deleteOnExit() to defer deletion:
-        Files.deleteIfExists(dictionary);
-        Files.deleteIfExists(affix);
+        dictionary.toFile().deleteOnExit();
+        affix.toFile().deleteOnExit();

🧹 Nitpick comments (1)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (1)
134-137: Potential Windows path handling issue with URL.getPath()

Using URL.getPath() on Windows file URLs may return paths like "/C:/path/to/file" which Path.of() might not handle correctly due to the leading slash.

Consider using Paths.get(url.toURI()) for more robust cross-platform handling:
-        return hunspellDictionaryStreamFactory.createFromLocalFiles(language, Path.of(dicUrl.getPath()), Path.of(affUrl.getPath()));
+        return hunspellDictionaryStreamFactory.createFromLocalFiles(language, Paths.get(dicUrl.toURI()), Paths.get(affUrl.toURI()));

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ff2a74 and 0a636b0.

📒 Files selected for processing (2)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (4 hunks)
languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java (3 hunks)

✅ Files skipped from review due to trivial changes (1)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/HunspellRule.java

🔇 Additional comments (6)

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (6)

7-7: LGTM!

The URL import is necessary for the new resource URL handling functionality added later in the file.

34-35: LGTM!

The environment variable flag provides necessary backward compatibility and addresses the production issues mentioned in the PR discussion. This allows teams to revert to the old behavior if needed.

38-38: LGTM!

The new Factory-based field provides better abstraction than the previous BiFunction approach, supporting both file and stream-based dictionary creation.

40-51: LGTM!

The deprecation is properly documented and the conditional logic ensures backward compatibility. The method appropriately delegates to either the old or new approach based on the environment flag.

75-83: LGTM!

The new Factory setter provides a clean API for custom implementations. The Javadoc clearly explains the use cases and benefits over the default temporary file approach.

157-175: LGTM!

The Factory interface is well-designed with clear separation of concerns. The default implementation of createFromLocalFiles provides convenience while allowing optimized implementations. The Javadoc clearly specifies stream handling responsibilities.

coderabbitai · 2025-08-06T11:33:52Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java

+
+    if (FORCE_TEMP_FILES) {
+      HunspellDictionary newHunspell = hunspellDictionaryFactory.apply(dictionary, affix);
+      map.put(key, newHunspell);
+      return newHunspell;
+    }
+
+    try {
+      HunspellDictionary newHunspell = hunspellDictionaryStreamFactory
+        .createFromLocalFiles(dictionary.getFileName().toString(), dictionary, affix);
+      map.put(key, newHunspell);
+      return newHunspell;
+    } catch (IOException e) {
+      throw new RuntimeException(e);
+    }
  }


🛠️ Refactor suggestion

Inconsistent error handling between old and new approaches

The new Factory-based approach (lines 98-105) wraps IOException in RuntimeException, while the old BiFunction approach (lines 92-96) doesn't handle exceptions. This creates inconsistent behavior depending on the FORCE_TEMP_FILES flag.

Consider adding consistent exception handling:

if (FORCE_TEMP_FILES) { - HunspellDictionary newHunspell = hunspellDictionaryFactory.apply(dictionary, affix); - map.put(key, newHunspell); - return newHunspell; + try { + HunspellDictionary newHunspell = hunspellDictionaryFactory.apply(dictionary, affix); + map.put(key, newHunspell); + return newHunspell; + } catch (Exception e) { + throw new RuntimeException(e); + } }

🤖 Prompt for AI Agents

In languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java around lines 91 to 106, the code inconsistently handles exceptions between the FORCE_TEMP_FILES block and the try-catch block. To fix this, add appropriate exception handling to the FORCE_TEMP_FILES block by wrapping any IOException or relevant exceptions thrown by hunspellDictionaryFactory.apply in a try-catch, and rethrow as RuntimeException to match the behavior of the other block, ensuring consistent error handling regardless of the flag.

coderabbitai · 2025-08-06T11:33:52Z

languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java

+    if (FORCE_TEMP_FILES) {
+      try {
+        ResourceDataBroker broker = JLanguageTool.getDataBroker();
+        InputStream dictionaryStream = broker.getAsStream(dicPath);
+        InputStream affixStream = broker.getAsStream(affPath);
+        if (dictionaryStream == null || affixStream == null) {
+          throw new RuntimeException("Could not find dictionary for language \"" + language + "\" in classpath");
+        }
+        Path dictionary = Files.createTempFile(language, ".dic");
+        Path affix = Files.createTempFile(language, ".aff");
+        Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);
+        Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);
+        return hunspellDictionaryFactory.apply(dictionary, affix);
+      } catch (IOException e) {
+        throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);
+      }
+    }


⚠️ Potential issue

Same temp file deletion issue in FORCE_TEMP_FILES path

The FORCE_TEMP_FILES code path has the same immediate temporary file deletion problem, but here the files aren't being deleted at all. The method creates temp files (lines 121-122) and copies streams to them (lines 123-124), but never cleans them up after calling hunspellDictionaryFactory.apply(dictionary, affix) on line 125.

Add cleanup for the temporary files:

Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING); Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING); - return hunspellDictionaryFactory.apply(dictionary, affix); + try { + return hunspellDictionaryFactory.apply(dictionary, affix); + } finally { + dictionary.toFile().deleteOnExit(); + affix.toFile().deleteOnExit(); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (FORCE_TEMP_FILES) {

try {

ResourceDataBroker broker = JLanguageTool.getDataBroker();

InputStream dictionaryStream = broker.getAsStream(dicPath);

InputStream affixStream = broker.getAsStream(affPath);

if (dictionaryStream == null || affixStream == null) {

throw new RuntimeException("Could not find dictionary for language \"" + language + "\" in classpath");

}

Path dictionary = Files.createTempFile(language, ".dic");

Path affix = Files.createTempFile(language, ".aff");

Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);

Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);

return hunspellDictionaryFactory.apply(dictionary, affix);

} catch (IOException e) {

throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);

}

}

if (FORCE_TEMP_FILES) {

try {

ResourceDataBroker broker = JLanguageTool.getDataBroker();

InputStream dictionaryStream = broker.getAsStream(dicPath);

InputStream affixStream = broker.getAsStream(affPath);

if (dictionaryStream == null || affixStream == null) {

throw new RuntimeException("Could not find dictionary for language \"" + language + "\" in classpath");

}

Path dictionary = Files.createTempFile(language, ".dic");

Path affix = Files.createTempFile(language, ".aff");

Files.copy(dictionaryStream, dictionary, StandardCopyOption.REPLACE_EXISTING);

Files.copy(affixStream, affix, StandardCopyOption.REPLACE_EXISTING);

try {

return hunspellDictionaryFactory.apply(dictionary, affix);

} finally {

dictionary.toFile().deleteOnExit();

affix.toFile().deleteOnExit();

}

} catch (IOException e) {

throw new RuntimeException("Could not create temporary dictionaries for language \"" + language + "\"", e);

}

}

🤖 Prompt for AI Agents

In languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java around lines 113 to 129, the temporary files created for dictionary and affix are not deleted after use, causing potential resource leaks. Modify the code to ensure that after calling hunspellDictionaryFactory.apply(dictionary, affix), both temporary files are deleted in a finally block or using try-with-resources to guarantee cleanup regardless of success or failure.

donnerpeter mentioned this pull request May 6, 2025

hunspell: don't fail when .dic exists in the temp dir and .aff doesn't; add a way to avoid temp file creation completely (with Lucene) #11321

Merged

coderabbitai bot reviewed May 6, 2025

View reviewed changes

SteVio89 reviewed May 9, 2025

View reviewed changes

coderabbitai bot reviewed May 12, 2025

View reviewed changes

donnerpeter force-pushed the hunspell branch from 0ff2a74 to ba50788 Compare August 6, 2025 11:29

hunspell: don't fail when .dic exists in the temp dir and .aff doesn'…

0a636b0

…t; add a way to avoid temp file creation completely (with Lucene)

donnerpeter force-pushed the hunspell branch from ba50788 to 0a636b0 Compare August 6, 2025 11:30

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

hunspell: don't fail when .dic exists in the temp dir and .aff doesn't; add a way to avoid temp file creation completely (with Lucene) #11359

Are you sure you want to change the base?

hunspell: don't fail when .dic exists in the temp dir and .aff doesn't; add a way to avoid temp file creation completely (with Lucene) #11359

Uh oh!

Conversation

donnerpeter commented May 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SteVio89 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 9, 2025

Choose a reason for hiding this comment

Relevant snippet from languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java (lines 1-109)

Summary of relevant details from the above snippet:

No other code snippets from the provided possibly_relevant_code_snippets are directly relevant to the question about deleting files in Hunspell.java.

Uh oh!

donnerpeter May 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 9, 2025

Choose a reason for hiding this comment

Uh oh!

SteVio89 commented May 9, 2025

Uh oh!

SteVio89 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

donnerpeter May 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

SteVio89 commented May 13, 2025

Uh oh!

donnerpeter commented May 14, 2025

Uh oh!

SteVio89 commented May 15, 2025

Uh oh!

donnerpeter commented May 16, 2025

Uh oh!

SteVio89 commented May 16, 2025

Uh oh!

donnerpeter commented Jun 2, 2025

Uh oh!

ilia-permiashkin commented Jun 25, 2025

Uh oh!

donnerpeter commented Jul 24, 2025

donnerpeter commented May 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 6, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot May 6, 2025 •

edited

Loading

Relevant snippet from `languagetool-core/src/main/java/org/languagetool/rules/spelling/hunspell/Hunspell.java` (lines 1-109)

No other code snippets from the provided `possibly_relevant_code_snippets` are directly relevant to the question about deleting files in `Hunspell.java`.