Skip to content

Brown-University-Library/failed_queue_investigation

Repository files navigation

failed_queue_investigation project

Contains code to experiment with an rq failed-queue -- and other general queueing experimentation.

On this page:

  • Typical usage, for running packages
  • Typical usage, for running scripts
  • One-time setup, or to upgrade patch versions
    • (this step auto creates and populates a venv, which you do not ever need to activate)
  • The investigation

Typical usage, for running packages

$ cd /path/to/failed_queue_investigation/

$ uv run rq
Usage: rq [OPTIONS] COMMAND [ARGS]...
  RQ command line tool.
<snip>

...or:

$ uv run redis-cli
127.0.0.1:6379>
127.0.0.1:6379> exit

...or:

$ uv run rq-dashboard
RQ Dashboard version 0.5.2
 * Serving Flask app 'rq_dashboard.cli' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   <snip>
   Press CTRL+C to quit

   If using vscode -- I can select "Open in Browser" and see all queues and workers and jobs.
   The terminal then outputs:
127.0.0.1 - - [23/Jun/2025 16:50:25] "GET /queues.json HTTP/1.1" 200 -
127.0.0.1 - - [23/Jun/2025 16:50:25] "GET /jobs/default/1.json HTTP/1.1" 200 -
127.0.0.1 - - [23/Jun/2025 16:50:25] "GET /workers.json HTTP/1.1" 200 -
<snip>

<Control C>

Typical usage, for running scripts

$ cd /path/to/failed_queue_investigation/

$ uv run ./the_script.py

One-time setup, or to upgrade patch versions

$ cd /path/to/failed_queue_investigation/

$ uv sync --upgrade --group staging

The investigation

Problem

A script was run that was thought would delete duplicates in the failed queue, within a certain date-range.

The result was two unexpected things:

  • All jobs on the failed queue were deleted.
  • The failed queue, instead of indicating it had zero jobs, no longer appeared when running rq info. This resulted in my queue-checker script generating alerts because the list of expected queues (which included the failed queue) did not match the actual queues (which no longer included the failed queue).

Goal

Someone else will look into the script; my goal was to see if the non-existence of the failed queue was a problem.

The reason we thought it might be a problem is that we ingested multiple thousands of items into a collection, and from past experience we would have expected a few failures -- but were alerted to none. So we were wondering if some jobs may have been failing silently, because of the lack of a failed queue.

Plan

  • Determine how the failed queue might have been deleted (try to recreate the issue).

    • From experience, emptying the failed queue does not normally cause the disappearance of the failed queue.
  • Once the failed was removed, see what happens when a failed job is run.

    • Research indicated that in the rq world, a failed job will automatically create a new failed queue if it does not exist, and put the job on it. I wanted to confirm this.
  • If a new failed job properly re-creates the failed-queue, then recreate the failed queue on production, to stop the alerts (I could have updated that queue-checker script).

Result

Setup

Normally for any new code, we want to work with python 3.12. However, there were problems getting that to work with the old versions of rq we're working with, so I used the old python 3.8x used on our servers.

For initial setup, do these steps:

$ cd /path/to/failed_queue_investigation_stuff/
$ git clone git@github.com:Brown-University-Library/failed_queue_investigation.git
$ cd ./failed_queue_investigation/
$ uv sync --upgrade --group staging

This will create a (git-ignored) .venv directory, and install the python and the rq dependencies specified in uv.lock. (And if you didn't have a uv.lock file, it would create one based on the pyproject.toml file.)

You can confirm everything's working properly by running:

$ cd /path/to/failed_queue_investigation/
$ uv run ./import_rq.py

This simply imports rq -- which wouldn't work if the environment was not set up correctly. It also gives you confirmation of the version of python and rq your environment is using.

Ok; ready to go.

Delete the failed queue

As noted, emptying the failed queue does not normally cause the disappearance of the failed queue.

I worked on one of our dev-servers that has queues set up, including a failed queue, and confirmed my memory that emptying the failed queue does not normally delete it.

See file a__empty_all_failed_jobs.py for code. In the doc-string is a simple way to empty the failed queue. The code itself shows another way. After doing both, uv run rq info still showed the failed queue (with the expected 0 jobs).

Turns out that's because rq has a list of queues it knows about, and it does not automatically update that list when the contents of a queue are deleted -- or even if the redis-key representing the queue itself is deleted.

See file b__delete_failed_queue_itself.py for code. In the doc-string is a simple way to delete the failed queue itself. The code itself shows another way. After doing both, uv run rq info no longer showed the failed queue.

Success in the goal of re-creating the lack of a failed queue. The original script code should still be reviewed to see what it did.

Create a failing job; then run it

See file c__create_failed_queue_job.py for code. The docstring at the top shows how to run the file. Other comments in the file also indicate that this code just creates and enqueues the job that will fail -- it does NOT trigger the failure by running it.

The docstrings indicate how to actually run this job.

Doing this did auto-recreate the failed queue -- and also put the failed job on it.

Success in the goal of confirming that a failed job will:

  • auto-recreate the failed queue.
  • put the failed job on it.

Update production

After confirming that creating a failed job recreates the failed queue, I performed the following steps to update the production queue:

  • I git-cloned this code to production
  • I ran the c__create_failed_queue_job.py script to enqueue a job on the newly-created default queue.
  • I ran the code to "run" the job -- which successfully recreated the failed queue with the failed job on it.
  • I deleted the failed job from the failed queue.
  • I deleted the default queue using the approach contained in b__delete_failed_queue_itself.py.

The queue-checker script no longer generates alerts.


About

Contains code to experiment with an `rq` failed-queue.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages