Skip to content

basic_publish timeout raises RecoverableChannelError instead of RecoverableConnectionError #452

@csabasim

Description

@csabasim

Description

When a socket timeout occurs during basic_publish, the library raises RecoverableChannelError instead of a connection error. This prevents proper connection recovery in e.g. Kombu's ensure() mechanism, as channel errors don't trigger connection re-establishment. This means max_retries becomes ineffective - it will always exhaust all retries on a dead connection without ever attempting to create a new one.

Environment

  • py-amqp: 5.3.1
  • Python 3.11
  • Kombu: 5.5.4
  • Using SSL connections
  • Heartbeat: 0 (typical for producer connections)
  • No application-level timeouts configured
  • TCP_USER_TIMEOUT has been increased to 30000 (30 sec) -> not the culprit

Stacktrace (Kombu and AMQP part)

Traceback (most recent call last):
  File "/application/.local/lib/python3.11/site-packages/amqp/channel.py", line 1797, in _basic_publish
    return self.send_method(
           ^^^^^^^^^^^^^^^^^
  File "/application/.local/lib/python3.11/site-packages/amqp/abstract_channel.py", line 70, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/application/.local/lib/python3.11/site-packages/amqp/method_framing.py", line 186, in write_frame
    write(buffer_store.view[:offset])
  File "/application/.local/lib/python3.11/site-packages/amqp/transport.py", line 350, in write
    self._write(s)
  File "/application/.local/lib/python3.11/site-packages/amqp/transport.py", line 600, in _write
    n = write(s)
        ^^^^^^^^
  File "/usr/local/lib/python3.11/ssl.py", line 1185, in write
    return self._sslobj.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/application/.local/lib/python3.11/site-packages/kombu/connection.py", line 472, in _reraise_as_library_errors
    yield
  File "/application/.local/lib/python3.11/site-packages/kombu/connection.py", line 556, in _ensured
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/application/.local/lib/python3.11/site-packages/kombu/messaging.py", line 214, in _publish
    return channel.basic_publish(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/application/.local/lib/python3.11/site-packages/amqp/channel.py", line 1817, in basic_publish_confirm
    ret = self._basic_publish(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/application/.local/lib/python3.11/site-packages/amqp/channel.py", line 1802, in _basic_publish
    raise RecoverableChannelError('basic_publish: timed out')
amqp.exceptions.RecoverableChannelError: basic_publish: timed out

Problem

The socket timeout (errno 110) occurs immediately (milliseconds) when attempting to publish, which likely indicates the connection is already dead. The immediate OS-level timeout strongly suggests this is not a slow network issue but rather a dead connection (potentially dropped by intermediate network equipment, LB, Ingress, etc. though the exact cause is unclear).

The current code in _basic_publish catches this and raises RecoverableChannelError here:

except socket.timeout:
    raise RecoverableChannelError('basic_publish: timed out')

This is problematic because:

  1. A socket timeout during write operations indicates the underlying TCP connection is dead, not just a channel issue
  2. e.g. Kombu's ensure() mechanism only re-establishes connections for connection errors, not channel errors
  3. This makes max_retries ineffective - it will retry on the same dead connection until all retries are exhausted, never attempting to establish a new connection

Expected Behavior

Socket timeouts during basic_publish should raise a connection error (e.g., RecoverableConnectionError) to trigger proper connection recovery in e.g. Kombu.

Proposed Solution

Should this be changed to raise a connection error instead?

except socket.timeout:
    raise RecoverableConnectionError('basic_publish: timed out')

Or perhaps we need a more nuanced approach to distinguish between different types of timeouts?

I'm happy to open a PR with the appropriate fix once we agree on the correct approach.

Additional Context

This issue manifests in production environments where connections traverse multiple network hops. Without heartbeats, these dead connections aren't detected until a publish attempt fails with an immediate timeout. The very quick OS-level timeout response strongly indicates the connection is already dead rather than experiencing temporary network delays. Not sure what the original consideration was when setting RecoverableChannelError on socket.timeout errors, but it might be worth revising it?

Links

Potentially related to #186

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions