-
Notifications
You must be signed in to change notification settings - Fork 192
Description
Description
When a socket timeout occurs during basic_publish
, the library raises RecoverableChannelError
instead of a connection error. This prevents proper connection recovery in e.g. Kombu's ensure()
mechanism, as channel errors don't trigger connection re-establishment. This means max_retries
becomes ineffective - it will always exhaust all retries on a dead connection without ever attempting to create a new one.
Environment
- py-amqp: 5.3.1
- Python 3.11
- Kombu: 5.5.4
- Using SSL connections
- Heartbeat: 0 (typical for producer connections)
- No application-level timeouts configured
- TCP_USER_TIMEOUT has been increased to 30000 (30 sec) -> not the culprit
Stacktrace (Kombu and AMQP part)
Traceback (most recent call last):
File "/application/.local/lib/python3.11/site-packages/amqp/channel.py", line 1797, in _basic_publish
return self.send_method(
^^^^^^^^^^^^^^^^^
File "/application/.local/lib/python3.11/site-packages/amqp/abstract_channel.py", line 70, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/application/.local/lib/python3.11/site-packages/amqp/method_framing.py", line 186, in write_frame
write(buffer_store.view[:offset])
File "/application/.local/lib/python3.11/site-packages/amqp/transport.py", line 350, in write
self._write(s)
File "/application/.local/lib/python3.11/site-packages/amqp/transport.py", line 600, in _write
n = write(s)
^^^^^^^^
File "/usr/local/lib/python3.11/ssl.py", line 1185, in write
return self._sslobj.write(data)
^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/application/.local/lib/python3.11/site-packages/kombu/connection.py", line 472, in _reraise_as_library_errors
yield
File "/application/.local/lib/python3.11/site-packages/kombu/connection.py", line 556, in _ensured
return fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/application/.local/lib/python3.11/site-packages/kombu/messaging.py", line 214, in _publish
return channel.basic_publish(
^^^^^^^^^^^^^^^^^^^^^^
File "/application/.local/lib/python3.11/site-packages/amqp/channel.py", line 1817, in basic_publish_confirm
ret = self._basic_publish(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/application/.local/lib/python3.11/site-packages/amqp/channel.py", line 1802, in _basic_publish
raise RecoverableChannelError('basic_publish: timed out')
amqp.exceptions.RecoverableChannelError: basic_publish: timed out
Problem
The socket timeout (errno 110) occurs immediately (milliseconds) when attempting to publish, which likely indicates the connection is already dead. The immediate OS-level timeout strongly suggests this is not a slow network issue but rather a dead connection (potentially dropped by intermediate network equipment, LB, Ingress, etc. though the exact cause is unclear).
The current code in _basic_publish
catches this and raises RecoverableChannelError
here:
except socket.timeout:
raise RecoverableChannelError('basic_publish: timed out')
This is problematic because:
- A socket timeout during write operations indicates the underlying TCP connection is dead, not just a channel issue
- e.g. Kombu's
ensure()
mechanism only re-establishes connections for connection errors, not channel errors - This makes
max_retries
ineffective - it will retry on the same dead connection until all retries are exhausted, never attempting to establish a new connection
Expected Behavior
Socket timeouts during basic_publish
should raise a connection error (e.g., RecoverableConnectionError
) to trigger proper connection recovery in e.g. Kombu.
Proposed Solution
Should this be changed to raise a connection error instead?
except socket.timeout:
raise RecoverableConnectionError('basic_publish: timed out')
Or perhaps we need a more nuanced approach to distinguish between different types of timeouts?
I'm happy to open a PR with the appropriate fix once we agree on the correct approach.
Additional Context
This issue manifests in production environments where connections traverse multiple network hops. Without heartbeats, these dead connections aren't detected until a publish attempt fails with an immediate timeout. The very quick OS-level timeout response strongly indicates the connection is already dead rather than experiencing temporary network delays. Not sure what the original consideration was when setting RecoverableChannelError on socket.timeout errors, but it might be worth revising it?
Links
Potentially related to #186