From: Jérémie Galarneau Date: Fri, 15 Dec 2023 18:49:07 +0000 (-0500) Subject: Fix: src.ctf.lttng-live: expect NEW_STREAM/METADATA for inactive streams X-Git-Url: https://git.efficios.com/?p=babeltrace.git;a=commitdiff_plain;h=ca6225465edba76718064ccd4b6fba6253cf02f4 Fix: src.ctf.lttng-live: expect NEW_STREAM/METADATA for inactive streams Observed issue -------------- Some LTTng-tools live tests failures appear to show babeltrace2 hanging (failing to print expected events). The problem is intermittent, but Kienan was able to develop a test case that's reproducible for him. The test case performs the following steps: - Start a ust application and leave it running - Configure and then start an lttng live session - Connect a live viewer (babeltrace) - Run a second ust application - Wait for the expected number of events - In the failing case, no events are seen by babeltrace Using per-uid buffers, the test typically completes normally. With per-pid buffers the test fails, hanging indefinitely if waiting for the specified number of events. While "hanging", babeltrace2 is polling the relayd. This affects for babeltrace2 stable-2.0 and master while using lttng-tools master. For more information, see the description of bug #1406[1] Cause ----- When consuming a live trace captured in per-PID mode, Babeltrace periodically requests the index of the next packet it should consume. As part of the reply, it gets a 'flags' field which is used to announce that new streams, or new metadata, are available to the viewer. Unfortunately, these 'flags' are only set when the relay daemon has new tracing data to deliver. It is not set when the relay daemon indicates that the stream is inactive (see LTTNG_VIEWER_INDEX_INACTIVE). In the average case where an application is spawned while others are actively emiting events, a request for new data will result in a reply that returns an index entry (code LTTNG_VIEWER_INDEX_OK) for an available packet accompanied by the LTTNG_VIEWER_FLAG_NEW_STREAM flag. This flag indicates to the viewer that it should request new streams (using the LTTNG_VIEWER_GET_NEW_STREAMS live protocol command) before consuming the new data. In the cases where we observe a hang, an application is running but not emiting new events. As such, the relay daemon periodically emits "live beacons" to indicate that the session's streams are inactive up to a given time 'T'. Since the existing application remains inactive and the viewer is never notified that new streams are available, the viewer effectively remains "stuck" and never notices the new application being traced. The LTTNG_VIEWER_FLAG_NEW_METADATA communicates a similar semantic with regards to the metadata. However, ignoring it for inactive streams isn't as deleterious: the same information is made available to the viewer the next time it will successfully request a new index to the relay daemon. This would only become a problem if the tracers start to express non-layout data (like supplemental environment information? but I don't see a real use-case) as part of the metadata stream that should be made available downstream even during periods of inactivity. Note that the same problem most likely affects the per-UID buffer allocation mode when multiple users are being traced. Solution -------- On the producer end, LTTNG_VIEWER_FLAG_NEW_STREAM is set even when returning an inactivity index. Note that to preserve compatibility with older live consumers that don't expect this flag in non-OK response, the LTTNG_VIEWER_FLAG_NEW_STREAM notification is repeated until the next LTTNG_VIEWER_GET_NEXT_INDEX command that returns LTTNG_VIEWER_INDEX_OK. On Babeltrace's end, the handler of the LTTNG_VIEWER_GET_NEXT_INDEX command (lttng_live_get_next_index) is modified to expect LTTNG_VIEWER_FLAG_NEW_STREAM in the cases where the command returns: - LTTNG_VIEWER_INDEX_OK (as done previously), - LTTNG_VIEWER_INDEX_HUP (new), - LTTNG_VIEWER_INDEX_INACTIVE (new). Drawbacks --------- This is arguably a protocol change as none of the producers ever set the NEW_METADATA/NEW_STREAM flags when indicating an inactive stream. References ---------- [1] https://bugs.lttng.org/issues/1406 Fixes #1406 Signed-off-by: Jérémie Galarneau Change-Id: Icc7f90f5cb805f50ea16968396fe35454d30b4a8 Reviewed-on: https://review.lttng.org/c/babeltrace/+/11587 Tested-by: jenkins Reviewed-by: Philippe Proulx --- diff --git a/src/plugins/ctf/lttng-live/viewer-connection.cpp b/src/plugins/ctf/lttng-live/viewer-connection.cpp index 2f077978..5ea98cdd 100644 --- a/src/plugins/ctf/lttng-live/viewer-connection.cpp +++ b/src/plugins/ctf/lttng-live/viewer-connection.cpp @@ -1411,6 +1411,14 @@ lttng_live_get_next_index(struct lttng_live_msg_iter *lttng_live_msg_iter, BT_COMP_LOGD("Received response from relay daemon: cmd=%s, response=%s", lttng_viewer_command_string(LTTNG_VIEWER_GET_NEXT_INDEX), lttng_viewer_next_index_return_code_string(rp_status)); + + if (flags & LTTNG_VIEWER_FLAG_NEW_STREAM) { + BT_COMP_LOGD("Marking all sessions as possibly needing new streams: " + "response=%s, response-flag=NEW_STREAM", + lttng_viewer_next_index_return_code_string(rp_status)); + lttng_live_need_new_streams(lttng_live_msg_iter); + } + switch (rp_status) { case LTTNG_VIEWER_INDEX_INACTIVE: { @@ -1442,7 +1450,6 @@ lttng_live_get_next_index(struct lttng_live_msg_iter *lttng_live_msg_iter, stream->ctf_stream_class_id.value = ctf_stream_class_id; stream->ctf_stream_class_id.is_set = true; } - lttng_live_stream_iterator_set_state(stream, LTTNG_LIVE_STREAM_ACTIVE_DATA); if (flags & LTTNG_VIEWER_FLAG_NEW_METADATA) { @@ -1451,12 +1458,6 @@ lttng_live_get_next_index(struct lttng_live_msg_iter *lttng_live_msg_iter, lttng_viewer_next_index_return_code_string(rp_status), trace->id); trace->metadata_stream_state = LTTNG_LIVE_METADATA_STREAM_STATE_NEEDED; } - if (flags & LTTNG_VIEWER_FLAG_NEW_STREAM) { - BT_COMP_LOGD("Marking all sessions as possibly needing new streams: " - "response=%s, response-flag=NEW_STREAM", - lttng_viewer_next_index_return_code_string(rp_status)); - lttng_live_need_new_streams(lttng_live_msg_iter); - } status = LTTNG_LIVE_ITERATOR_STATUS_OK; break; } @@ -1484,6 +1485,7 @@ lttng_live_get_next_index(struct lttng_live_msg_iter *lttng_live_msg_iter, status = LTTNG_LIVE_ITERATOR_STATUS_ERROR; goto end; } + goto end; error: